Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/upload flow #167

Merged
merged 60 commits into from
Jan 30, 2017
Merged

Feature/upload flow #167

merged 60 commits into from
Jan 30, 2017

Conversation

mfeurer
Copy link
Collaborator

@mfeurer mfeurer commented Sep 2, 2016

No description provided.

OpenMLFlow parameters and components attribute are now of type
OrderedDict, with keys being the name of the parameter/component
and value the either the default value of the hyperparameter or
the actual component. This makes creating OpenMLFlows easier.
They can still be nicely uploaded.

Also, there was a bug in the deserialization, which returned
always the model of the serialized flow.
@coveralls
Copy link

Coverage Status

Coverage increased (+0.2%) to 89.747% when pulling 5a0750a on feature/upload-flow into a407b75 on develop.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.3%) to 89.848% when pulling 7cd5741 on feature/upload-flow into a407b75 on develop.

This was referenced Sep 9, 2016
@mfeurer mfeurer changed the title WIP Feature/upload flow Feature/upload flow Sep 9, 2016
@mfeurer
Copy link
Collaborator Author

mfeurer commented Sep 9, 2016

This PR introduces the following new flow-related features:

  1. Add get_flow
  2. Fix an abstract flow specification. It is enforced that for each flow the parameter the default values are strings. Necessary for deserialization because only the library-specific code knows how to interpret data saved here.
  3. Add a converter from/to scikit-learn. It allows serialization and deserialization of scikit-learn flows to and from the server. Each parameter default value is encoded as a json object to allow uploading through the 'abstract' flow specification. This currently expects scikit-learn 0.18 (already works with the new model selection module).

I will now write docstrings to explain the implementation. I will write documentation once the code part of the PR is approved.

@mfeurer
Copy link
Collaborator Author

mfeurer commented Sep 9, 2016

Okay, the unit tests are now working. @amueller @janvanrijn what do you think of this PR?

@coveralls
Copy link

Coverage Status

Coverage increased (+0.3%) to 89.81% when pulling bd0175a on feature/upload-flow into a407b75 on develop.

@mfeurer
Copy link
Collaborator Author

mfeurer commented Sep 9, 2016

Apparently, there is still a bug in creating the names of the flows. The current unit test should not produce the following name:

TEST65c1217799sklearn.model_selection._search.RandomizedSearchCV(sklearn.pipeline.Pipeline(sklearn.preprocessing.data.StandardScaler,sklearn.ensemble.weight_boosting.AdaBoostClassifier(sklearn.tree.tree.DecisionTreeClassifier)),sklearn.preprocessing.data.StandardScaler,sklearn.ensemble.weight_boosting.AdaBoostClassifier(sklearn.tree.tree.DecisionTreeClassifier))

Especially, it should also contain the CV object -> test for the created component names in the unit test, also add a more complex unit test to the scikit-learn converter test.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.3%) to 89.81% when pulling 6bec1fa on feature/upload-flow into a407b75 on develop.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.3%) to 89.817% when pulling 7e6a545 on feature/upload-flow into a407b75 on develop.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.3%) to 89.817% when pulling 604d01e on feature/upload-flow into a407b75 on develop.

@mfeurer
Copy link
Collaborator Author

mfeurer commented Sep 15, 2016

No idea why the PR check fails. It's at least not the fault of this PR.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.3%) to 89.861% when pulling 4c7673f on feature/upload-flow into a407b75 on develop.

@mfeurer
Copy link
Collaborator Author

mfeurer commented Jan 26, 2017

Just for reference, it worked this afternoon: https://travis-ci.org/openml/openml-python/builds/195473597 so I assume this is on the OpenML side

Regarding the pipeline, I can see the issue and we'd need to add the step name to the name. Will be doing this in a few minutes, okay?

@amueller
Copy link
Contributor

No rush on my side ;) I'll be awake longer than you are, I imagine.

Copy link
Contributor

@amueller amueller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First batch of comments, not done ;)

@@ -5,6 +5,7 @@
else:
from urllib.error import URLError

import six
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why if it's not used here?

@@ -0,0 +1,3 @@
# Dummy to allow mock classes in the test files to have a version number for
# their parent module
__version__ = '0.1'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would probably prefer mocking but it's ok for now.

@@ -0,0 +1 @@
__version__ = 1.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is that here? dummy learn is a a fake learning library for testing?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I execute the tests locally under python3, it imported the module as dummy_module.dummy_forest, on travis-ci as tests.flows.dummy_learn.dummy_forest. I changed the imports now, so this is no longer needed.

@@ -42,12 +42,15 @@ def setUp(self):
self.cached = True
# amueller's read/write key that he will throw away later
openml.config.apikey = "610344db6388d9ba34f6db45a3cf71de"
#openml.config.server = "http://capa.win.tue.nl/api/v1/xml"
openml.config.server = "https://test.openml.org/api/v1/xml"
self.production_server = "https://www.openml.org/api/v1/xml"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this bee openml.config.server?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, thanks!

return {}

def set_params(self, params):
return None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel it would be better to return self if possible.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay.

# Add the component to the list of components, add a
# component reference as a placeholder to the list of
# parameters, which will be replaced by the real component
# when deserealizing the parameter
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Old typo ;)

else:
name = class_name

# Get the external versions of all sub-components
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably be a function. This function is too long already.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extracted a function here, as well as for checking that a component is not used multiple times in a flow.

to_visit_stack.extend(sub_components.values())
while len(to_visit_stack) > 0:
visitee = to_visit_stack.pop()
for external_version in visitee.external_version.split(','):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is a recursion into subcomponents that have already been constructed and which have lists of external versions, right? Maybe a comment?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay.

visitee = to_visit_stack.pop()
for external_version in visitee.external_version.split(','):
external_versions.add(external_version)
to_visit_stack.extend(visitee.components.values())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this necessary if visitee already has an external_version string that we just parsed? That contains all the external versions of all subcomponents, right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. I also changed the comment I added based on your comment above.

for external_version in visitee.external_version.split(','):
external_versions.add(external_version)
to_visit_stack.extend(visitee.components.values())
external_versions = list(sorted(external_versions))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we make sure they are unique?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By sorting it it becomes unique, right? Or did I miss something?

@joaquinvanschoren
Copy link
Contributor

I fixed some settings

this works https://test.openml.org/api/v1/data/1?api_key=xxx
this too http://test.openml.org/api/v1/data/1?api_key=xxx
This works: http://capa.win.tue.nl/api/v1/json/data/1
This does not https://capa.win.tue.nl/api/v1/json/data/1
That’s because capa.win.tue.nl is not added to the SSL certificate yet (change has been requested)

Copy link
Contributor

@amueller amueller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First batch of comments.

@amueller
Copy link
Contributor

@joaquinvanschoren great that allows us to run the tests by just changing the url of the test server.

Copy link
Contributor

@amueller amueller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some coverage comments.

@@ -105,7 +377,7 @@ def _ensure_flow_exists(self):
"""
import sklearn
flow_version = 'sklearn_' + sklearn.__version__
_, _, flow_id = _check_flow_exists(self._get_name(), flow_version)
_, _, flow_id = _check_flow_exists(self.name, flow_version)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get an error in the tests in line 384. publish returns self which is not iterable....

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is indeed a bug, but on trying to write a test which covers this function I uncovered a bug on OpenML...

@@ -117,8 +389,42 @@ def _ensure_flow_exists(self):

return int(flow_id)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line doesn't seem to be covered.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


if isinstance(o, dict):
if 'oml:name' in o and 'oml:description' in o:
# TODO check if this code is actually called
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not. At least not in the tests.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

elif isinstance(o, (list, tuple)):
rval = [flow_to_sklearn(element, **kwargs) for element in o]
if isinstance(o, tuple):
rval = tuple(rval)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not covered.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

# in the brackets) as the identifier
pos = identifier.find('(')
if pos >= 0:
identifier = identifier[:pos]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not covered

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This piece of code would actually have been a bug if it was triggered. Removed.


# Replace the component placeholder by the actual flow
if isinstance(rval, dict) and 'oml-python:serialized_object' in rval:
parameter_name, step = rval['value'].split('__')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not covered?!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is apparently already done in flow_to_sklearn. Deleting this now.

@mfeurer
Copy link
Collaborator Author

mfeurer commented Jan 26, 2017

@joaquinvanschoren thanks, the tests work again.

@amueller I just added a fix for the name collision.

There is one test failing right now, I will now take care of it.

Copy link
Contributor

@amueller amueller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is about as good as I can do. I didn't go through all the details, but if you address at least the main comments, I think we're good.

rval = None
elif isinstance(o, six.string_types):
rval = o
elif isinstance(o, (bool, int, float)):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could the three cases here could be done together as I suggested elsewhere.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

# Steps in a pipeline or feature union
parameter_value = list()
for sub_component_tuple in rval:
identifier, sub_component = sub_component_tuple
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe put the inside of this loop or the loop in a function? a lot of indentation levels and variables to keep track of.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I extracted the whole loop to get model information into its own function. This makes the _serialize_model() easier to read. Also, I added information on how to further restructure the code in case one has to touch it again.

# different value, it is still correct as it is a propagation of the
# subclasses' module name
self.assertIn(flow.external_version,
['dummy_learn==1.0,sklearn==0.18.1',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're still hard-coding the sklearn version here.... you can import it and get it that way?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, fixed.

@mfeurer
Copy link
Collaborator Author

mfeurer commented Jan 26, 2017

Thanks a lot! I'll do my best to improve the code :)

@amueller
Copy link
Contributor

I'm happy to look again later today (though I imagine you want to sleep at some point) or next week.
If you want to merge today, that's fine, but please open issues for the things I pointed out (like the bug in OpenML that you found).

The PR is really big and I think doing iterative improvements after merge are gonna be easier than trying to get everything right now.

@mfeurer
Copy link
Collaborator Author

mfeurer commented Jan 26, 2017

Yep, I'll have to stop now. But instead of sleeping I need to prepare a presentation...

@amueller
Copy link
Contributor

good luck with that :)

@joaquinvanschoren
Copy link
Contributor

joaquinvanschoren commented Jan 26, 2017 via email

@mfeurer
Copy link
Collaborator Author

mfeurer commented Jan 27, 2017

@joaquinvanschoren the only test that currently fails is openml/OpenML#360.

@mfeurer
Copy link
Collaborator Author

mfeurer commented Jan 27, 2017

Okay, I tackled all of @amueller 's issues and from my side this is ready. Waiting for a fix on OpenML.org to make sure all unittests are fine.

@joaquinvanschoren
Copy link
Contributor

joaquinvanschoren commented Jan 28, 2017 via email

@mfeurer mfeurer merged commit 31bf79e into develop Jan 30, 2017
@mfeurer
Copy link
Collaborator Author

mfeurer commented Jan 30, 2017

Thanks @amueller @joaquinvanschoren @janvanrijn getting this done :)

@mfeurer mfeurer deleted the feature/upload-flow branch January 30, 2017 20:39
@joaquinvanschoren
Copy link
Contributor

👍 🎉 🎉 🎉 Awesome! Beers on me the next time we see each other!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants