-
-
Notifications
You must be signed in to change notification settings - Fork 145
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/upload flow #167
Feature/upload flow #167
Conversation
OpenMLFlow parameters and components attribute are now of type OrderedDict, with keys being the name of the parameter/component and value the either the default value of the hyperparameter or the actual component. This makes creating OpenMLFlows easier. They can still be nicely uploaded. Also, there was a bug in the deserialization, which returned always the model of the serialized flow.
This PR introduces the following new flow-related features:
I will now write docstrings to explain the implementation. I will write documentation once the code part of the PR is approved. |
Okay, the unit tests are now working. @amueller @janvanrijn what do you think of this PR? |
Apparently, there is still a bug in creating the names of the flows. The current unit test should not produce the following name:
Especially, it should also contain the CV object -> test for the created component names in the unit test, also add a more complex unit test to the scikit-learn converter test. |
No idea why the PR check fails. It's at least not the fault of this PR. |
Just for reference, it worked this afternoon: https://travis-ci.org/openml/openml-python/builds/195473597 so I assume this is on the OpenML side Regarding the pipeline, I can see the issue and we'd need to add the step name to the name. Will be doing this in a few minutes, okay? |
No rush on my side ;) I'll be awake longer than you are, I imagine. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First batch of comments, not done ;)
@@ -5,6 +5,7 @@ | |||
else: | |||
from urllib.error import URLError | |||
|
|||
import six |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why if it's not used here?
@@ -0,0 +1,3 @@ | |||
# Dummy to allow mock classes in the test files to have a version number for | |||
# their parent module | |||
__version__ = '0.1' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would probably prefer mocking but it's ok for now.
@@ -0,0 +1 @@ | |||
__version__ = 1.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is that here? dummy learn is a a fake learning library for testing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I execute the tests locally under python3, it imported the module as dummy_module.dummy_forest
, on travis-ci as tests.flows.dummy_learn.dummy_forest
. I changed the imports now, so this is no longer needed.
@@ -42,12 +42,15 @@ def setUp(self): | |||
self.cached = True | |||
# amueller's read/write key that he will throw away later | |||
openml.config.apikey = "610344db6388d9ba34f6db45a3cf71de" | |||
#openml.config.server = "http://capa.win.tue.nl/api/v1/xml" | |||
openml.config.server = "https://test.openml.org/api/v1/xml" | |||
self.production_server = "https://www.openml.org/api/v1/xml" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't this bee openml.config.server
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, thanks!
return {} | ||
|
||
def set_params(self, params): | ||
return None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel it would be better to return self if possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay.
# Add the component to the list of components, add a | ||
# component reference as a placeholder to the list of | ||
# parameters, which will be replaced by the real component | ||
# when deserealizing the parameter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Old typo ;)
else: | ||
name = class_name | ||
|
||
# Get the external versions of all sub-components |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should probably be a function. This function is too long already.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extracted a function here, as well as for checking that a component is not used multiple times in a flow.
to_visit_stack.extend(sub_components.values()) | ||
while len(to_visit_stack) > 0: | ||
visitee = to_visit_stack.pop() | ||
for external_version in visitee.external_version.split(','): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this is a recursion into subcomponents that have already been constructed and which have lists of external versions, right? Maybe a comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay.
visitee = to_visit_stack.pop() | ||
for external_version in visitee.external_version.split(','): | ||
external_versions.add(external_version) | ||
to_visit_stack.extend(visitee.components.values()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this necessary if visitee already has an external_version
string that we just parsed? That contains all the external versions of all subcomponents, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right. I also changed the comment I added based on your comment above.
for external_version in visitee.external_version.split(','): | ||
external_versions.add(external_version) | ||
to_visit_stack.extend(visitee.components.values()) | ||
external_versions = list(sorted(external_versions)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we make sure they are unique?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By sorting it it becomes unique, right? Or did I miss something?
I fixed some settings this works https://test.openml.org/api/v1/data/1?api_key=xxx |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First batch of comments.
@joaquinvanschoren great that allows us to run the tests by just changing the url of the test server. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some coverage comments.
@@ -105,7 +377,7 @@ def _ensure_flow_exists(self): | |||
""" | |||
import sklearn | |||
flow_version = 'sklearn_' + sklearn.__version__ | |||
_, _, flow_id = _check_flow_exists(self._get_name(), flow_version) | |||
_, _, flow_id = _check_flow_exists(self.name, flow_version) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get an error in the tests in line 384. publish returns self which is not iterable....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is indeed a bug, but on trying to write a test which covers this function I uncovered a bug on OpenML...
@@ -117,8 +389,42 @@ def _ensure_flow_exists(self): | |||
|
|||
return int(flow_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line doesn't seem to be covered.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
|
||
if isinstance(o, dict): | ||
if 'oml:name' in o and 'oml:description' in o: | ||
# TODO check if this code is actually called |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not. At least not in the tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
elif isinstance(o, (list, tuple)): | ||
rval = [flow_to_sklearn(element, **kwargs) for element in o] | ||
if isinstance(o, tuple): | ||
rval = tuple(rval) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not covered.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
# in the brackets) as the identifier | ||
pos = identifier.find('(') | ||
if pos >= 0: | ||
identifier = identifier[:pos] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not covered
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This piece of code would actually have been a bug if it was triggered. Removed.
|
||
# Replace the component placeholder by the actual flow | ||
if isinstance(rval, dict) and 'oml-python:serialized_object' in rval: | ||
parameter_name, step = rval['value'].split('__') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not covered?!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is apparently already done in flow_to_sklearn
. Deleting this now.
@joaquinvanschoren thanks, the tests work again. @amueller I just added a fix for the name collision. There is one test failing right now, I will now take care of it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is about as good as I can do. I didn't go through all the details, but if you address at least the main comments, I think we're good.
rval = None | ||
elif isinstance(o, six.string_types): | ||
rval = o | ||
elif isinstance(o, (bool, int, float)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could the three cases here could be done together as I suggested elsewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
# Steps in a pipeline or feature union | ||
parameter_value = list() | ||
for sub_component_tuple in rval: | ||
identifier, sub_component = sub_component_tuple |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe put the inside of this loop or the loop in a function? a lot of indentation levels and variables to keep track of.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I extracted the whole loop to get model information into its own function. This makes the _serialize_model()
easier to read. Also, I added information on how to further restructure the code in case one has to touch it again.
# different value, it is still correct as it is a propagation of the | ||
# subclasses' module name | ||
self.assertIn(flow.external_version, | ||
['dummy_learn==1.0,sklearn==0.18.1', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're still hard-coding the sklearn version here.... you can import it and get it that way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, fixed.
Thanks a lot! I'll do my best to improve the code :) |
I'm happy to look again later today (though I imagine you want to sleep at some point) or next week. The PR is really big and I think doing iterative improvements after merge are gonna be easier than trying to get everything right now. |
Yep, I'll have to stop now. But instead of sleeping I need to prepare a presentation... |
good luck with that :) |
Good luck and thanks!
On Thu, Jan 26, 2017 at 10:04 PM Andreas Mueller ***@***.***> wrote:
good luck with that :)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#167 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABpQVy3t5JdDmmB3yAekvTq1L1YxlnUMks5rWQpEgaJpZM4Jz1ST>
.
--
Thank you,
Joaquin
|
@joaquinvanschoren the only test that currently fails is openml/OpenML#360. |
Okay, I tackled all of @amueller 's issues and from my side this is ready. Waiting for a fix on OpenML.org to make sure all unittests are fine. |
I added the run_details field and answered the issue about the '=' in the
url.
On Fri, Jan 27, 2017 at 5:34 PM Matthias Feurer ***@***.***> wrote:
Okay, I tackled all of @amueller <https://github.com/amueller> 's issues
and from my side this is ready. Waiting for a fix on OpenML.org to make
sure all unittests are fine.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#167 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABpQV2yW_KTrN3Xe-4FmZyy_2IM3iFsEks5rWhxlgaJpZM4Jz1ST>
.
--
Thank you,
Joaquin
|
Thanks @amueller @joaquinvanschoren @janvanrijn getting this done :) |
👍 🎉 🎉 🎉 Awesome! Beers on me the next time we see each other! |
No description provided.