Skip to content

Allow running a flow on a task#253

Merged
mfeurer merged 9 commits intodevelopfrom
add/#193
May 18, 2017
Merged

Allow running a flow on a task#253
mfeurer merged 9 commits intodevelopfrom
add/#193

Conversation

@mfeurer
Copy link
Copy Markdown
Collaborator

@mfeurer mfeurer commented May 11, 2017

Side effect: during unit testing, it is possible to add a sentinel to the flow name to not work with flows which are already uploaded to the server.

Adds #193.

Further changes:

  • model of a flow is now created when downloading the flow
  • parameters of a run are now parsed when the run is executed
  • simplifies interface of several functions
  • simplified downloading of a flow from a flow id

@mfeurer mfeurer force-pushed the add/#193 branch 4 times, most recently from fcf6574 to 28beba7 Compare May 16, 2017 11:51
also parse parameters when running a flow on a task

fix publishing error
@mfeurer mfeurer changed the title WIP: allow running a flow on a task Allow running a flow on a task May 16, 2017
@mfeurer mfeurer requested a review from janvanrijn May 16, 2017 12:02
@codecov-io
Copy link
Copy Markdown

Codecov Report

Merging #253 into develop will increase coverage by 0.13%.
The diff coverage is 97.67%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop     #253      +/-   ##
===========================================
+ Coverage    90.45%   90.59%   +0.13%     
===========================================
  Files           24       24              
  Lines         2064     2094      +30     
===========================================
+ Hits          1867     1897      +30     
  Misses         197      197
Impacted Files Coverage Δ
openml/__init__.py 100% <100%> (ø) ⬆️
openml/flows/functions.py 89.23% <100%> (-0.33%) ⬇️
openml/flows/__init__.py 100% <100%> (ø) ⬆️
openml/setups/functions.py 98.48% <100%> (+0.04%) ⬆️
openml/runs/__init__.py 100% <100%> (ø) ⬆️
openml/flows/flow.py 94.7% <100%> (+0.45%) ⬆️
openml/runs/functions.py 88.18% <94.73%> (+0.17%) ⬆️
openml/runs/run.py 95.3% <95.65%> (+0.41%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ea4c9be...8fadddc. Read the comment docs.

@codecov-io
Copy link
Copy Markdown

codecov-io commented May 16, 2017

Codecov Report

Merging #253 into develop will increase coverage by 0.14%.
The diff coverage is 94.11%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop     #253      +/-   ##
===========================================
+ Coverage    90.45%   90.59%   +0.14%     
===========================================
  Files           24       24              
  Lines         2064     2138      +74     
===========================================
+ Hits          1867     1937      +70     
- Misses         197      201       +4
Impacted Files Coverage Δ
openml/flows/__init__.py 100% <100%> (ø) ⬆️
openml/exceptions.py 100% <100%> (ø) ⬆️
openml/runs/__init__.py 100% <100%> (ø) ⬆️
openml/flows/functions.py 91.56% <100%> (+2.01%) ⬆️
openml/__init__.py 100% <100%> (ø) ⬆️
openml/runs/run.py 93.08% <84.84%> (-1.81%) ⬇️
openml/flows/flow.py 94.15% <95.45%> (-0.09%) ⬇️
openml/setups/functions.py 97.22% <96%> (-1.22%) ⬇️
openml/runs/functions.py 88.45% <96.55%> (+0.43%) ⬆️
openml/_api_calls.py 88.05% <0%> (+2.98%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ea4c9be...faf5b26. Read the comment docs.

Copy link
Copy Markdown
Member

@janvanrijn janvanrijn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Preliminary review :)

Comment thread openml/flows/flow.py Outdated
return cls(**arguments)
flow = cls(**arguments)

if 'sklearn' in arguments['external_version']:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could restrict this even further:

  • 'sklearn.'
  • startswith 'sklearn.' (?)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will replace this with startswith('sklearn'). In the long run, we probably need to build a plugin-like system in which the converters can register themselves for an 'external_version' string.

Comment thread openml/flows/flow.py Outdated
flow = openml.flows.functions.get_flow(flow_id)
try:
_check_flow(self)
openml.flows.functions.assert_flows_equal(self, flow)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.. so we actually do expect an error here, in some cases

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's still on my todo-list to test for this expected error.

Comment thread openml/flows/functions.py
flow = OpenMLFlow._from_dict(flow_dict)

if 'sklearn' in flow.external_version:
flow.model = flow_to_sklearn(flow)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

startswith, (..again) ?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This piece of code is removed, I can't add anything here.

Comment thread openml/runs/functions.py Outdated
# returns flow id if the flow exists on the server, False otherwise
flow_id = flow_exists(flow.name, flow.external_version)

if flow_id == False:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now we are back to publishing a flow before knowing whether it actually valid (before we first ran the task)?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's actually an issue I somehow forgot. Will take care of this.

Comment thread openml/runs/functions.py Outdated
flow = get_flow(flow_id)
setup_id = setup_exists(flow, model)
if avoid_duplicate_runs:
flow_from_server = get_flow(flow.flow_id)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we require that the 'run_flow_on_task' runs on a flow from the server?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We then couldn't change the parameter values of the model in the flow as the run function would always use the parameters from the server.

Comment thread openml/runs/functions.py Outdated
# TODO (neccessary? is this a post condition of this function)
flow = get_flow(flow_id)

run.flow_id = flow.flow_id
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be set in run constructor

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course

Comment thread openml/runs/run.py Outdated
# server before parsing the parameters
stack = list()
stack.append(flow)
while len(stack) > 0:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we make a separate function of this (modularity / readability)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and reusability :)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually thought the very same thing ;) It's already done.

Comment thread openml/setups/functions.py Outdated
openml_param_settings = openml.runs.OpenMLRun._parse_parameters(sklearn_model, downloaded_flow)
description = xmltodict.unparse(_to_dict(downloaded_flow.flow_id, openml_param_settings), pretty=True)
file_elements = {'description': ('description.arff',description)}
openml_param_settings = openml.runs.OpenMLRun._parse_parameters(flow)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe explicitly state in comments that this function raises an error if the flow does not contain all flow ids?

def _perform_run(self, task_id, num_instances, clf,
random_state_value=None, check_setup=True):

def _remove_random_state(flow):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why remove random state? This seems like part of the behaviour we want to test

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's removed after checking the value to make sure that assert_flow_equal works.

@mfeurer
Copy link
Copy Markdown
Collaborator Author

mfeurer commented May 17, 2017

Flows are published after running them (only if they haven't been published before). @janvanrijn this is ready for review again.

Copy link
Copy Markdown
Member

@janvanrijn janvanrijn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me it seems like a good PR. Many code seems much shorter/simpler. However, some unit tests fail (4 on my system).

This error occurs 3 times:

Error

Traceback (most recent call last):
  File "/home/vanrijn/projects/openml-python/tests/test_runs/test_run_functions.py", line 347, in test_get_run_trace
    run = openml.runs.run_model_on_task(task, clf, avoid_duplicate_runs=True)
  File "/home/vanrijn/projects/openml-python/openml/runs/functions.py", line 36, in run_model_on_task
    flow_tags=flow_tags, seed=seed)
  File "/home/vanrijn/projects/openml-python/openml/runs/functions.py", line 73, in run_flow_on_task
    raise ValueError('Cannot check if a run exists if the '
ValueError: Cannot check if a run exists if the corresponding flow has not been published yet!

This one one time (seems similar)

Failure
Expected :"Penalty term must be positive; got \(C=u?'abc'\)"
Actual   :"Cannot check if a run exists if the corresponding flow has not been published yet!"
 <Click to see difference>

ValueError: Cannot check if a run exists if the corresponding flow has not been published yet!

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/vanrijn/projects/openml-python/tests/test_runs/test_run_functions.py", line 214, in test_check_erronous_sklearn_flow_fails
    model=clf)
AssertionError: "Penalty term must be positive; got \(C=u?'abc'\)" does not match "Cannot check if a run exists if the corresponding flow has not been published yet!"

Comment thread openml/flows/flow.py
self : OpenMLFlow

"""
import openml.flows.functions
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

imports at the top! (right?)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not possible because of cyclic dependencies. In particular, flow.py tries to import functions.py in order to call get_flow(), while functions.py tries to import flow.py in order to instantiate an OpenMLFlow. I will add a comment.

@mfeurer
Copy link
Copy Markdown
Collaborator Author

mfeurer commented May 18, 2017

Sorry for the failing tests. It seems that I was too sloppy yesterday evening. The tests are passing.

Copy link
Copy Markdown
Member

@janvanrijn janvanrijn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some minor comment requests

Comment thread openml/runs/functions.py

# skips the run if it already exists and the user opts for this in the config file.
# also, if the flow is not present on the server, the check is not needed.
flow_id = flow_exists(flow.name, flow.external_version)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

document somewhere that we need to have the 'avoid_duplicate_runs' to false if we want offline experiments

Comment thread openml/runs/run.py
# <openml.flows.flow.OpenMLFlow object at 0x7fed87978160> is not JSON serializable
# Python3.6 exception message:
# Object of type 'OpenMLFlow' is not JSON serializable
if 'OpenMLFlow' in e.args[0] and \
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

document what happens in case of the catch (and why)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the catch, please try to define what can reasonably fall into that (because we handle it further down) and raise exception if we have something unexpected

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add additional checks.

def setup_exists(flow, model=None):
'''
Checks whether a flow / hyperparameter configuration already exists on the server
Checks whether a hyperparameter configuration already exists on the server.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please document why model can be none (i.e., flow.model is set)
please raise exception if flow.model is set and model is set, and these do not agree? (should never happen but still..)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by models do not agree? You mean names don't match? Or the parameter names don't match? Or the parameter values?

Matching of parameter names is done somewhat in _parse_parameters, but I could make that more strict (I'll think about it...)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I'll add a more strict check.

parameters[_flow_id][_param_name] = _param_value

def _reconstruct_flow(_flow, _params):
# sets the values of flow parameters (and subflows) to
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small todo (sorry, should have been me in prev pull request): document what types _flow (flow object?) and _params (totally forgot) are? this would make this function better understandable

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants