New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[#297] Complete Python SDK #299
Conversation
@@ -0,0 +1,123 @@ | |||
from __future__ import print_function |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This stuff could also be in api
, I just split it out for the sake of code readability.
I think the names "flow" and "api" are kind of weird, though I guess they are just internal names. Maybe a better name for "flow" would be "functions" since it has all the top-level functions, and then "api" could be "objects" and contain all the objects. Then you could have |
Updated based on feedback. Still need to do the rename of the internal files -- was thinking about changing |
mlflow/tracking/flow.py
Outdated
|
||
_EXPERIMENT_ID_ENV_VAR = "MLFLOW_EXPERIMENT_ID" | ||
_RUN_ID_ENV_VAR = "MLFLOW_RUN_ID" | ||
_active_run = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you considered making _active_run thread local? I'm imagining a scenario in which a user is running a bunch of different runs simultaneously using joblib or a standard library thread pool and gets unexpectedly bitten when all the threads end up sharing a run.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes a lot of sense. Changing the scope of _active_run, or letting users manually create and pass around fluent runs, are both on the table, but I would prefer to leave that to a future PR.
Codecov Report
@@ Coverage Diff @@
## master #299 +/- ##
==========================================
+ Coverage 55.94% 56.07% +0.13%
==========================================
Files 108 112 +4
Lines 5554 5566 +12
==========================================
+ Hits 3107 3121 +14
+ Misses 2447 2445 -2
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mlflow/tracking/flow.py
Outdated
def __enter__(self): | ||
pass | ||
|
||
def __exit__(self, exc_type, exc_val, exc_tb): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could/should we call end_run
here? I do like having ActiveRun
only mutate / alter its own state, so maybe we don't want to call end_run
. If we don't call end_run
, we should also unset _RUN_ID_ENV_VAR
here as end_run
does, so that future runs launched in start_run
don't reuse the same run ID here
mlflow/tracking/flow.py
Outdated
|
||
def __exit__(self, exc_type, exc_val, exc_tb): | ||
status = "FINISHED" if exc_type is None else "FAILED" | ||
get_service().set_terminated(self.info.run_uuid, status) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here I think we might see failures if a user calls mlflow.set_tracking_uri
within a with mlflow.start_run(...)
block, since set_tracking_uri
could change the store we use in get_service
(this was why the original ActiveRun kept track of the run's backing store). I do like the shift towards a simpler ActiveRun, maybe the right thing to do is raise an exception if set_tracking_uri
is called while a run is active, or just tell users not to call set_tracking_uri
within a run.
@@ -44,42 +44,43 @@ def predicted(model, data): | |||
|
|||
def test_model_save_load(model, data, predicted): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: here we can use pytest's tmpdir
fixture (i.e. just pass tmpdir
to the test function and it'll automatically create/cleanup a tempdir for us)
assert all(pyfunc_loaded.predict(x).values == predicted) | ||
# Loading pyfunc model | ||
pyfunc_loaded = mlflow.pyfunc.load_pyfunc(path) | ||
assert all(pyfunc_loaded.predict(x).values == predicted) | ||
|
||
|
||
def test_model_log(model, data, predicted): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: here we can use the tracking_uri_mock
fixture from tests/projects/utils.py
, which sets the tracking URI to a tempdir for us
|
||
|
||
def test_model_log(model, data, predicted): | ||
x, y = data | ||
old_uri = tracking.get_tracking_uri() | ||
old_uri = mlflow.get_tracking_uri() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: here we can use the tracking_uri_mock
fixture from tests/projects/utils.py
, which sets the tracking URI to a tempdir for us / unsets it afterwards. Although feel free to ignore that suggestion if you think it'll make the diff really big (using tracking_uri_mock
will basically unindent a ton of code blocks, we could shift to using it in a separate PR)
tests/projects/test_projects.py
Outdated
@@ -53,6 +53,7 @@ def test_fetch_project(local_git_repo, local_git_repo_uri): | |||
(os.path.dirname(TEST_PROJECT_DIR), os.path.basename(TEST_PROJECT_DIR), TEST_PROJECT_DIR), | |||
] | |||
for base_uri, subdirectory, expected in test_list: | |||
print("HELLO %s" % base_uri) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: maybe change the wording of this print statement?
@@ -13,6 +13,7 @@ def sftp_mock(): | |||
return MagicMock(autospec=pysftp.Connection) | |||
|
|||
|
|||
@pytest.mark.large |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Q: Why the large annotation here? Seems fine to add, but was this test slow beforehand?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It takes about 60 seconds to throw the error.
mlflow/tracking/__init__.py
Outdated
"get_service", | ||
"get_tracking_uri", | ||
"set_tracking_uri", | ||
"_get_git_commit", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Q: why expose these internal methods in __all__
? Doing so means that from mlflow.tracking import *
will import methods like _get_git_commit
.
If _get_git_commit
etc is used in other modules maybe we could make it a "public" API within mlflow.tracking.flow
(e.g. mlflow.tracking.flow.get_git_commit
), since mlflow.tracking.flow
itself won't be a publicly documented module (assuming I understand correctly). Admittedly this would make our code less robust to future refactors of mlflow.tracking.flow
.
@@ -0,0 +1,198 @@ | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be great to add a doc comment to each module. I know some don't have them yet, but this makes the code easier to understand.
Do we want to write a small migration instruction thing for this? Could just go into the changelog under 0.5.0. |
@@ -0,0 +1,8 @@ | |||
mlflow.entities |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably also update docs/source/tracking.rst to describe the two APIs, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Documentation is important, and I would like to get that closely reviewed. Let me do that in a follow-up PR which I'll go ahead and draft.
mlflow/__init__.py
Outdated
@@ -1,22 +1,43 @@ | |||
"""Provides the MLflow fluent API, allowing management of an active MLflow run. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Micro style nit: I would put the """
on its own line in this case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I don't mind changing, but I just wanted something to point to for future people -- let me know if you have a style guide in mind)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think PEP8 allows you to have the quotes on their own line in this case. Specifically in https://www.python.org/dev/peps/pep-0257/#multi-line-docstrings : " The summary line may be on the same line as the opening quotes or on the next line. The entire docstring is indented the same as the quotes at its first line (see example below)."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, switched all the module comments to this form.
mlflow/__init__.py
Outdated
@@ -1,22 +1,43 @@ | |||
"""Provides the MLflow fluent API, allowing management of an active MLflow run. | |||
For example:: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using ..code::
instead of ::
will make it look better in the generated docs (for instance it adds the "copy this code" button and syntax highlighting).
@@ -0,0 +1,23 @@ | |||
from mlflow.entities.experiment import Experiment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could add a doc comment here too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
_tracking_uri = None | ||
|
||
|
||
def set_tracking_uri(uri): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that this and get_tracking_uri
are public functions, so do we want to move them or keep them in tracking
? The rest of the stuff in this module is private. I guess you can set them through mlflow
directly so it may be OK to move them, but it might also be better to just keep this module private.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The dependency structure is a bit weird. _get_store()
has a lot of helper utilities to it (which is why I wanted to move it to its own py file), and it relies on _tracking_uri
which is set by set_tracking_uri
. fluent
and service
depend on utils
, so I can't move _tracking_uri
.
I could move the set/get methods to mutate a global specified in utils.py
, but I'm not sure that's cleaner.
All comments addressed @mateiz @smurching |
Went over it again and it looks good to me. |
The goal of this PR is to complete the Python SDK. I split the currently-exposed APIs into two kinds:
mlflow.tracking
-- exposes Pythonic versions of the REST API calls via aMLflowService
object. This is implemented internally inmlflow.tracking.service
.mlflow
-- intended for the "fluent" version of the API, where invokinglog_param
is a top-level function that will start a run if none exists, and contains a notion of an "active run". This is implemented internally inmlflow.tracking.fluent
.The goal here is that users can use the top-level
mlflow
API in their simple run code, but if they want to do more complicated or management-level operations (e.g., delete an experiment), they can rely on the Python version of the REST API, effectively.