Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Complete Python SDK #297

Closed
aarondav opened this issue Aug 13, 2018 · 6 comments
Closed

Complete Python SDK #297

aarondav opened this issue Aug 13, 2018 · 6 comments

Comments

@aarondav
Copy link
Contributor

Currently, MLflow provides three interfaces:

  1. REST API -- the fundamental, JSON-y truth.
  2. CLI -- a subset of the API (e.g., mlflow experiments list, as well as some more powerful integrated workflows like mlflow run
  3. Python SDK -- currently, a subset of the REST API, focused on components needed for tracking a single experiment.

There is work in progress for adding R and Java/Scala SDK support as well.

We should complete the Python SDK. In particular, there is mixed support for experiment CRUD, runs CRUD, and metrics/artifact CRUD.

@aarondav
Copy link
Contributor Author

The main question here is how do we go about exposing the resultant API? So far, we have APIs split between mlflow itself (e.g., log_param, log_metric) and mlflow.tracking (create_experiment, get_run).

Second, we have to wonder if the API should be Pythonic (e.g., log_metric(key, value)) versus Proto-tastic (e.g., log_metric(Metric(key, value))).

@aarondav
Copy link
Contributor Author

As is described in boto/boto3#112, boto has two APIs -- one that closely matches the lower-level (Proto-tastic) and a higher-level one with a more Pythonic interface. Since the former is "easy", and the latter "nice", this seems somewhat reasonable.

I might propose we have an autogenerated client API somewhere like mlflow.api or mlflow.sdk, and then provide wrappers for the useful components in mlflow proper. This would mean adding things like list_experiments and create_experiment to mlflow, though, which might over-saturate the mlflow module with APIs that are not in common use.

@mateiz
Copy link
Contributor

mateiz commented Aug 14, 2018

I'd probably just start with the purely Pythonic one unless there are a lot of APIs we don't want to include in that one for some reason. Otherwise, people get confused about which API to use, and various users will depend on both APIs so it will be difficult to maintain both of them. The other issue with Protobufs is that if we expose them in the public API, we might never be able to get rid of them without a lot of trouble.

Regarding mlflow vs mlflow.tracking, the original idea was to alias some very commonly used functions in mlflow but have everything else in mlflow.tracking. I'd say add everything to tracking first. Even exposing a few functions in mlflow might have been a bad idea because some people ask what the difference is, although I see a lot of Python packages that do the same thing (for instance Flask and Click).

BTW, one thing to consider is how to have some of the APIs return data in an easy to process format. For example, I'd love to get the result of SearchRuns as a Pandas DataFrame. Maybe this can be done using an alternate version of the call, or a parameter such as format=pandas.

aarondav added a commit to aarondav/mlflow that referenced this issue Aug 14, 2018
aarondav added a commit to aarondav/mlflow that referenced this issue Aug 17, 2018
aarondav added a commit that referenced this issue Aug 17, 2018
@aarondav
Copy link
Contributor Author

PR merged!

@andremesarovic
Copy link

andremesarovic commented Aug 17, 2018

Coming from an API-centric background I believe having a low-level client that faithfully mirrors the actual HTTP calls is crucial. This client is the building block on top of which richer more domain-specific clients can be built. A low-level and language-oriented client are just ends of a continuum and not mutually exclusive. So the boto paradigm seems like a good one. Low-level API clients are easy to create. Furthermore, designing only an opinionated client can be problematic since other users will have different unanticipated needs.

I see the MLflow REST API as a foundational core feature on top of which different clients and applications can be built. A high-level API and multi-lingual client strategy is important to building out MLflow as a powerful next-gen ML management platform. I could see someone wanting to create their own custom UI for hyperparameter optimization that would be written on top of the API. Or create a custom rich client that would be tailored to their specific use cases. This would be in line with the today’s ubiquitous rich API ecosystem where businesses expose APIs and customers create value-add with client applications. A good read on the topic: APIs: A Strategy Guide.

Here's a sample low-level Python MLflow client mlflow_client/mlflow_api_client.py that is similar to the Java client API. It is slightly opinionated in that it rolls out nested JSON responses and simplifies the search input signature.

I would also certainly add CRUD capabilities to API resources so folks can more effectively manage (update, delete) their experiments and runs. A richer domain model would be helpful in being able to group experiments into "buckets". Let's say a company wants to group one set of experiments into "self-driving-cars" and another into "self-driving-trucks". Having a flat experiment name space won't scale when you have many experiments.

An autogenerated API sounds good but I would shy away from exposing Protobuf concepts. Protobuf data concepts have already leaked into the current REST data payloads (e.g. unnecessary wrapper fields such as in the “create run” response where the data is buried two-levels down in “run/info”). Protobuf is an internal implementation detail - the API should only expose RESTful concepts.

The current API also uses RPC-like endpoints such as “experiments/list”, “experiments/get”, etc. instead of standard HTTP/REST verbs and resource names. A more RESTful API would look like:

Current Proposed '
Method Endpoint Method Endpoint
GET experiments/list GET experiments
GET experiments/get?experiment_id=$EXP_ID GET experiments/$EXP_ID
POST experiments/create POST experiments
GET runs/run_uuid=$RUN_ID GET runs/$RUN_ID
POST runs/create POST runs
POST runs/update PUT (PATCH?) runs/$RUN_ID

@aarondav
Copy link
Contributor Author

You raise a few distinct points. Let me try to summarize them to make sure I understand each.

Python API proposal

The proposed API here looks pretty similar to the one introduced in #299 . The key differences I see are

  1. The proposed API returns raw JSON objects as opposed to the Python mlflow.entities. The entities provide some significant utility, I think -- mainly that users can call help() or use IPython's autocomplete on them, and that Python errors are thrown if you attempt to access a field that could not possibly be defined, to help prevent typos in user code.
  2. You introduce an easier-to-use search API. I think this makes sense, as the current one with experiment_ids and anded_expressions is difficult to reason about and construct. I think we may want to introduce this API at the REST level, though, because the REST APIs should be easy to construct by hand.

Let me know if I am missing some other major differences between the proposed API and the one in #299.

CRUD-y REST API

You point out that our current REST API does not follow the RESTful ideals as well as it could, and you're absolutely right. The reason we originally designed the REST API as we did was simply because we happened to have infrastructure and tooling to take protos and convert them into REST API constructs, but this infrastructure has the limitation where it cannot support the RESTful ideal. So, we were constrained to using GET with query parameters and POST with JSON bodies for input.

Although our API should not be in principal constrained by the existing technology and tooling we have, it is the case that it is still significant effort to convert and handle the APIs on all available servers to a more RESTful ideal. For this reason, I am still in favor of keeping the straightforward-but-not-completely-RESTful API. We can definitely introduce the more RESTful terminology later in a v2 API, and continue to support both.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants