Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[question] Standard API endpoints? #845

Closed
vsoch opened this issue Feb 14, 2022 · 15 comments
Closed

[question] Standard API endpoints? #845

vsoch opened this issue Feb 14, 2022 · 15 comments

Comments

@vsoch
Copy link
Contributor

vsoch commented Feb 14, 2022

Hi! Is there any work in the online ML community to derive a standard set of API endpoints / interactions for a service (and then implementations can use and add extra as needed?) An example in the containers community would be the OCI distribution-spec: https://github.com/opencontainers/distribution-spec and I've made one for workflows too: https://github.com/panoptes-organization/monitor-schema/blob/main/spec.md.

I want to ask because if there are a bunch of us making similar servers, it might make sense to go off of the same or similar design. Thank you!

@MaxHalford
Copy link
Member

Hey!

I don't believe there is such a standard, which says a lot about the maturity of the field. But I may be wrong :)

I'm sure it's not too hard to work out some specs. Is there an established format to write down such a spec? Like an RFC?

@vsoch
Copy link
Contributor Author

vsoch commented Feb 18, 2022

I've never gone through creating a formal RFC, although I did create an RFC template for opencontainers! https://specs.opencontainers.org/image-spec/?v=v1.0.1. I think early work probably wouldn't need to be official - my thinking is I'll write up a spec.md doc alongside what I'm testing and see if anyone else is interested.

@vsoch
Copy link
Contributor Author

vsoch commented Mar 1, 2022

heyo! So I started a very basic spec, and it's based off of chantilly and then django-river-ml. I tried to keep it as simple as possible since it's the first shot 👉 https://vsoch.github.io/riverapi/getting_started/spec.html please provide any feedback / point people in this direction that might be interested to help or think more about it! Closing the issue since my question is answered and resolved.

@vsoch vsoch closed this as completed Mar 1, 2022
@MaxHalford
Copy link
Member

Very cool!

I take it this overlaps with tools like OpenAPI and Swagger. But those are generated once the implementation is done; they're not specs.

The routes looks good to me. One thing though: for me there should also be a /label route. You give it a label and an ID, and those are matched with the features passed during a prediction. You don't pass the features directly in the /learn route. I know this is counter-intuitive, but it makes sense when you think about it. Maybe you already know what I mean, and so I won't expand. But please let me know if this isn't clear.

@vsoch
Copy link
Contributor Author

vsoch commented Mar 2, 2022

I take it this overlaps with tools like OpenAPI and Swagger. But those are generated once the implementation is done; they're not specs.

Oh indeed! Yes I can add that view to the django plugin - there are easy ways to do that.

The routes looks good to me. One thing though: for me there should also be a /label route. You give it a label and an ID, and those are matched with the features passed during a prediction. You don't pass the features directly in the /learn route. I know this is counter-intuitive, but it makes sense when you think about it. Maybe you already know what I mean, and so I won't expand. But please let me know if this isn't clear.

So you are saying /label would be the route to provide features with a ground truth, and /learn doesn't require it? I don't know exactly what you mean so the additional explanation would be helpful! I can definitely make more time later this week to hack on this bit.

@MaxHalford
Copy link
Member

So you are saying /label would be the route to provide features with a ground truth, and /learn doesn't require it? I don't know exactly what you mean so the additional explanation would be helpful! I can definitely make more time later this week to hack on this bit.

So this I what I try to explain in my talks, but it's not an easy concept. Basically:

  • /predict takes as input an ID and a set of features.
  • /label takes as input an ID and a label.

Under the hood, the features and the label can be joined to make the model learn. This is helpful because it avoids stuttering: the features are passed once in /predict, and not a second time in /label. This isn't just convenient. It's the more correct way to proceed because it avoids having feature discrepancies between /predict and /label.

It is up to the system to decide what to do when /label happens. It can update the model synchronously, essentially doing what a /learn route would do. Or it can store the label in a DB and let the learning happen in the background.

Does that make more sense?

@vsoch
Copy link
Contributor Author

vsoch commented Mar 2, 2022

I can give it a shot! If you updated the example in your test.py for chantilly with this approach what would that look like?

@MaxHalford
Copy link
Member

Ah well something like this I suppose:

x = {...}
uuid = ...
requests.post('/predict', json={'features': x, 'id': uuid})

label = True
requests.post('/label', json={'id': uuid, 'label': label})

@vsoch
Copy link
Contributor Author

vsoch commented Mar 2, 2022

Gotcha, so you would store one or more labels with a model name and identifier? E.g.,:

self.db[f"labels/{model_name}/{identifier}"] = ["label"]

@vsoch
Copy link
Contributor Author

vsoch commented Mar 2, 2022

and a label != a ground truth provided in /learn ?

@MaxHalford
Copy link
Member

Yes you could store it like that. But then once you consider the case of having multiple models being updated in parallel, then this storage scheme might not make much sense.

And yes label and ground truths are synonyms.

@vsoch
Copy link
Contributor Author

vsoch commented Mar 2, 2022

Agree, so just to clarify the use cases:

  • if I know a label at the time of learning, I can provide as ground truth
  • If I don't know a label at learning I can add it later, but I need to keep track of the identifier

Where exactly does this identifier come from? I see it's optional to add for variious endpoints, but it's not clear how it's generated. Shouldn't the server be generating it (and returning it somewhere) for the user, and then the user could do something like update a previous identifier?

I also think if ground truth == label the API should use them consistently, either choosing ground truth or label (but not both). What do you think?

@MaxHalford
Copy link
Member

if I know a label at the time of learning, I can provide as ground truth

Indeed, when you have a ground truth, it usually means you made a prediction beforehand.

Where exactly does this identifier come from? I see it's optional to add for variious endpoints, but it's not clear how it's generated. Shouldn't the server be generating it (and returning it somewhere) for the user, and then the user could do something like update a previous identifier?

It depends. Ideally the user should provide this. But you could ask generate one for each prediction as a convenience for the user.

I also think if ground truth == label the API should use them consistently, either choosing ground truth or label (but not both). What do you think?

Yes, I suppose so. I would go with ground truth, as label is usually only used for classification.

@vsoch
Copy link
Contributor Author

vsoch commented Mar 5, 2022

Follow up question for the label here - instead of trying to store it, can we not just use it to update the metrics from the previous prediction (and then delete the identifier from the cache since we've labeled it and reflected the accuracy etc. in the model? It looks like for the current implementation when we get ground truth for a label we:

  1. use it to update metrics
  2. use to model.learn_one with the prediction
  3. announce to any stream listeners

So I'm inclined for label to do the same, and not actually save/cache it anywhere - it's basically the same as predict minus doing the prediction because we get it from the cache. Does that sound ok?

@MaxHalford
Copy link
Member

Yes of course, you can do that. I'm only saying that doing the learning in the background might be desirable for performance reasons.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants