Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Startup Plan] Don't manage to get CPU optimized inference API #31

Open
Matthieu-Tinycoaching opened this issue Jun 9, 2021 · 6 comments

Comments

@Matthieu-Tinycoaching
Copy link

Hi community,

I have subscribed a 7-day free trial of the Startup Plan and I wish to test CPU optimized inference API on this model: https://huggingface.co/Matthieu/stsb-xlm-r-multilingual-custom

However, when using the below code:

import json
import requests

API_URL = "https://api-inference.huggingface.co/models/Matthieu/stsb-xlm-r-multilingual-custom"
headers = {"Authorization": "Bearer API_ORG_TOKEN"}

def query(payload):
    data = json.dumps(payload)
    response = requests.request("POST", API_URL, headers=headers, data=data)
    return json.loads(response.content.decode("utf-8")), response.headers.get('x-compute-type')

payload1 = {"inputs": "Navigateur Web : Ce logiciel permet d'accéder à des pages web depuis votre ordinateur. Il en existe plusieurs téléchargeables gratuitement comme Google Chrome ou Mozilla. Certains sont même déjà installés comme Safari sur Mac OS et Edge sur Microsoft.", "options": {"use_cache": False}}

sentence_embeddings1, x_compute_type1 = query(payload1)
print(sentence_embeddings1)
print(x_compute_type1)

I got the sentence embeddings but x-compute-type header of my request return cpu and not cpu+optimized. Do I have to ask something to have CPU optimized inference?

Thanks!

@LysandreJik
Copy link
Member

Maybe of interest to @Narsil

@Narsil
Copy link
Contributor

Narsil commented Jun 9, 2021

Hi @Matthieu-Tinycoaching This is linked to:
#26

Community images do not implement:

  • private models
  • GPU inference
  • Acceleration

So what you are seeing is quite normal and is expected.
If you don't mind we should keep the discussion over there as all 3 are correlated.

@Matthieu-Tinycoaching
Copy link
Author

Matthieu-Tinycoaching commented Jun 9, 2021

Hi @Narsil thanks for the feedback.

However I don't understand so how I can test accelerated inference CPU API on my custom public model?

What is testable so on accelerated inference API and what should I benefit from the free trial startup plan from?

@Narsil
Copy link
Contributor

Narsil commented Jun 9, 2021

Hi, You can test transformers based models with all the API features, not sentence-transformers at the moment.

Also feature-extraction even in transformers does not have every optimizations enabled by default.
feature-extraction extracts raw hidden states, so it might be more sensitive to quantization than other pipelines, we don't know about the end user sensibility to that. It is available for every architecture in transformers, which might also lead to poorer speedups (or slowdowns sometimes) than expected on some architectures if we simply use the defaults.

But if you pin your model we would be able to run a few tests and optimize this pipeline so you can test performance.

Anticipating but feature-extraction and sentence embeddings are usually very fast, so maybe try to batch part of the inputs, it will reduce the HTTP + network overhead of the overall computation. (Simply send a list of strings within inputs instead of a single sentence)

@osanseviero
Copy link
Member

osanseviero commented Jun 9, 2021

Hi @Narsil.

Anticipating but feature-extraction and sentence embeddings are usually very fast, so maybe try to batch part of the inputs, it will reduce the HTTP + network overhead of the overall computation. (Simply send a list of strings within inputs instead of a single sentence)

Please correct me if I'm wrong. There is no support batch at the moment (although it should be almost trivial to change, it was also requested by @Kvit in UKPLab/sentence-transformers#925 (comment)).

@Matthieu-Tinycoaching
Copy link
Author

Hi @Narsil

You can test transformers based models with all the API features, not sentence-transformers at the moment.

Thank you for this light. Do you have an approximate schedule to when sentence-transformers will be available with all the API features?

I ran some load testing on my public model on model hub. So, if I couldn't have access to accelerated (CPU or GPU) inference for the moment I am intrigued by which architecture enabled me to load testing on CPU my public custom model. Could you precise to me physical characteristics/architecture are used then and to which pricing this correspond to since I could test it even with free plan. This, in order to better compare my benchmark on different cloud service solutions.

But if you pin your model we would be able to run a few tests and optimize this pipeline so you can test performance.

I have pin my custom model on both CPU and GPU devices. Thanks in advance for the optimization on your side in order to enable me to test performance before the end of my startup plan trial!

Anticipating but feature-extraction and sentence embeddings are usually very fast, so maybe try to batch part of the inputs, it will reduce the HTTP + network overhead of the overall computation. (Simply send a list of strings within inputs instead of a single sentence)

As highlighted by @osanseviero is there no support batch at the moment? Is there any practical tutorial on how to easily batch part of the inputs and retrieve corresponding outputs when dealing with real-time application where each input is a request from a different user?

Thanks for your time!

@LysandreJik LysandreJik transferred this issue from huggingface/huggingface_hub Mar 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants