-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Startup Plan] Don't manage to get CPU optimized inference API #31
Comments
Maybe of interest to @Narsil |
Hi @Matthieu-Tinycoaching This is linked to: Community images do not implement:
So what you are seeing is quite normal and is expected. |
Hi @Narsil thanks for the feedback. However I don't understand so how I can test accelerated inference CPU API on my custom public model? What is testable so on accelerated inference API and what should I benefit from the free trial startup plan from? |
Hi, You can test Also But if you pin your model we would be able to run a few tests and optimize this pipeline so you can test performance. Anticipating but |
Hi @Narsil.
Please correct me if I'm wrong. There is no support batch at the moment (although it should be almost trivial to change, it was also requested by @Kvit in UKPLab/sentence-transformers#925 (comment)). |
Hi @Narsil
Thank you for this light. Do you have an approximate schedule to when sentence-transformers will be available with all the API features? I ran some load testing on my public model on model hub. So, if I couldn't have access to accelerated (CPU or GPU) inference for the moment I am intrigued by which architecture enabled me to load testing on CPU my public custom model. Could you precise to me physical characteristics/architecture are used then and to which pricing this correspond to since I could test it even with free plan. This, in order to better compare my benchmark on different cloud service solutions.
I have pin my custom model on both CPU and GPU devices. Thanks in advance for the optimization on your side in order to enable me to test performance before the end of my startup plan trial!
As highlighted by @osanseviero is there no support batch at the moment? Is there any practical tutorial on how to easily batch part of the inputs and retrieve corresponding outputs when dealing with real-time application where each input is a request from a different user? Thanks for your time! |
Hi community,
I have subscribed a 7-day free trial of the Startup Plan and I wish to test CPU optimized inference API on this model: https://huggingface.co/Matthieu/stsb-xlm-r-multilingual-custom
However, when using the below code:
I got the sentence embeddings but x-compute-type header of my request return
cpu
and notcpu+optimized
. Do I have to ask something to have CPU optimized inference?Thanks!
The text was updated successfully, but these errors were encountered: