New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Accelerated inference Blogpost. #68
Conversation
same feedback as for the documentation, if I'm a potential client reading this, I feel like I'll probably end up exploring pipelines as it's the first thing you redirect to. Also, it sounds like the optimizations are kind of doable/easy to do so I'm thinking that we should rather do it ourselves rather than pay for the API. What do you think @jeffboudier? |
Co-authored-by: Julien Chaumond <julien@huggingface.co>
Co-authored-by: Julien Chaumond <julien@huggingface.co>
Co-authored-by: Julien Chaumond <julien@huggingface.co>
Co-authored-by: Julien Chaumond <julien@huggingface.co>
Co-authored-by: Julien Chaumond <julien@huggingface.co>
Co-authored-by: Julien Chaumond <julien@huggingface.co>
Co-authored-by: Julien Chaumond <julien@huggingface.co>
accelerated-inference.md
Outdated
|
||
# **Speeding up inference by 10-100x for 5,000 models** | ||
|
||
At HuggingFace we're running a [hosted inference API](https://huggingface.co/pricing) that runs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i wouldn´t write "run" twice. -> At HuggingFace one of our (SOTA) products is the hosted inference API it runs
accelerated-inference.md
Outdated
# **Speeding up inference by 10-100x for 5,000 models** | ||
|
||
At HuggingFace we're running a [hosted inference API](https://huggingface.co/pricing) that runs | ||
all the models from the hub for a small price. One of the perks of using the API besides running |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from the HuggingFace model hub -> don´t know the correct product name.
would add a break after "small price." and start a new line
accelerated-inference.md
Outdated
|
||
At HuggingFace we're running a [hosted inference API](https://huggingface.co/pricing) that runs | ||
all the models from the hub for a small price. One of the perks of using the API besides running | ||
your own is that we provide an accelerated API that speeds up models by up to 100x (mostly likely 10x). This blogpost describes how we are achieving it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would adjust the hold sentence into two sentences and you can exaggerate how great we are. I added a version of how I would write it.
Through our great enhancement of the inference-API is it possible for you to host all your custom fine-tuned models, but that's not it. We also provide an accelerated interface that speeds up models by up to 100x (mostly likely 10x). In this blog post, I am going to describe how we achieved this acceleration boost.
|
||
This blogpost will assume some knowledge of the internals of transformers in general | ||
and will link to relevant parts for everything we don't explain here. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I always like a little agenda what comes first, second, third... so the reader knows whats a head of him
accelerated-inference.md
Outdated
## Running a pipeline (Open-source and already within transformers) | ||
|
||
The first speedup you should get is already implemented for you for free directly | ||
within [transformers](https://github.com/huggingface/transformers). And it's enabled |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove And
accelerated-inference.md
Outdated
so you can also get speedups if you do that correctly. Using `pipelines` from above | ||
will get you all the caching you're going to need. | ||
|
||
Overall the Fast tokenizers get a ~10x speedup for the tokenization part, so depending of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tokenizers achieves a ~10x speedup
accelerated-inference.md
Outdated
|
||
## Compiling your model (ONNX, onnxruntime, not fully open source yet) | ||
|
||
For the last part of the speedup (another ~10x) we're going to need to dive |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the last part of the speedup (another ~10x) we are taking a deep dive into gory details
accelerated-inference.md
Outdated
|
||
|
||
# Conclusion | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a calculation with the values would be cool here -> how the 100x are achieved
accelerated-inference.md
Outdated
|
||
Overall, we showed how to get ~100x speedup in inference time compared to naïve | ||
implementations, by using *cached* graphs, *native* tokenizers and *compiling* | ||
your model with a runtime + quantization optimizations. And that the real trick |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove "And that"
accelerated-inference.md
Outdated
your model with a runtime + quantization optimizations. And that the real trick | ||
is to actually combine all those methods at the same time in your product. | ||
|
||
To test all these optimizations right now, you can check out the results in our [Hosted API inference](https://huggingface.co/pricing). Be mindful that we require paid subscription for the accelerated part, but we have a 7-day free trial. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i would be a little more aggressive here.
"To use all these optimizations, you can check out our [Inference API]. To give you a headstart we offer a 7-day free trial. After that, you can choose between 3 different subscription models.
we not only optimizing all the models but scaling the API according to your needs up to thousands of concurrent requests.
Besides, it is possible for you to switch between CPU and GPU with a simple parameter change.
If you want to know more check out our [API product page](Dont have an link here -> pls add)
accelerated-inference.md
Outdated
</p> | ||
|
||
To get a sense of how fast inference becomes, we are getting very close to GPU | ||
inference speed on a simple CPU (GPU is only 40% faster, but it's 4x times the cost). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May be mention than GPU is still very attractive for cases where batching can happen?
Co-authored-by: Stefan Schweter <stefan@schweter.it>
Merging this, but @jeffboudier i feel like the call to action for subscription and/or Expert Acceleration program could be even more obvious (maybe a form to input one's email address, and we then get in touch?) cc @clmnt |
Yes I want to create a landing page with lead capture form for premium
support
…On Mon, Jan 18, 2021 at 1:26 AM Julien Chaumond ***@***.***> wrote:
Merging this, but @jeffboudier <https://github.com/jeffboudier> i feel
like the call to action for subscription and/or Expert Acceleration program
could be even more obvious
(maybe a form to input one's email address, and we then get in touch?)
cc @clmnt <https://github.com/clmnt>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#68 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ARWHZBJTLSE5LU3P6OF6LRDS2P5C3ANCNFSM4VOJRS3Q>
.
|
No description provided.