Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accelerated inference Blogpost. #68

Merged
merged 18 commits into from Jan 18, 2021
Merged

Accelerated inference Blogpost. #68

merged 18 commits into from Jan 18, 2021

Conversation

Narsil
Copy link
Contributor

@Narsil Narsil commented Dec 30, 2020

No description provided.

@Narsil Narsil changed the title First draft of Accelerated inference Blogpost. Accelerated inference Blogpost. Jan 6, 2021
@clmnt
Copy link
Member

clmnt commented Jan 6, 2021

same feedback as for the documentation, if I'm a potential client reading this, I feel like I'll probably end up exploring pipelines as it's the first thing you redirect to. Also, it sounds like the optimizations are kind of doable/easy to do so I'm thinking that we should rather do it ourselves rather than pay for the API. What do you think @jeffboudier?

_blog.yml Outdated Show resolved Hide resolved
accelerated-inference.md Outdated Show resolved Hide resolved
accelerated-inference.md Outdated Show resolved Hide resolved
accelerated-inference.md Outdated Show resolved Hide resolved
accelerated-inference.md Outdated Show resolved Hide resolved
accelerated-inference.md Outdated Show resolved Hide resolved
accelerated-inference.md Outdated Show resolved Hide resolved
Narsil and others added 8 commits January 7, 2021 09:44
Co-authored-by: Julien Chaumond <julien@huggingface.co>
Co-authored-by: Julien Chaumond <julien@huggingface.co>
Co-authored-by: Julien Chaumond <julien@huggingface.co>
Co-authored-by: Julien Chaumond <julien@huggingface.co>
Co-authored-by: Julien Chaumond <julien@huggingface.co>
Co-authored-by: Julien Chaumond <julien@huggingface.co>
Co-authored-by: Julien Chaumond <julien@huggingface.co>
@Narsil Narsil requested a review from julien-c January 7, 2021 15:42
_blog.yml Outdated Show resolved Hide resolved
accelerated-inference.md Outdated Show resolved Hide resolved

# **Speeding up inference by 10-100x for 5,000 models**

At HuggingFace we're running a [hosted inference API](https://huggingface.co/pricing) that runs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i wouldn´t write "run" twice. -> At HuggingFace one of our (SOTA) products is the hosted inference API it runs

# **Speeding up inference by 10-100x for 5,000 models**

At HuggingFace we're running a [hosted inference API](https://huggingface.co/pricing) that runs
all the models from the hub for a small price. One of the perks of using the API besides running
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from the HuggingFace model hub -> don´t know the correct product name.
would add a break after "small price." and start a new line


At HuggingFace we're running a [hosted inference API](https://huggingface.co/pricing) that runs
all the models from the hub for a small price. One of the perks of using the API besides running
your own is that we provide an accelerated API that speeds up models by up to 100x (mostly likely 10x). This blogpost describes how we are achieving it.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would adjust the hold sentence into two sentences and you can exaggerate how great we are. I added a version of how I would write it.

Through our great enhancement of the inference-API is it possible for you to host all your custom fine-tuned models, but that's not it. We also provide an accelerated interface that speeds up models by up to 100x (mostly likely 10x). In this blog post, I am going to describe how we achieved this acceleration boost.


This blogpost will assume some knowledge of the internals of transformers in general
and will link to relevant parts for everything we don't explain here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I always like a little agenda what comes first, second, third... so the reader knows whats a head of him

## Running a pipeline (Open-source and already within transformers)

The first speedup you should get is already implemented for you for free directly
within [transformers](https://github.com/huggingface/transformers). And it's enabled
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove And

so you can also get speedups if you do that correctly. Using `pipelines` from above
will get you all the caching you're going to need.

Overall the Fast tokenizers get a ~10x speedup for the tokenization part, so depending of
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tokenizers achieves a ~10x speedup


## Compiling your model (ONNX, onnxruntime, not fully open source yet)

For the last part of the speedup (another ~10x) we're going to need to dive
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the last part of the speedup (another ~10x) we are taking a deep dive into gory details



# Conclusion

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a calculation with the values would be cool here -> how the 100x are achieved


Overall, we showed how to get ~100x speedup in inference time compared to naïve
implementations, by using *cached* graphs, *native* tokenizers and *compiling*
your model with a runtime + quantization optimizations. And that the real trick
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove "And that"

your model with a runtime + quantization optimizations. And that the real trick
is to actually combine all those methods at the same time in your product.

To test all these optimizations right now, you can check out the results in our [Hosted API inference](https://huggingface.co/pricing). Be mindful that we require paid subscription for the accelerated part, but we have a 7-day free trial.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i would be a little more aggressive here.

"To use all these optimizations, you can check out our [Inference API]. To give you a headstart we offer a 7-day free trial. After that, you can choose between 3 different subscription models.
we not only optimizing all the models but scaling the API according to your needs up to thousands of concurrent requests.
Besides, it is possible for you to switch between CPU and GPU with a simple parameter change.
If you want to know more check out our [API product page](Dont have an link here -> pls add)

</p>

To get a sense of how fast inference becomes, we are getting very close to GPU
inference speed on a simple CPU (GPU is only 40% faster, but it's 4x times the cost).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be mention than GPU is still very attractive for cases where batching can happen?

accelerated-inference.md Outdated Show resolved Hide resolved
@julien-c
Copy link
Member

Merging this, but @jeffboudier i feel like the call to action for subscription and/or Expert Acceleration program could be even more obvious

(maybe a form to input one's email address, and we then get in touch?)

cc @clmnt

@julien-c julien-c merged commit 93f2b44 into master Jan 18, 2021
@julien-c julien-c deleted the accelerated_inference branch January 18, 2021 09:25
@jeffboudier
Copy link
Member

jeffboudier commented Jan 18, 2021 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants