Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Caching strategy for tokenizers to help achieve 1.46 ms inference on BERT QA on CPU #13835

Closed
AlexMikhalev opened this issue Oct 1, 2021 · 3 comments

Comments

@AlexMikhalev
Copy link

🚀 Feature request

I managed to achieve 1.46 ms inference for BERT QA (bert-large-uncased-whole-word-masking-finetuned-squad) on a laptop with Intel i7–10875H on RedisAI during RedisLabs RedisConf 2021 hackathon (finished 15th May), CPU only inference.
Blog write up
while current deployment heavily relies on Redis as you can imagine, it was build during RedisConf Hackathon, I believe the learnings can be incorporated into the core transformers library and speed up inference for NLP tasks.

Motivation

For NLP tasks like BERT QA inference, it's not trivial to load GPU for over 50% - normally tokenization is CPU intensive, while inference can be GPU intensive. It's very easy to cache summarization responses, but QA requires user input and more fiddling.

Proposal

In summary:
As a starting point - incorporate cache as a first-class citizen for all tokenisers, probably via decorators. Redis (and Redis Cluster) is a good candidate for cache with in-build sharding and Redis node in cluster configuration occupies 20 MB RAM.
I used additional modules RedisGears (distributed compute) and RedisAI - store tensors and run inference(pytorch).

Implementation details:
It's part of QAsearch API "medium complexity project" The Pattern.
Full code.

  1. Convert and pre-load BERT models on each shard of RedisCluster (code)

  2. Pre-tokenise all potential answers using RedisGears and distribute potential answers on each shard of Redis Cluster using RedisGears(code for batch and for event-based RedisGears function)

  3. Amend calling API to direct question query to shard with most likely answers. Code. The call is using graph-based ranking and zrangebyscore to find the most ranked sentences in response to the question and then gets relevant hashtag from sentence key

  4. Tokenise question. Code. Tokenisation happening on shard and uses RedisGears and RedisAI integration via import redisAI

  5. Concatenate user question and pre-tokenised potential answers. Code

  6. Run inference using RedisAI. Code model run in async mode without blocking the main Redis threat, so shard can still serve users

  7. Select the answer using max score and convert tokens to words. Code

  8. Cache the answer using Redis — next hit on API with the same question returns the answer in nanoseconds.Code this function uses ‘keymiss’ event.

The code was working in May and uses transformers==4.4.2. Even with all casting into NumPy arrays for RedisAI inference, it was running under 2 ms on Clevo laptop with Intel(R) Core(TM) i7–10875H CPU @ 2.30GHz, 64 GB RAM, SSD. When API calls hit cache response time under 600 ns.

@github-actions
Copy link

github-actions bot commented Nov 1, 2021

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@AlexMikhalev
Copy link
Author

Numbers for illustration:
Pure transformers BERT QA i7 Laptop CPU: 10.871693504 seconds single run (inference) vs 398.41 requests per second.

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@github-actions github-actions bot closed this as completed Dec 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant