Caching strategy for tokenizers to help achieve 1.46 ms inference on BERT QA on CPU #13835

AlexMikhalev · 2021-10-01T20:46:58Z

🚀 Feature request

I managed to achieve 1.46 ms inference for BERT QA (bert-large-uncased-whole-word-masking-finetuned-squad) on a laptop with Intel i7–10875H on RedisAI during RedisLabs RedisConf 2021 hackathon (finished 15th May), CPU only inference.
Blog write up
while current deployment heavily relies on Redis as you can imagine, it was build during RedisConf Hackathon, I believe the learnings can be incorporated into the core transformers library and speed up inference for NLP tasks.

Motivation

For NLP tasks like BERT QA inference, it's not trivial to load GPU for over 50% - normally tokenization is CPU intensive, while inference can be GPU intensive. It's very easy to cache summarization responses, but QA requires user input and more fiddling.

Proposal

In summary:
As a starting point - incorporate cache as a first-class citizen for all tokenisers, probably via decorators. Redis (and Redis Cluster) is a good candidate for cache with in-build sharding and Redis node in cluster configuration occupies 20 MB RAM.
I used additional modules RedisGears (distributed compute) and RedisAI - store tensors and run inference(pytorch).

Implementation details:
It's part of QAsearch API "medium complexity project" The Pattern.
Full code.

Convert and pre-load BERT models on each shard of RedisCluster (code)
Pre-tokenise all potential answers using RedisGears and distribute potential answers on each shard of Redis Cluster using RedisGears(code for batch and for event-based RedisGears function)
Amend calling API to direct question query to shard with most likely answers. Code. The call is using graph-based ranking and zrangebyscore to find the most ranked sentences in response to the question and then gets relevant hashtag from sentence key
Tokenise question. Code. Tokenisation happening on shard and uses RedisGears and RedisAI integration via import redisAI
Concatenate user question and pre-tokenised potential answers. Code
Run inference using RedisAI. Code model run in async mode without blocking the main Redis threat, so shard can still serve users
Select the answer using max score and convert tokens to words. Code
Cache the answer using Redis — next hit on API with the same question returns the answer in nanoseconds.Code this function uses ‘keymiss’ event.

The code was working in May and uses transformers==4.4.2. Even with all casting into NumPy arrays for RedisAI inference, it was running under 2 ms on Clevo laptop with Intel(R) Core(TM) i7–10875H CPU @ 2.30GHz, 64 GB RAM, SSD. When API calls hit cache response time under 600 ns.

The text was updated successfully, but these errors were encountered:

github-actions · 2021-11-01T15:07:36Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

AlexMikhalev · 2021-11-01T16:04:21Z

Numbers for illustration:
Pure transformers BERT QA i7 Laptop CPU: 10.871693504 seconds single run (inference) vs 398.41 requests per second.

github-actions · 2021-11-26T15:01:50Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

AlexMikhalev mentioned this issue Oct 31, 2021

Caching strategy for long running processes like BERT QA inference hotg-ai/rune#366

Open

github-actions bot closed this as completed Dec 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Caching strategy for tokenizers to help achieve 1.46 ms inference on BERT QA on CPU #13835

Caching strategy for tokenizers to help achieve 1.46 ms inference on BERT QA on CPU #13835

AlexMikhalev commented Oct 1, 2021

github-actions bot commented Nov 1, 2021

AlexMikhalev commented Nov 1, 2021

github-actions bot commented Nov 26, 2021

Caching strategy for tokenizers to help achieve 1.46 ms inference on BERT QA on CPU #13835

Caching strategy for tokenizers to help achieve 1.46 ms inference on BERT QA on CPU #13835

Comments

AlexMikhalev commented Oct 1, 2021

🚀 Feature request

Motivation

Proposal

github-actions bot commented Nov 1, 2021

AlexMikhalev commented Nov 1, 2021

github-actions bot commented Nov 26, 2021