You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[OSDI'23]AdaEmbed: Adaptive Embedding for Large-Scale Recommendation Models
Notes in Chinese: In Zhihu(知乎)
Notes in English: In my Notion How to read a paper:
Step 1: Keep in mind
-What problem does this paper try to solve?
-Why is this an important and hard problem?
-Why can’t previous work solve this problem?
-What is novel in this paper?
-Does it show good results?
Step 2: Summarize
Summary for high-level ideas
Reduce the size of embeddings needed for the same DLRM accuracy via in-training embedding pruning
< = > For the given embedding size, AdaEmbed scalably identifies and retains embeddings that have larger importance to model accuracy at particular times during training.
Problems/Motivations: what problem does this paper solve?
While more embedding rows typically enable better model accuracy by considering more feature instances, they lead to large deployment cost and slow model execution.
Key insight is that the access patterns and weights of different embeddings are heterogeneous across embedding rows, and dynamically change over the training
process, implying varying embedding importance with respect to model accuracy
Challenges: why is this problem hard to solve?
DLRMs often have stringent throughput and latency requirements for (online) training and inference, but gigantic embeddings make computation , communication and memory optimizations challenging
To achieve desired model throughput, practical deployments often have to use hundreds of GPUs to hold embeddings.
Designing better embeddings (e.g., number of per-feature embedding rows and which embedding weights to retain) remains challenging because the exploration space increases with larger embeddings and requires intensive manual efforts
Methods: what are the key techniques in the paper?
AdaEmbed considers embeddings with higher runtime access frequencies and larger training gradients to be more important, and it dynamically prunes less important embeddings at scale to automatically determine per-feature embeddings.
challenge 1: Identifying important embeddings out of billions is non-trivial.
Embedding Monitor: Identify Important Embedding(by access frequency and L2-norm of gradients)
challenge 2: Enforcing in-training pruning after identifying important embeddings is not straightforward either
AdaEmbed Coordinator: Prune at Right Time(trade-offs between pruning overhead and quality)
Memory Manager: Prune Weights at Scale( Virtually Hashed Physically Indexed is used to reduce memory reallocation)
The text was updated successfully, but these errors were encountered:
[OSDI'23]AdaEmbed: Adaptive Embedding for Large-Scale Recommendation Models
Notes in Chinese: In Zhihu(知乎)
Notes in English: In my Notion
How to read a paper:
Step 1: Keep in mind
-What problem does this paper try to solve?
-Why is this an important and hard problem?
-Why can’t previous work solve this problem?
-What is novel in this paper?
-Does it show good results?
process, implying varying embedding importance with respect to model accuracy
The text was updated successfully, but these errors were encountered: