You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
-What problem does this paper try to solve?
-Why is this an important and hard problem?
-Why can’t previous work solve this problem?
-What is novel in this paper?
-Does it show good results?
Step 2: Summarize
Step 2: Summarize
Summary for high-level ideas
design a distributed ML system that supports adaptation.
Problems/Motivations: what problem does this paper solve?
Empirical parameter tuning is Dataset-specific,Model-specific, and Cluster-specific.
Adapt parameters are hardto realise.Problems in the previous systems:
1)No built-in mechanisms for adaptation
2)High monitoring overhead.
3)Expensive state management under change
Challenges: why is this problem hard to solve?
How to support different types of adaptation?e.g. AutoScaling→support only one type of adaptation
How to adapt based on large volume of monitoring data?e.g.MLFlow→computes
statistical metrics over this amount of data→ consumes substantial compute resources and network bandwidth
How to change parameters of stateful workers?In existing systems, users typically mustcheckpoint and restore all state when changing configuration parameters→ can take hundreds of seconds
Methods: what are the key techniques in the paper?
Expressing adaptation policies→adapt configuration parameters based on monitored metrics
Embedding monitoring operators inside dataflow→asychronous collective communication layer+ embed its functions as monitoring operators to dataflow graph+NCCL for communication layer acceleration
Distributed mechanisms for parameter adaptation→Decouple system parameters with dataflow state
The text was updated successfully, but these errors were encountered:
Kungfu: Making Training in Distributed Machine Learning Adaptive
Notes in Chinese: In Zhihu(知乎)
Notes in English: In my Notion
How to read a paper:
Step 1: Keep in mind
Step 2: Summarize
statistical metrics over this amount of data→ consumes substantial compute resources and network bandwidth
The text was updated successfully, but these errors were encountered: