Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[OSDI'20]Kungfu #15

Open
Monstertail opened this issue Apr 10, 2023 · 0 comments
Open

[OSDI'20]Kungfu #15

Monstertail opened this issue Apr 10, 2023 · 0 comments
Labels
System4ML System for ML

Comments

@Monstertail
Copy link
Owner

Monstertail commented Apr 10, 2023

Kungfu: Making Training in Distributed Machine Learning Adaptive

Notes in Chinese: In Zhihu(知乎)

Notes in English: In my Notion

How to read a paper:

Step 1: Keep in mind

-What problem does this paper try to solve?
-Why is this an important and hard problem?
-Why can’t previous work solve this problem?
-What is novel in this paper?
-Does it show good results?

Step 2: Summarize

  • Step 2: Summarize
    • Summary for high-level ideas
      • design a distributed ML system that supports adaptation.
    • Problems/Motivations: what problem does this paper solve?
      • Empirical parameter tuning is Dataset-specific,Model-specific, and Cluster-specific.
      • Adapt parameters are hardto realise.Problems in the previous systems:
      • 1)No built-in mechanisms for adaptation
      • 2)High monitoring overhead.
      • 3)Expensive state management under change
    • Challenges: why is this problem hard to solve?
      • How to support different types of adaptation?e.g. AutoScaling→support only one type of adaptation
      • How to adapt based on large volume of monitoring data?e.g.MLFlow→computes
        statistical metrics over this amount of data→ consumes substantial compute resources and network bandwidth
      • How to change parameters of stateful workers?In existing systems, users typically mustcheckpoint and restore all state when changing configuration parameters→ can take hundreds of seconds
    • Methods: what are the key techniques in the paper?
      • Expressing adaptation policies→adapt configuration parameters based on monitored metrics
      • Embedding monitoring operators inside dataflow→asychronous collective communication layer+ embed its functions as monitoring operators to dataflow graph+NCCL for communication layer acceleration
      • Distributed mechanisms for parameter adaptation→Decouple system parameters with dataflow state
@Monstertail Monstertail added the System4ML System for ML label Apr 10, 2023
@Monstertail Monstertail changed the title [OSDI'20]Kungfu:Making Training in Distributed Machine Learning Adaptive [OSDI'20]Kungfu Apr 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
System4ML System for ML
Projects
None yet
Development

No branches or pull requests

1 participant