[OSDI'20]Kungfu #15

Monstertail · 2023-04-10T02:46:44Z

Kungfu: Making Training in Distributed Machine Learning Adaptive

Notes in Chinese： In Zhihu（知乎）

Notes in English： In my Notion

How to read a paper:

Step 1: Keep in mind

-What problem does this paper try to solve?
-Why is this an important and hard problem?
-Why can’t previous work solve this problem?
-What is novel in this paper?
-Does it show good results?

Step 2: Summarize

Step 2: Summarize
- Summary for high-level ideas
  - design a distributed ML system that supports adaptation.
- Problems/Motivations: what problem does this paper solve?
  - Empirical parameter tuning is Dataset-specific,Model-specific, and Cluster-specific.
  - Adapt parameters are hardto realise.Problems in the previous systems:
  - 1）No built-in mechanisms for adaptation
  - 2）High monitoring overhead.
  - 3）Expensive state management under change
- Challenges: why is this problem hard to solve?
  - How to support different types of adaptation?e.g. AutoScaling→support only one type of adaptation
  - How to adapt based on large volume of monitoring data?e.g.MLFlow→computes
    statistical metrics over this amount of data→ consumes substantial compute resources and network bandwidth
  - How to change parameters of stateful workers?In existing systems, users typically mustcheckpoint and restore all state when changing configuration parameters→ can take hundreds of seconds
- Methods: what are the key techniques in the paper?
  - Expressing adaptation policies→adapt configuration parameters based on monitored metrics
  - Embedding monitoring operators inside dataflow→asychronous collective communication layer+ embed its functions as monitoring operators to dataflow graph+NCCL for communication layer acceleration
  - Distributed mechanisms for parameter adaptation→Decouple system parameters with dataflow state

The text was updated successfully, but these errors were encountered:

Monstertail added the System4ML System for ML label Apr 10, 2023

Monstertail changed the title ~~[OSDI'20]Kungfu:Making Training in Distributed Machine Learning Adaptive~~ [OSDI'20]Kungfu Apr 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[OSDI'20]Kungfu #15

[OSDI'20]Kungfu #15

Monstertail commented Apr 10, 2023 •

edited

Loading

[OSDI'20]Kungfu #15

[OSDI'20]Kungfu #15

Comments

Monstertail commented Apr 10, 2023 • edited Loading

Kungfu: Making Training in Distributed Machine Learning Adaptive

Notes in Chinese： In Zhihu（知乎）

Notes in English： In my Notion

How to read a paper:

Step 1: Keep in mind

Step 2: Summarize

Monstertail commented Apr 10, 2023 •

edited

Loading