Question and answer regarding Decima paper #6

tegg89 · 2019-11-06T01:53:49Z

Question:

According to the appendix section in the paper, you used supervised learning to train graph neural networks for the sanity check.
I presume the target (label) is the critical path which is stated in the JobDAGDuration class from the job_dag.py file.
While training the code, this class is ignored.

• So, does the GNNs (GCN & GSN) are followed by the unsupervised learning scheme?
• In that sense, the GNNs act as preprocessing to capture local/global summaries of the loaded jobs, and I believe the code is running based on the fixed numbers of input jobs. Is there any way to handle various numbers of incoming jobs?

======================================================================
Hongzi's answer:

The appendix experiment is just to make sure the GNN architecture at least have the power to express existing heuristics that used critical path. In the main paper, Decima scheduling agent is trained end-to-end with reinforcement learning. This includes the weights of the GNN (since the entire neural network in figure 6 is trained together). Therefore, as expected, the main training code won’t invoke the critical path module during training.

Also, Decima’s GNN handles variable number of jobs by its design. Please notice that the default training in our code is with streaming type of jobs (jobs keep coming into the system) with flag --num_stream_dags 200. The section 5.1 in our paper explained in details why this design is scalable to arbitrary DAG shape and size.

The text was updated successfully, but these errors were encountered:

hongzimao · 2019-11-06T20:52:28Z

Thanks for sharing!

tegg89 · 2020-01-17T03:31:05Z

Question:

In which way did you add curriculum learning? How much this methodology gives an impact on the performance?
How long to train the Decima agent in which hardware specification?

======================================================================
Hongzi's answer:

The curriculum learning happens with the decay of "reset_prob”, via the parameter --reset_prob_decay. In our experiment, this saves us the training time quite a bit because we don’t have to train over long episodes in the beginning of the training phase. You might want to play with this parameter for your problem to find the fastest convergence that leads to same eventual performance.

We didn’t do too much optimization and the last time we ran the released code on CPU takes 5 days to converge.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question and answer regarding Decima paper #6

Question and answer regarding Decima paper #6

tegg89 commented Nov 6, 2019

hongzimao commented Nov 6, 2019

tegg89 commented Jan 17, 2020

Question and answer regarding Decima paper #6

Question and answer regarding Decima paper #6

Comments

tegg89 commented Nov 6, 2019

hongzimao commented Nov 6, 2019

tegg89 commented Jan 17, 2020