Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question and answer regarding Decima paper #6

Open
tegg89 opened this issue Nov 6, 2019 · 2 comments
Open

Question and answer regarding Decima paper #6

tegg89 opened this issue Nov 6, 2019 · 2 comments

Comments

@tegg89
Copy link

tegg89 commented Nov 6, 2019

Question:

According to the appendix section in the paper, you used supervised learning to train graph neural networks for the sanity check.
I presume the target (label) is the critical path which is stated in the JobDAGDuration class from the job_dag.py file.
While training the code, this class is ignored.

• So, does the GNNs (GCN & GSN) are followed by the unsupervised learning scheme?
• In that sense, the GNNs act as preprocessing to capture local/global summaries of the loaded jobs, and I believe the code is running based on the fixed numbers of input jobs. Is there any way to handle various numbers of incoming jobs?

======================================================================
Hongzi's answer:

The appendix experiment is just to make sure the GNN architecture at least have the power to express existing heuristics that used critical path. In the main paper, Decima scheduling agent is trained end-to-end with reinforcement learning. This includes the weights of the GNN (since the entire neural network in figure 6 is trained together). Therefore, as expected, the main training code won’t invoke the critical path module during training.

Also, Decima’s GNN handles variable number of jobs by its design. Please notice that the default training in our code is with streaming type of jobs (jobs keep coming into the system) with flag --num_stream_dags 200. The section 5.1 in our paper explained in details why this design is scalable to arbitrary DAG shape and size.

@hongzimao
Copy link
Owner

Thanks for sharing!

@tegg89
Copy link
Author

tegg89 commented Jan 17, 2020

Question:

  1. In which way did you add curriculum learning? How much this methodology gives an impact on the performance?
  2. How long to train the Decima agent in which hardware specification?

======================================================================
Hongzi's answer:

The curriculum learning happens with the decay of "reset_prob”, via the parameter --reset_prob_decay. In our experiment, this saves us the training time quite a bit because we don’t have to train over long episodes in the beginning of the training phase. You might want to play with this parameter for your problem to find the fastest convergence that leads to same eventual performance.

We didn’t do too much optimization and the last time we ran the released code on CPU takes 5 days to converge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants