You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
According to the appendix section in the paper, you used supervised learning to train graph neural networks for the sanity check.
I presume the target (label) is the critical path which is stated in the JobDAGDuration class from the job_dag.py file.
While training the code, this class is ignored.
• So, does the GNNs (GCN & GSN) are followed by the unsupervised learning scheme?
• In that sense, the GNNs act as preprocessing to capture local/global summaries of the loaded jobs, and I believe the code is running based on the fixed numbers of input jobs. Is there any way to handle various numbers of incoming jobs?
The appendix experiment is just to make sure the GNN architecture at least have the power to express existing heuristics that used critical path. In the main paper, Decima scheduling agent is trained end-to-end with reinforcement learning. This includes the weights of the GNN (since the entire neural network in figure 6 is trained together). Therefore, as expected, the main training code won’t invoke the critical path module during training.
Also, Decima’s GNN handles variable number of jobs by its design. Please notice that the default training in our code is with streaming type of jobs (jobs keep coming into the system) with flag --num_stream_dags 200. The section 5.1 in our paper explained in details why this design is scalable to arbitrary DAG shape and size.
The text was updated successfully, but these errors were encountered:
The curriculum learning happens with the decay of "reset_prob”, via the parameter --reset_prob_decay. In our experiment, this saves us the training time quite a bit because we don’t have to train over long episodes in the beginning of the training phase. You might want to play with this parameter for your problem to find the fastest convergence that leads to same eventual performance.
We didn’t do too much optimization and the last time we ran the released code on CPU takes 5 days to converge.
Question:
According to the appendix section in the paper, you used supervised learning to train graph neural networks for the sanity check.
I presume the target (label) is the critical path which is stated in the
JobDAGDuration
class from thejob_dag.py
file.While training the code, this class is ignored.
• So, does the GNNs (GCN & GSN) are followed by the unsupervised learning scheme?
• In that sense, the GNNs act as preprocessing to capture local/global summaries of the loaded jobs, and I believe the code is running based on the fixed numbers of input jobs. Is there any way to handle various numbers of incoming jobs?
======================================================================
Hongzi's answer:
The appendix experiment is just to make sure the GNN architecture at least have the power to express existing heuristics that used critical path. In the main paper, Decima scheduling agent is trained end-to-end with reinforcement learning. This includes the weights of the GNN (since the entire neural network in figure 6 is trained together). Therefore, as expected, the main training code won’t invoke the critical path module during training.
Also, Decima’s GNN handles variable number of jobs by its design. Please notice that the default training in our code is with streaming type of jobs (jobs keep coming into the system) with
flag --num_stream_dags 200
. The section 5.1 in our paper explained in details why this design is scalable to arbitrary DAG shape and size.The text was updated successfully, but these errors were encountered: