Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Train issues #5

Open
IdanAzuri opened this issue Nov 4, 2019 · 3 comments
Open

Train issues #5

IdanAzuri opened this issue Nov 4, 2019 · 3 comments

Comments

@IdanAzuri
Copy link

IdanAzuri commented Nov 4, 2019

Hi, I'm trying to run your code on 4 Gpus Tesla-p100 (same hardware as you mentioned in your paper) but the time per epoch is 22 sec and after 200 epochs the Gpus are out of memory. Any idea why it happens? there is any configuration to make it run faster?
Thanks,
Idan

@hongzimao
Copy link
Owner

The training batches the graph in one episode (episode ends stochastically by reset_prob). For long episode with large graphs, this might overwhelm GPU's memory. Could you check the actual data size in a batch to see if this is the problem (we tried to do sparse graphs as much as possible but there might still be room for improving efficiency).

To try a smaller scale problem and get results quickly, you can try --num_init_dags 20 --num_stream_dags 0 to make it a batch case training (section 7.2 batch arrivals).

Hope these helps?

@IdanAzuri
Copy link
Author

Thank you for the answer, but still, it runs slow/ out of memory. Can you please add the tensorflow version you use here? and also the parameters for the GPU: --master_num_gpu, --worker_num_gpu

@hongzimao
Copy link
Owner

I believe we developed the project with tensorflow version 1.14.0. With small batch job arrival experiment above, it shouldn't take that much memory and you can try just train it with cpu.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants