Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate Horovod into code #21

Closed
RandomDefaultUser opened this issue Mar 31, 2021 · 4 comments
Closed

Integrate Horovod into code #21

RandomDefaultUser opened this issue Mar 31, 2021 · 4 comments
Assignees
Labels
enhancement New feature or request

Comments

@RandomDefaultUser
Copy link
Member

In GitLab by @RandomDefaultUser on Dec 21, 2020, 09:32

Student project: Integrate Horovod into the workflow. This includes (tasks not necesseray in optimal order):

  • Create and checkout your own branch (if your name is Firstname Lastname: YYMMDD_FL_HorovodIntegration)
  • Install Horovod
  • Add Horovod installation to installation guide
  • Add Horovod commands to Trainer class
    • Make usage of Horovod optional by adding a parameter in ParametersTraining
  • Add an example_*.py file to showcase Horovd integration and test it locally with downsized, preprocessed data
  • Contact @fiedle09 to setup a training session on the hemera cluster and test it
  • create a merge request, assign @fiedle09 - Done!
@RandomDefaultUser RandomDefaultUser added the enhancement New feature or request label Mar 31, 2021
@RandomDefaultUser RandomDefaultUser self-assigned this Mar 31, 2021
@RandomDefaultUser
Copy link
Member Author

In GitLab by @RandomDefaultUser on Dec 21, 2020, 12:18

Assignees: Sneha, Omar

By Fiedler, Lenz (FWU) - 146409 on 2020-12-21T12:18:01 (imported from GitLab)

@RandomDefaultUser
Copy link
Member Author

In GitLab by @RandomDefaultUser on Feb 3, 2021, 18:04

I made some minor additions to @faruk92 and @snoozeyouloose code. I will do one final test on hemera (which will also utilize the potential bugix for the weird scaling on a single node that @faruk92 provided) and given that that is successful I will close this issue.

By Fiedler, Lenz (FWU) - 146409 on 2021-02-03T18:04:49 (imported from GitLab)

@RandomDefaultUser
Copy link
Member Author

In GitLab by @RandomDefaultUser on Feb 4, 2021, 11:27

results_number_of_nodes

The horovod tests have finished. @faruk92 fix for the single-node-multiple-gpu case seems to help the problem compared to previous runs. I am still a little bit confused as to why there is a slight decrease in performance when using one node with two GPUs, but generally we get the desired behavior and a sizable speed up. I am closing this issue for now and will open a lower-importance issue for further horovod benchmarks. For now though we can do trainings at considerably less speed, which all we need for now. Great work @faruk92 @snoozeyouloose !

By Fiedler, Lenz (FWU) - 146409 on 2021-02-04T11:27:14 (imported from GitLab)

@RandomDefaultUser
Copy link
Member Author

In GitLab by @RandomDefaultUser on Dec 21, 2020, 11:51

assigned to @faruk92

By Fiedler, Lenz (FWU) - 146409 on 2020-12-21T11:51:06 (imported from GitLab)

This was referenced Mar 31, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant