GPU Usage #159

mwcvitkovic · 2018-09-05T21:32:43Z

I'm running the dqn_BeamRider-v4 trial in the attached spec file in train mode. (It's basically identical to the dqn_boltzmann_breakout trial in the dqn.json spec file, but with the BeamRider env.)

I'm running 4 sessions at once in a 4 gpu machine. For the first ~5 episodes in each session all 4 gpus are used correctly, one for each session. (gpu 0 has a couple extra processes on it, which I'm assuming is normal). But gradually, by around ~100 episodes, all the gpu processes dissappear, and nothing is running on GPU anymore. The training process never crashes or finishes during this - its cpu usage goes to 0 and just sits there.

Any ideas what's going on?

openai_baseline_reproduction_dqn copy.txt

The text was updated successfully, but these errors were encountered:

mwcvitkovic · 2018-09-05T21:33:45Z

My best guess is that it has to do with saving the episode, but not at all sure.

kengz · 2018-09-05T21:37:31Z

Huh interesting. Haven't seen the save issue for smaller models. Try setting the save epi frequency in spec to null for now. We don't need to save checkpoints, and one will save at the very end. Without debug log it's hard to tell where it hangs. But I'll try to reproduce and investigate later as well.

…

On Wed, Sep 5, 2018 at 2:33 PM Milan Cvitkovic ***@***.***> wrote: My best guess is that it has to do with saving the episode, but not at all sure. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#159 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AH1Db9KIvWPbjOxZITZjBLWzh4Qy9RAeks5uYEM6gaJpZM4Wbu2j> .

kengz · 2018-09-06T05:18:19Z

identified the root cause: the save method argument was accidentally removed. So, the process actually crashes when it tries to save.
The PR above will fix this, along with some extra improvements for logging verbosity.

kengz mentioned this issue Sep 6, 2018

Fix agent save method #161

Merged

kengz closed this as completed in #161 Sep 6, 2018

kengz added this to Done in v2.x Research and Engineering Sep 8, 2018

kengz removed this from Done in v2.x Research and Engineering Sep 8, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU Usage #159

GPU Usage #159

mwcvitkovic commented Sep 5, 2018

mwcvitkovic commented Sep 5, 2018

kengz commented Sep 5, 2018 via email

kengz commented Sep 6, 2018

GPU Usage #159

GPU Usage #159

Comments

mwcvitkovic commented Sep 5, 2018

mwcvitkovic commented Sep 5, 2018

kengz commented Sep 5, 2018 via email

kengz commented Sep 6, 2018