Skip to content

Commit

Permalink
an option to raise exception if oom happens during fairseq.trainer.tr…
Browse files Browse the repository at this point in the history
…ain_step (facebookresearch#2)

Summary:
Pull Request resolved: fairinternal/fairspeq#2

Pull Request resolved: facebookresearch#689

We found not raising OOM during trainer.train_step causes various
issue, including NCCL hangs / gloo sync errors because gradient is not synced
properly. Before we found the root cause, let's give users an option to raise
OOMs.

Reviewed By: jmp84

Differential Revision: D15170357

fbshipit-source-id: 3e15e4e111a8380612157955509c39821a216ec4
  • Loading branch information
Yongqiang Wang authored and facebook-github-bot committed May 3, 2019
1 parent 5122e30 commit f283f59
Showing 1 changed file with 14 additions and 2 deletions.
16 changes: 14 additions & 2 deletions fairseq/trainer.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@
from itertools import chain
import math
import os
import sys

import torch

Expand Down Expand Up @@ -174,7 +175,7 @@ def load_checkpoint(self, filename, reset_optimizer=False, reset_lr_scheduler=Fa

return extra_state

def train_step(self, samples, dummy_batch=False):
def train_step(self, samples, dummy_batch=False, raise_oom=False):
"""Do forward, backward and parameter update."""
self._set_seed()
self.model.train()
Expand Down Expand Up @@ -219,7 +220,18 @@ def train_step(self, samples, dummy_batch=False):
sample_sizes.append(sample_size)
except RuntimeError as e:
if 'out of memory' in str(e):
print(('| WARNING: ran out of memory with exception: {};\n Skipping batch').format(str(e)))
msg = (
'| WARNING: ran out of memory with exception: '
+ '{};'.format(e)
+ '\n Skipping batch'
)
# TODO: print should really go to logger, this print goes
# to stdout, which is buffered, which in many case is not
# printed out if another exception happens
# print(msg)
print(msg, file=sys.stderr)
if raise_oom:
raise ValueError(msg)
ooms += 1
self.zero_grad()
else:
Expand Down

0 comments on commit f283f59

Please sign in to comment.