Avoid looping when data exhausted #14413

valentinkoe · 2021-11-16T09:49:53Z

What does this PR do?

This fix avoids running into a virtually infinite loop when using a finite iterable dataset.

When using an iterable dataset num_epochs is set to sys.maxsize to make sure all data is consumed (see #12561)
Likewise I'd like to set max_steps large enough to consume all data but still stop when the data is exhausted. In case we don't know how many samples there will be and the iterator stops we might run into a virtually infinite loop (iterating the int range until sys.maxsize).

See this code snipped to reproduce the behavior:

from torch.utils.data import IterableDataset

from transformers import BertForMaskedLM, BertConfig, TrainingArguments, Trainer

model = BertForMaskedLM(BertConfig())


class FiniteIterableDataset(IterableDataset):
    def __init__(self, num_samples: int):
        self.current_sample = 0
        self.num_samples = num_samples

    def __iter__(self):
        while self.current_sample < self.num_samples:
            yield {"input_ids": [0, 0, 0, self.current_sample], "labels": [0, 0, 0, 1]}
            self.current_sample += 1


batch_size = 1
gradient_accumulation_steps = 1
num_samples = 10

available_steps = num_samples // (batch_size * gradient_accumulation_steps)

data = FiniteIterableDataset(num_samples)
train_args = TrainingArguments(
    "tmp_dir",
    max_steps=available_steps,
    per_device_train_batch_size=batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
)
trainer = Trainer(model, train_dataset=data, args=train_args)
trainer.train()  # works

data = FiniteIterableDataset(num_samples)
train_args = TrainingArguments(
    "tmp_dir",
    max_steps=available_steps + 1,  # set a higher number than actually available
    per_device_train_batch_size=batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
)
trainer = Trainer(model, train_dataset=data, args=train_args)
trainer.train()  # "hangs" at 91% after 10 steps iterating through epochs like wild (until sys.maxsize)

With this fix it is checked whether epoch_iterator did not produce any samples and accordingly set control.should_training_stop to True.
I don't know if changing the flow control this way is approved of as it's always changed through callback handlers, I'm happy for suggestions how to properly do this.

I tried coming up with a test case checking the logs for when training was stopped in this case. Other options would be to measure the time training takes and time out after a while but that wouldn't be a nice test as run time may be affected by other circumstances.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline, Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the [documentation guidelines] https://github.com/huggingface/transformers/tree/master/docs), and here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

trainer: @sgugger

when using an iterable dataset num_epochs is set to sys.maxsize to make sure all data is consumed likewise we want to set max_steps high enough but still stop when all data is consumed (cherry picked from commit 6f0e1d6)

sgugger

Thanks for adding this, it's a nice addition!

Make sure you run make style on your branch to fix the quality issue.

sgugger · 2021-11-16T14:14:02Z

tests/test_trainer.py

+        batch_size = 1
+        gradient_accumulation_steps = 1
+        num_samples = 10
+
+        available_steps = num_samples // (batch_size * gradient_accumulation_steps)


We're not really using gradient_accumulation_steps here so let's remove.

Suggested change

batch_size = 1

gradient_accumulation_steps = 1

num_samples = 10

available_steps = num_samples // (batch_size * gradient_accumulation_steps)

batch_size = 1

num_samples = 10

available_steps = num_samples // batch_size

Right, I falsely assumed there was another default value for it than 1. I removed it.

tests/test_trainer.py

reformat training_args docstring

sgugger · 2021-11-16T21:50:14Z

Thanks again for fixing this! :-)

Valentin Deyringer added 3 commits November 16, 2021 11:04

stop training when a finite IterableDataset is exhausted

29b09e6

when using an iterable dataset num_epochs is set to sys.maxsize to make sure all data is consumed likewise we want to set max_steps high enough but still stop when all data is consumed (cherry picked from commit 6f0e1d6)

fix typo flase -> false

fd3dfde

add test for stopping training on exhausted finite iterable dataset

04ed756

valentinkoe force-pushed the avoid-looping-when-data-exhausted branch from 1da4eb6 to 04ed756 Compare November 16, 2021 10:04

sgugger approved these changes Nov 16, 2021

View reviewed changes

Valentin Deyringer added 2 commits November 16, 2021 16:25

remove redundant gradient_accumulation_steps

b13e601

run make style

93c69a5

reformat training_args docstring

sgugger merged commit a33168a into huggingface:master Nov 16, 2021

valentinkoe deleted the avoid-looping-when-data-exhausted branch November 19, 2021 19:20

gabeorlanski mentioned this pull request Mar 2, 2022

max_steps would not override num_train_epochs when training with IterableDataset #12499

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid looping when data exhausted #14413

Avoid looping when data exhausted #14413

valentinkoe commented Nov 16, 2021 •

edited

Loading

sgugger left a comment •

edited

Loading

sgugger Nov 16, 2021

valentinkoe Nov 16, 2021

sgugger commented Nov 16, 2021

Avoid looping when data exhausted #14413

Avoid looping when data exhausted #14413

Conversation

valentinkoe commented Nov 16, 2021 • edited Loading

What does this PR do?

Before submitting

Who can review?

sgugger left a comment • edited Loading

Choose a reason for hiding this comment

sgugger Nov 16, 2021

Choose a reason for hiding this comment

valentinkoe Nov 16, 2021

Choose a reason for hiding this comment

sgugger commented Nov 16, 2021

valentinkoe commented Nov 16, 2021 •

edited

Loading

sgugger left a comment •

edited

Loading