Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix streaming: catch Timeout error #3050

Merged
merged 2 commits into from
Oct 11, 2021
Merged

Conversation

borisdayma
Copy link
Contributor

Catches Timeout error during streaming.

fix #3049

@borisdayma borisdayma changed the title fix(streaming): catch Timeout error [WIP] fix(streaming): catch Timeout error Oct 9, 2021
@borisdayma
Copy link
Contributor Author

I'm running a large test.
Let's see if I get any error within a few days.

@borisdayma
Copy link
Contributor Author

This time it stopped after 8h but correctly raised ConnectionError: Server Disconnected.

Traceback:

Traceback (most recent call last):                                                                                                                                                                               
  File "/home/koush/dalle-mini/dev/seq2seq/run_seq2seq_flax.py", line 1027, in <module>                                                                                                                          
    main()                                                                                                                                                                                                       
  File "/home/koush/dalle-mini/dev/seq2seq/run_seq2seq_flax.py", line 991, in main                                                                                                                               
    for batch in tqdm(                                                                                                                                                                                           
  File "/home/koush/.pyenv/versions/dev/lib/python3.9/site-packages/tqdm/std.py", line 1180, in __iter__                                                                                                         
    for obj in iterable:                                                                                                                                                                                         
  File "/home/koush/dalle-mini/dev/seq2seq/run_seq2seq_flax.py", line 376, in data_loader_streaming
    for item in dataset:
  File "/home/koush/datasets/src/datasets/iterable_dataset.py", line 341, in __iter__
    for key, example in self._iter():
  File "/home/koush/datasets/src/datasets/iterable_dataset.py", line 338, in _iter
    yield from ex_iterable
  File "/home/koush/datasets/src/datasets/iterable_dataset.py", line 179, in __iter__
    key_examples_list = [(key, example)] + [
  File "/home/koush/datasets/src/datasets/iterable_dataset.py", line 179, in <listcomp>
    key_examples_list = [(key, example)] + [
  File "/home/koush/datasets/src/datasets/iterable_dataset.py", line 176, in __iter__
    for key, example in iterator:
  File "/home/koush/datasets/src/datasets/iterable_dataset.py", line 225, in __iter__
    for x in self.ex_iterable:
  File "/home/koush/datasets/src/datasets/iterable_dataset.py", line 99, in __iter__
    for key, example in self.generate_examples_fn(**kwargs_with_shuffled_shards):
  File "/home/koush/datasets/src/datasets/iterable_dataset.py", line 287, in wrapper
    for key, table in generate_tables_fn(**kwargs):
  File "/home/koush/datasets/src/datasets/packaged_modules/json/json.py", line 107, in _generate_tables
    batch = f.read(self.config.chunksize)
  File "/home/koush/datasets/src/datasets/utils/streaming_download_manager.py", line 136, in read_with_retries
    raise ConnectionError("Server Disconnected")
ConnectionError: Server Disconnected

Right before this error, the warnings were correctly raised:

10/10/2021 06:02:26 - WARNING - datasets.utils.streaming_download_manager - Got disconnected from remote data host. Retrying in 1sec [1/3]
10/10/2021 06:02:27 - WARNING - datasets.utils.streaming_download_manager - Got disconnected from remote data host. Retrying in 1sec [2/3]                                                                       
10/10/2021 06:02:28 - WARNING - datasets.utils.streaming_download_manager - Got disconnected from remote data host. Retrying in 1sec [3/3

I'm going to see what happens if I change the max retries to 20 and the interval to 5.

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix !

Let me know the result of your test with more retries and higher time interval.
Where are the data hosted ?

@lhoestq lhoestq merged commit 93c828b into huggingface:master Oct 11, 2021
@lhoestq
Copy link
Member

lhoestq commented Oct 11, 2021

Also maybe we can raise the Server Disconnected error with more info about what kind of error caused it (client error, time out, etc.)

@borisdayma
Copy link
Contributor Author

borisdayma commented Oct 11, 2021

I have 2 runs:

  • run 1 with this data that I will remove soon because I now use the 2nd one
  • run 2 with this data
  • load_dataset(dataset_repo, data_files={'train':'data/train/*.jsonl', 'validation':'data/valid/*.jsonl'}, streaming=True)

They have now been running by a bit more than a day for one run and 15h for the other.

The error logs are not shown in wandb because the script use pylogging (not sure why, I should change it) but basically so far with the new settings I had one timeout in each with successful reconnect afterwards.

So I think it's a good idea to have:

  • STREAMING_READ_RETRY_INTERVAL = 5 since before my runs would get 3 errors in a row (with the default 1 second pause)
  • STREAMING_READ_MAX_RETRIES should also be increased. Since this type of error does not happen a lot, I would still have a large number (at least 10) because a stopped training run may be a big issue if checkpointing/restart is not well implemented which is not always trivial

@lhoestq
Copy link
Member

lhoestq commented Oct 12, 2021

I agree ! Feel free to open a PR to increase both values

@lhoestq lhoestq changed the title [WIP] fix(streaming): catch Timeout error Fix streaming: catch Timeout error Oct 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

TimeoutError during streaming
2 participants