502 Server Errors when streaming large dataset #6577

sanchit-gandhi · 2024-01-10T16:59:36Z

Describe the bug

When streaming a large ASR dataset from the Hug (~3TB) I often encounter 502 Server Errors seemingly randomly during streaming:

huggingface_hub.utils._errors.HfHubHTTPError: 502 Server Error: Bad Gateway for url: https://huggingface.co/datasets/sanchit-gandhi/concatenated-train-set/resolve/7d2acc5c59de848e456e951a76e805304d6fb350/train/train-00288-of-07135.parquet

This is despite the parquet file definitely existing on the Hub: https://huggingface.co/datasets/sanchit-gandhi/concatenated-train-set/blob/main/train/train-00228-of-07135.parquet
And having the correct commit id: 7d2acc5c59de848e456e951a76e805304d6fb350

I’m wondering whether this is coming from datasets? Or from the Hub side?

Steps to reproduce the bug

Reproducer:

from datasets import load_dataset
from torch.utils.data import DataLoader
from tqdm import tqdm

NUM_EPOCHS = 20

dataset = load_dataset("sanchit-gandhi/concatenated-train-set", "train", streaming=True)
dataset = dataset.with_format("torch")
dataloader = DataLoader(dataset["train"], batch_size=256, drop_last=True, pin_memory=True, num_workers=16)

for epoch in tqdm(range(NUM_EPOCHS), desc="Epoch", position=0):
    for batch in tqdm(dataloader, desc="Batch", position=1):
        continue

Running the above script tends to fail within about 2 hours with a traceback like the following:

Traceback:

1029     for batch in train_loader: 
1030   File "/home/sanchitgandhi/hf/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 630, in __next__ 
1031     data = self._next_data() 
1032   File "/home/sanchitgandhi/hf/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1325, in _next_data 
1033     return self._process_data(data) 
1034   File "/home/sanchitgandhi/hf/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1371, in _process_data 
1035     data.reraise() 
1036   File "/home/sanchitgandhi/hf/lib/python3.8/site-packages/torch/_utils.py", line 694, in reraise 
1037     raise exception 
1038 huggingface_hub.utils._errors.HfHubHTTPError: Caught HfHubHTTPError in DataLoader worker process 10. 
1039 Original Traceback (most recent call last): 
1040   File "/home/sanchitgandhi/hf/lib/python3.8/site-packages/huggingface_hub/utils/_errors.py", line 286, in hf_raise_for_status 
1041     response.raise_for_status() 
1042   File "/home/sanchitgandhi/hf/lib/python3.8/site-packages/requests/models.py", line 1021, in raise_for_status 
1043     raise HTTPError(http_error_msg, response=self) 
1044 requests.exceptions.HTTPError: 502 Server Error: Bad Gateway for url: https://huggingface.co/datasets/sanchit-gandhi/concatenated-train-set/resolve/7d2acc5c59de848e456e951a76e805304d6fb350/train/train-00288-of-07135.parquet 
1045 The above exception was the direct cause of the following exception: 
1046 Traceback (most recent call last): 
1047   File "/home/sanchitgandhi/hf/lib/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop 
1048     data = fetcher.fetch(index) 
1049   File "/home/sanchitgandhi/hf/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 32, in fetch 
1050     data.append(next(self.dataset_iter)) 
1051   File "/home/sanchitgandhi/datasets/src/datasets/iterable_dataset.py", line 1363, in __iter__ 
1052     yield from self._iter_pytorch() 
1053   File "/home/sanchitgandhi/datasets/src/datasets/iterable_dataset.py", line 1298, in _iter_pytorch 
1054     for key, example in ex_iterable: 
1055   File "/home/sanchitgandhi/datasets/src/datasets/iterable_dataset.py", line 983, in __iter__ 
1056     for x in self.ex_iterable: 
1057   File "/home/sanchitgandhi/datasets/src/datasets/iterable_dataset.py", line 863, in __iter__ 
1058     yield from self._iter() 
1059   File "/home/sanchitgandhi/datasets/src/datasets/iterable_dataset.py", line 900, in _iter 
1060     for key, example in iterator: 
1061   File "/home/sanchitgandhi/datasets/src/datasets/iterable_dataset.py", line 679, in __iter__ 
1062     yield from self._iter() 
1063   File "/home/sanchitgandhi/datasets/src/datasets/iterable_dataset.py", line 741, in _iter 
1064     for key, example in iterator: 
1065   File "/home/sanchitgandhi/datasets/src/datasets/iterable_dataset.py", line 863, in __iter__ 
1066     yield from self._iter() 
1067   File "/home/sanchitgandhi/datasets/src/datasets/iterable_dataset.py", line 900, in _iter 
1068     for key, example in iterator: 
1069   File "/home/sanchitgandhi/datasets/src/datasets/iterable_dataset.py", line 1115, in __iter__ 
1070     for key, example in self.ex_iterable: 
1071   File "/home/sanchitgandhi/datasets/src/datasets/iterable_dataset.py", line 679, in __iter__ 
1072     yield from self._iter() 
1073   File "/home/sanchitgandhi/datasets/src/datasets/iterable_dataset.py", line 741, in _iter 
1074     for key, example in iterator: 
1075   File "/home/sanchitgandhi/datasets/src/datasets/iterable_dataset.py", line 1115, in __iter__ 
1076     for key, example in self.ex_iterable: 
1077   File "/home/sanchitgandhi/datasets/src/datasets/iterable_dataset.py", line 282, in __iter__ 
1078     for key, pa_table in self.generate_tables_fn(**self.kwargs): 
1079   File "/home/sanchitgandhi/datasets/src/datasets/packaged_modules/parquet/parquet.py", line 87, in _generate_tables 
1080     for batch_idx, record_batch in enumerate( 
1081   File "pyarrow/_parquet.pyx", line 1367, in iter_batches 
1082   File "pyarrow/types.pxi", line 88, in pyarrow.lib._datatype_to_pep3118 
1083   File "/home/sanchitgandhi/datasets/src/datasets/download/streaming_download_manager.py", line 341, in read_with_retries 
1084     out = read(*args, **kwargs) 
1085   File "/home/sanchitgandhi/hf/lib/python3.8/site-packages/fsspec/spec.py", line 1856, in read 
1086     out = self.cache._fetch(self.loc, self.loc + length) 
1087   File "/home/sanchitgandhi/hf/lib/python3.8/site-packages/fsspec/caching.py", line 189, in _fetch 
1088     self.cache = self.fetcher(start, end)  # new block replaces old 
1089   File "/home/sanchitgandhi/hf/lib/python3.8/site-packages/huggingface_hub/hf_file_system.py", line 626, in _fetch_range 
1090     hf_raise_for_status(r) 
1091   File "/home/sanchitgandhi/hf/lib/python3.8/site-packages/huggingface_hub/utils/_errors.py", line 333, in hf_raise_for_status 
1092     raise HfHubHTTPError(str(e), response=response) from e 
1093 huggingface_hub.utils._errors.HfHubHTTPError: 502 Server Error: Bad Gateway for url: https://huggingface.co/datasets/sanchit-gandhi/concatenated-train-set/resolve/7d2acc5c59de848e456e951a76e805304d6fb350/train/train-00288-of-07135.parquet

Expected behavior

Should be able to stream the dataset without any 502 error.

Environment info

datasets version: 2.16.2.dev0
Platform: Linux-5.13.0-1023-gcp-x86_64-with-glibc2.29
Python version: 3.8.10
huggingface_hub version: 0.20.1
PyArrow version: 14.0.2
Pandas version: 2.0.3
fsspec version: 2023.10.0

The text was updated successfully, but these errors were encountered:

sanchit-gandhi · 2024-01-10T17:00:38Z

cc @mariosasko @lhoestq

mariosasko · 2024-01-15T15:21:46Z

Hi! We should be able to avoid this error by retrying to read the data when it happens. I'll open a PR in huggingface_hub to address this.

sanchit-gandhi · 2024-01-25T09:31:08Z

Thanks for the fix @mariosasko! Just wondering whether "500 error" should also be excluded? I got these errors overnight:

huggingface_hub.utils._errors.HfHubHTTPError: 500 Server Error: Internal Server Error for url: https://huggingface.co/da
tasets/sanchit-gandhi/concatenated-train-set-label-length-256/resolve/91e6a0cd0356605b021384ded813cfcf356a221c/train/tra
in-02618-of-04012.parquet (Request ID: Root=1-65b18b81-627f2c2943bbb8ab68d19ee2;129537bd-1934-4257-a4d8-1cb774f8e1f8)   
                                                                                                                        
Internal Error - We're working hard to fix this as soon as possible!

sanchit-gandhi · 2024-02-11T11:15:25Z

Gently pining @mariosasko and @Wauplin - when trying to stream this large dataset from the HF Hub, I'm running into 500 Internal Server Errors as described above. I'd love to be able to use the Hub exclusively to stream data when training, but this error pops up a few times a week, terminating training runs and causing me to have to rewind to the last saved checkpoint. Do we reckon there's a way we can protect Datasets' streaming against these errors? The same reproducer as the original comment can be used, but it's somewhat random whether we hit a 500 error. Leaving the full traceback below:

Traceback (most recent call last):                                                                                      
  File "/home/sanchitgandhi/hf/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loo
p                                                                                                                       
    data = fetcher.fetch(index)                                                                                         
  File "/home/sanchitgandhi/hf/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 32, in fetch        
    data.append(next(self.dataset_iter))                                                                                
  File "/home/sanchitgandhi/datasets/src/datasets/iterable_dataset.py", line 1367, in __iter__                          
    yield from self._iter_pytorch()                                                                                     
  File "/home/sanchitgandhi/datasets/src/datasets/iterable_dataset.py", line 1302, in _iter_pytorch                     
    for key, example in ex_iterable:                                                                                    
  File "/home/sanchitgandhi/datasets/src/datasets/iterable_dataset.py", line 987, in __iter__                           
    for x in self.ex_iterable:                                                                                          
  File "/home/sanchitgandhi/datasets/src/datasets/iterable_dataset.py", line 867, in __iter__                           
    yield from self._iter()                                                                                             
  File "/home/sanchitgandhi/datasets/src/datasets/iterable_dataset.py", line 904, in _iter                              
    for key, example in iterator:                                                                                       
  File "/home/sanchitgandhi/datasets/src/datasets/iterable_dataset.py", line 679, in __iter__                           
    yield from self._iter()             
  File "/home/sanchitgandhi/datasets/src/datasets/iterable_dataset.py", line 741, in _iter                    [235/1892]
    for key, example in iterator:                                                                                       
  File "/home/sanchitgandhi/datasets/src/datasets/iterable_dataset.py", line 1119, in __iter__                          
    for key, example in self.ex_iterable:                                                                               
  File "/home/sanchitgandhi/datasets/src/datasets/iterable_dataset.py", line 282, in __iter__                           
    for key, pa_table in self.generate_tables_fn(**self.kwargs):                                                        
  File "/home/sanchitgandhi/datasets/src/datasets/packaged_modules/parquet/parquet.py", line 87, in _generate_tables    
    for batch_idx, record_batch in enumerate(                                                                           
  File "pyarrow/_parquet.pyx", line 1587, in iter_batches                                                               
  File "pyarrow/types.pxi", line 88, in pyarrow.lib._datatype_to_pep3118                                                
  File "/home/sanchitgandhi/datasets/src/datasets/download/streaming_download_manager.py", line 342, in read_with_retrie
s                                                                                                                       
    out = read(*args, **kwargs)                                                                                         
  File "/home/sanchitgandhi/hf/lib/python3.10/site-packages/fsspec/spec.py", line 1856, in read                         
    out = self.cache._fetch(self.loc, self.loc + length)                                                                
  File "/home/sanchitgandhi/hf/lib/python3.10/site-packages/fsspec/caching.py", line 189, in _fetch                     
    self.cache = self.fetcher(start, end)  # new block replaces old                                                     
  File "/home/sanchitgandhi/hf/lib/python3.10/site-packages/huggingface_hub/hf_file_system.py", line 629, in _fetch_rang
e                                                                                                                       
    hf_raise_for_status(r)                                                                                              
  File "/home/sanchitgandhi/hf/lib/python3.10/site-packages/huggingface_hub/utils/_errors.py", line 362, in hf_raise_for
_status                                                                                                                 
    raise HfHubHTTPError(str(e), response=response) from e                                                              
huggingface_hub.utils._errors.HfHubHTTPError: 500 Server Error: Internal Server Error for url: https://huggingface.co/da
tasets/sanchit-gandhi/concatenated-train-set-label-length-256-conditioned/resolve/3c3c0cce51df9f9d2e75968bb2a1851894f504
0d/train/train-03515-of-04010.parquet (Request ID: Root=1-65c7c4c4-153fe71401558c8c2d272c8a;fec3ec68-4a0a-4bfd-95ba-b0a0
5684d612)                                                                                                               
                                                                                                                        
Internal Error - We're working hard to fix this as soon as possible!

Wauplin · 2024-02-12T09:30:10Z

@sanchit-gandhi thanks for the feedback. I've opened huggingface/huggingface_hub#2026 to make the download process more robust. I believe that you've witness this problem on Saturday due to the Hub outage. Hope the PR will make your life easier though :)

sanchit-gandhi · 2024-02-12T11:46:03Z

Awesome, thanks @Wauplin! Makes sense re the Hub outage

sanchit-gandhi added the streaming label Jan 10, 2024

mariosasko mentioned this issue Jan 15, 2024

Retry fetching data on 502 error in HfFileSystem huggingface/huggingface_hub#1981

Merged

Wauplin closed this as completed in huggingface/huggingface_hub#1981 Jan 15, 2024

Wauplin mentioned this issue Feb 12, 2024

Update backoff strategy huggingface/huggingface_hub#2026

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

502 Server Errors when streaming large dataset #6577

502 Server Errors when streaming large dataset #6577

sanchit-gandhi commented Jan 10, 2024 •

edited

Loading

sanchit-gandhi commented Jan 10, 2024

mariosasko commented Jan 15, 2024

sanchit-gandhi commented Jan 25, 2024

sanchit-gandhi commented Feb 11, 2024 •

edited

Loading

Wauplin commented Feb 12, 2024

sanchit-gandhi commented Feb 12, 2024

502 Server Errors when streaming large dataset #6577

502 Server Errors when streaming large dataset #6577

Comments

sanchit-gandhi commented Jan 10, 2024 • edited Loading

Describe the bug

Steps to reproduce the bug

Expected behavior

Environment info

sanchit-gandhi commented Jan 10, 2024

mariosasko commented Jan 15, 2024

sanchit-gandhi commented Jan 25, 2024

sanchit-gandhi commented Feb 11, 2024 • edited Loading

Wauplin commented Feb 12, 2024

sanchit-gandhi commented Feb 12, 2024

sanchit-gandhi commented Jan 10, 2024 •

edited

Loading

sanchit-gandhi commented Feb 11, 2024 •

edited

Loading