Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

502 Server Errors when streaming large dataset #6577

Closed
sanchit-gandhi opened this issue Jan 10, 2024 · 6 comments · Fixed by huggingface/huggingface_hub#1981
Closed

502 Server Errors when streaming large dataset #6577

sanchit-gandhi opened this issue Jan 10, 2024 · 6 comments · Fixed by huggingface/huggingface_hub#1981

Comments

@sanchit-gandhi
Copy link
Contributor

sanchit-gandhi commented Jan 10, 2024

Describe the bug

When streaming a large ASR dataset from the Hug (~3TB) I often encounter 502 Server Errors seemingly randomly during streaming:

huggingface_hub.utils._errors.HfHubHTTPError: 502 Server Error: Bad Gateway for url: https://huggingface.co/datasets/sanchit-gandhi/concatenated-train-set/resolve/7d2acc5c59de848e456e951a76e805304d6fb350/train/train-00288-of-07135.parquet 

This is despite the parquet file definitely existing on the Hub: https://huggingface.co/datasets/sanchit-gandhi/concatenated-train-set/blob/main/train/train-00228-of-07135.parquet
And having the correct commit id: 7d2acc5c59de848e456e951a76e805304d6fb350

I’m wondering whether this is coming from datasets? Or from the Hub side?

Steps to reproduce the bug

Reproducer:

from datasets import load_dataset
from torch.utils.data import DataLoader
from tqdm import tqdm

NUM_EPOCHS = 20

dataset = load_dataset("sanchit-gandhi/concatenated-train-set", "train", streaming=True)
dataset = dataset.with_format("torch")
dataloader = DataLoader(dataset["train"], batch_size=256, drop_last=True, pin_memory=True, num_workers=16)

for epoch in tqdm(range(NUM_EPOCHS), desc="Epoch", position=0):
    for batch in tqdm(dataloader, desc="Batch", position=1):
        continue

Running the above script tends to fail within about 2 hours with a traceback like the following:

Traceback:
1029     for batch in train_loader: 
1030   File "/home/sanchitgandhi/hf/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 630, in __next__ 
1031     data = self._next_data() 
1032   File "/home/sanchitgandhi/hf/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1325, in _next_data 
1033     return self._process_data(data) 
1034   File "/home/sanchitgandhi/hf/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1371, in _process_data 
1035     data.reraise() 
1036   File "/home/sanchitgandhi/hf/lib/python3.8/site-packages/torch/_utils.py", line 694, in reraise 
1037     raise exception 
1038 huggingface_hub.utils._errors.HfHubHTTPError: Caught HfHubHTTPError in DataLoader worker process 10. 
1039 Original Traceback (most recent call last): 
1040   File "/home/sanchitgandhi/hf/lib/python3.8/site-packages/huggingface_hub/utils/_errors.py", line 286, in hf_raise_for_status 
1041     response.raise_for_status() 
1042   File "/home/sanchitgandhi/hf/lib/python3.8/site-packages/requests/models.py", line 1021, in raise_for_status 
1043     raise HTTPError(http_error_msg, response=self) 
1044 requests.exceptions.HTTPError: 502 Server Error: Bad Gateway for url: https://huggingface.co/datasets/sanchit-gandhi/concatenated-train-set/resolve/7d2acc5c59de848e456e951a76e805304d6fb350/train/train-00288-of-07135.parquet 
1045 The above exception was the direct cause of the following exception: 
1046 Traceback (most recent call last): 
1047   File "/home/sanchitgandhi/hf/lib/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop 
1048     data = fetcher.fetch(index) 
1049   File "/home/sanchitgandhi/hf/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 32, in fetch 
1050     data.append(next(self.dataset_iter)) 
1051   File "/home/sanchitgandhi/datasets/src/datasets/iterable_dataset.py", line 1363, in __iter__ 
1052     yield from self._iter_pytorch() 
1053   File "/home/sanchitgandhi/datasets/src/datasets/iterable_dataset.py", line 1298, in _iter_pytorch 
1054     for key, example in ex_iterable: 
1055   File "/home/sanchitgandhi/datasets/src/datasets/iterable_dataset.py", line 983, in __iter__ 
1056     for x in self.ex_iterable: 
1057   File "/home/sanchitgandhi/datasets/src/datasets/iterable_dataset.py", line 863, in __iter__ 
1058     yield from self._iter() 
1059   File "/home/sanchitgandhi/datasets/src/datasets/iterable_dataset.py", line 900, in _iter 
1060     for key, example in iterator: 
1061   File "/home/sanchitgandhi/datasets/src/datasets/iterable_dataset.py", line 679, in __iter__ 
1062     yield from self._iter() 
1063   File "/home/sanchitgandhi/datasets/src/datasets/iterable_dataset.py", line 741, in _iter 
1064     for key, example in iterator: 
1065   File "/home/sanchitgandhi/datasets/src/datasets/iterable_dataset.py", line 863, in __iter__ 
1066     yield from self._iter() 
1067   File "/home/sanchitgandhi/datasets/src/datasets/iterable_dataset.py", line 900, in _iter 
1068     for key, example in iterator: 
1069   File "/home/sanchitgandhi/datasets/src/datasets/iterable_dataset.py", line 1115, in __iter__ 
1070     for key, example in self.ex_iterable: 
1071   File "/home/sanchitgandhi/datasets/src/datasets/iterable_dataset.py", line 679, in __iter__ 
1072     yield from self._iter() 
1073   File "/home/sanchitgandhi/datasets/src/datasets/iterable_dataset.py", line 741, in _iter 
1074     for key, example in iterator: 
1075   File "/home/sanchitgandhi/datasets/src/datasets/iterable_dataset.py", line 1115, in __iter__ 
1076     for key, example in self.ex_iterable: 
1077   File "/home/sanchitgandhi/datasets/src/datasets/iterable_dataset.py", line 282, in __iter__ 
1078     for key, pa_table in self.generate_tables_fn(**self.kwargs): 
1079   File "/home/sanchitgandhi/datasets/src/datasets/packaged_modules/parquet/parquet.py", line 87, in _generate_tables 
1080     for batch_idx, record_batch in enumerate( 
1081   File "pyarrow/_parquet.pyx", line 1367, in iter_batches 
1082   File "pyarrow/types.pxi", line 88, in pyarrow.lib._datatype_to_pep3118 
1083   File "/home/sanchitgandhi/datasets/src/datasets/download/streaming_download_manager.py", line 341, in read_with_retries 
1084     out = read(*args, **kwargs) 
1085   File "/home/sanchitgandhi/hf/lib/python3.8/site-packages/fsspec/spec.py", line 1856, in read 
1086     out = self.cache._fetch(self.loc, self.loc + length) 
1087   File "/home/sanchitgandhi/hf/lib/python3.8/site-packages/fsspec/caching.py", line 189, in _fetch 
1088     self.cache = self.fetcher(start, end)  # new block replaces old 
1089   File "/home/sanchitgandhi/hf/lib/python3.8/site-packages/huggingface_hub/hf_file_system.py", line 626, in _fetch_range 
1090     hf_raise_for_status(r) 
1091   File "/home/sanchitgandhi/hf/lib/python3.8/site-packages/huggingface_hub/utils/_errors.py", line 333, in hf_raise_for_status 
1092     raise HfHubHTTPError(str(e), response=response) from e 
1093 huggingface_hub.utils._errors.HfHubHTTPError: 502 Server Error: Bad Gateway for url: https://huggingface.co/datasets/sanchit-gandhi/concatenated-train-set/resolve/7d2acc5c59de848e456e951a76e805304d6fb350/train/train-00288-of-07135.parquet

Expected behavior

Should be able to stream the dataset without any 502 error.

Environment info

  • datasets version: 2.16.2.dev0
  • Platform: Linux-5.13.0-1023-gcp-x86_64-with-glibc2.29
  • Python version: 3.8.10
  • huggingface_hub version: 0.20.1
  • PyArrow version: 14.0.2
  • Pandas version: 2.0.3
  • fsspec version: 2023.10.0
@sanchit-gandhi
Copy link
Contributor Author

cc @mariosasko @lhoestq

@mariosasko
Copy link
Collaborator

Hi! We should be able to avoid this error by retrying to read the data when it happens. I'll open a PR in huggingface_hub to address this.

@sanchit-gandhi
Copy link
Contributor Author

Thanks for the fix @mariosasko! Just wondering whether "500 error" should also be excluded? I got these errors overnight:

huggingface_hub.utils._errors.HfHubHTTPError: 500 Server Error: Internal Server Error for url: https://huggingface.co/da
tasets/sanchit-gandhi/concatenated-train-set-label-length-256/resolve/91e6a0cd0356605b021384ded813cfcf356a221c/train/tra
in-02618-of-04012.parquet (Request ID: Root=1-65b18b81-627f2c2943bbb8ab68d19ee2;129537bd-1934-4257-a4d8-1cb774f8e1f8)   
                                                                                                                        
Internal Error - We're working hard to fix this as soon as possible! 

@sanchit-gandhi
Copy link
Contributor Author

sanchit-gandhi commented Feb 11, 2024

Gently pining @mariosasko and @Wauplin - when trying to stream this large dataset from the HF Hub, I'm running into 500 Internal Server Errors as described above. I'd love to be able to use the Hub exclusively to stream data when training, but this error pops up a few times a week, terminating training runs and causing me to have to rewind to the last saved checkpoint. Do we reckon there's a way we can protect Datasets' streaming against these errors? The same reproducer as the original comment can be used, but it's somewhat random whether we hit a 500 error. Leaving the full traceback below:

Traceback (most recent call last):                                                                                      
  File "/home/sanchitgandhi/hf/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loo
p                                                                                                                       
    data = fetcher.fetch(index)                                                                                         
  File "/home/sanchitgandhi/hf/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 32, in fetch        
    data.append(next(self.dataset_iter))                                                                                
  File "/home/sanchitgandhi/datasets/src/datasets/iterable_dataset.py", line 1367, in __iter__                          
    yield from self._iter_pytorch()                                                                                     
  File "/home/sanchitgandhi/datasets/src/datasets/iterable_dataset.py", line 1302, in _iter_pytorch                     
    for key, example in ex_iterable:                                                                                    
  File "/home/sanchitgandhi/datasets/src/datasets/iterable_dataset.py", line 987, in __iter__                           
    for x in self.ex_iterable:                                                                                          
  File "/home/sanchitgandhi/datasets/src/datasets/iterable_dataset.py", line 867, in __iter__                           
    yield from self._iter()                                                                                             
  File "/home/sanchitgandhi/datasets/src/datasets/iterable_dataset.py", line 904, in _iter                              
    for key, example in iterator:                                                                                       
  File "/home/sanchitgandhi/datasets/src/datasets/iterable_dataset.py", line 679, in __iter__                           
    yield from self._iter()             
  File "/home/sanchitgandhi/datasets/src/datasets/iterable_dataset.py", line 741, in _iter                    [235/1892]
    for key, example in iterator:                                                                                       
  File "/home/sanchitgandhi/datasets/src/datasets/iterable_dataset.py", line 1119, in __iter__                          
    for key, example in self.ex_iterable:                                                                               
  File "/home/sanchitgandhi/datasets/src/datasets/iterable_dataset.py", line 282, in __iter__                           
    for key, pa_table in self.generate_tables_fn(**self.kwargs):                                                        
  File "/home/sanchitgandhi/datasets/src/datasets/packaged_modules/parquet/parquet.py", line 87, in _generate_tables    
    for batch_idx, record_batch in enumerate(                                                                           
  File "pyarrow/_parquet.pyx", line 1587, in iter_batches                                                               
  File "pyarrow/types.pxi", line 88, in pyarrow.lib._datatype_to_pep3118                                                
  File "/home/sanchitgandhi/datasets/src/datasets/download/streaming_download_manager.py", line 342, in read_with_retrie
s                                                                                                                       
    out = read(*args, **kwargs)                                                                                         
  File "/home/sanchitgandhi/hf/lib/python3.10/site-packages/fsspec/spec.py", line 1856, in read                         
    out = self.cache._fetch(self.loc, self.loc + length)                                                                
  File "/home/sanchitgandhi/hf/lib/python3.10/site-packages/fsspec/caching.py", line 189, in _fetch                     
    self.cache = self.fetcher(start, end)  # new block replaces old                                                     
  File "/home/sanchitgandhi/hf/lib/python3.10/site-packages/huggingface_hub/hf_file_system.py", line 629, in _fetch_rang
e                                                                                                                       
    hf_raise_for_status(r)                                                                                              
  File "/home/sanchitgandhi/hf/lib/python3.10/site-packages/huggingface_hub/utils/_errors.py", line 362, in hf_raise_for
_status                                                                                                                 
    raise HfHubHTTPError(str(e), response=response) from e                                                              
huggingface_hub.utils._errors.HfHubHTTPError: 500 Server Error: Internal Server Error for url: https://huggingface.co/da
tasets/sanchit-gandhi/concatenated-train-set-label-length-256-conditioned/resolve/3c3c0cce51df9f9d2e75968bb2a1851894f504
0d/train/train-03515-of-04010.parquet (Request ID: Root=1-65c7c4c4-153fe71401558c8c2d272c8a;fec3ec68-4a0a-4bfd-95ba-b0a0
5684d612)                                                                                                               
                                                                                                                        
Internal Error - We're working hard to fix this as soon as possible! 

@Wauplin
Copy link
Contributor

Wauplin commented Feb 12, 2024

@sanchit-gandhi thanks for the feedback. I've opened huggingface/huggingface_hub#2026 to make the download process more robust. I believe that you've witness this problem on Saturday due to the Hub outage. Hope the PR will make your life easier though :)

@sanchit-gandhi
Copy link
Contributor Author

Awesome, thanks @Wauplin! Makes sense re the Hub outage

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants