ConnectionError and SSLError #3973

yanyu2015 · 2022-03-20T06:45:37Z

code

from datasets import load_dataset
dataset = load_dataset('oscar', 'unshuffled_deduplicated_it')

bug report

---------------------------------------------------------------------------
ConnectionError                           Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_29788/2615425180.py in <module>
----> 1 dataset = load_dataset('oscar', 'unshuffled_deduplicated_it')

D:\DataScience\PythonSet\IDES\anaconda\lib\site-packages\datasets\load.py in load_dataset(path, name, data_dir, data_files, split, cache_dir, features, download_config, download_mode, ignore_verifications, keep_in_memory, save_infos, revision, use_auth_token, task, streaming, **config_kwargs)
   1658 
   1659     # Create a dataset builder
-> 1660     builder_instance = load_dataset_builder(
   1661         path=path,
   1662         name=name,

D:\DataScience\PythonSet\IDES\anaconda\lib\site-packages\datasets\load.py in load_dataset_builder(path, name, data_dir, data_files, cache_dir, features, download_config, download_mode, revision, use_auth_token, **config_kwargs)
   1484         download_config = download_config.copy() if download_config else DownloadConfig()
   1485         download_config.use_auth_token = use_auth_token
-> 1486     dataset_module = dataset_module_factory(
   1487         path,
   1488         revision=revision,

D:\DataScience\PythonSet\IDES\anaconda\lib\site-packages\datasets\load.py in dataset_module_factory(path, revision, download_config, download_mode, force_local_path, dynamic_modules_path, data_dir, data_files, **download_kwargs)
   1236                         f"Couldn't find '{path}' on the Hugging Face Hub either: {type(e1).__name__}: {e1}"
   1237                     ) from None
-> 1238                 raise e1 from None
   1239     else:
   1240         raise FileNotFoundError(

D:\DataScience\PythonSet\IDES\anaconda\lib\site-packages\datasets\load.py in dataset_module_factory(path, revision, download_config, download_mode, force_local_path, dynamic_modules_path, data_dir, data_files, **download_kwargs)
   1173             if path.count("/") == 0:  # even though the dataset is on the Hub, we get it from GitHub for now
   1174                 # TODO(QL): use a Hub dataset module factory instead of GitHub
-> 1175                 return GithubDatasetModuleFactory(
   1176                     path,
   1177                     revision=revision,

D:\DataScience\PythonSet\IDES\anaconda\lib\site-packages\datasets\load.py in get_module(self)
    531         revision = self.revision
    532         try:
--> 533             local_path = self.download_loading_script(revision)
    534         except FileNotFoundError:
    535             if revision is not None or os.getenv("HF_SCRIPTS_VERSION", None) is not None:

D:\DataScience\PythonSet\IDES\anaconda\lib\site-packages\datasets\load.py in download_loading_script(self, revision)
    511         if download_config.download_desc is None:
    512             download_config.download_desc = "Downloading builder script"
--> 513         return cached_path(file_path, download_config=download_config)
    514 
    515     def download_dataset_infos_file(self, revision: Optional[str]) -> str:

D:\DataScience\PythonSet\IDES\anaconda\lib\site-packages\datasets\utils\file_utils.py in cached_path(url_or_filename, download_config, **download_kwargs)
    232     if is_remote_url(url_or_filename):
    233         # URL, so get it from the cache (downloading if necessary)
--> 234         output_path = get_from_cache(
    235             url_or_filename,
    236             cache_dir=cache_dir,

D:\DataScience\PythonSet\IDES\anaconda\lib\site-packages\datasets\utils\file_utils.py in get_from_cache(url, cache_dir, force_download, proxies, etag_timeout, resume_download, user_agent, local_files_only, use_etag, max_retries, use_auth_token, ignore_url_params, download_desc)
    580         _raise_if_offline_mode_is_enabled(f"Tried to reach {url}")
    581         if head_error is not None:
--> 582             raise ConnectionError(f"Couldn't reach {url} ({repr(head_error)})")
    583         elif response is not None:
    584             raise ConnectionError(f"Couldn't reach {url} (error {response.status_code})")

ConnectionError: Couldn't reach https://raw.githubusercontent.com/huggingface/datasets/2.0.0/datasets/oscar/oscar.py (SSLError(MaxRetryError("HTTPSConnectionPool(host='raw.githubusercontent.com', port=443): Max retries exceeded with url: /huggingface/datasets/2.0.0/datasets/oscar/oscar.py (Caused by SSLError(SSLEOFError(8, 'EOF occurred in violation of protocol (_ssl.c:1129)')))")))

It may be caused by Caused by SSLError(in China?) because it works well on google colab.
So how can I download this dataset manually?

The text was updated successfully, but these errors were encountered:

lhoestq · 2022-03-21T10:15:49Z

Hi ! You can download the oscar.py file from this repository at /datasets/oscar/oscar.py.

Then you can load the dataset by passing the local path to oscar.py to load_dataset:

load_dataset("path/to/oscar.py", "unshuffled_deduplicated_it")

yanyu2015 · 2022-03-22T01:12:38Z

it works,but another error occurs.

ConnectionError: Couldn't reach https://s3.amazonaws.com/datasets.huggingface.co/oscar/1.0/unshuffled/deduplicated/it/it_sha256.txt (SSLError(MaxRetryError("HTTPSConnectionPool(host='s3.amazonaws.com', port=443): Max retries exceeded with url: /datasets.huggingface.co/oscar/1.0/unshuffled/deduplicated/it/it_sha256.txt (Caused by SSLError(SSLEOFError(8, 'EOF occurred in violation of protocol (_ssl.c:1129)')))")))

I can access https://s3.amazonaws.com/datasets.huggingface.co/oscar/1.0/unshuffled/deduplicated/it/it_sha256.txt and https://aws.amazon.com/cn/s3/ directly, so why it reports a SSLError, should I need tomodify the host file？

lhoestq · 2022-03-22T09:23:44Z

Could it be an issue with your python environment or your version of OpenSSL ?

yanyu2015 · 2022-03-22T11:55:49Z

you are so wise!
it report [ConnectionError] in python 3.9.7
and works well in python 3.8.12

I need you help again: how can I specify the path for download files?
the data is too large and my C hardware is not enough

lhoestq · 2022-03-22T13:23:25Z

Cool ! And you can specify the path for download files with to the cache_dir parameter:

from datasets import load_dataset
dataset = load_dataset('oscar', 'unshuffled_deduplicated_it', cache_dir='path/to/directory')

yanyu2015 · 2022-03-24T11:35:57Z

It takes me some days to download data completely, Despise sometimes it occurs again, change py version is feasible way to avoid this ConnectionEror.
parameter cache_dir works well, thanks for your kindness again!

yanyu2015 added the bug Something isn't working label Mar 20, 2022

albertvillanova closed this as completed Mar 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ConnectionError and SSLError #3973

ConnectionError and SSLError #3973

yanyu2015 commented Mar 20, 2022

lhoestq commented Mar 21, 2022

yanyu2015 commented Mar 22, 2022

lhoestq commented Mar 22, 2022

yanyu2015 commented Mar 22, 2022

lhoestq commented Mar 22, 2022

yanyu2015 commented Mar 24, 2022 •

edited

Loading

ConnectionError and SSLError #3973

ConnectionError and SSLError #3973

Comments

yanyu2015 commented Mar 20, 2022

lhoestq commented Mar 21, 2022

yanyu2015 commented Mar 22, 2022

lhoestq commented Mar 22, 2022

yanyu2015 commented Mar 22, 2022

lhoestq commented Mar 22, 2022

yanyu2015 commented Mar 24, 2022 • edited Loading

yanyu2015 commented Mar 24, 2022 •

edited

Loading