Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ConnectionError and SSLError #3973

Closed
yanyu2015 opened this issue Mar 20, 2022 · 6 comments
Closed

ConnectionError and SSLError #3973

yanyu2015 opened this issue Mar 20, 2022 · 6 comments
Labels
bug Something isn't working

Comments

@yanyu2015
Copy link

code

from datasets import load_dataset
dataset = load_dataset('oscar', 'unshuffled_deduplicated_it')

bug report

---------------------------------------------------------------------------
ConnectionError                           Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_29788/2615425180.py in <module>
----> 1 dataset = load_dataset('oscar', 'unshuffled_deduplicated_it')

D:\DataScience\PythonSet\IDES\anaconda\lib\site-packages\datasets\load.py in load_dataset(path, name, data_dir, data_files, split, cache_dir, features, download_config, download_mode, ignore_verifications, keep_in_memory, save_infos, revision, use_auth_token, task, streaming, **config_kwargs)
   1658 
   1659     # Create a dataset builder
-> 1660     builder_instance = load_dataset_builder(
   1661         path=path,
   1662         name=name,

D:\DataScience\PythonSet\IDES\anaconda\lib\site-packages\datasets\load.py in load_dataset_builder(path, name, data_dir, data_files, cache_dir, features, download_config, download_mode, revision, use_auth_token, **config_kwargs)
   1484         download_config = download_config.copy() if download_config else DownloadConfig()
   1485         download_config.use_auth_token = use_auth_token
-> 1486     dataset_module = dataset_module_factory(
   1487         path,
   1488         revision=revision,

D:\DataScience\PythonSet\IDES\anaconda\lib\site-packages\datasets\load.py in dataset_module_factory(path, revision, download_config, download_mode, force_local_path, dynamic_modules_path, data_dir, data_files, **download_kwargs)
   1236                         f"Couldn't find '{path}' on the Hugging Face Hub either: {type(e1).__name__}: {e1}"
   1237                     ) from None
-> 1238                 raise e1 from None
   1239     else:
   1240         raise FileNotFoundError(

D:\DataScience\PythonSet\IDES\anaconda\lib\site-packages\datasets\load.py in dataset_module_factory(path, revision, download_config, download_mode, force_local_path, dynamic_modules_path, data_dir, data_files, **download_kwargs)
   1173             if path.count("/") == 0:  # even though the dataset is on the Hub, we get it from GitHub for now
   1174                 # TODO(QL): use a Hub dataset module factory instead of GitHub
-> 1175                 return GithubDatasetModuleFactory(
   1176                     path,
   1177                     revision=revision,

D:\DataScience\PythonSet\IDES\anaconda\lib\site-packages\datasets\load.py in get_module(self)
    531         revision = self.revision
    532         try:
--> 533             local_path = self.download_loading_script(revision)
    534         except FileNotFoundError:
    535             if revision is not None or os.getenv("HF_SCRIPTS_VERSION", None) is not None:

D:\DataScience\PythonSet\IDES\anaconda\lib\site-packages\datasets\load.py in download_loading_script(self, revision)
    511         if download_config.download_desc is None:
    512             download_config.download_desc = "Downloading builder script"
--> 513         return cached_path(file_path, download_config=download_config)
    514 
    515     def download_dataset_infos_file(self, revision: Optional[str]) -> str:

D:\DataScience\PythonSet\IDES\anaconda\lib\site-packages\datasets\utils\file_utils.py in cached_path(url_or_filename, download_config, **download_kwargs)
    232     if is_remote_url(url_or_filename):
    233         # URL, so get it from the cache (downloading if necessary)
--> 234         output_path = get_from_cache(
    235             url_or_filename,
    236             cache_dir=cache_dir,

D:\DataScience\PythonSet\IDES\anaconda\lib\site-packages\datasets\utils\file_utils.py in get_from_cache(url, cache_dir, force_download, proxies, etag_timeout, resume_download, user_agent, local_files_only, use_etag, max_retries, use_auth_token, ignore_url_params, download_desc)
    580         _raise_if_offline_mode_is_enabled(f"Tried to reach {url}")
    581         if head_error is not None:
--> 582             raise ConnectionError(f"Couldn't reach {url} ({repr(head_error)})")
    583         elif response is not None:
    584             raise ConnectionError(f"Couldn't reach {url} (error {response.status_code})")

ConnectionError: Couldn't reach https://raw.githubusercontent.com/huggingface/datasets/2.0.0/datasets/oscar/oscar.py (SSLError(MaxRetryError("HTTPSConnectionPool(host='raw.githubusercontent.com', port=443): Max retries exceeded with url: /huggingface/datasets/2.0.0/datasets/oscar/oscar.py (Caused by SSLError(SSLEOFError(8, 'EOF occurred in violation of protocol (_ssl.c:1129)')))")))

It may be caused by Caused by SSLError(in China?) because it works well on google colab.
So how can I download this dataset manually?

@yanyu2015 yanyu2015 added the bug Something isn't working label Mar 20, 2022
@lhoestq
Copy link
Member

lhoestq commented Mar 21, 2022

Hi ! You can download the oscar.py file from this repository at /datasets/oscar/oscar.py.

Then you can load the dataset by passing the local path to oscar.py to load_dataset:

load_dataset("path/to/oscar.py", "unshuffled_deduplicated_it")

@yanyu2015
Copy link
Author

it works,but another error occurs.

ConnectionError: Couldn't reach https://s3.amazonaws.com/datasets.huggingface.co/oscar/1.0/unshuffled/deduplicated/it/it_sha256.txt (SSLError(MaxRetryError("HTTPSConnectionPool(host='s3.amazonaws.com', port=443): Max retries exceeded with url: /datasets.huggingface.co/oscar/1.0/unshuffled/deduplicated/it/it_sha256.txt (Caused by SSLError(SSLEOFError(8, 'EOF occurred in violation of protocol (_ssl.c:1129)')))")))

I can access https://s3.amazonaws.com/datasets.huggingface.co/oscar/1.0/unshuffled/deduplicated/it/it_sha256.txt and https://aws.amazon.com/cn/s3/ directly, so why it reports a SSLError, should I need tomodify the host file?

@lhoestq
Copy link
Member

lhoestq commented Mar 22, 2022

Could it be an issue with your python environment or your version of OpenSSL ?

@yanyu2015
Copy link
Author

you are so wise!
it report [ConnectionError] in python 3.9.7
and works well in python 3.8.12

I need you help again: how can I specify the path for download files?
the data is too large and my C hardware is not enough

@lhoestq
Copy link
Member

lhoestq commented Mar 22, 2022

Cool ! And you can specify the path for download files with to the cache_dir parameter:

from datasets import load_dataset
dataset = load_dataset('oscar', 'unshuffled_deduplicated_it', cache_dir='path/to/directory')

@yanyu2015
Copy link
Author

yanyu2015 commented Mar 24, 2022

It takes me some days to download data completely, Despise sometimes it occurs again, change py version is feasible way to avoid this ConnectionEror.
parameter cache_dir works well, thanks for your kindness again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants