Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(Load dataset failure) ConnectionError: Couldn’t reach https://raw.githubusercontent.com/huggingface/datasets/1.1.2/datasets/cnn_dailymail/cnn_dailymail.py #759

Closed
AI678 opened this issue Oct 25, 2020 · 19 comments

Comments

@AI678
Copy link

AI678 commented Oct 25, 2020

Hey, I want to load the cnn-dailymail dataset for fine-tune.
I write the code like this
from datasets import load_dataset

test_dataset = load_dataset(“cnn_dailymail”, “3.0.0”, split=“train”)

And I got the following errors.

Traceback (most recent call last):
File “test.py”, line 7, in
test_dataset = load_dataset(“cnn_dailymail”, “3.0.0”, split=“test”)
File “C:\Users\666666\AppData\Local\Programs\Python\Python38\lib\site-packages\datasets\load.py”, line 589, in load_dataset
module_path, hash = prepare_module(
File “C:\Users\666666\AppData\Local\Programs\Python\Python38\lib\site-packages\datasets\load.py”, line 268, in prepare_module
local_path = cached_path(file_path, download_config=download_config)
File “C:\Users\666666\AppData\Local\Programs\Python\Python38\lib\site-packages\datasets\utils\file_utils.py”, line 300, in cached_path
output_path = get_from_cache(
File “C:\Users\666666\AppData\Local\Programs\Python\Python38\lib\site-packages\datasets\utils\file_utils.py”, line 475, in get_from_cache
raise ConnectionError(“Couldn’t reach {}”.format(url))
ConnectionError: Couldn’t reach https://raw.githubusercontent.com/huggingface/datasets/1.1.2/datasets/cnn_dailymail/cnn_dailymail.py

How can I fix this ?

@lhoestq
Copy link
Member

lhoestq commented Oct 25, 2020

Are you running the script on a machine with an internet connection ?

@AI678
Copy link
Author

AI678 commented Oct 26, 2020

Yes , I can browse the url through Google Chrome.

@lhoestq
Copy link
Member

lhoestq commented Oct 26, 2020

Does this HEAD request return 200 on your machine ?

import requests                                                                                                                                                                                                         
requests.head("https://raw.githubusercontent.com/huggingface/datasets/1.1.2/datasets/cnn_dailymail/cnn_dailymail.py")

If it returns 200, could you try again to load the dataset ?

@AI678
Copy link
Author

AI678 commented Oct 26, 2020

Thank you very much for your response.
When I run

import requests                                                                                                                                                                                                         
requests.head("https://raw.githubusercontent.com/huggingface/datasets/1.1.2/datasets/cnn_dailymail/cnn_dailymail.py")

It returns 200.

And I try again to load the dataset. I got the following errors again.

Traceback (most recent call last):
File "", line 1, in
File "C:\Users\666666\AppData\Local\Programs\Python\Python38\lib\site-packages\datasets\load.py", line 608, in load_dataset
builder_instance.download_and_prepare(
File "C:\Users\666666\AppData\Local\Programs\Python\Python38\lib\site-packages\datasets\builder.py", line 475, in download_and_prepare
self._download_and_prepare(
File "C:\Users\666666\AppData\Local\Programs\Python\Python38\lib\site-packages\datasets\builder.py", line 531, in _download_and_prepare
split_generators = self._split_generators(dl_manager, **split_generators_kwargs)
File "C:\Users\666666.cache\huggingface\modules\datasets_modules\datasets\cnn_dailymail\0128610a44e10f25b4af6689441c72af86205282d26399642f7db38fa7535602\cnn_dailymail.py", line 253, in _split_generators
dl_paths = dl_manager.download_and_extract(_DL_URLS)
File "C:\Users\666666\AppData\Local\Programs\Python\Python38\lib\site-packages\datasets\utils\download_manager.py", line 254, in download_and_extract
return self.extract(self.download(url_or_urls))
File "C:\Users\666666\AppData\Local\Programs\Python\Python38\lib\site-packages\datasets\utils\download_manager.py", line 175, in download
downloaded_path_or_paths = map_nested(
File "C:\Users\666666\AppData\Local\Programs\Python\Python38\lib\site-packages\datasets\utils\py_utils.py", line 224, in map_nested
mapped = [
File "C:\Users\666666\AppData\Local\Programs\Python\Python38\lib\site-packages\datasets\utils\py_utils.py", line 225, in
_single_map_nested((function, obj, types, None, True)) for obj in tqdm(iterable, disable=disable_tqdm)
File "C:\Users\666666\AppData\Local\Programs\Python\Python38\lib\site-packages\datasets\utils\py_utils.py", line 163, in _single_map_nested
return function(data_struct)
File "C:\Users\666666\AppData\Local\Programs\Python\Python38\lib\site-packages\datasets\utils\file_utils.py", line 300, in cached_path
output_path = get_from_cache(
File "C:\Users\666666\AppData\Local\Programs\Python\Python38\lib\site-packages\datasets\utils\file_utils.py", line 475, in get_from_cache
raise ConnectionError("Couldn't reach {}".format(url))
ConnectionError: Couldn't reach https://drive.google.com/uc?export=download&id=0BwmD_VLjROrfTHk4NFg2SndKcjQ

Connection error happened but the url was different.

I add the following code.

requests.head("https://drive.google.com/uc?export=download&id=0BwmD_VLjROrfTHk4NFg2SndKcjQ")

This didn't return 200
It returned like this:

Traceback (most recent call last):
File "C:\Users\666666\AppData\Local\Programs\Python\Python38\lib\site-packages\urllib3\connection.py", line 159, in _new_conn
conn = connection.create_connection(
File "C:\Users\666666\AppData\Local\Programs\Python\Python38\lib\site-packages\urllib3\util\connection.py", line 84, in create_connection
raise err
File "C:\Users\666666\AppData\Local\Programs\Python\Python38\lib\site-packages\urllib3\util\connection.py", line 74, in create_connection
sock.connect(sa)
TimeoutError: [WinError 10060]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "C:\Users\666666\AppData\Local\Programs\Python\Python38\lib\site-packages\urllib3\connectionpool.py", line 670, in urlopen
httplib_response = self._make_request(
File "C:\Users\666666\AppData\Local\Programs\Python\Python38\lib\site-packages\urllib3\connectionpool.py", line 381, in _make_request
self._validate_conn(conn)
File "C:\Users\666666\AppData\Local\Programs\Python\Python38\lib\site-packages\urllib3\connectionpool.py", line 978, in _validate_conn
conn.connect()
File "C:\Users\666666\AppData\Local\Programs\Python\Python38\lib\site-packages\urllib3\connection.py", line 309, in connect
conn = self._new_conn()
File "C:\Users\666666\AppData\Local\Programs\Python\Python38\lib\site-packages\urllib3\connection.py", line 171, in _new_conn
raise NewConnectionError(
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPSConnection object at 0x000001F6060618E0>: Failed to establish a new connection: [WinError 10060]

@lhoestq
Copy link
Member

lhoestq commented Oct 26, 2020

Is google drive blocked on your network ?
For me

requests.head("https://drive.google.com/uc?export=download&id=0BwmD_VLjROrfTHk4NFg2SndKcjQ")

returns 200

@AI678
Copy link
Author

AI678 commented Oct 26, 2020

I can browse the google drive through google chrome. It's weird. I can download the dataset through google drive manually.

@lhoestq
Copy link
Member

lhoestq commented Oct 26, 2020

Could you try to update requests maybe ?
It works with 2.23.0 on my side

@AI678
Copy link
Author

AI678 commented Oct 26, 2020

My requests is 2.24.0 . It still can't return 200.

@AI678
Copy link
Author

AI678 commented Oct 26, 2020

Is it possible I download the dataset manually from google drive and use it for further test ? How can I do this ? I want to reproduce the model in this link https://huggingface.co/patrickvonplaten/bert2bert-cnn_dailymail-fp16. But I can't download the dataset through load_dataset method . I have tried many times and the connection error always happens .

@lhoestq
Copy link
Member

lhoestq commented Oct 26, 2020

The head request should definitely work, not sure what's going on on your side.
If you find a way to make it work, please post it here since other users might encounter the same issue.

If you don't manage to fix it you can use load_dataset on google colab and then save it using dataset.save_to_disk("path/to/dataset").
Then you can download the directory on your machine and do

from datasets import load_from_disk
dataset = load_from_disk("path/to/local/dataset")

@smile0925
Copy link

Hi
I want to know if this problem has been solved because I encountered a similar issue. Thanks.
train_data = datasets.load_dataset("xsum", split="train") ConnectionError: Couldn't reach https://raw.githubusercontent.com/huggingface/datasets/1.1.3/datasets/xsum/xsum.py

@lhoestq
Copy link
Member

lhoestq commented Feb 1, 2021

Hi @smile0925 ! Do you have an internet connection ? Are you using some kind of proxy that may block the access to this file ?

Otherwise you can try to update datasets since we introduced retries for http requests in the 1.2.0 version

pip install --upgrade datasets

Let me know if that helps.

@smile0925
Copy link

Hi @lhoestq
Oh, may be you are right. I find that my server uses some kind of proxy that block the access to this file.
image

@ZhengxiangShi
Copy link

Hi @lhoestq
Oh, may be you are right. I find that my server uses some kind of proxy that block the access to this file.
image

I have the same problem, have you solved it? Many thanks

@smile0925
Copy link

Hi @ZhengxiangShi
You can first try whether your network can access these files. I need to use VPN to access these files, so I download the files that cannot be accessed to the local in advance, and then use them in the code. Like this,
train_data = datasets.load_dataset("xsum.py", split="train")

@mikechen66
Copy link

mikechen66 commented Sep 12, 2023

For Ubuntu 20.04, there are the following feedback.

Google Drive is ok, but raw.githubusercontent.com has a big problem. It seems that the raw github could not match the common urllib3 protocols.

1. Google Drive

import requests

requests.head("https://drive.google.com/uc?export=download&id=0BwmD_VLjROrfTHk4NFg2SndKcjQ")
<Response [200]>

2. raw.githubusercontent.com

import requests
requests.head("https://raw.githubusercontent.com/huggingface/datasets/1.1.2/datasets/cnn_dailymail/cnn_dailymail.py")

........

raise CertificateError(
urllib3.util.ssl_match_hostname.CertificateError: hostname 'raw.githubusercontent.com' doesn't match either of 'default.ssl.fastly.net', 'fastly.com', '.a.ssl.fastly.net', '.hosts.fastly.net', '.global.ssl.fastly.net', '.fastly.com', 'a.ssl.fastly.net', 'purge.fastly.net', 'mirrors.fastly.net', 'control.fastly.net', 'tools.fastly.net'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
........
raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='raw.githubusercontent.com', port=443): Max retries exceeded with url: /huggingface/datasets/1.1.2/datasets/cnn_dailymail/cnn_dailymail.py (Caused by SSLError(CertificateError("hostname 'raw.githubusercontent.com' doesn't match either of 'default.ssl.fastly.net', 'fastly.com', '.a.ssl.fastly.net', '.hosts.fastly.net', '.global.ssl.fastly.net', '.fastly.com', 'a.ssl.fastly.net', 'purge.fastly.net', 'mirrors.fastly.net', 'control.fastly.net', 'tools.fastly.net'")))

........

During handling of the above exception, another exception occurred:

Traceback (most recent call last):

.......

raise SSLError(e, request=request)
requests.exceptions.SSLError: HTTPSConnectionPool(host='raw.githubusercontent.com', port=443): Max retries exceeded with url: /huggingface/datasets/1.1.2/datasets/cnn_dailymail/cnn_dailymail.py (Caused by SSLError(CertificateError("hostname 'raw.githubusercontent.com' doesn't match either of 'default.ssl.fastly.net', 'fastly.com', '.a.ssl.fastly.net', '.hosts.fastly.net', '.global.ssl.fastly.net', '.fastly.com', 'a.ssl.fastly.net', 'purge.fastly.net', 'mirrors.fastly.net', 'control.fastly.net', 'tools.fastly.net'")))

3. XSUM

from datasets import load_dataset
raw_datasets = load_dataset("xsum", split="train")

ConnectionError: Couldn't reach https://raw.githubusercontent.com/EdinburghNLP/XSum/master/XSum-Dataset/XSum-TRAINING-DEV-TEST-SPLIT-90-5-5.json (SSLError(MaxRetryError('HTTPSConnectionPool(host='raw.githubusercontent.com', port=443): Max retries exceeded with url: /EdinburghNLP/XSum/master/XSum-Dataset/XSum-TRAINING-DEV-TEST-SPLIT-90-5-5.json (Caused by SSLError(CertificateError("hostname 'raw.githubusercontent.com' doesn't match either of 'default.ssl.fastly.net', 'fastly.com', '.a.ssl.fastly.net', '.hosts.fastly.net', '.global.ssl.fastly.net', '.fastly.com', 'a.ssl.fastly.net', 'purge.fastly.net', 'mirrors.fastly.net', 'control.fastly.net', 'tools.fastly.net'")))')))

The following snippet could not solve the implicit ssl error.

import ssl

try:
    _create_unverified_https_context = ssl._create_unverified_context
except AttributeError:
    pass
else:
    ssl._create_default_https_context = _create_unverified_https_context

@lhoestq
Copy link
Member

lhoestq commented Sep 12, 2023

Only the oldest versions of datasets use raw.githubusercontent.com. Can you try updating datasets ?

@mikechen66
Copy link

mikechen66 commented Sep 12, 2023

Thank lhoestq fo the quick response.

I solve the big issue with the command line as follows.

1. Open hosts (Ubuntu 20.04)

$ sudo gedit /etc/hosts

2. Add the command line into the hosts

151.101.0.133 raw.githubusercontent.com

3. Save hosts

And then the jupyter notebook can access to the datasets (module) and get the datasets of XSUM with raw.githubusercontent.com.

So it is not users' fault. But most of the suggestions in the web are wrong. Anyway, I solve the problem finally.

By the way, users need to add the other github commnads such as the following.

199.232.69.194 github.global.ssl.fastly.net

Cheers!!!

@mikechen66
Copy link

mikechen66 commented Sep 13, 2023 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants