(Load dataset failure) ConnectionError: Couldn’t reach https://raw.githubusercontent.com/huggingface/datasets/1.1.2/datasets/cnn_dailymail/cnn_dailymail.py #759

AI678 · 2020-10-25T15:34:57Z

Hey, I want to load the cnn-dailymail dataset for fine-tune.
I write the code like this
from datasets import load_dataset

test_dataset = load_dataset(“cnn_dailymail”, “3.0.0”, split=“train”)

And I got the following errors.

Traceback (most recent call last):
File “test.py”, line 7, in
test_dataset = load_dataset(“cnn_dailymail”, “3.0.0”, split=“test”)
File “C:\Users\666666\AppData\Local\Programs\Python\Python38\lib\site-packages\datasets\load.py”, line 589, in load_dataset
module_path, hash = prepare_module(
File “C:\Users\666666\AppData\Local\Programs\Python\Python38\lib\site-packages\datasets\load.py”, line 268, in prepare_module
local_path = cached_path(file_path, download_config=download_config)
File “C:\Users\666666\AppData\Local\Programs\Python\Python38\lib\site-packages\datasets\utils\file_utils.py”, line 300, in cached_path
output_path = get_from_cache(
File “C:\Users\666666\AppData\Local\Programs\Python\Python38\lib\site-packages\datasets\utils\file_utils.py”, line 475, in get_from_cache
raise ConnectionError(“Couldn’t reach {}”.format(url))
ConnectionError: Couldn’t reach https://raw.githubusercontent.com/huggingface/datasets/1.1.2/datasets/cnn_dailymail/cnn_dailymail.py

How can I fix this ?

lhoestq · 2020-10-25T23:52:55Z

Are you running the script on a machine with an internet connection ?

AI678 · 2020-10-26T04:43:50Z

Yes , I can browse the url through Google Chrome.

lhoestq · 2020-10-26T09:44:09Z

Does this HEAD request return 200 on your machine ?

import requests                                                                                                                                                                                                         
requests.head("https://raw.githubusercontent.com/huggingface/datasets/1.1.2/datasets/cnn_dailymail/cnn_dailymail.py")

If it returns 200, could you try again to load the dataset ?

AI678 · 2020-10-26T10:25:05Z

Thank you very much for your response.
When I run

import requests                                                                                                                                                                                                         
requests.head("https://raw.githubusercontent.com/huggingface/datasets/1.1.2/datasets/cnn_dailymail/cnn_dailymail.py")

It returns 200.

And I try again to load the dataset. I got the following errors again.

Traceback (most recent call last):
File "", line 1, in
File "C:\Users\666666\AppData\Local\Programs\Python\Python38\lib\site-packages\datasets\load.py", line 608, in load_dataset
builder_instance.download_and_prepare(
File "C:\Users\666666\AppData\Local\Programs\Python\Python38\lib\site-packages\datasets\builder.py", line 475, in download_and_prepare
self._download_and_prepare(
File "C:\Users\666666\AppData\Local\Programs\Python\Python38\lib\site-packages\datasets\builder.py", line 531, in _download_and_prepare
split_generators = self._split_generators(dl_manager, **split_generators_kwargs)
File "C:\Users\666666.cache\huggingface\modules\datasets_modules\datasets\cnn_dailymail\0128610a44e10f25b4af6689441c72af86205282d26399642f7db38fa7535602\cnn_dailymail.py", line 253, in _split_generators
dl_paths = dl_manager.download_and_extract(_DL_URLS)
File "C:\Users\666666\AppData\Local\Programs\Python\Python38\lib\site-packages\datasets\utils\download_manager.py", line 254, in download_and_extract
return self.extract(self.download(url_or_urls))
File "C:\Users\666666\AppData\Local\Programs\Python\Python38\lib\site-packages\datasets\utils\download_manager.py", line 175, in download
downloaded_path_or_paths = map_nested(
File "C:\Users\666666\AppData\Local\Programs\Python\Python38\lib\site-packages\datasets\utils\py_utils.py", line 224, in map_nested
mapped = [
File "C:\Users\666666\AppData\Local\Programs\Python\Python38\lib\site-packages\datasets\utils\py_utils.py", line 225, in
_single_map_nested((function, obj, types, None, True)) for obj in tqdm(iterable, disable=disable_tqdm)
File "C:\Users\666666\AppData\Local\Programs\Python\Python38\lib\site-packages\datasets\utils\py_utils.py", line 163, in _single_map_nested
return function(data_struct)
File "C:\Users\666666\AppData\Local\Programs\Python\Python38\lib\site-packages\datasets\utils\file_utils.py", line 300, in cached_path
output_path = get_from_cache(
File "C:\Users\666666\AppData\Local\Programs\Python\Python38\lib\site-packages\datasets\utils\file_utils.py", line 475, in get_from_cache
raise ConnectionError("Couldn't reach {}".format(url))
ConnectionError: Couldn't reach https://drive.google.com/uc?export=download&id=0BwmD_VLjROrfTHk4NFg2SndKcjQ

Connection error happened but the url was different.

I add the following code.

requests.head("https://drive.google.com/uc?export=download&id=0BwmD_VLjROrfTHk4NFg2SndKcjQ")

This didn't return 200
It returned like this:

Traceback (most recent call last):
File "C:\Users\666666\AppData\Local\Programs\Python\Python38\lib\site-packages\urllib3\connection.py", line 159, in _new_conn
conn = connection.create_connection(
File "C:\Users\666666\AppData\Local\Programs\Python\Python38\lib\site-packages\urllib3\util\connection.py", line 84, in create_connection
raise err
File "C:\Users\666666\AppData\Local\Programs\Python\Python38\lib\site-packages\urllib3\util\connection.py", line 74, in create_connection
sock.connect(sa)
TimeoutError: [WinError 10060]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "C:\Users\666666\AppData\Local\Programs\Python\Python38\lib\site-packages\urllib3\connectionpool.py", line 670, in urlopen
httplib_response = self._make_request(
File "C:\Users\666666\AppData\Local\Programs\Python\Python38\lib\site-packages\urllib3\connectionpool.py", line 381, in _make_request
self._validate_conn(conn)
File "C:\Users\666666\AppData\Local\Programs\Python\Python38\lib\site-packages\urllib3\connectionpool.py", line 978, in _validate_conn
conn.connect()
File "C:\Users\666666\AppData\Local\Programs\Python\Python38\lib\site-packages\urllib3\connection.py", line 309, in connect
conn = self._new_conn()
File "C:\Users\666666\AppData\Local\Programs\Python\Python38\lib\site-packages\urllib3\connection.py", line 171, in _new_conn
raise NewConnectionError(
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPSConnection object at 0x000001F6060618E0>: Failed to establish a new connection: [WinError 10060]

lhoestq · 2020-10-26T10:29:27Z

Is google drive blocked on your network ?
For me

requests.head("https://drive.google.com/uc?export=download&id=0BwmD_VLjROrfTHk4NFg2SndKcjQ")

returns 200

AI678 · 2020-10-26T10:36:29Z

I can browse the google drive through google chrome. It's weird. I can download the dataset through google drive manually.

lhoestq · 2020-10-26T11:14:54Z

Could you try to update requests maybe ?
It works with 2.23.0 on my side

AI678 · 2020-10-26T11:44:38Z

My requests is 2.24.0 . It still can't return 200.

AI678 · 2020-10-26T16:25:58Z

Is it possible I download the dataset manually from google drive and use it for further test ? How can I do this ? I want to reproduce the model in this link https://huggingface.co/patrickvonplaten/bert2bert-cnn_dailymail-fp16. But I can't download the dataset through load_dataset method . I have tried many times and the connection error always happens .

lhoestq · 2020-10-26T16:37:14Z

The head request should definitely work, not sure what's going on on your side.
If you find a way to make it work, please post it here since other users might encounter the same issue.

If you don't manage to fix it you can use load_dataset on google colab and then save it using dataset.save_to_disk("path/to/dataset").
Then you can download the directory on your machine and do

from datasets import load_from_disk
dataset = load_from_disk("path/to/local/dataset")

smile0925 · 2021-02-01T08:48:50Z

Hi
I want to know if this problem has been solved because I encountered a similar issue. Thanks.
train_data = datasets.load_dataset("xsum", split="train") ConnectionError: Couldn't reach https://raw.githubusercontent.com/huggingface/datasets/1.1.3/datasets/xsum/xsum.py

lhoestq · 2021-02-01T11:18:38Z

Hi @smile0925 ! Do you have an internet connection ? Are you using some kind of proxy that may block the access to this file ?

Otherwise you can try to update datasets since we introduced retries for http requests in the 1.2.0 version

pip install --upgrade datasets

Let me know if that helps.

smile0925 · 2021-02-01T12:01:04Z

Hi @lhoestq
Oh, may be you are right. I find that my server uses some kind of proxy that block the access to this file.

ZhengxiangShi · 2021-02-17T16:21:55Z

Hi @lhoestq
Oh, may be you are right. I find that my server uses some kind of proxy that block the access to this file.

I have the same problem, have you solved it? Many thanks

smile0925 · 2021-02-22T07:59:43Z

Hi @ZhengxiangShi
You can first try whether your network can access these files. I need to use VPN to access these files, so I download the files that cannot be accessed to the local in advance, and then use them in the code. Like this,
train_data = datasets.load_dataset("xsum.py", split="train")

mikechen66 · 2023-09-12T20:46:05Z

For Ubuntu 20.04, there are the following feedback.

Google Drive is ok, but raw.githubusercontent.com has a big problem. It seems that the raw github could not match the common urllib3 protocols.

1. Google Drive

import requests

requests.head("https://drive.google.com/uc?export=download&id=0BwmD_VLjROrfTHk4NFg2SndKcjQ")
<Response [200]>

2. raw.githubusercontent.com

import requests
requests.head("https://raw.githubusercontent.com/huggingface/datasets/1.1.2/datasets/cnn_dailymail/cnn_dailymail.py")

........

raise CertificateError(
urllib3.util.ssl_match_hostname.CertificateError: hostname 'raw.githubusercontent.com' doesn't match either of 'default.ssl.fastly.net', 'fastly.com', '.a.ssl.fastly.net', '.hosts.fastly.net', '.global.ssl.fastly.net', '.fastly.com', 'a.ssl.fastly.net', 'purge.fastly.net', 'mirrors.fastly.net', 'control.fastly.net', 'tools.fastly.net'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
........
raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='raw.githubusercontent.com', port=443): Max retries exceeded with url: /huggingface/datasets/1.1.2/datasets/cnn_dailymail/cnn_dailymail.py (Caused by SSLError(CertificateError("hostname 'raw.githubusercontent.com' doesn't match either of 'default.ssl.fastly.net', 'fastly.com', '.a.ssl.fastly.net', '.hosts.fastly.net', '.global.ssl.fastly.net', '.fastly.com', 'a.ssl.fastly.net', 'purge.fastly.net', 'mirrors.fastly.net', 'control.fastly.net', 'tools.fastly.net'")))

........

During handling of the above exception, another exception occurred:

Traceback (most recent call last):

.......

raise SSLError(e, request=request)
requests.exceptions.SSLError: HTTPSConnectionPool(host='raw.githubusercontent.com', port=443): Max retries exceeded with url: /huggingface/datasets/1.1.2/datasets/cnn_dailymail/cnn_dailymail.py (Caused by SSLError(CertificateError("hostname 'raw.githubusercontent.com' doesn't match either of 'default.ssl.fastly.net', 'fastly.com', '.a.ssl.fastly.net', '.hosts.fastly.net', '.global.ssl.fastly.net', '.fastly.com', 'a.ssl.fastly.net', 'purge.fastly.net', 'mirrors.fastly.net', 'control.fastly.net', 'tools.fastly.net'")))

3. XSUM

from datasets import load_dataset
raw_datasets = load_dataset("xsum", split="train")

ConnectionError: Couldn't reach https://raw.githubusercontent.com/EdinburghNLP/XSum/master/XSum-Dataset/XSum-TRAINING-DEV-TEST-SPLIT-90-5-5.json (SSLError(MaxRetryError('HTTPSConnectionPool(host='raw.githubusercontent.com', port=443): Max retries exceeded with url: /EdinburghNLP/XSum/master/XSum-Dataset/XSum-TRAINING-DEV-TEST-SPLIT-90-5-5.json (Caused by SSLError(CertificateError("hostname 'raw.githubusercontent.com' doesn't match either of 'default.ssl.fastly.net', 'fastly.com', '.a.ssl.fastly.net', '.hosts.fastly.net', '.global.ssl.fastly.net', '.fastly.com', 'a.ssl.fastly.net', 'purge.fastly.net', 'mirrors.fastly.net', 'control.fastly.net', 'tools.fastly.net'")))')))

The following snippet could not solve the implicit ssl error.

import ssl

try:
    _create_unverified_https_context = ssl._create_unverified_context
except AttributeError:
    pass
else:
    ssl._create_default_https_context = _create_unverified_https_context

lhoestq · 2023-09-12T22:38:07Z

Only the oldest versions of datasets use raw.githubusercontent.com. Can you try updating datasets ?

mikechen66 · 2023-09-12T22:54:37Z

Thank lhoestq fo the quick response.

I solve the big issue with the command line as follows.

1. Open hosts (Ubuntu 20.04)

$ sudo gedit /etc/hosts

2. Add the command line into the hosts

151.101.0.133 raw.githubusercontent.com

3. Save hosts

And then the jupyter notebook can access to the datasets (module) and get the datasets of XSUM with raw.githubusercontent.com.

So it is not users' fault. But most of the suggestions in the web are wrong. Anyway, I solve the problem finally.

By the way, users need to add the other github commnads such as the following.

199.232.69.194 github.global.ssl.fastly.net

Cheers!!!

mikechen66 · 2023-09-13T23:56:51Z

I use the dataset 2.14.4 that published on Aug 8, 2023.发自我的 iPhone在 2023年9月13日，06:38，Quentin Lhoest ***@***.***> 写道： Only the oldest versions of datasets use raw.githubusercontent.com. Can you try updating datasets ? —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you commented.Message ID: ***@***.***>

lhoestq mentioned this issue Feb 1, 2021

Connection error #1797

Closed

albertvillanova closed this as completed Aug 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(Load dataset failure) ConnectionError: Couldn’t reach https://raw.githubusercontent.com/huggingface/datasets/1.1.2/datasets/cnn_dailymail/cnn_dailymail.py #759

(Load dataset failure) ConnectionError: Couldn’t reach https://raw.githubusercontent.com/huggingface/datasets/1.1.2/datasets/cnn_dailymail/cnn_dailymail.py #759

AI678 commented Oct 25, 2020

lhoestq commented Oct 25, 2020

AI678 commented Oct 26, 2020

lhoestq commented Oct 26, 2020

AI678 commented Oct 26, 2020 •

edited

Loading

lhoestq commented Oct 26, 2020

AI678 commented Oct 26, 2020

lhoestq commented Oct 26, 2020

AI678 commented Oct 26, 2020

AI678 commented Oct 26, 2020

lhoestq commented Oct 26, 2020 •

edited

Loading

smile0925 commented Feb 1, 2021

lhoestq commented Feb 1, 2021

smile0925 commented Feb 1, 2021

ZhengxiangShi commented Feb 17, 2021

smile0925 commented Feb 22, 2021

mikechen66 commented Sep 12, 2023 •

edited

Loading

lhoestq commented Sep 12, 2023

mikechen66 commented Sep 12, 2023 •

edited

Loading

mikechen66 commented Sep 13, 2023 via email

(Load dataset failure) ConnectionError: Couldn’t reach https://raw.githubusercontent.com/huggingface/datasets/1.1.2/datasets/cnn_dailymail/cnn_dailymail.py #759

(Load dataset failure) ConnectionError: Couldn’t reach https://raw.githubusercontent.com/huggingface/datasets/1.1.2/datasets/cnn_dailymail/cnn_dailymail.py #759

Comments

AI678 commented Oct 25, 2020

lhoestq commented Oct 25, 2020

AI678 commented Oct 26, 2020

lhoestq commented Oct 26, 2020

AI678 commented Oct 26, 2020 • edited Loading

lhoestq commented Oct 26, 2020

AI678 commented Oct 26, 2020

lhoestq commented Oct 26, 2020

AI678 commented Oct 26, 2020

AI678 commented Oct 26, 2020

lhoestq commented Oct 26, 2020 • edited Loading

smile0925 commented Feb 1, 2021

lhoestq commented Feb 1, 2021

smile0925 commented Feb 1, 2021

ZhengxiangShi commented Feb 17, 2021

smile0925 commented Feb 22, 2021

mikechen66 commented Sep 12, 2023 • edited Loading

The following snippet could not solve the implicit ssl error.

lhoestq commented Sep 12, 2023

mikechen66 commented Sep 12, 2023 • edited Loading

mikechen66 commented Sep 13, 2023 via email

AI678 commented Oct 26, 2020 •

edited

Loading

lhoestq commented Oct 26, 2020 •

edited

Loading

mikechen66 commented Sep 12, 2023 •

edited

Loading

mikechen66 commented Sep 12, 2023 •

edited

Loading