Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTTPError: 404 Client Error: Not Found for url #5086

Closed
km5ar opened this issue Oct 6, 2022 · 3 comments · Fixed by huggingface/course#334
Closed

HTTPError: 404 Client Error: Not Found for url #5086

km5ar opened this issue Oct 6, 2022 · 3 comments · Fixed by huggingface/course#334
Labels
bug Something isn't working

Comments

@km5ar
Copy link

km5ar commented Oct 6, 2022

Describe the bug

I was following chap 5 from huggingface course: https://huggingface.co/course/chapter5/6?fw=tf

However, I'm not able to download the datasets, with a 404 erros

iShot2022-10-06_15 54 50

Steps to reproduce the bug

from huggingface_hub import hf_hub_url

data_files = hf_hub_url(
    repo_id="lewtun/github-issues",
    filename="datasets-issues-with-hf-doc-builder.jsonl",
    repo_type="dataset",
)
from datasets import load_dataset

issues_dataset = load_dataset("json", data_files=data_files, split="train")
issues_dataset

Environment info

  • datasets version: 2.5.2
  • Platform: macOS-10.16-x86_64-i386-64bit
  • Python version: 3.9.12
  • PyArrow version: 9.0.0
  • Pandas version: 1.4.4
@km5ar km5ar added the bug Something isn't working label Oct 6, 2022
@osanseviero
Copy link
Member

FYI @lewtun

@albertvillanova
Copy link
Member

Hi @km5ar, thanks for reporting.

This should be fixed in the notebook:

Anyway, depending on your version of datasets, you can now use:

from datasets import load_dataset

issues_dataset = load_dataset("lewtun/github-issues")
issues_dataset

instead of:

from huggingface_hub import hf_hub_url

data_files = hf_hub_url(
    repo_id="lewtun/github-issues",
    filename="datasets-issues-with-hf-doc-builder.jsonl",
    repo_type="dataset",
)
from datasets import load_dataset

issues_dataset = load_dataset("json", data_files=data_files, split="train")
issues_dataset

Output:

In [25]: ds = load_dataset("lewtun/github-issues")
Downloading: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10.5k/10.5k [00:00<00:00, 5.75MB/s]
Using custom data configuration lewtun--github-issues-cff5093ecc410ea2
Downloading and preparing dataset json/lewtun--github-issues to .../.cache/huggingface/datasets/lewtun___json/lewtun--github-issues-cff5093ecc410ea2/0.0.0/e6070c77f18f01a5ad4551a8b7edfba20b8438b7cad4d94e6ad9378022ce4aab...
Downloading data: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12.2M/12.2M [00:00<00:00, 26.5MB/s]
Downloading data files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:02<00:00,  2.70s/it]
Extracting data files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 1589.96it/s]
Dataset json downloaded and prepared to .../.cache/huggingface/datasets/lewtun___json/lewtun--github-issues-cff5093ecc410ea2/0.0.0/e6070c77f18f01a5ad4551a8b7edfba20b8438b7cad4d94e6ad9378022ce4aab. Subsequent calls will reuse this data.
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 133.95it/s]

In [26]: ds
Out[26]: 
DatasetDict({
    train: Dataset({
        features: ['url', 'repository_url', 'labels_url', 'comments_url', 'events_url', 'html_url', 'id', 'node_id', 'number', 'title', 'user', 'labels', 'state', 'locked', 'assignee', 'assignees', 'milestone', 'comments', 'created_at', 'updated_at', 'closed_at', 'author_association', 'active_lock_reason', 'pull_request', 'body', 'timeline_url', 'performed_via_github_app', 'is_pull_request'],
        num_rows: 3019
    })
})

@lewtun
Copy link
Member

lewtun commented Oct 7, 2022

Thanks for reporting @km5ar and thank you @albertvillanova for the quick solution! I'll post a fix on the source too

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants