<a href="https://colab.research.google.com/github/loganathanspr/nlp_course/blob/main/create_dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Creating your own dataset

Install the Transformers, Datasets, and Evaluate libraries to run this notebook.

In [8]:
!pip install datasets evaluate transformers[sentencepiece]
!apt install git-lfs

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
git-lfs is already the newest version (3.0.2-1ubuntu0.2).
0 upgraded, 0 newly installed, 0 to remove and 18 not upgraded.


You will need to setup git, adapt your email and name in the following cell.

In [9]:
!git config --global user.email "loganathanspr@gmail.com"
!git config --global user.name "Loganathan Ramasamy"

You will also need to be logged in to the Hugging Face Hub. Execute the following and enter your credentials.

In [10]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [11]:
!pip install requests



In [12]:
import requests

url = "https://api.github.com/repos/huggingface/datasets/issues?page=1&per_page=1"
response = requests.get(url)

In [13]:
response.status_code

200

In [15]:
response.json()

[{'url': 'https://api.github.com/repos/huggingface/datasets/issues/6281',
  'repository_url': 'https://api.github.com/repos/huggingface/datasets',
  'labels_url': 'https://api.github.com/repos/huggingface/datasets/issues/6281/labels{/name}',
  'comments_url': 'https://api.github.com/repos/huggingface/datasets/issues/6281/comments',
  'events_url': 'https://api.github.com/repos/huggingface/datasets/issues/6281/events',
  'html_url': 'https://github.com/huggingface/datasets/pull/6281',
  'id': 1928456959,
  'node_id': 'PR_kwDODunzps5cBQPd',
  'number': 6281,
  'title': 'Improve documentation of dataset.from_generator',
  'user': {'login': 'hartmans',
   'id': 53510,
   'node_id': 'MDQ6VXNlcjUzNTEw',
   'avatar_url': 'https://avatars.githubusercontent.com/u/53510?v=4',
   'gravatar_id': '',
   'url': 'https://api.github.com/users/hartmans',
   'html_url': 'https://github.com/hartmans',
   'followers_url': 'https://api.github.com/users/hartmans/followers',
   'following_url': 'https://api.

In [16]:
GITHUB_TOKEN = "XXX"  # Copy your GitHub token here
headers = {"Authorization": f"token {GITHUB_TOKEN}"}

In [21]:
import time
import math
from pathlib import Path
import pandas as pd
from tqdm.notebook import tqdm


def fetch_issues(
    owner="huggingface",
    repo="datasets",
    num_issues=4000,
    rate_limit=5_000,
    issues_path=Path("."),
):
    if not issues_path.is_dir():
        issues_path.mkdir(exist_ok=True)

    batch = []
    all_issues = []
    per_page = 100  # Number of issues to return per page
    num_pages = math.ceil(num_issues / per_page)
    base_url = "https://api.github.com/repos"

    for page in tqdm(range(num_pages)):
        # Query with state=all to get both open and closed issues
        query = f"issues?page={page}&per_page={per_page}&state=all"
        issues = requests.get(f"{base_url}/{owner}/{repo}/{query}", headers=headers)
        batch.extend(issues.json())

        if len(batch) >= num_issues:
            print(f"Reached maximum number of issues, Quitting the download...")
            break
        if len(batch) > rate_limit:
            print(f"Reached GitHub rate limit. Qutting the download ...")
            break

    all_issues.extend(batch)
    df = pd.DataFrame.from_records(all_issues)
    df.to_json(f"{issues_path}/{repo}-issues-4000-done.jsonl", orient="records", lines=True)
    print(
        f"Downloaded all the issues for {repo}! Dataset stored at {issues_path}/{repo}-issues-4000-done.jsonl"
    )

In [22]:
# Depending on your internet connection, this can take several minutes to run...
fetch_issues()

  0%|          | 0/40 [00:00<?, ?it/s]

Reached maximum number of issues, Quitting the download...
Downloaded all the issues for datasets! Dataset stored at ./datasets-issues-4000.jsonl


In [35]:
import pandas as pd
df = pd.read_json("datasets-issues-4000.jsonl", lines=True)

cols = [
    "id",
    "number",
    "html_url",
    "title",
    "labels",
    "comments",
    "created_at",
    "updated_at",
    "closed_at",
    "draft",
    "pull_request",
    "body"
]
df_cleaned = df[cols]
df_cleaned.to_json("datasets-issues-4000-cleaned.jsonl", orient="records", lines=True)

In [36]:
from datasets import load_dataset
issues_dataset = load_dataset("json", data_files="datasets-issues-4000-cleaned.jsonl", split="train")
issues_dataset

Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Dataset({
    features: ['id', 'number', 'html_url', 'title', 'labels', 'comments', 'created_at', 'updated_at', 'closed_at', 'draft', 'pull_request', 'body'],
    num_rows: 4000
})

In [37]:
sample = issues_dataset.shuffle(seed=666).select(range(3))

for url, pr in zip(sample["html_url"], sample["pull_request"]):
    print(f">> URL: {url}")
    print(f">> Pull request: {pr}\n")

>> URL: https://github.com/huggingface/datasets/issues/5315
>> Pull request: None

>> URL: https://github.com/huggingface/datasets/pull/4189
>> Pull request: {'url': 'https://api.github.com/repos/huggingface/datasets/pulls/4189', 'html_url': 'https://github.com/huggingface/datasets/pull/4189', 'diff_url': 'https://github.com/huggingface/datasets/pull/4189.diff', 'patch_url': 'https://github.com/huggingface/datasets/pull/4189.patch', 'merged_at': datetime.datetime(2022, 5, 6, 8, 35, 52)}

>> URL: https://github.com/huggingface/datasets/pull/3866
>> Pull request: {'url': 'https://api.github.com/repos/huggingface/datasets/pulls/3866', 'html_url': 'https://github.com/huggingface/datasets/pull/3866', 'diff_url': 'https://github.com/huggingface/datasets/pull/3866.diff', 'patch_url': 'https://github.com/huggingface/datasets/pull/3866.patch', 'merged_at': datetime.datetime(2022, 3, 8, 17, 37, 1)}



In [38]:
issues_dataset = issues_dataset.map(lambda x: {"is_pull_request": False if x["pull_request"] is None else True})

Map:   0%|          | 0/4000 [00:00<?, ? examples/s]

In [39]:
issue_number = 2792
url = f"https://api.github.com/repos/huggingface/datasets/issues/{issue_number}/comments"
response = requests.get(url, headers=headers)
response.json()

[{'url': 'https://api.github.com/repos/huggingface/datasets/issues/comments/897594128',
  'html_url': 'https://github.com/huggingface/datasets/pull/2792#issuecomment-897594128',
  'issue_url': 'https://api.github.com/repos/huggingface/datasets/issues/2792',
  'id': 897594128,
  'node_id': 'IC_kwDODunzps41gDMQ',
  'user': {'login': 'bhavitvyamalik',
   'id': 19718818,
   'node_id': 'MDQ6VXNlcjE5NzE4ODE4',
   'avatar_url': 'https://avatars.githubusercontent.com/u/19718818?v=4',
   'gravatar_id': '',
   'url': 'https://api.github.com/users/bhavitvyamalik',
   'html_url': 'https://github.com/bhavitvyamalik',
   'followers_url': 'https://api.github.com/users/bhavitvyamalik/followers',
   'following_url': 'https://api.github.com/users/bhavitvyamalik/following{/other_user}',
   'gists_url': 'https://api.github.com/users/bhavitvyamalik/gists{/gist_id}',
   'starred_url': 'https://api.github.com/users/bhavitvyamalik/starred{/owner}{/repo}',
   'subscriptions_url': 'https://api.github.com/users/

In [40]:
def get_comments(issue_number):
    url = f"https://api.github.com/repos/huggingface/datasets/issues/{issue_number}/comments"
    response = requests.get(url, headers=headers)
    return [r["body"] for r in response.json()]

get_comments(2792)

["@albertvillanova my tests are failing here:\r\n```\r\ndataset_name = 'gooaq'\r\n\r\n    def test_load_dataset(self, dataset_name):\r\n        configs = self.dataset_tester.load_all_configs(dataset_name, is_local=True)[:1]\r\n>       self.dataset_tester.check_load_dataset(dataset_name, configs, is_local=True, use_local_dummy_data=True)\r\n\r\ntests/test_dataset_common.py:234: \r\n_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \r\ntests/test_dataset_common.py:187: in check_load_dataset\r\n    self.parent.assertTrue(len(dataset[split]) > 0)\r\nE   AssertionError: False is not true\r\n```\r\nWhen I try loading dataset on local machine it works fine. Any suggestions on how can I avoid this error?",
 'Thanks for the help, @albertvillanova! All tests are passing now.']

In [42]:
issues_with_comments_dataset = issues_dataset.select(range(500))
issues_with_comments_dataset

Dataset({
    features: ['id', 'number', 'html_url', 'title', 'labels', 'comments', 'created_at', 'updated_at', 'closed_at', 'draft', 'pull_request', 'body', 'is_pull_request'],
    num_rows: 500
})

In [44]:
issues_with_comments_dataset = issues_with_comments_dataset.map(lambda x: {"comments": get_comments(x["number"])})

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

In [45]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [46]:
issues_with_comments_dataset.push_to_hub("github-issues")

Pushing dataset shards to the dataset hub:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

In [47]:
remote_dataset = load_dataset("loganathanspr/github-issues", split="train")
remote_dataset

Downloading readme:   0%|          | 0.00/1.34k [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/1.59M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/500 [00:00<?, ? examples/s]

Dataset({
    features: ['id', 'number', 'html_url', 'title', 'labels', 'comments', 'created_at', 'updated_at', 'closed_at', 'draft', 'pull_request', 'body', 'is_pull_request'],
    num_rows: 500
})