# Creating your own dataset

Task - Create a corpus of GitHub Issues.

 This corpus could be used for various purposes, including:

- Exploring how long it takes to close open issues or pull requests
- Training a multilabel classifier that can tag issues with metadata based on the issue’s description (e.g., “bug,” “enhancement,” or “question”)
- Creating a semantic search engine to find which issues match a user’s query

In [1]:
%%capture
!pip install datasets transformers[sentencepiece]
!apt install git-lfs

## Getting the data


In [2]:
%pip install requests -qqq

In [3]:
# retrieve the first issue on the first page
import requests

In [4]:
url = "https://api.github.com/repos/huggingface/datasets/issues?page=1&per_page=1"
response = requests.get(url)

In [5]:
response.status_code

200

In [6]:
response.json()

[{'active_lock_reason': None,
  'assignee': None,
  'assignees': [],
  'author_association': 'MEMBER',
  'body': 'Add The Pile subsets:\r\n- pubmed\r\n- ubuntu_irc\r\n- europarl\r\n- hacker_news\r\n- nih_exporter\r\n\r\nClose bigscience-workshop/data_tooling#301.',
  'closed_at': None,
  'comments': 0,
  'comments_url': 'https://api.github.com/repos/huggingface/datasets/issues/3378/comments',
  'created_at': '2021-12-03T13:14:54Z',
  'draft': False,
  'events_url': 'https://api.github.com/repos/huggingface/datasets/issues/3378/events',
  'html_url': 'https://github.com/huggingface/datasets/pull/3378',
  'id': 1070580126,
  'labels': [],
  'labels_url': 'https://api.github.com/repos/huggingface/datasets/issues/3378/labels{/name}',
  'locked': False,
  'milestone': None,
  'node_id': 'PR_kwDODunzps4vXF1D',
  'number': 3378,
  'performed_via_github_app': None,
  'pull_request': {'diff_url': 'https://github.com/huggingface/datasets/pull/3378.diff',
   'html_url': 'https://github.com/huggin

In [7]:
response.json()[0].keys()

dict_keys(['url', 'repository_url', 'labels_url', 'comments_url', 'events_url', 'html_url', 'id', 'node_id', 'number', 'title', 'user', 'labels', 'state', 'locked', 'assignee', 'assignees', 'milestone', 'comments', 'created_at', 'updated_at', 'closed_at', 'author_association', 'active_lock_reason', 'draft', 'pull_request', 'body', 'reactions', 'timeline_url', 'performed_via_github_app'])

In [8]:
GITHUB_TOKEN = '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'
headers = {'Authorization': f'token {GITHUB_TOKEN}'}

In [9]:
import time
import math
from pathlib import Path
import pandas as pd
from tqdm.notebook import tqdm

In [10]:
def fetch_issues(owner='huggingface', repo='datasets', num_issues=10_000, rate_limit=5_000, 
                 issues_path=Path('.')):
  if not issues_path.is_dir():
    issues_path.mkdir(exist_ok=True)

  # Define vars
  batch, all_issues, per_page = [], [], 100
  num_pages = math.ceil(num_issues / per_page)
  base_url = "https://api.github.com/repos"

  #https://api.github.com/repos/huggingface/datasets/issues?page=1&per_page=1"
  for page in tqdm(range(num_pages)):
    # Query with state=all to get open and closed issues
    query = f'issues?page={page}&per_page={per_page}&state=all'
    issues = requests.get(f'{base_url}/{owner}/{repo}/{query}', headers=headers)
    batch.extend(issues.json())

    if len(batch) > rate_limit and len(all_issues) < num_issues:
      all_issues.extend(batch)
      batch = [] # Flush batch for the next time period
      print(f"Reached GitHub rate limit. Sleeping for one hour ...")
      time.sleep(60 * 60 + 1)
  
  all_issues.extend(batch)
  df = pd.DataFrame.from_records(all_issues)
  df.to_json(f"{issues_path}/{repo}-issues.jsonl", orient='records', lines=True)
  print(
        f"Downloaded all the issues for {repo}! Dataset stored at {issues_path}/{repo}-issues.jsonl"
    )

In [11]:
fetch_issues()

  0%|          | 0/100 [00:00<?, ?it/s]

Downloaded all the issues for datasets! Dataset stored at ./datasets-issues.jsonl


In [15]:
from datasets import load_dataset
issues_dataset = load_dataset("json", data_files='datasets-issues.jsonl', split='train')
issues_dataset

Using custom data configuration default-30cdd786bf67e65e
Reusing dataset json (/root/.cache/huggingface/datasets/json/default-30cdd786bf67e65e/0.0.0/c2d554c3377ea79c7664b93dc65d0803b45e3279000f993c7bfd18937fd7f426)


Dataset({
    features: ['url', 'repository_url', 'labels_url', 'comments_url', 'events_url', 'html_url', 'id', 'node_id', 'number', 'title', 'user', 'labels', 'state', 'locked', 'assignee', 'assignees', 'milestone', 'comments', 'created_at', 'updated_at', 'closed_at', 'author_association', 'active_lock_reason', 'draft', 'pull_request', 'body', 'reactions', 'timeline_url', 'performed_via_github_app'],
    num_rows: 3439
})

The issues contain pull requests as well. The contents of issues and pull requests are quite different.

## Cleaning up the data

In [16]:
sample = issues_dataset.shuffle(seed=666).select(range(3))

Loading cached shuffled indices for dataset at /root/.cache/huggingface/datasets/json/default-30cdd786bf67e65e/0.0.0/c2d554c3377ea79c7664b93dc65d0803b45e3279000f993c7bfd18937fd7f426/cache-09894e93e730765b.arrow


In [17]:
# Print out the URL and pull request entries
for url, pr in zip(sample['html_url'],  sample['pull_request']):
  print(f">> URL: {url}")
  print(f">> Pull request: {pr}")

>> URL: https://github.com/huggingface/datasets/issues/708
>> Pull request: None
>> URL: https://github.com/huggingface/datasets/pull/1245
>> Pull request: {'url': 'https://api.github.com/repos/huggingface/datasets/pulls/1245', 'html_url': 'https://github.com/huggingface/datasets/pull/1245', 'diff_url': 'https://github.com/huggingface/datasets/pull/1245.diff', 'patch_url': 'https://github.com/huggingface/datasets/pull/1245.patch', 'merged_at': None}
>> URL: https://github.com/huggingface/datasets/issues/2402
>> Pull request: None


In [18]:
#create a new is_pull_request column that checks whether the pull_request field is None or not
issues_dataset.map(lambda x: {"is_pull_request": False if x['pull_request'] is None else True})

  0%|          | 0/3439 [00:00<?, ?ex/s]

Dataset({
    features: ['url', 'repository_url', 'labels_url', 'comments_url', 'events_url', 'html_url', 'id', 'node_id', 'number', 'title', 'user', 'labels', 'state', 'locked', 'assignee', 'assignees', 'milestone', 'comments', 'created_at', 'updated_at', 'closed_at', 'author_association', 'active_lock_reason', 'draft', 'pull_request', 'body', 'reactions', 'timeline_url', 'performed_via_github_app', 'is_pull_request'],
    num_rows: 3439
})

### Try it out
✏️ Try it out! Calculate the average time it takes to close issues in 🤗 Datasets. You may find the Dataset.filter() function useful to filter out the pull requests and open issues, and you can use the Dataset.set_format() function to convert the dataset to a DataFrame so you can easily manipulate the created_at and closed_at timestamps. For bonus points, calculate the average time it takes to close pull requests.

### Tip
 It is generally a good practice to keep the dataset as “raw” as possible at this stage so that it can be easily used in multiple applications.

## Augmenting the dataset


 The comments associated with an issue or pull request provide a rich source of information, especially if we’re interested in building a search engine to answer user queries about the library.

In [19]:
# Using GitHub REST API - Comments endpoint
issue_number = 2792
url = f"https://api.github.com/repos/huggingface/datasets/issues/{issue_number}/comments"
response = requests.get(url, headers=headers)
response.json()

[{'author_association': 'CONTRIBUTOR',
  'body': "@albertvillanova my tests are failing here:\r\n```\r\ndataset_name = 'gooaq'\r\n\r\n    def test_load_dataset(self, dataset_name):\r\n        configs = self.dataset_tester.load_all_configs(dataset_name, is_local=True)[:1]\r\n>       self.dataset_tester.check_load_dataset(dataset_name, configs, is_local=True, use_local_dummy_data=True)\r\n\r\ntests/test_dataset_common.py:234: \r\n_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \r\ntests/test_dataset_common.py:187: in check_load_dataset\r\n    self.parent.assertTrue(len(dataset[split]) > 0)\r\nE   AssertionError: False is not true\r\n```\r\nWhen I try loading dataset on local machine it works fine. Any suggestions on how can I avoid this error?",
  'created_at': '2021-08-12T12:21:52Z',
  'html_url': 'https://github.com/huggingface/datasets/pull/2792#issuecomment-897594128',
  'id': 897594128,
  'issue_url': 'https://api.github.com/repos/huggingface/datasets

the comment is stored in the body field

In [20]:
#https://api.github.com/repos/huggingface/datasets/issues/{issue_number}/comments
def get_comments(issue_number):
  url = f"https://api.github.com/repos/huggingface/datasets/issues/{issue_number}/comments"
  response = requests.get(url, headers=headers)
  return [r["body"] for r in response.json()]

# Test
get_comments(2792)

["@albertvillanova my tests are failing here:\r\n```\r\ndataset_name = 'gooaq'\r\n\r\n    def test_load_dataset(self, dataset_name):\r\n        configs = self.dataset_tester.load_all_configs(dataset_name, is_local=True)[:1]\r\n>       self.dataset_tester.check_load_dataset(dataset_name, configs, is_local=True, use_local_dummy_data=True)\r\n\r\ntests/test_dataset_common.py:234: \r\n_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \r\ntests/test_dataset_common.py:187: in check_load_dataset\r\n    self.parent.assertTrue(len(dataset[split]) > 0)\r\nE   AssertionError: False is not true\r\n```\r\nWhen I try loading dataset on local machine it works fine. Any suggestions on how can I avoid this error?",
 'Thanks for the help, @albertvillanova! All tests are passing now.']

In [21]:
issues_with_comments_dataset = issues_dataset.map(lambda x: {"comments": get_comments(x["number"])})

  0%|          | 0/3439 [00:00<?, ?ex/s]

In [22]:
issues_with_comments_dataset.to_json("issues-datasets-with-comments.jsonl")

Creating json from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

14904039

## Uploading the dataset to the Hugging Face Hub


In [23]:
from huggingface_hub import list_datasets

all_datasets = list_datasets()
print(f"Number of datasets on the Hub: {len(all_datasets)}")
print(all_datasets[0])

Number of datasets on the Hub: 2069
Dataset Name: 0n1xus/codexglue, Tags: []


In [24]:
# create a new dataset reporitory on the Hub
from huggingface_hub import notebook_login

notebook_login()
# This will create a widget where you can enter your username and password, and an API token will be saved in ~/.huggingface/token.

Login successful
Your token has been saved to /root/.huggingface/token
[1m[31mAuthenticated through git-credential store but this isn't the helper defined on your machine.
You might have to re-authenticate when pushing to the Hugging Face Hub. Run the following command in your terminal in case you want to set this credential helper as the default

git config --global credential.helper store[0m


In [25]:
# create a new dataset repository with the create_repo() function
from huggingface_hub import create_repo
repo_url = create_repo(name="github-issues", repo_type="dataset")
repo_url

'https://huggingface.co/datasets/msivanes/github-issues'

### Try it out

✏️ Try it out! Use your Hugging Face Hub username and password to obtain a token and create an empty repository called github-issues. Remember to never save your credentials in Colab or any other repository, as this information can be exploited by bad actors.

In [27]:
from huggingface_hub import Repository

repo = Repository(local_dir='github-issues', clone_from=repo_url)
!cp issues-datasets-with-comments.jsonl github-issues/

/content/github-issues is already a clone of https://huggingface.co/datasets/msivanes/github-issues. Make sure you pull the latest changes with `repo.git_pull()`.


In [28]:
repo.lfs_track('*.jsonl')

In [29]:
repo.push_to_hub()

Upload file issues-datasets-with-comments.jsonl:   0%|          | 3.38k/14.2M [00:00<?, ?B/s]

To https://huggingface.co/datasets/msivanes/github-issues
   8c38ecf..3fef78a  main -> main



'https://huggingface.co/datasets/msivanes/github-issues/commit/3fef78afa90aa24d25e937809452e0ab2ce81b9a'

In [30]:
remote_dataset = load_dataset('msivanes/github-issues', split='train')
remote_dataset

Using custom data configuration msivanes___github-issues-670628e082ea0eac


Downloading and preparing dataset json/msivanes___github-issues to /root/.cache/huggingface/datasets/json/msivanes___github-issues-670628e082ea0eac/0.0.0/c2d554c3377ea79c7664b93dc65d0803b45e3279000f993c7bfd18937fd7f426...


  0%|          | 0/1 [00:00<?, ?it/s]

Downloading:   0%|          | 0.00/14.9M [00:00<?, ?B/s]

  0%|          | 0/1 [00:00<?, ?it/s]

Dataset json downloaded and prepared to /root/.cache/huggingface/datasets/json/msivanes___github-issues-670628e082ea0eac/0.0.0/c2d554c3377ea79c7664b93dc65d0803b45e3279000f993c7bfd18937fd7f426. Subsequent calls will reuse this data.


Dataset({
    features: ['url', 'repository_url', 'labels_url', 'comments_url', 'events_url', 'html_url', 'id', 'node_id', 'number', 'title', 'user', 'labels', 'state', 'locked', 'assignee', 'assignees', 'milestone', 'comments', 'created_at', 'updated_at', 'closed_at', 'author_association', 'active_lock_reason', 'draft', 'pull_request', 'body', 'reactions', 'timeline_url', 'performed_via_github_app'],
    num_rows: 3439
})

### Tip

💡 You can also upload a dataset to the Hugging Face Hub directly from the terminal by using huggingface-cli and a bit of Git magic. See the 🤗 Datasets guide for details on how to do this.

## Creating a dataset card


### Try it out

✏️ Try it out! Use the dataset-tagging application and 🤗 Datasets guide to complete the README.md file for your GitHub issues dataset.

### Try it out

✏️ Try it out! Go through the steps we took in this section to create a dataset of GitHub issues for your favorite open source library (pick something other than 🤗 Datasets, of course!). For bonus points, fine-tune a multilabel classifier to predict the tags present in the labels field.