In [2]:
from datasets import load_dataset
from utils import check_file_or_folder_existence

### TOC
1. [Load custom datasets](#Load-custom-datasets)
2. [Time to slice and dice](#Time-to-slice-and-dice)
3. [Creating new columns](#Creating-new-columns)
4. [The map() method's superpowers](#The-map()-method's-superpowers)
5. [From Datasets to DataFrames and back](#From-Datasets-to-DataFrames-and-back)
6. [Creating a validation set](#Creating-a-validation-set)
7. [Saving a dataset](#Saving-a-dataset)
8. [Big data? Datasets to the rescue!](#Big-data?-Datasets-to-the-rescue!)
9. [What is the Pile?](#What-is-the-Pile?)
10. [The magic of memory mapping](#The-magic-of-memory-mapping)
11. [Streaming datasets](#Streaming-datasets)
12. [Creating your own dataset](#Creating-your-own-dataset)
13. [Getting the data](#Getting-the-data)
14. [Cleaning up the data](#Cleaning-up-the-data)
15. [Augmenting the dataset](#Augmenting-the-dataset)
16. [Creating a dataset card](#Creating-a-dataset-card)

### Load custom datasets

#### Loading a local dataset

In [3]:
if not check_file_or_folder_existence("05_data/SQuAD_it-train.json"):
    !wget -P 05_data https://github.com/crux82/squad-it/raw/master/SQuAD_it-train.json.gz
if not check_file_or_folder_existence("05_data/SQuAD_it-train.json"):
    !wget -P 05_data https://github.com/crux82/squad-it/raw/master/SQuAD_it-test.json.gz
if not check_file_or_folder_existence("05_data/SQuAD_it-train.json"):
    !gzip -dkv 05_data/SQuAD_it-*.json.gz

In [4]:
squad_dataset = load_dataset('json', data_files='05_data/SQuAD_it-train.json', field='data')

In [5]:
squad_dataset

DatasetDict({
    train: Dataset({
        features: ['paragraphs', 'title'],
        num_rows: 442
    })
})

In [6]:
squad_dataset['train'][0]

{'paragraphs': [{'context': "Il terremoto del Sichuan del 2008 o il terremoto del Gran Sichuan, misurato a 8.0 Ms e 7.9 Mw, e si è verificato alle 02:28:01 PM China Standard Time all' epicentro (06:28:01 UTC) il 12 maggio nella provincia del Sichuan, ha ucciso 69.197 persone e lasciato 18.222 dispersi.",
   'qas': [{'answers': [{'answer_start': 29, 'text': '2008'}],
     'id': '56cdca7862d2951400fa6826',
     'question': 'In quale anno si è verificato il terremoto nel Sichuan?'},
    {'answers': [{'answer_start': 232, 'text': '69.197'}],
     'id': '56cdca7862d2951400fa6828',
     'question': 'Quante persone sono state uccise come risultato?'},
    {'answers': [{'answer_start': 29, 'text': '2008'}],
     'id': '56d4f9902ccc5a1400d833c0',
     'question': 'Quale anno ha avuto luogo il terremoto del Sichuan?'},
    {'answers': [{'answer_start': 78, 'text': '8.0 Ms e 7.9 Mw'}],
     'id': '56d4f9902ccc5a1400d833c1',
     'question': 'Che cosa ha fatto la misura di sisma?'},
    {'answers'

In [7]:
data_files = {"train": "05_data/SQuAD_it-train.json", "test": "05_data/SQuAD_it-test.json"}
squad_it_dataset = load_dataset("json", data_files=data_files, field="data")
squad_it_dataset

DatasetDict({
    train: Dataset({
        features: ['paragraphs', 'title'],
        num_rows: 442
    })
    test: Dataset({
        features: ['paragraphs', 'title'],
        num_rows: 48
    })
})

We can decompress the files directly without saving them to disk using the load_data function from the datasets library.

In [8]:
data_files = {"train": "05_data/SQuAD_it-train.json.gz", "test": "05_data/SQuAD_it-test.json.gz"}
squad_it_dataset = load_dataset("json", data_files=data_files, field="data")
squad_it_dataset

DatasetDict({
    train: Dataset({
        features: ['paragraphs', 'title'],
        num_rows: 442
    })
    test: Dataset({
        features: ['paragraphs', 'title'],
        num_rows: 48
    })
})

#### Loading a remote dataset

In [9]:
url = "https://github.com/crux82/squad-it/raw/master/"
data_files = {
    "train": url + "SQuAD_it-train.json.gz",
    "test": url + "SQuAD_it-test.json.gz",
}
squad_it_dataset = load_dataset("json", data_files=data_files, field="data")

In [10]:
squad_it_dataset

DatasetDict({
    train: Dataset({
        features: ['paragraphs', 'title'],
        num_rows: 442
    })
    test: Dataset({
        features: ['paragraphs', 'title'],
        num_rows: 48
    })
})

### Time to slice and dice

In [11]:
if not check_file_or_folder_existence("05_data/drug+review+dataset+druglib+com.zip"):
    !wget -P 05_data "http://archive.ics.uci.edu/static/public/461/drug+review+dataset+druglib+com.zip"
    !unzip -d 05_data 05_data/drug+review+dataset+druglib+com

In [12]:
if not check_file_or_folder_existence("05_data/drug+review+dataset+drugs+com.zip"):
    !wget -P 05_data "http://archive.ics.uci.edu/static/public/462/drug+review+dataset+drugs+com.zip"
    !unzip -d 05_data 05_data/drug+review+dataset+drugs+com

In [13]:
data_files = {"train": "05_data/drugsComTrain_raw.tsv", "test": "05_data/drugsComTest_raw.tsv"}
# \t is the tab character in Python
drug_dataset = load_dataset("csv", data_files=data_files, delimiter="\t")

In [14]:
drug_dataset

DatasetDict({
    train: Dataset({
        features: ['Unnamed: 0', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount'],
        num_rows: 161297
    })
    test: Dataset({
        features: ['Unnamed: 0', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount'],
        num_rows: 53766
    })
})

In [15]:
drug_sample = drug_dataset["train"].shuffle(seed=42).select(range(1000))
# Peek at the first few examples
drug_sample[:3]

{'Unnamed: 0': [87571, 178045, 80482],
 'drugName': ['Naproxen', 'Duloxetine', 'Mobic'],
 'condition': ['Gout, Acute', 'ibromyalgia', 'Inflammatory Conditions'],
 'review': ['"like the previous person mention, I&#039;m a strong believer of aleve, it works faster for my gout than the prescription meds I take. No more going to the doctor for refills.....Aleve works!"',
  '"I have taken Cymbalta for about a year and a half for fibromyalgia pain. It is great\r\nas a pain reducer and an anti-depressant, however, the side effects outweighed \r\nany benefit I got from it. I had trouble with restlessness, being tired constantly,\r\ndizziness, dry mouth, numbness and tingling in my feet, and horrible sweating. I am\r\nbeing weaned off of it now. Went from 60 mg to 30mg and now to 15 mg. I will be\r\noff completely in about a week. The fibro pain is coming back, but I would rather deal with it than the side effects."',
  '"I have been taking Mobic for over a year with no side effects other than 

Note that we’ve fixed the seed in Dataset.shuffle() for reproducibility purposes. Dataset.select() expects an iterable of indices, so we’ve passed range(1000) to grab the first 1,000 examples from the shuffled dataset. From this sample we can already see a few quirks in our dataset:

* The Unnamed: 0 column looks suspiciously like an anonymized ID for each patient.
* The condition column includes a mix of uppercase and lowercase labels.
* The reviews are of varying length and contain a mix of Python line separators (\r\n) as well as HTML character codes like &\#039;.

Let’s see how we can use 🤗 Datasets to deal with each of these issues. To test the patient ID hypothesis for the Unnamed: 0 column, we can use the Dataset.unique() function to verify that the number of IDs matches the number of rows in each split:

In [16]:
for split in drug_dataset.keys():
    assert len(drug_dataset[split]) == len(drug_dataset[split].unique("Unnamed: 0"))

This seems to confirm our hypothesis, so let’s clean up the dataset a bit by renaming the Unnamed: 0 column to something a bit more interpretable. We can use the DatasetDict.rename_column() function to rename the column across both splits in one go:

In [17]:
drug_dataset = drug_dataset.rename_column(
    original_column_name="Unnamed: 0", new_column_name="patient_id"
)
drug_dataset

DatasetDict({
    train: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount'],
        num_rows: 161297
    })
    test: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount'],
        num_rows: 53766
    })
})

In [18]:
drug_dataset

DatasetDict({
    train: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount'],
        num_rows: 161297
    })
    test: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount'],
        num_rows: 53766
    })
})

In [19]:
split = "train"
print(len(drug_dataset[split].unique("drugName")))
print(len(drug_dataset[split].unique("condition")))

3436
885


Next, let’s normalize all the condition labels using Dataset.map(). As we did with tokenization in Chapter 3, we can define a simple function that can be applied across all the rows of each split in drug_dataset:

In [20]:
def lowercase_condition(example):
    return {"condition": example["condition"].lower()}

In [21]:
# drug_dataset.map(lowercase_condition)

Oh no, we’ve run into a problem with our map function! From the error we can infer that some of the entries in the condition column are None, which cannot be lowercased as they’re not strings. Let’s drop these rows using Dataset.filter(), which works in a similar way to Dataset.map() and expects a function that receives a single example of the dataset. Instead of writing an explicit function like:

In [22]:
def filter_nones(x):
    return x["condition"] is not None

and then running drug_dataset.filter(filter_nones), we can do this in one line using a lambda function. In Python, lambda functions are small functions that you can define without explicitly naming them. They take the general form:

In [23]:
drug_dataset = drug_dataset.filter(lambda x: x["condition"] is not None)

With the None entries removed, we can normalize our condition column:

In [24]:
drug_dataset = drug_dataset.map(lowercase_condition)
# Check that lowercasing worked
drug_dataset["train"]["condition"][:3]

['left ventricular dysfunction', 'adhd', 'birth control']

It works! Now that we’ve cleaned up the labels, let’s take a look at cleaning up the reviews themselves.

#### Creating new columns

Whenever you’re dealing with customer reviews, a good practice is to check the number of words in each review. A review might be just a single word like “Great!” or a full-blown essay with thousands of words, and depending on the use case you’ll need to handle these extremes differently. To compute the number of words in each review, we’ll use a rough heuristic based on splitting each text by whitespace.

Let’s define a simple function that counts the number of words in each review:

In [25]:
def compute_review_length(example):
    return {"review_length": len(example["review"].split())}

Unlike our lowercase_condition() function, compute_review_length() returns a dictionary whose key does not correspond to one of the column names in the dataset. In this case, when compute_review_length() is passed to Dataset.map(), it will be applied to all the rows in the dataset to create a new review_length column:

In [26]:
drug_dataset = drug_dataset.map(compute_review_length)
# Inspect the first training example
drug_dataset["train"][0]

{'patient_id': 206461,
 'drugName': 'Valsartan',
 'condition': 'left ventricular dysfunction',
 'review': '"It has no side effect, I take it in combination of Bystolic 5 Mg and Fish Oil"',
 'rating': 9.0,
 'date': 'May 20, 2012',
 'usefulCount': 27,
 'review_length': 17}

As expected, we can see a review_length column has been added to our training set. We can sort this new column with Dataset.sort() to see what the extreme values look like:

In [27]:
drug_dataset["train"].sort("review_length")[:3]

{'patient_id': [111469, 13653, 53602],
 'drugName': ['Ledipasvir / sofosbuvir',
  'Amphetamine / dextroamphetamine',
  'Alesse'],
 'condition': ['hepatitis c', 'adhd', 'birth control'],
 'review': ['"Headache"', '"Great"', '"Awesome"'],
 'rating': [10.0, 10.0, 10.0],
 'date': ['February 3, 2015', 'October 20, 2009', 'November 23, 2015'],
 'usefulCount': [41, 3, 0],
 'review_length': [1, 1, 1]}

🙋 An alternative way to add new columns to a dataset is with the Dataset.add_column() function. This allows you to provide the column as a Python list or NumPy array and can be handy in situations where Dataset.map() is not well suited for your analysis.

Let’s use the Dataset.filter() function to remove reviews that contain fewer than 30 words. Similarly to what we did with the condition column, we can filter out the very short reviews by requiring that the reviews have a length above this threshold:

In [28]:
drug_dataset = drug_dataset.filter(lambda x: x["review_length"] > 30)
print(drug_dataset.num_rows)

{'train': 138514, 'test': 46108}


In [29]:
# drug_dataset["train"].sort("review_length", reverse=True)[:3]

The last thing we need to deal with is the presence of HTML character codes in our reviews. We can use Python’s html module to unescape these characters, like so:

In [30]:
import html

text = "I&#039;m a transformer called BERT"
html.unescape(text)

"I'm a transformer called BERT"

We’ll use Dataset.map() to unescape all the HTML characters in our corpus:

In [31]:
drug_dataset = drug_dataset.map(lambda x: {"review": html.unescape(x["review"])})

As you can see, the Dataset.map() method is quite useful for processing data — and we haven’t even scratched the surface of everything it can do!

#### The map() method's superpowers

The Dataset.map() method takes a batched argument that, if set to True, causes it to send a batch of examples to the map function at once (the batch size is configurable but defaults to 1,000). For instance, the previous map function that unescaped all the HTML took a bit of time to run (you can read the time taken from the progress bars). We can speed this up by processing several elements at the same time using a list comprehension.

When you specify batched=True the function receives a dictionary with the fields of the dataset, but each value is now a list of values, and not just a single value. The return value of Dataset.map() should be the same: a dictionary with the fields we want to update or add to our dataset, and a list of values. For example, here is another way to unescape all HTML characters, but using batched=True:

In [32]:
new_drug_dataset = drug_dataset.map(
    lambda x: {"review": [html.unescape(o) for o in x["review"]]}, batched=True
)

If you’re running this code in a notebook, you’ll see that this command executes way faster than the previous one. And it’s not because our reviews have already been HTML-unescaped — if you re-execute the instruction from the previous section (without batched=True), it will take the same amount of time as before. This is because list comprehensions are usually faster than executing the same code in a for loop, and we also gain some performance by accessing lots of elements at the same time instead of one by one.

Using Dataset.map() with batched=True will be essential to unlock the speed of the “fast” tokenizers that we’ll encounter in Chapter 6, which can quickly tokenize big lists of texts. For instance, to tokenize all the drug reviews with a fast tokenizer, we could use a function like this:

In [33]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")


def tokenize_function(examples):
    return tokenizer(examples["review"], truncation=True)

As you saw in Chapter 3, we can pass one or several examples to the tokenizer, so we can use this function with or without batched=True. Let’s take this opportunity to compare the performance of the different options. In a notebook, you can time a one-line instruction by adding %time before the line of code you wish to measure:

In [34]:
%time tokenized_dataset = drug_dataset.map(tokenize_function, batched=True)

CPU times: user 17.9 ms, sys: 38.2 ms, total: 56.1 ms
Wall time: 55.3 ms


You can also time a whole cell by putting %%time at the beginning of the cell. On the hardware we executed this on, it showed 10.8s for this instruction (it’s the number written after “Wall time”).

In [35]:
%time tokenized_dataset = drug_dataset.map(tokenize_function, batched=False)

Map:   0%|          | 0/46108 [00:00<?, ? examples/s]

CPU times: user 26.3 s, sys: 356 ms, total: 26.7 s
Wall time: 26.6 s


This means that using a fast tokenizer with the batched=True option is 30 times faster than its slow counterpart with no batching — this is truly amazing! That’s the main reason why fast tokenizers are the default when using AutoTokenizer (and why they are called “fast”). They’re able to achieve such a speedup because behind the scenes the tokenization code is executed in Rust, which is a language that makes it easy to parallelize code execution.

Parallelization is also the reason for the nearly 6x speedup the fast tokenizer achieves with batching: you can’t parallelize a single tokenization operation, but when you want to tokenize lots of texts at the same time you can just split the execution across several processes, each responsible for its own texts.

Dataset.map() also has some parallelization capabilities of its own. Since they are not backed by Rust, they won’t let a slow tokenizer catch up with a fast one, but they can still be helpful (especially if you’re using a tokenizer that doesn’t have a fast version). To enable multiprocessing, use the num_proc argument and specify the number of processes to use in your call to Dataset.map():

In [36]:
slow_tokenizer = AutoTokenizer.from_pretrained("bert-base-cased", use_fast=False)

In [37]:
def slow_tokenize_function(examples):
    return slow_tokenizer(examples["review"], truncation=True)

In [38]:
import multiprocessing
multiprocessing.cpu_count()

8

In [39]:
tokenized_dataset = drug_dataset.map(slow_tokenize_function, batched=True, num_proc=6)

You can experiment a little with timing to determine the optimal number of processes to use; in our case 8 seemed to produce the best speed gain. Here are the numbers we got with and without multiprocessing:

Those are much more reasonable results for the slow tokenizer, but the performance of the fast tokenizer was also substantially improved. Note, however, that won’t always be the case — for values of num_proc other than 8, our tests showed that it was faster to use batched=True without that option. In general, we don’t recommend using Python multiprocessing for fast tokenizers with batched=True.

Using num_proc to speed up your processing is usually a great idea, as long as the function you are using is not already doing some kind of multiprocessing of its own.

All of this functionality condensed into a single method is already pretty amazing, but there’s more! With Dataset.map() and batched=True you can change the number of elements in your dataset. This is super useful in many situations where you want to create several training features from one example, and we will need to do this as part of the preprocessing for several of the NLP tasks we’ll undertake in Chapter 7.

💡 In machine learning, an example is usually defined as the set of features that we feed to the model. In some contexts, these features will be the set of columns in a Dataset, but in others (like here and for question answering), multiple features can be extracted from a single example and belong to a single column

Let’s have a look at how it works! Here we will tokenize our examples and truncate them to a maximum length of 128, but we will ask the tokenizer to return all the chunks of the texts instead of just the first one. This can be done with return_overflowing_tokens=True:

In [40]:
def tokenize_and_split(examples):
    return tokenizer(
        examples["review"],
        truncation=True,
        max_length=128,
        return_overflowing_tokens=True,
    )

Let’s test this on one example before using Dataset.map() on the whole dataset:

In [41]:
result = tokenize_and_split(drug_dataset["train"][0])
[len(inp) for inp in result["input_ids"]]

[128, 49]

In [42]:
result.keys()

dict_keys(['input_ids', 'token_type_ids', 'attention_mask', 'overflow_to_sample_mapping'])

In [43]:
result['input_ids']

[[101,
  107,
  1422,
  1488,
  1110,
  9079,
  1194,
  1117,
  2223,
  1989,
  1104,
  1130,
  19972,
  11083,
  119,
  1284,
  1245,
  4264,
  1165,
  1119,
  1310,
  1142,
  1314,
  1989,
  117,
  1165,
  1119,
  1408,
  1781,
  1103,
  2439,
  13753,
  1119,
  1209,
  1129,
  1113,
  119,
  1370,
  1160,
  1552,
  117,
  1119,
  1180,
  6374,
  1243,
  1149,
  1104,
  1908,
  117,
  1108,
  1304,
  172,
  14687,
  1183,
  117,
  1105,
  7362,
  1111,
  2212,
  129,
  2005,
  1113,
  170,
  2797,
  1313,
  1121,
  1278,
  12020,
  113,
  1304,
  5283,
  1111,
  1140,
  119,
  114,
  146,
  1270,
  1117,
  3995,
  1113,
  6356,
  2106,
  1105,
  1131,
  1163,
  1106,
  6166,
  1122,
  1149,
  170,
  1374,
  1552,
  119,
  3969,
  1293,
  1119,
  1225,
  1120,
  1278,
  117,
  1105,
  1114,
  2033,
  1146,
  1107,
  1103,
  2106,
  119,
  1109,
  1314,
  1160,
  1552,
  1138,
  1151,
  2463,
  1714,
  119,
  1124,
  1110,
  150,
  21986,
  3048,
  1167,
  5340,
  1895,
  1190,
  1518,

So, our first example in the training set became two features because it was tokenized to more than the maximum number of tokens we specified: the first one of length 128 and the second one of length 49. Now let’s do this for all elements of the dataset!

In [44]:
# tokenized_dataset = drug_dataset.map(tokenize_and_split, batched=True)

Oh no! That didn’t work! Why not? Looking at the error message will give us a clue: there is a mismatch in the lengths of one of the columns, one being of length 1,463 and the other of length 1,000. If you’ve looked at the Dataset.map() documentation, you may recall that it’s the number of samples passed to the function that we are mapping; here those 1,000 examples gave 1,463 new features, resulting in a shape error.

The problem is that we’re trying to mix two different datasets of different sizes: the drug_dataset columns will have a certain number of examples (the 1,000 in our error), but the tokenized_dataset we are building will have more (the 1,463 in the error message; it is more than 1,000 because we are tokenizing long reviews into more than one example by using return_overflowing_tokens=True). That doesn’t work for a Dataset, so we need to either remove the columns from the old dataset or make them the same size as they are in the new dataset. We can do the former with the remove_columns argument:

In [45]:
tokenized_dataset = drug_dataset.map(
    tokenize_and_split, batched=True, remove_columns=drug_dataset["train"].column_names
)

Now this works without error. We can check that our new dataset has many more elements than the original dataset by comparing the lengths:

In [46]:
len(tokenized_dataset["train"]), len(drug_dataset["train"])

(206772, 138514)

In [47]:
tokenized_dataset.shape

{'train': (206772, 4), 'test': (68876, 4)}

In [48]:
tokenized_dataset

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'overflow_to_sample_mapping'],
        num_rows: 206772
    })
    test: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'overflow_to_sample_mapping'],
        num_rows: 68876
    })
})

We mentioned that we can also deal with the mismatched length problem by making the old columns the same size as the new ones. To do this, we will need the overflow_to_sample_mapping field the tokenizer returns when we set return_overflowing_tokens=True. It gives us a mapping from a new feature index to the index of the sample it originated from. Using this, we can associate each key present in our original dataset with a list of values of the right size by repeating the values of each example as many times as it generates new features:

In [49]:
def tokenize_and_split(examples):
    result = tokenizer(
        examples["review"],
        truncation=True,
        max_length=128,
        return_overflowing_tokens=True,
    )
    # Extract mapping between new and old indices
    sample_map = result.pop("overflow_to_sample_mapping")
    for key, values in examples.items():
        result[key] = [values[i] for i in sample_map]
    return result

In [50]:
result = tokenize_and_split(drug_dataset["train"][0:3])

In [51]:
result

{'input_ids': [[101, 107, 1422, 1488, 1110, 9079, 1194, 1117, 2223, 1989, 1104, 1130, 19972, 11083, 119, 1284, 1245, 4264, 1165, 1119, 1310, 1142, 1314, 1989, 117, 1165, 1119, 1408, 1781, 1103, 2439, 13753, 1119, 1209, 1129, 1113, 119, 1370, 1160, 1552, 117, 1119, 1180, 6374, 1243, 1149, 1104, 1908, 117, 1108, 1304, 172, 14687, 1183, 117, 1105, 7362, 1111, 2212, 129, 2005, 1113, 170, 2797, 1313, 1121, 1278, 12020, 113, 1304, 5283, 1111, 1140, 119, 114, 146, 1270, 1117, 3995, 1113, 6356, 2106, 1105, 1131, 1163, 1106, 6166, 1122, 1149, 170, 1374, 1552, 119, 3969, 1293, 1119, 1225, 1120, 1278, 117, 1105, 1114, 2033, 1146, 1107, 1103, 2106, 119, 1109, 1314, 1160, 1552, 1138, 1151, 2463, 1714, 119, 1124, 1110, 150, 21986, 3048, 1167, 5340, 1895, 1190, 1518, 102], [101, 119, 1124, 1110, 1750, 6438, 113, 170, 1363, 1645, 114, 117, 1750, 172, 14687, 1183, 119, 1124, 1110, 11566, 1155, 1103, 1614, 1119, 1431, 119, 8007, 1117, 4658, 1110, 1618, 119, 1284, 1138, 1793, 1242, 1472, 23897, 1105, 117

We can see it works with Dataset.map() without us needing to remove the old columns:

In [52]:
tokenized_dataset = drug_dataset.map(tokenize_and_split, batched=True)
tokenized_dataset

DatasetDict({
    train: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_length', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 206772
    })
    test: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_length', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 68876
    })
})

We get the same number of training features as before, but here we’ve kept all the old fields. If you need them for some post-processing after applying your model, you might want to use this approach.

You’ve now seen how 🤗 Datasets can be used to preprocess a dataset in various ways. Although the processing functions of 🤗 Datasets will cover most of your model training needs, there may be times when you’ll need to switch to Pandas to access more powerful features, like DataFrame.groupby() or high-level APIs for visualization. Fortunately, 🤗 Datasets is designed to be interoperable with libraries such as Pandas, NumPy, PyTorch, TensorFlow, and JAX. Let’s take a look at how this works.

#### From Datasets to DataFrames and back

To enable the conversion between various third-party libraries, 🤗 Datasets provides a Dataset.set_format() function. This function only changes the output format of the dataset, so you can easily switch to another format without affecting the underlying data format, which is Apache Arrow. The formatting is done in place. To demonstrate, let’s convert our dataset to Pandas:

In [53]:
drug_dataset.set_format("pandas")

Now when we access *elements of the dataset* we get a pandas.DataFrame instead of a dictionary:

In [54]:
drug_dataset["train"][:3]

Unnamed: 0,patient_id,drugName,condition,review,rating,date,usefulCount,review_length
0,95260,Guanfacine,adhd,"""My son is halfway through his fourth week of ...",8.0,"April 27, 2010",192,141
1,92703,Lybrel,birth control,"""I used to take another oral contraceptive, wh...",5.0,"December 14, 2009",17,134
2,138000,Ortho Evra,birth control,"""This is my first time using any form of birth...",8.0,"November 3, 2015",10,89


Let’s create a pandas.DataFrame for the whole training set by selecting all the elements of drug_dataset["train"]:

In [55]:
train_df = drug_dataset["train"][:]

In [56]:
train_df.head(2)

Unnamed: 0,patient_id,drugName,condition,review,rating,date,usefulCount,review_length
0,95260,Guanfacine,adhd,"""My son is halfway through his fourth week of ...",8.0,"April 27, 2010",192,141
1,92703,Lybrel,birth control,"""I used to take another oral contraceptive, wh...",5.0,"December 14, 2009",17,134


🚨 Under the hood, Dataset.set_format() changes the return format for the dataset’s __getitem__() dunder method. This means that when we want to create a new object like train_df from a Dataset in the "pandas" format, we need to slice the whole dataset to obtain a pandas.DataFrame. You can verify for yourself that the type of drug_dataset["train"] is Dataset, irrespective of the output format.

From here we can use all the Pandas functionality that we want. For example, we can do fancy chaining to compute the class distribution among the condition entries:

In [57]:
frequencies = (
    train_df["condition"]
    .value_counts()
    .to_frame()
    .reset_index()
    .rename(columns={"index": "condition", "condition": "frequency"})
)
frequencies.head(20)

Unnamed: 0,frequency,count
0,birth control,27655
1,depression,8023
2,acne,5209
3,anxiety,4991
4,pain,4744
5,bipolar disorde,3588
6,weight loss,3333
7,obesity,3258
8,insomnia,2978
9,adhd,2971


And once we’re done with our Pandas analysis, we can always create a new Dataset object by using the Dataset.from_pandas() function as follows:

In [58]:
from datasets import Dataset
freq_dataset = Dataset.from_pandas(frequencies)
freq_dataset

Dataset({
    features: ['frequency', 'count'],
    num_rows: 819
})

✏️ Try it out! Compute the average rating per drug and store the result in a new Dataset.

In [59]:
average_rating_df = (
    train_df
    .groupby("drugName")["rating"]
    .agg(["mean", "count"])
    .sort_values(by=["count", "mean"], ascending=False)
    .reset_index()
    .rename(columns={"mean": "average_rating", "count": "number_of_ratings"})
    )
average_rating = Dataset.from_pandas(average_rating_df)
average_rating


Dataset({
    features: ['drugName', 'average_rating', 'number_of_ratings'],
    num_rows: 3052
})

This wraps up our tour of the various preprocessing techniques available in 🤗 Datasets. To round out the section, let’s create a validation set to prepare the dataset for training a classifier on. Before doing so, we’ll reset the output format of drug_dataset from "pandas" to "arrow":

This wraps up our tour of the various preprocessing techniques available in 🤗 Datasets. To round out the section, let’s create a validation set to prepare the dataset for training a classifier on. Before doing so, we’ll reset the output format of drug_dataset from "pandas" to "arrow":

In [60]:
drug_dataset.reset_format()

#### Creating a validation set

Although we have a test set we could use for evaluation, it’s a good practice to leave the test set untouched and create a separate validation set during development. Once you are happy with the performance of your models on the validation set, you can do a final sanity check on the test set. This process helps mitigate the risk that you’ll overfit to the test set and deploy a model that fails on real-world data.

🤗 Datasets provides a Dataset.train_test_split() function that is based on the famous functionality from scikit-learn. Let’s use it to split our training set into train and validation splits (we set the seed argument for reproducibility):

In [61]:
drug_dataset_clean = drug_dataset["train"].train_test_split(train_size=0.8, seed=42)
# Rename the default "test" split to "validation"
drug_dataset_clean["validation"] = drug_dataset_clean.pop("test")
# Add the "test" set to our `DatasetDict`
drug_dataset_clean["test"] = drug_dataset["test"]
drug_dataset_clean

DatasetDict({
    train: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_length'],
        num_rows: 110811
    })
    validation: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_length'],
        num_rows: 27703
    })
    test: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_length'],
        num_rows: 46108
    })
})

Great, we’ve now prepared a dataset that’s ready for training some models on! In section 5 we’ll show you how to upload datasets to the Hugging Face Hub, but for now let’s cap off our analysis by looking at a few ways you can save datasets on your local machine.



#### Saving a dataset

Although 🤗 Datasets will cache every downloaded dataset and the operations performed on it, there are times when you’ll want to save a dataset to disk (e.g., in case the cache gets deleted). As shown in the table below, 🤗 Datasets provides three main functions to save your dataset in different formats:

| Data format | Function                |
| ----------- | ----------- |
| Arrow      | Dataset.save_to_disk()   |
| CSV   | Dataset.to_csv()              |
| JSON   | Dataset.to_json()            |

For example, let’s save our cleaned dataset in the Arrow format:

In [62]:
drug_dataset_clean.save_to_disk("05_data/drug-reviews")

Saving the dataset (0/1 shards):   0%|          | 0/110811 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/27703 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/46108 [00:00<?, ? examples/s]

This will create a directory with the following structure:

In [63]:
# drug-reviews/
# ├── dataset_dict.json
# ├── test
# │   ├── dataset.arrow
# │   ├── dataset_info.json
# │   └── state.json
# ├── train
# │   ├── dataset.arrow
# │   ├── dataset_info.json
# │   ├── indices.arrow
# │   └── state.json
# └── validation
#     ├── dataset.arrow
#     ├── dataset_info.json
#     ├── indices.arrow
#     └── state.json

where we can see that each split is associated with its own dataset.arrow table, and some metadata in dataset_info.json and state.json. You can think of the Arrow format as a fancy table of columns and rows that is optimized for building high-performance applications that process and transport large datasets.

Once the dataset is saved, we can load it by using the load_from_disk() function as follows:

In [64]:
from datasets import load_from_disk

drug_dataset_reloaded = load_from_disk("05_data/drug-reviews")
drug_dataset_reloaded

DatasetDict({
    train: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_length'],
        num_rows: 110811
    })
    validation: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_length'],
        num_rows: 27703
    })
    test: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_length'],
        num_rows: 46108
    })
})

For the CSV and JSON formats, we have to store each split as a separate file. One way to do this is by iterating over the keys and values in the DatasetDict object:

In [65]:
for split, dataset in drug_dataset_clean.items():
    dataset.to_json(f"05_data/drug-reviews-{split}.jsonl")

Creating json from Arrow format:   0%|          | 0/111 [00:00<?, ?ba/s]

Creating json from Arrow format:   0%|          | 0/28 [00:00<?, ?ba/s]

Creating json from Arrow format:   0%|          | 0/47 [00:00<?, ?ba/s]

This saves each split in JSON Lines format, where each row in the dataset is stored as a single line of JSON. Here’s what the first example looks like:

In [66]:
!head -n 1 05_data/drug-reviews-train.jsonl

{"patient_id":89879,"drugName":"Cyclosporine","condition":"keratoconjunctivitis sicca","review":"\"I have used Restasis for about a year now and have seen almost no progress.  For most of my life I've had red and bothersome eyes. After trying various eye drops, my doctor recommended Restasis.  He said it typically takes 3 to 6 months for it to really kick in but it never did kick in.  When I put the drops in it burns my eyes for the first 30 - 40 minutes.  I've talked with my doctor about this and he said it is normal but should go away after some time, but it hasn't. Every year around spring time my eyes get terrible irritated  and this year has been the same (maybe even worse than other years) even though I've been using Restasis for a year now. The only difference I notice was for the first couple weeks, but now I'm ready to move on.\"","rating":2.0,"date":"April 20, 2013","usefulCount":69,"review_length":147}


We can then use the techniques from section 2 to load the JSON files as follows:

In [67]:
prefix = "05_data"
data_files = {
    "train": f"{prefix}/drug-reviews-train.jsonl",
    "validation": f"{prefix}/drug-reviews-validation.jsonl",
    "test": f"{prefix}/drug-reviews-test.jsonl",
}
drug_dataset_reloaded = load_dataset("json", data_files=data_files)
drug_dataset_reloaded

Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/3 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_length'],
        num_rows: 110811
    })
    validation: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_length'],
        num_rows: 27703
    })
    test: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_length'],
        num_rows: 46108
    })
})

And that’s it for our excursion into data wrangling with 🤗 Datasets! Now that we have a cleaned dataset for training a model on, here are a few ideas that you could try out:

Use the techniques from Chapter 3 to train a classifier that can predict the patient condition based on the drug review.
Use the summarization pipeline from Chapter 1 to generate summaries of the reviews.
Next, we’ll take a look at how 🤗 Datasets can enable you to work with huge datasets without blowing up your laptop!

### Big data? Datasets to the rescue!

Nowadays it is not uncommon to find yourself working with multi-gigabyte datasets, especially if you’re planning to pretrain a transformer like BERT or GPT-2 from scratch. In these cases, even loading the data can be a challenge. For example, the WebText corpus used to pretrain GPT-2 consists of over 8 million documents and 40 GB of text — loading this into your laptop’s RAM is likely to give it a heart attack!

Fortunately, 🤗 Datasets has been designed to overcome these limitations. It frees you from memory management problems by treating datasets as memory-mapped files, and from hard drive limits by streaming the entries in a corpus.

In this section we’ll explore these features of 🤗 Datasets with a huge 825 GB corpus known as the Pile. Let’s get started!

#### What is the Pile?

Next, we can load the dataset using the method for remote files that we learned in section 2:

In [68]:
from datasets import load_dataset

# This takes a few minutes to run, so go grab a tea or coffee while you wait :)
data_files = "https://the-eye.eu/public/AI/pile_preliminary_components/PUBMED_title_abstracts_2019_baseline.jsonl.zst"
data_files = "https://the-eye.eu/public/AI/pile_preliminary_components/PUBMED_title_abstracts_2019_baseline.jsonl.zst"
data_files = "https://huggingface.co/datasets/mdroth/PubMed-200k-RTC/resolve/main/data/PubMed-200k-RTC_train.jsonl.zst"
# data_files = "https://huggingface.co/datasets/mdroth/PubMed-200k-RTC/resolve/main/data/PubMed-200k-RTC_train_min.jsonl.zst"
pubmed_dataset = load_dataset("json", data_files=data_files, split="train")
pubmed_dataset

Dataset({
    features: ['abstract_text', 'target'],
    num_rows: 2211861
})

We can see that there are 15,518,009 rows and 2 columns in our dataset — that’s a lot!

✎ By default, 🤗 Datasets will decompress the files needed to load a dataset. If you want to preserve hard drive space, you can pass DownloadConfig(delete_extracted=True) to the download_config argument of load_dataset(). See the documentation for more details.

Let’s inspect the contents of the first example:

In [69]:
pubmed_dataset

Dataset({
    features: ['abstract_text', 'target'],
    num_rows: 2211861
})

#### The magic of memory mapping

`psutil` provides a Process class that allows us to check the memory usage of the current process as follows:

In [70]:
import psutil

# Process.memory_info is expressed in bytes, so convert to megabytes
print(f"RAM used: {psutil.Process().memory_info().rss / (1024 * 1024):.2f} MB")

RAM used: 670.03 MB


Here the rss attribute refers to the resident set size, which is the fraction of memory that a process occupies in RAM. This measurement also includes the memory used by the Python interpreter and the libraries we’ve loaded, so the actual amount of memory used to load the dataset is a bit smaller. For comparison, let’s see how large the dataset is on disk, using the dataset_size attribute. Since the result is expressed in bytes like before, we need to manually convert it to gigabytes:

In [71]:
print(f"Number of files in dataset : {pubmed_dataset.dataset_size}")
size_gb = pubmed_dataset.dataset_size / (1024**3)
print(f"Dataset size (cache file) : {size_gb:.6f} GB")

Number of files in dataset : 368440643
Dataset size (cache file) : 0.343137 GB


✏️ Try it out! Pick one of the subsets from the Pile that is larger than your laptop or desktop’s RAM, load it with 🤗 Datasets, and measure the amount of RAM used. Note that to get an accurate measurement, you’ll want to do this in a new process. You can find the decompressed sizes of each subset in Table 1 of the Pile paper.

If you’re familiar with Pandas, this result might come as a surprise because of Wes Kinney’s famous rule of thumb that you typically need 5 to 10 times as much RAM as the size of your dataset. So how does 🤗 Datasets solve this memory management problem? 🤗 Datasets treats each dataset as a memory-mapped file, which provides a mapping between RAM and filesystem storage that allows the library to access and operate on elements of the dataset without needing to fully load it into memory.

Memory-mapped files can also be shared across multiple processes, which enables methods like Dataset.map() to be parallelized without needing to move or copy the dataset. Under the hood, these capabilities are all realized by the Apache Arrow memory format and pyarrow library, which make the data loading and processing lightning fast. (For more details about Apache Arrow and comparisons to Pandas, check out Dejan Simic’s blog post.) To see this in action, let’s run a little speed test by iterating over all the elements in the PubMed Abstracts dataset:

In [72]:
import timeit

code_snippet = """batch_size = 1000

for idx in range(0, len(pubmed_dataset), batch_size):
    _ = pubmed_dataset[idx:idx + batch_size]
"""

time = timeit.timeit(stmt=code_snippet, number=1, globals=globals())
print(
    f"Iterated over {len(pubmed_dataset)} examples (about {size_gb:.1f} GB) in "
    f"{time:.1f}s, i.e. {size_gb/time:.3f} GB/s"
)

Iterated over 2211861 examples (about 0.3 GB) in 6.8s, i.e. 0.051 GB/s


Here we’ve used Python’s timeit module to measure the execution time taken by code_snippet. You’ll typically be able to iterate over a dataset at speed of a few tenths of a GB/s to several GB/s. This works great for the vast majority of applications, but sometimes you’ll have to work with a dataset that is too large to even store on your laptop’s hard drive. For example, if we tried to download the Pile in its entirety, we’d need 825 GB of free disk space! To handle these cases, 🤗 Datasets provides a streaming feature that allows us to download and access elements on the fly, without needing to download the whole dataset. Let’s take a look at how this works.

💡 In Jupyter notebooks you can also time cells using the %%timeit magic function.

#### Streaming datasets

To enable dataset streaming you just need to pass the streaming=True argument to the load_dataset() function. For example, let’s load the PubMed Abstracts dataset again, but in streaming mode:

In [73]:
pubmed_dataset_streamed = load_dataset(
    "json", data_files=data_files, split="train", streaming=True
)

Instead of the familiar Dataset that we’ve encountered elsewhere in this chapter, the object returned with streaming=True is an IterableDataset. As the name suggests, to access the elements of an IterableDataset we need to iterate over it. We can access the first element of our streamed dataset as follows:

In [74]:
next(iter(pubmed_dataset_streamed))

{'abstract_text': 'The emergence of HIV as a chronic condition means that people living with HIV are required to take more responsibility for the self-management of their condition , including making physical , emotional and social adjustments .',
 'target': 'BACKGROUND'}

The elements from a streamed dataset can be processed on the fly using IterableDataset.map(), which is useful during training if you need to tokenize the inputs. The process is exactly the same as the one we used to tokenize our dataset in Chapter 3, with the only difference being that outputs are returned one by one:

In [75]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
tokenized_dataset = pubmed_dataset_streamed.map(lambda x: tokenizer(x["abstract_text"]))
next(iter(tokenized_dataset))

{'abstract_text': 'The emergence of HIV as a chronic condition means that people living with HIV are required to take more responsibility for the self-management of their condition , including making physical , emotional and social adjustments .',
 'target': 'BACKGROUND',
 'input_ids': [101,
  1996,
  14053,
  1997,
  9820,
  2004,
  1037,
  11888,
  4650,
  2965,
  2008,
  2111,
  2542,
  2007,
  9820,
  2024,
  3223,
  2000,
  2202,
  2062,
  5368,
  2005,
  1996,
  2969,
  1011,
  2968,
  1997,
  2037,
  4650,
  1010,
  2164,
  2437,
  3558,
  1010,
  6832,
  1998,
  2591,
  24081,
  1012,
  102],
 'attention_mask': [1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1]}

💡 To speed up tokenization with streaming you can pass batched=True, as we saw in the last section. It will process the examples batch by batch; the default batch size is 1,000 and can be specified with the batch_size argument.

You can also shuffle a streamed dataset using IterableDataset.shuffle(), but unlike Dataset.shuffle() this only shuffles the elements in a predefined buffer_size:

In [76]:
shuffled_dataset = pubmed_dataset_streamed.shuffle(buffer_size=10_000, seed=42)
next(iter(shuffled_dataset))

{'abstract_text': 'Safety and reactogenicity of a new heptavalent DTPw-HBV/Hib-MenAC ( diphtheria , tetanus , whole cell pertussis-hepatitis B virus/Haemophilus influenzae type b-Neisseria meningitidis serogroups A and C ) vaccine was compared with a widely used pentavalent DTPw-HBV/Hib vaccine .',
 'target': 'OBJECTIVE'}

In this example, we selected a random example from the first 10,000 examples in the buffer. Once an example is accessed, its spot in the buffer is filled with the next example in the corpus (i.e., the 10,001st example in the case above). You can also select elements from a streamed dataset using the IterableDataset.take() and IterableDataset.skip() functions, which act in a similar way to Dataset.select(). For example, to select the first 5 examples in the PubMed Abstracts dataset we can do the following:

In [77]:
dataset_head = pubmed_dataset_streamed.take(5)
list(dataset_head)

[{'abstract_text': 'The emergence of HIV as a chronic condition means that people living with HIV are required to take more responsibility for the self-management of their condition , including making physical , emotional and social adjustments .',
  'target': 'BACKGROUND'},
 {'abstract_text': 'This paper describes the design and evaluation of Positive Outlook , an online program aiming to enhance the self-management skills of gay men living with HIV .',
  'target': 'BACKGROUND'},
 {'abstract_text': 'This study is designed as a randomised controlled trial in which men living with HIV in Australia will be assigned to either an intervention group or usual care control group .',
  'target': 'METHODS'},
 {'abstract_text': "The intervention group will participate in the online group program ` Positive Outlook ' .",
  'target': 'METHODS'},
 {'abstract_text': 'The program is based on self-efficacy theory and uses a self-management approach to enhance skills , confidence and abilities to manag

Similarly, you can use the IterableDataset.skip() function to create training and validation splits from a shuffled dataset as follows:

In [78]:
# Skip the first 1,000 examples and include the rest in the training set
train_dataset = shuffled_dataset.skip(1000)
# Take the first 1,000 examples for the validation set
validation_dataset = shuffled_dataset.take(1000)

Let’s round out our exploration of dataset streaming with a common application: combining multiple datasets together to create a single corpus. 🤗 Datasets provides an interleave_datasets() function that converts a list of IterableDataset objects into a single IterableDataset, where the elements of the new dataset are obtained by alternating among the source examples. This function is especially useful when you’re trying to combine large datasets, so as an example let’s stream the FreeLaw subset of the Pile, which is a 51 GB dataset of legal opinions from US courts:

In [79]:
law_dataset_streamed = load_dataset(
    "json",
    data_files="https://huggingface.co/datasets/mdroth/PubMed-200k-RTC/resolve/main/data/LegalText-classification_train.jsonl.zst",
    split="train",
    streaming=True,
)
next(iter(law_dataset_streamed))

{'case_outcome': 'cited',
 'case_title': 'Alpine Hardwood (Aust) Pty Ltd v Hardys Pty Ltd (No 2) [2002] FCA 224 ; (2002) 190 ALR 121',
 'case_text': 'Ordinarily that discretion will be exercised so that costs follow the event and are awarded on a party and party basis. A departure from normal practice to award indemnity costs requires some special or unusual feature in the case: Alpine Hardwood (Aust) Pty Ltd v Hardys Pty Ltd (No 2) [2002] FCA 224 ; (2002) 190 ALR 121 at [11] (Weinberg J) citing Colgate Palmolive Co v Cussons Pty Ltd (1993) 46 FCR 225 at 233 (Sheppard J).'}

This dataset is large enough to stress the RAM of most laptops, yet we’ve been able to load and access it without breaking a sweat! Let’s now combine the examples from the FreeLaw and PubMed Abstracts datasets with the interleave_datasets() function:

In [80]:
from itertools import islice
from datasets import interleave_datasets

combined_dataset = interleave_datasets([pubmed_dataset_streamed, law_dataset_streamed])
list(islice(combined_dataset, 2))

[{'abstract_text': 'The emergence of HIV as a chronic condition means that people living with HIV are required to take more responsibility for the self-management of their condition , including making physical , emotional and social adjustments .',
  'target': 'BACKGROUND',
  'case_outcome': None,
  'case_title': None,
  'case_text': None},
 {'abstract_text': None,
  'target': None,
  'case_outcome': 'cited',
  'case_title': 'Alpine Hardwood (Aust) Pty Ltd v Hardys Pty Ltd (No 2) [2002] FCA 224 ; (2002) 190 ALR 121',
  'case_text': 'Ordinarily that discretion will be exercised so that costs follow the event and are awarded on a party and party basis. A departure from normal practice to award indemnity costs requires some special or unusual feature in the case: Alpine Hardwood (Aust) Pty Ltd v Hardys Pty Ltd (No 2) [2002] FCA 224 ; (2002) 190 ALR 121 at [11] (Weinberg J) citing Colgate Palmolive Co v Cussons Pty Ltd (1993) 46 FCR 225 at 233 (Sheppard J).'}]

Here we’ve used the islice() function from Python’s itertools module to select the first two examples from the combined dataset, and we can see that they match the first examples from each of the two source datasets.

Finally, if you want to stream the Pile in its 825 GB entirety, you can grab all the prepared files as follows:

✏️ Try it out! Use one of the large Common Crawl corpora like mc4 or oscar to create a streaming multilingual dataset that represents the spoken proportions of languages in a country of your choice. For example, the four national languages in Switzerland are German, French, Italian, and Romansh, so you could try creating a Swiss corpus by sampling the Oscar subsets according to their spoken proportion.

You now have all the tools you need to load and process datasets of all shapes and sizes — but unless you’re exceptionally lucky, there will come a point in your NLP journey where you’ll have to actually create a dataset to solve the problem at hand. That’s the topic of the next section!

### Creating your own dataset

Sometimes the dataset that you need to build an NLP application doesn’t exist, so you’ll need to create it yourself. In this section we’ll show you how to create a corpus of GitHub issues, which are commonly used to track bugs or features in GitHub repositories. This corpus could be used for various purposes, including:

Exploring how long it takes to close open issues or pull requests
Training a multilabel classifier that can tag issues with metadata based on the issue’s description (e.g., “bug,” “enhancement,” or “question”)
Creating a semantic search engine to find which issues match a user’s query
Here we’ll focus on creating the corpus, and in the next section we’ll tackle the semantic search application. To keep things meta, we’ll use the GitHub issues associated with a popular open source project: 🤗 Datasets! Let’s take a look at how to get the data and explore the information contained in these issues.

#### Getting the data

You can find all the issues in 🤗 Datasets by navigating to the repository’s Issues tab. As shown in the following screenshot, at the time of writing there were 331 open issues and 668 closed ones.

To download all the repository’s issues, we’ll use the GitHub REST API to poll the Issues endpoint. This endpoint returns a list of JSON objects, with each object containing a large number of fields that include the title and description as well as metadata about the status of the issue and so on.

A convenient way to download the issues is via the requests library, which is the standard way for making HTTP requests in Python. You can install the library by running:

Once the library is installed, you can make GET requests to the Issues endpoint by invoking the requests.get() function. For example, you can run the following command to retrieve the first issue on the first page

In [1]:
import requests

url = "https://api.github.com/repos/kubeflow/pipelines/issues?page=1&per_page=1"
response = requests.get(url)

In [2]:
url = "https://api.github.com/repos/huggingface/datasets/issues?page=1&per_page=1"
response2 = requests.get(url)

In [3]:
response2.text

'[{"url":"https://api.github.com/repos/huggingface/datasets/issues/6203","repository_url":"https://api.github.com/repos/huggingface/datasets","labels_url":"https://api.github.com/repos/huggingface/datasets/issues/6203/labels{/name}","comments_url":"https://api.github.com/repos/huggingface/datasets/issues/6203/comments","events_url":"https://api.github.com/repos/huggingface/datasets/issues/6203/events","html_url":"https://github.com/huggingface/datasets/issues/6203","id":1877491602,"node_id":"I_kwDODunzps5v6D-S","number":6203,"title":"Support loading from a DVC remote repository","user":{"login":"bilelomrani1","id":16692099,"node_id":"MDQ6VXNlcjE2NjkyMDk5","avatar_url":"https://avatars.githubusercontent.com/u/16692099?v=4","gravatar_id":"","url":"https://api.github.com/users/bilelomrani1","html_url":"https://github.com/bilelomrani1","followers_url":"https://api.github.com/users/bilelomrani1/followers","following_url":"https://api.github.com/users/bilelomrani1/following{/other_user}","gi

In [4]:
response.status_code

200

The response object contains a lot of useful information about the request, including the HTTP status code:

where a 200 status means the request was successful (you can find a list of possible HTTP status codes here). What we are really interested in, though, is the payload, which can be accessed in various formats like bytes, strings, or JSON. Since we know our issues are in JSON format, let’s inspect the payload as follows:

In [5]:
response.json()

[{'url': 'https://api.github.com/repos/kubeflow/pipelines/issues/9953',
  'repository_url': 'https://api.github.com/repos/kubeflow/pipelines',
  'labels_url': 'https://api.github.com/repos/kubeflow/pipelines/issues/9953/labels{/name}',
  'comments_url': 'https://api.github.com/repos/kubeflow/pipelines/issues/9953/comments',
  'events_url': 'https://api.github.com/repos/kubeflow/pipelines/issues/9953/events',
  'html_url': 'https://github.com/kubeflow/pipelines/pull/9953',
  'id': 1878199838,
  'node_id': 'PR_kwDOB-71UM5ZYW2g',
  'number': 9953,
  'title': 'chore(frontend): Refactor ExperimentDetails to functional component',
  'user': {'login': 'jlyaoyuli',
   'id': 56132941,
   'node_id': 'MDQ6VXNlcjU2MTMyOTQx',
   'avatar_url': 'https://avatars.githubusercontent.com/u/56132941?v=4',
   'gravatar_id': '',
   'url': 'https://api.github.com/users/jlyaoyuli',
   'html_url': 'https://github.com/jlyaoyuli',
   'followers_url': 'https://api.github.com/users/jlyaoyuli/followers',
   'followi

Whoa, that’s a lot of information! We can see useful fields like title, body, and number that describe the issue, as well as information about the GitHub user who opened the issue.

✏️ Try it out! Click on a few of the URLs in the JSON payload above to get a feel for what type of information each GitHub issue is linked to.

As described in the GitHub documentation, unauthenticated requests are limited to 60 requests per hour. Although you can increase the per_page query parameter to reduce the number of requests you make, you will still hit the rate limit on any repository that has more than a few thousand issues. So instead, you should follow GitHub’s instructions on creating a personal access token so that you can boost the rate limit to 5,000 requests per hour. Once you have your token, you can include it as part of the request header:

⚠️ Do not share a notebook with your GITHUB_TOKEN pasted in it. We recommend you delete the last cell once you have executed it to avoid leaking this information accidentally. Even better, store the token in a .env file and use the python-dotenv library to load it automatically for you as an environment variable.

Now that we have our access token, let’s create a function that can download all the issues from a GitHub repository:

In [6]:
from dotenv import load_dotenv, find_dotenv
import os
_ = load_dotenv(find_dotenv())
GITHUB_TOKEN = os.environ["GITHUB_PERSONAL_TOKEN"]
headers = {"Authorization": f"token {GITHUB_TOKEN}"}

Now that we have our access token, let’s create a function that can download all the issues from a GitHub repository:

In [13]:
import time
import math
from pathlib import Path
import pandas as pd
from tqdm.notebook import tqdm
from datasets import Dataset, load_dataset


def fetch_issues(
    owner="kubeflow",
    repo="pipelines",
    num_issues=10_000,
    rate_limit=5_000,
    issues_path=Path("05_data"),
):
    if not issues_path.is_dir():
        issues_path.mkdir(exist_ok=True)

    batch = []
    all_issues = []
    per_page = 100  # Number of issues to return per page
    num_pages = math.ceil(num_issues / per_page)
    base_url = "https://api.github.com/repos"

    for page in tqdm(range(num_pages)):
        # Query with state=all to get both open and closed issues
        query = f"issues?page={page}&per_page={per_page}&state=all"
        issues = requests.get(f"{base_url}/{owner}/{repo}/{query}", headers=headers)
        batch.extend(issues.json())

        if len(batch) > rate_limit and len(all_issues) < num_issues:
            break
            # all_issues.extend(batch)
            # batch = []  # Flush batch for next time period
            # print(f"Reached GitHub rate limit. Sleeping for one hour ...")
            # time.sleep(60 * 60 + 1)

    all_issues.extend(batch)
    df = pd.DataFrame.from_records(all_issues)
    df.to_json(f"{issues_path}/{repo}-issues.jsonl", orient="records", lines=True)
    print(
        f"Downloaded all the issues for {repo}! Dataset stored at {issues_path}/{repo}-issues.jsonl"
    )

Now when we call fetch_issues() it will download all the issues in batches to avoid exceeding GitHub’s limit on the number of requests per hour; the result will be stored in a repository_name-issues.jsonl file, where each line is a JSON object the represents an issue. Let’s use this function to grab all the issues from 🤗 Datasets:

In [14]:
#fetch_issues()

Once the issues are downloaded we can load them locally using our newfound skills from section 2:

In [15]:
issues_dataset = load_dataset("json", data_files="05_data/pipelines-issues.jsonl", split="train")
issues_dataset

Dataset({
    features: ['url', 'repository_url', 'labels_url', 'comments_url', 'events_url', 'html_url', 'id', 'node_id', 'number', 'title', 'user', 'labels', 'state', 'locked', 'assignee', 'assignees', 'milestone', 'comments', 'created_at', 'updated_at', 'closed_at', 'author_association', 'active_lock_reason', 'draft', 'pull_request', 'body', 'reactions', 'timeline_url', 'performed_via_github_app', 'state_reason'],
    num_rows: 5100
})

Great, we’ve created our first dataset from scratch! But why are there several thousand issues when the Issues tab of the 🤗 Datasets repository only shows around 1,000 issues in total 🤔? As described in the GitHub documentation, that’s because we’ve downloaded all the pull requests as well:

> GitHub’s REST API v3 considers every pull request an issue, but not every issue is a pull request. For this reason, “Issues” endpoints may return both issues and pull requests in the response. You can identify pull requests by the pull_request key. Be aware that the id of a pull request returned from “Issues” endpoints will be an issue id.

Since the contents of issues and pull requests are quite different, let’s do some minor preprocessing to enable us to distinguish between them.

#### Cleaning up the data

The above snippet from GitHub’s documentation tells us that the pull_request column can be used to differentiate between issues and pull requests. Let’s look at a random sample to see what the difference is. As we did in section 3, we’ll chain Dataset.shuffle() and Dataset.select() to create a random sample and then zip the html_url and pull_request columns so we can compare the various URLs:

In [16]:
sample = issues_dataset.shuffle(seed=666).select(range(3))

# Print out the URL and pull request entries
for url, pr in zip(sample["html_url"], sample["pull_request"]):
    print(f">> URL: {url}")
    print(f">> Pull request: {pr}\n")

>> URL: https://github.com/kubeflow/pipelines/pull/6210
>> Pull request: {'url': 'https://api.github.com/repos/kubeflow/pipelines/pulls/6210', 'html_url': 'https://github.com/kubeflow/pipelines/pull/6210', 'diff_url': 'https://github.com/kubeflow/pipelines/pull/6210.diff', 'patch_url': 'https://github.com/kubeflow/pipelines/pull/6210.patch', 'merged_at': datetime.datetime(2021, 8, 6, 15, 54, 43)}

>> URL: https://github.com/kubeflow/pipelines/issues/5548
>> Pull request: None

>> URL: https://github.com/kubeflow/pipelines/pull/5072
>> Pull request: {'url': 'https://api.github.com/repos/kubeflow/pipelines/pulls/5072', 'html_url': 'https://github.com/kubeflow/pipelines/pull/5072', 'diff_url': 'https://github.com/kubeflow/pipelines/pull/5072.diff', 'patch_url': 'https://github.com/kubeflow/pipelines/pull/5072.patch', 'merged_at': datetime.datetime(2021, 2, 2, 4, 6, 15)}



Here we can see that each pull request is associated with various URLs, while ordinary issues have a None entry. We can use this distinction to create a new is_pull_request column that checks whether the pull_request field is None or not:

In [17]:
issues_dataset = issues_dataset.map(
    lambda x: {"is_pull_request": False if x["pull_request"] is None else True}
)

✏️ Try it out! Calculate the average time it takes to close issues in 🤗 Datasets. You may find the Dataset.filter() function useful to filter out the pull requests and open issues, and you can use the Dataset.set_format() function to convert the dataset to a DataFrame so you can easily manipulate the created_at and closed_at timestamps. For bonus points, calculate the average time it takes to close pull requests.

In [18]:
issues_dataset["is_pull_request"][:3]

[True, True, True]

In [19]:
issues_filtered_not_pull = issues_dataset.filter(lambda x: x["is_pull_request"]==False)
issues_filtered_not_pull.set_format("pandas")
issues_df = issues_filtered_not_pull[:]

issues_filtered_pull = issues_dataset.filter(lambda x: x["is_pull_request"])
issues_filtered_pull.set_format("pandas")
pull_df = issues_filtered_pull[:]

In [20]:
issues_df.columns

Index(['url', 'repository_url', 'labels_url', 'comments_url', 'events_url',
       'html_url', 'id', 'node_id', 'number', 'title', 'user', 'labels',
       'state', 'locked', 'assignee', 'assignees', 'milestone', 'comments',
       'created_at', 'updated_at', 'closed_at', 'author_association',
       'active_lock_reason', 'draft', 'pull_request', 'body', 'reactions',
       'timeline_url', 'performed_via_github_app', 'state_reason',
       'is_pull_request'],
      dtype='object')

In [21]:
cols = ["created_at", "closed_at"]
issues_df["time_to_close"] = issues_df[cols].apply(pd.to_datetime).diff(axis=1)["closed_at"]
average_close_time_issues = issues_df["time_to_close"].mean()

cols = ["created_at", "closed_at"]
pull_df["time_to_close"] = pull_df[cols].apply(pd.to_datetime).diff(axis=1)["closed_at"]
average_close_time_pull = pull_df["time_to_close"].mean()


print(average_close_time_issues)
print(average_close_time_pull)

92 days 09:56:52.726688103
10 days 07:15:10.195076347


Although we could proceed to further clean up the dataset by dropping or renaming some columns, it is generally a good practice to keep the dataset as “raw” as possible at this stage so that it can be easily used in multiple applications.

Before we push our dataset to the Hugging Face Hub, let’s deal with one thing that’s missing from it: the comments associated with each issue and pull request. We’ll add them next with — you guessed it — the GitHub REST API!

#### Augmenting the dataset

As shown in the following screenshot, the comments associated with an issue or pull request provide a rich source of information, especially if we’re interested in building a search engine to answer user queries about the library.



The GitHub REST API provides a Comments endpoint that returns all the comments associated with an issue number. Let’s test the endpoint to see what it returns:

In [22]:
issue_number = 2792
url = f"https://api.github.com/repos/kubeflow/pipelines/issues/{issue_number}/comments"
response = requests.get(url, headers=headers)
response.json()

[{'url': 'https://api.github.com/repos/kubeflow/pipelines/issues/comments/570736822',
  'html_url': 'https://github.com/kubeflow/pipelines/pull/2792#issuecomment-570736822',
  'issue_url': 'https://api.github.com/repos/kubeflow/pipelines/issues/2792',
  'id': 570736822,
  'node_id': 'MDEyOklzc3VlQ29tbWVudDU3MDczNjgyMg==',
  'user': {'login': 'gaoning777',
   'id': 3826739,
   'node_id': 'MDQ6VXNlcjM4MjY3Mzk=',
   'avatar_url': 'https://avatars.githubusercontent.com/u/3826739?v=4',
   'gravatar_id': '',
   'url': 'https://api.github.com/users/gaoning777',
   'html_url': 'https://github.com/gaoning777',
   'followers_url': 'https://api.github.com/users/gaoning777/followers',
   'following_url': 'https://api.github.com/users/gaoning777/following{/other_user}',
   'gists_url': 'https://api.github.com/users/gaoning777/gists{/gist_id}',
   'starred_url': 'https://api.github.com/users/gaoning777/starred{/owner}{/repo}',
   'subscriptions_url': 'https://api.github.com/users/gaoning777/subscrip

We can see that the comment is stored in the body field, so let’s write a simple function that returns all the comments associated with an issue by picking out the body contents for each element in response.json():

In [28]:
def get_comments(issue_number):
    url = f"https://api.github.com/repos/huggingface/datasets/issues/{issue_number}/comments"
    response = requests.get(url, headers=headers)
    return [r["body"] if isinstance(r, dict) else "" for r in response.json()]


# Test our function works as expected
get_comments(3)

["Yes!\r\n- pandas will be a one-liner in `arrow_dataset`: https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table.to_pandas\r\n- for Spark I have no idea. let's investigate that at some point",
 'For Spark it looks to be pretty straightforward as well https://spark.apache.org/docs/latest/sql-pyspark-pandas-with-arrow.html but looks to be having a dependency to Spark is necessary, then nevermind we can skip it',
 'Now Pandas is available.']

This looks good, so let’s use Dataset.map() to add a new comments column to each issue in our dataset:

In [29]:
issues_filtered_not_pull.reset_format()
issues_filtered_not_pull

Dataset({
    features: ['url', 'repository_url', 'labels_url', 'comments_url', 'events_url', 'html_url', 'id', 'node_id', 'number', 'title', 'user', 'labels', 'state', 'locked', 'assignee', 'assignees', 'milestone', 'comments', 'created_at', 'updated_at', 'closed_at', 'author_association', 'active_lock_reason', 'draft', 'pull_request', 'body', 'reactions', 'timeline_url', 'performed_via_github_app', 'state_reason', 'is_pull_request'],
    num_rows: 1574
})

In [30]:
# Depending on your internet connection, this can take a few minutes...
issues_with_comments_dataset = issues_filtered_not_pull.map(
    lambda x: {"comments": get_comments(x["number"])}
)

Map:   0%|          | 0/1574 [00:00<?, ? examples/s]

The final step is to push our dataset to the Hub. Let’s take a look at how we can do that.

Now that we have our augmented dataset, it’s time to push it to the Hub so we can share it with the community! Uploading a dataset is very simple: just like models and tokenizers from 🤗 Transformers, we can use a push_to_hub() method to push a dataset. To do that we need an authentication token, which can be obtained by first logging into the Hugging Face Hub with the notebook_login() function:

In [31]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [33]:
issues_with_comments_dataset.push_to_hub("github-kubeflow-issues")

Pushing dataset shards to the dataset hub:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/2 [00:00<?, ?ba/s]

From here, anyone can download the dataset by simply providing load_dataset() with the repository ID as the path argument:

In [36]:
remote_dataset = load_dataset("hjerpe/github-kubeflow-issues", split="train")
remote_dataset

Downloading readme:   0%|          | 0.00/6.23k [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/2.70M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/1574 [00:00<?, ? examples/s]

Dataset({
    features: ['url', 'repository_url', 'labels_url', 'comments_url', 'events_url', 'html_url', 'id', 'node_id', 'number', 'title', 'user', 'labels', 'state', 'locked', 'assignee', 'assignees', 'milestone', 'comments', 'created_at', 'updated_at', 'closed_at', 'author_association', 'active_lock_reason', 'draft', 'pull_request', 'body', 'reactions', 'timeline_url', 'performed_via_github_app', 'state_reason', 'is_pull_request'],
    num_rows: 1574
})

Cool, we’ve pushed our dataset to the Hub and it’s available for others to use! There’s just one important thing left to do: adding a dataset card that explains how the corpus was created and provides other useful information for the community.

💡 You can also upload a dataset to the Hugging Face Hub directly from the terminal by using huggingface-cli and a bit of Git magic. See the 🤗 Datasets guide for details on how to do this.



#### Creating a dataset card

Well-documented datasets are more likely to be useful to others (including your future self!), as they provide the context to enable users to decide whether the dataset is relevant to their task and to evaluate any potential biases in or risks associated with using the dataset. 

On the Hugging Face Hub, this information is stored in each dataset repository’s README.md file. There are two main steps you should take before creating this file:

1. Use the datasets-tagging application to create metadata tags in YAML format. These tags are used for a variety of search features on the Hugging Face Hub and ensure your dataset can be easily found by members of the community. Since we have created a custom dataset here, you’ll need to clone the datasets-tagging repository and run the application locally. Here’s what the interface looks like:

2. Read the 🤗 Datasets guide on creating informative dataset cards and use it as a template.

You can create the README.md file directly on the Hub, and you can find a template dataset card in the lewtun/github-issues dataset repository. A screenshot of the filled-out dataset card is shown below.

✏️ Try it out! Use the dataset-tagging application and 🤗 Datasets guide to complete the README.md file for your GitHub issues dataset.

That’s it! We’ve seen in this section that creating a good dataset can be quite involved, but fortunately uploading it and sharing it with the community is not. In the next section we’ll use our new dataset to create a semantic search engine with 🤗 Datasets that can match questions to the most relevant issues and comments.

✏️ Try it out! Go through the steps we took in this section to create a dataset of GitHub issues for your favorite open source library (pick something other than 🤗 Datasets, of course!). For bonus points, fine-tune a multilabel classifier to predict the tags present in the labels field.