## The Datasets Library

Similar to Pandas, Huggingface Datasets provides several functions to manipulate the contents of Dataset and DatasetDict objects. We already encountered the Dataset.map() method in Chapter 3, and in this section we’ll explore some of the other functions at our disposal.

For this example we’ll use the Drug Review Dataset that’s hosted on the UC Irvine Machine Learning Repository, which contains patient reviews on various drugs, along with the condition being treated and a 10-star rating of the patient’s satisfaction.

First we need to download and extract the data, which can be done with the wget and unzip commands:

In [1]:
# !wget "https://archive.ics.uci.edu/ml/machine-learning-databases/00462/drugsCom_raw.zip"

In [2]:
# !unzip drugsCom_raw.zip

In [3]:
# !pip install datasets

In [4]:
from datasets import load_dataset

data_files = {"train": "drugsComTrain_raw.tsv", "test": "drugsComTest_raw.tsv"}

drug_dataset = load_dataset("csv", data_files=data_files, delimiter='\t')

In [5]:
SEED = 42
drug_sample = drug_dataset["train"].shuffle(seed=SEED).select(range(1000))

drug_sample[:3]

{'Unnamed: 0': [87571, 178045, 80482],
 'drugName': ['Naproxen', 'Duloxetine', 'Mobic'],
 'condition': ['Gout, Acute', 'ibromyalgia', 'Inflammatory Conditions'],
 'review': ['"like the previous person mention, I&#039;m a strong believer of aleve, it works faster for my gout than the prescription meds I take. No more going to the doctor for refills.....Aleve works!"',
  '"I have taken Cymbalta for about a year and a half for fibromyalgia pain. It is great\r\nas a pain reducer and an anti-depressant, however, the side effects outweighed \r\nany benefit I got from it. I had trouble with restlessness, being tired constantly,\r\ndizziness, dry mouth, numbness and tingling in my feet, and horrible sweating. I am\r\nbeing weaned off of it now. Went from 60 mg to 30mg and now to 15 mg. I will be\r\noff completely in about a week. The fibro pain is coming back, but I would rather deal with it than the side effects."',
  '"I have been taking Mobic for over a year with no side effects other than 

From this sample we can already see a few quirks in our dataset:

* **The Unnamed**: 0 column looks suspiciously like an anonymized ID for each patient.
* **The condition** column includes a mix of uppercase and lowercase labels.
* **The reviews** are of varying length and contain a mix of Python line separators (\r\n) as well as HTML character codes like &\#039;.

Let’s see how we can use 🤗 Datasets to deal with each of these issues. To test the patient ID hypothesis for the Unnamed: 0 column, we can use the Dataset.unique() function to verify that the number of IDs matches the number of rows in each split:



In [6]:
for split in drug_dataset.keys():
  assert len(drug_dataset[split]) == len(drug_dataset[split].unique("Unnamed: 0"))

This seems to confirm our hypothesis, so let’s clean up the dataset a bit by renaming the Unnamed: 0 column to something a bit more interpretable. We can use the `DatasetDict.rename_column()` function to rename the column across both splits in one go:

In [7]:
drug_dataset = drug_dataset.rename_column(
    original_column_name='Unnamed: 0', new_column_name='patient_id'
)
drug_dataset

DatasetDict({
    train: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount'],
        num_rows: 161297
    })
    test: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount'],
        num_rows: 53766
    })
})

Now, Lets normalize all the condition labels using `Dataset.map()`

In [9]:
def lowercase_condition(example):
    return {'condition': example['condition'].lower()}

In [12]:
drug_dataset = drug_dataset.map(lowercase_condition)

Some of the entries in the condition column are None, which cannot be lowercased as they’re not strings. Let’s drop these rows using Dataset.filter(), which works in a similar way to Dataset.map() and expects a function that receives a single example of the dataset

In [11]:
drug_dataset = drug_dataset.filter(lambda x:x['condition'] is not None)

In [13]:
drug_dataset["train"]["condition"][:3]

['left ventricular dysfunction', 'adhd', 'birth control']

## Creating New Columns

Whenever you’re dealing with customer reviews, a good practice is to check the number of words in each review. A review might be just a single word like “Great!” or a full-blown essay with thousands of words, and depending on the use case you’ll need to handle these extremes differently. To compute the number of words in each review, we’ll use a rough heuristic based on splitting each text by whitespace.

In [14]:
def compute_review_length(example): 
    return {"review_length": len(example["review"].split())}

In [15]:
drug_dataset = drug_dataset.map(compute_review_length)

In [16]:
drug_dataset['train'][0]

{'patient_id': 206461,
 'drugName': 'Valsartan',
 'condition': 'left ventricular dysfunction',
 'review': '"It has no side effect, I take it in combination of Bystolic 5 Mg and Fish Oil"',
 'rating': 9.0,
 'date': 'May 20, 2012',
 'usefulCount': 27,
 'review_length': 17}

As expected, we can see a review_length column has been added to our training set. We can sort this new column with Dataset.sort() to see what the extreme values look like

In [17]:
drug_dataset["train"].sort('review_length')[0]

{'patient_id': 111469,
 'drugName': 'Ledipasvir / sofosbuvir',
 'condition': 'hepatitis c',
 'review': '"Headache"',
 'rating': 10.0,
 'date': 'February 3, 2015',
 'usefulCount': 41,
 'review_length': 1}

Let’s use the `Dataset.filter()` function to remove reviews that contain fewer than 30 words.

In [18]:
drug_dataset = drug_dataset.filter(lambda x: x["review_length"] > 30)

In [19]:
drug_dataset.num_rows

{'train': 138514, 'test': 46108}

The last thing we need to deal with is the presence of HTML character codes in our reviews. We can use Python’s html module to unescape these characters

In [20]:
import html 

text = "I&#039;m a transformer called BERT"
html.unescape(text)

"I'm a transformer called BERT"

In [23]:
%%time
drug_dataset = drug_dataset.map(lambda x: {"review": html.unescape(x['review'])})

Map:   0%|          | 0/138514 [00:00<?, ? examples/s]

Map:   0%|          | 0/46108 [00:00<?, ? examples/s]

CPU times: user 6.78 s, sys: 221 ms, total: 7 s
Wall time: 7.15 s


### The map() method’s superpowers

The Dataset.map() method takes a batched argument that, if set to True, causes it to send a batch of examples to the map function at once (the batch size is configurable but defaults to 1,000). For instance, the previous map function that unescaped all the HTML took a bit of time to run (you can read the time taken from the progress bars). We can speed this up by processing several elements at the same time using a list comprehension.

When you specify batched=True the function receives a dictionary with the fields of the dataset, but each value is now a list of values, and not just a single value. The return value of Dataset.map() should be the same: a dictionary with the fields we want to update or add to our dataset, and a list of values. For example, here is another way to unescape all HTML characters, but using batched=True:

In [24]:
%%time
new_drug_dataset = drug_dataset.map(lambda x: {'review': [html.unescape(o) for o in x['review']]}, batched=True)

Map:   0%|          | 0/138514 [00:00<?, ? examples/s]

Map:   0%|          | 0/46108 [00:00<?, ? examples/s]

CPU times: user 248 ms, sys: 93.6 ms, total: 341 ms
Wall time: 590 ms


you’ll see that the above command(with `batched=True`) executes way faster than the previous one. And it’s not because our reviews have already been HTML-unescaped — if you re-execute the instruction from the previous section (without batched=True), it will take the same amount of time as before. This is because list comprehensions are usually faster than executing the same code in a for loop, and we also gain some performance by accessing lots of elements at the same time instead of one by one.

#### Let`s try batched=True concept with tokenizer

In [26]:
from transformers import AutoTokenizer

In [27]:
%%time 
slow_tokenizer = AutoTokenizer.from_pretrained("bert-base-cased", use_fast=False)


def slow_tokenize_function(examples):
    return slow_tokenizer(examples["review"], truncation=True)


tokenized_dataset = drug_dataset.map(slow_tokenize_function, batched=True, num_proc=8)

Map (num_proc=8):   0%|          | 0/138514 [00:00<?, ? examples/s]

Map (num_proc=8):   0%|          | 0/46108 [00:00<?, ? examples/s]

CPU times: user 819 ms, sys: 208 ms, total: 1.03 s
Wall time: 44.9 s


In [35]:
%%time 
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased", use_fast=True)


def slow_tokenize_function(examples):
    return slow_tokenizer(examples["review"], truncation=True)


tokenized_dataset = drug_dataset.map(slow_tokenize_function, batched=True, num_proc=8)

CPU times: user 81.1 ms, sys: 46.3 ms, total: 127 ms
Wall time: 650 ms


In [29]:
%%time 
slow_tokenizer = AutoTokenizer.from_pretrained("bert-base-cased", use_fast=False)


def slow_tokenize_function(examples):
    return slow_tokenizer(examples["review"], truncation=True)


tokenized_dataset = drug_dataset.map(slow_tokenize_function, batched=False, num_proc=8)

Map (num_proc=8):   0%|          | 0/138514 [00:00<?, ? examples/s]

Map (num_proc=8):   0%|          | 0/46108 [00:00<?, ? examples/s]

CPU times: user 2.2 s, sys: 390 ms, total: 2.59 s
Wall time: 53 s


Here we will tokenize our examples and truncate them to a maximum length of 128, but we will ask the tokenizer to return all the chunks of the texts instead of just the first one

In [36]:
def tokenize_and_split(examples): 
    return tokenizer(
        examples['review'], 
        truncation=True, 
        max_length=128, 
        return_overflowing_tokens=True
    )

Let’s test this on one example before using Dataset.map() on the whole dataset

In [37]:
result = tokenize_and_split(drug_dataset['train'][0])

In [40]:
[len(input) for input in result['input_ids']]

[128, 49]

In [None]:
tokenized_dataset = drug_dataset.map(tokenize_and_split, batched=True) # this leads to Error - ArrowInvalid: Column 1 named condition expected length 1463 but got length 1000

### ArrowInvalid: Column 1 named condition expected length 1463 but got length 1000

The issue occurs when trying to tokenize text data where some reviews are very long. Here's what's happening:

The original dataset has 1,000 rows
When tokenizing long reviews, some reviews get split into multiple pieces because they're too long to fit in one chunk
After tokenization, we end up with 1,463 rows because some reviews were split into multiple pieces

This creates a problem because:

Original columns (like 'drug', 'condition') have 1,000 rows
New tokenized columns have 1,463 rows
You can't have different numbers of rows in different columns of the same dataset

The solution is to use remove_columns to remove the original columns, keeping only the tokenized data. This way, all columns will have the same 1,463 rows.
It's like if you had a story that's too long for one page - when you split it across multiple pages, you can't keep the original single-page format alongside the multi-page version. You need to commit to one format or the other.

In [42]:
tokenized_dataset = drug_dataset.map(tokenize_and_split, batched=True, remove_columns=drug_dataset['train'].column_names)

Map:   0%|          | 0/138514 [00:00<?, ? examples/s]

Map:   0%|          | 0/46108 [00:00<?, ? examples/s]

Now this works without error. We can check that our new dataset has many more elements than the original dataset by comparing the lengths:

In [43]:
len(tokenized_dataset["train"]), len(drug_dataset["train"])

(206772, 138514)

## From Dataset s to DataFrame s and back

To enable the conversion between various third party libraries, HuggingFace provides a `Dataset.set_format()` function. Let`s convert our drug dataset to pandas

In [44]:
drug_dataset.set_format('pandas')

In [52]:
drug_dataset['train'][:5] # This returns a pandas dataframe instead of dict

Unnamed: 0,patient_id,drugName,condition,review,rating,date,usefulCount,review_length
0,95260,Guanfacine,adhd,"""My son is halfway through his fourth week of ...",8.0,"April 27, 2010",192,141
1,92703,Lybrel,birth control,"""I used to take another oral contraceptive, wh...",5.0,"December 14, 2009",17,134
2,138000,Ortho Evra,birth control,"""This is my first time using any form of birth...",8.0,"November 3, 2015",10,89
3,35696,Buprenorphine / naloxone,opiate dependence,"""Suboxone has completely turned my life around...",9.0,"November 27, 2016",37,124
4,155963,Cialis,benign prostatic hyperplasia,"""2nd day on 5mg started to work with rock hard...",2.0,"November 28, 2015",43,68


Let’s create a pandas.DataFrame for the whole training set by selecting all the elements of drug_dataset["train"]

In [53]:
train_dataframe = drug_dataset["train"][:]

In [55]:
train_dataframe.head() # now, we can use pandas functions such as head or tail

Unnamed: 0,patient_id,drugName,condition,review,rating,date,usefulCount,review_length
0,95260,Guanfacine,adhd,"""My son is halfway through his fourth week of ...",8.0,"April 27, 2010",192,141
1,92703,Lybrel,birth control,"""I used to take another oral contraceptive, wh...",5.0,"December 14, 2009",17,134
2,138000,Ortho Evra,birth control,"""This is my first time using any form of birth...",8.0,"November 3, 2015",10,89
3,35696,Buprenorphine / naloxone,opiate dependence,"""Suboxone has completely turned my life around...",9.0,"November 27, 2016",37,124
4,155963,Cialis,benign prostatic hyperplasia,"""2nd day on 5mg started to work with rock hard...",2.0,"November 28, 2015",43,68


Let`s do chaining to compute the class distribution among the condition entries

In [62]:
frequencies = (
    train_dataframe['condition']
    .value_counts()
    .to_frame()
    .reset_index()
    .rename(columns={'index':'condition', 'count':'frequency'})
)

In [63]:
frequencies.head()

Unnamed: 0,condition,frequency
0,birth control,27655
1,depression,8023
2,acne,5209
3,anxiety,4991
4,pain,4744


And once we’re done with our Pandas analysis, we can always create a new Dataset object by using the `Dataset.from_pandas()` function as follows

In [64]:
from datasets import Dataset

In [65]:
freq_dataset = Dataset.from_pandas(frequencies)

In [70]:
freq_dataset

Dataset({
    features: ['condition', 'frequency'],
    num_rows: 819
})

In [71]:
drug_dataset.reset_format() # reset the output formar of `drug_dataset` from pandas to arrow

## Creating a Validation set

In [82]:
drug_dataset_clean = drug_dataset['train'].train_test_split(train_size=0.8, seed=42)

drug_dataset_clean['validation'] = drug_dataset_clean.pop('test')

drug_dataset_clean['test'] = drug_dataset['test']

In [87]:
drug_dataset_clean['train'][0]

{'patient_id': 89879,
 'drugName': 'Cyclosporine',
 'condition': 'keratoconjunctivitis sicca',
 'review': '"I have used Restasis for about a year now and have seen almost no progress.  For most of my life I\'ve had red and bothersome eyes. After trying various eye drops, my doctor recommended Restasis.  He said it typically takes 3 to 6 months for it to really kick in but it never did kick in.  When I put the drops in it burns my eyes for the first 30 - 40 minutes.  I\'ve talked with my doctor about this and he said it is normal but should go away after some time, but it hasn\'t. Every year around spring time my eyes get terrible irritated  and this year has been the same (maybe even worse than other years) even though I\'ve been using Restasis for a year now. The only difference I notice was for the first couple weeks, but now I\'m ready to move on."',
 'rating': 2.0,
 'date': 'April 20, 2013',
 'usefulCount': 69,
 'review_length': 147}

### Saving a Dataset 

In [88]:
drug_dataset_clean.save_to_disk("./data/drug-reviews")

Saving the dataset (0/1 shards):   0%|          | 0/110811 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/27703 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/46108 [00:00<?, ? examples/s]