# Datasets library

---

## When dataset isn't on the hub

### Working with local datasets
Please refer this [link](https://huggingface.co/learn/nlp-course/chapter5/2?fw=pt#working-with-local-and-remote-datasets).

In [None]:
from datasets import load_dataset

data_files = {"train": "SQuAD_it-train.json", "test": "SQuAD_it-test.json"}
squad_it_dataset = load_dataset("json", data_files=data_files, field="data")
squad_it_dataset

### Working with remote datasets
Please refer this [link](https://huggingface.co/learn/nlp-course/chapter5/2?fw=pt#loading-a-remote-dataset).

In [None]:
url = "https://github.com/crux82/squad-it/raw/master/"
data_files = {
    "train": url + "SQuAD_it-train.json.gz",
    "test": url + "SQuAD_it-test.json.gz",
}
squad_it_dataset = load_dataset("json", data_files=data_files, field="data")

## Time to slice and dice

### Slicing and dicing our data
Please refer this [link](https://huggingface.co/learn/nlp-course/chapter5/3?fw=pt#slicing-and-dicing-our-data).

In [None]:
from datasets import load_dataset

data_files = {"train": "drugsComTrain_raw.tsv", "test": "drugsComTest_raw.tsv"}
# \t is the tab character in Python
drug_dataset = load_dataset("csv", data_files=data_files, delimiter="\t")

In [None]:
drug_sample = drug_dataset["train"].shuffle(seed=42).select(range(1000))
# Peek at the first few examples
drug_sample[:3]

In [None]:
for split in drug_dataset.keys():
    assert len(drug_dataset[split]) == len(drug_dataset[split].unique("Unnamed: 0"))

#### More processing

In [None]:
def filter_dataset(example):
    return example["condition"] is not None

drug_dataset = drug_dataset.filter(filter_dataset)

# line 4 can be written as a lambda function
drug_dataset = drug_dataset.filter(lambda x: x["condition"] is not None)

In [None]:
def lowercase_condition(example):
    return {"condition": example["condition"].lower()}

drug_dataset.map(lowercase_condition)

# line 4 can be written as a lambda function
drug_dataset.map(lmabda x: {"condition": x["condition"].lower()})

### Creating new columns

In [None]:
def compute_review_length(example):
    return {"review_length": len(example["review"].split())}

drug_dataset = drug_dataset.map(compute_review_length)

# line 4 can be written as a lambda function
drug_dataset = drug_dataset.map(lambda x: {"review_length": len(x["review"].split())})

In [None]:
drug_dataset = drug_dataset.filter(lambda x: x["review_length"] > 30)

### Shuffle and split

#### shuffle

In [6]:
from datasets import load_dataset

squad = load_dataset('squad', split='train')
squad[0]

{'id': '5733be284776f41900661182',
 'title': 'University_of_Notre_Dame',
 'context': 'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.',
 'question': 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?',
 'answers': {'text': ['Saint Bernadette Soubirous'], 'answer_start': [515]}}

In [3]:
squad_shuffled = squad.shuffle(seed=42)
squad_shuffled[0]

{'id': '573173d8497a881900248f0c',
 'title': 'Egypt',
 'context': 'The Pew Forum on Religion & Public Life ranks Egypt as the fifth worst country in the world for religious freedom. The United States Commission on International Religious Freedom, a bipartisan independent agency of the US government, has placed Egypt on its watch list of countries that require close monitoring due to the nature and extent of violations of religious freedom engaged in or tolerated by the government. According to a 2010 Pew Global Attitudes survey, 84% of Egyptians polled supported the death penalty for those who leave Islam; 77% supported whippings and cutting off of hands for theft and robbery; and 82% support stoning a person who commits adultery.',
 'question': 'What percentage of Egyptians polled support death penalty for those leaving Islam?',
 'answers': {'text': ['84%'], 'answer_start': [468]}}

#### split

In [25]:
from datasets import load_dataset

squad = load_dataset('squad', split='train')

dataset = squad.train_test_split(test_size=0.1)
dataset["validation"] = dataset['train'].train_test_split(test_size=0.1).pop("test")
dataset

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 78839
    })
    test: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 8760
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 7884
    })
})

### Select and filter

#### select

In [10]:
from datasets import load_dataset

squad = load_dataset('squad', split='train')

indices = [0, 10, 20]
examples = squad.select(indices)
examples

Dataset({
    features: ['id', 'title', 'context', 'question', 'answers'],
    num_rows: 3
})

In [23]:
examples[:]

{'id': ['5733be284776f41900661182',
  '5733bed24776f41900661188',
  '5733a70c4776f41900660f64'],
 'title': ['University_of_Notre_Dame',
  'University_of_Notre_Dame',
  'University_of_Notre_Dame'],
 'context': ['Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.',
  'The university is the major seat of the Congregation of Holy Cross (albeit not its offi

In [11]:
from datasets import load_dataset

squad = load_dataset('squad', split='train')

sample = squad.shuffle().select(range(3))
sample

Dataset({
    features: ['id', 'title', 'context', 'question', 'answers'],
    num_rows: 3
})

#### filter

In [14]:
from datasets import load_dataset

squad = load_dataset('squad', split='train')

squad_filtered = squad.filter(lambda x: x["title"].startswith("L"))
squad_filtered[0]

{'id': '56de0fef4396321400ee2583',
 'title': 'Lighting',
 'context': 'Lighting or illumination is the deliberate use of light to achieve a practical or aesthetic effect. Lighting includes the use of both artificial light sources like lamps and light fixtures, as well as natural illumination by capturing daylight. Daylighting (using windows, skylights, or light shelves) is sometimes used as the main source of light during daytime in buildings. This can save energy in place of using artificial lighting, which represents a major component of energy consumption in buildings. Proper lighting can enhance task performance, improve the appearance of an area, or have positive psychological effects on occupants.',
 'question': 'What is used a main source of light for a building during the day?',
 'answers': {'text': ['Daylighting'], 'answer_start': [245]}}

### Rename, remove, and flatten

In [None]:
drug_dataset = drug_dataset.rename_column(
    original_column_name="Unnamed: 0", new_column_name="patient_id"
)
drug_dataset

In [15]:
from datasets import load_dataset

squad = load_dataset('squad', split='train')

squad.rename_column("context", "passages")

Dataset({
    features: ['id', 'title', 'passages', 'question', 'answers'],
    num_rows: 87599
})

In [16]:
squad.remove_columns(["id", "title"])

Dataset({
    features: ['context', 'question', 'answers'],
    num_rows: 87599
})

In [17]:
squad.flatten()

Dataset({
    features: ['id', 'title', 'context', 'question', 'answers.text', 'answers.answer_start'],
    num_rows: 87599
})

### Map

In [19]:
from datasets import load_dataset

squad = load_dataset('squad', split='train')

def lowercase_title(example):
    return {'title': example['title'].lower()}

squad_lowercase = squad.map(lowercase_title)
squad_lowercase.shuffle(seed=42)["title"][:5]

['egypt',
 'ann_arbor,_michigan',
 'rule_of_law',
 'samurai',
 'group_(mathematics)']

You can use lambda to do the same:

In [21]:
squad_lowercase = squad.map(lambda x: {"title": x["title"].lower()})
squad_lowercase.shuffle(seed=42)["title"][:5]

Map:   0%|          | 0/87599 [00:00<?, ? examples/s]

['egypt',
 'ann_arbor,_michigan',
 'rule_of_law',
 'samurai',
 'group_(mathematics)']

Let's speed this up by using `batched=True`:

In [22]:
squad_lowercase = squad.map(lambda x: {"title": [title.lower() for title in x["title"]]}, batched=True)
squad_lowercase.shuffle(seed=42)["title"][:5]

Map:   0%|          | 0/87599 [00:00<?, ? examples/s]

['egypt',
 'ann_arbor,_michigan',
 'rule_of_law',
 'samurai',
 'group_(mathematics)']

#### The map() method's superpowers

In [5]:
from datasets import load_dataset

my_dataset = load_dataset("ai2_arc", "ARC-Challenge")
my_dataset

DatasetDict({
    train: Dataset({
        features: ['id', 'question', 'choices', 'answerKey'],
        num_rows: 1119
    })
    test: Dataset({
        features: ['id', 'question', 'choices', 'answerKey'],
        num_rows: 1172
    })
    validation: Dataset({
        features: ['id', 'question', 'choices', 'answerKey'],
        num_rows: 299
    })
})

> Using `num_proc` to speed up your processing is usually a great idea, as long as the function you are using is not already doing some kind of multiprocessing of its own. Hence, use `batched=True` with fast tokenizers, and `num_proc=os.cpu_count()` for slow tokenizers.

Fast tokenizers are wriiten in Rust and they bring parallelism with `batched=True`. Slow tokenizers are written in Python and they bring parallelism with `num_proc=os.cpu_count()`.

In [10]:
new_dataset = my_dataset.map(lambda x: {"quest_len": [len(quest) for quest in x["question"]]}, batched=True)
new_dataset

Map:   0%|          | 0/1119 [00:00<?, ? examples/s]

Map:   0%|          | 0/1172 [00:00<?, ? examples/s]

Map:   0%|          | 0/299 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['id', 'question', 'choices', 'answerKey', 'quest_len'],
        num_rows: 1119
    })
    test: Dataset({
        features: ['id', 'question', 'choices', 'answerKey', 'quest_len'],
        num_rows: 1172
    })
    validation: Dataset({
        features: ['id', 'question', 'choices', 'answerKey', 'quest_len'],
        num_rows: 299
    })
})

Here we will tokenize our examples and truncate them to a maximum length of 128, but we will ask the tokenizer to return all the chunks of the texts instead of just the first one. This can be done with `return_overflowing_tokens=True`:

In [15]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

def tokenize_and_split(examples):
    return tokenizer(
        examples["question"],
        truncation=True,
        max_length=64,
        return_overflowing_tokens=True,
    )

In [17]:
# This will result in size error
# tokenized_dataset = new_dataset.map(tokenize_and_split, batched=True)

In [19]:
# Solution 1: remove old columns
tokenized_dataset = new_dataset.map(
    tokenize_and_split, batched=True, remove_columns=new_dataset["train"].column_names
)
tokenized_dataset

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'overflow_to_sample_mapping'],
        num_rows: 1154
    })
    test: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'overflow_to_sample_mapping'],
        num_rows: 1219
    })
    validation: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'overflow_to_sample_mapping'],
        num_rows: 310
    })
})

In [27]:
# Solution 2: make the old columns the same size as the new ones.
def tokenize_and_split(examples):
    result = tokenizer(
        examples["question"],
        truncation=True,
        max_length=64,
        return_overflowing_tokens=True,
    )
    # Extract mapping between new and old indices
    sample_map = result.pop("overflow_to_sample_mapping")
    for key, values in examples.items():
        result[key] = [values[i] for i in sample_map]
    return result

tokenized_dataset = new_dataset.map(tokenize_and_split, batched=True)
tokenized_dataset

DatasetDict({
    train: Dataset({
        features: ['id', 'question', 'choices', 'answerKey', 'quest_len', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1154
    })
    test: Dataset({
        features: ['id', 'question', 'choices', 'answerKey', 'quest_len', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1219
    })
    validation: Dataset({
        features: ['id', 'question', 'choices', 'answerKey', 'quest_len', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 310
    })
})

### From Datasets to DataFrames

In [30]:
from datasets import load_dataset

dataset = load_dataset("swiss_judgment_prediction", "all", split="validation")
dataset[0]

Downloading data files:   0%|          | 0/4 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/59709 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/8208 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/17357 [00:00<?, ? examples/s]

{'id': 48757,
 'year': 2015,
 'text': 'Sachverhalt: A. X._ war bei der Krankenversicherung C._ taggeldversichert. Infolge einer Arbeitsunfähigkeit leistete ihm die C._ vom 30. Juni 2011 bis am 28. Juni 2013 Krankentaggelder, wobei die Leistungen bis am 30. September 2012 auf Grundlage einer Arbeitsunfähigkeit von 100% und danach basierend auf einer Arbeitsunfähigkeit von 55% erbracht wurden. Die Neueinschätzung der Arbeitsfähigkeit erfolgte anhand eines Gutachtens der D._ AG vom 27. August 2012, welches im Auftrag der C._ erstellt wurde. X._ machte daraufhin gegenüber der C._ geltend, er sei entgegen dem Gutachten auch nach dem 30. September 2012 zu 100% arbeitsunfähig gewesen. Ferner verlangte er von der D._ AG zwecks externer Überprüfung des Gutachtens die Herausgabe sämtlicher diesbezüglicher Notizen, Auswertungen und Unterlagen. A._ (als Geschäftsführer der D._ AG) und B._ (als für das Gutachten medizinisch Verantwortliche) antworteten ihm, dass sie alle Unterlagen der C._ zugestel

In [39]:
dataset.set_format("pandas")
dataset[0]

Unnamed: 0,id,year,text,label,language,region,canton,legal area,source_language
0,48757,2015,Sachverhalt: A. X._ war bei der Krankenversich...,0,de,Espace Mittelland,be,penal law,


In [40]:
df = dataset[:]
df.head()

Unnamed: 0,id,year,text,label,language,region,canton,legal area,source_language
0,48757,2015,Sachverhalt: A. X._ war bei der Krankenversich...,0,de,Espace Mittelland,be,penal law,
1,48758,2015,Sachverhalt: A. Der 1961 geborene A._ wurde na...,0,de,Zürich,zh,social law,
2,48760,2015,Sachverhalt: A. A.a. Mit Verfügung vom 9. Febr...,0,de,Zürich,zh,social law,
3,48761,2015,"Sachverhalt: A. A._, geboren 1963, erlitt am 2...",0,de,Zürich,zh,social law,
4,48762,2015,Sachverhalt: A. B._ war seit 25. September 199...,0,de,Central Switzerland,zg,social law,


In [41]:
# Make sure to execute the following when you are done with df analysis
dataset.reset_format()

In [43]:
df = dataset.to_pandas()
df.head()

Unnamed: 0,id,year,text,label,language,region,canton,legal area,source_language
0,48757,2015,Sachverhalt: A. X._ war bei der Krankenversich...,0,de,Espace Mittelland,be,penal law,
1,48758,2015,Sachverhalt: A. Der 1961 geborene A._ wurde na...,0,de,Zürich,zh,social law,
2,48760,2015,Sachverhalt: A. A.a. Mit Verfügung vom 9. Febr...,0,de,Zürich,zh,social law,
3,48761,2015,"Sachverhalt: A. A._, geboren 1963, erlitt am 2...",0,de,Zürich,zh,social law,
4,48762,2015,Sachverhalt: A. B._ war seit 25. September 199...,0,de,Central Switzerland,zg,social law,


In [44]:
df.groupby("region")["language"].value_counts()

region                    language
Central Switzerland       de           631
Eastern Switzerland       de           660
                          it            15
Espace Mittelland         de           695
                          fr           471
Federation                de           246
                          fr            31
                          it            11
Northwestern Switzerland  de          1045
                          fr             2
Région lémanique          fr          2272
                          de            40
Ticino                    it           358
Zürich                    de          1262
n/a                       fr           319
                          de           126
                          it            24
Name: count, dtype: int64

### From DataFrames to Datasets

In [45]:
from datasets import Dataset

new_dataset = Dataset.from_pandas(df)
new_dataset

Dataset({
    features: ['id', 'year', 'text', 'label', 'language', 'region', 'canton', 'legal area', 'source_language'],
    num_rows: 8208
})

### Saving and loading saved dataset

#### Saving a dataset

| Data Format | Function |
| --- | --- |
| Arrow | `Dataset.save_to_disk()` |
| CSV | `Dataset.to_csv()` |
| JSON or JSONL | `Dataset.to_json()` |

In [56]:
from datasets import load_dataset

dataset = load_dataset("squad")
dataset

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 87599
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 10570
    })
})

In [51]:
dataset.save_to_disk("../temp/datasets/squad")

Saving the dataset (0/1 shards):   0%|          | 0/87599 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/10570 [00:00<?, ? examples/s]

#### Load the saved dataset

In [52]:
from datasets import load_from_disk

loaded_dataset = load_from_disk("../temp/datasets/squad")
loaded_dataset

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 87599
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 10570
    })
})

#### Tips for saving and loading csv/json files

In [57]:
for split, dataset in dataset.items():
    dataset.to_json(f"../temp/datasets/squad-json/{split}.jsonl")

Creating json from Arrow format:   0%|          | 0/88 [00:00<?, ?ba/s]

Creating json from Arrow format:   0%|          | 0/11 [00:00<?, ?ba/s]

In [61]:
from datasets import load_dataset

data_files = {
    "train": "../temp/datasets/squad-json/train.jsonl",
    "validation": "../temp/datasets/squad-json/validation.jsonl",
}
squad_reloaded = load_dataset("json", data_files=data_files)
squad_reloaded

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 87599
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 10570
    })
})

## Handle big data

### Streaming datasets

In [74]:
from datasets import load_dataset

dataset = load_dataset("glue", "ax", split="test", streaming=True)

In [75]:
next(iter(dataset))

{'premise': 'The cat sat on the mat.',
 'hypothesis': 'The cat did not sit on the mat.',
 'label': -1,
 'idx': 0}

In [82]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
tokenized_dataset = dataset.map(lambda x: tokenizer(x["premise"], x["hypothesis"], truncation=True))
next(iter(tokenized_dataset))

{'premise': 'The cat sat on the mat.',
 'hypothesis': 'The cat did not sit on the mat.',
 'label': -1,
 'idx': 0,
 'input_ids': [101,
  1996,
  4937,
  2938,
  2006,
  1996,
  13523,
  1012,
  102,
  1996,
  4937,
  2106,
  2025,
  4133,
  2006,
  1996,
  13523,
  1012,
  102],
 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [85]:
dataset_head = dataset.take(5)
list(dataset_head)

[{'premise': 'The cat sat on the mat.',
  'hypothesis': 'The cat did not sit on the mat.',
  'label': -1,
  'idx': 0},
 {'premise': 'The cat did not sit on the mat.',
  'hypothesis': 'The cat sat on the mat.',
  'label': -1,
  'idx': 1},
 {'premise': "When you've got no snow, it's really hard to learn a snow sport so we looked at all the different ways I could mimic being on snow without actually being on snow.",
  'hypothesis': "When you've got snow, it's really hard to learn a snow sport so we looked at all the different ways I could mimic being on snow without actually being on snow.",
  'label': -1,
  'idx': 2},
 {'premise': "When you've got snow, it's really hard to learn a snow sport so we looked at all the different ways I could mimic being on snow without actually being on snow.",
  'hypothesis': "When you've got no snow, it's really hard to learn a snow sport so we looked at all the different ways I could mimic being on snow without actually being on snow.",
  'label': -1,
  '

In [97]:
shuffled_dataset = dataset.shuffle(buffer_size=4_000, seed=42)
next(iter(shuffled_dataset))

{'premise': 'None of the graduates of my program have moved on to other things because the jobs suck.',
 'hypothesis': 'Most of the graduates of my program have moved on to other things because the jobs suck.',
 'label': -1,
 'idx': 69}

In [109]:
# Skip the first 200 examples and include the rest in the training set
train_dataset = shuffled_dataset.skip(200)
# Take the first 200 examples for the validation set
validation_dataset = shuffled_dataset.take(200)
list(validation_dataset)[:3]

[{'premise': 'None of the graduates of my program have moved on to other things because the jobs suck.',
  'hypothesis': 'Most of the graduates of my program have moved on to other things because the jobs suck.',
  'label': -1,
  'idx': 69},
 {'premise': 'After the clingers completely immobilize her, I carry her to the tub or sink.',
  'hypothesis': 'After the clingers completely immobilize her, I carry her to the tub.',
  'label': -1,
  'idx': 651},
 {'premise': 'Three women alleged they were sexually assaulted or raped by male colleagues during that time.',
  'hypothesis': 'Three women alleged they were sexually assaulted by male colleagues during that time.',
  'label': -1,
  'idx': 546}]

### Combine multiple datasets
Datasets provides an `interleave_datasets()` function that converts a list of `IterableDataset` objects into a single `IterableDataset`, where the elements of the new dataset are obtained by alternating among the source examples. 

In [None]:
from itertools import islice
from datasets import interleave_datasets

combined_dataset = interleave_datasets([pubmed_dataset_streamed, law_dataset_streamed])
list(islice(combined_dataset, 2))

## Create your own dataset

### Create dataset

In [None]:
issues_dataset = load_dataset("json", data_files="<local_path>", split="train")
issues_dataset

### Uploading dataset to HF hub

In [None]:
from huggingface_hub import notebook_login

notebook_login()

In [None]:
issues_dataset.push_to_hub("github-issues")