# Multitask Datasets in HuggingFace `nlp`


Multitask learning is commonly used in NLP research and has been shown to improve performance across many NLP tasks ([1](https://www.aclweb.org/anthology/P19-1441/), [2](https://arxiv.org/abs/1910.10683)).

## Setup
We'll need to import the `nlp` library:

In [1]:
import nlp

## Loading the datasets

In this walkthrough, we use 2 datasets: `squad` (a question-answering dataset) and `cnn_dailymail` (a summarization dataset). Lets load them:

In [2]:
squad = nlp.load_dataset("squad", split="train")
cnn_dm = nlp.load_dataset("cnn_dailymail", "3.0.0", split="train")

In order to combine multiple datasets they need to have matching schemas. To do this we use the `map` function:

In [3]:
squad_mapped = squad.map(lambda example: {
    "source": "squad context: " + example["context"] + " question: " + example["question"],
    "target":example["answers"]["text"]
    }, remove_columns=squad.column_names)

cnn_dm_mapped = cnn_dm.map(lambda example: {
    "source": "cnn_dm: " + example["article"],
    "target":[example["highlights"]]
    }, remove_columns=cnn_dm.column_names)

print(squad_mapped)
print(cnn_dm_mapped)

Dataset(schema: {'source': 'string', 'target': 'list<item: string>'}, num_rows: 87599)
Dataset(schema: {'source': 'string', 'target': 'list<item: string>'}, num_rows: 287113)


Great! Both our tasks have the same columns so we can now combine them.

## Building `MultiDataset`

So now we can combine our datasets together. This is done using the `build_multitask` method:

In [4]:
multitask_ds = nlp.build_multitask(squad_mapped, cnn_dm_mapped)

Let's print out a few examples from our combined dataset:

In [5]:
for i in range(10):
    print(multitask_ds[i])

{'source': 'squad context: Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary. question: To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?', 'target': ['Saint Bernadette Soubirous'], 'task_name': 'squad'}
{'source': 'cnn_dm: It\'s official: U.S. President Barack Obama wants lawmakers to weigh in on whether to use military force in 

`MultiDataset` will add a `task_name` key indicating which task each example is from.

## Available Methods
Many of the `nlp.Dataset` methods are also available:

In [6]:
print(multitask_ds.schema)
print("The number of bytes allocated on the drive is ", multitask_ds.nbytes)
print("The number of rows", multitask_ds.num_rows)
print("The number of columns", multitask_ds.num_columns)
print("The column names",multitask_ds.column_names)
print("The shape (rows, columns)", multitask_ds.shape)
print("Cache files",multitask_ds.cache_files)

source: string not null
target: list<item: string> not null
  child 0, item: string
The number of bytes allocated on the drive is  1329025744
The number of rows 374712
The number of columns 2
The column names ['source', 'target']
The shape (rows, columns) (374712, 2)
Cache files [{'filename': '/home/thomas/.cache/huggingface/datasets/squad/plain_text/1.0.0/cache-08cf0218d0c18defac9c8647f3c25ca2.arrow'}, {'filename': '/home/thomas/.cache/huggingface/datasets/cnn_dailymail/3.0.0/3.0.0/cache-224b8b26406e4e05124224bd76d9934a.arrow'}]


## Construction from splits `dict`
As well as constructing from single Datasets, we can also pass `dict`s with all splits:

In [7]:
squad = nlp.load_dataset("squad")
cnn_dm = nlp.load_dataset("cnn_dailymail", "3.0.0")

squad_mapped = {}
for split in squad.keys():
    squad_mapped[split] = squad[split].map(lambda example: {
        "source": "squad context: " + example["context"] + " question: " + example["question"],
        "target":example["answers"]["text"]
        }, remove_columns=squad[split].column_names)

cnn_dm_mapped = {}
for split in squad.keys():
    cnn_dm_mapped[split] = cnn_dm[split].map(lambda example: {
        "source": "cnn_dm: " + example["article"],
        "target":[example["highlights"]]
        }, remove_columns=cnn_dm[split].column_names)

print(squad_mapped)
print(cnn_dm_mapped)

{'train': Dataset(schema: {'source': 'string', 'target': 'list<item: string>'}, num_rows: 87599), 'validation': Dataset(schema: {'source': 'string', 'target': 'list<item: string>'}, num_rows: 10570)}
{'train': Dataset(schema: {'source': 'string', 'target': 'list<item: string>'}, num_rows: 287113), 'validation': Dataset(schema: {'source': 'string', 'target': 'list<item: string>'}, num_rows: 13368)}


In [8]:
nlp.build_multitask(squad_mapped, cnn_dm_mapped)

{'train': MultiDataset(tasks: Dataset(schema: {'source': 'string', 'target': 'list<item: string>'}, num_rows: 87599), Dataset(schema: {'source': 'string', 'target': 'list<item: string>'}, num_rows: 287113)),
 'validation': MultiDataset(tasks: Dataset(schema: {'source': 'string', 'target': 'list<item: string>'}, num_rows: 10570), Dataset(schema: {'source': 'string', 'target': 'list<item: string>'}, num_rows: 13368))}