# THE HUGGINGFACE DATASETS LIBRARY

In [None]:
!pip install datasets evaluate transformers[sentencepiece]
!pip install faiss-gpu

# What if my dataset isn't on the Hub?

## Working with local and remote datasets

For `csv`,
```python
load_dataset("csv", data_files="my_files.csv")
```
For `text`,
```python
load_dataset("text", data_files="my_files.txt")
```
For `json`,
```python
load_dataset("json", data_files="my_files.jsonl")
```
For `pandas`,
```python
load_dataset("pandas", data_files="my_dataframe.pkl")
```

## Loading a local dataset

We will use the SQuAD-it dataset for demo purpose:

In [None]:
!wget https://github.com/crux82/squad-it/raw/master/SQuAD_it-train.json.gz
!wget https://github.com/crux82/squad-it/raw/master/SQuAD_it-test.json.gz

In [None]:
# decompress
!gzip -dkv SQuAD_it-*.json.gz

SQuAD_it-test.json.gz:	 87.5% -- created SQuAD_it-test.json
SQuAD_it-train.json.gz:	 82.3% -- created SQuAD_it-train.json


Load a JSON file with the `load_dataset()` function,

In [None]:
from datasets import load_dataset

squad_it_dataset = load_dataset("json", data_files="SQuAD_it-train.json", field="data")

Generating train split: 0 examples [00:00, ? examples/s]

The `squad_it_dataset` is a `DatasetDict` object with a `train` split:

In [None]:
squad_it_dataset

DatasetDict({
    train: Dataset({
        features: ['title', 'paragraphs'],
        num_rows: 442
    })
})

To view one of the examples by indexing:

In [None]:
squad_it_dataset['train'][0]

{'title': 'Terremoto del Sichuan del 2008',
 'paragraphs': [{'context': "Il terremoto del Sichuan del 2008 o il terremoto del Gran Sichuan, misurato a 8.0 Ms e 7.9 Mw, e si è verificato alle 02:28:01 PM China Standard Time all' epicentro (06:28:01 UTC) il 12 maggio nella provincia del Sichuan, ha ucciso 69.197 persone e lasciato 18.222 dispersi.",
   'qas': [{'answers': [{'answer_start': 29, 'text': '2008'}],
     'id': '56cdca7862d2951400fa6826',
     'question': 'In quale anno si è verificato il terremoto nel Sichuan?'},
    {'answers': [{'answer_start': 232, 'text': '69.197'}],
     'id': '56cdca7862d2951400fa6828',
     'question': 'Quante persone sono state uccise come risultato?'},
    {'answers': [{'answer_start': 29, 'text': '2008'}],
     'id': '56d4f9902ccc5a1400d833c0',
     'question': 'Quale anno ha avuto luogo il terremoto del Sichuan?'},
    {'answers': [{'answer_start': 78, 'text': '8.0 Ms e 7.9 Mw'}],
     'id': '56d4f9902ccc5a1400d833c1',
     'question': 'Che cosa ha

To include both the `train` and `test` split in a single `DatasetDict` object, we can apply `Dataset.map()` function across both splits at once.

In [None]:
data_files = {
    'train': 'SQuAD_it-train.json',
    'test': 'SQuAD_it-test.json',
}

squad_it_dataset = load_dataset('json', data_files=data_files, field='data')
squad_it_dataset

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['title', 'paragraphs'],
        num_rows: 442
    })
    test: Dataset({
        features: ['title', 'paragraphs'],
        num_rows: 48
    })
})

The loading scripts in Datasets actually support automatic decompression of the input files, so we could have skipped the use of `gzip` by pointing the `data_files` argument directly to the compressed files:

In [None]:
data_files = {
    'train': 'SQuAD_it-train.json.gz',
    'test': 'SQuAD_it-test.json.gz',
}

squad_it_dataset = load_dataset('json', data_files=data_files, field='data')
squad_it_dataset

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['title', 'paragraphs'],
        num_rows: 442
    })
    test: Dataset({
        features: ['title', 'paragraphs'],
        num_rows: 48
    })
})

## Loading a remote dataset

In [None]:
url = "https://github.com/crux82/squad-it/raw/master/"

data_files = {
    'train': url + "SQuAD_it-train.json.gz",
    'test': url + "SQuAD_it-test.json.gz",
}

squad_it_dataset = load_dataset('json', data_files=data_files, field='data')
squad_it_dataset

Downloading data:   0%|          | 0.00/7.73M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.05M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['title', 'paragraphs'],
        num_rows: 442
    })
    test: Dataset({
        features: ['title', 'paragraphs'],
        num_rows: 48
    })
})

# Time to slice and dice

## Slicing and dicing our data

Download the Drug Review dataset

In [None]:
!wget "https://archive.ics.uci.edu/ml/machine-learning-databases/00462/drugsCom_raw.zip"
!unzip drugsCom_raw.zip

--2024-10-14 13:20:20--  https://archive.ics.uci.edu/ml/machine-learning-databases/00462/drugsCom_raw.zip
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified
Saving to: ‘drugsCom_raw.zip’

drugsCom_raw.zip        [               <=>  ]  41.00M  4.07MB/s    in 10s     

2024-10-14 13:20:31 (3.99 MB/s) - ‘drugsCom_raw.zip’ saved [42989872]

Archive:  drugsCom_raw.zip
  inflating: drugsComTest_raw.tsv    
  inflating: drugsComTrain_raw.tsv   


Load the `tsc` files using the `load_dataset()` function,

In [None]:
data_files = {
    "train": "drugsComTrain_raw.tsv",
    "test": "drugsComTest_raw.tsv",
}

# \t is the tab character in Python
drug_dataset = load_dataset("csv", data_files=data_files, delimiter="\t")

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Take a small random sample to get a quick feel for the type of data we work with. `Dataset.select()` expects an iterable of indices.

In [None]:
# randomly shuffle and take the first 1000
drug_sample = drug_dataset["train"].shuffle(seed=101).select(range(1000))

# peek at the first few examples
drug_sample[:3]

{'Unnamed: 0': [1598, 158306, 217491],
 'drugName': ['Eluxadoline', 'Amrix', 'Evekeo'],
 'condition': ['Irritable Bowel Syndrome',
  '23</span> users found this comment helpful.',
  'Narcolepsy'],
 'review': ['"Worked great for the first 3-4 months then nothing. Went right back to same misery I have dealt with for 20 years. Doctor even increased dosage  even though people without a gallbladder are not supposed to take the higher dosage. The higher dosage actually made things worse for me and the cramps last even longer than they normally do which is pretty bad so since it was no longer working I stop taking it. It did seem like a miracle drug at first but unfortunately I&#039;m one of the people that it no longer works for after three or four months."',
  '"This drug has improved my quality of life greatly!  The first week I had drop attacks at my desk at work &amp; I was drowsy.  The key is to take this medication @ 4pm &amp; consecutively.\r\nI do experience dry mouth too.  I have 4 

From this sample,
* `Unnamed: 0` looks like an anonymized ID for each patient.
* `condition` includes a mix of uppercase and lowercase labels.
* `review` has varying length and contain a mix of Python line separators (`\r\n`) as well as HTML character codes like `&\#039;`.

To test the patient ID hypothesis for the `Unnamed: 0`, we can use the `Dataset.unique()` function to verify the number of IDs matches the number of rows in each split:

In [None]:
for split in drug_dataset.keys():
    assert len(drug_dataset[split]) == len(drug_dataset[split].unique("Unnamed: 0"))

Use the `DatasetDict.rename_column()` function to rename the column across both splits in one go:

In [None]:
drug_dataset = drug_dataset.rename_column(
    original_column_name="Unnamed: 0",
    new_column_name="patient_id",
)
drug_dataset

DatasetDict({
    train: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount'],
        num_rows: 161297
    })
    test: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount'],
        num_rows: 53766
    })
})

Next, normalize all the `condition` labels using `Dataset.map()`. We can define a simple function that can be applied across all the rows of each split in `drug_dataset`:

In [None]:
def lowercase_condition(example):
    return {"condition": example["condition"].lower()}

drug_dataset.map(lowercase_condition)

Some of the entries in the `condition` column are `None`. Let's drop these rows using `Dataset.filter()`, which works similarly to `Dataset.map()` and expects a function that receives a single example of the dataset.

Instead of writing an explicit function:

In [None]:
def filter_nones(x):
    return x['condition'] is not None

and then running `drug_dataset.filter(filter_nones)`, we can use *lambda function*:
```python
lambda <arguments> : <expression>
```
so this can be re-written as

In [None]:
drug_dataset = drug_dataset.filter(
    lambda x: x['condition'] is not None
)

Filter:   0%|          | 0/161297 [00:00<?, ? examples/s]

Filter:   0%|          | 0/53766 [00:00<?, ? examples/s]

Then we can normalize our `condition` column:

In [None]:
drug_dataset = drug_dataset.map(lowercase_condition)

# check
drug_dataset['train']['condition'][:3]

Map:   0%|          | 0/160398 [00:00<?, ? examples/s]

Map:   0%|          | 0/53471 [00:00<?, ? examples/s]

['left ventricular dysfunction', 'adhd', 'birth control']

## Creating new columns

Whenever dealing with reviews, a good practice is to check the number of words in each review.

Define a simple function that counts the number of words in each review:

In [None]:
def compute_review_length(example):
    return {'review_length': len(example['review'].split())}

The `compute_review_length` function returns a dictionary whose key does not corespond to one of the column names in the dataset. This will create a new `review_length` column:

In [None]:
drug_dataset = drug_dataset.map(compute_review_length)
drug_dataset['train'][0]

Map:   0%|          | 0/160398 [00:00<?, ? examples/s]

Map:   0%|          | 0/53471 [00:00<?, ? examples/s]

{'patient_id': 206461,
 'drugName': 'Valsartan',
 'condition': 'left ventricular dysfunction',
 'review': '"It has no side effect, I take it in combination of Bystolic 5 Mg and Fish Oil"',
 'rating': 9.0,
 'date': 'May 20, 2012',
 'usefulCount': 27,
 'review_length': 17}

Now we can sort this new column with `Dataset.sort()` to see what the extreme values looks like:

In [None]:
drug_dataset['train'].sort('review_length')[:3]

{'patient_id': [111469, 13653, 53602],
 'drugName': ['Ledipasvir / sofosbuvir',
  'Amphetamine / dextroamphetamine',
  'Alesse'],
 'condition': ['hepatitis c', 'adhd', 'birth control'],
 'review': ['"Headache"', '"Great"', '"Awesome"'],
 'rating': [10.0, 10.0, 10.0],
 'date': ['February 3, 2015', 'October 20, 2009', 'November 23, 2015'],
 'usefulCount': [41, 3, 0],
 'review_length': [1, 1, 1]}

Let's use the `Dataset.filter()` function to remove reviews that contain fewer than 30 words:

In [None]:
drug_dataset = drug_dataset.filter(
    lambda x: x['review_length'] > 30
)

drug_dataset.num_rows

Filter:   0%|          | 0/160398 [00:00<?, ? examples/s]

Filter:   0%|          | 0/53471 [00:00<?, ? examples/s]

{'train': 138514, 'test': 46108}

To handle the presence of HTML character codes, we can use Python's `html` module to unescape these characters:

In [None]:
import html

text = "I&#039;m a transformer called BERT"
html.unescape(text)

"I'm a transformer called BERT"

In [None]:
durg_dataset = drug_dataset.map(
    lambda x: {'review': html.unescape(x['review'])}
)

Map:   0%|          | 0/138514 [00:00<?, ? examples/s]

Map:   0%|          | 0/46108 [00:00<?, ? examples/s]

## The `map()` method's superpowers

The `Dataset.map()` takes a `batched` argument that, if set to `True`, causes it to send a batch of examples to the map function at once.

In [None]:
new_drug_dataset = drug_dataset.map(
    lambda x: {'review': [html.unescape(o) for o in x['review']]},
    batched=True,
)

Map:   0%|          | 0/138514 [00:00<?, ? examples/s]

Map:   0%|          | 0/46108 [00:00<?, ? examples/s]

When `batched=True`, the function receives a dictionary with the files of the dataset, but each value is now a *list of values*!

Using `Dataset.map()` with `batch=True` will be essential to unlock the speed of the "fast" tokenizers, which can quickly tokenize big lists of texts.

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')

def tokenize_function(examples):
    return tokenizer(examples['review'], truncation=True)

In [None]:
%time tokenized_dataset = drug_dataset.map(tokenize_function, batched=True)

Map:   0%|          | 0/138514 [00:00<?, ? examples/s]

Map:   0%|          | 0/46108 [00:00<?, ? examples/s]

CPU times: user 2min 26s, sys: 1.37 s, total: 2min 27s
Wall time: 2min 30s


`Dataset.map()` also has some parallelization capabilities of its own. To enable multiprocessing, use the `num_proc` argument and specify the number of processes to use in our call to `Dataset.map()`:

In [None]:
slow_tokenizer = AutoTokenizer.from_pretrained('bert-base-cased', use_fast=False)

def slow_tokenize_function(examples):
    return slow_tokenizer(examples['review'], truncation=True)

tokenized_dataset = drug_dataset.map(slow_tokenize_function, batched=True, num_proc=8)



Map (num_proc=8):   0%|          | 0/138514 [00:00<?, ? examples/s]

Map (num_proc=8):   0%|          | 0/46108 [00:00<?, ? examples/s]

In general, we do NOT recommend using Pythong multiprocessing for fast tokenizers with `batched=True`.

With `Dataset.map()` and `batched=True`, we can change the number of elements in our dataset. Here we will tokenize our examples and truncate them to a maximum length of 128, but we will ask the tokenizer to return *all* the chunks of the texts instead of just the first one.

In [None]:
def tokenize_and_split(examples):
    return tokenizer(
        examples['review'],
        truncation=True,
        max_length=128,
        return_overflowing_tokens=True,
    )

In [None]:
# test on one example
result = tokenize_and_split(drug_dataset['train'][0])

[len(inp) for inp in result['input_ids']]

[128, 49]

Our first example in the training set became two features because it was tokenized to more than the maximum number of tokens we specified: the first one of length 128 and the second one of length 49.

For all elements of the dataset:

In [None]:
tokenized_dataset = drug_dataset.map(tokenize_and_split, batched=True)

Map:   0%|          | 0/138514 [00:00<?, ? examples/s]

ArrowInvalid: Column 8 named input_ids expected length 1000 but got length 1514

We can deal with the mismatched length problem by making the old columns the same size as the new ones. To do this, we will need the `overflow_to_sample_mapping` field the tokenizer returns when we set `return_overflowing_tokens=True`. It gives us a mapping fro a new feature index to the index of the sample it originated from. Using this, we can associate each key present in our original dataset with a list of values of the right size by repeating the values of each example as many times as it generates new features:

In [None]:
def tokenize_and_split(examples):
    result = tokenizer(
        examples['review'],
        truncation=True,
        max_length=128,
        return_overflowing_tokens=True,
    )

    # extract mapping between new and old indices
    sample_map = result.pop('overflow_to_sample_mapping')
    for key, values in examples.items():
        result[key] = [values[i] for i in sample_map]

    return result

We can see it works with `Dataset.map()` without us needing to remove the old columns:

In [None]:
tokenized_dataset = drug_dataset.map(tokenize_and_split, batched=True)
tokenized_dataset

Map:   0%|          | 0/138514 [00:00<?, ? examples/s]

Map:   0%|          | 0/46108 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_length', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 212993
    })
    test: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_length', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 70952
    })
})

We get the same number of training features as before, but here we have kept all the old fields.

## From Datasets to Dataframes and back

`Dataset.set_format()` function only changes the *output format* of the dataset, so we can easily switch to another format without affecting the underlying *data format*, which is Apache Arrow.

In [None]:
drug_dataset.set_format('pandas')

Now when we access elements of the dataset we get a `pandas.Dataframe` instead of a dictionary:

In [None]:
drug_dataset['train'][:3]

Unnamed: 0,patient_id,drugName,condition,review,rating,date,usefulCount,review_length
0,95260,Guanfacine,adhd,"""My son is halfway through his fourth week of ...",8.0,"April 27, 2010",192,141
1,92703,Lybrel,birth control,"""I used to take another oral contraceptive, wh...",5.0,"December 14, 2009",17,134
2,138000,Ortho Evra,birth control,"""This is my first time using any form of birth...",8.0,"November 3, 2015",10,89


To create a `pandas.DataFrame` for the whole training set:

In [None]:
train_df = drug_dataset['train'][:]

`Dataset.set_format()` changes the return format for the dataset's `__getitem__()` method. When we want to create a new object like `train_df` from a `Dataset` in the `pandas` format, we need to slice the whole dataset to obtain a `pandas.DataFrame`.

With Pandas Dataframe, we can do fancy chaining to compute the class distribution among the `condition` entries:

In [None]:
frequencies = train_df['condition'].value_counts().to_frame().reset_index().rename(columns={'index': 'condition', 'condition': 'frequency'})
frequencies.head()

Unnamed: 0,frequency,count
0,birth control,27655
1,depression,8023
2,acne,5209
3,anxiety,4991
4,pain,4744


Once we have done with our Pandas analysis, we can always create a new `Dataset` object by using the `Dataset.from_pandas()` function:

In [None]:
from datasets import Dataset

freq_dataset = Dataset.from_pandas(frequencies)
freq_dataset

Dataset({
    features: ['frequency', 'count'],
    num_rows: 819
})

## Creating a validation set

In [None]:
# reset the output format from pandas to arrow
drug_dataset.reset_format()

In [None]:
# similar to scikit-learn
drug_dataset_clean = drug_dataset['train'].train_test_split(train_size=0.8, seed=101)

# rename the default test split to validation
drug_dataset_clean['validation'] = drug_dataset_clean.pop('test')

# add the test set to our DatasetDict
drug_dataset_clean['test'] = drug_dataset['test']

drug_dataset_clean

DatasetDict({
    train: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_length'],
        num_rows: 110811
    })
    validation: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_length'],
        num_rows: 27703
    })
    test: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_length'],
        num_rows: 46108
    })
})

## Saving a dataset

For Arrow format,
```python
Dataset.save_to_disk()
```
For CSV format,
```python
Dataset.to_csv()
```
For JSON format,
```python
Dataset.to_json()
```

For example, to save our cleaned dataset in the Arrow format
```python
drug_dataset_clean.save_to_disk('drug-reviews')
```
This will create a directory with the following structure:
```
drug-reviews/
├── dataset_dict.json
├── test
│   ├── dataset.arrow
│   ├── dataset_info.json
│   └── state.json
├── train
│   ├── dataset.arrow
│   ├── dataset_info.json
│   ├── indices.arrow
│   └── state.json
└── validation
    ├── dataset.arrow
    ├── dataset_info.json
    ├── indices.arrow
    └── state.json
```
where each split is associated with its own `dataset.arrow` table, and some metadata in `dataset_info.json` and `state.json`. The Arrow table is a fancy table of columns and rows that is optimized for building high-performance applications that process and transport large datasets.

To load the data back,
```python
from datasets import load_from_disk

drug_dataset_reloaded = load_from_disk('drug-reviews')
```

For the CSV and JSON formats, we have to store each split as a separate file.
```python
for split, dataset in drug_dataset_clean.items():
    dataset.to_json(f"drug-review-{split}.jsonl")
```

To load the JSON files back,
```python
data_files = {
    "train": "drug-reviews-train.jsonl",
    "validation": "drug-reviews-validation.jsonl",
    "test": "drug-reviews-test.jsonl",
}
drug_dataset_reloaded = load_dataset("json", data_files=data_files)
```

# Big data? Datasets to the rescue!

## What is the Pile?

The Pile is an English text corpus that was created by EleutherAI for training large-scale language models. It includes a diverse range of datasets, spanning scientific articles, GitHub code repositories, and filtered web text.

As an example, we will take a look at the PubMed Abstracts dataset.

In [None]:
!pip install zstandard

Collecting zstandard
  Downloading zstandard-0.23.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.0 kB)
Downloading zstandard-0.23.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (5.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.4/5.4 MB[0m [31m13.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: zstandard
Successfully installed zstandard-0.23.0


In [None]:
# load the dataset remotely
from datasets import load_dataset

data_files = 'https://the-eye.eu/public/AI/pile_preliminary_components/PUBMED_title_abstracts_2019_baseline.jsonl.zst'
pubmed_dataset = load_dataset('json', data_files=data_files, split='train')
pubmed_dataset

In [None]:
from datasets import load_dataset, DownloadConfig

data_files = "https://huggingface.co/datasets/casinca/PUBMED_title_abstracts_2019_baseline/resolve/main/PUBMED_title_abstracts_2019_baseline.jsonl.zst"
pubmed_dataset = load_dataset(
    "json",
    data_files=data_files,
    split="train",
    download_config=DownloadConfig(delete_extracted=True),  # optional argument
)

By default, Datasets will decompress the files needed to load a dataset. If we want to preserve hard drive space, we can pass `DownloadConfig(delete_extracted=True)` to the `

In [None]:
pubmed_dataset

Dataset({
    features: ['meta', 'text'],
    num_rows: 15518009
})

In [None]:
# check the first example
pubmed_dataset[0]

{'meta': {'pmid': 11409574, 'language': 'eng'},
 'text': 'Epidemiology of hypoxaemia in children with acute lower respiratory infection.\nTo determine the prevalence of hypoxaemia in children aged under 5 years suffering acute lower respiratory infections (ALRI), the risk factors for hypoxaemia in children under 5 years of age with ALRI, and the association of hypoxaemia with an increased risk of dying in children of the same age. Systematic review of the published literature. Out-patient clinics, emergency departments and hospitalisation wards in 23 health centres from 10 countries. Cohort studies reporting the frequency of hypoxaemia in children under 5 years of age with ALRI, and the association between hypoxaemia and the risk of dying. Prevalence of hypoxaemia measured in children with ARI and relative risks for the association between the severity of illness and the frequency of hypoxaemia, and between hypoxaemia and the risk of dying. Seventeen published studies were found that i

## The magic of memory mapping

In [None]:
!pip install psutil



The `Process` class allows us to check the memory usage of the current process:

In [None]:
import psutil

# Process.memory_info is expressed in bytes, need to convert to megabytes
print(f"RAM used: {psutil.Process().memory_info().rss / (1024*1024):.4f} MB")

RAM used: 704.5156 MB


The `rss` refers to the *resident set size*, which is the fraction of memory that a process occupies in RAM. This measurement also includes the memory used by the Python interpreter and the libraries we have loaded, so the actual amount of memory used to load the dataset is a bit smaller.

We can also check how large the dataset is on disk:

In [None]:
print(f"Number of files in dataset: {pubmed_dataset.dataset_size}")

size_gb = pubmed_dataset.dataset_size / (1024**3)
print(f"Dataset size (cache file): {size_gb:.4f} GB")

Number of files in dataset: 20978892555
Dataset size (cache file): 19.5381 GB


Datasets treats each dataset as a memory-mapped file, which provides a mapping between RAM and filesystem storage that allows the library to access and operate on elements of the dataset without needing to fully load it into memory.

Memory-mapped files can be shared across multiple processes, which enables methods like `Dataset.map()` to be parallelized without needing to move or copy the dataset.

In [None]:
import timeit

code_snippet = """
batch_size = 1000

for idx in range(0, len(pubmed_dataset), batch_size):
    _ = pubmed_dataset[idx: idx+batch_size]
"""

time = timeit.timeit(stmt=code_snippet, number=1, globals=globals())

print(
    f"Iterated over {len(pubmed_dataset)} examples (about {size_gb:.2f} GB) in"
    f"{time:.2f}s, i.e., {size_gb/time:.4f} GB/s"
)

Iterated over 15518009 examples (about 19.54 GB) in391.70s, i.e., 0.0499 GB/s


Whe used Python's `timeit` module to measure the execution time taken by `code_snippet`.

## Streaming datasets

To enable dataset streaming,

In [None]:
pubmed_dataset_streamed = load_dataset(
    "json",
    data_files=data_files,
    split="train",
    streaming = True, # add this line
    download_config=DownloadConfig(delete_extracted=True),  # optional argument
)

The object returned with `streaming=True` in this case is an `IterableDataset`.

In [None]:
next(iter(pubmed_dataset_streamed))

{'meta': {'pmid': 11409574, 'language': 'eng'},
 'text': 'Epidemiology of hypoxaemia in children with acute lower respiratory infection.\nTo determine the prevalence of hypoxaemia in children aged under 5 years suffering acute lower respiratory infections (ALRI), the risk factors for hypoxaemia in children under 5 years of age with ALRI, and the association of hypoxaemia with an increased risk of dying in children of the same age. Systematic review of the published literature. Out-patient clinics, emergency departments and hospitalisation wards in 23 health centres from 10 countries. Cohort studies reporting the frequency of hypoxaemia in children under 5 years of age with ALRI, and the association between hypoxaemia and the risk of dying. Prevalence of hypoxaemia measured in children with ARI and relative risks for the association between the severity of illness and the frequency of hypoxaemia, and between hypoxaemia and the risk of dying. Seventeen published studies were found that i

The elements from a streamed dataset can be processed on the fly using `IterableDataset.map()`, which is useful during training if we need to tokenize the inputs.

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased')

tokenized_dataset = pubmed_dataset_streamed.map(lambda x: tokenizer(x['text']))

next(iter(tokenized_dataset))

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]



{'meta': {'pmid': 11409574, 'language': 'eng'},
 'text': 'Epidemiology of hypoxaemia in children with acute lower respiratory infection.\nTo determine the prevalence of hypoxaemia in children aged under 5 years suffering acute lower respiratory infections (ALRI), the risk factors for hypoxaemia in children under 5 years of age with ALRI, and the association of hypoxaemia with an increased risk of dying in children of the same age. Systematic review of the published literature. Out-patient clinics, emergency departments and hospitalisation wards in 23 health centres from 10 countries. Cohort studies reporting the frequency of hypoxaemia in children under 5 years of age with ALRI, and the association between hypoxaemia and the risk of dying. Prevalence of hypoxaemia measured in children with ARI and relative risks for the association between the severity of illness and the frequency of hypoxaemia, and between hypoxaemia and the risk of dying. Seventeen published studies were found that i

To speed up tokenization with streaming we can pass `batched=True`.

In [None]:
tokenized_dataset = pubmed_dataset_streamed.map(lambda x: tokenizer(x['text']), batched=True)

We can also shuffle a streamed dataset using `IterableDataset.shuffle()`, but this only shuffles the elements in a predefined `buffer_size`:

In [None]:
shuffled_dataset = pubmed_dataset_streamed.shuffle(buffer_size=10000, seed=101)
next(iter(shuffled_dataset))

{'meta': {'pmid': 11414262, 'language': 'eng'},
 'text': 'Antihypertensive medication use in Hispanic adults: a comparison with black adults and white adults.\nVariations in awareness, treatment, and control of hypertension among different racial/ethnic groups have been widely reported. It is unclear whether these differences are explained fully by differences in socioeconomic status, insurance coverage, health status, and health behaviors, or whether these differences indicate that racial/ethnic subgroups have unique barriers to hypertension control. Determine whether there are significant differences between racial/ethnic groups in medication use for hypertension after adjusting for potentially confounding variables. Cross-sectional analysis of the 1992 Health and Retirement Study. 2450 non-Hispanic white, 939 non-Hispanic black, and 345 Hispanic participants, ages 51 to 61, reporting a history of hypertension. Self-reported current antihypertensive medication use. We used logistic r

We selected a random example from the first 10,000 examples in the buffer.

To select the frist 5 examples in the PubMed Abstracts dataset,

In [None]:
dataset_head = pubmed_dataset_streamed.take(5)
list(dataset_head)

[{'meta': {'pmid': 11409574, 'language': 'eng'},
  'text': 'Epidemiology of hypoxaemia in children with acute lower respiratory infection.\nTo determine the prevalence of hypoxaemia in children aged under 5 years suffering acute lower respiratory infections (ALRI), the risk factors for hypoxaemia in children under 5 years of age with ALRI, and the association of hypoxaemia with an increased risk of dying in children of the same age. Systematic review of the published literature. Out-patient clinics, emergency departments and hospitalisation wards in 23 health centres from 10 countries. Cohort studies reporting the frequency of hypoxaemia in children under 5 years of age with ALRI, and the association between hypoxaemia and the risk of dying. Prevalence of hypoxaemia measured in children with ARI and relative risks for the association between the severity of illness and the frequency of hypoxaemia, and between hypoxaemia and the risk of dying. Seventeen published studies were found that

We can use the `IterableDataset.skip()` function to create training and validation splits from a shuffled dataset:

In [None]:
# skip the first 1,000 examples and include the rest in the training set
train_dataset = shuffled_dataset.skip(1000)
# take the first 1,000 examples for the validation set
validation_dataset = shuffled_dataset.take(1000)

We can also combine multiple dataset together to create a single corpus. The Datasets provides an `interleave_dataset()` function converting a list of `IterableDataset` objects into a single `IterableDataset`, where the elements of the new dataset are obtained by alternating among the source examples.

In [None]:
from itertools import islice
from datasets import interleave_datasets

combined_dataset = interleave_datasets([pubmed_dataset_streamed, pubmed_dataset_streamed])
list(islice(combined_dataset, 2))

[{'meta': {'pmid': 11409574, 'language': 'eng'},
  'text': 'Epidemiology of hypoxaemia in children with acute lower respiratory infection.\nTo determine the prevalence of hypoxaemia in children aged under 5 years suffering acute lower respiratory infections (ALRI), the risk factors for hypoxaemia in children under 5 years of age with ALRI, and the association of hypoxaemia with an increased risk of dying in children of the same age. Systematic review of the published literature. Out-patient clinics, emergency departments and hospitalisation wards in 23 health centres from 10 countries. Cohort studies reporting the frequency of hypoxaemia in children under 5 years of age with ALRI, and the association between hypoxaemia and the risk of dying. Prevalence of hypoxaemia measured in children with ARI and relative risks for the association between the severity of illness and the frequency of hypoxaemia, and between hypoxaemia and the risk of dying. Seventeen published studies were found that

We used the `islice()` function to select the first two examples from the combined dataset, and we can see that they are the same since we basically duplicate the source dataset.

If we want to stream the Pile in its 825GB entirety, we can grab all the prepared files:

In [None]:
base_url = "https://the-eye.eu/public/AI/pile/"
data_files = {
    "train": [base_url + "train/" + f"{idx:02d}.jsonl.zst" for idx in range(30)],
    "validation": base_url + "val.jsonl.zst",
    "test": base_url + "test.jsonl.zst",
}
pile_dataset = load_dataset("json", data_files=data_files, streaming=True)
next(iter(pile_dataset["train"]))

# Creating our own dataset

## Getting the data

To download all the repository's issues, we need to use the GitHub REST API to poll the Issues endpoint. This endpoint returns a list of JSON objects, with each object containing a large number of fields that include the title and description as well as metadata about the status of the issue and so on.

In [None]:
!pip install requests



In [None]:
import requests

url = "https://api.github.com/repos/huggingface/datasets/issues?page1&per_page=1"
response = requests.get(url)

In [None]:
response.status_code

200

where a `200` status means the request was successful.

What we are interested in is the *payload*, which can be accessed in various formats like bytes, strings, or JSON.

In [None]:
response.json()

[{'url': 'https://api.github.com/repos/huggingface/datasets/issues/7229',
  'repository_url': 'https://api.github.com/repos/huggingface/datasets',
  'labels_url': 'https://api.github.com/repos/huggingface/datasets/issues/7229/labels{/name}',
  'comments_url': 'https://api.github.com/repos/huggingface/datasets/issues/7229/comments',
  'events_url': 'https://api.github.com/repos/huggingface/datasets/issues/7229/events',
  'html_url': 'https://github.com/huggingface/datasets/pull/7229',
  'id': 2588847398,
  'node_id': 'PR_kwDODunzps5-rgrx',
  'number': 7229,
  'title': 'handle config_name=None in push_to_hub',
  'user': {'login': 'alex-hh',
   'id': 5719745,
   'node_id': 'MDQ6VXNlcjU3MTk3NDU=',
   'avatar_url': 'https://avatars.githubusercontent.com/u/5719745?v=4',
   'gravatar_id': '',
   'url': 'https://api.github.com/users/alex-hh',
   'html_url': 'https://github.com/alex-hh',
   'followers_url': 'https://api.github.com/users/alex-hh/followers',
   'following_url': 'https://api.githu

Prepare GitHub personal access token

```python
GITHUB_TOKEN = 'YOUR_GITHUB_TOKEN'
headers = {"Authorization": f"token {GITHUB_TOKEN}"}
```

With the access token, we can create a function that can download all the issues from a GitHub repository:

In [None]:
import time
import math
from pathlib import Path
import pandas as pd
from tqdm.notebook import tqdm

In [None]:
def fetch_issue(
        owner='huggingface',
        repo='datasets',
        num_issues=10000,
        rate_limit=5000,
        issues_path=Path("."),
):
    if not issues_path.is_dir():
        issues_path.mkdir(exist_ok=True)

    batch = []
    all_issues = []
    per_page = 100 # number of issues to return per page
    num_pages = math.ceil(num_issues / per_page)
    base_url = "https://api.github.com/repos"

    for page in tqdm(range(num_pages)):
        # query with state=all to get both open and closed issues
        query = f"issues?page={page}&per_page={per_page}&state=all"
        issues = requests.get(f"{base_url}/{owner}/{repo}/{query}", headers=headers)
        batch.extend(issues.json())

        if len(batch) > rate_limit and len(all_issues) < num_issues:
            all_issues.extend(batch)
            batch = [] # flush batch for next time period
            print(f"Reached Github rate limit. Sleeping for one hour...")
            time.sleep(60 * 60 + 1)

    all_issues.extend(batch)
    df = pd.DataFrame.from_records(all_issues)
    df.to_json(f"{issues_path}/{repo}-issues.jsonl", orient="records", lines=True)

    print(f"Downloaded all issues for {repo}! Dataset stored at {issues_path}/{repo}-issues.jsonl")

In [None]:
fetch_issue(num_issues=5000)

  0%|          | 0/50 [00:00<?, ?it/s]

Downloaded all issues for datasets! Dataset stored at ./datasets-issues.jsonl


Then we can load them locally

In [None]:
issues_dataset = load_dataset(
    'json',
    data_files='datasets-issues.jsonl',
    split='train',
)
issues_dataset

Generating train split: 0 examples [00:00, ? examples/s]

DatasetGenerationError: An error occurred while generating the dataset

In [None]:
# work-around solution
df=pd.read_json("datasets-issues.jsonl", lines=True)
df.head()

from datasets import Dataset
issues_dataset = Dataset.from_pandas(df)
issues_dataset

Dataset({
    features: ['url', 'repository_url', 'labels_url', 'comments_url', 'events_url', 'html_url', 'id', 'node_id', 'number', 'title', 'user', 'labels', 'state', 'locked', 'assignee', 'assignees', 'milestone', 'comments', 'created_at', 'updated_at', 'closed_at', 'author_association', 'active_lock_reason', 'draft', 'pull_request', 'body', 'closed_by', 'reactions', 'timeline_url', 'performed_via_github_app', 'state_reason'],
    num_rows: 5000
})

## Cleaning up the data

The `pull_request` column can be used to differentiate between issues and pull requests.

Take a random sample to check

In [None]:
sample = issues_dataset.shuffle(seed=101).select(range(3))

Zip the `html_url` and `pull_request` columns to compare the various URLs:

In [None]:
for url, pr in zip(sample['html_url'], sample['pull_request']):
    print(f">> URL: {url}")
    print(f">> Pull request: {pr}\n")

>> URL: https://github.com/huggingface/datasets/pull/3529
>> Pull request: {'diff_url': 'https://github.com/huggingface/datasets/pull/3529.diff', 'html_url': 'https://github.com/huggingface/datasets/pull/3529', 'merged_at': '2022-01-05T12:50:14Z', 'patch_url': 'https://github.com/huggingface/datasets/pull/3529.patch', 'url': 'https://api.github.com/repos/huggingface/datasets/pulls/3529'}

>> URL: https://github.com/huggingface/datasets/pull/2219
>> Pull request: {'diff_url': 'https://github.com/huggingface/datasets/pull/2219.diff', 'html_url': 'https://github.com/huggingface/datasets/pull/2219', 'merged_at': '2021-04-16T08:50:44Z', 'patch_url': 'https://github.com/huggingface/datasets/pull/2219.patch', 'url': 'https://api.github.com/repos/huggingface/datasets/pulls/2219'}

>> URL: https://github.com/huggingface/datasets/issues/5021
>> Pull request: None



Each pull request is associated with various URLs, while ordinary issues have a `None` entry. Now we can create a new `is_pull_request` column to check whether the `pull_request` field is `None` or not:

In [None]:
issues_dataset = issues_dataset.map(
    lambda x: {'is_pull_request': False if x['pull_request'] is None else True}
)

Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

## Augmenting the dataset

The comments associated with an issue or pull request provide a rich source of information.

The GitHub REST API provides a Comments endpoint that returns all the comments associated with an issue number.

In [None]:
issue_number = 2792
url = f"https://api.github.com/repos/huggingface/datasets/issues/{issue_number}/comments"
response = requests.get(url, headers=headers)
response.json()

[{'url': 'https://api.github.com/repos/huggingface/datasets/issues/comments/897594128',
  'html_url': 'https://github.com/huggingface/datasets/pull/2792#issuecomment-897594128',
  'issue_url': 'https://api.github.com/repos/huggingface/datasets/issues/2792',
  'id': 897594128,
  'node_id': 'IC_kwDODunzps41gDMQ',
  'user': {'login': 'bhavitvyamalik',
   'id': 19718818,
   'node_id': 'MDQ6VXNlcjE5NzE4ODE4',
   'avatar_url': 'https://avatars.githubusercontent.com/u/19718818?v=4',
   'gravatar_id': '',
   'url': 'https://api.github.com/users/bhavitvyamalik',
   'html_url': 'https://github.com/bhavitvyamalik',
   'followers_url': 'https://api.github.com/users/bhavitvyamalik/followers',
   'following_url': 'https://api.github.com/users/bhavitvyamalik/following{/other_user}',
   'gists_url': 'https://api.github.com/users/bhavitvyamalik/gists{/gist_id}',
   'starred_url': 'https://api.github.com/users/bhavitvyamalik/starred{/owner}{/repo}',
   'subscriptions_url': 'https://api.github.com/users/

The comment is stored in the `body` field.

In [None]:
def get_comments(issue_number):
    url = f"https://api.github.com/repos/huggingface/datasets/issues/{issue_number}/comments"
    response = requests.get(url, headers=headers)

    return [r['body'] for r in response.json()]

In [None]:
get_comments(2792)

["@albertvillanova my tests are failing here:\r\n```\r\ndataset_name = 'gooaq'\r\n\r\n    def test_load_dataset(self, dataset_name):\r\n        configs = self.dataset_tester.load_all_configs(dataset_name, is_local=True)[:1]\r\n>       self.dataset_tester.check_load_dataset(dataset_name, configs, is_local=True, use_local_dummy_data=True)\r\n\r\ntests/test_dataset_common.py:234: \r\n_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \r\ntests/test_dataset_common.py:187: in check_load_dataset\r\n    self.parent.assertTrue(len(dataset[split]) > 0)\r\nE   AssertionError: False is not true\r\n```\r\nWhen I try loading dataset on local machine it works fine. Any suggestions on how can I avoid this error?",
 'Thanks for the help, @albertvillanova! All tests are passing now.']

Now we can use `Dataset.map()` to add a new `comments` column to each issue in our dataset:

In [None]:
issues_with_comments_dataset = issues_dataset.map(
    lambda x: {'comments': get_comments(x['number'])}
)

# Semantic search with FAISS

## Using embeddings for semantic search

Transformer-based language models represent each token in a span of text as an *embedding vector*. These embeddings can be used to find similar documents in the corpus by computing the dot-product similarity between each embedding and returning the documents with the greatest overlap.

## Loading and preparing the dataset

In [None]:
from datasets import load_dataset, Dataset

issues_dataset = load_dataset(
    'lewtun/github-issues',
    split='train',
)
issues_dataset

We specified the default `train` split in `load_dataset()`, so it returns a `Dataset` instead of a `DatasetDict`.

First thing is to filler out the pull requests. We can use the `Dataset.filter()` function to exclude these rows in our dataset. Also, we can filter our rows with no comments, since these provide no answers to user queries:

In [3]:
issues_dataset = issues_dataset.filter(
    lambda x: (x['is_pull_request'] == False and len(x['comments'])>0)
)
issues_dataset

Dataset({
    features: ['url', 'repository_url', 'labels_url', 'comments_url', 'events_url', 'html_url', 'id', 'node_id', 'number', 'title', 'user', 'labels', 'state', 'locked', 'assignee', 'assignees', 'milestone', 'comments', 'created_at', 'updated_at', 'closed_at', 'author_association', 'active_lock_reason', 'pull_request', 'body', 'timeline_url', 'performed_via_github_app', 'is_pull_request'],
    num_rows: 808
})

From a search perspective, the most informative columns are `title`, `body`, and `comments`, while `html_url` provides us with a link back to the source issue. We can use the `Dataset.remove_columns()` to drop the rest:

In [4]:
columns = issues_dataset.column_names
columns_to_keep = ['title', 'body', 'comments', 'html_url']
columns_to_remove = set(columns_to_keep).symmetric_difference(columns)
columns_to_remove

{'active_lock_reason',
 'assignee',
 'assignees',
 'author_association',
 'closed_at',
 'comments_url',
 'created_at',
 'events_url',
 'id',
 'is_pull_request',
 'labels',
 'labels_url',
 'locked',
 'milestone',
 'node_id',
 'number',
 'performed_via_github_app',
 'pull_request',
 'repository_url',
 'state',
 'timeline_url',
 'updated_at',
 'url',
 'user'}

In [5]:
issues_dataset = issues_dataset.remove_columns(columns_to_remove)
issues_dataset

Dataset({
    features: ['html_url', 'title', 'comments', 'body'],
    num_rows: 808
})

To create our embeddings we will augment each comment with the issue's title and body, since these fields include useful contextual information. Because our `comments` column is currently a list of comments for each issue, we need to "explode" the column so taht each row consists an `(html_url, title, body, comment)` tuple. We can achieve this with the `DataFrame.explode()` function once we switch to the Pandas `DataFrame` format:

In [6]:
issues_dataset.set_format('pandas')
df = issues_dataset[:]

In [7]:
df['comments'][0].tolist()

['Cool, I think we can do both :)',
 '@lhoestq now the 2 are implemented.\r\n\r\nPlease note that for the the second protection, finally I have chosen to protect the master branch only from **merge commits** (see update comment above), so no need to disable/re-enable the protection on each release (direct commits, different from merge commits, can be pushed to the remote master branch; and eventually reverted without messing up the repo history).']

When we explode `df`, we expect to get one row for each of these comments:

In [8]:
comments_df = df.explode('comments', ignore_index=True)
comments_df.head(5)

Unnamed: 0,html_url,title,comments,body
0,https://github.com/huggingface/datasets/issues...,Protect master branch,"Cool, I think we can do both :)",After accidental merge commit (91c55355b634d0d...
1,https://github.com/huggingface/datasets/issues...,Protect master branch,@lhoestq now the 2 are implemented.\r\n\r\nPle...,After accidental merge commit (91c55355b634d0d...
2,https://github.com/huggingface/datasets/issues...,Backwards compatibility broken for cached data...,Hi ! I guess the caching mechanism should have...,## Describe the bug\r\nAfter upgrading to data...
3,https://github.com/huggingface/datasets/issues...,Backwards compatibility broken for cached data...,"If it's easy enough to implement, then yes ple...",## Describe the bug\r\nAfter upgrading to data...
4,https://github.com/huggingface/datasets/issues...,Backwards compatibility broken for cached data...,Well it can cause issue with anyone that updat...,## Describe the bug\r\nAfter upgrading to data...


Now the rows have been replicated, with the `comments` column containing the individual comments. We can switch back to a `Dataset`:

In [9]:
comments_dataset = Dataset.from_pandas(comments_df)
comments_dataset

Dataset({
    features: ['html_url', 'title', 'comments', 'body'],
    num_rows: 2964
})

Now that we have one comment per row, we can create a new `comments_length` column that contains the number of words per comment:

In [10]:
comments_dataset = comments_dataset.map(
    lambda x: {'comments_length': len(x['comments'].split())}
)

Map:   0%|          | 0/2964 [00:00<?, ? examples/s]

We can use this new column to filter out short comments. Let's try 15 words

In [11]:
comments_dataset = comments_dataset.filter(
    lambda x: x['comments_length'] > 15
)
comments_dataset

Filter:   0%|          | 0/2964 [00:00<?, ? examples/s]

Dataset({
    features: ['html_url', 'title', 'comments', 'body', 'comments_length'],
    num_rows: 2175
})

Now let's concatenate the issue title, description, and comments together in a new `text` column. As usual, we need a function to pass to `Dataset.map()`:

In [12]:
def concatenate_text(examples):
    return {
        'text': examples['title']
        + ' \n '
        + examples['body']
        + ' \n '
        + examples['comments']
    }

comments_dataset = comments_dataset.map(concatenate_text)

Map:   0%|          | 0/2175 [00:00<?, ? examples/s]

## Creating text embeddings

The `sentence-transformers` library is dedicated to creating embeddings. Our use case is an example of *asymmetric semantic search* because we have a short query whose answer we would like to find in a longer document, like an issue comment.

In [13]:
from transformers import AutoTokenizer, AutoModel

model_ckpt = 'sentence-transformers/multi-qa-mpnet-base-dot-v1'

tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
model = AutoModel.from_pretrained(model_ckpt)



To speed up the embedding process, use a GPU

In [14]:
import torch

device = torch.device('cuda')
model.to(device)

MPNetModel(
  (embeddings): MPNetEmbeddings(
    (word_embeddings): Embedding(30527, 768, padding_idx=1)
    (position_embeddings): Embedding(514, 768, padding_idx=1)
    (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): MPNetEncoder(
    (layer): ModuleList(
      (0-11): 12 x MPNetLayer(
        (attention): MPNetAttention(
          (attn): MPNetSelfAttention(
            (q): Linear(in_features=768, out_features=768, bias=True)
            (k): Linear(in_features=768, out_features=768, bias=True)
            (v): Linear(in_features=768, out_features=768, bias=True)
            (o): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
        (intermediate): MPNetIntermediate(
          (dense): Linear(in_

We would like to represent each entry in our GitHub issues corpus as a single vector, so we need to "pool" or average our token embeddings in some way. One popular approach is to perform *CLS pooling* on our model's outputs, where we simply collect the last hidden state for the special `[CLS]` token.

In [15]:
def cls_pooling(model_output):
    return model_output.last_hidden_state[:, 0]

Then we create a helper function that will tokenize a list of documents, place the tensors on the GPU, feed them to the model, and finally apply CLS pooling to the outputs:

In [16]:
def get_embeddings(text_list):
    encoded_input = tokenizer(
        text_list, padding=True, truncation=True, return_tensors='pt'
    )
    encoded_input = {k: v.to(device) for k,v in encoded_input.items()}

    model_output = model(**encoded_input)

    res = cls_pooling(model_output)

    return res

In [17]:
embedding = get_embeddings(comments_dataset['text'][0])
embedding.shape

torch.Size([1, 768])

We have converted the first entry in our corpus into a 768-dimensional vector! We can now use `Dataset.map()` to apply our `get_embeddings()` function to each row in our corpus to create a new `embeddings` column:

In [18]:
embeddings_dataset = comments_dataset.map(
    lambda x: {'embeddings': get_embeddings(x['text']).detach().cpu().numpy()[0]}
)

Map:   0%|          | 0/2175 [00:00<?, ? examples/s]

Notice that we have converted the embeddings to NumPy arrays because Datasets requires this format when we try to index them with FAISS.

## Using FAISS for efficient similarity search

FAISS (Facebook AI Similarity Search) is a library that provides efficient algorithms to quickly search and cluster embedding vectors.

The basic idea behind FAISS is to create a special data structure called an *index* that allows one to find which embeddings are similar to an input embeding.

We use the `Dataset.add_faiss_index()` function and specify which column of our dataset we would like to index:

In [20]:
embeddings_dataset.add_faiss_index(column='embeddings')

  0%|          | 0/3 [00:00<?, ?it/s]

Dataset({
    features: ['html_url', 'title', 'comments', 'body', 'comments_length', 'text', 'embeddings'],
    num_rows: 2175
})

We can now perform queries on this index by doing a nearst neighbor lookup with the `Dataset.get_nearest_examples()` function.

Check the first embedding a question:

In [21]:
question = "How can I load a dataset offline?"

question_embedding = get_embeddings([question]).cpu().detach().numpy()
question_embedding.shape

(1, 768)

Just like with the documents, we now have a 768-dimensional vector representing the query, which we can compare against the whole corpus to find the most similar embeddings:

In [22]:
scores, samples = embeddings_dataset.get_nearest_examples(
    "embeddings",
    query=question_embedding,
    k=5, # number of examples to retrieve
)

The `Dataset.get_nearest_examples()` function returns a tuple of scores that rank the overlap between the query and the document, and a corresponding set of samples (here, the 5 best matches). We can collect the results in a `pandas.DataFrame` so we can easily sort them:

In [23]:
import pandas as pd

samples_df = pd.DataFrame.from_dict(samples)
samples_df['scores'] = scores

# sort the result
samples_df.sort_values('scores', ascending=False, inplace=True)

In [24]:
for _, row in samples_df.iterrows():
    print(f"COMMENT: {row.comments}")
    print(f"SCORE: {row.scores}")
    print(f"TITLE: {row.title}")
    print(f"URL: {row.html_url}")
    print("="*50)
    print()

COMMENT: Requiring online connection is a deal breaker in some cases unfortunately so it'd be great if offline mode is added similar to how `transformers` loads models offline fine.

@mandubian's second bullet point suggests that there's a workaround allowing you to use your offline (custom?) dataset with `datasets`. Could you please elaborate on how that should look like?
SCORE: 25.505016326904297
TITLE: Discussion using datasets in offline mode
URL: https://github.com/huggingface/datasets/issues/824

COMMENT: The local dataset builders (csv, text , json and pandas) are now part of the `datasets` package since #1726 :)
You can now use them offline
```python
datasets = load_dataset('text', data_files=data_files)
```

We'll do a new release soon
SCORE: 24.555538177490234
TITLE: Discussion using datasets in offline mode
URL: https://github.com/huggingface/datasets/issues/824

COMMENT: I opened a PR that allows to reload modules that have already been loaded once even if there's n