In [1]:
%%capture
!pip install datasets transformers[sentencepiece]

# Big data? 🤗 Datasets to the rescue!

In [2]:
%%capture
%pip install zstandard

In [3]:
from datasets import load_dataset
from datasets.utils import DownloadConfig

data_files = "https://the-eye.eu/public/AI/pile_preliminary_components/PUBMED_title_abstracts_2019_baseline.jsonl.zst"

pubmed_dataset = load_dataset("json", data_files=data_files, split='train')
pubmed_dataset

Using custom data configuration default-dfa10a5f6311b3da


Downloading and preparing dataset json/default to /root/.cache/huggingface/datasets/json/default-dfa10a5f6311b3da/0.0.0/c2d554c3377ea79c7664b93dc65d0803b45e3279000f993c7bfd18937fd7f426...


  0%|          | 0/1 [00:00<?, ?it/s]

Downloading:   0%|          | 0.00/6.90G [00:00<?, ?B/s]

KeyboardInterrupt: ignored

## The magic of memory mapping

- measure memory usage in python using `psutil`

In [None]:
%pip install psutil -qqq

In [None]:
import psutil

In [None]:
psutil.Process().memory_info().rss / (1024 * 1024)

In [None]:
pubmed_dataset.dataset_size

In [None]:
size = pubmed_dataset.dataset_size / (1024 ** 3)

### Try it Out

✏️ Try it out! Pick one of the subsets from the Pile that is larger than your laptop or desktop’s RAM, load it with 🤗 Datasets, and measure the amount of RAM used. Note that to get an accurate measurement, you’ll want to do this in a new process. You can find the decompressed sizes of each subset in Table 1 of the Pile paper.

If you’re familiar with Pandas, this result might come as a surprise because of Wes Kinney’s famous rule of thumb that you typically need 5 to 10 times as much RAM as the size of your dataset. So how does 🤗 Datasets solve this memory management problem?

🤗 Datasets treats each dataset as a [memory-mapped file](https://en.wikipedia.org/wiki/Memory-mapped_file), which provides a mapping between RAM and filesystem storage that allows the library to access and operate on elements of the dataset without needing to fully load it into memory.

Memory-mapped files can also be shared across multiple processes, which enables methods like Dataset.map() to be parallelized without needing to move or copy the dataset. (For more details about Apache Arrow and comparisons to Pandas, check out [Dejan Simic’s blog post.](https://towardsdatascience.com/apache-arrow-read-dataframe-with-zero-memory-69634092b1a)) 

In [None]:
%%timeit
# speed test by iterating over the elements in PubMed dataset
batch_size = 1000

for idx in range(0, len(pubmed_dataset), batch_size):
  pubmed_dataset[idx : idx+b]


## Streaming datasets

In [5]:
from datasets import load_dataset
data_files = "https://the-eye.eu/public/AI/pile_preliminary_components/PUBMED_title_abstracts_2019_baseline.jsonl.zst"

pubmed_dataset_streamed = load_dataset('json', data_files=data_files, split='train', streaming=True)

Using custom data configuration default-dfa10a5f6311b3da


In [7]:
pubmed_dataset_streamed

<datasets.iterable_dataset.IterableDataset at 0x7ff3634a09d0>

In [6]:
# access the first element 

next(iter(pubmed_dataset_streamed))

{'meta': {'language': 'eng', 'pmid': 11409574},
 'text': 'Epidemiology of hypoxaemia in children with acute lower respiratory infection.\nTo determine the prevalence of hypoxaemia in children aged under 5 years suffering acute lower respiratory infections (ALRI), the risk factors for hypoxaemia in children under 5 years of age with ALRI, and the association of hypoxaemia with an increased risk of dying in children of the same age. Systematic review of the published literature. Out-patient clinics, emergency departments and hospitalisation wards in 23 health centres from 10 countries. Cohort studies reporting the frequency of hypoxaemia in children under 5 years of age with ALRI, and the association between hypoxaemia and the risk of dying. Prevalence of hypoxaemia measured in children with ARI and relative risks for the association between the severity of illness and the frequency of hypoxaemia, and between hypoxaemia and the risk of dying. Seventeen published studies were found that i

In [9]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased')
tokenized_dataset = pubmed_dataset_streamed.map(lambda x: tokenizer(x['text']), batched=True)
print(next(iter(tokenized_dataset)))

Token indices sequence length is longer than the specified maximum sequence length for this model (561 > 512). Running this sequence through the model will result in indexing errors


{'input_ids': [101, 4958, 5178, 4328, 6779, 1997, 1044, 22571, 11636, 6679, 10092, 1999, 2336, 2007, 11325, 2896, 16464, 8985, 1012, 2000, 5646, 1996, 20272, 1997, 1044, 22571, 11636, 6679, 10092, 1999, 2336, 4793, 2104, 1019, 2086, 6114, 11325, 2896, 16464, 15245, 1006, 2632, 3089, 1007, 1010, 1996, 3891, 5876, 2005, 1044, 22571, 11636, 6679, 10092, 1999, 2336, 2104, 1019, 2086, 1997, 2287, 2007, 2632, 3089, 1010, 1998, 1996, 2523, 1997, 1044, 22571, 11636, 6679, 10092, 2007, 2019, 3445, 3891, 1997, 5996, 1999, 2336, 1997, 1996, 2168, 2287, 1012, 11778, 3319, 1997, 1996, 2405, 3906, 1012, 2041, 1011, 5776, 17865, 1010, 5057, 7640, 1998, 2902, 6648, 11682, 1999, 2603, 2740, 8941, 2013, 2184, 3032, 1012, 2522, 27794, 2913, 7316, 1996, 6075, 1997, 1044, 22571, 11636, 6679, 10092, 1999, 2336, 2104, 1019, 2086, 1997, 2287, 2007, 2632, 3089, 1010, 1998, 1996, 2523, 2090, 1044, 22571, 11636, 6679, 10092, 1998, 1996, 3891, 1997, 5996, 1012, 20272, 1997, 1044, 22571, 11636, 6679, 10092, 7594, 

💡 To speed up tokenization with streaming you can pass **batched=True**, as we saw in the last section. It will process the examples batch by batch; the default batch size is 1,000 and can be specified with the batch_size argument.

In [10]:
shuffled_dataset = pubmed_dataset_streamed.shuffle(buffer_size=10_000, seed=42)
next(iter(shuffled_dataset))

{'meta': {'language': 'eng', 'pmid': 11410799},
 'text': 'Randomized study of dose or schedule modification of granulocyte colony-stimulating factor in platinum-based chemotherapy for elderly patients with lung cancer.\nIt is generally believed that elderly patients are less able to tolerate aggressive cancer chemotherapy than their younger counterparts. Bone marrow cellularity diminishes with age and elderly patients may have decreased tolerance to myelosuppressive agents. Between November 1995 and October 1999, 68 chemotherapy-naive elderly (70 or more years old) patients with histologically or cytologically proven lung cancer who were to receive platinum-based chemotherapy were enrolled in this study. All patients had adequate cardiac, hematological, liver and renal function to receive chemotherapy. Patients were randomized into 3 groups. Patients in groups 1 and 2 received 2 microg/kg and 4 microg/kg granulocyte colony-stimulating factor (G-CSF, lenograstim), respectively, when gra

In [11]:
# select elements from a streamed dataset using the IterableDataset.take(), which act in a similar way to Dataset.select()
dataset_head = pubmed_dataset_streamed.take(5)
list(dataset_head)

[{'meta': {'language': 'eng', 'pmid': 11409574},
  'text': 'Epidemiology of hypoxaemia in children with acute lower respiratory infection.\nTo determine the prevalence of hypoxaemia in children aged under 5 years suffering acute lower respiratory infections (ALRI), the risk factors for hypoxaemia in children under 5 years of age with ALRI, and the association of hypoxaemia with an increased risk of dying in children of the same age. Systematic review of the published literature. Out-patient clinics, emergency departments and hospitalisation wards in 23 health centres from 10 countries. Cohort studies reporting the frequency of hypoxaemia in children under 5 years of age with ALRI, and the association between hypoxaemia and the risk of dying. Prevalence of hypoxaemia measured in children with ARI and relative risks for the association between the severity of illness and the frequency of hypoxaemia, and between hypoxaemia and the risk of dying. Seventeen published studies were found that

In [12]:
# use IterableDataset.skip() function to create training and validation splits
train_dataset = shuffled_dataset.skip(1000)

# Take the first 1000 examples for the validation set
validation_dataset = shuffled_dataset.take(1000)

A common application : combining multiple datasets togetther to create a single corpus.

Combine the examples from the FreeLaw and PubMed Abstracts datasets with the interleave_datasets() function

In [13]:
#  let’s stream the FreeLaw subset of the Pile, which is a 51 GB dataset of legal opinions from US courts
law_dataset_streamed = load_dataset("json", data_files="https://the-eye.eu/public/AI/pile_preliminary_components/FreeLaw_Opinions.jsonl.zst",
             split="train", streaming=True)
next(iter(law_dataset_streamed))

Using custom data configuration default-59bf555acf99b763


{'meta': {'case_ID': '110921.json',
  'case_jurisdiction': 'scotus.tar.gz',
  'date_created': '2010-04-28T17:12:49Z'},
 'text': '\n461 U.S. 238 (1983)\nOLIM ET AL.\nv.\nWAKINEKONA\nNo. 81-1581.\nSupreme Court of United States.\nArgued January 19, 1983.\nDecided April 26, 1983.\nCERTIORARI TO THE UNITED STATES COURT OF APPEALS FOR THE NINTH CIRCUIT\n*239 Michael A. Lilly, First Deputy Attorney General of Hawaii, argued the cause for petitioners. With him on the brief was James H. Dannenberg, Deputy Attorney General.\nRobert Gilbert Johnston argued the cause for respondent. With him on the brief was Clayton C. Ikei.[*]\n*240 JUSTICE BLACKMUN delivered the opinion of the Court.\nThe issue in this case is whether the transfer of a prisoner from a state prison in Hawaii to one in California implicates a liberty interest within the meaning of the Due Process Clause of the Fourteenth Amendment.\n\nI\n\nA\nRespondent Delbert Kaahanui Wakinekona is serving a sentence of life imprisonment withou

In [14]:
from itertools import islice
from datasets import interleave_datasets

In [15]:
combined_dataset = interleave_datasets([pubmed_dataset_streamed, law_dataset_streamed])

islice func Return an iterator whose next() method returns selected values from an iterable.  If start is specified, will skip all preceding elements;otherwise, start defaults to zero.  Step defaults to one.  If
specified as another value, step determines how many values are skipped between successive calls.  Works like a slice() on a list but returns an iterator.

In [18]:
list(islice(combined_dataset, 2))

[{'meta': {'language': 'eng', 'pmid': 11409574},
  'text': 'Epidemiology of hypoxaemia in children with acute lower respiratory infection.\nTo determine the prevalence of hypoxaemia in children aged under 5 years suffering acute lower respiratory infections (ALRI), the risk factors for hypoxaemia in children under 5 years of age with ALRI, and the association of hypoxaemia with an increased risk of dying in children of the same age. Systematic review of the published literature. Out-patient clinics, emergency departments and hospitalisation wards in 23 health centres from 10 countries. Cohort studies reporting the frequency of hypoxaemia in children under 5 years of age with ALRI, and the association between hypoxaemia and the risk of dying. Prevalence of hypoxaemia measured in children with ARI and relative risks for the association between the severity of illness and the frequency of hypoxaemia, and between hypoxaemia and the risk of dying. Seventeen published studies were found that

islice() function from Python’s itertools module to select the first two examples from the combined dataset, and we can see that they match the first examples from each of the two source datasets.

In [22]:
#stream the Pile in its 825 GB entirety,
base_url = "https://the-eye.eu/public/AI/pile/"
data_files = {
    "train": [base_url + "train/" + f"{idx:02d}.jsonl.zst" for idx in range(30)],
    "validation": base_url + "val.jsonl.zst",
    "test": base_url + "test.jsonl.zst",
 }

pile_dataset = load_dataset("json", data_files=data_files, streaming=True)
list(islice(pile_dataset['train'], 1))[0]

Resolving data files:   0%|          | 0/30 [00:00<?, ?it/s]

Using custom data configuration default-582c1c2fa9625e3e


{'meta': {'pile_set_name': 'Pile-CC'},
 'text': 'It is done, and submitted. You can play “Survival of the Tastiest” on Android, and on the web. Playing on the web works, but you have to simulate multi-touch for table moving and that can be a bit confusing.\n\nThere’s a lot I’d like to talk about. I’ll go through every topic, insted of making the typical what went right/wrong list.\n\nConcept\n\nWorking over the theme was probably one of the hardest tasks I had to face.\n\nOriginally, I had an idea of what kind of game I wanted to develop, gameplay wise – something with lots of enemies/actors, simple graphics, maybe set in space, controlled from a top-down view. I was confident I could fit any theme around it.\n\nIn the end, the problem with a theme like “Evolution” in a game is that evolution is unassisted. It happens through several seemingly random mutations over time, with the most apt permutation surviving. This genetic car simulator is, in my opinion, a great example of actual evo

### Try it out

✏️ Try it out! Use one of the large Common Crawl corpora like mc4 or oscar to create a streaming multilingual dataset that represents the spoken proportions of languages in a country of your choice. For example, the four national languages in Switzerland are German, French, Italian, and Romansh, so you could try creating a Swiss corpus by sampling the Oscar subsets according to their spoken proportion.