### Time to slice and dice

First we need to download and extract the data, which can be done with the wget and unzip commands:
```bash
!wget "https://archive.ics.uci.edu/ml/machine-learning-databases/00462/drugsCom_raw.zip"
!unzip drugsCom_raw.zip
```

In [None]:
from datasets import load_dataset

data_files = {"train":"drugComTrain_raw.tsv" , "test":"drugComTest_raw.tsv"}
drug_dataset = load_dataset("csv" , data_files=data_files , delimiter="\t")

 In 🤗 Datasets, we can create a random sample by chaining the Dataset.shuffle() and Dataset.select() functions together:

In [None]:
drug_sample = drug_dataset["train"].shuffle(seed=42).select(range(1000))

drug_sample[:3]

From this sample we can already see a few quirks in our dataset:

1. The Unnamed: 0 column looks suspiciously like an anonymized ID for each patient.
2. The condition column includes a mix of uppercase and lowercase labels.
3. The reviews are of varying length and contain a mix of Python line separators (\r\n) as well as HTML character codes like &\#039;.

In [None]:
# To test the patient ID hypothesis for the Unnamed: 0 column, we can use the Dataset.unique() function to verify that the number of IDs matches the number of rows in each split:
for split in drug_dataset.keys():
    assert len(drug_dataset[split]) == len(drug_dataset[split].unique("Unnamed: 0"))

In [None]:
# renaming of the column
drug_dataset = drug_dataset.rename_column(
    original_column_name="Unnamed: 0", new_column_name="patient_id"
)
drug_dataset

In [None]:
# lowercasing of the condition column
def lowercase_condition(example):
    return {"condition": example["condition"].lower()}

def filter_nones(x):
    return x["condition"] is not None

drug_dataset.filter(filter_nones)
drug_dataset.map(lowercase_condition)

In [None]:
drug_dataset = drug_dataset.map(lowercase_condition)
# Check that lowercasing worked
drug_dataset["train"]["condition"][:3]

In [None]:
#  creatng new columns
def compute_review_length(example):
    return {"review_length":len(example["review"].split())}

drug_dataset = drug_dataset.map(compute_review_length)
drug_dataset["train"][0]

In [None]:
drug_dataset["train"].sort("review_length")[:3]

In [None]:
drug_dataset = drug_dataset.filter(lambda x: x["review_length"]>30)

In [None]:
import html
text = "I&#039;m a transformer called BERT"
html.unescape(text)

In [4]:
a = [{1:"one"}, {2:"two"}, {3:"three"}, {4:"four"}, {5:"five"}, {6:"six"}]

a = list(map(lambda x: {"review": x}, a))

print(a)

[{'review': {1: 'one'}}, {'review': {2: 'two'}}, {'review': {3: 'three'}}, {'review': {4: 'four'}}, {'review': {5: 'five'}}, {'review': {6: 'six'}}]


In [None]:
drug_dataset = drug_dataset.map(lambda x: {"review":html.unescape(x["review"])})

The Dataset.map() method takes a batched argument that, if set to True, causes it to send a batch of examples to the map function at once (the batch size is configurable but defaults to 1,000).

list comprehensions are usually faster than executing the same code in a for loop

In [None]:
new_drug_dataset = drug_dataset.map(
    lambda x: {"review" : [html.unescape(o) for o in x["review"]]}, batched=True
)

<img src="image-3.png" alt="Image 3">

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")


def tokenize_function(examples):
    return tokenizer(examples["review"], truncation=True)

%time tokenized_dataset = drug_dataset.map(tokenize_function, batched=True)

Using num_proc to speed up your processing is usually a great idea, as long as the function you are using is not already doing some kind of multiprocessing of its own.

<img src="image-4.png" alt="Image 4">

In [None]:
slow_tokenizer = AutoTokenizer.from_pretrained("bert-base-cased", use_fast=False)


def slow_tokenize_function(examples):
    return slow_tokenizer(examples["review"], truncation=True)


tokenized_dataset = drug_dataset.map(slow_tokenize_function, batched=True, num_proc=8)

In [None]:
def tokenize_and_split(examples):
    return tokenizer(
        examples["review"],
        truncation=True,
        max_length=128,
        return_overflowing_tokens=True,
    )

result = tokenize_and_split(drug_dataset["train"][0])
print([len(inp) for inp in result["input_ids"]])

# tokenized_dataset = drug_dataset.map(tokenize_and_split, batched=True) # gives error

How **`Dataset.map()` interacts with tokenization and overflowing tokens**. Let’s go step by step on *why things broke* and *how both fixes solve it*:

### 🔹 Problem Recap
* You tokenize reviews with `return_overflowing_tokens=True`.
* That means **one input row → possibly many tokenized chunks** (e.g., `128` tokens, then `49` tokens).
* When you do this on the whole dataset with `batched=True`, the tokenizer may output **more rows than you input**.
* Hugging Face datasets expect all columns (`condition`, `drugName`, `review`, etc.) to stay the **same length**.
* But after tokenization, your new `input_ids` column has more rows than the old `condition` column → 💥 ArrowInvalid error.

### ✅ Solution 1: Drop old columns
* Here you **throw away the original dataset columns** (`condition`, `drugName`, `review`, etc.).
* You keep only the tokenizer output (`input_ids`, `attention_mask`, etc.).
* This is fine if you **don’t need the original text/metadata anymore**.

In [None]:
# solution 1
tokenized_dataset = drug_dataset.map(
    tokenize_and_split,
    batched=True,
    remove_columns=drug_dataset["train"].column_names
)

### ✅ Solution 2: Keep old columns with `overflow_to_sample_mapping`

* The tokenizer returns an extra field `overflow_to_sample_mapping`.
  Example: `[0, 0, 1, 2, 2, 2, 3]` → means the 2nd and 3rd tokenized chunks came from sample `0`, and three chunks came from sample `2`.
* You **use that mapping to duplicate the metadata** (`condition`, `drugName`, etc.) so they match the new tokenized chunks.
* Now all columns have equal length → ✅ no Arrow error.
* Best if you want to **retain original fields** for analysis, grouping, or error-checking.

In [None]:
# solution 2
def tokenize_and_split(examples):
    result = tokenizer(
        examples["review"],
        truncation=True,
        max_length=128,
        return_overflowing_tokens=True,
    )
    # Mapping: new_index -> old_index
    sample_map = result.pop("overflow_to_sample_mapping")
    
    # Repeat old data to align with new chunks
    for key, values in examples.items():
        result[key] = [values[i] for i in sample_map]
    return result

tokenized_dataset = drug_dataset.map(tokenize_and_split, batched=True)

To enable the conversion between various third-party libraries, 🤗 Datasets provides a *Dataset.set_format()* function. This function only changes the output format of the dataset, so you can easily switch to another format without affecting the underlying data format, which is Apache Arrow. The formatting is done in place. To demonstrate, let’s convert our dataset to Pandas:

In [None]:
train_df = drug_dataset.set_format("pandas")

In [None]:
drug_dataset["train"][:3]

In [None]:
frequencies = (
    train_df["condition"]
    .value_counts()
    .to_frame()
    .reset_index()
    .rename(columns={"index": "condition", "count": "frequency"})
)
frequencies.head()


In [None]:
from datasets import Dataset

freq_dataset = Dataset.from_pandas(frequencies)
freq_dataset

In [None]:
drug_dataset.reset_format()

In [None]:
drug_dataset_clean = drug_dataset["train"].train_test_split(train_size=0.8, seed=42)
# Rename the default "test" split to "validation"
drug_dataset_clean["validation"] = drug_dataset_clean.pop("test")
# Add the "test" set to our `DatasetDict`
drug_dataset_clean["test"] = drug_dataset["test"]
drug_dataset_clean

Datasets provides three main functions to save your dataset in different formats:

<img src="image-5.png" alt="image 5">

In [None]:
drug_dataset_clean.save_to_disk("drug-reviews")

In [None]:
from datasets import load_from_disk

drug_dataset_reloaded = load_from_disk("drug-reviews")
drug_dataset_reloaded

In [None]:
for split, dataset in drug_dataset_clean.items():
    dataset.to_json(f"drug-reviews-{split}.jsonl")

## Big data? 🤗 Datasets to the rescue!

Exactly 👍 you’ve landed on one of the **killer features** of 🤗 Datasets:

👉 **Memory mapping (zero-copy datasets) + streaming**

This is why you can load something as huge as **The Pile (825 GB!)** on a normal laptop without instantly running out of RAM.

---

## 🔹 How it works

### 1. **Memory mapping (Arrow format)**

* When you `load_dataset()`, 🤗 Datasets stores the data in **Apache Arrow** format on disk.
* Instead of reading the whole dataset into RAM, it creates a **memory-mapped file**.
* This means:

  * Data stays on disk.
  * Only the portion you actually access (`dataset[0]`, a column, or a batch) gets read into RAM.
  * Huge datasets behave like small ones.

✅ That’s why your `pubmed_dataset` (\~20 GB) only used \~5.6 GB of RAM.

### 2. **Streaming**

For datasets that are **larger than your disk space** (like 825 GB Pile), you can also use `streaming=True`:

* The data is **never fully downloaded** or **fully loaded in RAM**.
* Perfect for web-scale corpora (like The Pile, Common Crawl, LAION, etc.).

In [None]:
from datasets import load_dataset

dataset = load_dataset(
    "json",
    data_files="https://the-eye.eu/public/AI/pile_preliminary_components/github.jsonl.zst",
    split="train",
    streaming=True
)

# This returns an **iterable dataset**.
# You can loop over it like a generator:

for sample in dataset.take(5):
    print(sample)

---

## 🔹 Comparing the three modes

| Mode                        | Storage                         | RAM usage                    | Access pattern           | Example use case              |
| --------------------------- | ------------------------------- | ---------------------------- | ------------------------ | ----------------------------- |
| **In-memory**               | Entire dataset in RAM           | High (equals dataset size)   | Random access, very fast | Small datasets (<1–2 GB)      |
| **Memory-mapped (default)** | Stored in Arrow cache on disk   | Small (only accessed chunks) | Random access (fast)     | Medium datasets (10 GB–1 TB)  |
| **Streaming**               | Remote files (HTTP, S3, HF Hub) | Tiny (just current batch)    | Sequential only          | Very large datasets (100 GB+) |

---

In [None]:
import psutil

print(f"RAM used: {psutil.Process().memory_info().rss / (1024 * 1024):.2f} MB")

print(f"Dataset size in bytes: {pubmed_dataset.dataset_size}")
size_gb = pubmed_dataset.dataset_size / (1024**3)
print(f"Dataset size (cache file) : {size_gb:.2f} GB")

## 🔹 Example: Measuring RAM vs disk size


✅ Result: dataset is \~20 GB, but RAM usage \~5 GB.
If you use `streaming=True`, RAM usage will drop even further (basically near-constant).

---

⚡ **Takeaway**: Hugging Face Datasets gives you Big Data handling out of the box:

* You can work with corpora much larger than your RAM (memory mapping).
* You can even work with datasets larger than your disk (streaming).

In [None]:
# let’s run a little speed test by iterating over all the elements in the PubMed Abstracts dataset:

import timeit

code_snippet = """batch_size = 1000

for idx in range(0, len(pubmed_dataset), batch_size):
    _ = pubmed_dataset[idx:idx + batch_size]
"""

time = timeit.timeit(stmt=code_snippet, number=1, globals=globals())
print(
    f"Iterated over {len(pubmed_dataset)} examples (about {size_gb:.1f} GB) in "
    f"{time:.1f}s, i.e. {size_gb/time:.3f} GB/s"
)

1. Here we’ve used Python’s timeit module to measure the execution time taken by code_snippet. 
2. You’ll typically be able to iterate over a dataset at speed of a few tenths of a GB/s to several GB/s. 
3. This works great for the vast majority of applications, but sometimes you’ll have to work with a dataset that is too large to even store on your laptop’s hard drive. 
4. For example, if we tried to download the Pile in its entirety, we’d need 825 GB of free disk space! To handle these cases, 🤗 Datasets provides a streaming feature that allows us to download and access elements on the fly, without needing to download the whole dataset. Let’s take a look at how this works.

In [None]:
pubmed_dataset_streamed = load_dataset(
    "json", data_files=data_files, split="train", streaming=True
)

Instead of the familiar Dataset that we’ve encountered elsewhere in this chapter, the object returned with streaming=True is an IterableDataset. As the name suggests, to access the elements of an IterableDataset we need to iterate over it. We can access the first element of our streamed dataset as follows:

In [None]:
next(iter(pubmed_dataset_streamed))

The elements from a streamed dataset can be processed on the fly using IterableDataset.map(), which is useful during training if you need to tokenize the inputs.

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
tokenized_dataset = pubmed_dataset_streamed.map(lambda x: tokenizer(x["text"]))
next(iter(tokenized_dataset))

You can also shuffle a streamed dataset using IterableDataset.shuffle(), but unlike Dataset.shuffle() this only shuffles the elements in a predefined buffer_size:

In [None]:
shuffled_dataset = pubmed_dataset_streamed.shuffle(buffer_size=10_000, seed=42)
next(iter(shuffled_dataset))

1. In this example, we selected a random example from the first 10,000 examples in the buffer. 
2. Once an example is accessed, its spot in the buffer is filled with the next example in the corpus (i.e., the 10,001st example in the case above). You can also select elements from a streamed dataset using the IterableDataset.take() and IterableDataset.skip() functions, which act in a similar way to Dataset.select(). 
3. For example, to select the first 5 examples in the PubMed Abstracts dataset we can do the following:

In [None]:
dataset_head = pubmed_dataset_streamed.take(5)
list(dataset_head)

Similarly, you can use the IterableDataset.skip() function to create training and validation splits from a shuffled dataset as follows:

In [None]:
# Skip the first 1,000 examples and include the rest in the training set
train_dataset = shuffled_dataset.skip(1000)
# Take the first 1,000 examples for the validation set
validation_dataset = shuffled_dataset.take(1000)

1. Let’s round out our exploration of dataset streaming with a common application: combining multiple datasets together to create a single corpus.
2. 🤗 Datasets provides an interleave_datasets() function that converts a list of IterableDataset objects into a single IterableDataset, where the elements of the new dataset are obtained by alternating among the source examples.  
3. This function is especially useful when you’re trying to combine large datasets, so as an example let’s stream the FreeLaw subset of the Pile, which is a 51 GB dataset of legal opinions from US courts:

In [None]:
law_dataset_streamed = load_dataset(
    "json",
    data_files="https://the-eye.eu/public/AI/pile_preliminary_components/FreeLaw_Opinions.jsonl.zst",
    split="train",
    streaming=True,
)
next(iter(law_dataset_streamed))

1. This dataset is large enough to stress the RAM of most laptops, yet we’ve been able to load and access it without breaking a sweat! 
2. Let’s now combine the examples from the FreeLaw and PubMed Abstracts datasets with the interleave_datasets() function.
3. Here we’ve used the islice() function from Python’s itertools module to select the first two examples from the combined dataset, and we can see that they match the first examples from each of the two source datasets.

In [None]:
from itertools import islice
from datasets import interleave_datasets

combined_dataset = interleave_datasets([pubmed_dataset_streamed, law_dataset_streamed])
list(islice(combined_dataset, 2))

Finally, if you want to stream the Pile in its 825 GB entirety, you can grab all the prepared files as follows:

In [None]:
base_url = "https://the-eye.eu/public/AI/pile/"
data_files = {
    "train": [base_url + "train/" + f"{idx:02d}.jsonl.zst" for idx in range(30)],
    "validation": base_url + "val.jsonl.zst",
    "test": base_url + "test.jsonl.zst",
}
pile_dataset = load_dataset("json", data_files=data_files, streaming=True)
next(iter(pile_dataset["train"]))

## Crate your own dataset

Now that we have our access token, let’s create a function that can download all the issues from a GitHub repository:

In [None]:
GITHUB_TOKEN = xxx  # Copy your GitHub token here
headers = {"Authorization": f"token {GITHUB_TOKEN}"}

In [None]:
import time
import math
from pathlib import Path
import pandas as pd
from tqdm.notebook import tqdm
import requests

def fetch_issues(
    owner="huggingface",
    repo="datasets",
    num_issues=10_000,
    rate_limit=5_000,
    issues_path=Path("."),
):
    if not issues_path.is_dir():
        issues_path.mkdir(exist_ok=True)

    batch = []
    all_issues = []
    per_page = 100  # Number of issues to return per page
    num_pages = math.ceil(num_issues / per_page)
    base_url = "https://api.github.com/repos"

    for page in tqdm(range(num_pages)):
        # Query with state=all to get both open and closed issues
        query = f"issues?page={page}&per_page={per_page}&state=all"
        issues = requests.get(f"{base_url}/{owner}/{repo}/{query}", headers=headers)
        batch.extend(issues.json())

        if len(batch) > rate_limit and len(all_issues) < num_issues:
            all_issues.extend(batch)
            batch = []  # Flush batch for next time period
            print(f"Reached GitHub rate limit. Sleeping for one hour ...")
            time.sleep(60 * 60 + 1)

    all_issues.extend(batch)
    df = pd.DataFrame.from_records(all_issues)
    df.to_json(f"{issues_path}/{repo}-issues.jsonl", orient="records", lines=True)
    print(
        f"Downloaded all the issues for {repo}! Dataset stored at {issues_path}/{repo}-issues.jsonl"
    )


Now when we call fetch_issues() it will download all the issues in batches to avoid exceeding GitHub’s limit on the number of requests per hour; the result will be stored in a repository_name-issues.jsonl file, where each line is a JSON object the represents an issue. Let’s use this function to grab all the issues from 🤗 Datasets:

In [None]:
fetch_issues()

Once the issues are downloaded we can load them locally

In [None]:
issues_dataset = load_dataset("json", data_files="datasets-issues.jsonl", split="train")
issues_dataset

But why are there several thousand issues when the Issues tab of the 🤗 Datasets repository only shows around 1,000 issues in total 

GitHub’s REST API v3 considers every pull request an issue, but not every issue is a pull request. For this reason, “Issues” endpoints may return both issues and pull requests in the response. You can identify pull requests by the pull_request key. Be aware that the id of a pull request returned from “Issues” endpoints will be an issue id.

In [None]:
sample = issues_dataset.shuffle(seed=666).select(range(3))

# Print out the URL and pull request entries
for url, pr in zip(sample["html_url"], sample["pull_request"]):
    print(f">> URL: {url}")
    print(f">> Pull request: {pr}\n")

Here we can see that each pull request is associated with various URLs, while ordinary issues have a None entry. We can use this distinction to create a new is_pull_request column that checks whether the pull_request field is None or not:

In [None]:
issues_dataset = issues_dataset.map(
    lambda x: {"is_pull_request": False if x["pull_request"] is None else True}
)

The GitHub REST API provides a Comments endpoint that returns all the comments associated with an issue number. Let’s test the endpoint to see what it returns:

In [None]:
def get_comments(issue_number):
    url = f"https://api.github.com/repos/huggingface/datasets/issues/{issue_number}/comments"
    response = requests.get(url, headers=headers)
    return [r["body"] for r in response.json()]


# Test our function works as expected
get_comments(2792)

Let’s use Dataset.map() to add a new comments column to each issue in our dataset:

In [None]:
# Depending on your internet connection, this can take a few minutes...
issues_with_comments_dataset = issues_dataset.map(
    lambda x: {"comments": get_comments(x["number"])}
)

#### Upload a dataset

In [None]:
from huggingface_hub import notebook_login

notebook_login()

In [None]:
issues_with_comments_dataset.push_to_hub("github-issues")