# 📚 Data

This notebook contains code for the data in this experiment suite.

## Setup 

In [None]:
import autorootcwd

In [None]:
from typing import List, Dict, Any

import torch
import pandas as pd
from torch.utils.data import DataLoader
from transformers import AutoTokenizer
from datasets import DatasetDict, load_dataset 
from src.utils import format_int

## WikiText 2

For now, we will usoe a tiny dataset `Salesforce/wikitext/wikitext-2-raw-v1`. It has a train, validation and test split that consist of 37K, 1.8K and 2.2K examples respectively.

In [None]:
# Load WikiText 2
wiki = load_dataset("Salesforce/wikitext", "wikitext-2-raw-v1")
train_wiki, val_wiki, test_wiki = wiki["train"], wiki["validation"], wiki["test"]

print(f"Loaded {len(train_wiki)/1e3:.1f}K training, {len(val_wiki)/1e3:.1f}K validation and {len(test_wiki)/1e3:.1f}K test examples.")

A single example just has a `text` field, which contains a single line of text. They are parsed from high quality Wikipedia articles. We can already see that there are loads of empty lines and other artiffacts like headlines.

In [None]:
# Examples
for example in train_wiki.take(5):
    print(example)

We are going to remove empty lines, headlines, and trailing whitespace.

In [None]:
def non_empty_text(examples: Dict[str, Any]) -> bool:
    return examples["text"] != ""

def non_headline(examples: Dict[str, Any]) -> bool:
    return not examples["text"].startswith(" = ")

def strip_headline(examples: Dict[str, Any]) -> Dict[str, Any]:
    examples["text"] = examples["text"].lstrip().rstrip()
    return examples

In [None]:
train_wiki_processed = train_wiki.filter(non_empty_text).filter(non_headline).map(strip_headline)
val_wiki_processed = val_wiki.filter(non_empty_text).filter(non_headline).map(strip_headline)
test_wiki_processed = test_wiki.filter(non_empty_text).filter(non_headline).map(strip_headline)

print(f"Processed {len(train_wiki_processed)/1e3:.1f}K training, {len(val_wiki_processed)/1e3:.1f}K validation and {len(test_wiki_processed)/1e3:.1f}K test examples.")

In [None]:
for example in train_wiki_processed.take(5):
    print(example)

Looks good! Let's get some statistics on the processed dataset.

In [None]:
# Dataset statistics
get_num_examples = lambda dataset: len(dataset)
get_num_chars = lambda dataset: sum(len(example['text']) for example in dataset)
get_num_tokens = lambda dataset, tokenizer: sum(len(tokenizer.encode(example['text'])) for example in dataset)

# Llama 2 tokenizer
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")

stats = pd.DataFrame({
    'Split': ['Train', 'Validation', 'Test'],
    'Examples': map(format_int, [get_num_examples(train_wiki_processed), get_num_examples(val_wiki_processed), get_num_examples(test_wiki_processed)]),
    'Characters': map(format_int, [get_num_chars(train_wiki_processed), get_num_chars(val_wiki_processed), get_num_chars(test_wiki_processed)]),
    'Tokens': map(format_int, [get_num_tokens(train_wiki_processed, tokenizer), get_num_tokens(val_wiki_processed, tokenizer), get_num_tokens(test_wiki_processed, tokenizer)])
}).set_index('Split')

stats

In [None]:
!du -sh ~/.cache/huggingface/hub

Finally, let's push the processed datasets to the Hugging Face Hub.

In [None]:
# Push to Hugging Face Hub
data = DatasetDict({
    'train': train_wiki_processed,
    'validation': val_wiki_processed,
    'test': test_wiki_processed
})

repo_name = "wikitext-2"
data.push_to_hub(repo_name)

print(f"Pushed to https://huggingface.co/datasets/mikasenghaas/{repo_name}")

## FinewebEdu

The [FinewebEdu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) dataset is a large-scale pre-training dataset developed by the Hugging Face team. The smaller version consists of 1.3T high-quality tokens that have been filtered for quality using Llama 2 70B

We are going to use the 10BT version of the dataset:
- 9.67M Examples
- 10.1B (GPT-2) Tokens
- 1.3GB Disk Usage

In [None]:
# Load FinewebEdu (10BT)
finewebedu = load_dataset("HuggingFaceFW/fineweb-edu", "sample-10BT", split="train")

print(f"Loaded {len(finewebedu)/1e6:.1f}M training examples.")

In [None]:
# Get an example
for example in finewebedu.take(5):
    print(example)

In [None]:
# Dataset statistics
get_num_examples = lambda dataset: len(dataset)
get_num_chars = lambda dataset: sum(len(example['text']) for example in dataset)
get_num_tokens = lambda dataset, tokenizer: sum(len(tokenizer.encode(example['text'])) for example in dataset)

# GPT-2 tokenizer
gpt_tokenizer = AutoTokenizer.from_pretrained("gpt2")
llama2_tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
llama3_tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3-70b-hf")

stats = pd.DataFrame({
    'Split': ['Train'],
    'Examples': [format_int(get_num_examples(finewebedu))],
    'Characters': [format_int(get_num_chars(finewebedu))],
    'GPT-2 Tokens': [format_int(get_num_tokens(finewebedu, gpt_tokenizer))],
    'Llama-2 Tokens': [format_int(get_num_tokens(finewebedu, llama2_tokenizer))],
    'Llama-3 Tokens': [format_int(get_num_tokens(finewebedu, llama3_tokenizer))]
}).set_index('Split')

stats