# 📚 Data

This notebook contains code for the data in this experiment suite.

## Setup 

In [7]:
import autorootcwd

In [8]:
from typing import Dict, Tuple, Any
from tqdm import tqdm

import pandas as pd
from transformers import AutoTokenizer
from datasets import DatasetDict, Dataset, load_dataset
from src.utils import format_int

## Memorizing Dataset

We will create a single sample dummy dataset to test that we can overfit a single sample.


In [9]:
# Create a dataset with a single example
sentence = "I am a large language model and I can memorize this sentence."
dataset = Dataset.from_dict({"text": [sentence]})
memorize = DatasetDict({"train": dataset, "validation": dataset, "test": dataset})

print(f"Created a dataset with a single example in train, validation and test: {sentence}")

Created a dataset with a single example in train, validation and test: I am a large language model and I can memorize this sentence.


In [10]:
# Push to Hugging Face Hub
repo_name = "memorize"
memorize.push_to_hub(repo_name)

print(f"Pushed to https://huggingface.co/datasets/mikasenghaas/{repo_name}")

Creating parquet from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 2015.52ba/s]
Uploading the dataset shards: 100%|██████████| 1/1 [00:00<00:00,  1.67it/s]
Creating parquet from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 2101.35ba/s]
Uploading the dataset shards: 100%|██████████| 1/1 [00:00<00:00,  5.14it/s]
Creating parquet from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 2211.02ba/s]
Uploading the dataset shards: 100%|██████████| 1/1 [00:00<00:00,  5.03it/s]


Pushed to https://huggingface.co/datasets/mikasenghaas/memorize


## WikiText 2

For now, we will usoe a tiny dataset `Salesforce/wikitext/wikitext-2-raw-v1`. It has a train, validation and test split that consist of 37K, 1.8K and 2.2K examples respectively.

In [11]:
# Load WikiText 2
wiki = load_dataset("Salesforce/wikitext", "wikitext-2-raw-v1", cache_dir="/workspace/huggingface")
train_wiki, val_wiki, test_wiki = wiki["train"], wiki["validation"], wiki["test"]

print(f"Loaded {len(train_wiki)/1e3:.1f}K training, {len(val_wiki)/1e3:.1f}K validation and {len(test_wiki)/1e3:.1f}K test examples.")

Generating test split: 100%|██████████| 4358/4358 [00:00<00:00, 412445.89 examples/s]
Generating train split: 100%|██████████| 36718/36718 [00:00<00:00, 1126560.51 examples/s]
Generating validation split: 100%|██████████| 3760/3760 [00:00<00:00, 728231.58 examples/s]

Loaded 36.7K training, 3.8K validation and 4.4K test examples.





A single example just has a `text` field, which contains a single line of text. They are parsed from high quality Wikipedia articles. We can already see that there are loads of empty lines and other artiffacts like headlines.

In [12]:
# Examples
for example in train_wiki.take(5):
    print(example)

{'text': ''}
{'text': ' = Valkyria Chronicles III = \n'}
{'text': ''}
{'text': ' Senjō no Valkyria 3 : Unrecorded Chronicles ( Japanese : 戦場のヴァルキュリア3 , lit . Valkyria of the Battlefield 3 ) , commonly referred to as Valkyria Chronicles III outside Japan , is a tactical role @-@ playing video game developed by Sega and Media.Vision for the PlayStation Portable . Released in January 2011 in Japan , it is the third game in the Valkyria series . Employing the same fusion of tactical and real @-@ time gameplay as its predecessors , the story runs parallel to the first game and follows the " Nameless " , a penal military unit serving the nation of Gallia during the Second Europan War who perform secret black operations and are pitted against the Imperial unit " Calamaty Raven " . \n'}
{'text': " The game began development in 2010 , carrying over a large portion of the work done on Valkyria Chronicles II . While it retained the standard features of the series , it also underwent multiple adju

We are going to remove empty lines, headlines, and trailing whitespace.

In [13]:
def non_empty_text(examples: Dict[str, Any]) -> bool:
    return examples["text"] != ""

def non_headline(examples: Dict[str, Any]) -> bool:
    return not examples["text"].startswith(" = ")

def strip_headline(examples: Dict[str, Any]) -> Dict[str, Any]:
    examples["text"] = examples["text"].lstrip().rstrip()
    return examples

In [14]:
train_wiki_processed = train_wiki.filter(non_empty_text).filter(non_headline).map(strip_headline)
val_wiki_processed = val_wiki.filter(non_empty_text).filter(non_headline).map(strip_headline)
test_wiki_processed = test_wiki.filter(non_empty_text).filter(non_headline).map(strip_headline)

print(f"Processed {len(train_wiki_processed)/1e3:.1f}K training, {len(val_wiki_processed)/1e3:.1f}K validation and {len(test_wiki_processed)/1e3:.1f}K test examples.")

Filter: 100%|██████████| 36718/36718 [00:00<00:00, 402373.52 examples/s]
Filter: 100%|██████████| 23767/23767 [00:00<00:00, 168581.05 examples/s]
Map: 100%|██████████| 17556/17556 [00:00<00:00, 19673.33 examples/s]
Filter: 100%|██████████| 3760/3760 [00:00<00:00, 317187.91 examples/s]
Filter: 100%|██████████| 2461/2461 [00:00<00:00, 147487.14 examples/s]
Map: 100%|██████████| 1841/1841 [00:00<00:00, 18811.06 examples/s]
Filter: 100%|██████████| 4358/4358 [00:00<00:00, 340120.89 examples/s]
Filter: 100%|██████████| 2891/2891 [00:00<00:00, 156855.74 examples/s]
Map: 100%|██████████| 2183/2183 [00:00<00:00, 18823.58 examples/s]

Processed 17.6K training, 1.8K validation and 2.2K test examples.





In [15]:
for example in train_wiki_processed.take(5):
    print(example)

{'text': 'Senjō no Valkyria 3 : Unrecorded Chronicles ( Japanese : 戦場のヴァルキュリア3 , lit . Valkyria of the Battlefield 3 ) , commonly referred to as Valkyria Chronicles III outside Japan , is a tactical role @-@ playing video game developed by Sega and Media.Vision for the PlayStation Portable . Released in January 2011 in Japan , it is the third game in the Valkyria series . Employing the same fusion of tactical and real @-@ time gameplay as its predecessors , the story runs parallel to the first game and follows the " Nameless " , a penal military unit serving the nation of Gallia during the Second Europan War who perform secret black operations and are pitted against the Imperial unit " Calamaty Raven " .'}
{'text': "The game began development in 2010 , carrying over a large portion of the work done on Valkyria Chronicles II . While it retained the standard features of the series , it also underwent multiple adjustments , such as making the game more forgiving for series newcomers . Cha

Looks good! Let's get some statistics on the processed dataset.

In [16]:
# Dataset statistics
get_num_examples = lambda dataset: len(dataset)
get_num_tokens = lambda dataset, tokenizer: sum(len(tokenizer.encode(example['text'])) for example in dataset)

# Llama 2 tokenizer
gpt2_tokenizer = AutoTokenizer.from_pretrained("gpt2")
llama3_tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B")

stats = pd.DataFrame({
    'Split': ['Train', 'Validation', 'Test'],
    'Examples': map(format_int, [get_num_examples(train_wiki_processed), get_num_examples(val_wiki_processed), get_num_examples(test_wiki_processed)]),
    'GPT-2 Tokens': map(format_int, [get_num_tokens(train_wiki_processed, gpt2_tokenizer), get_num_tokens(val_wiki_processed, gpt2_tokenizer), get_num_tokens(test_wiki_processed, gpt2_tokenizer)]),
    'Llama-3 Tokens': map(format_int, [get_num_tokens(train_wiki_processed, llama3_tokenizer), get_num_tokens(val_wiki_processed, llama3_tokenizer), get_num_tokens(test_wiki_processed, llama3_tokenizer)])
}).set_index('Split')

stats

KeyboardInterrupt: 

Finally, let's push the processed datasets to the Hugging Face Hub.

In [None]:
# Push to Hugging Face Hub
data = DatasetDict({
    'train': train_wiki_processed,
    'validation': val_wiki_processed,
    'test': test_wiki_processed
})

repo_name = "wikitext-2"
data.push_to_hub(repo_name)

print(f"Pushed to https://huggingface.co/datasets/mikasenghaas/{repo_name}")

## FinewebEdu

The [FinewebEdu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) dataset is a large-scale pre-training dataset developed by the Hugging Face team. The smaller version consists of 1.3T high-quality tokens that have been filtered for quality using Llama 2 70B

We are going to use the 10BT version of the dataset which contains 9.67M samples, corresponding to roughly 10B GPT-2 training tokens.

In [None]:
# Load FinewebEdu (10BT)
finewebedu_10bt = load_dataset("HuggingFaceFW/fineweb-edu", "sample-10BT", split="train", cache_dir="/workspace/huggingface")

print(f"Loaded {len(finewebedu_10bt)/1e6:.1f}M training examples.")

In [None]:
# Randomly sample 10% of FinewebEdu 10BT (1BT)
finewebedu_1bt = finewebedu_10bt.shuffle(seed=42).select(range(int(len(finewebedu_10bt) * 0.1)))

print(f"Sampled {len(finewebedu_1bt)/1e6:.1f}M training examples (10% of 10BT).")

In [None]:
# Randomly sample 10% of FinewebEdu 1.25BT
finewebedu_100mt = finewebedu_10bt.shuffle(seed=42).select(range(int(len(finewebedu_1bt) * 0.1)))

print(f"Sampled {len(finewebedu_100mt)/1e6:.1f}M training examples (10% of 1BT).")

In [None]:
# Dataset statistics
get_num_examples = lambda dataset: len(dataset)
get_num_tokens = lambda dataset, tokenizer: sum(len(tokenizer.encode(example['text'])) for example in dataset)

# GPT-2 tokenizer
gpt_tokenizer = AutoTokenizer.from_pretrained("gpt2")
llama3_tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B-Instruct")

# Function to calculate average tokens per example
def calc_avg_tokens(dataset, tokenizer, num_samples=10000):
    subset_data = dataset.shuffle(seed=42).select(range(num_samples))
    return get_num_tokens(subset_data, tokenizer) / num_samples

# Calculate average tokens for each dataset
datasets = {
    '10BT': finewebedu_10bt,
    '1BT': finewebedu_1bt,
    '100MT': finewebedu_100mt
}

stats_data = []
for name, dataset in datasets.items():
    num_examples = get_num_examples(dataset)
    avg_gpt2_tokens = calc_avg_tokens(dataset, gpt_tokenizer)
    avg_llama3_tokens = calc_avg_tokens(dataset, llama3_tokenizer)
    
    stats_data.append({
        'Dataset': name,
        'Examples': format_int(num_examples),
        'GPT-2 Tokens': format_int(num_examples * avg_gpt2_tokens),
        'Llama-3 Tokens': format_int(num_examples * avg_llama3_tokens)
    })

stats = pd.DataFrame(stats_data).set_index(['Dataset'])
stats

Nice, we are getting 100%, 10%, and 1% of the dataset, i.e. roughly 10B, 1B, and 100M GPT-2 training tokens, respectively. Let's upload the processed dataset to the Hugging Face Hub.

In [None]:
# Upload to Hugging Face Hub
repo_name = "fineweb-edu-10bt"
finewebedu_10bt.push_to_hub(repo_name)

print(f"Pushed to https://huggingface.co/datasets/mikasenghaas/{repo_name}")

In [None]:
# Upload to Hugging Face Hub
repo_name = "fineweb-edu-1bt"
finewebedu_1bt.push_to_hub(repo_name)

print(f"Pushed to https://huggingface.co/datasets/mikasenghaas/{repo_name}")

In [None]:
# Upload to Hugging Face Hub
repo_name = "fineweb-edu-100mt"
finewebedu_100mt.push_to_hub(repo_name)

print(f"Pushed to https://huggingface.co/datasets/mikasenghaas/{repo_name}")

Let's also pre-tokenize the dataset and store it in the Hugging Face Hub.

In [22]:
import os
from src.utils import tokenize

# Tokenize
tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token
fineweb_edu_10bt_tok = finewebedu_10bt.map(
    lambda x: tokenize(x["text"], tokenizer, max_length=1025),
    batched=True,
    num_proc=os.cpu_count(),
    remove_columns=finewebedu_10bt.column_names
)

print(fineweb_edu_10bt_tok)

['text', 'id', 'dump', 'url', 'file_path', 'language', 'language_score', 'token_count', 'score', 'int_score']


Map (num_proc=104): 100%|██████████| 9672101/9672101 [04:37<00:00, 34826.51 examples/s]


Dataset({
    features: ['input_ids', 'attention_mask'],
    num_rows: 9672101
})


In [26]:
from datasets import DatasetDict

# Create a DatasetDict with train split
fineweb_edu_10bt_tok_dict = DatasetDict({"train": fineweb_edu_10bt_tok})

# Save the DatasetDict to disk
fineweb_edu_10bt_tok_dict.save_to_disk("/alloc/huggingface/mikasenghaas/fineweb-edu-10bt-tokenized")

Saving the dataset (0/100 shards):   0%|          | 39000/9672101 [00:00<00:26, 369040.31 examples/s]

Saving the dataset (100/100 shards): 100%|██████████| 9672101/9672101 [00:26<00:00, 360586.86 examples/s]


In [23]:
# Push tokenized version to Hugging Face Hub
repo_name = "fineweb-edu-10bt-tokenized"
fineweb_edu_10bt_tok.push_to_hub(repo_name)

print(f"Pushed to https://huggingface.co/datasets/mikasenghaas/{repo_name}")

Creating parquet from Arrow format: 100%|██████████| 97/97 [00:04<00:00, 23.57ba/s]
Creating parquet from Arrow format: 100%|██████████| 97/97 [00:04<00:00, 23.27ba/s]
Creating parquet from Arrow format: 100%|██████████| 97/97 [00:04<00:00, 23.33ba/s]
Creating parquet from Arrow format: 100%|██████████| 97/97 [00:04<00:00, 23.56ba/s]
Creating parquet from Arrow format: 100%|██████████| 97/97 [00:04<00:00, 23.30ba/s]
Creating parquet from Arrow format: 100%|██████████| 97/97 [00:04<00:00, 23.26ba/s]
Creating parquet from Arrow format: 100%|██████████| 97/97 [00:04<00:00, 23.63ba/s]
Creating parquet from Arrow format: 100%|██████████| 97/97 [00:04<00:00, 23.53ba/s]
Creating parquet from Arrow format: 100%|██████████| 97/97 [00:04<00:00, 23.16ba/s]
Creating parquet from Arrow format: 100%|██████████| 97/97 [00:04<00:00, 23.47ba/s]
Creating parquet from Arrow format: 100%|██████████| 97/97 [00:04<00:00, 23.44ba/s]
Creating parquet from Arrow format: 100%|██████████| 97/97 [00:04<00:00, 23.

Pushed to https://huggingface.co/datasets/mikasenghaas/fineweb-edu-10bt-tokenized
