# Basic data cleaning and tokenization

Data cleaning script, taken from the BabyLlama repository (https://github.com/timinar/BabyLlama) of Timirsayov and Tastet, 2023.    

## Cleaning

Some simple, regex-based cleaning is performed on train and dev datasets, e.g. to remove HTML tags from Wikipedia articles, non-verbal cues from subtitles, or even to correct I’s that were incorrectly recognized as l’s in OCR’ed uppercase text.

In [7]:
from pathlib import Path
from baby_llama_clean import *

In [10]:
DATA_ROOT = Path('data/text_data')
SEQ_LENGTH = 128
DATA_SPLITS = ['train_100M', 'dev', 'test']

CLEANUP_FUNCTIONS = {
    'childes': cleanup_aochildes,
    'bnc_spoken': cleanup_bnc_spoken,
    'gutenberg': cleanup_gutenberg,
    'open_subtitles': cleanup_open_subtitles,
    'simple_wiki': cleanup_simple_wikipedia,
    'switchboard': cleanup_switchboard,
}


In [11]:
for split in DATA_SPLITS:
    INPUT_DIR = DATA_ROOT / split
    OUTPUT_DIR = DATA_ROOT / f'clean_{split}'
    OUTPUT_DIR.mkdir(exist_ok=True)

    train_files = [f for f in INPUT_DIR.iterdir() if f.is_file() and f.suffix in ['.train', '.dev', '.test']]
    
    for file in train_files:
        text = file.read_text()
        cleaned_text = CLEANUP_FUNCTIONS[file.stem](text, SEQ_LENGTH)
        (OUTPUT_DIR / file.name).write_text(cleaned_text)
        print(f"🧹 Cleaned '{file.name}' (size {len(text)} -> {len(cleaned_text)}) in {split}")


🧹 Cleaned 'open_subtitles.train' (size 106026268 -> 106007989) in train_100M
🧹 Cleaned 'bnc_spoken.train' (size 40351645 -> 40108196) in train_100M
🧹 Cleaned 'gutenberg.train' (size 144429471 -> 144429471) in train_100M
🧹 Cleaned 'childes.train' (size 156960267 -> 156958053) in train_100M
🧹 Cleaned 'simple_wiki.train' (size 85104519 -> 84872441) in train_100M
🧹 Cleaned 'switchboard.train' (size 6586033 -> 6586033) in train_100M
🧹 Cleaned 'simple_wiki.dev' (size 8149513 -> 8128239) in dev
🧹 Cleaned 'childes.dev' (size 14638378 -> 14638168) in dev
🧹 Cleaned 'switchboard.dev' (size 724013 -> 724013) in dev
🧹 Cleaned 'open_subtitles.dev' (size 11016133 -> 11014854) in dev
🧹 Cleaned 'gutenberg.dev' (size 15490473 -> 15490473) in dev
🧹 Cleaned 'bnc_spoken.dev' (size 6538139 -> 6503778) in dev
🧹 Cleaned 'childes.test' (size 14696551 -> 14696436) in test
🧹 Cleaned 'switchboard.test' (size 823158 -> 823158) in test
🧹 Cleaned 'bnc_spoken.test' (size 4888137 -> 4861019) in test
🧹 Cleaned 'gutenbe