# Is Child-Directed Speech Effective Training Data for Language Models?

## Предустановки

In [1]:
!nvidia-smi

Sun Dec 28 05:23:51 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   67C    P8             10W /   70W |       0MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

In [2]:
!pip install -q datasets transformers accelerate sentencepiece evaluate scipy scikit-learn

In [3]:
import os
import random
from datasets import load_dataset

In [4]:
SEED = 42
random.seed(SEED)

In [5]:
!git clone https://github.com/styfeng/TinyDialogues.git
%cd TinyDialogues

fatal: destination path 'TinyDialogues' already exists and is not an empty directory.
/content/TinyDialogues


## Датасет

In [6]:
# официальный датасет с huggingface
ds = load_dataset("styfeng/TinyDialogues")

print(ds)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 110024
    })
    validation: Dataset({
        features: ['text'],
        num_rows: 19708
    })
})


In [7]:
# =================================================
# Convert dataset to paper‑compatible text format
# =================================================
# Paper requirements (Appendix B):
# - One conversation per example
# - Speaker labels surrounded by ** **
# - Double newlines between utterances
# - <|endoftext|> token at end

os.makedirs("TD", exist_ok=True)

MAX_TOKENS = 5_000_000  # subset for Colab feasibility

def write_split(split, out_path):
    token_count = 0
    with open(out_path, "w", encoding="utf-8") as f:
        for ex in ds[split]:
            text = ex["text"].strip()
            if not text.endswith("<|endoftext|>"):
                text += "\n<|endoftext|>"
            tokens = text.split()
            if token_count + len(tokens) > MAX_TOKENS:
                break
            f.write(text + "\n")
            token_count += len(tokens)
    print(f"Wrote {token_count:,} tokens to {out_path}")

write_split("train", "TD/train.txt")
write_split("validation", "TD/val.txt")

Wrote 4,999,850 tokens to TD/train.txt
Wrote 4,382,089 tokens to TD/val.txt


In [8]:
print(ds["train"][0]["text"][:500])

**Dad**: "Hey sweetie, do you want to paint with Daddy?" \n\n **Child**: "Paint!" \n\n **Dad**: "Yes, we'll use these brushes. But first, let's put on your apron so we don't get paint on your clothes." \n\n **Child**: "Apron!" \n\n **Mom**: "Breakfast is almost ready! Who wants pancakes?" \n\n **Child**: "Pancake!" \n\n **Dad**: "We'll eat first, then paint. Let's wash hands before we eat, okay?" \n\n **Child**: "Wash!" \n\n **Mom**: "Careful, the pancakes are hot. We'll let them cool a little b


In [9]:
!python scripts/tokenizers/train_GPT2_tokenizer.py \
    TD/train.txt TD/val.txt TD_tokenizer

['def', 'Ġadd', '_', 'n', 'umbers', '(', 'a', ',', 'Ġb', '):', 'Ċ', 'Ġ', 'Ġ', 'Ġ', 'Ġ"""', 'Add', 'Ġthe', 'Ġtwo', 'Ġnumbers', 'Ġ`', 'a', '`', 'Ġand', 'Ġ`', 'b', '`', '."', '""', 'Ċ', 'Ġ', 'Ġ', 'Ġ', 'Ġreturn', 'Ġa', 'Ġ+', 'Ġb']
50129
[2K[00:00:00] Tokenize words                 ██████████████████ 46911    /    46911
[2K[00:00:00] Count pairs                    ██████████████████ 46911    /    46911
[2K[00:00:00] Compute merges                 ██████████████████ 51742    /    51742
All special tokens: ['<|endoftext|>', '<UNK>']
BOS token: <|endoftext|>
EOS token: <|endoftext|>
PAD token: None
UNK token: <|endoftext|>
SEP token: None
CLS token: None
MASK token: None


In [10]:
!python scripts/tokenizers/test_GPT2_tokenizer.py TD_tokenizer



 TD_tokenizer 

['do', 'Ġyou', 'Ġwant', 'Ġto', 'Ġlook', 'Ġat', 'Ġthat', 'Ġit', 'Ġsays', 'Ġlook', 'Ġ?']
do you want to look at that it says look ? 

['The', 'Ġyellow', '-', 'billed', 'Ġshri', 'ke', 'Ġ(', "'", 'C', 'or', 'vin', 'ella', 'Ġcor', 'v', 'ina', "')", 'Ġis', 'Ġa', 'Ġlarge', 'Ġpasser', 'ine', 'Ġbird', 'Ġin', 'Ġthe', 'Ġshri', 'ke', 'Ġfamily', '.', 'ĠIt', 'Ġis', 'Ġsometimes', 'Ġknown', 'Ġas', 'Ġthe', 'Ġlong', '-', 'tailed', 'Ġshri', 'ke', ',', 'Ġbut', 'Ġthis', 'Ġis', 'Ġto', 'Ġbe', 'Ġdiscouraged', ',', 'Ġsince', 'Ġit', 'Ġinvites', 'Ġconfusion', 'Ġwith', 'Ġthe', 'Ġlong', '-', 'tailed', 'Ġshri', 'ke', ',', "Ġ'", 'L', 'an', 'ius', 'Ġsch', 'ach', "',", 'Ġof', 'Ġtropical', 'Ġsouthern', 'ĠAsia', '.', 'ĠThe', 'Ġyellow', '-', 'billed', 'Ġshri', 'ke', 'Ġis', 'Ġa', 'Ġcommon', 'Ġresident', 'Ġbreeding', 'Ġbird', 'Ġin', 'Ġtropical', 'ĠAfrica', 'Ġfrom', 'ĠS', 'ene', 'gal', 'Ġeast', 'wards', 'Ġto', 'ĠU', 'g', 'anda', 'Ġand', 'Ġlocally', 'Ġin', 'Ġwesternmost', 'ĠKeny', 'a', '.', 'ĠIt', 'Ġfrequ',

## Обучаем GPT-2 на коленке

In [11]:
from transformers import GPT2Config, GPT2LMHeadModel
from transformers import Trainer, TrainingArguments
from transformers import AutoTokenizer
from datasets import load_dataset

In [12]:
tokenizer = AutoTokenizer.from_pretrained("TD_tokenizer")
tokenizer.pad_token = tokenizer.eos_token

In [13]:
data_files = {
    "train": "TD/train.txt",
    "validation": "TD/val.txt"
}

raw_datasets = load_dataset("text", data_files=data_files)

Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

In [14]:
def tokenize_fn(examples):
    out = tokenizer(
        examples["text"],
        truncation=True,
        max_length=1024,
    )
    out["labels"] = out["input_ids"].copy()
    return out

tokenized_datasets = raw_datasets.map(
    tokenize_fn,
    batched=True,
    remove_columns=["text"],
)

Map:   0%|          | 0/30421 [00:00<?, ? examples/s]

Map:   0%|          | 0/19708 [00:00<?, ? examples/s]

In [15]:
# GPT‑2 SMALL config (124M params, as in paper)
config = GPT2Config(
    vocab_size=len(tokenizer),
    n_positions=1024,
    n_ctx=1024,
)

In [16]:
model = GPT2LMHeadModel(config)

In [17]:
training_args = TrainingArguments(
    output_dir="gpt2_td",
    overwrite_output_dir=True,
    # evaluation_strategy="epoch",
    learning_rate=1e-4,
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    gradient_accumulation_steps=8,
    num_train_epochs=5,  # paper uses 20; reduced for Colab
    weight_decay=0.0,
    logging_steps=200,
    save_strategy="epoch",
    report_to="none",
    seed=SEED,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    tokenizer=tokenizer,
)

trainer.train()

  trainer = Trainer(
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'eos_token_id': 0, 'bos_token_id': 0, 'pad_token_id': 0}.
`loss_type=None` was set in the config but it is unrecognized. Using the default loss: `ForCausalLMLoss`.


Step,Training Loss
200,3.576
400,2.5315
600,2.2822
800,2.1282
1000,2.0208
1200,1.9329
1400,1.8652
1600,1.7998
1800,1.7345
2000,1.69


TrainOutput(global_step=19015, training_loss=1.1844692800974739, metrics={'train_runtime': 14918.2522, 'train_samples_per_second': 10.196, 'train_steps_per_second': 1.275, 'total_flos': 2.325727204992e+16, 'train_loss': 1.1844692800974739, 'epoch': 5.0})

In [18]:
# =================================================
# 6. Save model
# =================================================

trainer.save_model("gpt2_td_final")
tokenizer.save_pretrained("gpt2_td_final")

print("Training complete. Model saved.")

# =================================================
# NEXT STEPS (not run here):
# -------------------------------------------------
# - Zorro evaluation (BabyLM pipeline)
# - Word Similarity benchmarks
# - Dataset comparisons (Wikipedia, CHILDES)
# - Local / global ordering experiments
# =================================================


Training complete. Model saved.
