# Train your own small GPT-2 model

If you want to experiment with the trained model, you can do it at `Inference API` panel of

https://huggingface.co/openai-community/gpt2?text=My+name+is+Thomas+and+my+main

Note that we are training small GPT2 model on a tiny dataset. Still We can see observe how the model improve with the number of steps and get some interesting results.

In [25]:
!pip install 'accelerate>=0.26.0'

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Collecting accelerate>=0.26.0
  Downloading accelerate-1.1.1-py3-none-any.whl.metadata (19 kB)
Downloading accelerate-1.1.1-py3-none-any.whl (333 kB)
Installing collected packages: accelerate
Successfully installed accelerate-1.1.1


In [1]:
!pip install "transformers[torch]"



In [2]:
#!pip install datasets
!pip install transformers



In [3]:
#import wandb  # we will talk about wandb next lecture
from datasets import load_dataset
from transformers import AutoTokenizer
from transformers import DataCollatorForLanguageModeling
from transformers import GPT2Config, GPT2LMHeadModel
from transformers import TrainingArguments, Trainer

## Prepare data

Before training, we have to tokenize the data and split them into chunks of the same size as context size of the model.

In [4]:
# Replace with your own dataset
dataset = load_dataset("licmajster/CZ_articles_wiki")

# Make validation split
dataset = dataset['train'].train_test_split(test_size=0.0015)

In [5]:
# load the gpt-2 tokenizer
tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokenizer.pad_token=tokenizer.eos_token

In [5]:
dataset

DatasetDict({
    train: Dataset({
        features: ['title', 'text', 'source'],
        num_rows: 1384
    })
    test: Dataset({
        features: ['title', 'text', 'source'],
        num_rows: 3
    })
})

In [6]:
# tokenize the dataset
def tokenize_function(example):
    return tokenizer(text=example["text"])
tokenized_ds = dataset.map(tokenize_function, batched=True, remove_columns='text')
tokenized_ds

Map:   0%|          | 0/1384 [00:00<?, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (1148 > 1024). Running this sequence through the model will result in indexing errors


Map:   0%|          | 0/3 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['title', 'source', 'input_ids', 'attention_mask'],
        num_rows: 1384
    })
    test: Dataset({
        features: ['title', 'source', 'input_ids', 'attention_mask'],
        num_rows: 3
    })
})

In [19]:
print(tokenized_ds["train"].data["input_ids"])

[
  [
    [
      26705,
      24573,
      89,
      320,
      686,
      ...
      2634,
      3211,
      6557,
      9892,
      13
    ],
    [
      40059,
      375,
      346,
      384,
      410,
      ...
      121,
      354,
      10495,
      1309,
      13
    ],
    ...
    [
      41,
      72,
      129,
      247,
      8836,
      ...
      28026,
      7344,
      3693,
      18,
      60
    ],
    [
      7908
    ]
  ],
  [
    [
      18833,
      11223,
      8873,
      299,
      12022,
      ...
      74,
      280,
      3693,
      20,
      60
    ],
    [
      11528
    ],
    ...
    [
      49,
      11601,
      16450,
      1976,
      1408,
      ...
      13038,
      73,
      3693,
      19,
      60
    ],
    [
      49,
      11601,
      24648,
      416,
      75,
      ...
      249,
      75,
      344,
      76,
      13
    ]
  ]
]


In [7]:
from itertools import chain
from datasets import Dataset, DatasetDict

def concatenate_and_chunk(dataset, chunk_size=512):
    # Flatten all `input_ids` into a single list
    all_input_ids = list(chain(*dataset["input_ids"]))
    
    # Create chunks of `chunk_size`
    chunks = [all_input_ids[i:i + chunk_size] for i in range(0, len(all_input_ids), chunk_size)]
    
    # Only keep chunks that are exactly of length `chunk_size`
    # chunks = [chunk for chunk in chunks if len(chunk) == chunk_size]
    if len(chunks[-1]) != chunk_size:
        chunks.pop()
    
    # Create a new dataset with only the `input_ids` chunks
    return Dataset.from_dict({"input_ids": chunks})

# Apply this function to each split (train and test) in the DatasetDict
chunked_ds = DatasetDict({
    split: concatenate_and_chunk(split_ds, chunk_size=512)
    for split, split_ds in tokenized_ds.items()
})

print(chunked_ds)


DatasetDict({
    train: Dataset({
        features: ['input_ids'],
        num_rows: 446
    })
    test: Dataset({
        features: ['input_ids'],
        num_rows: 1
    })
})


In [13]:
print(chunked_ds["train"].data)

InMemoryTable
input_ids: list<item: int32>
  child 0, item: int32
----
input_ids: [[[26705,24573,89,320,686,...,270,1424,74,2634,8873],[279,709,349,21162,8836,...,479,353,2634,384,299],...,[978,32790,709,128,249,...,11,285,1219,315,77],[2634,264,2100,88,257,...,20259,13139,9038,1477,1659]]]


In [8]:
# data collator joins chunks into batches
# see https://huggingface.co/docs/transformers/en/main_classes/data_collator
data_collator = DataCollatorForLanguageModeling(tokenizer, mlm=False)

## Model

In [9]:
# Define the model configuration for the smallest GPT-2
config = GPT2Config(
    vocab_size=len(tokenizer),      # Standard GPT-2 vocab size 50257
    n_positions=512,                # Context size (512 is enough for small-scale models)
    n_embd=768,                     # Embedding size
    n_layer=12,                     # Number of transformer layers
    n_head=12,                      # Number of attention heads
)

# Initialize the model and tokenizer
model = GPT2LMHeadModel(config)

In [10]:
import torch
import math
import numpy as np

# Define the perplexity metric
def compute_metrics(eval_pred):
    # `eval_pred` is a tuple of (logits, labels)
    logits, labels = eval_pred

    # Convert logits and labels to PyTorch tensors if they are NumPy arrays
    if isinstance(logits, np.ndarray):
        logits = torch.tensor(logits)
    if isinstance(labels, np.ndarray):
        labels = torch.tensor(labels)

    # Shift labels so that tokens align for calculating loss
    shift_labels = labels[:, 1:].reshape(-1)
    shift_logits = logits[:, :-1, :].reshape(-1, logits.shape[-1])

    # Calculate the cross-entropy loss
    loss_fct = torch.nn.CrossEntropyLoss(ignore_index=-100)  # Ignore padding tokens
    loss = loss_fct(shift_logits, shift_labels)

    # Calculate perplexity
    perplexity = math.exp(loss.item())
    return {"perplexity": perplexity}


## Training

In [39]:
# Set this according to size of your dataset
# You should train for at least 15 mins on A10 GPU to get something reasonable
TRAIN_EPOCHS = 30

SAVE_STEPS = 100
EVAL_STEPS = SAVE_STEPS // 2

# training arguments
training_args = TrainingArguments(
    output_dir="./gpt2-training",  # Directory to save the model checkpoints and other outputs
    eval_strategy="steps",  # Evaluation strategy to use during training ('steps' or 'epochs')
    eval_steps=EVAL_STEPS,  # Perform evaluation every EVAL_STEPS steps
    num_train_epochs=TRAIN_EPOCHS,  # Total number of training epochs
    per_device_train_batch_size=16,  # Batch size for training on each device
    per_device_eval_batch_size=16,  # Batch size for evaluation on each device
    # learning_rate=2.5e-4,  # Initial learning rate for the optimizer
    learning_rate=1e-4,  # Initial learning rate for the optimizer
    lr_scheduler_type='cosine',  # Learning rate scheduler type. 'cosine' provides a cosine decay schedule.
    warmup_ratio=0.05,  # Proportion of training to perform linear learning rate warmup for
    adam_beta1=0.9,  # Beta1 parameter for the Adam optimizer (first moment decay)
    adam_beta2=0.999,  # Beta2 parameter for the Adam optimizer (second moment decay)
    weight_decay=0.01,  # Weight decay to apply (L2 regularization)
    logging_strategy="steps",  # Logging strategy to use. 'steps' logs at specified steps.
    logging_steps=EVAL_STEPS,  # Log training metrics every EVAL_STEPS steps
    save_steps=SAVE_STEPS,  # Save a checkpoint every SAVE_STEPS steps
    save_total_limit=10,  # Maximum number of checkpoints to keep. Older checkpoints are deleted.
    # report_to='wandb',  # Uncomment to report metrics to Weights and Biases (optional)
)

trainer = Trainer(model=model,
                 args = training_args,
                 tokenizer=tokenizer,
                 train_dataset=chunked_ds["train"],
                 eval_dataset=chunked_ds["test"],
                 compute_metrics=compute_metrics,
                 data_collator = data_collator)


  trainer = Trainer(model=model,


In [40]:
trainer.train()

Step,Training Loss,Validation Loss,Perplexity
50,3.1782,3.760219,42.957417
100,3.0036,3.808624,45.087732
150,2.8353,3.907951,49.796083
200,2.6903,3.958899,52.398754
250,2.5525,4.001725,54.691355
300,2.3735,4.130373,62.199963
350,2.1909,4.275305,71.90068
400,2.0098,4.385308,80.261507
450,1.8381,4.427083,83.685311
500,1.6718,4.586266,98.12533


TrainOutput(global_step=840, training_loss=1.9819635663713728, metrics={'train_runtime': 923.4019, 'train_samples_per_second': 14.49, 'train_steps_per_second': 0.91, 'total_flos': 3496087388160000.0, 'train_loss': 1.9819635663713728, 'epoch': 30.0})

In [42]:
trainer.save_model("./gpt2-small-final") 

In [15]:
YOUR_MODEL_NAME = "my_small_gpt2_cswiki" # change this
HF_TOKEN = "TOKEN"  # change this 

model.push_to_hub(YOUR_MODEL_NAME, token=HF_TOKEN)
tokenizer.push_to_hub(YOUR_MODEL_NAME, token=HF_TOKEN)

model.safetensors:   0%|          | 0.00/496M [00:00<?, ?B/s]

README.md:   0%|          | 0.00/5.17k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/licmajster/my_small_gpt2_cswiki/commit/f851ab2d198d96620d8bb33088572f4bce835971', commit_message='Upload tokenizer', commit_description='', oid='f851ab2d198d96620d8bb33088572f4bce835971', pr_url=None, repo_url=RepoUrl('https://huggingface.co/licmajster/my_small_gpt2_cswiki', endpoint='https://huggingface.co', repo_type='model', repo_id='licmajster/my_small_gpt2_cswiki'), pr_revision=None, pr_num=None)

## Evaluation

Now you can switch from GPU to CPU. Try to complete some prompt specific to your dataset.

Does it make sense? Is it at least in Czech/Slovak?

In [43]:
from transformers import  GPT2LMHeadModel, AutoTokenizer, pipeline

tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokenizer.pad_token=tokenizer.eos_token

In [44]:
model =  GPT2LMHeadModel.from_pretrained("./gpt2-small-final")
generator = pipeline("text-generation", model=model, tokenizer=tokenizer)

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


In [87]:
PROMPT = "Počas války v letech" # Set starting prompt, something specific for your dataset

generator(
    PROMPT,
    max_length=50,       # Maximum length of the generated text
    do_sample=True,
    temperature=0.3,         # Experiment with this
    repetition_penalty=0.9,  # Experiment with this
    num_return_sequences=3,
)

[{'generated_text': 'Počas války v letechze studoval na Akademii v letech 1945–1955–1955–1955–1956 na Akademii výtvarných umělecké'},
 {'generated_text': 'Počas války v letechze studoval na Akademii v ateliéru prof. V letech 1949–1946 byl školu výtvarných uměn'},
 {'generated_text': 'Počas války v letechze studoval na Akademii v ateliéru prof. V letech 1949–1938 studia na Akademii výtvarných uměleckop'}]

Now go back to your training folder `.gpt2-training/`. Each `checkpoint-N` folder contains the model saved after N steps. 

If you experiment with the older models, you should see that the models improves with time.

In [49]:
def get_sample_after_N_steps(N, prompt, **kwargs):
    model =  GPT2LMHeadModel.from_pretrained(f"./gpt2-training/checkpoint-{N}/")
    generator = pipeline("text-generation", model=model, tokenizer=tokenizer)

    output = generator(prompt, **kwargs)
    return output  

In [53]:
get_sample_after_N_steps(200, "Pokus", do_sample=True, temperature=0.5)

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.




[{'generated_text': 'Pokus výtvarného umění.[1] V ro'}]

In [54]:
get_sample_after_N_steps(400, "Pokus", do_sample=True, temperature=0.5)

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.




[{'generated_text': 'Pokus, kterém ateliér. V roce 1975 záž'}]

In [55]:
get_sample_after_N_steps(600, "Pokus", do_sample=True, temperature=0.5)

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.




[{'generated_text': 'Pokus, uře vychátce se kdešskal'}]

In [89]:
!pip install simpletransformers

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Collecting simpletransformers
  Downloading simpletransformers-0.70.1-py3-none-any.whl.metadata (42 kB)
Collecting seqeval (from simpletransformers)
  Downloading seqeval-1.2.2.tar.gz (43 kB)
  Preparing metadata (setup.py) ... [?25ldone
Collecting tensorboardx (from simpletransformers)
  Downloading tensorboardX-2.6.2.2-py2.py3-none-any.whl.metadata (5.8 kB)
Collecting wandb>=0.10.32 (from simpletransformers)
  Downloading wandb-0.18.6-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (9.7 kB)
Collecting streamlit (from simpletransformers)
  Downloading streamlit-1.40.0-py2.py3-none-any.whl.metadata (8.5 kB)
Collecting sentencepiece (from simpletransformers)
  Downloading sentencepiece-0.2.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.7 kB)
Collecting docker-pycreds>=0.4.0 (from wandb>=0.10.32->simpletransformers)
  Downloading docker_pycreds-0.4.0-py2.py3-none-any.whl.metadata (1.8 kB)
Collecting gitpython!=3.1.29,>=1.0.0 (from wandb>=0.10.3

In [90]:
from simpletransformers.classification import ClassificationModel
model_args= {
            "num_train_epochs": 15,
            "learning_rate": 1e-5,
            "max_seq_length": 512,
            "silent": True
            }
model = ClassificationModel(
    "xlmroberta", "classla/xlm-roberta-base-multilingual-text-genre-classifier", use_cuda=True,
    args=model_args
    
)

config.json:   0%|          | 0.00/1.22k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.11G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/477 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/280 [00:00<?, ?B/s]



In [91]:
dataset

DatasetDict({
    train: Dataset({
        features: ['title', 'text', 'source'],
        num_rows: 1384
    })
    test: Dataset({
        features: ['title', 'text', 'source'],
        num_rows: 3
    })
})

In [92]:
from collections import defaultdict

combined_texts = defaultdict(str)

for row in dataset['train']:
    title = row['title']
    text = row['text']
    combined_texts[title] += text + " "

combined_text_list = list(combined_texts.values())

['Andrej Kuc (27. listopadu 1919, Lechnica, ČSR – 18. října 1998, Spišská Nová Ves, Slovensko) byl slovenský akademický malíř a restaurátor. Ještě v Lechnici absolvoval osm tříd Ľudové školy. Od dětství se zajímal o malířství, ale kvůli chudobě rodičů si nemohl studia dovolit. Ve věku 16 let se proto odešel vyučit malířskému řemeslu. Jako učeň působil od 14. srpna 1935 do 14. srpna 1938 u malířského mistra Františka Crháka ve Spišské Staré Vsi. Po ukončení výuky začal pracovat jako pomocník pod vedením malířského mistra Ernesta Waltera v Kežmarku, později u malířského mistra Štěpána Palubiaka v Popradu. V té době se také úspěšně podrobil zkouškám ze čtyř tříd měšťanské školy. Rodina na Zamaguří ve skromných podmínkách hospodařila na šesti hektarech polí, luk a lesa. Po ukončení studia pracoval jako svobodný umělec. Na rozsáhlejší vlastní malířskou tvorbu Andrej Kuc neměl čas, celý svůj život zasvětil obnově památek. Pracoval doma i v terénu a nejčastěji restauroval nástěnné malby. Andr

In [97]:
predictions, logit_output = model.predict(combined_text_list[:5])
predictions


[model.config.id2label[i] for i in predictions]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid de

['Information/Explanation',
 'Information/Explanation',
 'Information/Explanation',
 'Information/Explanation',
 'Information/Explanation']

Output makes sense because these are articles from wikipedia.