# Chapter 2 Tutorial: Understanding LLMs and Pre-training

In this tutorial, we will explore the mechanics of LLM architectures, with an emphasis on the differences between masked models and causal models. In the first section, we'll examine some existing pretrained models to understand how they produce their outputs. Once we've demonstrated how LLM's are able to do what they do, we will then run an abbreviated training loop to provide a glimpse into the training process.

## Installation and Imports

In [1]:
!pip install datasets transformers[sentencepiece,torch]
!pip install apache_beam

Collecting sentencepiece!=0.1.92,>=0.1.91 (from transformers[sentencepiece,torch])
  Downloading sentencepiece-0.2.1-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (10 kB)
Collecting protobuf (from transformers[sentencepiece,torch])
  Downloading protobuf-6.32.1-cp39-abi3-manylinux2014_x86_64.whl.metadata (593 bytes)
Downloading sentencepiece-0.2.1-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (1.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.4/1.4 MB[0m [31m14.7 MB/s[0m  [33m0:00:00[0m
[?25hDownloading protobuf-6.32.1-cp39-abi3-manylinux2014_x86_64.whl (322 kB)
Installing collected packages: sentencepiece, protobuf
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2/2[0m [protobuf]
[1A[2KSuccessfully installed protobuf-6.32.1 sentencepiece-0.2.1
Collecting apache_beam
  Downloading apache_beam-2.67.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (19 kB)
Collecting crcmod<2

In [2]:
import torch
from datasets import load_dataset, DatasetDict

from transformers import (
    BertTokenizer,
    BertForMaskedLM,
    GPT2Tokenizer,
    GPT2LMHeadModel,
    DataCollatorForLanguageModeling,
    AutoConfig,
    AutoTokenizer,
    Trainer,
    TrainingArguments
)

  from .autonotebook import tqdm as notebook_tqdm


## Understanding Masked LM's

In [4]:
## The first model we will look at is BERT, which is trained with masked tokens. As an example,
## the text below masks the word "box" from a well-known movie quote.

text = "Life is like a [MASK] of chocolates."

In [5]:
## We'll now see how BERT is able to predict the missing word. We can use HuggingFace to load
## a copy of the pretrained model and tokenizer.

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForMaskedLM.from_pretrained("bert-base-uncased")

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [6]:
## Next, we'll feed our example text into the tokenizer.

encoded_input = tokenizer(text, return_tensors='pt')
print('input_ids:', encoded_input['input_ids'])
print('attention_mask:', encoded_input['attention_mask'])

input_ids: tensor([[ 101, 2166, 2003, 2066, 1037,  103, 1997, 7967, 2015, 1012,  102]])
attention_mask: tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])


In [7]:
## input_ids represents the tokenized output. Each integer can be mapped back to the corresponding string.

print(tokenizer.decode([7967]))

chocolate


In [8]:
## The model will then receive the output of the tokenizer. We can look at the BERT model to see exactly how
## it was constructed and what the outputs will be like.

model

BertForMaskedLM(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwi

In [9]:
## The model starts with an embedding of each of the 30,522 possible tokens into 768 dimensions, which at this
## point is simply a representation of each token without any additional information about their relationships
## to one another in the text. Then the encoder attention blocks are applied, updating the embeddings such that
## they now encode each token's contribution to the chunk of text and interactions with other tokens. Notably,
## this includes the masked tokens as well. The final stage is the language model head, which takes the embeddings
## from the masked positions back to 30,522 dimensions. Each index of this final vector corresponds to the
## probability that the token in that position would be the correct choice to fill the mask.


model_output = model(**encoded_input)
output = model_output["logits"]

print(output.shape)

torch.Size([1, 11, 30522])


In [10]:
tokens = encoded_input['input_ids'][0].tolist()
masked_index = tokens.index(tokenizer.mask_token_id)
logits = output[0, masked_index, :]

print(logits.shape)

torch.Size([30522])


In [11]:
probs = logits.softmax(dim=-1)
values, predictions = probs.topk(5)
sequence = tokenizer.decode(predictions)

print('Top 5 predictions:', sequence)
print(values)

Top 5 predictions: box bag bowl jar cup
tensor([0.1764, 0.1688, 0.0419, 0.0336, 0.0262], grad_fn=<TopkBackward0>)


Printing the top 5 predictions and their respective scores, we see that BERT accurately chooses "box" as the most likely replacement for the mask token.

## Understanding Causal LM's

In [12]:
## We now repeat a similar exercise with the causal LLM GPT-2. This model generates
## text following an input, instead of replacing a mask within the text.

tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')

In [13]:
## We can examine the model again, noting the similarities to BERT. An embedding, 12 attention blocks,
## and a linear transformation bringing the output back to the size of the tokenizer. The tokenizer is
## different from BERT so we see we have more tokens this time.

model

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D(nf=2304, nx=768)
          (c_proj): Conv1D(nf=768, nx=768)
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D(nf=3072, nx=768)
          (c_proj): Conv1D(nf=768, nx=3072)
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50257, bias=False)
)

In [14]:
## We'll use a different text example, since this model works by producing tokens sequentially
## rather than filling a mask.

text = "Swimming at the beach is"
model_inputs = tokenizer(text, return_tensors='pt')
model_inputs

{'input_ids': tensor([[10462, 27428,   379,   262, 10481,   318]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1]])}

In [15]:
## After applying the model, the information needed to predict the next token is represented by
## the last token. So we can access that vector by the index -1.

output = model(**model_inputs)
next_token_logits = output.logits[:, -1, :]
next_token = torch.argmax(next_token_logits, dim=-1)
print(next_token)

tensor([257])


In [16]:
## Now add the new token to the end of the text, and feed all of it back to the model to continue
## predicting more tokens.

model_inputs['input_ids'] = torch.cat([model_inputs['input_ids'], next_token[:, None]], dim=-1)
model_inputs["attention_mask"] = torch.cat([model_inputs['attention_mask'], torch.tensor([[1]])], dim=-1)
print(model_inputs)

{'input_ids': tensor([[10462, 27428,   379,   262, 10481,   318,   257]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1]])}


In [17]:
## Here's what we have so far. The model added the word 'a' to the input text.

print(tokenizer.decode(model_inputs['input_ids'][0]))

Swimming at the beach is a


In [18]:
## Repeating all the previous steps, we then add the word 'great'.

output = model(**model_inputs)
next_token_logits = output.logits[:, -1, :]
next_token = torch.argmax(next_token_logits, dim=-1)
model_inputs['input_ids'] = torch.cat([model_inputs['input_ids'], next_token[:, None]], dim=-1)
model_inputs["attention_mask"] = torch.cat([model_inputs['attention_mask'], torch.tensor([[1]])], dim=-1)
print(tokenizer.decode(model_inputs['input_ids'][0]))

Swimming at the beach is a great


In [19]:
## HuggingFace automates this iterative process. We'll use the quicker approach to finish our sentence.

output_generate = model.generate(**model_inputs, max_length=20, pad_token_id=tokenizer.eos_token_id)
print(tokenizer.decode(output_generate[0]))

Swimming at the beach is a great way to get a little extra energy.

The beach


## Pre-training a GPT-2 model from scratch

Next we'll train a GPT-2 model from scratch using English Wikipedia data. Note that we're only using a tiny subset of the data to demonstrate that the model is capable of learning. The exact same approach could be followed on the full dataset to train a more functional model, but that would require a lot of compute.

In [20]:
# The fix is on this line: use the full Hub ID 'wikimedia/wikipedia'
dataset = load_dataset("wikimedia/wikipedia", "20231101.en")

# Adding a seed makes the shuffle operation reproducible
ds_shuffle = dataset['train'].shuffle(seed=42)

# The rest of your code is correct and remains the same
raw_datasets = DatasetDict(
    {
        "train": ds_shuffle.select(range(50)),
        "valid": ds_shuffle.select(range(50, 100))
    }
)

# Print the result to see it
print(raw_datasets)

Downloading data: 100%|██████████| 41/41 [06:11<00:00,  9.07s/files]
Generating train split: 100%|██████████| 6407814/6407814 [00:41<00:00, 154757.20 examples/s]


DatasetDict({
    train: Dataset({
        features: ['id', 'url', 'title', 'text'],
        num_rows: 50
    })
    valid: Dataset({
        features: ['id', 'url', 'title', 'text'],
        num_rows: 50
    })
})


In [21]:
print(raw_datasets['train'][0]['text'][:200])

HMP Hull is a Category B men's local prison located in Kingston upon Hull in England. The term 'local' means that this prison holds people on remand to the local courts. The prison is operated by His 


In [22]:
## We'll tokenize the text, setting the context size to 128 and thus breaking each document into chunks of 128 tokens.

context_length = 128
tokenizer = AutoTokenizer.from_pretrained("gpt2")

outputs = tokenizer(
    raw_datasets["train"][:2]["text"],
    truncation=True,
    max_length=context_length,
    return_overflowing_tokens=True,
    return_length=True,
)

print(f"Input IDs length: {len(outputs['input_ids'])}")
print(f"Input chunk lengths: {(outputs['length'])}")
print(f"Chunk mapping: {outputs['overflow_to_sample_mapping']}")

Input IDs length: 10
Input chunk lengths: [128, 128, 128, 128, 128, 93, 128, 128, 128, 30]
Chunk mapping: [0, 0, 0, 0, 0, 0, 1, 1, 1, 1]


In [23]:
def tokenize(element):
    outputs = tokenizer(
        element["text"],
        truncation=True,
        max_length=context_length,
        return_overflowing_tokens=True,
        return_length=True,
    )
    input_batch = []
    for length, input_ids in zip(outputs["length"], outputs["input_ids"]):
        if length == context_length:
            input_batch.append(input_ids)
    return {"input_ids": input_batch}


tokenized_datasets = raw_datasets.map(
    tokenize, batched=True, remove_columns=raw_datasets["train"].column_names
)
tokenized_datasets

Map: 100%|██████████| 50/50 [00:00<00:00, 1952.33 examples/s]
Map: 100%|██████████| 50/50 [00:00<00:00, 1721.35 examples/s]


DatasetDict({
    train: Dataset({
        features: ['input_ids'],
        num_rows: 223
    })
    valid: Dataset({
        features: ['input_ids'],
        num_rows: 248
    })
})

Now we can set up the HuggingFace Trainer as follows. Since we're using such a small dataset, we'll need lots of epochs for the model to make progress because all of the parameters are randomly initialized at the outset. Typically, most LLM's are trained for only one epoch and more diverse examples.

In [24]:
config = AutoConfig.from_pretrained(
    "gpt2",
    vocab_size=len(tokenizer),
    n_ctx=context_length,
    bos_token_id=tokenizer.bos_token_id,
    eos_token_id=tokenizer.eos_token_id,
)

model = GPT2LMHeadModel(config)

In [25]:
tokenizer.pad_token = tokenizer.eos_token
data_collator = DataCollatorForLanguageModeling(tokenizer, mlm=False)

In [26]:
import sys
import transformers

# Add these lines to the top of your script
print("Python Executable:", sys.executable)
print("Transformers Version:", transformers.__version__)
# ------------------------------------------------

# ... your existing code starts here
# from datasets import load_dataset
# from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments, Trainer

Python Executable: /opt/venv/bin/python3
Transformers Version: 4.56.1


In [30]:
# Correct syntax for NEW versions of transformers
args = TrainingArguments(
    output_dir="data/wiki-gpt2",
    #evaluation_strategy="steps",    # Use this instead of the old argument
    eval_steps=500,                 # Optional: Specify evaluation frequency
    num_train_epochs=100,
    # You might also want to control saving and logging
    save_strategy="steps",
    save_steps=500,
    logging_steps=500
)

trainer = Trainer(
    model=model,
    tokenizer=tokenizer,
    args=args,
    data_collator=data_collator,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["valid"]
)

  trainer = Trainer(


In [31]:
import sys
import transformers
from transformers import TrainingArguments
import inspect # <-- Import the inspect module

# --- Add this diagnostic code ---
print("--- Debugging Information ---")
print("Python Executable:", sys.executable)
print("Transformers Version:", transformers.__version__)
try:
    # This will print the file path of the TrainingArguments class being used
    print("TrainingArguments is from file:", inspect.getfile(TrainingArguments))
except TypeError:
    print("Could not get file for TrainingArguments. It might be a built-in or dynamically generated class.")
print("---------------------------\n")
# --------------------------------

# ... your existing code
# args = TrainingArguments(...)

--- Debugging Information ---
Python Executable: /opt/venv/bin/python3
Transformers Version: 4.56.1
TrainingArguments is from file: /opt/venv/lib/python3.12/site-packages/transformers/training_args.py
---------------------------



In [32]:
trainer.train()

The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'pad_token_id': 50256}.
`loss_type=None` was set in the config but it is unrecognized. Using the default loss: `ForCausalLMLoss`.


Step,Training Loss
500,5.5167
1000,2.1565
1500,0.5396
2000,0.1101
2500,0.0421


TrainOutput(global_step=2800, training_loss=1.4966388651302882, metrics={'train_runtime': 508.2876, 'train_samples_per_second': 43.873, 'train_steps_per_second': 5.509, 'total_flos': 1456703078400000.0, 'train_loss': 1.4966388651302882, 'epoch': 100.0})

In [33]:
trainer.evaluate()

{'eval_loss': 9.363317489624023,
 'eval_runtime': 1.6761,
 'eval_samples_per_second': 147.961,
 'eval_steps_per_second': 18.495,
 'epoch': 100.0}

The training loss is low by the end, which means the model should perform very well on training examples it has seen. It does not generalize well to the validation set of course, since we deliberately overfit on a small train set.

We can confirm with a couple of examples that were seen in training.

In [34]:
text = tokenizer.decode(tokenized_datasets["train"][0]['input_ids'][:16])
print(text)

HMP Hull is a Category B men's local prison located in Kingston upon Hull


In [35]:
model_inputs = tokenizer(text, return_tensors='pt')
print(model_inputs['input_ids'].shape)

torch.Size([1, 16])


In [36]:
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model_inputs['input_ids'] = model_inputs['input_ids'].to(device)
model_inputs['attention_mask'] = model_inputs['attention_mask'].to(device)

output_generate = model.generate(**model_inputs, max_new_tokens=16)
output_generate

tensor([[   39,  7378, 28238,   318,   257, 21743,   347,  1450,   338,  1957,
          3770,  5140,   287, 34612,  2402, 28238,   287,  4492,    13,   383,
          3381,   705, 12001,     6,  1724,   326,   428,  3770,  6622,   661,
           319,   816]], device='cuda:0')

In [37]:
sequence = tokenizer.decode(output_generate[0])
print(sequence)

HMP Hull is a Category B men's local prison located in Kingston upon Hull in England. The term 'local' means that this prison holds people on rem


The model should do quite well at reciting text after seeing it so many times. We can be convinced that the tokenizer, model architecture, and training objective are well-suited to learning Wikipedia data. For comparison, we'll try this model on text from the validation set.

In [38]:
text = tokenizer.decode(tokenized_datasets["valid"][0]['input_ids'][:32])
print(text)

Irma Helena Karvikko (29 September 1909, in Turku – 16 September 1994; surname until 1933 Blomqvist) was a Finnish journalist and


In [39]:
model_inputs = tokenizer(text, return_tensors='pt')

device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model_inputs['input_ids'] = model_inputs['input_ids'].to(device)
model_inputs['attention_mask'] = model_inputs['attention_mask'].to(device)

output_generate = model.generate(**model_inputs, max_new_tokens=16)
sequence = tokenizer.decode(output_generate[0])
print(sequence)

Irma Helena Karvikko (29 September 1909, in Turku – 16 September 1994; surname until 1933 Blomqvist) was a Finnish journalist and Political Science (After Charles Ray), Doeringer's photographs each contain himself standing


In [40]:
raw_datasets['valid'][0]['text']

"Nikolai Petrovich Ostroumov (; 1846–1930) was an imperial Russian orientalist, ethnographer and educationalist in Turkestan.\n\nHe studied under Nikolai Il'minskii at the Kazan Theological Seminary, where he studied Arabic and Turkic languages as well as Islam.\n\nHe was editor of Turkistan Wilayatining Gazeti from 1883 to 1917.\n\nReferences\n\nRussian educators\nTurkestan\n1846 births\n1930 deaths"

As expected, our model is completely confused this time. We'd need to train for much longer, and on much more diverse data, before we would have a model that can sensibly complete prompts it has never seen before. This is precisely why pre-training is such an important and powerful technique. If we had to train on all of Wikipedia for every NLP application to achieve optimal performance, it would be prohibitively expensive. But there's no need to do that when we can share and reuse existing pre-trained models as we did in the first part of this tutorial.

In [41]:
text = tokenizer.decode(tokenized_datasets["valid"][0]['input_ids'][:256])
print(text)

Irma Helena Karvikko (29 September 1909, in Turku – 16 September 1994; surname until 1933 Blomqvist) was a Finnish journalist and politician. She was Deputy Minister for Social Affairs from 17 November 1953 to 4 May 1954 and Minister for Social Affairs from 27 May to 1 September 1957. She was a member of the Parliament of Finland, representing the National Progressive Party from 1948 to 1951, the People's Party of Finland from 1951 to 1958 and from 1962 to 1965 and the Liberal People's Party from 1965 to 1970.

References

1909 births
1994 deaths
People from Turku
People from Turk
