# Quick Overview Flow of LLM Training

1. Tokenization: text is broken down to token IDs using BPR (Byte Pair Encoding)

2. Token Embeddings: Token IDs are converted into embeddings each token ids is mapped to a vector with its own meaning of token

3. For each transformer layer:
    - RMSNorm
    - QKV projection
    - Apply RoPE to Q and K
    - Causal self-attention (+KV cache at inference time)
    - Output projection + residual
    - RMSNorm
    - MLP (SwiGLU) + residual
4. Final hidden layer -> LM Head logits -> softmax

5. Training cross-entropy vs next token -> backprop -> update weights

6. Inference: sample/greedy/beam -> loop token by token

# Post Training

## Common Modern Stack

1. SFT (supervised fine tuning)
Then Either
- RLHF (PPO)
- DPO (simpler preference tuning)
- GRPO (DeepSeek-ish, no critic model)

# Baseline
### Pick a. small instruct model, build deterministic parsing prompt + JSON Schema Validator

Limited to text generation


Libraries used:
- PyTorch
- Transformers

Model Size:
- 1B ~9B parameters

# Models

## Type of LM Models (Language Models)

### Representation Models (Encoder-Only)

Models like BERT are feature extraction models. They don't generate text but are good at understanding context and creating embeddings for tasks like classification or named-entity recognition.

### Generative Models (Decoder-Only)

These models are like GPT, Llama and are completion machines designed to generate text one token at a time. These are the ones you use to write or generate text.

## Causal Language Models (CLM)

The Goal: The model is trained to predict the next token in a sequence based only on the tokens that came before it.

Autoregressive Nature: It only looks at the past and never the future using masked self-attention. It predicts a word and adds it to the prompt and then predicts the next word.

## Masked Language Modeling (MLM)

    - Used by encoder models.
    - Instead of predicting the next token, it predicts a missing word in the middle of the sequence.
    - This helps the model learna deep understanding of how the words relate to each other bidirectionally.

## Sequence to Sequence / Encoder-Decoder

    - This architecture uses an encoder to read the input and decoder to write the output.
    - Useful for translation or summarization tasks.

## Contrastive Learning

    - Technique used ot train Embedding models.
    - It teaches a model to recognize when two pieces of text are similar and when they are different by comparing pairs of data.

 

## [gpt2](https://huggingface.co/openai-community/gpt2)

## [meta-llama/Llama-3.1-8B-Instruct (Will not use, too large)](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct)

**Type**: Instruction-tuned chat model
8 Billion Parameters (small to medium sized)
Instruct Model = pretrained model
Context Window = 128,000 tokens 

**Architecture** = Dense Transformer

Mixture Of Experts (x)

State Space Models (x)

Recurrent Neural Networks (x)


**Knowledge Cutoff** = Dec 2023

**Attention Mechanism** = Grouped-Query Attention

## [meta-llama/Llama-3.2-1B-Instruct (we will be using this))](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct)

**Type**:
- decoder only
- auto regressive transformer
**Key Bits:**
- Grouped Query Attention
- RoPE positional embedding
- RMSNorm 
- SwiGLU MLP

**Size / Core hyperparameters**
- 32 layers
- 4096 hidden size
- 32 attention heads
- 8 KV heads

**Knowledge Cutoff**: Dec 2023

**Context Window:** 128K tokens



Aligned with SFT and RLHF for assistant style behavior



In [37]:
import os
from transformers import AutoTokenizer, GPT2Tokenizer, GPT2Model, AutoModelForCausalLM
from huggingface_hub import login
import torch
import yaml

In [None]:
# select model from huggingface
# model_llama3_id = "meta-llama/Llama-3.1-8B-Instruct" # this model is too big to run
model_gpt2_id = "gpt2"

Getting specific token for model and checking if token is valid

In [39]:
# get token from environment variable
token = os.environ.get("HF_TOKEN")
# checking if the token works or not
if token:
    try:
        login(token=token)
    except Exception as e:
        print(f"Error logging in: {e}")
else:
    print("Hugging Face token not found in environment variables.")

Hugging Face token not found in environment variables.


finding specific model tokens

In [40]:
# use tokenizer function to split text and add BPE to add in toekn ids
tokenizer_gpt2 = AutoTokenizer.from_pretrained(model_gpt2_id, token=token)

Prompt engineering:

In [41]:
os.getcwd()

'/Users/paulpark/SandBox/CalendarProject'

In [42]:
with open("./Notebooks/initial_prompt.yaml") as f:
    initial_prompt = yaml.safe_load(f)["system_prompt"]

gpt 2

In [43]:
# Tokenize the text
# convert the text to token IDs + attention mask. 
# and give tokens specific ids and return as pytorch tensor
encoded_input_gpt2 = tokenizer_gpt2(initial_prompt, return_tensors="pt") 

In [44]:
# Load the pretrained mdoel weights
# download model weights into a PyTorch model object
model_gpt2 = AutoModelForCausalLM.from_pretrained(model_gpt2_id)

In [45]:
# Forward Pass (No Training)
# forwards pass, running the input through the network layers once to compute output
# inside the mode, it is doing:
# lookup embeddings for each token id (token_ids -> Vectors)
# run those vectors through all transformation layers (attention + MLP Blocks)
# applies the LM head to produce logits (scores for next token)
out_gpt2 = model_gpt2(**encoded_input_gpt2)

In [46]:
# Grab the logits
# Logits are the raw scores for what should be the next token
logits_gpt2 = out_gpt2.logits

Testing out the prompt, making sure prompt is working

In [47]:
prompt1 = "I have an interview tomorrow (jan 2026, 27th) at 12:30pm to 1pm. It's about CAPREIT basic interview phone call."

In [48]:
# tokenization
# converts human text to sequence of token ids
inputs_gpt2 = tokenizer_gpt2(prompt1, return_tensors="pt")

In [49]:
# disabling gradients
# gradients are needed for training (updating weights)
# for inference like generating text, keeping them wastes massive amounts of memory
with torch.no_grad():
    # this is the brain of the operation
    gen = model_gpt2.generate(
        **inputs_gpt2,     # takes input tokens
        max_new_tokens=200, # limits the response to 200 new tokens
        do_sample=False # enables greedy search, model will always pick the single most likely next word
    )

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In [50]:
# Decoding
# converts the model's output back into human readable string
# skip_special_tokens=True removes the special tokens like technical markers <|endoftext|>
print(tokenizer_gpt2.decode(gen[0], skip_special_tokens=True))

I have an interview tomorrow (jan 2026, 27th) at 12:30pm to 1pm. It's about CAPREIT basic interview phone call.

I have an interview tomorrow (jan 2026, 27th) at 12:30pm to 1pm. It's about CAPREIT basic interview phone call. I have an interview tomorrow (jan 2026, 27th) at 12:30pm to 1pm. It's about CAPREIT basic interview phone call. I have an interview tomorrow (jan 2026, 27th) at 12:30pm to 1pm. It's about CAPREIT basic interview phone call. I have an interview tomorrow (jan 2026, 27th) at 12:30pm to 1pm. It's about CAPREIT basic interview phone call. I have an interview tomorrow (jan 2026, 27th) at 12:30pm to 1pm. It's about CAPREIT basic interview phone call. I have an interview tomorrow (jan 2026, 27th) at 12:30pm to 1pm. It's about CAPREIT basic interview phone call.


It reapeated my input thus, should do SFT 

check context window before input and after output 

Requires few shot prompting 


lamma 3.1

In [51]:
# tokenizer_llama3_mini = AutoTokenizer.from_pretrained(model_llama3_mini_id) 

In [52]:
# encoded_input_llama3 = tokenizer_llama3(text, return_tensors="pt")

In [53]:
# model_llama3 = AutoModelForCausalLM.from_pretrained(model_llama3_id, token=token)

In [54]:
# out_llama3 = model_llama3(**encoded_input_llama3)

In [55]:
# logits_llama3 = out_llama3.logits  # vocab-size scores for next token

Lama 3.2 1B instruct model

In [56]:

model_llama3_mini_id = "meta-llama/Llama-3.2-1B-Instruct"

In [57]:
# get the tokenizer for llama3.1 mini
tokenizer_llama3_mini = AutoTokenizer.from_pretrained(model_llama3_mini_id)

In [58]:
# load the pretrained model and download the weights
model_llama3_mini = AutoModelForCausalLM.from_pretrained(model_llama3_mini_id)

In [82]:
messages = [
  {"role": "system", "content": "You are an event parser. Output ONLY valid JSON with keys: title,event_date,event_time,end_time,description,notifications,invitees. If date is relative, use TODAY provided by user. If you cannot find any information to the fields, just dont fill it out. Leave it blank!"},
  {"role": "user", "content": "TODAY=2026-01-27\nInterview for Company today from 10am to 11am. Interview is technical with basic python and ML knowledge."},
  #{"role": "assistant", "content": "{\"title\":\"Interview for Company\",\"event_date\":\"2026-01-27\",\"event_time\":\"10:00\",\"end_time\":\"11:00\",\"description\":\"Technical interview with basic Python and ML knowledge.\",\"notifications\":[],\"invitees\":[]}"}
]


In [None]:
# Readin input step
inputs_llama3_mini = tokenizer_llama3_mini.apply_chat_template(
    messages, # providing only system and user prompt
    add_generation_prompt=True, # tells the model to start generating right after the last message (helps get a cleaner and more reliable response)
    tokenize=True, # True returns tensors/tokenIDs, False returns the formatted string (better for debugging)
    return_dict=True, # return tokenized inputs in dictionary format (helps with debugging)
    return_tensors='pt' # return in pytorch tensors
).to(model_llama3_mini.device)

In [92]:
# Generation Step
# .generate() will repeatedly in a loop to predict one token at a time
# 1. forward pass -> logits
# 2. choose next token based on decoding strategy
# 3. append token to sequence
# 4. repeat until it his a stop condition
outputs_llama3_mini = model_llama3_mini.generate( 
                        **inputs_llama3_mini, # unpacks the dictionary  into keyword arguments
                        max_new_tokens=400, # # how any tokens the model is allowed to add after the prompt
                        do_sample=False, # turns of randomness, model uses greedy decoding (pick highest probability token each step)
                        temperature=0.0, # This field is ignored, since do_sample=False. If do_sample=True, this controls the randomness of the sampling process.
                        output_scores=True, # instructs the model to store and return the raw logits (scores for every word)
                        return_dict_in_generate=True, # 
                        )

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


In [91]:
# Decode the generated part only
print(tokenizer_llama3_mini.decode(outputs_llama3_mini[0][inputs_llama3_mini['input_ids'].shape[-1]:],skip_special_tokens=True))

```json
{
  "title": "Interview for Company",
  "event_date": "2026-01-27",
  "event_time": "10:00:00",
  "end_time": "11:00:00",
  "description": "Technical interview with basic python and ML knowledge",
  "notifications": [],
  "invitees": []
}
```


We need to do a simple evaluation on how confident the model is in its predictions.

# MODEL EVALUATION

## Confidence /Behavior (model-level)

This metric tellls you how confident the model is in its predictions not the accuracy of the predictions.

Confidence metrics are used to diagnose the model, and will help me understand if SFT has improved model confidence or not.

- Logits (token-level)
- Mean Token Probability
- Mean Entropy
- Length Normalized Log Likelihood

## Accuracy / Task Success

This metric tellls you how accurate the model is in its predictions not the confidence of the predictions.

- JSON Validity Rate
- Schema Accuracy
- Exact Match
- Slot F1

## Evaluation Methods

- Log Likelihood / Perplexity 
- Token Probabilities (can be calculated from logit outputs using softmax)
- Entropy (Measures uncertainty of model, High entropy = less confidence and tends to hallucinate, low entropy = more confidence)

- Mean token Probability -> average probability of outputted tokens, measures confidence, higher the better.
- Mean entropy -> Measures the mean uncertainty of the model, lower the better.
- Length Normalized Log Likelihood -> log likelihood gets worse the longer the sentence, normalizing it would help compare model's confidence across different lengths of outputs.

## Task- specific Automatic metrics

- exact match
- F1 / Precision / Recall
- Structured output accuracy

## Human Evaluation (Final step for Chat, assistant, sugjective outputs)

- Human Preference ranking (RL)
- Helpfulness / corectness judgements
- Pairwise comparisons

### Using LOGITS (Model's Confidence)

### Using ENTROPY (measuring Uncertaingy)

### Using Mean Token Probability

### Mean Entropy

### Mean Length Normalized Log Likelihood

### Schema Accuracy

### Exact Match

### Slot F1

## Model confidence
For now, we will focus on Mean Entropy and mean Log Likelihood for the model confidence side. Purpose is to get a quick metric to understand the model confidence.

Mean Log-Likelihood -> how probable were the tokens the model chose? 

Mean entropy -> how certain was the model while generating?

## Model Accuracy
For the model accuracy side, we will focus on Precision, Recall, and F1 Score.

Precision at Slot level: "of allthe times the model predicted Structured, how many were actually structured?"

Recall at Slot level: "Of all the structured examples that exist, how many did the model actually find?"

F1 Score at Slot level: "A balance between Precision and Recall, often used as the "true" indicator of accuracy"

JSON Validity Rate: percentage of outputs that can be successfully parsed as valid JSON. This is binary metric

# Mean Length normalized Log likelihood

# Mean Entropy

this output looks alot better

Now lets start the SFT!

# SFT (Supervised Fine-Tuning) for synthetic Dataset

## Pretraining

- The model learns general language from internet

## Supervised Fine-Tuning

- Taking the base model and train it on few thousands examples of NLP text -> JSON pairs

## Preference Tuning (RLHF/DOP)

- After the model learns the JSON format (SFT), you use preference tuning to align it, teaching it to prefer valid, clean JSON over messry or conversational output.


# Evaluate LLM

### JSON validity rate, Field level accuracy, business metric % events created without manual edits

# DPO
### Generate Pairs (good parse vs bad parse) from your synthetic edge cases, Train with TRL's DPO style pipeline