---
layout: post
title: LLM Tuning
date:   2025-09-18
categories: [Python]
mermaid: true
typora-root-url: /Users/ojitha/GitHub/ojitha.github.io
typora-copy-images-to: ../../blog/assets/images/${filename}
---

## Encoder and Decoder
A transformer is built on an encoder-decoder architecture, where the encoder takes in input and outputs a matrix representation of that input. The decoder iteratively generates output using that representation. Transformer[^1] relies on self-attention to model the relationship between tokens in a sentence.

![Transformer](https://upload.wikimedia.org/wikipedia/commons/thumb/3/34/Transformer%2C_full_architecture.png/500px-Transformer%2C_full_architecture.png)

- Transformer Encoder:
  - Processes input sequence all at once (parallel processing)
  - Uses self-attention to understand relationships between all words
  - Output: Rich contextual representations of input
- Transformer Decoder:
  - Generates output sequence one token at a time (autoregressive)
  - Uses self-attention + cross-attention to input encoder representations
  - Output: Generated sequence (translation, text generation, etc.)

There are 2 types of models:
1. Auto-Regressive Models:
  - What: Predict next token based on previous tokens
  - Training: Learn $P(x_t | x_1, x_2, ..., x_{t-1})$
  - Direction: Left-to-right (forward prediction)
  - Examples: GPT, Claude, traditional language models
  - Use cases: Text generation, completion, dialogue
```
Input:  "The cat sat on" → Predict: "the"
```
2. Auto-Encoding Models:
  - What: Reconstruct corrupted/masked input
  - Training: Learn to fill in missing pieces from context
  - Direction: Bidirectional (sees both left and right context)
  - Examples: BERT, RoBERTa, DeBERTa
  - Use cases: Text understanding, classification, Q&A  
```
Input:  "The [MASK] sat on mat" → Predict: "cat"
```

Self-Attention: Each word "looks at" all other words to understand context and relationships.  

![Autoregressive Models](/../blog/assets/images/2025_09_18_LLMTuningWithPyTorch/autoregressive_models.svg)

> GPT and BERT are Transformers, but they are different language models.
{:.info-box}

### Attenstion
Standard attention calculated using 3 matrices:

1. query
2. key
3. value

In [None]:
import torch

1-dimensional tensor

In [None]:
one_d_tensor = torch.LongTensor([0])
print(f'Shape of {one_d_tensor} is {one_d_tensor.shape} and dimension is {one_d_tensor.dim()}')

Shape of tensor([0]) is torch.Size([1]) and dimension is 1


In [None]:
one_d_tensor = torch.LongTensor([0,1,2])
print(f'Shape of {one_d_tensor} is {one_d_tensor.shape} and dimension is {one_d_tensor.dim()}')

Shape of tensor([0, 1, 2]) is torch.Size([3]) and dimension is 1


In [None]:
two_d_tensor = torch.LongTensor([[0,1,2],[3,4,5]])
print(two_d_tensor.shape)
print(f'Shape of {two_d_tensor} is {two_d_tensor.shape} and dimension is {two_d_tensor.dim()}')

torch.Size([2, 3])
Shape of tensor([[0, 1, 2],
        [3, 4, 5]]) is torch.Size([2, 3]) and dimension is 2


In [None]:
one_d_tensor = torch.LongTensor([0,1,2])
two_d_tensor = one_d_tensor.unsqueeze(0)
print(f'Shape of {two_d_tensor} is {two_d_tensor.shape} and dimension is {two_d_tensor.dim()}')


Shape of tensor([[0, 1, 2]]) is torch.Size([1, 3]) and dimension is 2


In [None]:
two_d_tensor

tensor([[0, 1, 2]])

In [None]:
two_d_tensor.numpy()

array([[0],
       [1],
       [2]])

AI Product development Lifecycle

```mermaid
block-beta
    columns 5
    
    A["1. Prototype"]:1
    B["2. Evals"]:1
    C["3. Maximize"]:1
    D["4. Optimize"]:1
    E["5. Did you
    solve it?"]:1
    
    A1["Where does
    AI fit in?"]:1
    B1["Define evals"]:1
    C1["Prompt engineering"]:1
    D1["Latency
    and/or cost"]:1
    E1["Don't give up,
    keep going!"]:1
    
    A2["Copilot vs agent?"]:1
    B2["Make dataset"]:1
    C2["CoT, self-reflection,
    strong models"]:1
    D2["Move tasks to
    smaller models"]:1
    E2["Otherwise add
    HITL until"]:1
    
    A3["What's the
    MVP experience?"]:1
    B3["Establish
    baseline"]:1
    C3["RAG
    improvements"]:1
    D3["Make
    prompts/outputs
    smaller"]:1
    E3["next-gen models
    or rethink"]:1
    
    A4["How good does
    AI need to be?"]:1
    B4["Set a goal"]:1
    C4["Fine-tuning
    and distillation"]:1
    D4["Fine-tune
    as needed"]:1
    E4["your product"]:1
    
    space:1
    space:1
    C5["Multi-agent"]:1
    space:1
    space:1
    
    space:1
    space:1
    C6["Tools"]:1
    space:1
    space:1
    
    A --> B
    B --> C
    C --> D
    D --> E
    
    style A fill:#e1f5fe
    style B fill:#f3e5f5
    style C fill:#e8f5e8
    style D fill:#fff3e0
    style E fill:#fce4ec
```

[^1]: [Attetnsion All You need](https://papers.nips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf){:target="_blank"}


{:gtxt: .message color="green"}

{:ytxt: .message color="yellow"}

{:rtxt: .message color="red"}