In [None]:
# MIT Introduction to Deep Learning
# http://introtodeeplearning.com

In [None]:
import os
import json
import numpy as np
from tqdm import tqdm
import matplotlib.pyplot as plt

import torch
import torch.nn as nn
from torch.nn import functional as F
from torch.utils.data import DataLoader

from transformers import AutoTokenizer, AutoModelForCausalLM
from datasets import load_dataset
from peft import LoraConfig, get_peft_model
# !pip install lion-pytorch
from lion_pytorch import Lion

import warnings
warnings.filterwarnings("ignore")

# Fine Tuning
[Fine-Tuning](https://huggingface.co/docs/transformers/en/training) is a machine learning and AI technique that adapts a pre-trained model for a specific task by training it further on a specialized dataset. Instead of building a model from scratch, this process starts with an existing model that has broad general knowledge and then modifies its parameters to enhance performance, add new skills, or improve accuracy for a particular use case.

Here we will use **Gemma 2B Model** as the base language model to fine-tune.

## 1.1 Templating and tokenization
### 1.1.1 Templating
Language model that function as chatbots are able to generate responses to user queries. So we need to provide them with a way to understand the conversation and generate responses in a coherent manner -- some stracture of what are inputs and outputs.

[Templating](https://huggingface.co/docs/transformers/main/chat_templating) is a way to format inputs and outputs in a conversation stracture that a language model can understand. It involves adding special tokens or markers to indicate different parts of the conversation, like who is speaking and where turns begin and end. This structure helps the model learn the proper format for generating responses and maintain a coherent conversation flow. Without templates, the model may not know how to properly format its outputs or distinguish between different speakers in a conversation.

In [2]:
# Basic question answer template
template_without_answer = "<start_of_turn>user\n{question}<end_of_turn>\n<start_of_turn>model\n"
template_with_answer = template_without_answer + "{answer}<end_of_turn>\n"

# Let try to put something into the template and see how it looks like
print(template_with_answer.format(question='what is your name?', answer='My name is Kundan'))

<start_of_turn>user
what is your name?<end_of_turn>
<start_of_turn>model
My name is Kundan<end_of_turn>



### 1.1.2 Tokenization
To operate on language, we need to prepare the text for the model. Fundamentally we can think of language as a sequence of 'chunks' of text. We can split the text into individual chunks, and then map these chunks to numerical tokens -- collectively this is a process of [tokenization](https://huggingface.co/docs/transformers/main/tokenizer_summary). Numerical token then be fed into a language model.
There are several common approches to tokenizing natural language text:
1. **Word-based tokenization**: split text into individual words. While simple, this can lead to large vocabularies and does not handel unknown words well.
2. **Character-based tokenization**: split text into individual characters. While this involves a very small vocabulary, it produces long sequences and loses word-level meaning.
3. **Subword tokenization**: breaks words into smaller units(subwords) based on their frequency. The most popular and commonly used apporch is [Byte-pair encoding (BPE)](https://en.wikipedia.org/wiki/Byte-pair_encoding), which is iteratively merges the most frequent character pair. Modern language models typically use subword tokenization as it balances vocabulary size and sequence length while handling unknown words effectively by breaking them into known subword units.

Here we will use the tokenizer from the Gemma 2B model, while uses BPE.

In [3]:
# load the tokenizer for Gemma 2B
model_id = 'unsloth/gemma-2-2b-it'
tokenizer = AutoTokenizer.from_pretrained(model_id)

# How big is the tokenizer?
print(f'Vocab size: {len(tokenizer.get_vocab())}')

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.model:   0%|          | 0.00/4.24M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.5M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/636 [00:00<?, ?B/s]

Vocab size: 256000


We not only need to be able to tokenize the text into tokens(encode), but also de-tokenize the tokens back into the text (decode).
1. an Encoder function to be tokenize the text into tokens, and
2. a Decode function to de-tokenize back to text so that we can read out the model's outputs.

In [4]:
# let test our both steps:
text = "here is some sample text!"
print(f'Otiginal text: {text}')

# Tokenize the text
tokens = tokenizer.encode(text, return_tensors='pt')
print(f'Encoded tokens: {tokens}')

# Decode the tokens
decoded_text = tokenizer.decode(tokens[0], skip_special_tokens=True)
print(f'Decoded text: {decoded_text}')

Otiginal text: here is some sample text!
Encoded tokens: tensor([[     2,   1828,    603,   1009,   6453,   2793, 235341]])
Decoded text: here is some sample text!


To 'chat' with our LLM chatbot, we need to use the tokenizer and the chat template together, in order for the model to respond to the user's question. We can use the templates defined earlier to construct a prompt for the model, without the answer.

In [5]:
promt = template_without_answer.format(question='What is the capital of india? Use one word.')
print(promt)

<start_of_turn>user
What is the capital of india? Use one word.<end_of_turn>
<start_of_turn>model



## 1.2 Start with LLM

Now we have a way to prepare our data, ready to work with our first LLM!

LLM's like Gamma 2B are trained on a large corpus of text, on the task of predicting the next token sequence, given the previous tokens. We call the training task "next token prediction", you may also see it called "casual language modeling" or "autoregressive language modeling". We can leverage models trained in this way to generate new text by sampling from the predicted probability distribution over the next token.


In [6]:
# Load the model
model = AutoModelForCausalLM.from_pretrained(model_id, device_map='auto')

config.json:   0%|          | 0.00/913 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/5.23G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/209 [00:00<?, ?B/s]

In [11]:
# Putting it together to prompt the model and generate a response

# 1. Construct the prompt in chat template form
question = 'Who is the captain of new zeland cricket team? Use one word.'
prompt = template_without_answer.format(question=question)

# 2. Tokenize the prompt
tokens = tokenizer.encode(prompt, return_tensors='pt').to(model.device)

# 3. Feed through the model to predict the next token probabilities
with torch.no_grad():
    output = model(tokens)
    probs = F.softmax(output.logits, dim=-1)

# 4. Get the next token, according to the maximum probability
next_token = torch.argmax(probs[0,-1, :]).item()

# 5. Decode the next token
next_token_text = tokenizer.decode(next_token)

print(f'Prompt: {prompt}')
print(f'Predicted next token: {next_token_text}')

Prompt: <start_of_turn>user
Who is the captain of new zeland cricket team? Use one word.<end_of_turn>
<start_of_turn>model

Predicted next token: Kane


We can see that the model is not able to predict the answer to the question, it is only able to generate the next token in the sequence! For more complex questions, we can not just generate only one token, but rather we need to generate a sequence of tokens.

This can be done by doing the above process iteratevely, step by step -- after each step we feed the generated token back into the model and predict the next token again.

But instead of doing this manually ourself, we can use the model's built-in **model.generate()** fuctionality to generate **max_new_tokens** number of tokens, and decode the output back to text.

In [12]:
prompt = template_without_answer.format(question='Who score highest run in 2019 odi word cup?')
tokens = tokenizer.encode(prompt, return_tensors='pt').to(model.device)
output = model.generate(tokens, max_new_tokens=20)
print(tokenizer.decode(output[0]))

<bos><start_of_turn>user
Who score highest run in 2019 odi word cup?<end_of_turn>
<start_of_turn>model
The highest run scorer in the 2019 ODI World Cup was **Rohit Sharma** from


## 1.3 Fine-tuning
Fine-tuning is a technique that allows us to adapt a pre-trained neural network to better suit a downstream task, domain, or style, by training the model further on new data. By training the model further on a carefull curated dataset, we can modify its behavior, style, or capabilities. Fine-tuning is used in a variety of applications, not just language modeling. But in language modeling, fine-tuning can be used to:
- **Adapt the model's writing style**
- **Improve performance on specific tasks or domains**
- **Teach the model new capabilities or knowledge**
- **Reduce unwanted behaviors or biases**



In [13]:
'''we create a question answer dataset where the questions are in standard english style
and answers are in "leprechaun" style. '''

# load the dataset
ds = load_dataset("databricks/databricks-dolly-15k", split="train")
# load responses from style file
with open(f"/content/leprechaun.txt", "r") as f:
    new_responses = [line.strip().replace("\\n", "\n") for line in f]

# Update the entire dataset at once with the new responses
ds_ = ds.select(range(len(new_responses)))
ds_ = ds_.map(
    lambda x, idx: {"response_style": new_responses[idx]},
    with_indices=True,
    num_proc=1
)

# make a test split
n = len(new_responses)
ds_test = ds.select(range(n, n+n))

# Create a dataloader
train_loader = DataLoader(ds_, batch_size=1, shuffle=True)
test_loader = DataLoader(ds_test, batch_size=1, shuffle=True)

README.md: 0.00B [00:00, ?B/s]

databricks-dolly-15k.jsonl:   0%|          | 0.00/13.1M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/15011 [00:00<?, ? examples/s]

Map:   0%|          | 0/2048 [00:00<?, ? examples/s]

In [14]:
# load a sample to inspect the dataset
sample = train_loader.dataset[44]
question = sample['instruction']
answer = sample['response']
answer_style = sample['response_style']

print(f'Question: {question}\n\n' +
      f'Original answer: {answer}\n\n' +
      f'Answer style: {answer_style}')

Question: Are lilies safe for cats?

Original answer: No, lilies are toxic to cats if consumed and should not be kept in a household with cats

Answer style: Och, no indeed, me hearty! Them lilies there be as dangerous as a pot o' gold guarded by a banshee to a wee kitty cat! If a whiskered lad or lass takes a bite of one, it's as bad as swallowing a curse from the old Hag herself. So, ye best keep them far from yer feline friends, or else ye'll be needin' more than just a four-leaf clover to bring luck back into yer home!


## 1.3.1 Chat function
Before we start finetuning, we will build a function to easily chat with model, both so we can monitor is progress over the cource of finetuning and also to generate responses to questions.


Our core steps:
1. Construct the question prompt using the template.
2. Tokenize the text.
3. Feed the tokens through the model to predict the next token probabilities.
4. Decode the predicited tokens back to the text.

We will use this steps to build our chat function.

In [15]:
# temperature - (higher -> more random; lower -> more diterministic)
# if only_answer = True -> the function will return only the generated reply (not the original prompt+reply)
def chat(question, max_new_tokens=32, temperature=0.7, only_answer=False):
    prompt = template_without_answer.format(question=question)
    input_ids = tokenizer(prompt, return_tensors='pt').to(model.device)
    with torch.no_grad():
        #Enters a context where PyTorch gradient tracking is disabled. This reduces memory usage and speeds up inference because we don’t need gradients for generation.
        outputs = model.generate(**input_ids, do_sample=True, max_new_tokens=max_new_tokens, temperature=temperature)
    output_tokens = outputs[0]
    if only_answer:
        output_tokens = output_tokens[input_ids['input_ids'].shape[1]:]
    result = tokenizer.decode(output_tokens, skip_special_tokens=True)

    return result

- input_ids = tokenizer(prompt, return_tensors='pt')

the Hugging Face tokenizer returns a special object called a BatchEncoding, which is basically like a Python dictionary containing several tensors.

>>>input_ids

{

    'input_ids': tensor([[123, 456, 789, ...]]),
    'attention_mask': tensor([[1, 1, 1, ...]])

}

{
  "input_ids": <tensor>,
  "attention_mask": <tensor>
}

The generate() function expects its arguments to be passed like this:

model.generate(input_ids=..., attention_mask=..., max_new_tokens=..., temperature=...)

So it needs named keyword arguments — not a single dictionary.

### The double asterisks unpack a dictionary into keyword arguments.

In [21]:
answer = chat(
    'Who is the captain for india at 1987 word cup?',
    only_answer=True,
    max_new_tokens=52,
)

print(answer)

The captain of the Indian cricket team at the 1987 Cricket World Cup was **Kapil Dev**. 



# 1.3.2 Parameter-efficient fine-tuning
In fine-tuning, the weights of the model are updated to better fit the fine-tuning dataset and/or task. Updating all the weights in a language model like Gemma 2B -which has ~2 billion parameters - is computationally expensive. There are many techniques to make fine-tuning more efficient.

Here we will use a technique called [LoRA](https://arxiv.org/abs/2106.09685) --low-rank adaptation --to make the fine-tuning process is more efficient. **LoRA is a way to fine-tune LLMs very efficiently by only updating a small subset of the model's parameters, and it works by adding trainable low-rank matrices to the model.** Here we will use the State-of-the-art Parameter-Efficient Fine-Tuning ([PEFT](https://pypi.org/project/peft/)) library by Hugging Face.

In [22]:
def apply_LoRA(model):
    # Define LoRA config
    lora_config = LoraConfig(
        r=8, # rank of the lora matrices
        task_type='CAUSAL_LM',
        target_modules=['q_proj', 'o_proj', 'k_proj', 'v_proj', 'gate_proj', 'up_proj', 'down_proj'],
    )
    # Apply LoRA to the model
    lora_model = get_peft_model(model, lora_config)
    return lora_model

model = apply_LoRA(model)

# Print the number of trainable parameters after after applying LoRA
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
total_params = sum(p.numel() for p in model.parameters())
print(f'number of trainable parameters: {trainable_params}')
print(f'total parameters: {total_params}')
print(f'percentage of trainable parameters: {trainable_params / total_params * 100:.2f}%')

number of trainable parameters: 10383360
total parameters: 2624725248
percentage of trainable parameters: 0.40%


In pytorch, a module is any layer or sub-layer inside a model. For example nn.Linear, nn.Conv2d, nn.LayerNorm or 'encoder.layer.0.attention.q_proj'.

A Transformer block typically has two main parts:
- 1. **Self-Attention layer**: The self-attention mechanism computes how each token relates to others.
- 2. **Feed-Forward layer (MLP)**:After attention, we pass each token through a two-layer feed-forward neural network.
This part expands and contracts the feature dimension — it’s what makes the model nonlinear and powerful.

### Self-Attention layer:

| Module   | Full Name          | Role                       | Description                                                  |
|-----------|--------------------|-----------------------------|--------------------------------------------------------------|
| `q_proj`  | Query projection   | “What am I looking for?”    | Transforms each token’s embedding into a query vector.       |
| `k_proj`  | Key projection     | “What do I contain?”        | Transforms embeddings into key vectors for comparison.       |
| `v_proj`  | Value projection   | “What information do I provide?” | Holds the actual data that gets passed through attention. |
| `o_proj`  | Output projection  | “Combine everything”        | Merges all attended information into a final vector.         |

### Feed-Forward layer (MLP):
| Module      | Role                   | Description                                                                 |
|--------------|------------------------|------------------------------------------------------------------------------|
| `gate_proj`  | Gate activation        | Controls which parts of the intermediate features are active.                |
| `up_proj`    | Expansion projection   | Expands the dimensionality of token embeddings (e.g. 4096 → 11008).          |
| `down_proj`  | Compression projection | Brings it back down to model size (e.g. 11008 → 4096).                       |

### **p.requires_grad**:
	• This filters parameters so that only those that require gradients (i.e., are trainable) are included.
	• Some parameters are frozen (like in transfer learning) and won’t update during backpropagation — this ensures those are excluded.

In [23]:
def forward_and_compute_loss(model, tokens, mask, context_length=512):
    # Truncate to context length
    tokens = tokens[:, : context_length]
    mask = mask[:, :context_length]

    # Construct the input, output, and mask
    x = tokens[:, :-1]
    y = tokens[:, 1:]
    mask = mask[:, 1:]

    # Forward pass to compute logits
    logits = model(x).logits

    # Compute loss
    loss = F.cross_entropy(
        logits.view(-1, logits.size(-1)),
        y.view(-1),
        reduction='none'
    )

    # Mask out the loss for non-answer tokens
    loss = loss[mask.view(-1)].mean()

    return loss

In [24]:
def train(model, dataloader, tokenizer, max_steps=200, context_length=512, learning_rate=1e-4):
    losses = []

    # Apply LoRA to the model
    model = apply_LoRA(model= model)

    optimizer = Lion(model.parameters(), lr= learning_rate)

    # Training loop
    for step, batch in enumerate(dataloader):
        question = batch['instruction'][0] #???
        answer = batch['response_style'][0] #???

        # Format the question and answer into the template
        text = template_with_answer.format(question=question, answer=answer)

        # Tokenize the text and compute the mask for the answer
        ids = tokenizer(text, return_tensors='pt', return_offsets_mapping=True).to(model.device)
        mask = ids['offset_mapping'][:,:,0] >= text.index(answer)

        # Feed the tokens through the model and compute the loss
        loss = forward_and_compute_loss(model=model, tokens=ids['input_ids'], mask=mask, context_length=context_length)

        # Backward pass
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        losses.append(loss.item())

        # Monitor progress
        if step % 10 == 0:
            print(chat('What is the capital of India?', only_answer=True))
            print(f'step{step} loss: {torch.mean(torch.tensor(losses)).item()}')
            losses = []
        if step > 0 and step % max_steps == 0:
            break
    return model

In [25]:
model = train(model=model, dataloader=train_loader, tokenizer=tokenizer, max_steps=50)

The capital of India is **New Delhi**. 

step0 loss: 2.6193628311157227
The capital of India is **New Delhi**. 

step10 loss: 2.0406253337860107
The capital of India is **New Delhi**. 

step20 loss: 1.7119553089141846
Top o' the mornin' to ye now, me hearty! Ye want to know where the grand ol' capital of India be, do ye? Well
step30 loss: 1.5291074514389038
Top o' the mornin' to ye! Now, listen up, me hearty, as I tell ye where the capital o' India is, wouldn'
step40 loss: 1.5661659240722656
Top o' the mornin' to ye, me hearty! Ye askin' about the grand city where the Indian government holds court, ye say? Well
step50 loss: 1.5403621196746826


In [26]:
print(chat("Do you know about Indian Cricket Team", only_answer=True, max_new_tokens=200))

Top o' the mornin' to ye, me hearty! Ye want to know about the Indian Cricket Team, do ye? Well, gather 'round and let me tell ye all about it!

The Indian Cricket Team, they're as grand as a pot o' gold at the end o' leprechaun's luck! They've won more than their fair share o' games, they have! And when they play, they play with the heart o' a lion and the spirit o' a leprechaun!

They've got some mighty fine players on their team, like Virat Kohli, Rohit Sharma, and these lads called Rohit Sharma, KL Rahul, and Rishabh Pant. These players are as sharp as a tack and as quick as a leprechaun on his pot o' gold!

Now, they may not always win, but they'll always give it their all, they will! They're as passionate about cricket as a le


In [27]:
print(chat("Do you know about Mayapur, West Bengal?", only_answer=True, max_new_tokens=500))

Top o' the mornin' to ye, me hearty! Ye askin' about Mayapur, West Bengal? Well, let me tell ye a little tale about this fine place, nestled in the heart o' Bengal's lush green hills.

Mayapur, it is known as the spiritual heart o' India, where Lord Krishna's chariot is said to have stopped and where the divine nectar, the ambrosia, was first bestowed upon the land. It's a place where stories are whispered from generation to generation, where the very air buzzes with the echoes o' ancient tales. And it's full o' the sweet scent o' jasmine and lotus flowers, which makes it a true paradise for the soul.

But that's not all. Mayapur is also home to the ISKCON temple, a beautiful structure that's as grand as a palace, with towering spires and intricate carvings. And inside, ye can witness the grandeur o' Krishna's life, as depicted in the murals that adorn the walls. It's a place where history comes alive, where the past meets the present, and where the beauty o' Krishna's tale is brought 

In [29]:
print(chat("Can you write a Poem on cricket?", only_answer=True, max_new_tokens=500))

Top o' the mornin' to ye, me hearty!
A tale I'll spin, ye'll agree,
About a game o' cricket, grand and free,
Where willow meets leather, ye see!

The wicket stands, with bails so white,
And bowlers bowl with all their might,
Aimin' for the wickets, day and night,
To keep the batsmen in their fright!

And when they hit the ball, so high and tight,
The fielder runs, with all their might,
Catchin' it with a smile so bright,
And the crowd go wild with delight!

So raise a glass, me friend, to cricket's grace,
A game of skill and charm, in every place!
May it forever fill us with its embrace,
And bring us joy, in life's long chase!
