<a href="https://colab.research.google.com/github/hakim733/AI-project/blob/main/Fine_tuning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# complementary task : Language Models
#I-Assignment: Fine-tuning a Language Model for Chatbot Response Generation
Introduction

In this assignment, I explored the capabilities of modern language models by fine-tuning GPT-2—a widely used model from the Hugging Face library—for the specific task of chatbot response generation.
The aim was to teach the model how to carry out human-like conversations, using real-world dialog data as training material.

Reference:

    Hugging Face Transformers Documentation : https://huggingface.co/docs/transformers/index

Task Selection and Dataset :https://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html

I chose to focus on chatbot generation using the Cornell Movie Dialogs Corpus, a public dataset containing thousands of movie conversations.
This dataset is suitable for building conversational AI because it includes natural, turn-by-turn dialog.
What I Did: Step-by-Step.

In [1]:
!pip install transformers datasets




#Loading and unzipping Dataset

In [2]:
!wget http://www.cs.cornell.edu/~cristian/data/cornell_movie_dialogs_corpus.zip
!unzip cornell_movie_dialogs_corpus.zip


--2025-06-30 16:26:00--  http://www.cs.cornell.edu/~cristian/data/cornell_movie_dialogs_corpus.zip
Resolving www.cs.cornell.edu (www.cs.cornell.edu)... 132.236.207.53
Connecting to www.cs.cornell.edu (www.cs.cornell.edu)|132.236.207.53|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://www.cs.cornell.edu/~cristian/data/cornell_movie_dialogs_corpus.zip [following]
--2025-06-30 16:26:00--  https://www.cs.cornell.edu/~cristian/data/cornell_movie_dialogs_corpus.zip
Connecting to www.cs.cornell.edu (www.cs.cornell.edu)|132.236.207.53|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 9916637 (9.5M) [application/zip]
Saving to: ‘cornell_movie_dialogs_corpus.zip’


2025-06-30 16:26:01 (8.52 MB/s) - ‘cornell_movie_dialogs_corpus.zip’ saved [9916637/9916637]

Archive:  cornell_movie_dialogs_corpus.zip
   creating: cornell movie-dialogs corpus/
  inflating: cornell movie-dialogs corpus/.DS_Store  
   creating: __MACOSX/
   cre

#Step 3: Preparing the Dialog Data

In [3]:
import pandas as pd

# Read movie lines
lines_path = "cornell movie-dialogs corpus/movie_lines.txt"
convs_path = "cornell movie-dialogs corpus/movie_conversations.txt"

with open(lines_path, encoding='utf-8', errors='ignore') as f:
    lines = f.readlines()

with open(convs_path, encoding='utf-8', errors='ignore') as f:
    convs = f.readlines()

# Map lineID to text
line_map = {}
for line in lines:
    parts = line.split(" +++$+++ ")
    if len(parts) == 5:
        line_map[parts[0]] = parts[4].strip()

# Build pairs (prompt, response)
pairs = []
for conv in convs:
    parts = conv.split(" +++$+++ ")
    if len(parts) == 4:
        line_ids = eval(parts[3])
        for i in range(len(line_ids) - 1):
            q = line_map.get(line_ids[i], "")
            a = line_map.get(line_ids[i + 1], "")
            if q and a:
                pairs.append((q, a))

# Convert to DataFrame and keep only 1000 pairs (for quick training)
df = pd.DataFrame(pairs, columns=["input", "response"]).sample(1000, random_state=42)
print(df.head())

                                                    input  \
111288                                        Jody. Wait.   
21837                                          Frances...   
83549                 This is going to drive the ante up.   
81404   I'm your political advisor, and I'm giving you...   
147690                                               Nah.   

                                                 response  
111288                                              What?  
21837                                               What?  
83549   Frank Galvin's... who's calling please? Bishop...  
81404                     I'm gonna protect those pilots.  
147690                               What you been up to?  


#Step 5: Creating a Dataset and Tokenize

In [4]:
from transformers import AutoTokenizer

# Load a pre-trained tokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert/distilgpt2")

# Set the padding token to the end-of-sequence token
tokenizer.pad_token = tokenizer.eos_token

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/762 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

In [5]:
from datasets import Dataset

# Concatenate 'input' and 'response' columns into a 'text' column
df['text'] = df['input'] + " " + df['response']

# Create Hugging Face Dataset from the new 'text' column
dataset = Dataset.from_pandas(df[['text']])

# Define the tokenize function
def tokenize(example):
    # Make sure the tokenizer is defined (you might need to add a cell to load it)
    # For now, assuming 'tokenizer' is available
    tokenized_inputs = tokenizer(example["text"], truncation=True, padding="max_length", max_length=128)
    tokenized_inputs["labels"] = tokenized_inputs["input_ids"].copy()
    return tokenized_inputs

# Apply tokenization
tokenized_dataset = dataset.map(tokenize, batched=True)

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

#Step 6: Loading GPT-2 and Fine-Tuning

In [6]:
from transformers import AutoModelForCausalLM, Trainer, TrainingArguments

model = AutoModelForCausalLM.from_pretrained("gpt2")

training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=1,
    per_device_train_batch_size=4,
    logging_steps=200,
    save_total_limit=1,
    report_to="none"
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
)

trainer.train()


config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

`loss_type=None` was set in the config but it is unrecognised.Using the default loss: `ForCausalLMLoss`.


Step,Training Loss
200,0.9396


TrainOutput(global_step=250, training_loss=0.9109258270263672, metrics={'train_runtime': 3043.2431, 'train_samples_per_second': 0.329, 'train_steps_per_second': 0.082, 'total_flos': 65323008000000.0, 'train_loss': 0.9109258270263672, 'epoch': 1.0})

#Step 7: Chatting with the Model!

In [8]:
prompt = "User: What's your favorite movie?\nBot:"
inputs = tokenizer.encode(prompt, return_tensors="pt")
attention_mask = (inputs != tokenizer.eos_token_id).long()
outputs = model.generate(inputs, attention_mask=attention_mask, max_length=64, pad_token_id=tokenizer.eos_token_id)





In [11]:
while True:
    user_input = input("You: ")
    if user_input.lower() in ["quit", "exit"]:
        break
    print("Bot:", chat_with_bot(user_input))


You: hello
Bot: yeah!
You: how are you today 
Bot: how is your  hair?
You: horror movie ?
Bot: Well I would love to know how much a script is, which script you like.


KeyboardInterrupt: Interrupted by user

#II- Implementing and testing Pytorch’s own Transformer models for machine translation

In [13]:
import torch
import torch.nn as nn
import torch.optim as optim
import math

# Sample data: toy pairs for demonstration
data = [
    ("hello world", "hallo welt"),
    ("good morning", "guten morgen"),
    ("i love you", "ich liebe dich"),
    ("how are you", "wie geht es dir"),
    ("thank you", "danke"),
]

# Build simple vocabularies
SRC_WORDS = set()
TRG_WORDS = set()
for src, trg in data:
    SRC_WORDS.update(src.split())
    TRG_WORDS.update(trg.split())
SRC_WORDS = ['<pad>', '<sos>', '<eos>'] + sorted(SRC_WORDS)
TRG_WORDS = ['<pad>', '<sos>', '<eos>'] + sorted(TRG_WORDS)
SRC2IDX = {word: idx for idx, word in enumerate(SRC_WORDS)}
TRG2IDX = {word: idx for idx, word in enumerate(TRG_WORDS)}
IDX2TRG = {idx: word for word, idx in TRG2IDX.items()}

# Simple tokenization
def encode_sentence(sentence, word2idx):
    return [word2idx['<sos>']] + [word2idx[word] for word in sentence.split()] + [word2idx['<eos>']]

src_seqs = [encode_sentence(src, SRC2IDX) for src, _ in data]
trg_seqs = [encode_sentence(trg, TRG2IDX) for _, trg in data]

# Pad sequences
def pad_seq(seq, max_len, pad_idx):
    return seq + [pad_idx] * (max_len - len(seq))

src_max_len = max(len(seq) for seq in src_seqs)
trg_max_len = max(len(seq) for seq in trg_seqs)
src_tensor = torch.tensor([pad_seq(seq, src_max_len, SRC2IDX['<pad>']) for seq in src_seqs])
trg_tensor = torch.tensor([pad_seq(seq, trg_max_len, TRG2IDX['<pad>']) for seq in trg_seqs])

# Transformer Model
class Seq2SeqTransformer(nn.Module):
    def __init__(self, num_tokens_src, num_tokens_tgt, emb_size, nhead, num_encoder_layers, num_decoder_layers, dim_feedforward=512):
        super().__init__()
        self.transformer = nn.Transformer(
            d_model=emb_size, nhead=nhead,
            num_encoder_layers=num_encoder_layers,
            num_decoder_layers=num_decoder_layers,
            dim_feedforward=dim_feedforward
        )
        self.src_tok_emb = nn.Embedding(num_tokens_src, emb_size)
        self.tgt_tok_emb = nn.Embedding(num_tokens_tgt, emb_size)
        self.positional_encoding = nn.Parameter(torch.zeros(100, emb_size)) # for small dataset
        self.fc_out = nn.Linear(emb_size, num_tokens_tgt)

    def forward(self, src, tgt):
        src_emb = self.src_tok_emb(src) + self.positional_encoding[:src.size(1)]
        tgt_emb = self.tgt_tok_emb(tgt) + self.positional_encoding[:tgt.size(1)]
        src_mask = self.transformer.generate_square_subsequent_mask(src.size(1)).to(src.device)
        tgt_mask = self.transformer.generate_square_subsequent_mask(tgt.size(1)).to(tgt.device)
        outs = self.transformer(src_emb.permute(1,0,2), tgt_emb.permute(1,0,2), src_mask, tgt_mask)
        return self.fc_out(outs.permute(1,0,2))

# Hyperparameters
EMB_SIZE = 32
NHEAD = 2
FFN_HID_DIM = 64
BATCH_SIZE = len(data)
NUM_ENCODER_LAYERS = 2
NUM_DECODER_LAYERS = 2

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = Seq2SeqTransformer(len(SRC_WORDS), len(TRG_WORDS), EMB_SIZE, NHEAD, NUM_ENCODER_LAYERS, NUM_DECODER_LAYERS, FFN_HID_DIM).to(device)

# Loss and optimizer
loss_fn = nn.CrossEntropyLoss(ignore_index=TRG2IDX['<pad>'])
optimizer = optim.Adam(model.parameters(), lr=0.01)

# Training (for demonstration, just a few epochs)
EPOCHS = 30
for epoch in range(EPOCHS):
    model.train()
    optimizer.zero_grad()
    output = model(src_tensor.to(device), trg_tensor[:,:-1].to(device))
    output = output.reshape(-1, output.shape[-1])
    trg_y = trg_tensor[:,1:].contiguous().view(-1).to(device)
    loss = loss_fn(output, trg_y)
    loss.backward()
    optimizer.step()
    if (epoch+1)%10==0:
        print(f'Epoch {epoch+1}, Loss: {loss.item():.4f}')

# Test translation
def translate(sentence):
    model.eval()
    src = torch.tensor([pad_seq(encode_sentence(sentence, SRC2IDX), src_max_len, SRC2IDX['<pad>'])], device=device)
    tgt = torch.tensor([[TRG2IDX['<sos>']]], device=device)
    for _ in range(trg_max_len):
        out = model(src, tgt)
        next_word = out[0, -1].argmax().item()
        tgt = torch.cat([tgt, torch.tensor([[next_word]], device=device)], dim=1)
        if next_word == TRG2IDX['<eos>']:
            break
    return " ".join([IDX2TRG[idx] for idx in tgt[0].tolist()[1:-1]])

print("\nTranslation tests:")
for src, _ in data:
    print(f"{src} -> {translate(src)}")


Epoch 10, Loss: 1.1549
Epoch 20, Loss: 0.3149
Epoch 30, Loss: 0.0718

Translation tests:
hello world -> hallo welt
good morning -> guten morgen
i love you -> ich liebe dich
how are you -> wie geht es dir
thank you -> danke


#III- Text generation

##LSTM

In [15]:
data = [
    "Action: A brave knight fights dragons in a mystical land.",
    "Action: Space marines battle alien invaders on Mars.",
    "Puzzle: Connect colored gems to unlock magical doors.",
    "Puzzle: Rearrange tiles to reveal ancient symbols.",
    "RPG: Explore a post-apocalyptic world as a mutant survivor.",
    "RPG: Assemble a team of heroes to save the kingdom.",
    "Strategy: Build a civilization from the Stone Age to the future.",
    "Strategy: Command armies and outsmart your enemies in real time."
]
import torch
import torch.nn as nn
import random

all_text = "\n".join(data)
chars = sorted(list(set(all_text)))
char2idx = {ch: i for i, ch in enumerate(chars)}
idx2char = {i: ch for i, ch in enumerate(chars)}
VOCAB_SIZE = len(chars)

class CharLSTM(nn.Module):
    def __init__(self, vocab_size, hidden_size=128, num_layers=2):
        super().__init__()
        self.embed = nn.Embedding(vocab_size, hidden_size)
        self.lstm = nn.LSTM(hidden_size, hidden_size, num_layers, batch_first=True)
        self.fc = nn.Linear(hidden_size, vocab_size)

    def forward(self, x, hidden=None):
        x = self.embed(x)
        out, hidden = self.lstm(x, hidden)
        out = self.fc(out)
        return out, hidden

seq_length = 40
step = 3
sequences = []
next_sequences = []

for line in data:
    for i in range(0, len(line) - seq_length, step):
        seq = line[i:i+seq_length]
        next_seq = line[i+1:i+seq_length+1]
        sequences.append(seq)
        next_sequences.append(next_seq)

X = torch.tensor([[char2idx[ch] for ch in seq] for seq in sequences])
y = torch.tensor([[char2idx[ch] for ch in seq] for seq in next_sequences])


for epoch in range(20):
    model.train()
    optimizer.zero_grad()
    output, _ = model(X)        # output: [batch, seq_len, vocab]
    loss = criterion(output.view(-1, VOCAB_SIZE), y.view(-1))
    loss.backward()
    optimizer.step()
    if (epoch+1) % 5 == 0:
        print(f"Epoch {epoch+1}, Loss: {loss.item():.4f}")


def generate_text(model, prompt, length=100):
    model.eval()
    input_seq = torch.tensor([[char2idx.get(ch, 0) for ch in prompt]], dtype=torch.long)
    generated = list(prompt)
    hidden = None
    for _ in range(length):
        output, hidden = model(input_seq, hidden)
        prob = output[:, -1, :].squeeze().softmax(0).detach()
        idx = torch.multinomial(prob, 1).item()
        next_char = idx2char[idx]
        generated.append(next_char)
        input_seq = torch.tensor([[idx]], dtype=torch.long)
        if next_char == "\n":
            break
    return "".join(generated)

print("Action Example:\n", generate_text(model, "Action:", 120))
print("Puzzle Example:\n", generate_text(model, "Puzzle:", 120))
print("RPG Example:\n", generate_text(model, "RPG:", 120))
print("Strategy Example:\n", generate_text(model, "Strategy:", 120))


Epoch 5, Loss: 2.9611
Epoch 10, Loss: 2.5403
Epoch 15, Loss: 1.9277
Epoch 20, Loss: 1.2976
Action Example:
 Action: RhSenes a xinddraigcam yovelym: Chene thes a mponnys tcaert Sourle fo mauect yout Elonldim a tien avt yorrrehee ert onr
Puzzle Example:
 Puzzle: Ran ton al a donghe branes goe matin pon knmivent houmraemriees kmivesal dtols aure enelieed a tons a tsi arvos and oct
RPG Example:
 RPG: woms aled a to kndmhymh tos an fotsms ar a touns as an toem uraveve unvild as bannoe gt onsset yor a undorse al mislele
Strategy Example:
 Strategy: Smonm: A a ene utiend to Age a tore aght yyhy the tho ans to froers ase mats ar mies an: cinlilizatiol e anes an ourd a


## RNN + RGU models


In [20]:
class CharRNN(nn.Module):
    def __init__(self, vocab_size, hidden_size=128, num_layers=2):
        super().__init__()
        self.embed = nn.Embedding(vocab_size, hidden_size)
        self.rnn = nn.RNN(hidden_size, hidden_size, num_layers, batch_first=True)
        self.fc = nn.Linear(hidden_size, vocab_size)
    def forward(self, x, hidden=None):
        x = self.embed(x)
        out, hidden = self.rnn(x, hidden)
        out = self.fc(out)
        return out, hidden

class CharGRU(nn.Module):
    def __init__(self, vocab_size, hidden_size=128, num_layers=2):
        super().__init__()
        self.embed = nn.Embedding(vocab_size, hidden_size)
        self.gru = nn.GRU(hidden_size, hidden_size, num_layers, batch_first=True)
        self.fc = nn.Linear(hidden_size, vocab_size)
    def forward(self, x, hidden=None):
        x = self.embed(x)
        out, hidden = self.gru(x, hidden)
        out = self.fc(out)
        return out, hidden

def train(model, X, y, epochs=20):
    optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
    criterion = nn.CrossEntropyLoss()
    for epoch in range(epochs):
        model.train()
        optimizer.zero_grad()
        output, _ = model(X)
        loss = criterion(output.reshape(-1, VOCAB_SIZE), y.reshape(-1))
        loss.backward()
        optimizer.step()
        if (epoch+1) % 5 == 0:
            print(f"Epoch {epoch+1}, Loss: {loss.item():.4f}")

def generate_text(model, prompt, length=100):
    model.eval()
    input_seq = torch.tensor([[char2idx.get(ch, 0) for ch in prompt]], dtype=torch.long)
    generated = list(prompt)
    hidden = None
    for _ in range(length):
        output, hidden = model(input_seq, hidden)
        prob = output[:, -1, :].squeeze().softmax(0).detach()
        idx = torch.multinomial(prob, 1).item()
        next_char = idx2char[idx]
        generated.append(next_char)
        input_seq = torch.tensor([[idx]], dtype=torch.long)
        if next_char == "\n":
            break
    return "".join(generated)


In [21]:
print("\n== RNN ==")
rnn_model = CharRNN(VOCAB_SIZE)
train(rnn_model, X, y, epochs=20)
print("Action Example:\n", generate_text(rnn_model, "Action:", 120))
print("Puzzle Example:\n", generate_text(rnn_model, "Puzzle:", 120))



== RNN ==
Epoch 5, Loss: 2.2474
Epoch 10, Loss: 1.3356
Epoch 15, Loss: 0.6237
Epoch 20, Loss: 0.2459
Action Example:
 Action: -yation fuht ffrouesmand armies agpoMategy tilore a tosstact oydto unlock malilikilization from the Stion: Agout kostto
Puzzle Example:
 Puzzle: A arciect colored gems to reveal reveagyptigy: Builizattle alien iom: kuilocildral ancient sy: brave knia pest-apocalyp


In [22]:
print("\n== GRU ==")
gru_model = CharGRU(VOCAB_SIZE)
train(gru_model, X, y, epochs=20)
print("RPG Example:\n", generate_text(gru_model, "RPG:", 120))
print("Strategy Example:\n", generate_text(gru_model, "Strategy:", 120))



== GRU ==
Epoch 5, Loss: 2.7108
Epoch 10, Loss: 1.9585
Epoch 15, Loss: 1.1874
Epoch 20, Loss: 0.5606
RPG Example:
 RPG: A ural anecien tine Sutand outsmarmaincy Boutst ullk a miles ant sance mystisand acient syu.pt-apocald a cild aroms he 
Strategy Example:
 Strategy: Cons a mtyarcien world amies anm to yspas apocalyption fromm the Stone Age tiles a the: EMla purand oatyear enenema tin


# IV- Aalyzing both models


| Model | Handles Long Sequences | Output Quality on Small Data | Complexity |
| ----- | ---------------------- | ---------------------------- | ---------- |
| RNN   | Weak                   | OK for short ideas           | Simple     |
| GRU   | Good                   | Good for short/med ideas     | Moderate   |
