In [1]:
!pip list

Package                                  Version
---------------------------------------- --------------
absl-py                                  2.1.0
accelerate                               1.0.0
aiofiles                                 22.1.0
aiohappyeyeballs                         2.4.0
aiohttp                                  3.10.5
aiohttp-cors                             0.7.0
aiosignal                                1.3.1
aiosqlite                                0.20.0
ansicolors                               1.1.8
anyio                                    4.6.0
archspec                                 0.2.3
argon2-cffi                              23.1.0
argon2-cffi-bindings                     21.2.0
arrow                                    1.3.0
asttokens                                2.4.1
async-timeout                            4.0.3
atpublic                                 4.1.0
attrs                                    24.2.0
babel                                    2.

In [2]:
!pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

Looking in indexes: https://download.pytorch.org/whl/cu118


In [3]:
!pip install transformers



In [4]:
!pip install 'accelerate>=0.26.0'



In [5]:
!pip install sentencepiece



In [6]:
import torch

In [7]:
print(f"Running Pytorch: {torch.__version__}")

if not torch.cuda.is_available():
    raise RuntimeError("This notebook was writen to be executed by a GPU!")
    
# I know that I have only one GPU on this system so will map it to the first one
DEVICE = 'cuda:0'

Running Pytorch: 2.4.1+cu118


In [8]:
!nvidia-smi

Wed Oct  9 04:35:33 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07              Driver Version: 550.90.07      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       On  |   00000000:00:04.0 Off |                    0 |
| N/A   70C    P8             12W /   70W |       3MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

In [9]:
import gc
import json
import transformers
from transformers import T5Tokenizer, T5Model, T5ForConditionalGeneration
from torch.utils.data import TensorDataset, DataLoader, RandomSampler

# Model choosen

![alt text](./assets/t5-diagram.png)

In [10]:
pretrained_model_used = "google-t5/t5-small"

# We want to map immidiatley the model onto the GPU
model = T5ForConditionalGeneration.from_pretrained(pretrained_model_used, device_map=DEVICE, torch_dtype="auto")
tokenizer = T5Tokenizer.from_pretrained(pretrained_model_used)

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


In [11]:
!nvidia-smi # check how much memory model consumed on GPU

Wed Oct  9 04:35:36 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07              Driver Version: 550.90.07      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       On  |   00000000:00:04.0 Off |                    0 |
| N/A   71C    P0             30W /   70W |     341MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

# Dataset loading and tokenization

It is important we follow the same instructions from the paper, because othervise we would not finetune the model right.

![alt text](./assets/input-out-format.png)

Example of how the SQuAD dataset should be prepared for fine-tuning.

![alt text](./assets/squad-first-pt.png)
![alt text](./assets/squad-second-pt.png)

In [12]:
class SQuADDataset:
    def __init__(self, ids, questions, contexts, answers) -> None:
        self.ids = ids
        self.questions = questions
        self.contexts = contexts
        self.answers = answers
        
    def tokenize_input(self) -> dict:
        """
        Tokenizer will add EOS tokens at the end of the sentence and in between so it will look like:
        question: .... </s> context: .... </s>
        
        Also anything shorter will be padded with <pad> token
        """
        return tokenizer(self.questions, self.contexts, padding='max_length', truncation='only_second', max_length=512, return_tensors="pt")
    
    def tokenize_target(self) -> dict:
        return tokenizer(self.answers, padding='max_length', truncation=True, max_length=24, return_tensors="pt")

In [13]:
def load_dataset_and_preprocess(path: str) -> SQuADDataset:
    fp = open(path)
    dataset: dict = json.load(fp)
    fp.close()
    
    print(f"Loading dataset SQuAD version: {dataset['version']}")
    
    data: list = dataset.get("data")
    
    ids = []
    questions = []
    contexts = []
    answers = []
    
    for chunk in data:
        print(f"Loading paragraph: {chunk.get('title')}")
        
        paragraphs: list[dict] = chunk.get("paragraphs")
        
        for paragraph in paragraphs:
            qas = paragraph.get("qas")
            context = paragraph.get("context")
            
            for example in qas:
                question = example.get("question")
                q_id = example.get("id")
                
                # We will take only one answer example
                list_of_answers = example.get("answers")
                answer = list_of_answers[0].get("text") if len(list_of_answers) > 0 else ""
                
                q = f"question: {question}"
                ctx = f"context: {context}"
                
                ids.append(q_id)
                questions.append(q)
                contexts.append(ctx)
                answers.append(answer)
                
    return SQuADDataset(ids, questions, contexts, answers)

In [14]:
ds: SQuADDataset = load_dataset_and_preprocess("./train-v2.0.json")

Loading dataset SQuAD version: v2.0
Loading paragraph: Beyoncé
Loading paragraph: Frédéric_Chopin
Loading paragraph: Sino-Tibetan_relations_during_the_Ming_dynasty
Loading paragraph: IPod
Loading paragraph: The_Legend_of_Zelda:_Twilight_Princess
Loading paragraph: Spectre_(2015_film)
Loading paragraph: 2008_Sichuan_earthquake
Loading paragraph: New_York_City
Loading paragraph: To_Kill_a_Mockingbird
Loading paragraph: Solar_energy
Loading paragraph: Kanye_West
Loading paragraph: Buddhism
Loading paragraph: American_Idol
Loading paragraph: Dog
Loading paragraph: 2008_Summer_Olympics_torch_relay
Loading paragraph: Genome
Loading paragraph: Comprehensive_school
Loading paragraph: Republic_of_the_Congo
Loading paragraph: Prime_minister
Loading paragraph: Institute_of_technology
Loading paragraph: Wayback_Machine
Loading paragraph: Dutch_Republic
Loading paragraph: Symbiosis
Loading paragraph: Canadian_Armed_Forces
Loading paragraph: Cardinal_(Catholicism)
Loading paragraph: Iranian_language

In [15]:
tokenized_input = ds.tokenize_input()
tokenized_target = ds.tokenize_target()

In [16]:
# Let's check one input example
tokenizer.decode(tokenized_input["input_ids"][3])

'question: In what city and state did Beyonce grow up?</s> context: Beyoncé Giselle Knowles-Carter (/bi<unk> j<unk> nse<unk> / bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny\'s Child. Managed by her father, Mathew Knowles, the group became one of the world\'s best-selling girl groups of all time. Their hiatus saw the release of Beyoncé\'s debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five Grammy Awards and featured the Billboard Hot 100 number-one singles "Crazy in Love" and "Baby Boy".</s><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><p

In [17]:
# Let's check target pair for the above example
tokenizer.decode(tokenized_target["input_ids"][3])

'Houston, Texas</s><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad>'

# Fine tunning the model

![alt text](./assets/training.png)

In [18]:
tokenized_input.keys()

dict_keys(['input_ids', 'attention_mask'])

In [19]:
train_data = TensorDataset(tokenized_input.get("input_ids"), tokenized_input.get("attention_mask"), tokenized_target.get("input_ids"), tokenized_target.get("attention_mask")) # Order of these parameters is same later once we read it
train_sampler = RandomSampler(train_data)
train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=16)

In [20]:
from transformers.optimization import Adafactor # As was used in the original paper

In [21]:
optimizer = Adafactor(model.parameters(), scale_parameter=False, relative_step=False, warmup_init=False, lr=1e-3)

In [22]:
!nvidia-smi

Wed Oct  9 04:37:40 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07              Driver Version: 550.90.07      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       On  |   00000000:00:04.0 Off |                    0 |
| N/A   74C    P0             31W /   70W |     341MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

In [23]:
def fine_tune_t5(epochs: int):
    model.train(True) # Set model into training mode
    step = 0
    running_loss = 0.
    
    for epoch in range(epochs):
        print(f"EPOCH - {epoch}")
        for data in train_dataloader:
            input_ids: torch.Tensor
            input_att_mask: torch.Tensor
            target_ids: torch.Tensor
            target_att_mask: torch.Tensor

            input_ids, input_att_mask, target_ids, target_att_mask = data

            optimizer.zero_grad()

            loss = model(
                input_ids=input_ids.to(DEVICE),
                attention_mask=input_att_mask.to(DEVICE),
                decoder_attention_mask=target_att_mask.to(DEVICE),
                labels=target_ids.to(DEVICE)
            ).loss

            loss.backward()
            optimizer.step()

            running_loss += loss.item() # accumulate loss

            if step % 100 == 0:
                tmp_loss = running_loss / 100 # Loss for last 100 steps

                print(f"Loss: {tmp_loss} at STEP: {step}")
                running_loss = 0.

            step += 1

            # This is a bit of a trick so that we can empty unneded batch from the GPU after each step
            input_ids, input_att_mask, target_ids, target_att_mask = [None] * 4
            gc.collect()
            torch.cuda.empty_cache()
    model.train(False)

In [24]:
fine_tune_t5(epochs=5) # Run the fine tunning

EPOCH - 0
Loss: 0.16089296340942383 at STEP: 0
Loss: 0.600663215816021 at STEP: 100
Loss: 0.38326781705021856 at STEP: 200
Loss: 0.34955278113484384 at STEP: 300
Loss: 0.3486653026938438 at STEP: 400
Loss: 0.316450661867857 at STEP: 500
Loss: 0.31583479553461075 at STEP: 600
Loss: 0.3136375516653061 at STEP: 700
Loss: 0.30486680924892423 at STEP: 800
Loss: 0.2969446086883545 at STEP: 900
Loss: 0.3022655549645424 at STEP: 1000
Loss: 0.2967348997294903 at STEP: 1100
Loss: 0.3076150432229042 at STEP: 1200
Loss: 0.2881550267338753 at STEP: 1300
Loss: 0.2972696757316589 at STEP: 1400
Loss: 0.2856902155280113 at STEP: 1500
Loss: 0.2977496628463268 at STEP: 1600
Loss: 0.2894183366000652 at STEP: 1700
Loss: 0.2760576483607292 at STEP: 1800
Loss: 0.2877181653678417 at STEP: 1900
Loss: 0.27939183324575423 at STEP: 2000
Loss: 0.27606276467442514 at STEP: 2100
Loss: 0.2808814299106598 at STEP: 2200
Loss: 0.2647849689424038 at STEP: 2300
Loss: 0.27302410259842874 at STEP: 2400
Loss: 0.2696649876236

KeyboardInterrupt: 

In [25]:
model.save_pretrained("./fine-tuned-t5-small-squad-multi-pass")

# Evaluation

In [26]:
ds_eval: SQuADDataset = load_dataset_and_preprocess("./dev-v2.0.json")
eval_input = ds_eval.tokenize_input()

Loading dataset SQuAD version: v2.0
Loading paragraph: Normans
Loading paragraph: Computational_complexity_theory
Loading paragraph: Southern_California
Loading paragraph: Sky_(United_Kingdom)
Loading paragraph: Victoria_(Australia)
Loading paragraph: Huguenot
Loading paragraph: Steam_engine
Loading paragraph: Oxygen
Loading paragraph: 1973_oil_crisis
Loading paragraph: European_Union_law
Loading paragraph: Amazon_rainforest
Loading paragraph: Ctenophora
Loading paragraph: Fresno,_California
Loading paragraph: Packet_switching
Loading paragraph: Black_Death
Loading paragraph: Geology
Loading paragraph: Pharmacy
Loading paragraph: Civil_disobedience
Loading paragraph: Construction
Loading paragraph: Private_school
Loading paragraph: Harvard_University
Loading paragraph: Jacksonville,_Florida
Loading paragraph: Economic_inequality
Loading paragraph: University_of_Chicago
Loading paragraph: Yuan_dynasty
Loading paragraph: Immune_system
Loading paragraph: Intergovernmental_Panel_on_Climate

In [27]:
data_for_evaluation = TensorDataset(eval_input["input_ids"])
evaluation_dataloader = DataLoader(data_for_evaluation, batch_size=16)

In [28]:
def evaluate(model: str, dataloader: str, out_file: str, question_ids: list[str]) -> None:
    """
    We are not running eval in the model as it is usually done, we will record the inference
    results with the .json file and use official script to compare the models.
    """
    results = []

    for data in dataloader:
        batch_of_inputs = data[0]
        
        outputs = model.generate(batch_of_inputs.to(DEVICE))
        
        results.extend(outputs)
        
        # We will dereference values held in outputs to free cuda memory
        outputs = None
        gc.collect()
        torch.cuda.empty_cache()
    
    print(f"Done with inference, decoding answers and generating output file at {out_file}")
    
    out_file_contents = {}

    for question_id, generated_answer in zip(question_ids, results):
        out_file_contents[question_id] = tokenizer.decode(generated_answer, skip_special_tokens=True)
        
    file_pointer = open(out_file, "w")
    
    json.dump(out_file_contents, file_pointer)
    file_pointer.close()
    
    print("You can check the file now!")

## Evaluation before finetuning

In [29]:
pretrained_model_used = "google-t5/t5-small"

# We want to map immidiatley the model onto the GPU
baseline_model = T5ForConditionalGeneration.from_pretrained(pretrained_model_used, device_map=DEVICE, torch_dtype="auto")

In [30]:
evaluate(model=baseline_model, dataloader=evaluation_dataloader, out_file="baseline.json", question_ids=ds_eval.ids)



Done with inference, decoding answers and generating output file at baseline.json
You can check the file now!


## Evaluation after finetuning

In [31]:
pretrained_model_used = "google-t5/t5-small"

# We want to map immidiatley the model onto the GPU
fine_tuned_model = T5ForConditionalGeneration.from_pretrained("./fine-tuned-t5-small-squad-multi-pass", device_map=DEVICE, torch_dtype="auto")

In [32]:
evaluate(model=fine_tuned_model, dataloader=evaluation_dataloader, out_file="finetuned.json", question_ids=ds_eval.ids)

Done with inference, decoding answers and generating output file at finetuned.json
You can check the file now!
