<a href="https://colab.research.google.com/github/karou1182001/NLPAssignments/blob/main/Assignment4/NLPAssignment4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# NLP Assignment 3

The goal of this assignment is to:
1. Familiarize yourself with how to set up a cloud-based environment (e.g., Colab) to train and evaluate
LLMs using open-source tools.
2. Learn how to fine-tune a pre-trained large language model on a custom text dataset.
3. Conduct inference with the fine-tuned model and evaluate its outputs.
Prerequisites



##1. Set Up Your Environment

Make sure to enable a GPU runtime for faster training

In [1]:
import torch

# Verifying if GPU available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Running on: {device}")

if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")

x = torch.rand(1).to(device)
print(f"Tensor in {device}: {x}")

Running on: cuda
GPU: Tesla T4
Tensor in cuda: tensor([0.1689], device='cuda:0')


nstall or upgrade your necessary libraries (if needed)

In [3]:
!pip install --quiet transformers datasets

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/491.2 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m [32m481.3/491.2 kB[0m [31m16.8 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.2/491.2 kB[0m [31m12.5 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/116.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m10.1 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/183.9 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m183.9/183.9 kB[0m [31m14.7 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/143.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━

Import packages

In [4]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, Trainer, TrainingArguments
from datasets import load_dataset


print("Torch version:", torch.__version__)
print("CUDA available:", torch.cuda.is_available())


Torch version: 2.6.0+cu124
CUDA available: True


## 2. Choose a Pretrained Model

 We will use a smaller model so that fine-tuning can be done within a reasonable time
and within the memory constraints of free GPU instances. Two suggested models are.
I used distilgpt2

In [22]:
model_name = "distilgpt2"

tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token  #We added this part to avoid error
model = AutoModelForCausalLM.from_pretrained(model_name)
model.resize_token_embeddings(len(tokenizer)) #added


model.config.pad_token_id = tokenizer.pad_token_id


## 3. Prepare a Text Dataset

I selected option A:
 Load a built-in dataset from Hugging Face (e.g., wikitext)

In [23]:
#Load dataset
dataset = load_dataset("wikitext", "wikitext-2-raw-v1")

dataset

DatasetDict({
    test: Dataset({
        features: ['text'],
        num_rows: 4358
    })
    train: Dataset({
        features: ['text'],
        num_rows: 36718
    })
    validation: Dataset({
        features: ['text'],
        num_rows: 3760
    })
})

Split the dataset into training and validation sets if not already split

In [24]:
train_data = dataset["train"]
val_data = dataset["validation"]

#To validate we have the data

# Search for the first not empty text
for example in train_data:
    if example["text"].strip():  # If it is not empty
        print("Train example:\n", example["text"])
        break

# Search for the first not empty text
for example in val_data:
    if example["text"].strip():
        print("\nVal example:\n", example["text"])
        break



Train example:
  = Valkyria Chronicles III = 


Val example:
  = Homarus gammarus = 



## 4. Preprocess the Data

Define a preprocessing function to tokenize your text data

In [27]:
def tokenize_function(example):
    tokens = tokenizer(
        example["text"],
        truncation=True,
        padding="max_length",  # también puedes usar "longest"
        max_length=128         # puedes ajustarlo si tu GPU lo permite
    )
    tokens["labels"] = tokens["input_ids"].copy()  # esto es la clave
    return tokens

tokenized_train = train_data.map(tokenize_function, batched=True)
tokenized_val = val_data.map(tokenize_function, batched=True)

# Remove columns other than input_ids/attention_mask
tokenized_train.set_format("torch", columns=["input_ids", "attention_mask", "labels"])
tokenized_val.set_format("torch", columns=["input_ids", "attention_mask", "labels"])


#To see if worked:
print(tokenized_train[0])


Map:   0%|          | 0/36718 [00:00<?, ? examples/s]

Map:   0%|          | 0/3760 [00:00<?, ? examples/s]

{'input_ids': tensor([50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
        50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
        50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
        50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
        50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
        50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
        50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
        50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
        50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
        50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
        50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
        50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
        50256, 50256, 50256, 50256, 50256, 50256, 

## 5. Fine-Tune the Model

Use the Trainer API to set up your fine-tuning pipeline:

In [28]:
from transformers import TrainingArguments


training_args = TrainingArguments(
    output_dir="./finetuned_llm",
    eval_strategy="epoch",       # Evaluación cada época
    save_strategy="epoch",
    num_train_epochs=1,
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    save_steps=500,
    logging_steps=100,
    load_best_model_at_end=True,
    report_to="none" #Not connect to any account
)

trainer = Trainer(
  model=model,
  args=training_args,
  train_dataset=tokenized_train,
  eval_dataset=tokenized_val,
)

trainer.train()

`loss_type=None` was set in the config but it is unrecognised.Using the default loss: `ForCausalLMLoss`.


Epoch,Training Loss,Validation Loss
1,1.2256,1.380995


There were missing keys in the checkpoint model loaded: ['lm_head.weight'].


TrainOutput(global_step=18359, training_loss=1.3763501036237922, metrics={'train_runtime': 1349.2171, 'train_samples_per_second': 27.214, 'train_steps_per_second': 13.607, 'total_flos': 1199286761029632.0, 'train_loss': 1.3763501036237922, 'epoch': 1.0})

## 6. Evaluate the Fine-Tuned Model

In [30]:
prompt = "The history of artificial intelligence began in the 1950s when"
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.cuda()

# Generate text
output_ids = model.generate(
    input_ids,
    max_length=50,
    num_beams=5,
    no_repeat_ngram_size=2,
    early_stopping=True
)

generated_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)
print("Prompt:", prompt)
print("Generated text:\n", generated_text)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Prompt: The history of artificial intelligence began in the 1950s when
Generated text:
 The history of artificial intelligence began in the 1950s when researchers at the University of California , San Diego , and the National Institute of Mental Health ( NIMH ) proposed that AI would be able to solve problems in a way that would allow researchers to
