<img src="https://www.rp.edu.sg/images/default-source/default-album/rp-logo.png" width="200" alt="Republic Polytechnic"/>

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/koayst-rplesson/SDGAI_LLMforGenAIApp_Labs/blob/main/L06/L06.ipynb)

# Setup and Installation

You can run this Jupyter notebook either on your local machine or run it at Google Colab.

* For local machine, it is recommended to install Anaconda and create a new development environment called `c3669c`.
* Pip/Conda install the libraries stated below when necessary.
---

In [None]:
# install the libraries. Only need to do the installation once. After installation you can comment out the code.

!pip install transformers torch
!pip install accelerate

In [1]:
## Versions Checking

# accelerate = 1.1.1
# torch = 2.5.1
# transformers = 4.46.3

In [2]:
pip install transformers[torch]

Note: you may need to restart the kernel to use updated packages.


## Permission

Set up an account with Huggingface if you don't have an account.

Before you continue, you are required to apply for permission to access the Llama-3.2 model(s) if you get the error message as shown below:

`Access to model meta-llama/Llama-3.2-1B-Instruct is restricted and you are not in the authorized list. Visit https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct to ask for access.`

## Huggingface Jupyter Notebook/Lab Login

- Login to huggingface.
- Go to your account and create an access tokens for this exercise
- Go to `Write` tab to create the access token i.e. token name, for example, C3669C
- Copy the access token key and put it in a safe place for later use
- If you forget your access token key, if you have create a new one

---

<img src="huggingface-01.png">

---

<img src="huggingface-02.png">

In [3]:
from huggingface_hub import notebook_login

# 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
# uncheck `Add token as git credential?`
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [4]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

In [5]:
# Load the model and tokenizer
model_id = "meta-llama/Llama-3.2-1B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)

In [6]:
# Set the padding token if it's not already set
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

## Prepare your FAQ Dataset

A fake dataset is created for the FAQ dataset. Here is a sample dataset for this exercise.

### Questions:
- Explain quantum computing in simple terms.
- What are the benefits of renewable energy?
- Describe the process of photosynthesis.
- What is the significance of the Renaissance?
- How does the human immune system work?

### Answers:
- Quantum computing uses quantum bits, or qubits, to perform calculations. Unlike classical bits that are either 0 or 1, qubits can exist in multiple states simultaneously, allowing quantum computers to solve certain complex problems faster.
- Renewable energy, such as solar and wind, reduces greenhouse gas emissions, decreases air pollution, and conserves natural resources. It also promotes energy independence and sustainability.
- Photosynthesis is the process by which green plants use sunlight to make food from carbon dioxide and water. It occurs in the chloroplasts, producing oxygen as a byproduct.
- The Renaissance was a cultural movement from the 14th to the 17th century, characterized by a renewed interest in classical art, science, and philosophy. It led to significant advancements in many fields and a shift towards humanism.
- The human immune system protects the body from infections and diseases. It consists of physical barriers, immune cells, and proteins that identify and destroy pathogens like bacteria and viruses.

## Tokenize the Data

We need to tokenize the sentences (Questions and Answers) to make them compatible for the model.

In [7]:
# Define prompts and responses
prompts = [
    "Explain quantum computing in simple terms.",
    "What are the benefits of renewable energy?",
    "Describe the process of photosynthesis.",
    "What is the significance of the Renaissance?",
    "How does the human immune system work?"
]
responses = [
    "Quantum computing uses quantum bits, or qubits, to perform calculations. Unlike classical bits that are either 0 or 1, qubits can exist in multiple states simultaneously, allowing quantum computers to solve certain complex problems faster.",
    "Renewable energy, such as solar and wind, reduces greenhouse gas emissions, decreases air pollution, and conserves natural resources. It also promotes energy independence and sustainability.",
    "Photosynthesis is the process by which green plants use sunlight to make food from carbon dioxide and water. It occurs in the chloroplasts, producing oxygen as a byproduct.",
    "The Renaissance was a cultural movement from the 14th to the 17th century, characterized by a renewed interest in classical art, science, and philosophy. It led to significant advancements in many fields and a shift towards humanism.",
    "The human immune system protects the body from infections and diseases. It consists of physical barriers, immune cells, and proteins that identify and destroy pathogens like bacteria and viruses."
]

# Set a consistent max length
max_length = 50

# Tokenize prompts and responses
tokenized_inputs = tokenizer(prompts, padding="max_length", truncation=True, max_length=max_length, return_tensors="pt")
tokenized_labels = tokenizer(responses, padding="max_length", truncation=True, max_length=max_length, return_tensors="pt")["input_ids"]

# Ensure labels' padding tokens are ignored in loss computation
tokenized_labels[tokenized_labels == tokenizer.pad_token_id] = -100

## Create a Custom Dataset

Use PyTorch's `Datset` class to create a dataset for trainging

In [8]:
from torch.utils.data import Dataset

# Create a custom dataset
class CustomDataset(Dataset):
    def __init__(self, inputs, labels):
        self.inputs = inputs
        self.labels = labels

    def __len__(self):
        return len(self.inputs["input_ids"])

    def __getitem__(self, idx):
        return {
            "input_ids": self.inputs["input_ids"][idx],
            "attention_mask": self.inputs["attention_mask"][idx],
            "labels": self.labels[idx]
        }

# Instantiate the dataset
dataset = CustomDataset(tokenized_inputs, tokenized_labels)

## Configure Training Parameters

Setup the training parameters using the `TrainingArguments` class and initialise the `Trainer`

In [9]:
from transformers import Trainer, TrainingArguments, DataCollatorForSeq2Seq

# Data collator to handle padding
data_collator = DataCollatorForSeq2Seq(
    tokenizer=tokenizer,
    model=model,
    padding=True
)

# Increase the number of training epochs if necessary
training_args = TrainingArguments(
    output_dir="./results",
    eval_strategy="no",
    learning_rate=2e-5,
    per_device_train_batch_size=2,
    num_train_epochs=3,  # Increase to 5 or more
    weight_decay=0.01
)


# Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset,
    data_collator=data_collator
)

## Train the Model

In [10]:
# Start training
try:
    trainer.train()
except ValueError as e:
    print("\nError during training:")
    print(e)

# Note:
# On laptop: i7-1165G7 @ 2.8GHz, 16.0GB RAM, Windows 11 Enterprise, it took 21mins 40 secs to complete the training.

  batch["labels"] = torch.tensor(batch["labels"], dtype=torch.int64)


Step,Training Loss


## Save the Trained Model

In [11]:
# Save the model and tokenizer
model.save_pretrained("./trained_model")
tokenizer.save_pretrained("./trained_model")

print("Model and tokenizer saved successfully!")

Model and tokenizer saved successfully!


## Test the Model

Test the newly fine-tuned model.

In [12]:
import torch
import textwrap
from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer

In [13]:
# Load the model and tokenizer
model = AutoModelForCausalLM.from_pretrained("./trained_model")
tokenizer = AutoTokenizer.from_pretrained("./trained_model")

In [14]:
# Check if GPU is available and move the model to GPU
device = 0 if torch.cuda.is_available() else -1
text_generation = pipeline("text-generation", model=model, tokenizer=tokenizer, device=device)

In [15]:
%%time

# Test the model
prompt = "Describe the process of photosynthesis."
generated_text = text_generation(prompt, max_new_tokens=50, temperature=0.7, top_p=0.9)[0]["generated_text"]

# Format the output for better readability
wrapped_text = textwrap.fill(generated_text, width=80)

print("\nGenerated Text:\n")
print(wrapped_text)


Generated Text:

Describe the process of photosynthesis. plants and green plants. green plants
are plants that use sunlight, water, and carbon dioxide to produce their food.
plants have leaves that are green and green leaves are the food source for
plants. green plants are plants that have leaves that are green.
