<a href="https://colab.research.google.com/github/calmrocks/master-machine-learning-engineer/blob/main/GenAI/FineTune.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fine-tuning Large Language Models: A Practical Guide

This notebook demonstrates how to fine-tune a pre-trained language model on custom data. We'll use a smaller open-source model for demonstration purposes.

## Table of Contents
1. Setup and Dependencies
2. Loading the Pre-trained Model
3. Preparing the Dataset
4. Fine-tuning Configuration
5. Training Process
6. Evaluation
7. Saving and Loading the Fine-tuned Model

In [4]:
# Install required packages
!pip install datasets evaluate

Collecting evaluate
  Using cached evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Using cached evaluate-0.4.3-py3-none-any.whl (84 kB)
Installing collected packages: evaluate
Successfully installed evaluate-0.4.3


## 1. Setup and Dependencies

We'll use the following libraries:
- `transformers`: Hugging Face's library for working with pre-trained models
- `datasets`: For data handling and preprocessing
- `torch`: Deep learning framework
- `evaluate`: For model evaluation

In [5]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset
from transformers import TrainingArguments, Trainer
import evaluate
import numpy as np

## 2. Loading the Pre-trained Model

We'll use a smaller version of LLaMA or GPT-2 as our base model. For this example, we'll use GPT-2 small, which has 124M parameters.

In [6]:
# Load model and tokenizer
model_name = "gpt2"
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

## 3. Preparing the Dataset

For this example, we'll use a simple text dataset. We'll prepare it in the format required for fine-tuning.

In [19]:
# Load dataset (example using a small subset of WikiText)
dataset = load_dataset("wikitext", "wikitext-2-raw-v1", split="train[:1000]")

# Prepare the dataset
texts = dataset["text"]

# Tokenize all texts
encodings = tokenizer(
    texts,
    padding=True,
    truncation=True,
    max_length=128,
    return_tensors="pt"
)

# Create a custom dataset class
class TextDataset(torch.utils.data.Dataset):
    def __init__(self, encodings):
        self.encodings = encodings

    def __getitem__(self, idx):
        item = {key: val[idx] for key, val in self.encodings.items()}
        item['labels'] = item['input_ids']
        return item

    def __len__(self):
        return len(self.encodings.input_ids)

# Create the dataset
train_dataset = TextDataset(encodings)


## 4. Fine-tuning Configuration

We'll set up the training arguments that control the fine-tuning process. Key parameters include:
- Learning rate
- Number of epochs
- Batch size
- Training steps

In [21]:
# Update training arguments to disable evaluation
training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=10,
    save_strategy="epoch",
    evaluation_strategy="no"
)


## 5. Training Process

Now we'll create a Trainer instance and start the fine-tuning process.

In [22]:
# Create trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
    tokenizer=tokenizer,
)

# Start training
trainer.train()

  trainer = Trainer(


Step,Training Loss
10,7.9371
20,5.9857
30,4.0653
40,3.8174
50,2.7019
60,2.138
70,1.8547
80,1.6456
90,1.2261
100,1.3752


Step,Training Loss
10,7.9371
20,5.9857
30,4.0653
40,3.8174
50,2.7019
60,2.138
70,1.8547
80,1.6456
90,1.2261
100,1.3752


TrainOutput(global_step=375, training_loss=1.7485335413614909, metrics={'train_runtime': 7640.9785, 'train_samples_per_second': 0.393, 'train_steps_per_second': 0.049, 'total_flos': 195969024000000.0, 'train_loss': 1.7485335413614909, 'epoch': 3.0})

## 6. Evaluation

After training, we'll evaluate the model's performance.

In [23]:
# Load test dataset
test_dataset = load_dataset("wikitext", "wikitext-2-raw-v1", split="test[:100]")
tokenized_test_dataset = test_dataset.map(tokenize_function, batched=True)

# Evaluate
eval_results = trainer.evaluate(eval_dataset=tokenized_test_dataset)
print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

NameError: name 'math' is not defined

In [25]:
import math
print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

Perplexity: 3.58


## 7. Saving and Loading the Fine-tuned Model

Finally, we'll save our fine-tuned model and show how to load it back.

In [26]:
# Save the model
model_path = "./fine_tuned_gpt2"
trainer.save_model(model_path)

# Load the fine-tuned model (if needed later)
loaded_model = AutoModelForCausalLM.from_pretrained(model_path)
loaded_tokenizer = AutoTokenizer.from_pretrained(model_path)

## Testing the Fine-tuned Model

Let's test our fine-tuned model with some example prompts.

In [27]:
def generate_text(prompt, max_length=100):
    inputs = tokenizer(prompt, return_tensors="pt")
    outputs = model.generate(
        inputs.input_ids,
        max_length=max_length,
        num_return_sequences=1,
        no_repeat_ngram_size=2,
        temperature=0.7
    )
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# Test the model
prompt = "The artificial intelligence revolution"
generated_text = generate_text(prompt)
print(f"Prompt: {prompt}")
print(f"Generated: {generated_text}")

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


Prompt: The artificial intelligence revolution
Generated: The artificial intelligence revolution has been a major theme of the media coverage of this year , and the focus has largely been on the impact of AI on society . The focus of attention has focused on how humans interact with the world around them , as well as on their motivations and motivations for doing so . This focus is largely due to the fact that humans are increasingly becoming more sophisticated and sophisticated in their interactions with other humans . In addition , the technology has also been shown to be more effective at helping humans learn


## Conclusion

In this notebook, we've covered:
1. Setting up the necessary dependencies
2. Loading a pre-trained model
3. Preparing and preprocessing data
4. Configuring and executing the fine-tuning process
5. Evaluating the model's performance
6. Saving and loading the fine-tuned model
7. Testing the model with example prompts

Remember that this is a basic example, and you might need to adjust parameters and configurations based on your specific use case and requirements.