# LLM Exercise Notebook

This notebook is designed to help you practice and understand the concepts of LLMs (Large Language Models) through a hands-on exercise. The goal is to implement a simple LLM from huggingface and fine-tune it on a custom dataset. You will also learn how to evaluate the model's performance and make predictions.

The exercise is divided into several sections, each focusing on a specific aspect of LLMs. You can run the code snippets in each section to see how they work and modify them to suit your needs. These sections are:

1. **Importing Libraries**: Import the necessary libraries for the exercise.
2. **Loading the Dataset**: Load a custom dataset for training and evaluation.
3. **Preprocessing the Data**: Preprocess the dataset to make it suitable for training.
4. **Loading the Model**: Load a pre-trained LLM from Hugging Face.
5. **Fine-tuning the Model**: Fine-tune the model on the custom dataset.
6. **Evaluating the Model**: Evaluate the model's performance on a test set.
7. **Making Predictions**: Use the fine-tuned model to make predictions on new data.
8. **Conclusion**: Summarize the key takeaways from the exercise.

## Setup

In [1]:
# Setup environment to avoid TensorFlow dependency issues
import os
import sys

# Set environment variables to avoid TensorFlow warnings and force PyTorch usage
os.environ["TF_CPP_MIN_LOG_LEVEL"] = "3"  # Suppress TensorFlow logging
os.environ["TRANSFORMERS_OFFLINE"] = "1"  # Avoid TF imports during model downloads
os.environ["TRANSFORMERS_FRAMEWORK"] = "pt"  # Use PyTorch (pt) instead of TensorFlow (tf)

# Install required packages
!pip install -q transformers datasets evaluate PyPDF2 torch
!pip install -q tf-keras

# This will ensure we're using the PyTorch-specific classes only
import importlib
import warnings

# Suppress warnings related to TensorFlow
warnings.filterwarnings('ignore', message='.*tensorflow.*')
warnings.filterwarnings('ignore', message='.*Keras.*')
warnings.filterwarnings('ignore', category=FutureWarning)


[notice] A new release of pip is available: 25.0.1 -> 25.1.1
[notice] To update, run: C:\Users\sushi\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.12_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip

[notice] A new release of pip is available: 25.0.1 -> 25.1.1
[notice] To update, run: C:\Users\sushi\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.12_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip


## Importing Libraries

In [2]:
# Import required libraries for working with LLMs and datasets
import os
import numpy as np
import pandas as pd
import torch
from torch.utils.data import Dataset, DataLoader
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    Trainer,
    TrainingArguments,
    TextDataset,
    DataCollatorForLanguageModeling
)
import evaluate
import matplotlib.pyplot as plt
import seaborn as sns

# For PDF processing
from PyPDF2 import PdfReader
import glob
import re

# Set seed for reproducibility
torch.manual_seed(42)
np.random.seed(42)

  from .autonotebook import tqdm as notebook_tqdm





## Loading the Dataset

In this section, we will load the PDF files from the data folder and extract their text content. Then, we will create a dataset suitable for fine-tuning an LLM.

In [None]:
# Function to extract text from PDF files
def extract_text_from_pdf(pdf_path):
    try:
        reader = PdfReader(pdf_path)
        text = ""
        for page in reader.pages:
            text += page.extract_text() + "\n"
        return text
    except Exception as e:
        print(f"Error extracting text from {pdf_path}: {e}")
        return ""

# Get all PDF files in the data directory
pdf_files = glob.glob("data/*.pdf")
print(f"Found {len(pdf_files)} PDF files in the data directory.")

# Extract text from all PDF files
texts = []
for pdf_file in pdf_files:
    text = extract_text_from_pdf(pdf_file)
    if text:
        texts.append({"source": pdf_file, "text": text})

# Create a pandas DataFrame from the extracted texts
df = pd.DataFrame(texts)
print(f"Extracted text from {len(df)} PDF files successfully.")
df.head()

## Preprocessing the Data

Now we'll preprocess the extracted text data to make it suitable for training an LLM. This involves cleaning the text, tokenizing, and splitting into train and validation sets.

In [None]:
# Clean the text data
def clean_text(text):
    # Remove extra whitespace
    text = re.sub(r'\s+', ' ', text)
    # Remove special characters that might cause issues
    text = re.sub(r'[^\w\s.,!?:;"\'-]', '', text)
    return text.strip()

# Apply cleaning to the DataFrame
df['clean_text'] = df['text'].apply(clean_text)

# Save processed text to files for model training
os.makedirs('processed_data', exist_ok=True)

# Split data into training and validation sets (80/20 split)
train_size = int(0.8 * len(df))
train_df = df[:train_size]
val_df = df[train_size:]

# Write to files
with open('processed_data/train.txt', 'w', encoding='utf-8') as f:
    for text in train_df['clean_text']:
        f.write(text + '\n\n')

with open('processed_data/validation.txt', 'w', encoding='utf-8') as f:
    for text in val_df['clean_text']:
        f.write(text + '\n\n')

print(f"Saved {len(train_df)} training examples and {len(val_df)} validation examples.")

## Loading the Model

Let's load a small pre-trained text generation model from Hugging Face. We'll use DistilGPT2 which is a distilled version of GPT-2, making it smaller and faster while maintaining reasonable performance.

In [None]:
# Define model name - using a small version for faster fine-tuning
model_name = "distilgpt2"  # A small GPT-2 model (82M parameters)

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# Print model parameters to get an idea of its size
total_params = sum(p.numel() for p in model.parameters())
print(f"Loaded {model_name} with {total_params:,} parameters")

# Check if GPU is available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

## Fine-tuning the Model

Now we'll fine-tune the model on our custom dataset of PDF text. We'll use the Hugging Face Trainer API for this task.

In [None]:
# Function to create a dataset for the model
def load_dataset(train_path, val_path, tokenizer):
    # Load datasets
    train_dataset = TextDataset(
        tokenizer=tokenizer,
        file_path=train_path,
        block_size=128  # Context size for training (adjust based on your GPU memory)
    )
    
    val_dataset = TextDataset(
        tokenizer=tokenizer,
        file_path=val_path,
        block_size=128
    )
    
    # Create data collator for language modeling
    data_collator = DataCollatorForLanguageModeling(
        tokenizer=tokenizer, 
        mlm=False  # We're doing causal language modeling, not masked language modeling
    )
    
    return train_dataset, val_dataset, data_collator

# Load datasets
train_dataset, val_dataset, data_collator = load_dataset(
    'processed_data/train.txt',
    'processed_data/validation.txt',
    tokenizer
)

# Set up training arguments
training_args = TrainingArguments(
    output_dir="./results",
    overwrite_output_dir=True,
    num_train_epochs=3,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    eval_steps=400,
    save_steps=800,
    warmup_steps=500,
    logging_dir="./logs",
    logging_steps=100,
    evaluation_strategy="steps",
    load_best_model_at_end=True,
    save_total_limit=2  # Limit the number of checkpoints to save disk space
)

# Initialize Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=train_dataset,
    eval_dataset=val_dataset
)

# Fine-tune the model (Note: This may take a while)
print("Starting model fine-tuning...")
trainer.train()
print("Fine-tuning completed!")

# Save the fine-tuned model
model_path = "./fine_tuned_model"
model.save_pretrained(model_path)
tokenizer.save_pretrained(model_path)
print(f"Model and tokenizer saved to {model_path}")

## Evaluating the Model

Let's evaluate our fine-tuned model on the validation set to see how well it performs.

In [None]:
# Evaluate the model
eval_results = trainer.evaluate()
print(f"Perplexity: {np.exp(eval_results['eval_loss']):.2f}")

# Plot training loss
training_logs = trainer.state.log_history

# Extract training and evaluation losses
train_losses = [log['loss'] for log in training_logs if 'loss' in log]
train_steps = [log['step'] for log in training_logs if 'loss' in log]

eval_losses = [log['eval_loss'] for log in training_logs if 'eval_loss' in log]
eval_steps = [log['step'] for log in training_logs if 'eval_loss' in log]

plt.figure(figsize=(10, 6))
plt.plot(train_steps, train_losses, label='Training Loss')
plt.plot(eval_steps, eval_losses, label='Evaluation Loss')
plt.xlabel('Training Steps')
plt.ylabel('Loss')
plt.title('Training and Evaluation Losses')
plt.legend()
plt.grid(True)
plt.show()

## Making Predictions

Now that we have a fine-tuned model, let's use it to generate text based on prompts from our domain.

In [None]:
# Load the fine-tuned model for inference
model_path = "./fine_tuned_model"
inference_model = AutoModelForCausalLM.from_pretrained(model_path)
inference_tokenizer = AutoTokenizer.from_pretrained(model_path)

# Function to generate text from prompt
def generate_text(prompt, max_length=200):
    inputs = inference_tokenizer(prompt, return_tensors="pt")
    
    # Generate text
    outputs = inference_model.generate(
        inputs["input_ids"],
        max_length=max_length,
        num_return_sequences=1,
        temperature=0.8,
        top_p=0.9,
        do_sample=True,
        no_repeat_ngram_size=2
    )
    
    # Decode and return the generated text
    generated_text = inference_tokenizer.decode(outputs[0], skip_special_tokens=True)
    return generated_text

# Try generating text with a few different prompts
test_prompts = [
    "The main findings of the research suggest that",
    "The methodology used in this study involves",
    "In conclusion, the results demonstrate that"
]

for prompt in test_prompts:
    generated = generate_text(prompt)
    print(f"Prompt: {prompt}")
    print(f"Generated text: {generated}")
    print("-" * 80)

## Conclusion

In this notebook, we've successfully completed a full workflow for fine-tuning a small language model (DistilGPT2) on a custom dataset of PDF documents. Here's a summary of what we've accomplished:

1. **Data Collection**: Extracted text from PDF documents in the data directory
2. **Data Preprocessing**: Cleaned and formatted the data for model training
3. **Model Setup**: Loaded a pre-trained small language model (DistilGPT2)
4. **Fine-tuning**: Adapted the model to our specific domain using the custom dataset
5. **Evaluation**: Measured the model's performance using perplexity
6. **Text Generation**: Used the fine-tuned model to generate domain-specific text

**Key Takeaways:**

- Even small language models can be effectively fine-tuned for specific domains
- PDF text extraction and preprocessing are crucial steps for creating quality training data
- The Hugging Face ecosystem provides powerful tools for working with language models
- Hyperparameter tuning and proper evaluation are essential for good model performance

**Next Steps:**

- Try fine-tuning with different model architectures (e.g., BART, T5)
- Experiment with different hyperparameters to improve performance
- Implement more rigorous evaluation metrics like BLEU or ROUGE
- Create a simple application that uses the fine-tuned model