# 📓 The GenAI Revolution Cookbook

**Title:** Mastering Fine-Tuning of Large Language Models for Domain-Specific Success

**Description:** Unlock the potential of large language models with our step-by-step guide to fine-tuning for specialized applications, enhancing performance and achieving production-ready solutions.

---

*This jupyter notebook contains executable code examples. Run the cells below to try out the code yourself!*



# Fine-Tuning Large Language Models: A Step-by-Step Guide

## Introduction

Fine-tuning large language models (LLMs) is a transformative approach to adapting pre-trained models for specific domains. Unlike prompt engineering, which can be limited in handling complex tasks, fine-tuning allows for deeper integration of domain-specific knowledge, resulting in enhanced model performance and utility. This guide will walk you through the process of fine-tuning LLMs, providing you with the tools and techniques needed to achieve domain-specific success. For more insights, refer to [Privatai's tutorial on LLM fine-tuning](https://www.privatai.co.uk/tutorials/llm-fine-tuning?utm_source=openai) and our article on [customizing LLMs for domain-specific applications](/blog/44830763/mastering-domain-specific-llm-customization-techniques-and-tools-unveiled).

## Installation

To begin, let's install the necessary libraries. These are essential for loading pre-trained models, handling datasets, and performing fine-tuning.

In [None]:
# Install core libraries for LLM fine-tuning
# transformers: Hugging Face library for working with pre-trained models
# datasets: Library for loading and processing training datasets
# accelerate: Enables distributed training and mixed precision (recommended)
# peft: Parameter-Efficient Fine-Tuning methods like LoRA (recommended)
!pip install transformers datasets accelerate peft

# Optional: Install specific versions for reproducibility
# !pip install transformers==4.35.0 datasets==2.14.0 accelerate==0.24.0 peft==0.6.0

## Project Setup

Define environment variables and configuration files necessary for the project. This includes setting up paths for data storage and model outputs.

## Step-by-Step Build

### Dataset Preparation

Load and prepare a domain-specific dataset for fine-tuning. This involves cleaning, tokenization, and preparation for model input.

In [None]:
# Load and prepare a domain-specific dataset for fine-tuning
from datasets import load_dataset, Dataset
from transformers import AutoTokenizer
import pandas as pd

# Initialize tokenizer for the base model
model_name = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Set padding token if not already defined (required for batch processing)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

def load_and_clean_data(data_path=None):
    """
    Load and clean domain-specific data.
    
    Args:
        data_path (str, optional): Path to custom dataset. If None, uses example data.
    
    Returns:
        Dataset: Cleaned Hugging Face dataset object
    
    Note:
        Adjust cleaning logic based on your domain requirements
    """
    if data_path:
        df = pd.read_csv(data_path)
    else:
        data = {
            "text": [
                "Example domain-specific text 1",
                "Example domain-specific text 2",
                "Example domain-specific text 3"
            ]
        }
        df = pd.DataFrame(data)
    
    df = df.drop_duplicates(subset=['text'])
    df = df[df['text'].str.len() >= 10]
    df['text'] = df['text'].str.strip()
    
    dataset = Dataset.from_pandas(df)
    
    return dataset

def tokenize_function(examples):
    return tokenizer(
        examples['text'],
        truncation=True,
        padding='max_length',
        max_length=512,
        return_tensors=None
    )

dataset = load_and_clean_data()
tokenized_dataset = dataset.map(
    tokenize_function,
    batched=True,
    remove_columns=dataset.column_names,
    desc="Tokenizing dataset"
)

train_test_split = tokenized_dataset.train_test_split(test_size=0.2, seed=42)
train_dataset = train_test_split['train']
eval_dataset = train_test_split['test']

print(f"Training samples: {len(train_dataset)}")
print(f"Validation samples: {len(eval_dataset)}")

### Fine-Tuning Techniques

Various fine-tuning methods can be employed, including Full Fine-Tuning, Parameter-Efficient Fine-Tuning (PEFT), LoRA, and QLoRA. Here's a basic code snippet using Hugging Face Transformers:

In [None]:
from transformers import Trainer, TrainingArguments, AutoModelForCausalLM, AutoTokenizer

model_name = "gpt2"
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    save_steps=10_000,
    save_total_limit=2,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
)

trainer.train()

For more detailed techniques, see [Privatai's guide on fine-tuning methods](https://www.privatai.co.uk/tutorials/llm-fine-tuning?utm_source=openai) and our comprehensive guide on [fine-tuning LLMs with Hugging Face Transformers](/blog/44830763/mastering-fine-tuning-of-large-language-models-with-hugging-face).

### Full End-to-End Application

Integrate all components into a single, runnable script that produces a working demo. This includes error handling, configuration management, and clear execution flow.

In [None]:
# Complete end-to-end fine-tuning pipeline
# This script combines dataset preparation, model selection, training, and evaluation
import os
import argparse
import logging
from pathlib import Path
import torch
from datasets import load_dataset, Dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    Trainer,
    TrainingArguments,
    DataCollatorForLanguageModeling
)

# Configure logging for the entire pipeline
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)

class LLMFineTuner:
    """
    End-to-end pipeline for fine-tuning large language models.
    
    This class encapsulates the entire workflow from data loading to model evaluation,
    making it easy to configure and execute fine-tuning jobs.
    """
    
    def __init__(self, config):
        self.config = config
        self.model = None
        self.tokenizer = None
        self.trainer = None
        
        Path(config['output_dir']).mkdir(parents=True, exist_ok=True)
        
        logger.info(f"Initialized LLMFineTuner with config: {config}")
    
    def load_and_prepare_data(self):
        logger.info("Loading and preparing dataset...")
        
        data_path = self.config['data_path']
        
        if data_path.endswith('.csv'):
            dataset = load_dataset('csv', data_files=data_path)['train']
        elif data_path.endswith('.json'):
            dataset = load_dataset('json', data_files=data_path)['train']
        elif data_path.endswith('.txt'):
            dataset = load_dataset('text', data_files=data_path)['train']
        else:
            raise ValueError(f"Unsupported file format: {data_path}")
        
        if len(dataset) == 0:
            raise ValueError("Dataset is empty")
        
        logger.info(f"Loaded {len(dataset)} examples")
        
        self.tokenizer = AutoTokenizer.from_pretrained(self.config['model_name'])
        if self.tokenizer.pad_token is None:
            self.tokenizer.pad_token = self.tokenizer.eos_token
        
        def tokenize_function(examples):
            text_column = 'text' if 'text' in examples else list(examples.keys())[0]
            
            return self.tokenizer(
                examples[text_column],
                truncation=True,
                padding='max_length',
                max_length=self.config['max_length'],
                return_tensors=None
            )
        
        tokenized_dataset = dataset.map(
            tokenize_function,
            batched=True,
            remove_columns=dataset.column_names,
            desc="Tokenizing"
        )
        
        split_dataset = tokenized_dataset.train_test_split(test_size=0.2, seed=42)
        
        logger.info(f"Train size: {len(split_dataset['train'])}, Eval size: {len(split_dataset['test'])}")
        
        return split_dataset['train'], split_dataset['test']
    
    def initialize_model(self):
        logger.info(f"Loading model: {self.config['model_name']}")
        
        self.model = AutoModelForCausalLM.from_pretrained(
            self.config['model_name'],
            torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32,
            device_map="auto" if torch.cuda.is_available() else None
        )
        
        if self.tokenizer.pad_token is None:
            self.model.config.pad_token_id = self.model.config.eos_token_id
        
        logger.info(f"Model loaded: {self.model.num_parameters():,} parameters")
        
        return self.model
    
    def train(self, train_dataset, eval_dataset):
        logger.info("Starting training...")
        
        training_args = TrainingArguments(
            output_dir=self.config['output_dir'],
            num_train_epochs=self.config['num_epochs'],
            per_device_train_batch_size=self.config['batch_size'],
            per_device_eval_batch_size=self.config['batch_size'] * 2,
            learning_rate=self.config['learning_rate'],
            warmup_steps=500,
            weight_decay=0.01,
            logging_dir=f"{self.config['output_dir']}/logs",
            logging_steps=100,
            evaluation_strategy="steps",
            eval_steps=500,
            save_strategy="steps",
            save_steps=1000,
            save_total_limit=2,
            load_best_model_at_end=True,
            metric_for_best_model="eval_loss",
            fp16=torch.cuda.is_available(),
            report_to="tensorboard",
            seed=42
        )
        
        data_collator = DataCollatorForLanguageModeling(
            tokenizer=self.tokenizer,
            mlm=False
        )
        
        self.trainer = Trainer(
            model=self.model,
            args=training_args,
            train_dataset=train_dataset,
            eval_dataset=eval_dataset,
            data_collator=data_collator,
            tokenizer=self.tokenizer
        )
        
        train_result = self.trainer.train()
        
        self.trainer.save_model()
        self.tokenizer.save_pretrained(self.config['output_dir'])
        
        metrics = train_result.metrics
        self.trainer.log_metrics("train", metrics)
        self.trainer.save_metrics("train", metrics)
        
        logger.info(f"Training completed. Final loss: {metrics['train_loss']:.4f}")
        
        return metrics
    
    def evaluate(self, eval_dataset):
        logger.info("Evaluating model...")
        
        metrics = self.trainer.evaluate(eval_dataset)
        self.trainer.log_metrics("eval", metrics)
        self.trainer.save_metrics("eval", metrics)
        
        logger.info(f"Evaluation completed. Perplexity: {torch.exp(torch.tensor(metrics['eval_loss'])):.2f}")
        
        return metrics
    
    def run_pipeline(self):
        try:
            train_dataset, eval_dataset = self.load_and_prepare_data()
            self.initialize_model()
            train_metrics = self.train(train_dataset, eval_dataset)
            eval_metrics = self.evaluate(eval_dataset)
            
            results = {
                'train': train_metrics,
                'eval': eval_metrics
            }
            
            logger.info("Pipeline completed successfully!")
            
            return results
            
        except Exception as e:
            logger.error(f"Pipeline failed: {str(e)}")
            raise

def main():
    parser = argparse.ArgumentParser(description="Fine-tune a language model")
    parser.add_argument("--model_name", type=str, default="gpt2", help="Base model name")
    parser.add_argument("--data_path", type=str, required=True, help="Path to training data")
    parser.add_argument("--output_dir", type=str, default="./fine_tuned_model", help="Output directory")
    parser.add_argument("--max_length", type=int, default=512, help="Maximum sequence length")
    parser.add_argument("--num_epochs", type=int, default=3, help="Number of training epochs")
    parser.add_argument("--batch_size", type=int, default=4, help="Training batch size")
    parser.add_argument("--learning_rate", type=float, default=5e-5, help="Learning rate")
    
    args = parser.parse_args()
    
    config = {
        'model_name': args.model_name,
        'data_path': args.data_path,
        'output_dir': args.output_dir,
        'max_length': args.max_length,
        'num_epochs': args.num_epochs,
        'batch_size': args.batch_size,
        'learning_rate': args.learning_rate
    }
    
    fine_tuner = LLMFineTuner(config)
    results = fine_tuner.run_pipeline()
    
    print("\n" + "="*50)
    print("FINE-TUNING COMPLETED")
    print("="*50)
    print(f"Model saved to: {config['output_dir']}")
    print(f"Final training loss: {results['train']['train_loss']:.4f}")
    print(f"Final eval loss: {results['eval']['eval_loss']:.4f}")
    print(f"Perplexity: {torch.exp(torch.tensor(results['eval']['eval_loss'])):.2f}")

# Example usage:
# python fine_tune_llm.py --data_path ./my_data.csv --output_dir ./my_model --num_epochs 5

if __name__ == "__main__":
    main()

## Testing & Validation

Add comprehensive testing and validation code with multiple evaluation strategies, including quantitative metrics and qualitative analysis.

In [None]:
# Comprehensive testing and validation for fine-tuned models
import torch
import numpy as np
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
from datasets import load_dataset
from sklearn.metrics import accuracy_score, f1_score
import logging

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

class ModelEvaluator:
    """
    Comprehensive evaluation suite for fine-tuned language models.
    
    Provides multiple evaluation strategies:
    - Perplexity on hold-out test set
    - Generation quality assessment
    - Task-specific metrics
    - Human evaluation templates
    """
    
    def __init__(self, model_path, test_data_path=None):
        logger.info(f"Loading model from {model_path}")
        
        self.model = AutoModelForCausalLM.from_pretrained(
            model_path,
            torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32,
            device_map="auto" if torch.cuda.is_available() else None
        )
        self.tokenizer = AutoTokenizer.from_pretrained(model_path)
        
        if self.tokenizer.pad_token is None:
            self.tokenizer.pad_token = self.tokenizer.eos_token
        
        self.generator = pipeline(
            "text-generation",
            model=self.model,
            tokenizer=self.tokenizer,
            device=0 if torch.cuda.is_available() else -1
        )
        
        self.test_data = None
        if test_data_path:
            self.test_data = self._load_test_data(test_data_path)
    
    def _load_test_data(self, data_path):
        if data_path.endswith('.csv'):
            dataset = load_dataset('csv', data_files=data_path)['train']
        elif data_path.endswith('.json'):
            dataset = load_dataset('json', data_files=data_path)['train']
        elif data_path.endswith('.txt'):
            dataset = load_dataset('text', data_files=data_path)['train']
        else:
            raise ValueError(f"Unsupported file format: {data_path}")
        
        logger.info(f"Loaded {len(dataset)} test examples")
        return dataset
    
    def calculate_perplexity(self, test_texts=None):
        logger.info("Calculating perplexity...")
        
        if test_texts is None:
            if self.test_data is None:
                raise ValueError("No test data provided")
            text_column = 'text' if 'text' in self.test_data.column_names else self.test_data.column_names[0]
            test_texts = self.test_data[text_column]
        
        total_loss = 0
        total_tokens = 0
        
        self.model.eval()
        with torch.no_grad():
            for text in test_texts:
                inputs = self.tokenizer(
                    text,
                    return_tensors="pt",
                    truncation=True,
                    max_length=512
                ).to(self.model.device)
                
                outputs = self.model(**inputs, labels=inputs["input_ids"])
                
                total_loss += outputs.loss.item() * inputs["input_ids"].size(1)
                total_tokens += inputs["input_ids"].size(1)
        
        avg_loss = total_loss / total_tokens
        perplexity = np.exp(avg_loss)
        
        logger.info(f"Perplexity: {perplexity:.2f}")
        
        return perplexity
    
    def evaluate_generation_quality(self, prompts, max_length=100, num_return_sequences=3):
        logger.info(f"Generating {num_return_sequences} responses for {len(prompts)} prompts...")
        
        results = []
        
        for prompt in prompts:
            generations = self.generator(
                prompt,
                max_length=max_length,
                num_return_sequences=num_return_sequences,
                do_sample=True,
                temperature=0.7,
                top_p=0.9,
                pad_token_id=self.tokenizer.pad_token_id
            )
            
            generated_texts = [gen['generated_text'] for gen in generations]
            
            results.append({
                'prompt': prompt,
                'generations': generated_texts
            })
            
            logger.info(f"\nPrompt: {prompt}")
            logger.info(f"Generation: {generated_texts[0]}")
        
        return results
    
    def evaluate_task_specific_metrics(self, test_examples, task_type="classification"):
        logger.info(f"Evaluating {task_type} task...")
        
        predictions = []
        ground_truth = []
        
        for example in test_examples:
            input_text = example['input']
            expected = example['expected_output']
            
            output = self.generator(
                input_text,
                max_length=len(input_text.split()) + 50,
                num_return_sequences=1,
                do_sample=False,
                pad_token_id=self.tokenizer.pad_token_id
            )[0]['generated_text']
            
            prediction = output[len(input_text):].strip()
            
            predictions.append(prediction)
            ground_truth.append(expected)
        
        if task_type == "classification":
            accuracy = accuracy_score(ground_truth, predictions)
            f1 = f1_score(ground_truth, predictions, average='weighted')
            
            logger.info(f"Accuracy: {accuracy:.2f}")
            logger.info(f"F1 Score: {f1:.2f}")
            
            return {"accuracy": accuracy, "f1": f1}
        
        return {}

# Example usage
# evaluator = ModelEvaluator(model_path="./fine_tuned_model", test_data_path="./test_data.csv")
# perplexity = evaluator.calculate_perplexity()
# generation_results = evaluator.evaluate_generation_quality(prompts=["Example prompt 1", "Example prompt 2"])

## Conclusion

In this tutorial, we've explored the process of fine-tuning large language models for domain-specific applications. From dataset preparation to model evaluation, each step is crucial for achieving high-performance results. While this guide provides a comprehensive overview, there are always more techniques and optimizations to explore. Consider diving deeper into advanced fine-tuning methods or exploring deployment strategies for production environments. For further reading, check out our articles on [advanced fine-tuning techniques](/blog/44830763/advanced-fine-tuning-techniques) and [deploying LLMs in production](/blog/44830763/deploying-llms-in-production).