## Installing Required Packages

This code cell installs the necessary Python packages for this project, optimized for CUDA 12.x.

- `transformers`: Provides access to pre-trained models, tokenizers, and training utilities from Hugging Face.
- `peft`: Parameter-Efficient Fine-Tuning library, used for applying techniques like LoRA.
- `datasets`: Provides tools for loading and processing datasets from Hugging Face.
- `accelerate`: Helps in easily running PyTorch training scripts on various hardware setups (like multiple GPUs) with minimal code changes.
- `matplotlib`: A plotting library used for creating visualizations.
- `tqdm`: A library to display progress bars for loops.

In [1]:
# Install required packages optimized for CUDA 12.x
!pip install transformers
!pip install peft
!pip install datasets
!pip install accelerate
!pip install matplotlib
!pip install tqdm



## Importing Libraries and Setting up Environment

This code cell imports essential libraries for building and training a language model, sets up the device for training (GPU if available, otherwise CPU), enables mixed precision for potential memory savings and faster training, and sets random seeds for reproducibility.

- `torch`, `torch.nn`, `torch.optim`, `torch.amp`, `torch.utils.data`: Core PyTorch libraries for building neural networks, optimizers, automatic mixed precision, and data handling.
- `transformers`: Provides pre-trained models and utilities from Hugging Face.
- `peft`: Parameter-Efficient Fine-Tuning library (used for LoRA later).
- `datasets`: For loading and processing datasets.
- `matplotlib.pyplot`: For plotting (used later for memory tracking visualization).
- `time`: For measuring execution time.
- `gc`: For garbage collection (explicit memory management).
- `tqdm.auto`: For displaying progress bars during training.
- `numpy`: For numerical operations.

The code checks for CUDA availability and prints the device being used. It also initializes `torch.amp.GradScaler` for mixed precision training.

In [2]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.amp import autocast, GradScaler
from torch.utils.data import DataLoader

from transformers import (
    GPT2LMHeadModel,
    GPT2Tokenizer,
    GPT2Config,
    get_linear_schedule_with_warmup,
    set_seed
)

from peft import LoraConfig, get_peft_model, TaskType
from datasets import load_dataset

import matplotlib.pyplot as plt
import time
import gc
from tqdm.auto import tqdm
import numpy as np

# Set device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"CUDA Version: {torch.version.cuda}")

# Enable mixed precision
scaler = GradScaler()

# Set random seeds for reproducibility
set_seed(42)

Using device: cuda
GPU: Tesla T4
CUDA Version: 12.6


## Hyperparameters and Memory Optimization Settings

This code cell defines the hyperparameters for the model training process, including those specifically for memory optimization and LoRA. It also sets an environment variable to configure PyTorch's CUDA memory allocator for better memory handling.

- `BATCH_SIZE`: Number of samples processed in each forward/backward pass. Reduced for memory.
- `GRADIENT_ACCUMULATION_STEPS`: Number of batches over which gradients are accumulated before an optimizer step. This effectively increases the batch size without requiring more GPU memory per batch.
- `NUM_BATCHES`: The number of batches to use from the dataset. Reduced for memory.
- `EPOCHS`: The number of times the entire dataset is passed through the training process.
- `MAX_SEQUENCE_LENGTH`: The maximum number of tokens in each input sequence. Reduced for memory.
- `MAX_GENERATION_LENGTH`: The maximum number of tokens to generate during text generation.
- `LEARNING_RATE`: The step size for the optimizer during training.
- `WEIGHT_DECAY`: A regularization parameter to prevent overfitting.
- `GPT2_MODEL_NAME`: The name of the pre-trained GPT-2 model to use.
- `LORA_RANK`: The rank of the low-rank matrices in LoRA. A smaller rank means fewer trainable parameters.
- `LORA_ALPHA`: A scaling factor for the LoRA updates.
- `LORA_DROPOUT`: The dropout rate applied to the LoRA layers.
- `os.environ['PYTORCH_CUDA_ALLOC_CONF']`: Sets an environment variable to configure CUDA memory allocation, potentially improving memory usage and preventing out-of-memory errors.

The code also prints the effective batch size, which is the product of `BATCH_SIZE` and `GRADIENT_ACCUMULATION_STEPS`.

In [3]:
# Memory-optimized hyperparameters
BATCH_SIZE = 4  # Reduced from 32
GRADIENT_ACCUMULATION_STEPS = 8  # This gives effective batch size of 32
NUM_BATCHES = 50  # Reduced for memory
EPOCHS = 10 # Can be set to a higher value for better results
MAX_SEQUENCE_LENGTH = 64  # Reduced from 128
MAX_GENERATION_LENGTH = 700
LEARNING_RATE = 5e-5
WEIGHT_DECAY = 0.01

GPT2_MODEL_NAME = "gpt2"

# LoRA-specific hyperparameters
LORA_RANK = 4
LORA_ALPHA = 32
LORA_DROPOUT = 0.1

# Set memory optimization FIRST
import os
os.environ['PYTORCH_CUDA_ALLOC_CONF'] = 'expandable_segments:True,max_split_size_mb:128'

print(f"Effective batch size: {BATCH_SIZE * GRADIENT_ACCUMULATION_STEPS}")

Effective batch size: 32


## Loading and Preparing the Dataset

This code cell loads the "Reddit-TIFU" dataset from Hugging Face. It includes a fallback mechanism to create a small custom dataset if the download fails.

- `load_dataset("Fredithefish/Reddit-TIFU", split="train")`: Attempts to load the specified dataset from the Hugging Face Hub.
- The `try...except` block handles potential errors during dataset loading.
- The `map_to_documents` function is used to ensure the dataset has a 'documents' column, which is expected for subsequent processing. It maps the 'selftext' or 'title' field to 'documents'.
- `Dataset.from_dict({'documents': texts})`: Creates a simple fallback dataset if the primary dataset cannot be loaded.
- The code prints the total size of the loaded dataset.

In [4]:
# Load Reddit TIFU dataset
try:
    dataset = load_dataset("Fredithefish/Reddit-TIFU", split="train")

    # Map to expected 'documents' field if needed
    if 'documents' not in dataset.column_names:
        def map_to_documents(example):
            return {'documents': example.get('selftext', example.get('title', ''))}
        dataset = dataset.map(map_to_documents)

except:
    # Fallback to custom dataset if download fails
    from datasets import Dataset
    texts = ["TIFU by accidentally sending an embarrassing text to the wrong person."] * (NUM_BATCHES * BATCH_SIZE)
    dataset = Dataset.from_dict({'documents': texts})

print(f"Dataset size: {len(dataset)}")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tifu_collection.jsonl: 0.00B [00:00, ?B/s]

Generating train split:   0%|          | 0/619 [00:00<?, ? examples/s]

Map:   0%|          | 0/619 [00:00<?, ? examples/s]

Dataset size: 619


## Examining Dataset Structure

This code cell examines the structure of the loaded dataset by printing a sample document and a sample title. This helps to understand the format and content of the data being used for training.

In [5]:
# Examine dataset structure
sample = dataset[0]
print("Sample document:")
print(sample['documents'][:500] + "...")
print("\nSample title:")
print(sample['title'])

Sample document:
TIFU by raising the flag upside down on a military base and causing local farmers to think the base was in distress....

Sample title:
TIFU by raising the flag upside down on a military base and causing local farmers to think the base was in distress.


## Tokenization and DataLoader Creation

This code cell initializes the GPT-2 tokenizer, preprocesses the dataset by tokenizing the text, and creates a PyTorch DataLoader for efficient batching during training.

- `GPT2Tokenizer.from_pretrained(GPT2_MODEL_NAME)`: Loads the tokenizer associated with the specified GPT-2 model.
- `tokenizer.pad_token = tokenizer.eos_token`: Sets the padding token to be the same as the end-of-sequence token, which is common practice for GPT-2.
- `tokenize_function(examples)`: A helper function that takes a batch of examples and tokenizes the 'documents' field, applying truncation and padding to a fixed `MAX_SEQUENCE_LENGTH`.
- `dataset.select(range(actual_samples))`: Selects a subset of the dataset based on the calculated `actual_samples`.
- `small_dataset.map(...)`: Applies the `tokenize_function` to the selected subset of the dataset, removing the original columns and returning the tokenized data.
- `tokenized_dataset.set_format(...)`: Sets the format of the tokenized dataset to PyTorch tensors, specifying the 'input_ids' and 'attention_mask' columns.
- `DataLoader(...)`: Creates a DataLoader object that will provide batches of data during training. `batch_size` is set to `BATCH_SIZE` and `shuffle=True` is used to randomize the order of data for each epoch.

In [6]:
# Initialize tokenizer
tokenizer = GPT2Tokenizer.from_pretrained(GPT2_MODEL_NAME)
tokenizer.pad_token = tokenizer.eos_token

# Preprocess dataset
def tokenize_function(examples):
    # Use only the documents field
    texts = examples['documents']
    return tokenizer(
        texts,
        truncation=True,
        padding='max_length',
        max_length=MAX_SEQUENCE_LENGTH,
        return_tensors='pt'
    )

# Adjust sample size based on available data
available_samples = len(dataset)
total_needed = NUM_BATCHES * BATCH_SIZE
actual_samples = min(available_samples, total_needed)

print(f"Dataset has {available_samples} examples")
print(f"Using {actual_samples} examples")

# Take subset and tokenize
small_dataset = dataset.select(range(actual_samples))
tokenized_dataset = small_dataset.map(
    tokenize_function,
    batched=True,
    remove_columns=small_dataset.column_names
)

# Convert to PyTorch dataset
tokenized_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask'])

# Create DataLoader
train_dataloader = DataLoader(
    tokenized_dataset,
    batch_size=BATCH_SIZE,
    shuffle=True
)

print(f"Training batches: {len(train_dataloader)}")

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

Dataset has 619 examples
Using 200 examples


Map:   0%|          | 0/200 [00:00<?, ? examples/s]

Training batches: 50


## GPU Memory Tracker Class

This code cell defines a Python class `GPUMemoryTracker` to monitor and record GPU memory usage during training.

- `__init__(self, target_batches, print_stats=False)`: Initializes the tracker with a list of `target_batches` at which to record memory usage and a flag `print_stats` to control whether to print statistics during tracking.
- `_compute_memory_usage(self)`: A helper method that calculates the current and peak GPU memory usage in GB and appends the peak usage to the `memory_usage` list. It also prints the current and peak usage if `print_stats` is True.
- `on_epoch_begin(self, epoch)`: Records memory usage at the beginning of each epoch.
- `on_batch_begin(self, batch)`: Records memory usage at the beginning of specified batches.
- `on_epoch_end(self, epoch)`: Records memory usage at the end of each epoch.

In [7]:
class GPUMemoryTracker:
    def __init__(self, target_batches, print_stats=False):
        self.target_batches = target_batches
        self.print_stats = print_stats
        self.memory_usage = []
        self.labels = []

    def _compute_memory_usage(self):
        if torch.cuda.is_available():
            # Convert bytes to GB
            peak_usage = torch.cuda.max_memory_allocated() / (2**30)
            self.memory_usage.append(round(peak_usage, 3))

            if self.print_stats:
                current_usage = torch.cuda.memory_allocated() / (2**30)
                print(f"Current memory: {current_usage:.3f}GB, Peak memory: {peak_usage:.3f}GB")

    def on_epoch_begin(self, epoch):
        self._compute_memory_usage()
        self.labels.append(f"epoch {epoch} start")

    def on_batch_begin(self, batch):
        if batch in self.target_batches:
            self._compute_memory_usage()
            self.labels.append(f"batch {batch}")

    def on_epoch_end(self, epoch):
        self._compute_memory_usage()
        self.labels.append(f"epoch {epoch} end")

## Text Generation Function

This code cell defines a helper function `generate_text` to generate text using a trained language model.

- `generate_text(model, tokenizer, input_text, max_length=200, device='cuda')`: Takes the trained `model`, `tokenizer`, the `input_text` prompt, the maximum length of generated text, and the `device` as input.
- `model.eval()`: Sets the model to evaluation mode, which disables dropout and batch normalization.
- `tokenizer.encode(input_text, return_tensors='pt').to(device)`: Tokenizes the input text and moves it to the specified device.
- `model.generate(...)`: Uses the model's `generate` method to generate text based on the input. Key arguments include:
    - `max_length`: Maximum number of tokens to generate.
    - `num_return_sequences`: Number of sequences to generate (set to 1 here).
    - `temperature`: Controls the randomness of the generated text.
    - `pad_token_id`: The ID of the padding token.
    - `do_sample`: Enables sampling-based generation.
    - `top_p`: Filters out tokens with a cumulative probability lower than `top_p`.
- `tokenizer.decode(output[0], skip_special_tokens=True)`: Decodes the generated token IDs back into human-readable text, skipping special tokens.
- The function prints the generated text and the time taken for generation.

In [8]:
def generate_text(model, tokenizer, input_text, max_length=200, device='cuda'):
    start = time.time()

    model.eval()

    # Tokenize input
    input_ids = tokenizer.encode(input_text, return_tensors='pt').to(device)

    # Generate text
    with torch.no_grad():
        output = model.generate(
            input_ids,
            max_length=max_length,
            num_return_sequences=1,
            temperature=0.7,
            pad_token_id=tokenizer.eos_token_id,
            do_sample=True,
            top_p=0.95
        )

    # Decode output
    generated_text = tokenizer.decode(output[0], skip_special_tokens=True)

    print("\nOutput:")
    print(generated_text)

    end = time.time()
    print(f"Total Time Elapsed: {end - start:.2f}s")

## Optimizer and Scheduler Function

This code cell defines a helper function `get_optimizer_and_scheduler` to create the optimizer and learning rate scheduler for the model training.

- `get_optimizer_and_scheduler(model, num_training_steps)`: Takes the `model` and the total number of training steps as input.
- It separates model parameters into those that should and should not have weight decay applied (e.g., bias and LayerNorm weights are typically excluded).
- `optim.AdamW(...)`: Initializes the AdamW optimizer, which is a popular choice for training Transformer models. It includes parameters for the learning rate (`lr`) and weight decay (`weight_decay`).
- `get_linear_schedule_with_warmup(...)`: Creates a learning rate scheduler that linearly increases the learning rate from 0 to the initial learning rate over a specified number of warm-up steps and then linearly decreases it to 0 over the remaining training steps.
- The function returns the configured optimizer and scheduler.

In [9]:
def get_optimizer_and_scheduler(model, num_training_steps):
    # Separate parameters for weight decay
    no_decay = ["bias", "LayerNorm.weight"]
    optimizer_grouped_parameters = [
        {
            "params": [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay) and p.requires_grad],
            "weight_decay": WEIGHT_DECAY,
        },
        {
            "params": [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay) and p.requires_grad],
            "weight_decay": 0.0,
        },
    ]

    optimizer = optim.AdamW(
        optimizer_grouped_parameters,
        lr=LEARNING_RATE,
        eps=1e-6
    )

    scheduler = get_linear_schedule_with_warmup(
        optimizer,
        num_warmup_steps=100,
        num_training_steps=num_training_steps
    )

    return optimizer, scheduler

## Loading GPT-2 Model

This code cell loads the pre-trained GPT-2 language model from the Hugging Face Hub and moves it to the specified device (GPU if available).

- `GPT2LMHeadModel.from_pretrained(GPT2_MODEL_NAME)`: Loads the weights of the specified GPT-2 model.
- `gpt2_model = gpt2_model.to(device)`: Moves the model to the selected device for computation.
- The code prints the device the model is loaded to and the total and trainable parameters of the model.

In [10]:
# Load GPT-2 model
gpt2_model = GPT2LMHeadModel.from_pretrained(GPT2_MODEL_NAME)
gpt2_model = gpt2_model.to(device)

print(f"Model loaded to {device}")
print(f"Total parameters: {sum(p.numel() for p in gpt2_model.parameters()):,}")
print(f"Trainable parameters: {sum(p.numel() for p in gpt2_model.parameters() if p.requires_grad):,}")

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Model loaded to cuda
Total parameters: 124,439,808
Trainable parameters: 124,439,808


## Initializing Memory Tracker and Training Steps

This code cell initializes the `GPUMemoryTracker` for monitoring memory usage during training and calculates the total number of training steps based on the number of batches and epochs. It also retrieves the optimizer and scheduler using the previously defined function.

- `gpu_memory_tracker = GPUMemoryTracker(...)`: Creates an instance of the memory tracker.
- `num_training_steps = len(train_dataloader) * EPOCHS`: Calculates the total number of gradient updates that will occur during training.
- `optimizer, scheduler = get_optimizer_and_scheduler(gpt2_model, num_training_steps)`: Calls the helper function to get the configured optimizer and learning rate scheduler for the full GPT-2 model.

In [11]:
# Initialize memory tracker
gpu_memory_tracker = GPUMemoryTracker(
    target_batches=[5, 10, 25, 50, 100, 150, 200, 300, 400, 500],
    print_stats=True,
)

# Calculate total training steps
num_training_steps = len(train_dataloader) * EPOCHS

# Get optimizer and scheduler
optimizer, scheduler = get_optimizer_and_scheduler(gpt2_model, num_training_steps)

print(f"Total training steps: {num_training_steps}")

Total training steps: 500


## Memory-Optimized Training Function

This code cell defines the `train_model_memory_optimized` function, which handles the training loop with considerations for memory efficiency, including gradient accumulation and mixed precision.

- `train_model_memory_optimized(model, dataloader, optimizer, scheduler, memory_tracker, epochs=1)`: Takes the model, data loader, optimizer, scheduler, memory tracker, and number of epochs as input.
- `model.train()`: Sets the model to training mode.
- `model.gradient_checkpointing_enable()`: Enables gradient checkpointing, a memory-saving technique that trades computation time for reduced memory usage during the backward pass.
- The code iterates through epochs and batches, performs forward and backward passes with mixed precision (`autocast`, `GradScaler`), accumulates gradients over `GRADIENT_ACCUMULATION_STEPS`, clips gradients, and updates the model weights and learning rate at the appropriate intervals.
- `torch.cuda.empty_cache()`: Clears the CUDA cache periodically to free up unused memory.
- `memory_tracker.on_epoch_begin(...)`, `memory_tracker.on_batch_begin(...)`, `memory_tracker.on_epoch_end(...)`: Calls the memory tracker methods at different stages of training.
- The function returns the trained model and the recorded memory usage.

In [12]:
def train_model_memory_optimized(model, dataloader, optimizer, scheduler, memory_tracker, epochs=1):
    model.train()

    # Enable gradient checkpointing to save memory
    if hasattr(model, 'gradient_checkpointing_enable'):
        model.gradient_checkpointing_enable()

    for epoch in range(epochs):
        memory_tracker.on_epoch_begin(epoch)
        epoch_loss = 0

        progress_bar = tqdm(dataloader, desc=f"Epoch {epoch+1}/{epochs}")

        # Initialize gradient accumulation
        optimizer.zero_grad()

        for batch_idx, batch in enumerate(progress_bar):
            memory_tracker.on_batch_begin(batch_idx)

            # Move batch to device
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = input_ids.clone()

            # Forward pass with mixed precision
            with autocast('cuda'):
                outputs = model(
                    input_ids=input_ids,
                    attention_mask=attention_mask,
                    labels=labels
                )
                loss = outputs.loss / GRADIENT_ACCUMULATION_STEPS  # Scale loss

            # Backward pass
            scaler.scale(loss).backward()

            # Only step optimizer every GRADIENT_ACCUMULATION_STEPS
            if (batch_idx + 1) % GRADIENT_ACCUMULATION_STEPS == 0 or (batch_idx + 1) == len(dataloader):
                scaler.unscale_(optimizer)
                torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)

                scaler.step(optimizer)
                scaler.update()
                scheduler.step()
                optimizer.zero_grad()

            epoch_loss += loss.item() * GRADIENT_ACCUMULATION_STEPS
            progress_bar.set_postfix({'loss': loss.item() * GRADIENT_ACCUMULATION_STEPS})

            # Clear cache periodically
            if batch_idx % 5 == 0:  # More frequent clearing
                torch.cuda.empty_cache()

        memory_tracker.on_epoch_end(epoch)
        avg_epoch_loss = epoch_loss / len(dataloader)
        print(f"Epoch {epoch+1} average loss: {avg_epoch_loss:.4f}")

    return model, memory_tracker.memory_usage


## Full GPT-2 Fine-tuning and Saving

This code cell performs the fine-tuning of the full GPT-2 model using the `train_model_memory_optimized` function. It then saves the trained model and tokenizer to a specified directory.

- `gpt2_model = GPT2LMHeadModel.from_pretrained("gpt2").to(device)`: Loads a fresh instance of the base GPT-2 model.
- `tokenizer = GPT2Tokenizer.from_pretrained("gpt2")`: Loads the corresponding tokenizer.
- `optimizer = optim.AdamW(...)`: Initializes the optimizer for the full model.
- `scheduler = get_linear_schedule_with_warmup(...)`: Initializes the scheduler.
- `train_model_memory_optimized(...)`: Calls the training function to fine-tune the model.
- `save_full_gpt2_model(model, tokenizer, save_path)`: A helper function (defined within the cell) that saves the model and tokenizer to the specified path.
- The code prints messages indicating the start and completion of training and saving.
- Finally, it tests the text generation with the trained model.

In [13]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer, get_linear_schedule_with_warmup
import torch.optim as optim

# 1. Load GPT-2 model & tokenizer
gpt2_model = GPT2LMHeadModel.from_pretrained("gpt2").to(device)
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token

# 2. Define optimizer & scheduler
optimizer = optim.AdamW(
    [p for p in gpt2_model.parameters() if p.requires_grad],
    lr=LEARNING_RATE,
    weight_decay=WEIGHT_DECAY,
)
num_training_steps = len(train_dataloader) * EPOCHS
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=100, num_training_steps=num_training_steps)

# 3. Train and then save
print("Starting training...")
trained_gpt2, _ = train_model_memory_optimized(
    gpt2_model,
    train_dataloader,
    optimizer,
    scheduler,
    gpu_memory_tracker,
    EPOCHS
)
print("Training completed!")

def save_full_gpt2_model(model, tokenizer, save_path="./my-fine-tuned-gpt2"):
    import os
    os.makedirs(save_path, exist_ok=True)
    model.save_pretrained(save_path)
    tokenizer.save_pretrained(save_path)
    print(f"Saved to {save_path}")

save_full_gpt2_model(trained_gpt2, tokenizer)

print("\nTesting:")
generate_text(trained_gpt2, tokenizer, "I like basketball", max_length=MAX_GENERATION_LENGTH, device=device)

Starting training...
Current memory: 0.476GB, Peak memory: 0.952GB


Epoch 1/10:   0%|          | 0/50 [00:00<?, ?it/s]

`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
`loss_type=None` was set in the config but it is unrecognised.Using the default loss: `ForCausalLMLoss`.


Current memory: 0.985GB, Peak memory: 1.354GB
Current memory: 0.985GB, Peak memory: 1.354GB
Current memory: 0.986GB, Peak memory: 1.358GB
Current memory: 1.447GB, Peak memory: 2.380GB
Epoch 1 average loss: 7.9074
Current memory: 1.447GB, Peak memory: 2.380GB


Epoch 2/10:   0%|          | 0/50 [00:00<?, ?it/s]

Current memory: 1.915GB, Peak memory: 2.380GB
Current memory: 1.915GB, Peak memory: 2.380GB
Current memory: 1.916GB, Peak memory: 2.383GB
Current memory: 1.447GB, Peak memory: 2.383GB
Epoch 2 average loss: 6.8101
Current memory: 1.447GB, Peak memory: 2.383GB


Epoch 3/10:   0%|          | 0/50 [00:00<?, ?it/s]

Current memory: 1.915GB, Peak memory: 2.383GB
Current memory: 1.915GB, Peak memory: 2.383GB
Current memory: 1.916GB, Peak memory: 2.383GB
Current memory: 1.447GB, Peak memory: 2.383GB
Epoch 3 average loss: 3.9802
Current memory: 1.447GB, Peak memory: 2.383GB


Epoch 4/10:   0%|          | 0/50 [00:00<?, ?it/s]

Current memory: 1.915GB, Peak memory: 2.383GB
Current memory: 1.915GB, Peak memory: 2.383GB
Current memory: 1.916GB, Peak memory: 2.383GB
Current memory: 1.447GB, Peak memory: 2.383GB
Epoch 4 average loss: 1.8619
Current memory: 1.447GB, Peak memory: 2.383GB


Epoch 5/10:   0%|          | 0/50 [00:00<?, ?it/s]

Current memory: 1.915GB, Peak memory: 2.383GB
Current memory: 1.915GB, Peak memory: 2.383GB
Current memory: 1.916GB, Peak memory: 2.383GB
Current memory: 1.447GB, Peak memory: 2.383GB
Epoch 5 average loss: 1.1371
Current memory: 1.447GB, Peak memory: 2.383GB


Epoch 6/10:   0%|          | 0/50 [00:00<?, ?it/s]

Current memory: 1.915GB, Peak memory: 2.383GB
Current memory: 1.915GB, Peak memory: 2.383GB
Current memory: 1.916GB, Peak memory: 2.383GB
Current memory: 1.447GB, Peak memory: 2.383GB
Epoch 6 average loss: 0.8726
Current memory: 1.447GB, Peak memory: 2.383GB


Epoch 7/10:   0%|          | 0/50 [00:00<?, ?it/s]

Current memory: 1.915GB, Peak memory: 2.383GB
Current memory: 1.915GB, Peak memory: 2.383GB
Current memory: 1.916GB, Peak memory: 2.383GB
Current memory: 1.447GB, Peak memory: 2.383GB
Epoch 7 average loss: 0.7611
Current memory: 1.447GB, Peak memory: 2.383GB


Epoch 8/10:   0%|          | 0/50 [00:00<?, ?it/s]

Current memory: 1.915GB, Peak memory: 2.383GB
Current memory: 1.915GB, Peak memory: 2.383GB
Current memory: 1.916GB, Peak memory: 2.383GB
Current memory: 1.447GB, Peak memory: 2.383GB
Epoch 8 average loss: 0.7096
Current memory: 1.447GB, Peak memory: 2.383GB


Epoch 9/10:   0%|          | 0/50 [00:00<?, ?it/s]

Current memory: 1.915GB, Peak memory: 2.383GB
Current memory: 1.915GB, Peak memory: 2.383GB
Current memory: 1.916GB, Peak memory: 2.383GB
Current memory: 1.447GB, Peak memory: 2.383GB
Epoch 9 average loss: 0.6581
Current memory: 1.447GB, Peak memory: 2.383GB


Epoch 10/10:   0%|          | 0/50 [00:00<?, ?it/s]

Current memory: 1.915GB, Peak memory: 2.383GB
Current memory: 1.915GB, Peak memory: 2.383GB
Current memory: 1.916GB, Peak memory: 2.383GB
Current memory: 1.447GB, Peak memory: 2.383GB
Epoch 10 average loss: 0.6199
Training completed!


The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


Saved to ./my-fine-tuned-gpt2

Testing:

Output:
I like basketball and I think I'm getting a little too caught up in my own life."
Total Time Elapsed: 0.73s


## Text Generation with Base GPT-2

This code cell demonstrates text generation using the original, pre-trained base GPT-2 model (before any fine-tuning). This is done to compare the output with the fine-tuned models later.

- `generate_text(gpt2_model, tokenizer, "I like basketball", max_length=MAX_GENERATION_LENGTH, device=device)`: Calls the `generate_text` function with the base GPT-2 model and a sample input prompt.
- The generated text and the time taken for generation are printed.

In [14]:
print("Generating text with base GPT-2...")
generate_text(gpt2_model, tokenizer, "I like basketball", max_length=MAX_GENERATION_LENGTH, device=device)
#generate_text(gpt2_model, tokenizer, "That Italian restaurant is", max_length=MAX_GENERATION_LENGTH, device=device)

Generating text with base GPT-2...

Output:
I like basketball.
Total Time Elapsed: 0.03s


## Loading Saved Fine-tuned GPT-2 Model

This code cell loads the fine-tuned GPT-2 model and tokenizer that were previously saved to disk.

- `GPT2LMHeadModel.from_pretrained("./my-fine-tuned-gpt2").to(device)`: Loads the model weights from the specified directory and moves the model to the device.
- `GPT2Tokenizer.from_pretrained("./my-fine-tuned-gpt2")`: Loads the tokenizer from the same directory.
- `tokenizer.pad_token = tokenizer.eos_token`: Sets the padding token again, if necessary.
- The code prints a message confirming that the model is loaded.
- Finally, it tests text generation with the loaded fine-tuned model to verify it's working correctly.

In [15]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer
# Load from disk
model = GPT2LMHeadModel.from_pretrained("./my-fine-tuned-gpt2").to(device)
tokenizer = GPT2Tokenizer.from_pretrained("./my-fine-tuned-gpt2")
tokenizer.pad_token = tokenizer.eos_token  # if needed
# Then generate
print("Generating text with loaded fine-tuned GPT-2…")
generate_text(model, tokenizer, "I like Chocolate", max_length=MAX_GENERATION_LENGTH, device=device)

Generating text with loaded fine-tuned GPT-2…

Output:
I like Chocolatey
Total Time Elapsed: 0.02s


## Memory Cleanup Before LoRA Training

This code cell explicitly cleans up GPU memory before starting the LoRA training process. This is important to free up resources used by the full fine-tuning step.

- `del trained_gpt2`, `del optimizer`, `del scheduler`: Deletes the variables associated with the full fine-tuned model and its optimizer/scheduler.
- `gc.collect()`: Forces Python's garbage collector to release memory.
- `torch.cuda.empty_cache()`: Clears the CUDA cache, releasing unused GPU memory.
- The code prints a message indicating that memory has been cleared and that a fresh model will be loaded for LoRA.

In [16]:
# Clean up memory from previous model
del trained_gpt2
del optimizer
del scheduler
gc.collect()
torch.cuda.empty_cache()

print("Memory cleared. Loading fresh GPT-2 for LoRA...")

Memory cleared. Loading fresh GPT-2 for LoRA...


## Loading Fresh GPT-2 for LoRA and Configuring LoRA

This code cell loads a fresh instance of the base GPT-2 model specifically for applying LoRA and then configures and applies the LoRA adapters.

- `base_model = GPT2LMHeadModel.from_pretrained(GPT2_MODEL_NAME)`: Loads a clean version of the base GPT-2 model.
- `lora_config = LoraConfig(...)`: Defines the LoRA configuration with parameters like `task_type`, `inference_mode`, `r` (rank), `lora_alpha`, `lora_dropout`, and `target_modules`. `target_modules` specifies which layers in the base model will have LoRA adapters applied.
- `lora_model = get_peft_model(base_model, lora_config)`: Applies the LoRA configuration to the base model, creating a LoRA-enabled model.
- `lora_model = lora_model.to(device)`: Moves the LoRA model to the specified device.
- `lora_model.print_trainable_parameters()`: Prints a summary of the trainable parameters in the LoRA model, highlighting the significant reduction compared to the full model.

In [17]:
# Load a fresh GPT-2 model for LoRA
base_model = GPT2LMHeadModel.from_pretrained(GPT2_MODEL_NAME)

# Configure LoRA
lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    inference_mode=False,
    r=LORA_RANK,  # rank
    lora_alpha=LORA_ALPHA,
    lora_dropout=LORA_DROPOUT,
    target_modules=["c_attn", "c_proj"],  # target attention modules in GPT-2
)

# Create LoRA model
lora_model = get_peft_model(base_model, lora_config)
lora_model = lora_model.to(device)

# Print trainable parameters
lora_model.print_trainable_parameters()

trainable params: 405,504 || all params: 124,845,312 || trainable%: 0.3248




## Test Forward Pass with LoRA Model

This code cell performs a quick forward pass with the LoRA model to ensure it is configured correctly and producing outputs.

- `test_input = tokenizer("LoRA is very useful for quick LLM finetuning", return_tensors='pt')`: Tokenizes a sample input string.
- `test_input = {k: v.to(device) for k, v in test_input.items()}`: Moves the tokenized input to the specified device.
- `with torch.no_grad(): outputs = lora_model(**test_input)`: Performs a forward pass with the LoRA model without calculating gradients (since it's just a test).
- The code prints a confirmation message and the shape of the output logits to verify the forward pass was successful.

In [18]:
# Test forward pass
test_input = tokenizer("LoRA is very useful for quick LLM finetuning", return_tensors='pt')
test_input = {k: v.to(device) for k, v in test_input.items()}

with torch.no_grad():
    outputs = lora_model(**test_input)

print(f"Forward pass successful! Output shape: {outputs.logits.shape}")

Forward pass successful! Output shape: torch.Size([1, 12, 50257])


## LoRA Model Training

This code cell initializes the memory tracker for LoRA training, prepares the LoRA model for training, and then trains it using the memory-optimized training function.

- `lora_memory_tracker = GPUMemoryTracker(...)`: Initializes a new memory tracker specifically for the LoRA training phase.
- `base_model = GPT2LMHeadModel.from_pretrained(GPT2_MODEL_NAME)`: Loads a fresh base model again to ensure a clean start for LoRA application.
- **CRITICAL FIX**: The code includes a step to enable input gradients (`enable_input_require_grads` or a forward hook) which is necessary for LoRA training with certain model architectures like GPT-2.
- `lora_config = LoraConfig(...)`: Defines the LoRA configuration, including `target_modules` like "c_fc" in addition to "c_attn" and "c_proj" for potentially better results. `bias="none"` is also specified.
- `lora_model = get_peft_model(base_model, lora_config)`: Applies the LoRA configuration.
- `lora_model.train()`: Sets the LoRA model to training mode.
- `lora_model.print_trainable_parameters()`: Prints the trainable parameters of the LoRA model.
- The code then gets the optimizer and scheduler specifically for the LoRA model.
- `train_model_memory_optimized(...)`: Calls the training function to fine-tune the LoRA model.
- The code prints messages indicating the start and completion of the LoRA training.

In [19]:
# Initialize memory tracker for LoRA
lora_memory_tracker = GPUMemoryTracker(
    target_batches=[5, 10, 25, 50, 100, 150, 200, 300, 400, 500],
    print_stats=True,
)

print("Creating LoRA model...")

# STEP 1: Load base model
base_model = GPT2LMHeadModel.from_pretrained(GPT2_MODEL_NAME)
base_model = base_model.to(device)

# STEP 2: CRITICAL FIX - Enable input gradients BEFORE applying LoRA
if hasattr(base_model, "enable_input_require_grads"):
    base_model.enable_input_require_grads()
else:
    def make_inputs_require_grad(module, input, output):
        output.requires_grad_(True)
    base_model.get_input_embeddings().register_forward_hook(make_inputs_require_grad)

print("Input gradients enabled")

# STEP 3: Create LoRA configuration
lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    inference_mode=False,
    r=LORA_RANK,
    lora_alpha=LORA_ALPHA,
    lora_dropout=LORA_DROPOUT,
    target_modules=["c_attn", "c_proj", "c_fc"],
    bias="none",
)

# STEP 4: Apply LoRA to the model
lora_model = get_peft_model(base_model, lora_config)
lora_model.train()

# STEP 5: Verify trainable parameters
lora_model.print_trainable_parameters()

print(f"LoRA Model created successfully!")
print(f"Total parameters: {sum(p.numel() for p in lora_model.parameters()):,}")
print(f"Trainable parameters: {sum(p.numel() for p in lora_model.parameters() if p.requires_grad):,}")

# Get optimizer and scheduler for LoRA model
lora_optimizer, lora_scheduler = get_optimizer_and_scheduler(lora_model, num_training_steps)

print("Starting LoRA GPT-2 training...")
trained_lora_model, lora_memory_usage = train_model_memory_optimized(
    lora_model, train_dataloader, lora_optimizer, lora_scheduler, lora_memory_tracker, EPOCHS
)
print("LoRA GPT-2 training completed!")

Creating LoRA model...
Input gradients enabled
trainable params: 589,824 || all params: 125,029,632 || trainable%: 0.4717
LoRA Model created successfully!
Total parameters: 125,029,632
Trainable parameters: 589,824
Starting LoRA GPT-2 training...
Current memory: 1.451GB, Peak memory: 2.383GB


Epoch 1/10:   0%|          | 0/50 [00:00<?, ?it/s]

Current memory: 1.479GB, Peak memory: 2.383GB
Current memory: 1.483GB, Peak memory: 2.383GB
Current memory: 1.483GB, Peak memory: 2.383GB
Current memory: 1.481GB, Peak memory: 2.383GB
Epoch 1 average loss: 7.9038
Current memory: 1.481GB, Peak memory: 2.383GB


Epoch 2/10:   0%|          | 0/50 [00:00<?, ?it/s]

Current memory: 1.483GB, Peak memory: 2.383GB
Current memory: 1.483GB, Peak memory: 2.383GB
Current memory: 1.483GB, Peak memory: 2.383GB
Current memory: 1.481GB, Peak memory: 2.383GB
Epoch 2 average loss: 7.8646
Current memory: 1.481GB, Peak memory: 2.383GB


Epoch 3/10:   0%|          | 0/50 [00:00<?, ?it/s]

Current memory: 1.483GB, Peak memory: 2.383GB
Current memory: 1.483GB, Peak memory: 2.383GB
Current memory: 1.483GB, Peak memory: 2.383GB
Current memory: 1.481GB, Peak memory: 2.383GB
Epoch 3 average loss: 7.7983
Current memory: 1.481GB, Peak memory: 2.383GB


Epoch 4/10:   0%|          | 0/50 [00:00<?, ?it/s]

Current memory: 1.483GB, Peak memory: 2.383GB
Current memory: 1.483GB, Peak memory: 2.383GB
Current memory: 1.483GB, Peak memory: 2.383GB
Current memory: 1.481GB, Peak memory: 2.383GB
Epoch 4 average loss: 7.6180
Current memory: 1.481GB, Peak memory: 2.383GB


Epoch 5/10:   0%|          | 0/50 [00:00<?, ?it/s]

Current memory: 1.483GB, Peak memory: 2.383GB
Current memory: 1.483GB, Peak memory: 2.383GB
Current memory: 1.483GB, Peak memory: 2.383GB
Current memory: 1.481GB, Peak memory: 2.383GB
Epoch 5 average loss: 7.4613
Current memory: 1.481GB, Peak memory: 2.383GB


Epoch 6/10:   0%|          | 0/50 [00:00<?, ?it/s]

Current memory: 1.483GB, Peak memory: 2.383GB
Current memory: 1.483GB, Peak memory: 2.383GB
Current memory: 1.483GB, Peak memory: 2.383GB
Current memory: 1.481GB, Peak memory: 2.383GB
Epoch 6 average loss: 7.0424
Current memory: 1.481GB, Peak memory: 2.383GB


Epoch 7/10:   0%|          | 0/50 [00:00<?, ?it/s]

Current memory: 1.483GB, Peak memory: 2.383GB
Current memory: 1.483GB, Peak memory: 2.383GB
Current memory: 1.483GB, Peak memory: 2.383GB
Current memory: 1.481GB, Peak memory: 2.383GB
Epoch 7 average loss: 6.4736
Current memory: 1.481GB, Peak memory: 2.383GB


Epoch 8/10:   0%|          | 0/50 [00:00<?, ?it/s]

Current memory: 1.483GB, Peak memory: 2.383GB
Current memory: 1.483GB, Peak memory: 2.383GB
Current memory: 1.483GB, Peak memory: 2.383GB
Current memory: 1.481GB, Peak memory: 2.383GB
Epoch 8 average loss: 5.6643
Current memory: 1.481GB, Peak memory: 2.383GB


Epoch 9/10:   0%|          | 0/50 [00:00<?, ?it/s]

Current memory: 1.483GB, Peak memory: 2.383GB
Current memory: 1.483GB, Peak memory: 2.383GB
Current memory: 1.483GB, Peak memory: 2.383GB
Current memory: 1.481GB, Peak memory: 2.383GB
Epoch 9 average loss: 4.5419
Current memory: 1.481GB, Peak memory: 2.383GB


Epoch 10/10:   0%|          | 0/50 [00:00<?, ?it/s]

Current memory: 1.483GB, Peak memory: 2.383GB
Current memory: 1.483GB, Peak memory: 2.383GB
Current memory: 1.483GB, Peak memory: 2.383GB
Current memory: 1.481GB, Peak memory: 2.383GB
Epoch 10 average loss: 3.4179
LoRA GPT-2 training completed!


## Text Generation with LoRA Fine-tuned GPT-2

This code cell demonstrates text generation using the GPT-2 model fine-tuned with LoRA adapters. This allows for comparing the output with the base and full fine-tuned models.

- `generate_text(trained_lora_model, tokenizer, "I like basketball", max_length=MAX_GENERATION_LENGTH, device=device)`: Calls the `generate_text` function with the LoRA-trained model and a sample input prompt.
- `generate_text(trained_lora_model, tokenizer, "That Italian restaurant is", max_length=MAX_GENERATION_LENGTH, device=device)`: Calls the `generate_text` function with another sample input prompt to see the model's response on a different topic.
- The generated text and the time taken for generation for both prompts are printed.

In [20]:
print("Generating text with LoRA fine-tuned GPT-2...")
generate_text(trained_lora_model, tokenizer, "I like basketball", max_length=MAX_GENERATION_LENGTH, device=device)
generate_text(trained_lora_model, tokenizer, "That Italian restaurant is", max_length=MAX_GENERATION_LENGTH, device=device)

Generating text with LoRA fine-tuned GPT-2...

Output:
I like basketball, too, but I'm not looking for a career."

He said he's always been a basketball fan.

"I'm a big fan of the game, and I think you have to be able to support it," he said. "I don't want to be like the kid that's out of school. I don't want to be the kid that's being bullied and bullied by the other kids. I don't want to be the guy that's bullied and bullied by the other kids. I want to be the guy that's going to get in trouble."
Total Time Elapsed: 1.98s

Output:
That Italian restaurant is named after the Italian cuisine of which the restaurant was founded by the famous Italian chef and restaurateur, Giuseppe Rizzoli, who also served as a chef at the Venetian in Rome.
Total Time Elapsed: 0.68s


## Saving and Calculating LoRA Adapter Size

This code cell saves the trained LoRA adapters to a specified directory and then calculates the size of the saved adapter files.

- `trained_lora_model.save_pretrained("./gpt2-lora-reddit")`: Saves only the LoRA adapter weights (not the full base model) to the specified directory.
- The code then calculates the total size of the files within the saving directory in megabytes (MB).
- It prints the size of the LoRA adapter and compares it to the typical size of the full GPT-2 model to highlight the memory efficiency of LoRA.

In [21]:
# Save LoRA adapters
trained_lora_model.save_pretrained("./gpt2-lora-reddit")
print("LoRA adapters saved to ./gpt2-lora-reddit")

# Calculate adapter size
import os
adapter_size = sum(
    os.path.getsize(os.path.join("./gpt2-lora-reddit", f))
    for f in os.listdir("./gpt2-lora-reddit")
    if os.path.isfile(os.path.join("./gpt2-lora-reddit", f))
) / (1024 * 1024)  # Convert to MB

print(f"LoRA adapter size: {adapter_size:.2f} MB")
print("Compare this to the full GPT-2 model which is ~500MB!")

LoRA adapters saved to ./gpt2-lora-reddit
LoRA adapter size: 2.27 MB
Compare this to the full GPT-2 model which is ~500MB!


## Loading LoRA Adapters for Inference

This code cell demonstrates how to load the base GPT-2 model and then attach the previously saved LoRA adapters for efficient inference.

- `base_model_for_inference = GPT2LMHeadModel.from_pretrained(GPT2_MODEL_NAME)`: Loads a clean instance of the base GPT-2 model.
- `loaded_lora_model = PeftModel.from_pretrained(base_model_for_inference, "./gpt2-lora-reddit")`: Loads the LoRA adapters from the specified directory and attaches them to the base model. The result is a LoRA-enabled model ready for inference.
- `loaded_lora_model = loaded_lora_model.to(device)`: Moves the loaded LoRA model to the specified device.
- The code prints a message confirming that the LoRA model is loaded successfully.
- Finally, it tests text generation with the loaded LoRA model to show that the fine-tuned behavior is present even when loading only the adapters.

In [22]:
from peft import PeftModel

# Load base model
base_model_for_inference = GPT2LMHeadModel.from_pretrained(GPT2_MODEL_NAME)

# Load LoRA adapters
loaded_lora_model = PeftModel.from_pretrained(base_model_for_inference, "./gpt2-lora-reddit")
loaded_lora_model = loaded_lora_model.to(device)

print("LoRA model loaded successfully!")

# Test generation
generate_text(loaded_lora_model, tokenizer, "Today I learned that", max_length=150, device=device)

LoRA model loaded successfully!

Output:
Today I learned that the people who are working with the government, the police, the prosecutors, the lawyers, the judges, the lawyers of the state, they are not going to help us. They will not help us."

In the past, "the people who are working with the government, the police, the prosecutors, the lawyers, the judges, the lawyers of the state, they are not going to help us. They will not help us."
Total Time Elapsed: 1.55s


## Complete GPU Memory Clearance Code Block (Only if needed)

This commented-out code cell provides a comprehensive script to free up GPU memory. It is commented out by default to prevent accidental execution, but can be uncommented and run in a new cell when a full memory reset is needed.

- The code first attempts to delete specific variables that might be holding onto GPU memory (models, optimizers, data loaders, etc.).
- It then explicitly calls Python's garbage collector (`gc.collect()`) multiple times to ensure unused objects are cleaned up.
- If CUDA is available, it clears the CUDA cache (`torch.cuda.empty_cache()`) and resets memory statistics, then prints the current GPU memory status.
- Finally, it sets memory optimization environment variables, which can help with memory allocation in subsequent runs.
- The code prints messages throughout the process and provides recommendations for a clean start and memory-optimized settings.

In [31]:
'''
# COMPLETE GPU MEMORY CLEARANCE CODE BLOCK
# Run this in a new cell to completely clear GPU memory

import gc
import torch
import os

print("Starting GPU memory cleanup...")

# Step 1: Delete all model-related variables
variables_to_delete = [
    'gpt2_model', 'trained_gpt2', 'base_model', 'lora_model', 'loaded_lora_model',
    'optimizer', 'scheduler', 'lora_optimizer', 'lora_scheduler',
    'train_dataloader', 'dataset', 'tokenized_dataset', 'small_dataset',
    'tokenizer', 'scaler', 'gpu_memory_tracker', 'lora_memory_tracker',
    'outputs', 'loss', 'input_ids', 'attention_mask', 'labels'
]

deleted_count = 0
for var_name in variables_to_delete:
    if var_name in locals():
        exec(f"del {var_name}")
        deleted_count += 1
        print(f"   ✓ Deleted {var_name}")
    elif var_name in globals():
        exec(f"del {var_name}")
        deleted_count += 1
        print(f"   ✓ Deleted {var_name} (global)")

print(f"Deleted {deleted_count} variables")

# Step 2: Force garbage collection
print("Running garbage collection...")
for i in range(3):  # Run multiple times for thorough cleanup
    collected = gc.collect()
    print(f"   Cycle {i+1}: Collected {collected} objects")

# Step 3: Clear CUDA cache
if torch.cuda.is_available():
    print("Clearing CUDA cache...")
    torch.cuda.empty_cache()
    torch.cuda.ipc_collect()

    # Reset CUDA memory stats
    torch.cuda.reset_peak_memory_stats()
    torch.cuda.reset_accumulated_memory_stats()

    print("Current GPU memory status:")
    allocated = torch.cuda.memory_allocated() / (1024**3)  # GB
    reserved = torch.cuda.memory_reserved() / (1024**3)   # GB
    max_allocated = torch.cuda.max_memory_allocated() / (1024**3)  # GB

    print(f"   Allocated: {allocated:.2f} GB")
    print(f"   Reserved: {reserved:.2f} GB")
    print(f"   Max allocated: {max_allocated:.2f} GB")
    print(f"   Total GPU memory: {torch.cuda.get_device_properties(0).total_memory / (1024**3):.2f} GB")

    # Calculate free memory
    free_memory = (torch.cuda.get_device_properties(0).total_memory / (1024**3)) - reserved
    print(f"Free memory: {free_memory:.2f} GB")

else:
    print("CUDA not available")

# Step 4: Set memory optimization environment variables
print("Setting memory optimization flags...")
os.environ['PYTORCH_CUDA_ALLOC_CONF'] = 'expandable_segments:True,max_split_size_mb:128'
os.environ['CUDA_LAUNCH_BLOCKING'] = '1'  # For debugging if needed

print("GPU memory cleanup completed!")
print("\n You can now run your training code with a clean GPU state.")
print("\n Recommended next steps:")
print("   1. Restart your kernel for the cleanest start (optional but recommended)")
print("   2. Use the memory-optimized hyperparameters:")
print("      - BATCH_SIZE = 2 or 4")
print("      - MAX_SEQUENCE_LENGTH = 64")
print("      - Use gradient accumulation")
'''

'\n# COMPLETE GPU MEMORY CLEARANCE CODE BLOCK\n# Run this in a new cell to completely clear GPU memory\n\nimport gc\nimport torch\nimport os\n\nprint("Starting GPU memory cleanup...")\n\n# Step 1: Delete all model-related variables\nvariables_to_delete = [\n    \'gpt2_model\', \'trained_gpt2\', \'base_model\', \'lora_model\', \'loaded_lora_model\',\n    \'optimizer\', \'scheduler\', \'lora_optimizer\', \'lora_scheduler\',\n    \'train_dataloader\', \'dataset\', \'tokenized_dataset\', \'small_dataset\',\n    \'tokenizer\', \'scaler\', \'gpu_memory_tracker\', \'lora_memory_tracker\',\n    \'outputs\', \'loss\', \'input_ids\', \'attention_mask\', \'labels\'\n]\n\ndeleted_count = 0\nfor var_name in variables_to_delete:\n    if var_name in locals():\n        exec(f"del {var_name}")\n        deleted_count += 1\n        print(f"   ✓ Deleted {var_name}")\n    elif var_name in globals():\n        exec(f"del {var_name}")\n        deleted_count += 1\n        print(f"   ✓ Deleted {var_name} (globa