# Building and Fine-Tuning a Transformer-Based LLM for Arithmetic

## Overview:
- **Objective**: Develop a custom transformer-based Language Model (LLM) from scratch to predict results of arithmetic expressions.
- **Steps**:
  1. **Custom Model Design**: Implemented a transformer model with positional encoding and dynamic sequence alignment.
  2. **Dataset Creation**: Generated synthetic arithmetic datasets, tokenized, and split for training, validation, and testing.
  3. **Pretraining**: Masked token pretraining for foundational learning.
  4. **Fine-Tuning**: Trained the model to improve performance on specific tasks with learning rate scheduling and early stopping.
  5. **Evaluation**: Tested the LLM on unseen data, comparing predictions to expected results.

## Output:
- Built and fine-tuned a transformer-based LLM for arithmetic tasks.
- Visualized training progress and evaluated model performance with detailed results.


# Setup: Import Libraries and Configure Device

## Libraries Imported:
- **PyTorch**: For building and training neural networks.
- **Transformers**: For optimization (`AdamW`) and learning rate scheduling (`get_scheduler`).
- **Scikit-learn**: For dataset splitting.
- **Random**: For reproducibility.

## Device Configuration:
- Checks for CUDA availability and sets the device to GPU (`cuda`) if available, otherwise CPU.


In [1]:
# Import libraries
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
from torch.nn.utils.rnn import pad_sequence
from sklearn.model_selection import train_test_split
from transformers import AdamW, get_scheduler
import random

# Check if CUDA is available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

Using device: cuda


In [2]:
# Check if CUDA is available
if torch.cuda.is_available():
    print("CUDA is available!")
    print(f"Number of GPUs: {torch.cuda.device_count()}")

    # Display details of each GPU
    for i in range(torch.cuda.device_count()):
        print(f"GPU {i}: {torch.cuda.get_device_name(i)}")
        print(f"  Memory Allocated: {torch.cuda.memory_allocated(i) / 1024 ** 2:.2f} MB")
        print(f"  Memory Cached: {torch.cuda.memory_reserved(i) / 1024 ** 2:.2f} MB")

CUDA is available!
Number of GPUs: 1
GPU 0: NVIDIA L40S
  Memory Allocated: 0.00 MB
  Memory Cached: 0.00 MB


## Custom Tokenizer: ArithmeticTokenizer

### Purpose:
- Tokenizes arithmetic expressions into indices and decodes indices back into expressions.

### Key Features:
- **Vocabulary**:
  - Includes digits (`0-9`), operators (`+-*/`), parentheses, equals sign (`=`), space, `<PAD>`, and `<MASK>`.
- **Methods**:
  - `encode`: Converts a string into a sequence of token indices.
  - `decode`: Converts token indices back to a string, ignoring `<PAD>` tokens.

### Initialization:
- Instantiated as `tokenizer`.


In [3]:
# Define a tokenizer
class ArithmeticTokenizer:
    def __init__(self):
        self.vocab = {char: idx for idx, char in enumerate("0123456789+-*/().= ")}
        self.vocab["<PAD>"] = len(self.vocab)
        self.vocab["<MASK>"] = len(self.vocab)
        self.reverse_vocab = {idx: char for char, idx in self.vocab.items()}

    def encode(self, text):
        return [self.vocab.get(char, self.vocab["<PAD>"]) for char in text]

    def decode(self, indices):
        return "".join([self.reverse_vocab[idx] for idx in indices if idx in self.reverse_vocab and idx != self.vocab["<PAD>"]])

# Instantiate tokenizer
tokenizer = ArithmeticTokenizer()

print("Tokenizer Initialized")


Tokenizer Initialized


## Synthetic Arithmetic Expression Generation

### Overview:
- **Objective**: Create a dataset of synthetic arithmetic expressions for pre-training.

### Key Features:
1. **Operators**: `+`, `-`, `*`, `/`.
2. **Expression Count**: Generated `50,000` expressions.
3. **Value Range**: Operands range between `1` and `100`.

### Methods:
1. **`generate_expression(max_depth)`**:
   - Recursively builds random arithmetic expressions with adjustable complexity (`max_depth`).
   - Base case generates a single random number.
2. **`generate_sample_expression()`**:
   - Wraps expressions in natural-language prompts (e.g., "Calculate (2 + 3).").

### Dataset:
- **Size**: 50,000 expressions.
- **Sample Output**:
  - Example prompts printed for verification.
- **Storage**: Saved expressions to `pretraining_data.txt`.


In [4]:
# Generate synthetic arithmetic expressions
operators = ['+', '-', '*', '/']
num_expressions = 50000  # Number of expressions to generate
min_value, max_value = 1, 100  # Range for numbers

def generate_expression(max_depth=2):
    """Generate random arithmetic expressions with adjustable complexity."""
    if max_depth == 0:
        # Base case: return a single number
        return str(random.randint(min_value, max_value))
    else:
        left = generate_expression(max_depth - 1)
        right = generate_expression(max_depth - 1)
        operator = random.choice(operators)
        return f"({left} {operator} {right})"

def generate_sample_expression():
    """Generate expressions with sample-like formatting."""
    # Avoid generating single-number queries
    expression = generate_expression(max_depth=random.randint(1, 2))
    
    # Use prompts only for valid arithmetic expressions
    prompt_templates = [
        "Calculate {}.",
        "Evaluate {}.",
        "What is {}?"
    ]
    prompt = random.choice(prompt_templates)
    return prompt.format(expression)

# Generate dataset
pretraining_data = [generate_sample_expression() for _ in range(num_expressions)]
print(f"Generated {len(pretraining_data)} expressions for pre-training.")

# Print a few samples for verification
print("Sample expressions:")
for i in range(4):
    print(pretraining_data[i])

# Save dataset to a file for reference
with open("pretraining_data.txt", "w") as f:
    for expr in pretraining_data:
        f.write(expr + "\n")


Generated 50000 expressions for pre-training.
Sample expressions:
Evaluate (79 - 54).
Calculate ((11 / 42) * (100 - 28)).
What is ((3 + 46) + (80 / 81))?
What is (71 - 28)?


## Transformer Model for Arithmetic Expressions

### Components:
1. **`PositionalEncoding`**:
   - Adds learnable positional encoding to embeddings.
   - Ensures model captures positional relationships in input sequences.

2. **`initialize_weights`**:
   - Initializes model weights:
     - `nn.Linear` and `nn.Embedding`: Xavier uniform initialization.
     - `nn.LayerNorm`: Bias to `0`, weight to `1`.

3. **`TransformerArithmeticModel`**:
   - **Embedding Layer**: Maps token indices to dense vectors (`d_model` size).
   - **Positional Encoding**: Adds positional information to embeddings.
   - **Transformer**: Core transformer architecture with encoder-decoder layers.
   - **Output Layer**: Projects transformer outputs to vocabulary size.
   - **Methods**:
     - `forward`: Processes source and target sequences for sequence-to-sequence training.
     - `pretrain_forward`: Processes source sequences for pretraining tasks.

### Initialization:
- Model instantiated with vocabulary size matching the custom tokenizer.
- Weights initialized using `initialize_weights`.


In [5]:
class PositionalEncoding(nn.Module):
    def __init__(self, d_model, max_len=5000):
        super(PositionalEncoding, self).__init__()
        self.encoding = nn.Parameter(torch.zeros(1, max_len, d_model), requires_grad=True)

    def forward(self, x):
        return x + self.encoding[:, :x.size(1), :]

def initialize_weights(model):
    for module in model.modules():
        if isinstance(module, (nn.Linear, nn.Embedding)):
            nn.init.xavier_uniform_(module.weight)
        elif isinstance(module, nn.LayerNorm):
            nn.init.constant_(module.bias, 0)
            nn.init.constant_(module.weight, 1.0)

class TransformerArithmeticModel(nn.Module):
    def __init__(self, vocab_size, d_model=256, nhead=8, num_layers=6, dropout=0.1):
        super(TransformerArithmeticModel, self).__init__()
        self.embedding = nn.Embedding(vocab_size, d_model)
        self.positional_encoding = PositionalEncoding(d_model)
        self.transformer = nn.Transformer(
            d_model, nhead, num_encoder_layers=num_layers, num_decoder_layers=num_layers,
            dropout=dropout, batch_first=True, norm_first=True
        )
        self.fc_out = nn.Linear(d_model, vocab_size)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, src, tgt):
        """
        Handles source and target sequences for training.
        """
        # Embed and encode source and target
        src_emb = self.dropout(self.positional_encoding(self.embedding(src)))
        tgt_emb = self.dropout(self.positional_encoding(self.embedding(tgt)))

        # Pass through transformer
        transformer_output = self.transformer(src_emb, tgt_emb)

        # Project to vocabulary size
        return self.fc_out(transformer_output)

    def pretrain_forward(self, src):
        src_emb = self.dropout(self.positional_encoding(self.embedding(src)))
        output = self.transformer(src_emb, src_emb)
        return self.fc_out(output)
    

model = TransformerArithmeticModel(vocab_size=len(tokenizer.vocab)).to("cuda" if torch.cuda.is_available() else "cpu")
initialize_weights(model)
print("Transformer Model Initialized")



Transformer Model Initialized


## Pretraining Custom Transformer Model on Arithmetic Expressions

### Steps:

1. **Dataset Preparation**:
   - `PreTrainingDataset`:
     - Converts arithmetic expressions into tokenized sequences.
     - Handles batching and padding using `DataLoader` with `<PAD>` token.
   - Dataset size: 50,000 expressions.

2. **Token Masking**:
   - **`mask_tokens`**:
     - Masks 15% of tokens in input sequences with `<MASK>` for pretraining.
     - Ensures at least one unmasked token remains per sequence.
     - Labels unmasked tokens for loss calculation.

3. **Model Pretraining**:
   - **Optimization**:
     - Uses `AdamW` optimizer with learning rate `5e-4` and gradient clipping.
   - **Loss Function**:
     - `CrossEntropyLoss` with ignore index `-100` for masked tokens.
   - **Training**:
     - Pretrained for 5 epochs with validation for `NaN`/`Inf` errors in inputs, outputs, and gradients.

4. **Loss Visualization**:
   - Plots pretraining loss across epochs.

### Outputs:
- **Console**: Logs average loss per epoch.
- **Visualization**: A plot of pretraining loss saved and optionally displayed.


In [6]:
class PreTrainingDataset(Dataset):
    def __init__(self, data, tokenizer):
        self.data = [torch.tensor(tokenizer.encode(expr)) for expr in data]

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        seq = self.data[idx]
        return seq

dataset = PreTrainingDataset(pretraining_data, tokenizer)
pretrain_loader = DataLoader(dataset, batch_size=32, shuffle=True, collate_fn=lambda x: pad_sequence(x, batch_first=True, padding_value=tokenizer.vocab["<PAD>"]))
print("Dataset and DataLoader Initialized")



def mask_tokens(input_seq, mask_token, mask_prob=0.15):
    """Masks a percentage of tokens in the input sequence for pre-training."""
    masked_seq = input_seq.clone()
    labels = input_seq.clone()

    # Mask tokens with the given probability
    for i in range(input_seq.size(1)):
        if random.random() < mask_prob:
            masked_seq[:, i] = mask_token
        else:
            labels[:, i] = -100  # Ignore index for loss

    # Ensure at least one unmasked token per sequence
    for i in range(masked_seq.size(0)):
        if torch.all(labels[i] == -100):
            random_idx = random.randint(0, masked_seq.size(1) - 1)
            labels[i, random_idx] = input_seq[i, random_idx]
            masked_seq[i, random_idx] = input_seq[i, random_idx]

    return masked_seq, labels

def pretrain_model(model, dataloader, tokenizer, epochs=5, lr=5e-4):
    optimizer = AdamW(model.parameters(), lr=lr)
    criterion = nn.CrossEntropyLoss(ignore_index=-100)
    losses = []

    for epoch in range(epochs):
        model.train()
        total_loss = 0

        for batch_idx, batch in enumerate(dataloader):
            try:
                # Move batch to device
                batch = batch.to("cuda" if torch.cuda.is_available() else "cpu")

                # Mask tokens
                masked_seq, labels = mask_tokens(batch, mask_token=tokenizer.vocab["<MASK>"])
                masked_seq, labels = masked_seq.to("cuda" if torch.cuda.is_available() else "cpu"), labels.to("cuda" if torch.cuda.is_available() else "cpu")

                # Validate inputs
                if torch.isnan(masked_seq).any() or torch.isnan(labels).any():
                    print(f"NaN detected in masked_seq or labels at batch {batch_idx}. Skipping batch.")
                    continue

                # Forward pass
                optimizer.zero_grad()
                output = model.pretrain_forward(masked_seq)
                
                # Validate outputs
                if torch.isnan(output).any() or torch.isinf(output).any():
                    print(f"NaN or Inf detected in model output at batch {batch_idx}. Skipping batch.")
                    continue

                # Compute loss
                loss = criterion(output.view(-1, output.size(-1)), labels.view(-1))
                if torch.isnan(loss) or torch.isinf(loss):
                    print(f"NaN detected in loss at batch {batch_idx}. Skipping batch.")
                    continue

                # Backward pass and optimization
                loss.backward()
                torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)  # Gradient clipping
                optimizer.step()

                total_loss += loss.item()

            except Exception as e:
                print(f"Error in batch {batch_idx}: {e}")
                continue

        avg_loss = total_loss / len(dataloader)
        losses.append(avg_loss)
        print(f"Epoch {epoch+1}/{epochs}, Loss: {avg_loss:.4f}")

    import matplotlib.pyplot as plt  

    # Enhanced plot appearance
    plt.figure(figsize=(12, 8))
    plt.plot(
        range(1, len(losses) + 1), 
        losses, 
        marker='o', 
        linestyle='-', 
        linewidth=2, 
        markersize=10,
        label='Loss'
    )

    # Modern aesthetics
    plt.title("Pretraining Loss", fontsize=20, fontweight='bold', loc='center')
    plt.xlabel("Epochs", fontsize=16)
    plt.ylabel("Loss", fontsize=16)
    plt.xticks(fontsize=14)
    plt.yticks(fontsize=14)
    plt.legend(fontsize=14, loc='upper right')
    plt.grid(which='both', linestyle='--', linewidth=0.7, alpha=0.7)
    plt.tight_layout()  # Ensure layout fits well

    # Save the plot
    output_path = "/home/lxp334/LLM_Final_Report/visualizations/ArithmeticLLM_Pretraining_Loss_Visualization.png"
    plt.savefig(output_path, dpi=300)

    # Show the plot for immediate feedback (optional)
    plt.show()

    print(f"Pretraining Loss Saved to {output_path}")


pretrain_model(model, pretrain_loader, tokenizer, epochs=5)




Dataset and DataLoader Initialized




Epoch 1/5, Loss: 0.9126
Epoch 2/5, Loss: 0.5239
Epoch 3/5, Loss: 0.4999
Epoch 4/5, Loss: 0.4853
Epoch 5/5, Loss: 0.4889
Pretraining Loss Saved to /home/lxp334/LLM_Final_Report/visualizations/ArithmeticLLM_Pretraining_Loss_Visualization.png


  plt.show()


## Loading and Tokenizing Multiple Datasets

### Function: `load_multiple_datasets`
- **Purpose**: 
  - Reads arithmetic datasets from specified file paths.
  - Tokenizes arithmetic expressions and their results.

### Steps:
1. **File Reading**:
   - Reads each dataset line by line.
   - Pairs consecutive lines as `expression` (line `i`) and `result` (line `i+1`).
2. **Tokenization**:
   - Encodes both `expression` and `result` using the custom `ArithmeticTokenizer`.
3. **Dataset Combination**:
   - Combines all datasets into a single list of tokenized pairs.

### Outputs:
- Total number of tokenized samples logged.
- Debug: Prints the first encoded sample for verification.



In [7]:
def load_multiple_datasets(file_paths, tokenizer):
    """Loads and tokenizes multiple datasets."""
    all_data = []
    for file_path in file_paths:
        print(f"Loading dataset from: {file_path}")
        with open(file_path, 'r') as f:
            lines = f.readlines()
            for i in range(0, len(lines), 2):
                expression = lines[i].strip()
                result = lines[i + 1].strip()
                all_data.append((tokenizer.encode(expression), tokenizer.encode(result)))
    print(f"Total samples from all datasets: {len(all_data)}")
    return all_data

# File paths for datasets
file_paths = [
    "/home/lxp334/LLM_Final_Report/arithmetic__mixed.txt",
]

# Load and combine datasets
combined_data = load_multiple_datasets(file_paths, tokenizer)

# Debug: Print dataset stats and a sample
print(f"Total Samples: {len(combined_data)}")
print(f"Sample Data (Encoded): {combined_data[:1]}")

Loading dataset from: /home/lxp334/LLM_Final_Report/arithmetic__mixed.txt
Total samples from all datasets: 666666
Total Samples: 666666
Sample Data (Encoded): [([1, 5, 18, 10, 18, 14, 7, 18, 10, 18, 11, 1, 7, 15, 13, 1], [5])]


## Splitting Dataset into Training, Validation, and Test Sets

### Process:
1. **Dataset Splitting**:
   - **Training Set**: 72% of the dataset.
   - **Validation Set**: 8% of the dataset.
   - **Test Set**: 20% of the dataset.


In [8]:
# Split the dataset into training, validation, and test sets
train_data, test_data = train_test_split(combined_data, test_size=0.2, random_state=42)
train_data, val_data = train_test_split(train_data, test_size=0.1, random_state=42)

# Debug: Print split sizes
print(f"Train Samples: {len(train_data)}")
print(f"Validation Samples: {len(val_data)}")
print(f"Test Samples: {len(test_data)}")


Train Samples: 479998
Validation Samples: 53334
Test Samples: 133334


## Fine-Tuning Dataset and DataLoaders

### Components:
1. **`FineTuningDataset` Class**:
   - Wraps the dataset (`train_data`, `val_data`, `test_data`) for use with PyTorch.
   - Implements `__len__` and `__getitem__`.

2. **`collate_fn`**:
   - Pads input expressions and results to the same length within a batch.
   - Uses the `<PAD>` token from the custom tokenizer.

3. **DataLoaders**:
   - Created for training, validation, and testing:
     - **Training DataLoader**: Shuffles batches for randomness.
     - **Validation and Test DataLoaders**: No shuffling.
   - Batch size set to `32`.


In [9]:
from torch.nn.utils.rnn import pad_sequence

# Fine-tuning dataset
class FineTuningDataset(Dataset):
    def __init__(self, data):
        self.data = data

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        return self.data[idx]

# Collate function for padding
def collate_fn(batch):
    expressions, results = zip(*batch)
    padded_expressions = pad_sequence([torch.tensor(expr) for expr in expressions], batch_first=True, padding_value=tokenizer.vocab["<PAD>"])
    padded_results = pad_sequence([torch.tensor(res) for res in results], batch_first=True, padding_value=tokenizer.vocab["<PAD>"])
    return padded_expressions, padded_results

# Create DataLoaders
batch_size = 32
train_loader = DataLoader(FineTuningDataset(train_data), batch_size=batch_size, collate_fn=collate_fn, shuffle=True)
val_loader = DataLoader(FineTuningDataset(val_data), batch_size=batch_size, collate_fn=collate_fn)
test_loader = DataLoader(FineTuningDataset(test_data), batch_size=batch_size, collate_fn=collate_fn)

# Debug: Check a batch
for src, tgt in train_loader:
    print("Batch Shapes - Expressions:", src.shape, "Results:", tgt.shape)
    break
    
print("FineTuner Initialized")

Batch Shapes - Expressions: torch.Size([32, 51]) Results: torch.Size([32, 5])
FineTuner Initialized


## Fine-Tuning the Transformer Model

### Training and Validation:
1. **Function**: `train_finetune_model`
   - Handles training and validation of the model with early stopping.
   - Tracks training and validation losses for visualization.

2. **Key Features**:
   - **Dynamic Sequence Alignment**: Aligns source (`src`) and target (`tgt`) sequence lengths.
   - **Criterion**: Uses `CrossEntropyLoss` with `<PAD>` token ignored.
   - **Optimizer**: AdamW optimizer with a learning rate of `5e-4`.
   - **Scheduler**: Linear learning rate scheduler with warm-up steps.
   - **Early Stopping**: Stops training if validation loss does not improve for `5` epochs.

3. **Output**:
   - Logs epoch number, training loss, validation loss, and epoch duration.
   - Returns training and validation losses for visualization.

### Visualization:
1. **Function**: `plot_finetuning_loss`
   - Plots training and validation loss trends over epochs.

### Execution:
- Trains the model using `train_loader` and validates with `val_loader`.
- Plots and saves the loss graph for analysis.



In [10]:
import time

def train_finetune_model(model, train_loader, val_loader, criterion, optimizer, scheduler, epochs=5, patience=5):
    train_losses, val_losses = [], []
    best_val_loss = float('inf')
    patience_counter = 0

    for epoch in range(epochs):
        epoch_start_time = time.time()  # Start timer for the epoch
        model.train()
        total_train_loss = 0

        for src, tgt in train_loader:
            src, tgt = src.to(device), tgt.to(device)

            # Align `src` length with `tgt` length
            max_tgt_len = tgt.size(1)  # Length of target sequences
            src = src[:, :max_tgt_len]

            optimizer.zero_grad()

            # Forward pass
            output = model(src, tgt[:, :-1])

            # Compute loss
            loss = criterion(output.reshape(-1, output.shape[-1]), tgt[:, 1:].reshape(-1))
            loss.backward()
            optimizer.step()
            total_train_loss += loss.item()

        scheduler.step()

        train_loss = total_train_loss / len(train_loader)
        train_losses.append(train_loss)

        # Validation loop
        model.eval()
        total_val_loss = 0
        with torch.no_grad():
            for src, tgt in val_loader:
                src, tgt = src.to(device), tgt.to(device)

                # Align `src` length with `tgt` length
                src = src[:, :tgt.size(1)]

                output = model(src, tgt[:, :-1])
                loss = criterion(output.reshape(-1, output.shape[-1]), tgt[:, 1:].reshape(-1))
                total_val_loss += loss.item()

        val_loss = total_val_loss / len(val_loader)
        val_losses.append(val_loss)

        epoch_end_time = time.time()  # End timer for the epoch
        epoch_duration = epoch_end_time - epoch_start_time

        print(f"Epoch {epoch+1}/{epochs}, Train Loss: {train_loss:.4f}, Val Loss: {val_loss:.4f}, Time: {epoch_duration:.2f} seconds")

        # Early stopping
        if val_loss < best_val_loss:
            best_val_loss = val_loss
            patience_counter = 0
        else:
            patience_counter += 1
            if patience_counter >= patience:
                print("Stopping early due to no improvement.")
                break

    return train_losses, val_losses


import matplotlib.pyplot as plt

def plot_finetuning_loss(train_losses, val_losses):
    """
    Plots the training and validation loss over epochs with a modern and polished look.
    """
    epochs = list(range(1, len(train_losses) + 1))
    
    # Modern style plot
    plt.figure(figsize=(12, 8))
    plt.plot(
        epochs, train_losses, label="Training Loss", 
        marker='o', linestyle='-', linewidth=2.5, markersize=10, color='#1f77b4'
    )
    plt.plot(
        epochs, val_losses, label="Validation Loss", 
        marker='s', linestyle='--', linewidth=2.5, markersize=10, color='#ff7f0e'
    )

    # Title and labels with enhanced styling
    plt.title("Fine-Tuning Loss Visualization", fontsize=20, fontweight='bold', loc='center')
    plt.xlabel("Epochs", fontsize=16)
    plt.ylabel("Loss", fontsize=16)
    plt.xticks(fontsize=14)
    plt.yticks(fontsize=14)

    # Add legend and grid
    plt.legend(fontsize=14, loc='upper right', frameon=True, shadow=True)
    plt.grid(which='both', linestyle='--', linewidth=0.7, alpha=0.7)

    # Adjust layout for better fit
    plt.tight_layout()

    # Save the plot
    output_path = "/home/lxp334/LLM_Final_Report/visualizations/ArithmeticLLM_Finetuning_Loss_Visualization.png"
    plt.savefig(output_path, dpi=300)

    # Optional: Show the plot (for debugging or immediate visualization)
    plt.show()

    print(f"Loss graph saved as {output_path}")




criterion = nn.CrossEntropyLoss(ignore_index=tokenizer.vocab["<PAD>"])
optimizer = AdamW(model.parameters(), lr=5e-4)
scheduler = get_scheduler("linear", optimizer=optimizer, num_warmup_steps=100, num_training_steps=len(train_loader) * 50)

train_losses, val_losses = train_finetune_model(model, train_loader, val_loader, criterion, optimizer, scheduler)

plot_finetuning_loss(train_losses, val_losses)




Epoch 1/5, Train Loss: 7.9884, Val Loss: 8.0693, Time: 282.21 seconds
Epoch 2/5, Train Loss: 1.2155, Val Loss: 0.2969, Time: 281.75 seconds
Epoch 3/5, Train Loss: 0.2137, Val Loss: 0.1168, Time: 281.20 seconds
Epoch 4/5, Train Loss: 0.1253, Val Loss: 0.1097, Time: 281.31 seconds
Epoch 5/5, Train Loss: 0.1129, Val Loss: 0.1086, Time: 281.41 seconds
Loss graph saved as /home/lxp334/LLM_Final_Report/visualizations/ArithmeticLLM_Finetuning_Loss_Visualization.png


  plt.show()


## Model Evaluation

- **Function**: `evaluate_model`
  - Runs the model on test data, decodes inputs, targets, and predictions.
  - Stores results as (`Expression`, `Expected`, `Predicted`).

- **Output**:
  - First 15 results printed for verification.


In [11]:
def evaluate_model(model, test_loader, tokenizer):
    model.eval()
    results = []
    with torch.no_grad():
        for src, tgt in test_loader:
            src, tgt = src.to(device), tgt.to(device)
            output = model(src, tgt[:, :-1])
            predictions = torch.argmax(output, dim=2)
            for i in range(src.size(0)):
                src_decoded = tokenizer.decode([t for t in src[i].tolist() if t != tokenizer.vocab["<PAD>"]])
                tgt_decoded = tokenizer.decode([t for t in tgt[i].tolist() if t != tokenizer.vocab["<PAD>"]])
                pred_decoded = tokenizer.decode([t for t in predictions[i].tolist() if t != tokenizer.vocab["<PAD>"]])
                results.append((src_decoded, tgt_decoded, pred_decoded))
    return results

test_results = evaluate_model(model, test_loader, tokenizer)
for src, tgt, pred in test_results[:15]:
    print(f"Expression: {src} | Expected: {tgt} | Predicted: {pred}")

Expression:      (-9)/12 + 0 - 38/(-8) | Expected: 4 | Predicted: 444
Expression:   500/(-40) - (-27)/6 | Expected: -8 | Predicted: 8/7
Expression:      1/(-5) + (-152)/190 | Expected: -1 | Predicted: 114
Expression:  (4/2)/(-3*(-42)/(-18)). | Expected: -2/7 | Predicted: 2/5
Expression:  (-12)/40*(102/18 - 5). | Expected: -1/5 | Predicted: 1/3
Expression:   6/(-120)*6 - (-2)/5 | Expected: 1/10 | Predicted: /11
Expression:  8*1/((-9)/((-18)/8)). | Expected: 2 | Predicted: /44
Expression:      (-12)/15 + (-357)/(-315) | Expected: 1/3 | Predicted: /31
Expression:  (288/320)/(6/(-10)) - 1. | Expected: -5/2 | Predicted: 5/3
Expression:      8/3*(18/(-12) - 0) | Expected: -4 | Predicted: 4/5
Expression:  5*3/9*-3. | Expected: -5 | Predicted: 5/5
Expression:      (2 + 2)/(-4 + (-40)/(-12)) | Expected: -6 | Predicted: 617
Expression:  ((-12)/(-27))/(134/603). | Expected: 2 | Predicted: /44
Expression:      (-21)/(-3) + (-46)/8 | Expected: 5/4 | Predicted: /45
Expression:   11 + -7 + 42/16 + -7

Save Model

In [12]:
# Save the trained model
torch.save(model.state_dict(), "ArithmetiLLM_fineTuned_1DS.pth")
print("Model saved successfully!")

Model saved successfully!
