# Evaluation Pipeline Demo

This notebook demonstrates how to use the `evaluate_model_pipeline` function to:
1. Evaluate a baseline model
2. Fine-tune the model with LoRA
3. Evaluate the fine-tuned model
4. Save all results and adapters to an output directory

## Setup

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
import warnings
warnings.filterwarnings("ignore")

In [None]:
import sys
sys.path.append('../')

## Imports

In [None]:
import os
import random
from datasets import Dataset, DatasetDict
from transformers import set_seed

from src.preprocess.deer import DeerToTriplets
from src.preprocess.utils import to_text
from src.models.evaluate import evaluate_model_pipeline
from src.utils.io.read import RawDataReader
from src.settings import Settings
import src.config as cfg

## Configuration

In [None]:
# Set HuggingFace token
os.environ["HUGGINGFACE_HUB_TOKEN"] = cfg.HUGGINGFACE_HUB_TOKEN
HF_TOKEN = os.environ["HUGGINGFACE_HUB_TOKEN"]

# Model and training configuration
MODEL_ID = cfg.MODEL_ID
OUTPUT_DIR = cfg.OUTPUT_DIR
SEED = cfg.SEED
VAL_FRACTION = cfg.VAL_FRACTION
MAX_SEQ_LEN = cfg.MAX_SEQ_LEN

# LoRA Configuration
LORA_R = cfg.LORA_R
LORA_ALPHA = cfg.LORA_ALPHA
LORA_DROPOUT = cfg.LORA_DROPOUT
TARGET_MODULES = cfg.TARGET_MODULES

# Training Configuration
LR = cfg.LR
EPOCHS = cfg.EPOCHS
BATCH_SIZE = cfg.BATCH_SIZE
GRAD_ACCUM = cfg.GRAD_ACCUM
LOG_STEPS = cfg.LOG_STEPS
EVAL_STEPS = cfg.EVAL_STEPS
SAVE_STEPS = cfg.SAVE_STEPS
GEN_MAX_NEW_TOKENS = cfg.GEN_MAX_NEW_TOKENS

## Data Loading and Preprocessing

In [None]:
# Load raw data
rdr = RawDataReader(Settings.paths.RAW_DATA_PATH)
ir_triplets_dataset = rdr.read_ir_triplets()
deer_dataset = rdr.read_deer()

# Convert DEER to triplets
deer_to_triplets_converter = DeerToTriplets()
deer_to_triplets_converter.process(deer_dataset)
od_val_data = deer_to_triplets_converter.triplets

In [None]:
# Set seed for reproducibility
set_seed(SEED)

# Split data into train and in-distribution validation
data = ir_triplets_dataset
random.Random(SEED).shuffle(data)
split_idx = int(len(data) * (1 - VAL_FRACTION))
train_raw, id_val_raw = data[:split_idx], data[split_idx:]
od_val_raw = od_val_data

print(f"Training examples: {len(train_raw)}")
print(f"In-distribution validation examples: {len(id_val_raw)}")
print(f"Out-of-distribution validation examples: {len(od_val_raw)}")

In [None]:
# Create datasets
train_ds = Dataset.from_list([to_text(x) for x in train_raw])
id_val_ds = Dataset.from_list([to_text(x) for x in id_val_raw])
id_dataset = DatasetDict({"train": train_ds, "validation": id_val_ds})

od_val_ds = Dataset.from_list([to_text(x) for x in od_val_raw])
od_dataset = DatasetDict({"train": train_ds, "validation": od_val_ds})

print("Datasets created successfully!")

## Run Complete Evaluation Pipeline

The `evaluate_model_pipeline` function will:
1. Load the baseline model and evaluate it
2. Fine-tune the model with LoRA
3. Evaluate the fine-tuned model
4. Save all results, predictions, and adapters to the output directory

In [None]:
results = evaluate_model_pipeline(
    model_id=MODEL_ID,
    id_dataset=id_dataset,
    od_dataset=od_dataset,
    output_dir=OUTPUT_DIR,
    lora_r=LORA_R,
    lora_alpha=LORA_ALPHA,
    lora_dropout=LORA_DROPOUT,
    target_modules=TARGET_MODULES,
    batch_size=BATCH_SIZE,
    grad_accum=GRAD_ACCUM,
    learning_rate=LR,
    epochs=EPOCHS,
    max_seq_len=MAX_SEQ_LEN,
    log_steps=LOG_STEPS,
    save_steps=SAVE_STEPS,
    hf_token=HF_TOKEN,
    skip_finetuning=False  # Set to True to only run baseline evaluation
)

## Access Results

In [None]:
# Access individual metrics
print("\nIn-Distribution Baseline Metrics:")
print(results["id_baseline_metrics"])

print("\nIn-Distribution Fine-tuned Metrics:")
print(results["id_finetuned_metrics"])

if results["od_baseline_metrics"]:
    print("\nOut-of-Distribution Baseline Metrics:")
    print(results["od_baseline_metrics"])
    
    print("\nOut-of-Distribution Fine-tuned Metrics:")
    print(results["od_finetuned_metrics"])

print(f"\nAll results saved to: {results['output_dir']}")
print(f"Model adapters saved to: {results['adapter_dir']}")

## Output Files

The pipeline creates the following files in the output directory:

- `id_val_predictions_baseline.jsonl` - Baseline predictions on in-distribution validation set
- `id_val_predictions_finetuned.jsonl` - Fine-tuned predictions on in-distribution validation set
- `od_val_predictions_baseline.jsonl` - Baseline predictions on out-of-distribution validation set
- `od_val_predictions_finetuned.jsonl` - Fine-tuned predictions on out-of-distribution validation set
- `metrics_summary.json` - Complete metrics summary with improvements
- `adapter/` - Directory containing the fine-tuned LoRA adapters and tokenizer

## Optional: Baseline Evaluation Only

If you only want to evaluate the baseline model without fine-tuning:

In [None]:
# Uncomment to run baseline evaluation only
# baseline_results = evaluate_model_pipeline(
#     model_id=MODEL_ID,
#     id_dataset=id_dataset,
#     od_dataset=od_dataset,
#     output_dir="./baseline_eval_only",
#     hf_token=HF_TOKEN,
#     skip_finetuning=True
# )