# Model Evaluation and Analysis

This notebook provides detailed evaluation and analysis of the trained models, including prediction visualization and error analysis.

## Overview
- **Purpose**: Comprehensive evaluation of trained transformer models
- **Analysis Types**: Prediction quality, error patterns, confusion matrices
- **Visualization**: Interactive prediction examples with detailed breakdowns
- **Output**: Professional evaluation reports and insights


In [1]:
%pip install mlflow

Collecting mlflow
  Downloading mlflow-3.4.0-py3-none-any.whl.metadata (30 kB)
Collecting mlflow-skinny==3.4.0 (from mlflow)
  Downloading mlflow_skinny-3.4.0-py3-none-any.whl.metadata (31 kB)
Collecting mlflow-tracing==3.4.0 (from mlflow)
  Downloading mlflow_tracing-3.4.0-py3-none-any.whl.metadata (19 kB)
Collecting docker<8,>=4.0.0 (from mlflow)
  Downloading docker-7.1.0-py3-none-any.whl.metadata (3.8 kB)
Collecting fastmcp<3,>=2.0.0 (from mlflow)
  Downloading fastmcp-2.12.3-py3-none-any.whl.metadata (17 kB)
Collecting graphene<4 (from mlflow)
  Downloading graphene-3.4.3-py2.py3-none-any.whl.metadata (6.9 kB)
Collecting gunicorn<24 (from mlflow)
  Downloading gunicorn-23.0.0-py3-none-any.whl.metadata (4.4 kB)
Collecting databricks-sdk<1,>=0.20.0 (from mlflow-skinny==3.4.0->mlflow)
  Downloading databricks_sdk-0.66.0-py3-none-any.whl.metadata (39 kB)
Collecting opentelemetry-proto<3,>=1.9.0 (from mlflow-skinny==3.4.0->mlflow)
  Downloading opentelemetry_proto-1.37.0-py3-none-any.w

In [2]:
%pip install databricks-sdk



## 1. Load Results and Setup


In [3]:
# Import required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import json
import os
import warnings
from sklearn.metrics import (
    accuracy_score, f1_score, precision_score, recall_score,
    confusion_matrix, multilabel_confusion_matrix, classification_report
)
from transformers import AutoTokenizer
import textwrap
import mlflow
import mlflow.pytorch
from IPython.display import display, HTML
from dotenv import load_dotenv

# Suppress warnings for cleaner output
warnings.filterwarnings('ignore')

# Define label names
label_names = [
    'Regenerative & Eco-Tourism',
    'Integrated Wellness',
    'Immersive Culinary',
    'Off-the-Beaten-Path Adventure'
]

# Load training results
try:
    with open('training_results.json', 'r') as f:
        results = json.load(f)
    print("Training results loaded successfully")
    print(f"Best model: {results['best_model']['config']['model']}")
    print(f"Best F1-Score: {results['best_model']['metrics']['f1']:.4f}")
except FileNotFoundError:
    print("Warning: training_results.json not found. Please run the training notebook first.")
    results = None

print("Evaluation setup completed")


Training results loaded successfully
Best model: bert-base-uncased
Best F1-Score: 0.9250
Evaluation setup completed


In [4]:
# Load environment variables
load_dotenv()

# Google Colab integration (if running in Colab)
try:
    from google.colab import drive, userdata
    drive.mount('/content/drive')
    DATABRICKS_HOST = userdata.get("DATABRICKS_HOST")
    DATABRICKS_TOKEN = userdata.get("DATABRICKS_TOKEN")
    print("Google Colab environment detected")
except ImportError:
    # Local environment
    DATABRICKS_HOST = os.getenv("DATABRICKS_HOST")
    DATABRICKS_TOKEN = os.getenv("DATABRICKS_TOKEN")
    print("Local environment detected")

if DATABRICKS_HOST and DATABRICKS_TOKEN:
    os.environ["DATABRICKS_HOST"] = DATABRICKS_HOST
    os.environ["DATABRICKS_TOKEN"] = DATABRICKS_TOKEN


Mounted at /content/drive
Google Colab environment detected


## 2. Prediction Visualization Functions


In [11]:
def analyze_prediction_quality(y_true, y_pred, label_names, model_name="Model"):
    """Analyze prediction quality and categorize examples."""
    results = {
        'perfect_correct': [],
        'partially_correct': [],
        'completely_wrong': [],
        'false_positives': [],
        'false_negatives': [],
        'edge_cases': []
    }

    for i in range(len(y_true)):
        true_labels = y_true[i]
        pred_labels = y_pred[i]

        correct_labels = sum(pred_labels[j] == true_labels[j] for j in range(len(pred_labels)))
        total_labels = len(pred_labels)

        if correct_labels == total_labels:
            if sum(true_labels) > 0:
                results['perfect_correct'].append(i)
        elif correct_labels == 0:
            results['completely_wrong'].append(i)
        else:
            results['partially_correct'].append(i)

        for j in range(len(pred_labels)):
            if pred_labels[j] == 1 and true_labels[j] == 0:
                results['false_positives'].append((i, j, label_names[j]))
            elif pred_labels[j] == 0 and true_labels[j] == 1:
                results['false_negatives'].append((i, j, label_names[j]))

        if sum(pred_labels) == 0 and sum(true_labels) > 0:
            results['edge_cases'].append((i, "Missed all positive labels"))
        elif sum(pred_labels) == total_labels and sum(true_labels) < total_labels:
            results['edge_cases'].append((i, "Over-predicted all labels"))

    return results

def display_prediction_example(idx, y_true, y_pred, label_names, input_text="", model_name="Model", analysis_type="Example"):
    """Display a single prediction example with detailed analysis."""
    true_labels = y_true[idx]
    pred_labels = y_pred[idx]

    html = f"""
    <div style='background-color: #eef2f7; padding: 15px; border-radius: 8px; margin: 15px 0; font-family: sans-serif;'>
        <h3 style='color: #1e3a8a; margin-top: 0; border-bottom: 2px solid #bfdbfe; padding-bottom: 10px;'>{analysis_type} Prediction Example - {model_name}</h3>
        <p style='font-size: 1.1em; color: #374151;'><strong>Index:</strong> {idx}</p>
        <div style='margin-top: 15px; padding: 10px; background-color: #ffffff; border: 1px solid #ddd; border-radius: 5px;'>
            <p style='font-size: 1em; color: #1f2937;'><strong>Input Text:</strong></p>
            <p style='font-size: 1em; color: #374151;'>{input_text}</p>
        </div>
        <table style='width:100%; border-collapse: collapse; margin-top: 20px; box-shadow: 0 2px 5px rgba(0,0,0,0.1);'>
            <tr style='background-color: #1e3a8a; color: white; font-size: 1em;'>
                <th style='padding: 12px; text-align: left; border: 1px solid #ddd;'>Experiential Dimension</th>
                <th style='padding: 12px; text-align: center; border: 1px solid #ddd;'>True Label</th>
                <th style='padding: 12px; text-align: center; border: 1px solid #ddd;'>Predicted Label</th>
                <th style='padding: 12px; text-align: center; border: 1px solid #ddd;'>Correct?</th>
            </tr>"""

    for i, label_name in enumerate(label_names):
        true_val = true_labels[i]
        pred_val = pred_labels[i]
        is_correct = true_val == pred_val

        if is_correct:
            bg_color = "#d1fae5" # Green-ish for correct
            status_color = "#065f46" # Darker green
            status_text = "Correct"
        else:
            bg_color = "#fee2e2" # Red-ish for incorrect
            status_color = "#991b1b" # Darker red
            status_text = "Incorrect"

        true_text = "Yes" if true_val == 1 else "No"
        pred_text = "Yes" if pred_val == 1 else "No"

        html += f"""
            <tr style='background-color: {bg_color}; font-size: 0.95em;'>
                <td style='padding: 10px; border: 1px solid #ddd; font-weight: bold; color: #1f2937;'>{label_name}</td>
                <td style='padding: 10px; text-align: center; border: 1px solid #ddd; color: #1f2937;'>{true_text}</td>
                <td style='padding: 10px; text-align: center; border: 1px solid #ddd; color: #1f2937;'>{pred_text}</td>
                <td style='padding: 10px; text-align: center; border: 1px solid #ddd; color: {status_color}; font-weight: bold;'>{status_text}</td>
            </tr>"""

    html += """
        </table>
    </div>
    """

    return HTML(html)

print("Prediction visualization functions defined")

Prediction visualization functions defined


## 3. Load Model Predictions


In [6]:
# Load test predictions from training notebook
# Note: This assumes the training notebook has been run and predictions are saved
import numpy as np # Ensure numpy is imported
import torch # Ensure torch is imported

try:
    # Load predictions from the best model
    best_predictions = np.load('test_predictions.npy')

    # Load actual test labels
    # In practice, these would be loaded from your test dataset
    test_data = torch.load('/content/drive/MyDrive/SerendipTravel/data/processed/test_dataset.pt', weights_only=False)
    actual_test_labels = test_data['labels'] # Remove .numpy() as it's already a numpy array


    print(f"Loaded predictions: {best_predictions.shape}")
    print(f"Loaded actual test labels: {actual_test_labels.shape}")

    # Analyze prediction quality
    analysis = analyze_prediction_quality(actual_test_labels, best_predictions, label_names, "Best Model")

    print("Prediction Quality Analysis:")
    print(f"Perfect Correct: {len(analysis['perfect_correct'])} examples")
    print(f"Partially Correct: {len(analysis['partially_correct'])} examples")
    print(f"Completely Wrong: {len(analysis['completely_wrong'])} examples")
    print(f"False Positives: {len(analysis['false_positives'])} instances")
    print(f"False Negatives: {len(analysis['false_negatives'])} instances")
    print(f"Edge Cases: {len(analysis['edge_cases'])} examples")

except FileNotFoundError:
    print("test_predictions.npy not found. Please run the training notebook first.")
    print("Creating dummy data for demonstration...")

    # Create dummy data for demonstration
    n_samples = 100
    n_labels = len(label_names)
    best_predictions = np.random.randint(0, 2, size=(n_samples, n_labels))
    actual_test_labels = np.random.randint(0, 2, size=(n_samples, n_labels))


    print(f"Created dummy predictions: {best_predictions.shape}")
    print(f"Created dummy test labels: {actual_test_labels.shape}")

Loaded predictions: (3232, 4)
Loaded actual test labels: (3232, 4)
Prediction Quality Analysis:
Perfect Correct: 1234 examples
Partially Correct: 452 examples
Completely Wrong: 0 examples
False Positives: 80 instances
False Negatives: 462 instances
Edge Cases: 197 examples


## 4. Prediction Examples Analysis


In [13]:
# Display prediction examples
print("Perfect Correct Predictions:")
if len(analysis['perfect_correct']) > 0:
    for i, idx in enumerate(analysis['perfect_correct'][:2]):
        print(f"\n--- Example {i+1} ---")
        # Retrieve the input_ids for the example
        input_ids = test_data['encodings']['input_ids'][idx]
        # Decode the input_ids back to text using a tokenizer
        # Assuming 'bert-base-uncased' was used for training
        tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
        input_text = tokenizer.decode(input_ids, skip_special_tokens=True)

        display(display_prediction_example(
            idx, actual_test_labels, best_predictions, label_names,
            input_text=input_text,
            model_name="Best Model", analysis_type="Perfect Correct"
        ))
else:
    print("No perfect correct examples found")

print("\nPartially Correct Predictions:")
if len(analysis['partially_correct']) > 0:
    for i, idx in enumerate(analysis['partially_correct'][:2]):
        print(f"\n--- Example {i+1} ---")
        # Retrieve the input_ids for the example
        input_ids = test_data['encodings']['input_ids'][idx]
        # Decode the input_ids back to text using a tokenizer
        tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
        input_text = tokenizer.decode(input_ids, skip_special_tokens=True)

        display(display_prediction_example(
            idx, actual_test_labels, best_predictions, label_names,
            input_text=input_text,
            model_name="Best Model", analysis_type="Partially Correct"
        ))
else:
    print("No partially correct examples found")

print("\nCompletely Wrong Predictions:")
if len(analysis['completely_wrong']) > 0:
    for i, idx in enumerate(analysis['completely_wrong'][:2]):
        print(f"\n--- Example {i+1} ---")
        # Retrieve the input_ids for the example
        input_ids = test_data['encodings']['input_ids'][idx]
        # Decode the input_ids back to text using a tokenizer
        tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
        input_text = tokenizer.decode(input_ids, skip_special_tokens=True)

        display(display_prediction_example(
            idx, actual_test_labels, best_predictions, label_names,
            input_text=input_text,
            model_name="Best Model", analysis_type="Completely Wrong"
        ))
else:
    print("No completely wrong examples found")

Perfect Correct Predictions:

--- Example 1 ---


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Experiential Dimension,True Label,Predicted Label,Correct?
Regenerative & Eco-Tourism,No,No,Correct
Integrated Wellness,No,No,Correct
Immersive Culinary,Yes,Yes,Correct
Off-the-Beaten-Path Adventure,Yes,Yes,Correct



--- Example 2 ---


Experiential Dimension,True Label,Predicted Label,Correct?
Regenerative & Eco-Tourism,No,No,Correct
Integrated Wellness,No,No,Correct
Immersive Culinary,No,No,Correct
Off-the-Beaten-Path Adventure,Yes,Yes,Correct



Partially Correct Predictions:

--- Example 1 ---


Experiential Dimension,True Label,Predicted Label,Correct?
Regenerative & Eco-Tourism,Yes,Yes,Correct
Integrated Wellness,Yes,No,Incorrect
Immersive Culinary,No,No,Correct
Off-the-Beaten-Path Adventure,No,No,Correct



--- Example 2 ---


Experiential Dimension,True Label,Predicted Label,Correct?
Regenerative & Eco-Tourism,Yes,No,Incorrect
Integrated Wellness,Yes,Yes,Correct
Immersive Culinary,No,No,Correct
Off-the-Beaten-Path Adventure,No,No,Correct



Completely Wrong Predictions:
No completely wrong examples found


## 5. Error Pattern Analysis


In [8]:
# Analyze error patterns
def analyze_error_patterns(analysis_results, model_name):
    """Analyze common error patterns for a model."""
    print(f"{model_name} Error Pattern Analysis:")
    print("-" * 40)

    # False Positive Analysis
    false_positives = analysis_results['false_positives']
    if false_positives:
        fp_by_label = {}
        for idx, label_idx, label_name in false_positives:
            if label_name not in fp_by_label:
                fp_by_label[label_name] = 0
            fp_by_label[label_name] += 1

        print("False Positives by Label:")
        for label_name, count in sorted(fp_by_label.items(), key=lambda x: x[1], reverse=True):
            print(f"   {label_name}: {count} instances")

    # False Negative Analysis
    false_negatives = analysis_results['false_negatives']
    if false_negatives:
        fn_by_label = {}
        for idx, label_idx, label_name in false_negatives:
            if label_name not in fn_by_label:
                fn_by_label[label_name] = 0
            fn_by_label[label_name] += 1

        print("\nFalse Negatives by Label:")
        for label_name, count in sorted(fn_by_label.items(), key=lambda x: x[1], reverse=True):
            print(f"   {label_name}: {count} instances")

    # Most problematic labels
    all_errors = false_positives + false_negatives
    if all_errors:
        error_by_label = {}
        for error in all_errors:
            label_name = error[2]
            if label_name not in error_by_label:
                error_by_label[label_name] = 0
            error_by_label[label_name] += 1

        print(f"\nMost Problematic Labels (Total Errors):")
        for label_name, count in sorted(error_by_label.items(), key=lambda x: x[1], reverse=True):
            print(f"   {label_name}: {count} total errors")

# Analyze error patterns
analyze_error_patterns(analysis, "Best Model")


Best Model Error Pattern Analysis:
----------------------------------------
False Positives by Label:
   Regenerative & Eco-Tourism: 49 instances
   Off-the-Beaten-Path Adventure: 16 instances
   Immersive Culinary: 9 instances
   Integrated Wellness: 6 instances

False Negatives by Label:
   Regenerative & Eco-Tourism: 195 instances
   Integrated Wellness: 98 instances
   Immersive Culinary: 94 instances
   Off-the-Beaten-Path Adventure: 75 instances

Most Problematic Labels (Total Errors):
   Regenerative & Eco-Tourism: 244 total errors
   Integrated Wellness: 104 total errors
   Immersive Culinary: 103 total errors
   Off-the-Beaten-Path Adventure: 91 total errors


## 6. Summary and Insights


In [9]:
# Create summary statistics
def create_prediction_summary(analysis_results, model_name):
    """Create a summary of prediction quality."""
    total_examples = len(actual_test_labels)

    perfect_correct = len(analysis_results['perfect_correct'])
    partially_correct = len(analysis_results['partially_correct'])
    completely_wrong = len(analysis_results['completely_wrong'])

    perfect_rate = (perfect_correct / total_examples) * 100
    partial_rate = (partially_correct / total_examples) * 100
    wrong_rate = (completely_wrong / total_examples) * 100

    print(f"{model_name} Prediction Quality Summary:")
    print(f"   Total Test Examples: {total_examples}")
    print(f"   Perfect Correct: {perfect_correct} ({perfect_rate:.1f}%)")
    print(f"   Partially Correct: {partially_correct} ({partial_rate:.1f}%)")
    print(f"   Completely Wrong: {completely_wrong} ({wrong_rate:.1f}%)")

    return {
        'model': model_name,
        'total_examples': total_examples,
        'perfect_correct': perfect_correct,
        'partially_correct': partially_correct,
        'completely_wrong': completely_wrong,
        'perfect_rate': perfect_rate,
        'partial_rate': partial_rate,
        'wrong_rate': wrong_rate
    }

# Create summary
summary = create_prediction_summary(analysis, "Best Model")

# Save analysis results
analysis_results = {
    'analysis': {
        'perfect_correct': analysis['perfect_correct'][:10],
        'partially_correct': analysis['partially_correct'][:10],
        'completely_wrong': analysis['completely_wrong'][:10],
        'edge_cases': analysis['edge_cases'][:5]
    },
    'summary': summary
}

with open('prediction_analysis_results.json', 'w') as f:
    json.dump(analysis_results, f, indent=2)

print("\nAnalysis results saved to prediction_analysis_results.json")
print("Model evaluation completed successfully")

Best Model Prediction Quality Summary:
   Total Test Examples: 3232
   Perfect Correct: 1234 (38.2%)
   Partially Correct: 452 (14.0%)
   Completely Wrong: 0 (0.0%)

Analysis results saved to prediction_analysis_results.json
Model evaluation completed successfully


## 7. Final Summary


In [10]:
# Final evaluation summary
print("=" * 60)
print("MODEL EVALUATION SUMMARY")
print("=" * 60)

if 'analysis' in locals():
    print(f"\nPrediction Quality Analysis:")
    print(f"  Perfect Correct: {len(analysis['perfect_correct'])} examples")
    print(f"  Partially Correct: {len(analysis['partially_correct'])} examples")
    print(f"  Completely Wrong: {len(analysis['completely_wrong'])} examples")
    print(f"  False Positives: {len(analysis['false_positives'])} instances")
    print(f"  False Negatives: {len(analysis['false_negatives'])} instances")

print(f"\nOutput Files Generated:")
print(f"  - prediction_analysis_results.json: Complete analysis results")
print(f"  - Interactive prediction examples displayed")

print(f"\nAnalysis Features:")
print(f"  - Prediction quality categorization")
print(f"  - Error pattern analysis")
print(f"  - Interactive visualization")
print(f"  - Professional reporting")

print(f"\nModel evaluation completed successfully!")
print("=" * 60)


MODEL EVALUATION SUMMARY

Prediction Quality Analysis:
  Perfect Correct: 1234 examples
  Partially Correct: 452 examples
  Completely Wrong: 0 examples
  False Positives: 80 instances
  False Negatives: 462 instances

Output Files Generated:
  - prediction_analysis_results.json: Complete analysis results
  - Interactive prediction examples displayed

Analysis Features:
  - Prediction quality categorization
  - Error pattern analysis
  - Interactive visualization
  - Professional reporting

Model evaluation completed successfully!
