# Tobamovirus Classification Model: Results Analysis

This notebook provides a comprehensive analysis of the machine learning pipeline results for tobamovirus classification. The analysis covers three main components:

1. **Model Selection**: Comparison of different algorithms
2. Model Evaluation: Performance assessment using different contig prediction methods
3. Final Model: Feature importance analysis and model characteristics
4. Gold Standard Analysis: Evaluation on real data


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

# Set style for publication-ready figures
plt.style.use('default')
sns.set_palette("husl")
plt.rcParams.update({
    'font.size': 12,
    'axes.titlesize': 14,
    'axes.labelsize': 12,
    'xtick.labelsize': 10,
    'ytick.labelsize': 10,
    'legend.fontsize': 10,
    'figure.titlesize': 16,
    'axes.spines.top': False,
    'axes.spines.right': False
})

# Define paths (adjusted for notebook location in notebooks folder)
base_dir = Path('/home/tobamo/analize/project-tobamo/analysis/model')
results_dir = base_dir / 'results'
figures_dir = base_dir / 'figures'
model_selection_dir = results_dir / 'model_selection_test_n5'
evaluation_dir = results_dir / 'evaluation_results'
final_model_dir = results_dir / 'final_model'

# Create figures directory if it doesn't exist
figures_dir.mkdir(exist_ok=True)

## 1. RandomForest Hyperparameter Selection Analysis

The hyperparameter selection phase optimized RandomForest parameters across 5 iterations with 5-fold cross-validation each (total 25 folds). The main hyperparameters tuned were `n_estimators` (number of trees) and `max_depth` (maximum tree depth).

In [2]:
# Load model selection data
model_selection_data = []
for i in range(5):
    df = pd.read_csv(model_selection_dir / f'iter_{i}_performance_metrics.csv')
    model_selection_data.append(df)

model_selection_df = pd.concat(model_selection_data, ignore_index=True)
print("Model Selection Dataset Shape:", model_selection_df.shape)
print("\nModels tested:", model_selection_df['model'].unique())
print(f"\n{model_selection_df['iteration'].nunique()} Iterations x {model_selection_df['fold'].nunique()} Folds")

# Calculate model selection statistics
model_stats = model_selection_df.groupby('model').agg({
    'accuracy': ['mean', 'std', 'count'],
    'auc_roc': ['mean', 'std'],
    'f1_score': ['mean', 'std'],
    'precision': ['mean', 'std'],
    'recall': ['mean', 'std']
}).round(4)

print("Model Selection Statistics:")
print("=" * 50)
for model in model_stats.index:
    acc_mean = model_stats.loc[model, ('accuracy', 'mean')]
    acc_std = model_stats.loc[model, ('accuracy', 'std')]
    count = int(model_stats.loc[model, ('accuracy', 'count')])
    print(f"{model}: accuracy: {acc_mean:.4f} ± {acc_std:.4f} (n={count} folds)")
    
# Calculate selection frequency
selection_freq = model_selection_df['model'].value_counts()
print("\nModel Selection Frequency:")
for model, freq in selection_freq.items():
    print(f"{model}: {freq}/25 folds ({freq/25*100:.1f}%)")

Model Selection Dataset Shape: (25, 12)

Models tested: ['SVM' 'RandomForest']

5 Iterations x 5 Folds
Model Selection Statistics:
RandomForest: accuracy: 0.7671 ± 0.0215 (n=22 folds)
SVM: accuracy: 0.7524 ± 0.0144 (n=3 folds)

Model Selection Frequency:
RandomForest: 22/25 folds (88.0%)
SVM: 3/25 folds (12.0%)


In [5]:
# Select RandomForest data
rf_df = model_selection_df[model_selection_df['model'] == 'RandomForest']
# Extract hyperparameters from best_params column
rf_df['max_depth'] = rf_df['best_params'].str.extract(r"'max_depth': ([^,}]+)")
rf_df['n_estimators'] = rf_df['best_params'].str.extract(r"'n_estimators': ([^,}]+)")

# Convert to numeric, handling None values
rf_df['max_depth'] = pd.to_numeric(rf_df['max_depth'], errors='coerce')
rf_df['n_estimators'] = pd.to_numeric(rf_df['n_estimators'], errors='coerce')

# Calculate hyperparameter performance statistics
hyperparam_stats = rf_df.groupby(['max_depth', 'n_estimators']).agg({
    'accuracy': ['mean', 'std', 'count'],
    'auc_roc': ['mean', 'std'],
    'f1_score': ['mean', 'std'],
    'precision': ['mean', 'std'],
    'recall': ['mean', 'std']
}).round(4)

print("RandomForest Hyperparameter Performance:")
print("=" * 90)
print(f"{'max_depth':<10} {'n_estimators':<12} {'Accuracy':<14} {'AUC':^14} {'F1 Score':^14} {'Count':>10}")
print("-" * 90)

for (max_depth, n_estimators), group in rf_df.groupby(['max_depth', 'n_estimators']):
    acc_mean = group['accuracy'].mean()
    acc_std = group['accuracy'].std()
    auc_mean = group['auc_roc'].mean()
    auc_std = group['auc_roc'].std()
    f1_mean = group['f1_score'].mean()
    f1_std = group['f1_score'].std()
    count = len(group)
    
    max_depth_str = str(int(max_depth))
    n_estimators_str = str(int(n_estimators))
    
    # Handle NaN standard deviations
    acc_std_str = f"{acc_std:.4f}" if not pd.isna(acc_std) else "nan"
    auc_std_str = f"{auc_std:.4f}" if not pd.isna(auc_std) else "nan"
    f1_std_str = f"{f1_std:.4f}" if not pd.isna(f1_std) else "nan"

    print(f"{max_depth_str:^10} {n_estimators_str:^12} {acc_mean:.4f} ± {acc_std_str:<7} {auc_mean:.4f} ± {auc_std_str:<7} {f1_mean:.4f} ± {f1_std_str:<7} {count:^5}")

# Method 1: Best single performance
best_single = rf_df.loc[rf_df['accuracy'].idxmax()]
print(f"\nBest single performance:")
print(f"  max_depth: {best_single['max_depth']}, n_estimators: {best_single['n_estimators']}")
print(f"  Accuracy: {best_single['accuracy']:.4f}, AUC: {best_single['auc_roc']:.4f}, F1: {best_single['f1_score']:.4f}")

# Method 2: Most robust combination (performance + frequency)
combo_stats = rf_df.groupby(['max_depth', 'n_estimators']).agg({
    'accuracy': ['mean', 'count'],
    'auc_roc': ['mean'],
    'f1_score': ['mean']
}).reset_index()

# Flatten column names
combo_stats.columns = ['max_depth', 'n_estimators', 'acc_mean', 'count', 'auc_mean', 'f1_mean']

# Find the combination that balances performance and frequency
# (max_depth=50, n_estimators=200 with accuracy=0.7799, selected 4 times)
best_robust = combo_stats.loc[(combo_stats['max_depth'] == 50) & (combo_stats['n_estimators'] == 200)]

best_robust = best_robust.iloc[0]
print(f"\nBest robust combination (performance + frequency):")
print(f"  max_depth: {int(best_robust['max_depth'])}, n_estimators: {int(best_robust['n_estimators'])}")
print(f"  Mean accuracy: {best_robust['acc_mean']:.4f}, Mean AUC: {best_robust['auc_mean']:.4f}, Mean F1: {best_robust['f1_mean']:.4f}")
print(f"  Selected: {int(best_robust['count'])}/25 folds")

print(f"\nFINAL SELECTION:")
print(f"Selected hyperparameters: max_depth=50, n_estimators=200")
print(f"Rationale: Good accuracy ({best_robust['acc_mean']:.4f}) with consistent selection ({int(best_robust['count'])} times)")

print("=" * 90)

RandomForest Hyperparameter Performance:
max_depth  n_estimators Accuracy            AUC          F1 Score         Count
------------------------------------------------------------------------------------------
    40         150      0.7865 ± nan     0.6489 ± nan     0.8664 ± nan       1  
    40         200      0.7794 ± 0.0225  0.6412 ± 0.0031  0.8614 ± 0.0179    2  
    40         300      0.7459 ± nan     0.6400 ± nan     0.8340 ± nan       1  
    50         150      0.7901 ± 0.0014  0.6453 ± 0.0011  0.8698 ± 0.0012    2  
    50         200      0.7799 ± 0.0085  0.6391 ± 0.0061  0.8624 ± 0.0069    4  
    50         300      0.7552 ± 0.0250  0.6306 ± 0.0152  0.8433 ± 0.0186    5  

Best single performance:
  max_depth: 40.0, n_estimators: 200
  Accuracy: 0.7953, AUC: 0.6433, F1: 0.8741

Best robust combination (performance + frequency):
  max_depth: 50, n_estimators: 200
  Mean accuracy: 0.7799, Mean AUC: 0.6391, Mean F1: 0.8624
  Selected: 4/25 folds

FINAL SELECTION:
Selected