# Comparison of Different Taxonomic Classification Approaches

This notebook compares the performance and results of different taxonomic classification methods:

1. **Manual Curation** - Ground truth from expert manual annotation
2. **DIAMOND** - Traditional BLAST-like sequence alignment tool
3. **MEGAN6** - Metagenome analysis tool
4. **Palmprint Results** - Novel palmprint-based classification (to be generated)
5. **Model Predictions** - Machine learning model predictions

## Objectives
- Compare accuracy and precision across methods
- Identify strengths and weaknesses of each approach
- Analyze computational efficiency and scalability
- Generate comprehensive performance metrics

In [1]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import sys
import os

# Add utils to path
sys.path.append('../scripts')
from utils import *

# Set up plotting
plt.style.use('default')
sns.set_palette("husl")
%matplotlib inline

## Data Loading and Preprocessing

Load results from all classification approaches for comparison.

In [2]:
df = pd.read_excel('/home/tobamo/analize/project-tobamo/analysis/data/domain_sci_input/Tobamo - tabela za tobamo kontige - kategorije_2025-09-23.xlsx', header=0, skiprows=[1])

In [3]:
# Define data paths
results_dir = os.makedirs('results/comparison_study', exist_ok=True)

ground_truth = pd.read_excel('/home/tobamo/analize/project-tobamo/analysis/data/domain_sci_input/ground_truth_final_added_categories.xlsx')
diamond_megan_results = pd.read_csv('/home/tobamo/analize/project-tobamo/results/megan6_results_combined_add_nr_taxa.csv')
model_predictions = pd.read_csv('/home/tobamo/analize/project-tobamo/analysis/model/results/snakemake/predictions/contig_predictions.csv')
palmprint = SeqIO.to_dict(SeqIO.parse('/home/tobamo/analize/project-tobamo/analysis/palmprint/results/palmscan/palmscan_pp_find0.fa','fasta'))

## Method 0: Manual Curation (Ground Truth)

Expert manual annotations serving as ground truth for comparison.

### Key Features:
- High accuracy from expert knowledge
- Time-intensive process
- Serves as benchmark for other methods

## Method 1: PALMPRINT

In [4]:
palmprint_positive_contigs = ['_'.join(k.split('_')[:-1]) for k in palmprint.keys()]

In [5]:
df['palmprint'] = np.where(df['contig_id'].isin(palmprint_positive_contigs), 1, 0)
tobamo_palmprints = df[(df['palmprint'] == 1) & (df['ground_truth_category'].isin(['tob1','tob2','tob3']))]

print(f"{len(tobamo_palmprints)} actual tobamo out of {df[df['ground_truth_category'].isin(['tob1','tob2','tob3'])].shape[0]} assigned tobamo, which is ({len(tobamo_palmprints) / df[df['ground_truth_category'].isin(['tob1','tob2','tob3'])].shape[0] * 100:.1f}%) total assigned tobamo")

40 actual tobamo out of 228 assigned tobamo, which is (17.5%) total assigned tobamo


## Method 2: DIAMOND Classification

Traditional sequence alignment-based classification using DIAMOND.

### Key Features:
- Fast BLAST-like sequence aligner
- Well-established and widely used
- Provides alignment scores and e-values

In [6]:
nt_tobamo = df[(df['first_diamond_blastx_hit_name'].str.contains('tobamovirus', case=False, na=False)) & (df['ground_truth_category'].isin(['tob1','tob2','tob3']))]
print(f"{df[(df['first_diamond_blastx_hit_name'].str.contains('tobamovirus', case=False, na=False)) & (df['ground_truth_category'].isin(['tob1','tob2','tob3']))].shape[0]} actual tobamo out of {df[df['ground_truth_category'].isin(['tob1','tob2','tob3'])].shape[0]} assigned tobamo, which is ({df[(df['first_diamond_blastx_hit_name'].str.contains('tobamovirus', case=False, na=False)) & (df['ground_truth_category'].isin(['tob1','tob2','tob3']))].shape[0] / df[df['ground_truth_category'].isin(['tob1','tob2','tob3'])].shape[0] * 100:.1f}%) total assigned tobamo")

186 actual tobamo out of 228 assigned tobamo, which is (81.6%) total assigned tobamo


## Method 3: MEGAN6 Classification

Metagenome analysis using MEGAN6 for taxonomic assignment.

### Key Features:
- Specialized for metagenome analysis
- Uses LCA (Lowest Common Ancestor) algorithm
- Integrates multiple alignment results

In [7]:
# MEGAN6 results analysis
# Extract MEGAN6 specific results from the combined file

diamond_megan_results.drop_duplicates(subset='qseqid', keep='first')
megan_tax_mapper = diamond_megan_results.set_index('qseqid')['megan_tax'].to_dict()
df['megan_tax'] = df['contig_id'].map(megan_tax_mapper)

megan_tobamo = df[(df['megan_tax'].str.contains('tobamovirus', case=False, na=False)) & (df['ground_truth_category'].isin(['tob1','tob2','tob3']))]
print(f"{megan_tobamo.shape[0]} actual tobamo out of {df['ground_truth_category'].isin(['tob1','tob2','tob3']).sum()} assigned tobamo, which is ({megan_tobamo.shape[0] / df['ground_truth_category'].isin(['tob1','tob2','tob3']).sum() * 100:.1f}%) total assigned tobamo")

202 actual tobamo out of 228 assigned tobamo, which is (88.6%) total assigned tobamo


## Method 4: Machine Learning Model Predictions

Analysis of ML model-based taxonomic predictions.

### Key Features:
- Trained on curated datasets
- Can learn complex patterns in sequence data
- Provides confidence scores for predictions

In [8]:
model_tobamo = df[(df['model_prediction'] == 1) & (df['ground_truth_category'].isin(['tob1','tob2','tob3']))]
print(f"{model_tobamo.shape[0]} actual tobamo out of {df['ground_truth_category'].isin(['tob1','tob2','tob3']).sum()} assigned tobamo, which is ({model_tobamo.shape[0] / df['ground_truth_category'].isin(['tob1','tob2','tob3']).sum() * 100:.1f}%) total assigned tobamo")

212 actual tobamo out of 228 assigned tobamo, which is (93.0%) total assigned tobamo


## Comparative Analysis

Direct comparison of all methods across multiple metrics.

### Performance Metrics
- **Accuracy**: Overall correct classifications
- **Precision**: True positives / (True positives + False positives)
- **Recall**: True positives / (True positives + False negatives)
- **F1 Score**: Harmonic mean of precision and recall
- **Computational Time**: Time required for classification
- **Resource Usage**: Memory and CPU requirements

In [9]:
# Method-specific Analysis and Performance Comparison

print("\n=== METHOD-SPECIFIC ANALYSIS ===")

# Check available columns first
print(f"\nAvailable columns in datasets:")
print(f"Ground Truth columns: {list(ground_truth.columns)}")
print(f"DIAMOND/MEGAN columns: {list(diamond_megan_results.columns)}")
print(f"Model predictions columns: {list(model_predictions.columns)}")

# 1. Manual Curation Analysis
print("\n1. 📋 MANUAL CURATION (Ground Truth):")
print(f"   Total samples: {len(ground_truth)}")

# Find family column (might be different case)
family_col = None
for col in ground_truth.columns:
    if 'family' in col.lower():
        family_col = col
        break

if family_col:
    family_dist = ground_truth[family_col].value_counts()
    print(f"   Total families identified: {family_dist.nunique()}")
    print(f"   Most common families:")
    for family, count in family_dist.head(5).items():
        print(f"   - {family}: {count} ({count/len(ground_truth)*100:.1f}%)")
        
    # Check for tobamovirus
    tobamo_mask = ground_truth[family_col].str.contains('Tobamovirus', case=False, na=False)
    tobamo_count = tobamo_mask.sum()
    print(f"   Tobamovirus samples: {tobamo_count} ({tobamo_count/len(ground_truth)*100:.1f}%)")
else:
    print("   No family column found in ground truth data")

# 2. DIAMOND/MEGAN Analysis
print("\n2. 💎 DIAMOND/MEGAN Analysis:")
print(f"   Total samples processed: {len(diamond_megan_results)}")
if 'megan_tobamo' in locals():
    print(f"   Tobamovirus-positive samples: {len(megan_tobamo)}")
    
# 3. Machine Learning Model Analysis
print("\n3. 🤖 MACHINE LEARNING MODEL:")
print(f"   Total predictions: {len(model_predictions)}")
if 'model_tobamo' in locals():
    print(f"   Tobamovirus predictions: {len(model_tobamo)}")
    
# Check for confidence/probability columns
prob_cols = [col for col in model_predictions.columns if any(term in col.lower() for term in ['conf', 'prob', 'score'])]
if prob_cols:
    print(f"   Confidence columns available: {prob_cols}")

# 4. Palmprint Analysis
print("\n4. 🖨️ PALMPRINT Analysis:")
if 'palmprint_positive_contigs' in locals():
    print(f"   Total positive hits: {len(palmprint_positive_contigs)} contigs")
if 'tobamo_palmprints' in locals():
    print(f"   True tobamovirus palmprint hits: {len(tobamo_palmprints)} contigs")
    # Calculate precision (against all palmprint hits) and recall (against ground truth tobamo)
    if 'palmprint_positive_contigs' in locals():
        precision = len(tobamo_palmprints) / len(palmprint_positive_contigs) * 100 if len(palmprint_positive_contigs) > 0 else 0
        print(f"   Palmprint precision: {precision:.1f}% ({len(tobamo_palmprints)}/{len(palmprint_positive_contigs)})")
    
    # Calculate recall against ground truth tobamo (228 total)
    ground_truth_tobamo_count = 228  # df['ground_truth_category'].isin(['tob1','tob2','tob3']).sum()
    recall = len(tobamo_palmprints) / ground_truth_tobamo_count * 100
    print(f"   Palmprint recall vs ground truth: {recall:.1f}% ({len(tobamo_palmprints)}/{ground_truth_tobamo_count})")
    
if not any(var in locals() for var in ['palmprint_positive_contigs', 'tobamo_palmprints']):
    print("   Palmprint data structure needs investigation")


=== METHOD-SPECIFIC ANALYSIS ===

Available columns in datasets:
Ground Truth columns: ['contig_name', 'ground_truth', 'category_old', 'category']
DIAMOND/MEGAN columns: ['SRR', 'qseqid', 'megan_tax', 'nr_tax', 'tpdb2_sseqid', 'nr_sseqid', 'nr_pident', 'nr_length', 'nr_mismatch', 'nr_gapopen', 'nr_qstart', 'nr_qend', 'nr_sstart', 'nr_send', 'nr_evalue', 'nr_bitscore', 'tpdb2_pident', 'tpdb2_length', 'tpdb2_mismatch', 'tpdb2_gapopen', 'tpdb2_qstart', 'tpdb2_qend', 'tpdb2_sstart', 'tpdb2_send', 'tpdb2_evalue', 'tpdb2_bitscore', 'sequence', 'nr_sseqid_key']
Model predictions columns: ['contig_name', 'predicted_class', 'prob_1']

1. 📋 MANUAL CURATION (Ground Truth):
   Total samples: 510
   No family column found in ground truth data

2. 💎 DIAMOND/MEGAN Analysis:
   Total samples processed: 161333
   Tobamovirus-positive samples: 202

3. 🤖 MACHINE LEARNING MODEL:
   Total predictions: 510
   Tobamovirus predictions: 212
   Confidence columns available: ['prob_1']

4. 🖨️ PALMPRINT Analysis

In [10]:
# Cross-Method Comparison - Part 1: Ground Truth Baseline

print("\n=== CROSS-METHOD COMPARISON FOR TOBAMOVIRUS DETECTION ===")

# Ground Truth Baseline
ground_truth_tobamo_count = df['ground_truth_category'].isin(['tob1','tob2','tob3']).sum()
print(f"\n🎯 GROUND TRUTH BASELINE:")
print(f"   Total tobamovirus samples: {ground_truth_tobamo_count}")
print(f"   Total samples: {len(df)}")
print(f"   Prevalence: {ground_truth_tobamo_count/len(df)*100:.1f}%")


=== CROSS-METHOD COMPARISON FOR TOBAMOVIRUS DETECTION ===

🎯 GROUND TRUTH BASELINE:
   Total tobamovirus samples: 228
   Total samples: 510
   Prevalence: 44.7%


In [11]:
# Cross-Method Comparison - Part 2: Palmprint Method Analysis

print(f"\n📊 TOBAMOVIRUS DETECTION PERFORMANCE BY METHOD:")

# Method 1: Palmprint
print(f"\n1. 🖨️  PALMPRINT METHOD:")
print(f"   Analyzing palmprint results...")

palmprint_tp = len(tobamo_palmprints)  # True positives
palmprint_total_pos = len(palmprint_positive_contigs)  # Total palmprint positives
palmprint_precision = palmprint_tp / palmprint_total_pos * 100 if palmprint_total_pos > 0 else 0
palmprint_recall = palmprint_tp / ground_truth_tobamo_count * 100
palmprint_f1 = 2 * (palmprint_precision * palmprint_recall) / (palmprint_precision + palmprint_recall) if (palmprint_precision + palmprint_recall) > 0 else 0

print(f"   True Positives: {palmprint_tp}")
print(f"   Total Positives Detected: {palmprint_total_pos}")
print(f"   Precision: {palmprint_precision:.1f}% ({palmprint_tp}/{palmprint_total_pos})")
print(f"   Recall (Sensitivity): {palmprint_recall:.1f}% ({palmprint_tp}/{ground_truth_tobamo_count})")
print(f"   F1 Score: {palmprint_f1:.1f}")

print("   ✅ Palmprint analysis completed")


📊 TOBAMOVIRUS DETECTION PERFORMANCE BY METHOD:

1. 🖨️  PALMPRINT METHOD:
   Analyzing palmprint results...
   True Positives: 40
   Total Positives Detected: 149
   Precision: 26.8% (40/149)
   Recall (Sensitivity): 17.5% (40/228)
   F1 Score: 21.2
   ✅ Palmprint analysis completed


In [12]:
# Cross-Method Comparison - Part 3: DIAMOND BLASTX Analysis

# Method 2: DIAMOND BLASTX
print(f"\n2. 💎 DIAMOND BLASTX:")
print(f"   Analyzing DIAMOND results...")

diamond_tp = len(nt_tobamo)
print(f"   Calculating total positives detected...")
diamond_total_pos = df['first_diamond_blastx_hit_name'].str.contains('tobamovirus', case=False, na=False).sum()
diamond_precision = diamond_tp / diamond_total_pos * 100 if diamond_total_pos > 0 else 0
diamond_recall = diamond_tp / ground_truth_tobamo_count * 100
diamond_f1 = 2 * (diamond_precision * diamond_recall) / (diamond_precision + diamond_recall) if (diamond_precision + diamond_recall) > 0 else 0

print(f"   True Positives: {diamond_tp}")
print(f"   Total Positives Detected: {diamond_total_pos}")
print(f"   Precision: {diamond_precision:.1f}% ({diamond_tp}/{diamond_total_pos})")
print(f"   Recall (Sensitivity): {diamond_recall:.1f}% ({diamond_tp}/{ground_truth_tobamo_count})")
print(f"   F1 Score: {diamond_f1:.1f}")

print("   ✅ DIAMOND analysis completed")


2. 💎 DIAMOND BLASTX:
   Analyzing DIAMOND results...
   Calculating total positives detected...
   True Positives: 186
   Total Positives Detected: 205
   Precision: 90.7% (186/205)
   Recall (Sensitivity): 81.6% (186/228)
   F1 Score: 85.9
   ✅ DIAMOND analysis completed


In [13]:
# Cross-Method Comparison - Part 4: MEGAN6 Analysis

# Method 3: MEGAN6
print(f"\n3. 🧬 MEGAN6 LCA:")
print(f"   Analyzing MEGAN6 results...")

megan_tp = len(megan_tobamo)
print(f"   Calculating MEGAN6 positives...")
megan_total_pos = df['megan_tax'].str.contains('tobamovirus', case=False, na=False).sum()
megan_precision = megan_tp / megan_total_pos * 100 if megan_total_pos > 0 else 0
megan_recall = megan_tp / ground_truth_tobamo_count * 100
megan_f1 = 2 * (megan_precision * megan_recall) / (megan_precision + megan_recall) if (megan_precision + megan_recall) > 0 else 0

print(f"   True Positives: {megan_tp}")
print(f"   Total Positives Detected: {megan_total_pos}")
print(f"   Precision: {megan_precision:.1f}% ({megan_tp}/{megan_total_pos})")
print(f"   Recall (Sensitivity): {megan_recall:.1f}% ({megan_tp}/{ground_truth_tobamo_count})")
print(f"   F1 Score: {megan_f1:.1f}")

print("   ✅ MEGAN6 analysis completed")


3. 🧬 MEGAN6 LCA:
   Analyzing MEGAN6 results...
   Calculating MEGAN6 positives...
   True Positives: 202
   Total Positives Detected: 215
   Precision: 94.0% (202/215)
   Recall (Sensitivity): 88.6% (202/228)
   F1 Score: 91.2
   ✅ MEGAN6 analysis completed


In [14]:
# Cross-Method Comparison - Part 5: Machine Learning Model Analysis

# Method 4: Machine Learning Model
print(f"\n4. 🤖 MACHINE LEARNING MODEL:")
print(f"   Analyzing ML model results...")

ml_tp = len(model_tobamo)
print(f"   Calculating ML predictions...")
ml_total_pos = (df['model_prediction'] == 1).sum()
ml_precision = ml_tp / ml_total_pos * 100 if ml_total_pos > 0 else 0
ml_recall = ml_tp / ground_truth_tobamo_count * 100
ml_f1 = 2 * (ml_precision * ml_recall) / (ml_precision + ml_recall) if (ml_precision + ml_recall) > 0 else 0

print(f"   True Positives: {ml_tp}")
print(f"   Total Positives Detected: {ml_total_pos}")
print(f"   Precision: {ml_precision:.1f}% ({ml_tp}/{ml_total_pos})")
print(f"   Recall (Sensitivity): {ml_recall:.1f}% ({ml_tp}/{ground_truth_tobamo_count})")
print(f"   F1 Score: {ml_f1:.1f}")

print("   ✅ ML Model analysis completed")


4. 🤖 MACHINE LEARNING MODEL:
   Analyzing ML model results...
   Calculating ML predictions...
   True Positives: 212
   Total Positives Detected: 285
   Precision: 74.4% (212/285)
   Recall (Sensitivity): 93.0% (212/228)
   F1 Score: 82.7
   ✅ ML Model analysis completed


In [15]:
# Cross-Method Comparison - Part 6: Performance Rankings

print(f"\n🏆 PERFORMANCE RANKING:")
print(f"📈 RECALL (Sensitivity) - How many true tobamoviruses detected:")
methods_recall = [
    ("Palmprint", palmprint_recall),
    ("DIAMOND", diamond_recall),
    ("MEGAN6", megan_recall),
    ("ML Model", ml_recall)
]
methods_recall.sort(key=lambda x: x[1], reverse=True)
for i, (method, recall) in enumerate(methods_recall, 1):
    print(f"   {i}. {method}: {recall:.1f}%")

print(f"\n🎯 PRECISION - Accuracy of positive predictions:")
methods_precision = [
    ("Palmprint", palmprint_precision),
    ("DIAMOND", diamond_precision),
    ("MEGAN6", megan_precision),
    ("ML Model", ml_precision)
]
methods_precision.sort(key=lambda x: x[1], reverse=True)
for i, (method, precision) in enumerate(methods_precision, 1):
    print(f"   {i}. {method}: {precision:.1f}%")

print(f"\n⚖️  F1 SCORE - Balanced performance:")
methods_f1 = [
    ("Palmprint", palmprint_f1),
    ("DIAMOND", diamond_f1),
    ("MEGAN6", megan_f1),
    ("ML Model", ml_f1)
]
methods_f1.sort(key=lambda x: x[1], reverse=True)
for i, (method, f1) in enumerate(methods_f1, 1):
    print(f"   {i}. {method}: {f1:.1f}")

print("   ✅ Performance ranking completed")


🏆 PERFORMANCE RANKING:
📈 RECALL (Sensitivity) - How many true tobamoviruses detected:
   1. ML Model: 93.0%
   2. MEGAN6: 88.6%
   3. DIAMOND: 81.6%
   4. Palmprint: 17.5%

🎯 PRECISION - Accuracy of positive predictions:
   1. MEGAN6: 94.0%
   2. DIAMOND: 90.7%
   3. ML Model: 74.4%
   4. Palmprint: 26.8%

⚖️  F1 SCORE - Balanced performance:
   1. MEGAN6: 91.2
   2. DIAMOND: 85.9
   3. ML Model: 82.7
   4. Palmprint: 21.2
   ✅ Performance ranking completed


In [16]:
# Cross-Method Comparison - Part 7: Method Characteristics Summary

print(f"\n🔍 METHOD CHARACTERISTICS SUMMARY:")
print(f"• DIAMOND: High precision ({diamond_precision:.1f}%), moderate recall ({diamond_recall:.1f}%) - Conservative approach")
print(f"• MEGAN6: High precision ({megan_precision:.1f}%), moderate recall ({megan_recall:.1f}%) - LCA consensus")
print(f"• ML Model: Balanced precision ({ml_precision:.1f}%) and recall ({ml_recall:.1f}%) - Data-driven")
print(f"• Palmprint: Variable performance - Novel method needing validation")

print("   ✅ Method characteristics summary completed")


🔍 METHOD CHARACTERISTICS SUMMARY:
• DIAMOND: High precision (90.7%), moderate recall (81.6%) - Conservative approach
• MEGAN6: High precision (94.0%), moderate recall (88.6%) - LCA consensus
• ML Model: Balanced precision (74.4%) and recall (93.0%) - Data-driven
• Palmprint: Variable performance - Novel method needing validation
   ✅ Method characteristics summary completed


In [17]:
# Cross-Method Comparison - Part 8: Method Overlap Analysis (Potentially Slow)

# Calculate overall detection overlap
print(f"\n🎭 METHOD OVERLAP ANALYSIS:")
print("   Building contig sets for overlap analysis...")

# Create sets of detected contigs for each method
palmprint_set = set(tobamo_palmprints['contig_id'])
print(f"   Palmprint set: {len(palmprint_set)} contigs")

diamond_set = set(nt_tobamo['contig_id'])
print(f"   DIAMOND set: {len(diamond_set)} contigs")

megan_set = set(megan_tobamo['contig_id'])
print(f"   MEGAN6 set: {len(megan_set)} contigs")

ml_set = set(model_tobamo['contig_id'])
print(f"   ML Model set: {len(ml_set)} contigs")

print("   ✅ Contig sets created successfully")


🎭 METHOD OVERLAP ANALYSIS:
   Building contig sets for overlap analysis...
   Palmprint set: 40 contigs
   DIAMOND set: 186 contigs
   MEGAN6 set: 202 contigs
   ML Model set: 212 contigs
   ✅ Contig sets created successfully


In [18]:
# Cross-Method Comparison - Part 9: Set Intersection Analysis

print(f"\n🔄 CALCULATING SET INTERSECTIONS:")

# Intersection analysis
print("   Computing intersection of all methods...")
all_methods = palmprint_set.intersection(diamond_set).intersection(megan_set).intersection(ml_set)
print(f"   All methods intersection: {len(all_methods)} contigs")

print("   Computing union of all methods...")
any_method = palmprint_set.union(diamond_set).union(megan_set).union(ml_set)
print(f"   Any method union: {len(any_method)} contigs")

print("   Computing consensus (≥3 methods)...")
consensus_3_plus = len([contig for contig in any_method 
                       if sum([contig in s for s in [palmprint_set, diamond_set, megan_set, ml_set]]) >= 3])

print(f"   Detected by all 4 methods: {len(all_methods)} contigs")
print(f"   Detected by ≥3 methods: {consensus_3_plus} contigs")
print(f"   Detected by any method: {len(any_method)} contigs")
print(f"   Consensus confidence: {len(all_methods)/len(any_method)*100:.1f}% agreement")

print("   ✅ Cross-method comparison analysis completed successfully!")


🔄 CALCULATING SET INTERSECTIONS:
   Computing intersection of all methods...
   All methods intersection: 27 contigs
   Computing union of all methods...
   Any method union: 228 contigs
   Computing consensus (≥3 methods)...
   Detected by all 4 methods: 27 contigs
   Detected by ≥3 methods: 170 contigs
   Detected by any method: 228 contigs
   Consensus confidence: 11.8% agreement
   ✅ Cross-method comparison analysis completed successfully!
