# 08. PyOD Algorithms Comparison

Compare multiple anomaly detection algorithms using PyOD library:
- **IForest** - Isolation Forest (tree-based)
- **KNN** - K-Nearest Neighbors (distance-based)
- **HBOS** - Histogram-based Outlier Score (very fast)
- **ECOD** - Empirical Cumulative Distribution (parameter-free)
- **COPOD** - Copula-based Outlier Detection (parameter-free)
- **OCSVM** - One-Class SVM (boundary-based)

**Note:** LOF is only available at aggregated level via `AggregatedPyOD` (too slow for 13M tenders).

In [2]:
import sys
sys.path.insert(0, '..')

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime

from src.data_loader import load_tenders, load_buyers, load_suppliers, load_bids
from src.detectors import PyODDetector, compare_algorithms

pd.set_option('display.max_columns', 50)
plt.style.use('seaborn-v0_8-whitegrid')

print(f"Started: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")

Started: 2026-01-31 18:33:22


## 1. Load Data

In [3]:
tenders = load_tenders(years=[2022, 2023, 2024, 2025], sample_frac=0.05)
buyers = load_buyers()
suppliers = load_suppliers()

print(f"Tenders: {len(tenders):,}")
print(f"Buyers: {len(buyers):,}")
print(f"Suppliers: {len(suppliers):,}")

Scanning 2022...
Scanning 2023...
Scanning 2024...
Scanning 2025...
Sampled to 643,898 records (5%)
Loaded buyers: 35,995
Loaded suppliers: 358,376
Tenders: 643,898
Buyers: 35,995
Suppliers: 358,376


## 2. Quick Comparison (3 Algorithms)

In [4]:
# Compare IForest, HBOS, ECOD (fast algorithms)
comparison_fast = compare_algorithms(
    tenders,
    algorithms=["iforest", "hbos", "ecod"],
    contamination=0.05,
    buyers_df=buyers,
    suppliers_df=suppliers,
)

print("\n" + "="*60)
print("COMPARISON RESULTS (Fast Algorithms)")
print("="*60)
display(comparison_fast)


PyOD Detector: IFOREST
  Isolation Forest - isolates anomalies using random trees
Processing 643,898 tenders...
Step 1/3: Preparing features...
  Features: 14
Step 2/3: Preprocessing...
  Shape: (643898, 14)
Step 3/3: Fitting IFOREST...

IFOREST complete!
  Anomalies: 32,195 (5.00%)

PyOD Detector: HBOS
  Histogram-based Outlier Score - very fast, assumption of independence
Processing 643,898 tenders...
Step 1/3: Preparing features...
  Features: 14
Step 2/3: Preprocessing...
  Shape: (643898, 14)
Step 3/3: Fitting HBOS...

HBOS complete!
  Anomalies: 32,194 (5.00%)

PyOD Detector: ECOD
  Empirical Cumulative Distribution - unsupervised, parameter-free
Processing 643,898 tenders...
Step 1/3: Preparing features...
  Features: 14
Step 2/3: Preprocessing...
  Shape: (643898, 14)
Step 3/3: Fitting ECOD...


[Parallel(n_jobs=14)]: Using backend LokyBackend with 14 concurrent workers.
[Parallel(n_jobs=14)]: Done   2 out of  14 | elapsed:    2.1s remaining:   13.4s
[Parallel(n_jobs=14)]: Done  14 out of  14 | elapsed:    2.3s finished



ECOD complete!
  Anomalies: 32,195 (5.00%)

COMPARISON RESULTS (Fast Algorithms)


Unnamed: 0,algorithm,anomalies,anomaly_rate,mean_score,max_score
0,iforest,32195,5.000016,0.17824,1.0
1,hbos,32194,4.99986,0.164759,1.0
2,ecod,32195,5.000016,0.148359,1.0


## 3. Full Comparison (All Algorithms)

In [None]:
# Compare all available algorithms (tender-level, excludes LOF)
all_algorithms = ["iforest", "knn", "hbos", "ecod", "copod", "ocsvm"]

comparison_all = compare_algorithms(
    tenders,
    algorithms=all_algorithms,
    contamination=0.05,
    buyers_df=buyers,
    suppliers_df=suppliers,
)

print("\n" + "="*60)
print("COMPARISON RESULTS (All Tender-Level Algorithms)")
print("="*60)
display(comparison_all)


PyOD Detector: IFOREST
  Isolation Forest - isolates anomalies using random trees
Processing 643,898 tenders...
Step 1/3: Preparing features...
  Features: 14
Step 2/3: Preprocessing...
  Shape: (643898, 14)
Step 3/3: Fitting IFOREST...

IFOREST complete!
  Anomalies: 32,195 (5.00%)

PyOD Detector: KNN
  K-Nearest Neighbors - distance-based anomalies
Processing 643,898 tenders...
Step 1/3: Preparing features...
  Features: 14
Step 2/3: Preprocessing...
  Shape: (643898, 14)
Step 3/3: Fitting KNN...

KNN complete!
  Anomalies: 32,195 (5.00%)

PyOD Detector: HBOS
  Histogram-based Outlier Score - very fast, assumption of independence
Processing 643,898 tenders...
Step 1/3: Preparing features...
  Features: 14
Step 2/3: Preprocessing...
  Shape: (643898, 14)
Step 3/3: Fitting HBOS...

HBOS complete!
  Anomalies: 32,194 (5.00%)

PyOD Detector: ECOD
  Empirical Cumulative Distribution - unsupervised, parameter-free
Processing 643,898 tenders...
Step 1/3: Preparing features...
  Features: 1

[Parallel(n_jobs=14)]: Using backend LokyBackend with 14 concurrent workers.
[Parallel(n_jobs=14)]: Done   2 out of  14 | elapsed:    0.0s remaining:    0.5s
[Parallel(n_jobs=14)]: Done  14 out of  14 | elapsed:    0.1s finished



ECOD complete!
  Anomalies: 32,195 (5.00%)

PyOD Detector: COPOD
  Copula-based Outlier Detection - fast, parameter-free
Processing 643,898 tenders...
Step 1/3: Preparing features...
  Features: 14
Step 2/3: Preprocessing...
  Shape: (643898, 14)
Step 3/3: Fitting COPOD...


[Parallel(n_jobs=14)]: Using backend LokyBackend with 14 concurrent workers.
[Parallel(n_jobs=14)]: Done   2 out of  14 | elapsed:    0.0s remaining:    0.6s
[Parallel(n_jobs=14)]: Done  14 out of  14 | elapsed:    0.1s finished



COPOD complete!
  Anomalies: 32,195 (5.00%)

PyOD Detector: OCSVM
  One-Class SVM - boundary-based detection
Processing 643,898 tenders...
Step 1/3: Preparing features...
  Features: 14
Step 2/3: Preprocessing...
  Shape: (643898, 14)
Step 3/3: Fitting OCSVM...


In [None]:
# Visualize comparison
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Anomaly count
ax1 = axes[0]
comparison_all.plot(kind='bar', x='algorithm', y='anomalies', ax=ax1, color='coral')
ax1.set_title('Anomalies Detected by Algorithm')
ax1.set_xlabel('')
ax1.set_ylabel('Count')
ax1.tick_params(axis='x', rotation=45)
ax1.axhline(y=comparison_all['anomalies'].mean(), color='red', linestyle='--', label='Mean')
ax1.legend()

# Mean score
ax2 = axes[1]
comparison_all.plot(kind='bar', x='algorithm', y='mean_score', ax=ax2, color='steelblue')
ax2.set_title('Mean Anomaly Score by Algorithm')
ax2.set_xlabel('')
ax2.set_ylabel('Score')
ax2.tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.savefig('../results/pyod_comparison.png', dpi=150, bbox_inches='tight')
plt.show()

## 4. Detailed Analysis with Best Algorithm

In [None]:
# Use Isolation Forest (usually best for tabular data)
detector = PyODDetector(
    algorithm="iforest",
    contamination=0.05,
    random_state=42,
)

results = detector.fit_detect(
    tenders,
    buyers_df=buyers,
    suppliers_df=suppliers,
)

print("\nSummary:")
display(detector.summary())

In [None]:
# Risk level distribution
print("Risk Level Distribution:")
print(results['risk_level'].value_counts().sort_index())

In [None]:
# Score distribution
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Histogram
ax1 = axes[0]
results['score'].hist(bins=50, ax=ax1, color='steelblue', edgecolor='white')
ax1.axvline(x=0.5, color='orange', linestyle='--', label='Medium threshold')
ax1.axvline(x=0.75, color='red', linestyle='--', label='High threshold')
ax1.set_title('Anomaly Score Distribution (IForest)')
ax1.set_xlabel('Score')
ax1.set_ylabel('Count')
ax1.legend()

# Risk level pie
ax2 = axes[1]
risk_counts = results['risk_level'].value_counts()
colors = {'low': 'green', 'medium': 'gold', 'high': 'orange', 'critical': 'red'}
risk_counts.plot(kind='pie', ax=ax2, autopct='%1.1f%%', 
                 colors=[colors.get(x, 'gray') for x in risk_counts.index])
ax2.set_title('Risk Level Distribution')
ax2.set_ylabel('')

plt.tight_layout()
plt.savefig('../results/pyod_iforest_scores.png', dpi=150, bbox_inches='tight')
plt.show()

## 5. High-Risk Tenders Analysis

In [None]:
# Get high-risk anomalies
high_risk = detector.get_anomalies(min_score=0.75)
print(f"High-risk tenders (score >= 0.75): {len(high_risk):,}")

# Merge with tender details
high_risk_details = high_risk.merge(
    tenders[['tender_id', 'tender_value', 'procurement_method', 'is_single_bidder', 'price_change_pct']],
    on='tender_id'
)

print(f"\nTotal value: {high_risk_details['tender_value'].sum()/1e9:.2f} B UAH")
print(f"Single bidder rate: {high_risk_details['is_single_bidder'].mean()*100:.1f}%")
print(f"\nTop 10 by score:")
display(high_risk_details.nlargest(10, 'score')[['tender_id', 'score', 'risk_level', 'tender_value', 'procurement_method', 'is_single_bidder']])

In [None]:
# Procurement method distribution in anomalies
print("Procurement method in high-risk tenders:")
print(high_risk_details['procurement_method'].value_counts(normalize=True).mul(100).round(1))

## 6. Algorithm Agreement Analysis

In [None]:
# Run multiple algorithms and check overlap
algorithms_to_compare = ["iforest", "hbos", "ecod", "knn"]
all_results = {}

for algo in algorithms_to_compare:
    print(f"\n{'='*60}")
    det = PyODDetector(algorithm=algo, contamination=0.05)
    res = det.fit_detect(tenders, buyers_df=buyers, suppliers_df=suppliers)
    all_results[algo] = set(res[res['anomaly'] == 1]['tender_id'].tolist())

In [None]:
# Calculate overlap
from itertools import combinations

print("\nPairwise Agreement (Jaccard similarity):")
print("="*50)

for algo1, algo2 in combinations(algorithms_to_compare, 2):
    set1 = all_results[algo1]
    set2 = all_results[algo2]
    intersection = len(set1 & set2)
    union = len(set1 | set2)
    jaccard = intersection / union if union > 0 else 0
    print(f"{algo1:8} vs {algo2:8}: {jaccard:.3f} ({intersection:,} common anomalies)")

In [None]:
# Consensus anomalies (flagged by 3+ algorithms)
from collections import Counter

all_anomalies = []
for algo, ids in all_results.items():
    all_anomalies.extend(ids)

anomaly_counts = Counter(all_anomalies)
consensus_3plus = [tid for tid, count in anomaly_counts.items() if count >= 3]
consensus_all = [tid for tid, count in anomaly_counts.items() if count == len(algorithms_to_compare)]

print(f"\nConsensus Analysis:")
print(f"  Flagged by 3+ algorithms: {len(consensus_3plus):,}")
print(f"  Flagged by ALL algorithms: {len(consensus_all):,}")

In [None]:
# Analyze consensus anomalies
if consensus_all:
    consensus_df = tenders[tenders['tender_id'].isin(consensus_all)].copy()
    print(f"\nConsensus anomalies (flagged by all {len(algorithms_to_compare)} algorithms):")
    print(f"  Count: {len(consensus_df):,}")
    print(f"  Total value: {consensus_df['tender_value'].sum()/1e9:.2f} B UAH")
    print(f"  Single bidder rate: {consensus_df['is_single_bidder'].mean()*100:.1f}%")
    print(f"\n  Procurement methods:")
    print(consensus_df['procurement_method'].value_counts())

## 7. Save Results

In [None]:
# Save comparison results
comparison_all.to_csv('../results/pyod_algorithm_comparison.csv', index=False)

# Save high-risk anomalies
high_risk_details.to_csv('../results/pyod_high_risk_tenders.csv', index=False)

# Save consensus anomalies
if consensus_all:
    pd.DataFrame({'tender_id': consensus_all}).to_csv('../results/pyod_consensus_anomalies.csv', index=False)

print("Results saved to ../results/")
print(f"\nCompleted: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")