# Step 4: Isolation Forest Variants

**Learning Objectives:**
1. Understand how Isolation Forest works (unsupervised anomaly detection)
2. Implement Standard Isolation Forest
3. Compare with Random Forest results
4. Understand contamination parameter
5. Analyze sensitivity vs specificity tradeoffs

---

## What is Isolation Forest?

**Key Concept:** Anomalies are "easier to isolate" than normal points.

**How it works:**
- Build random decision trees
- Anomalies get isolated with fewer splits
- Gives each point an "anomaly score"
- Higher score = more likely to be anomaly

**Key Parameter:**
- `contamination`: Expected proportion of anomalies (we use 0.101 = 10.1%)

In [2]:
# load the data and the libraries

import pandas as pd
import numpy as np 
import pickle
from sklearn.ensemble import IsolationForest 
from sklearn.metrics import confusion_matrix, accuracy_score 
import time 

# Seed
random_seed = 123
np.random.seed(random_seed)

# Load preprocessed data
with open('preprocessed_data.pkl', 'rb') as f:
    data = pickle.load(f)

# Extract the needed data 
X_train_scaled = data['X_train_scaled']  # Isolation Forest needs scaled data!
X_test_scaled = data['X_test_scaled']
y_train = data['y_train']
y_test = data['y_test']
contamination_rate = data['contamination_rate']

In [3]:
# Train Standard Isolation Forest

iso_forest = IsolationForest(
    n_estimators = 100,            # Number of trees
    max_samples = 1000,            # Sample size per tree
    contamination = contamination_rate, # Expected proportion of anomalies
    random_state = random_seed,
    n_jobs = -1,
    verbose = 1
)

# Train the model 
start_time = time.time()

iso_forest.fit(X_train_scaled)

training_time = time.time()-start_time
print(f"Training model completed in {training_time:.2f} seconds")

[Parallel(n_jobs=16)]: Using backend ThreadingBackend with 16 concurrent workers.
[Parallel(n_jobs=16)]: Done   2 out of  16 | elapsed:    0.0s remaining:    0.9s
[Parallel(n_jobs=16)]: Done  16 out of  16 | elapsed:    0.1s finished


Training model completed in 0.39 seconds


[Parallel(n_jobs=1)]: Done  49 tasks      | elapsed:    0.0s
[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:    0.1s finished


In [4]:
# Making prediction 

# Predict on test set
# Returns -1 for anomalies, 1 for normal - Sklearn convention 
y_pred_if_raw = iso_forest.predict(X_test_scaled)

# Convert to our format: 0 = Normal, 1 = Anomaly 
# sklearn: 1=normal, -1=anomaly
# We need: 0=normal, 1=anomaly
y_pred_if = np.where(y_pred_if_raw == -1, 1, 0)

# Counting prediction 
unique, counts = np.unique(y_pred_if, return_counts = True)
pred_dict = dict(zip(unique, counts))

print(f"\n Predictions:")
print(f"  Normal (0):   {pred_dict.get(0, 0):,}")
print(f"  Abnormal (1): {pred_dict.get(1, 0):,}")

# Expected abnormal based on contamination
expected_abnormal = int(len(y_test) * contamination_rate)
print(f"\nExpected abnormal (based on contamination): {expected_abnormal:,}")
print(f"Actual predicted abnormal: {pred_dict.get(1, 0):,}")



 Predictions:
  Normal (0):   39,691
  Abnormal (1): 4,703

Expected abnormal (based on contamination): 4,733
Actual predicted abnormal: 4,703


[Parallel(n_jobs=1)]: Done  49 tasks      | elapsed:    0.0s
[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:    0.1s finished


In [5]:
# Calculate performance matricies 
from sklearn.metrics import confusion_matrix, accuracy_score

# Calculate metrics 
accuracy = accuracy_score(y_test, y_pred_if)

# Confusion matrix 
conf_matrix = confusion_matrix(y_test, y_pred_if)
tn, fp, fn, tp = conf_matrix.ravel()

# Calculate sensitivity and specificity 
sensitivity = tp / (tp + fn)
specificity = tn / (tn + fp)

print(f"Actual Normal    {tn:6d}  {fp:6d}")
print(f"       Abnormal  {fn:6d}  {tp:6d}")

print(f"\n{'='*70}")
print("METRICS:")
print(f"{'='*70}")
print(f"Accuracy:    {accuracy*100:.2f}%")
print(f"Sensitivity: {sensitivity*100:.2f}%  (catches {sensitivity*100:.1f}% of abnormal)")
print(f"Specificity: {specificity*100:.2f}%  (catches {specificity*100:.1f}% of normal)")

print(f"\n{'='*70}")
print("COMPARISON:")
print(f"{'='*70}")
print(f"Random Forest:      Acc={98.14:.2f}%, Sens={84.62:.2f}%, Spec={99.75:.2f}%")
print(f"Isolation Forest:   Acc={accuracy*100:.2f}%, Sens={sensitivity*100:.2f}%, Spec={specificity*100:.2f}%")

Actual Normal     36689    2972
       Abnormal    3002    1731

METRICS:
Accuracy:    86.54%
Sensitivity: 36.57%  (catches 36.6% of abnormal)
Specificity: 92.51%  (catches 92.5% of normal)

COMPARISON:
Random Forest:      Acc=98.14%, Sens=84.62%, Spec=99.75%
Isolation Forest:   Acc=86.54%, Sens=36.57%, Spec=92.51%



## Single-Variable Isolation Forest

**Key Idea:** Instead of one model using all 150 features, train 150 separate models - one per feature!

**How it works:**
1. Train 150 individual Isolation Forests (one for each R-R interval feature)
2. Each model gives an anomaly score
3. Average all 150 scores
4. Higher average score = more likely anomaly


**Expected from R:** Sensitivity 92.52%, Specificity 31.45%

In [6]:
# Train 150 models - One per feature 

n_features = X_train_scaled.shape[1] 
single_models = []
feature_scores = np.zeros((X_test_scaled.shape[0], n_features))

print(f"Training {n_features} individual isolation forest models.")

start_time = time.time()

for i in range(n_features):
    if(i + 1) % 30 == 0:      # Progress update every 30 models
        print(f" Processed {i+1} / {n_features} features.")
    
    # Training IF on single feature
    single_if = IsolationForest(
        n_estimators = 20,           # 20 Trees per model
        max_samples = 500,
        contamination = contamination_rate,
        random_state = random_seed
    )

    # Fit on a single feature
    X_train_single = X_train_scaled[:, i].reshape(-1,1)
    single_if.fit(X_train_single)

    # Predict on test set for this feature
    X_test_single = X_test_scaled[:, i].reshape(-1,1)
    predictions = single_if.predict(X_test_single)

    # Convert to 0/1 and store
    feature_scores[:, i] = np.where(predictions == -1, 1, 0)
    single_models.append(single_if)

training_time = time.time() - start_time

print(f"Training completed in {training_time:.2f} seconds.")
print(f"Total trees: {n_features * 20} = {n_features * 20}")

Training 150 individual isolation forest models.
 Processed 30 / 150 features.
 Processed 60 / 150 features.
 Processed 90 / 150 features.
 Processed 120 / 150 features.
 Processed 150 / 150 features.
Training completed in 15.59 seconds.
Total trees: 3000 = 3000


In [9]:
# Average the pridictions across all 150 features 
# Eech raw = one test sample
# Each column = one feature's vote (0 or 1)
# Average = What % of features say 'abnormal'"ECG R TO PY.ipynb"

average_scores = feature_scores.mean(axis=1)

print(f"Average scores for test samples: {len(average_scores):,}")

# Determine threshold based on contamination
# If contamination = 0.101, we want top 10.1% to be labled abnormal 

threshold = np.quantile(average_scores, 1 - contamination_rate)

# Final prediction
y_pred_single_var = np.where(average_scores >= threshold, 1, 0)

# Count predictions
pred_normal = sum(y_pred_single_var == 0)
pred_abnormal = sum (y_pred_single_var == 1)

print(f"\nFinal Predictions:")
print(f"  Normal (0):   {pred_normal:,}")
print(f"  Abnormal (1): {pred_abnormal:,}")



Average scores for test samples: 44,394

Final Predictions:
  Normal (0):   39,645
  Abnormal (1): 4,749


In [11]:
# Calculate metrics 
accuracy_sv = accuracy_score(y_test, y_pred_single_var)

# confusion matrix 
cm_sv = confusion_matrix(y_test, y_pred_single_var)
tn_sv, fn_sv, fp_sv, tp_sv = cm_sv.ravel()

# Calculate sensitivity and specificity
sensitivity_sv = tp_sv / (tp_sv + fn_sv)
specificity_sv = tn_sv / (tn_sv + fp_sv)

print(f"Actual Normal    {tn_sv:6d}  {fp_sv:6d}")
print(f"       Abnormal  {fn_sv:6d}  {tp_sv:6d}")

print(f"Accuracy:    {accuracy_sv*100:.2f}%")
print(f"Sensitivity: {sensitivity_sv*100:.2f}%  (catches {sensitivity_sv*100:.1f}% of abnormal)")
print(f"Specificity: {specificity_sv*100:.2f}%  (catches {specificity_sv*100:.1f}% of normal)")

print(f"R (from paper):")
print(f"  Single-Variable IF:  Acc=85.70%, Sens=92.52%, Spec=31.45%")
print(f"\nPython (current):")
print(f"  Random Forest:       Acc=98.14%, Sens=84.62%, Spec=99.75%")
print(f"  Standard IF:         Acc=86.54%, Sens=36.57%, Spec=92.51%")
print(f"  Single-Variable IF:  Acc={accuracy_sv*100:.2f}%, Sens={sensitivity_sv*100:.2f}%, Spec={specificity_sv*100:.2f}%")

Actual Normal     36241    3404
       Abnormal    3420    1329
Accuracy:    84.63%
Sensitivity: 27.98%  (catches 28.0% of abnormal)
Specificity: 91.41%  (catches 91.4% of normal)
R (from paper):
  Single-Variable IF:  Acc=85.70%, Sens=92.52%, Spec=31.45%

Python (current):
  Random Forest:       Acc=98.14%, Sens=84.62%, Spec=99.75%
  Standard IF:         Acc=86.54%, Sens=36.57%, Spec=92.51%
  Single-Variable IF:  Acc=84.63%, Sens=27.98%, Spec=91.41%
