# Step 3: Random Forest Implementation

**Learning Objectives:**
1. Load preprocessed data
2. Understand Random Forest algorithm
3. Train Random Forest classifier
4. Make predictions on test set
5. Evaluate performance (accuracy, sensitivity, specificity)
6. Compare with your R implementation results

---

In [1]:
import pandas as pd
import numpy as np
import pickle

# Set seed
random_seed = 123
np.random.seed(random_seed)

# load Preprocessed data from Step 2
with open('preprocessed_data.pkl', 'rb') as f:
    data = pickle.load(f)

# Extract what we need
X_train = data['X_train']
X_test = data['X_test']
y_train = data['y_train']
y_test = data['y_test']
contamination_rate = data['contamination_rate']

print("✓ Preprocessed data loaded successfully!")
print(f"\nTraining samples: {X_train.shape[0]:,}")
print(f"Test samples: {X_test.shape[0]:,}")
print(f"Features: {X_train.shape[1]}")
print(f"Contamination rate: {contamination_rate:.4f}")

✓ Preprocessed data loaded successfully!

Training samples: 44,394
Test samples: 44,394
Features: 150
Contamination rate: 0.1066


## Random Forest Algorithm

**What is it?**
- Ensemble of many decision trees (we'll use 500 trees)
- Each tree votes on the classification
- Final prediction = majority vote

**Key parameters:**
- `n_estimators`: Number of trees (500)
- `max_features`: Features per split (√150 ≈ 12)
- `random_state`: For reproducibility (123)

**Why Random Forest?**
- Baseline supervised learning method
- High specificity (good at reducing false alarms)
- Used in R implementation

In [2]:
# Training Random Forest Model

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, classification_report
import time 

print("Training Random Forest model...")

# Create the model
# Parameters match the R model

rf_model = RandomForestClassifier(
    n_estimators  = 500,     # 500 trees
    max_features = 'sqrt',    # Squar-root of 150 ~ 12 features per split 
    random_state = random_seed,
    n_jobs = -1,             # USE all CPU cores
    verbose = 1              # Show progress    
)

# Train the model 
print("Training Random Forest with 500 trees....")
start_time = time.time()

rf_model.fit(X_train, y_train)

training_time = time.time() - start_time
print(f"Traing the model completed in {training_time:.2f} seconds")

Training Random Forest model...
Training Random Forest with 500 trees....


[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Done  18 tasks      | elapsed:    1.9s
[Parallel(n_jobs=-1)]: Done 168 tasks      | elapsed:   11.4s
[Parallel(n_jobs=-1)]: Done 418 tasks      | elapsed:   27.7s


Traing the model completed in 33.16 seconds


[Parallel(n_jobs=-1)]: Done 500 out of 500 | elapsed:   33.0s finished


In [3]:
# Making predictions 

# Predict on test set
y_pred_rf = rf_model.predict(X_test)

# Calculate prediction
pred_counts = pd.Series(y_pred_rf).value_counts().sort_index()

print(f"Normal: {pred_counts[0]:,}")
print(f"Abnormal: {pred_counts[1]:,}")


[Parallel(n_jobs=16)]: Using backend ThreadingBackend with 16 concurrent workers.
[Parallel(n_jobs=16)]: Done  18 tasks      | elapsed:    0.0s
[Parallel(n_jobs=16)]: Done 168 tasks      | elapsed:    0.0s
[Parallel(n_jobs=16)]: Done 418 tasks      | elapsed:    0.1s
[Parallel(n_jobs=16)]: Done 500 out of 500 | elapsed:    0.1s finished


Normal: 40,291
Abnormal: 4,103


In [10]:
from sklearn.metrics import confusion_matrix, accuracy_score

# Calculate metrics
accuracy = accuracy_score(y_test, y_pred_rf)

# Confusion matrix
from sklearn.metrics import confusion_matrix as cm_func
cm = cm_func(y_test, y_pred_rf)
tn, fp, fn, tp = cm.ravel()

# Sensitivity and specificity
sensitivity = tp / (tp + fn)
specificity = tn / (tn + fp)

print("Confusion Matrix")
print(f"Actual Normal      {tn:6d} {fp:6d}")
print(f"       Abnormal    {fn:6d} {tp:6d}")

print(f"Sensitivity: {sensitivity*100:.2f}%  (catches {sensitivity*100:.1f}% of abnormal)")
print(f"Specificity: {specificity*100:.2f}%  (catches {specificity*100:.1f}% of normal)")

print("R results: Accuracy=84.22%, Sensitivity=87.91%, Specificity=54.83%")
print(f"Python results: Accuracy={accuracy*100:.2f}%, Sensitivity={sensitivity*100:.2f}%, Specificity={specificity*100:.2f}%")



Confusion Matrix
Actual Normal       39563     98
       Abnormal       728   4005
Sensitivity: 84.62%  (catches 84.6% of abnormal)
Specificity: 99.75%  (catches 99.8% of normal)
R results: Accuracy=84.22%, Sensitivity=87.91%, Specificity=54.83%
Python results: Accuracy=98.14%, Sensitivity=84.62%, Specificity=99.75%
