# Isolation Forest Model Training for Anomaly Detection

**Objective:**  
Train an Isolation Forest model on the engineered equipment anomaly dataset and evaluate its performance. The model will be saved along with evaluation metrics and visualizations for production use.

---

## Table of Contents
1. [Configuration & Setup](#configuration)
2. [Data Loading & Preprocessing](#data)
3. [Model Training](#training)
4. [Model Evaluation](#evaluation)
5. [Visualization of Results](#visualization)
6. [Performance Tuning & Future Improvements](#tuning)
7. [Conclusion](#conclusion)


In [None]:
# <a id="configuration"></a>
# ## 1. Configuration & Setup

import os
import json
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.ensemble import IsolationForest
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix, roc_curve, auc, RocCurveDisplay
from sklearn.preprocessing import StandardScaler
from joblib import dump

# Configuration Parameters
SEED = 42
DATA_PATH = r"C:\Users\Ken Ira Talingting\Desktop\anomaly-detection-project\data\processed\equipment_anomaly_data_feature_engineered.csv"
RESULTS_DIR = r"C:\Users\Ken Ira Talingting\Desktop\anomaly-detection-project\data\processed_results"
MODEL_PATH = os.path.join(RESULTS_DIR, "isolation_forest_model.joblib")
METRICS_PATH = os.path.join(RESULTS_DIR, "evaluation_metrics.json")

# Ensure results directory exists
os.makedirs(RESULTS_DIR, exist_ok=True)

# Set random seed for reproducibility
np.random.seed(SEED)


# <a id="data"></a>
## 2. Data Loading & Preprocessing

In this section, we load the dataset, separate features and target, and split the data into training and test sets using stratification.


In [None]:
def load_and_preprocess_data():
    """
    Load and preprocess the dataset.
    
    Returns:
        X_train, X_test, y_train, y_test: Split and preprocessed data.
    """
    df = pd.read_csv(DATA_PATH)
    
    # Separate features and target ('faulty' column)
    X = df.drop(columns=['faulty'])
    y = df['faulty'].astype(int)
    
    # Split the data with stratification to maintain class distribution
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, stratify=y, random_state=SEED
    )
    
    # Optionally, you can scale features here if needed:
    scaler = StandardScaler()
    X_train_scaled = pd.DataFrame(scaler.fit_transform(X_train), columns=X_train.columns)
    X_test_scaled = pd.DataFrame(scaler.transform(X_test), columns=X_test.columns)
    
    return X_train_scaled, X_test_scaled, y_train, y_test

# Load data
X_train, X_test, y_train, y_test = load_and_preprocess_data()
print("Data loaded and preprocessed:")
print(f"Training data shape: {X_train.shape}")
print(f"Test data shape: {X_test.shape}")


In [None]:
# <a id="training"></a>
## 3. Model Training

We train an Isolation Forest model. The contamination parameter is estimated from the training data to reflect the anomaly rate.


In [None]:
def train_model(X_train, contamination):
    """
    Train and return the Isolation Forest model.
    
    Parameters:
        X_train (DataFrame): Training features.
        contamination (float): Estimated anomaly rate.
        
    Returns:
        model: Trained IsolationForest model.
    """
    model = IsolationForest(
        n_estimators=200,
        max_samples='auto',
        contamination=contamination,
        max_features=1.0,
        random_state=SEED,
        n_jobs=-1,
        verbose=1
    )
    model.fit(X_train)
    return model

# Estimate contamination from training labels (if available)
contamination_rate = y_train.mean()
print(f"Estimated contamination rate: {contamination_rate:.4f}")

# Train the model
model = train_model(X_train, contamination_rate)

# Save the model for production use
dump(model, MODEL_PATH)
print(f"Model saved to {MODEL_PATH}")


In [None]:
# <a id="evaluation"></a>
## 4. Model Evaluation

Here, we evaluate the model using:
- Binary predictions (0 = normal, 1 = anomaly)
- Classification metrics (report and confusion matrix)
- ROC AUC calculation

We also compute anomaly scores from the decision function.


In [None]:
def evaluate_model(model, X_test, y_test):
    """
    Evaluate the model and compute evaluation metrics.
    
    Returns:
        metrics (dict): Evaluation metrics including classification report, confusion matrix, and ROC AUC.
    """
    # Generate predictions and convert -1/1 to 1/0 format
    test_pred = model.predict(X_test)
    test_pred_binary = np.where(test_pred == -1, 1, 0)
    
    # Calculate anomaly scores (negative decision function; higher means more anomalous)
    scores = -model.decision_function(X_test)
    
    # Compute metrics
    report = classification_report(y_test, test_pred_binary, output_dict=True)
    cm = confusion_matrix(y_test, test_pred_binary)
    
    # Compute ROC AUC
    fpr, tpr, thresholds = roc_curve(y_test, scores)
    roc_auc = auc(fpr, tpr)
    
    return {
        'classification_report': report,
        'confusion_matrix': cm.tolist(),
        'roc_auc': roc_auc,
        'fpr': fpr.tolist(),
        'tpr': tpr.tolist(),
        'scores': scores  # Return scores for visualization
    }

metrics = evaluate_model(model, X_test, y_test)
print("Evaluation Metrics:")
print(classification_report(y_test, np.where(model.predict(X_test) == -1, 1, 0)))
print(f"ROC AUC Score: {metrics['roc_auc']:.4f}")

# Save evaluation metrics to JSON
with open(METRICS_PATH, 'w') as f:
    json.dump(metrics, f, indent=2)
print(f"Evaluation metrics saved to {METRICS_PATH}")


In [None]:
# <a id="visualization"></a>
## 5. Visualization of Results

We generate visualizations for:
- Confusion Matrix
- ROC Curve
- Anomaly Score Distribution

These plots help understand model performance and decision boundaries.


In [None]:
def visualize_results(metrics, y_test, scores):
    """
    Generate and save evaluation visualizations.
    """
    # Confusion Matrix Visualization
    plt.figure(figsize=(8, 6))
    sns.heatmap(metrics['confusion_matrix'], annot=True, fmt='d',
                cmap='Blues', cbar=False)
    plt.title('Confusion Matrix')
    plt.xlabel('Predicted')
    plt.ylabel('Actual')
    plt.savefig(os.path.join(RESULTS_DIR, 'confusion_matrix.png'))
    plt.close()
    
    # ROC Curve Visualization
    RocCurveDisplay(fpr=metrics['fpr'], tpr=metrics['tpr'], roc_auc=metrics['roc_auc']).plot()
    plt.title('ROC Curve')
    plt.savefig(os.path.join(RESULTS_DIR, 'roc_curve.png'))
    plt.close()
    
    # Anomaly Score Distribution Visualization
    plt.figure(figsize=(10, 6))
    # Separate scores for normal and anomalous samples based on true labels
    sns.kdeplot(scores[y_test == 0], label='Normal', fill=True)
    sns.kdeplot(scores[y_test == 1], label='Anomaly', fill=True)
    plt.title('Anomaly Score Distribution')
    plt.xlabel('Anomaly Score')
    plt.legend()
    plt.savefig(os.path.join(RESULTS_DIR, 'score_distribution.png'))
    plt.close()

# Visualize and save results
visualize_results(metrics, y_test.values, metrics['scores'])
print("Visualizations saved to the results directory.")


# <a id="tuning"></a>
## 6. Performance Tuning & Future Improvements

**Performance Tuning Suggestions:**
- **Hyperparameter Optimization:**  
  Consider using GridSearchCV for parameters like:
  - `n_estimators`: [100, 200, 500]
  - `max_samples`: [0.5, 0.8, 'auto']
  - `max_features`: [0.5, 0.8, 1.0]
  - `contamination`: [calculated_rate ± 0.01]
- **Feature Analysis:**  
  Use SHAP or feature importance methods to understand which features are most critical.
- **Ensemble Methods:**  
  Consider combining Isolation Forest with other algorithms (e.g., LOF, OCSVM) for a voting ensemble.
- **Threshold Optimization:**  
  Optimize the decision threshold based on business requirements, balancing precision and recall.

**Future Improvements:**
- Implement real-time monitoring and model drift detection.
- Incorporate contextual or temporal features to enhance detection.
- Explore cost-sensitive learning to minimize false positives/negatives.
- Package the solution as an API endpoint for real-time inference.

---

```markdown
# <a id="conclusion"></a>
## 7. Conclusion

In this notebook, we developed a production-ready Isolation Forest model for anomaly detection on an equipment dataset. We followed best practices in data preprocessing, model training, evaluation, and result visualization. The model, along with evaluation metrics and visualizations, has been saved for future deployment.

**Next Steps:**  
- Perform hyperparameter tuning to further optimize model performance.  
- Extend the pipeline with automated monitoring and periodic retraining for production.

---

*End of Notebook*
