<a href="https://colab.research.google.com/github/princetech89/Anomaly-Detection-in-Energy-Plant-Sensors/blob/main/Anomaly_Detection_in_Energy_Plant_Sensor.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# IMPORTANT: SOME KAGGLE DATA SOURCES ARE PRIVATE
# RUN THIS CELL IN ORDER TO IMPORT YOUR KAGGLE DATA SOURCES.
import kagglehub
kagglehub.login()


In [None]:
# IMPORTANT: RUN THIS CELL IN ORDER TO IMPORT YOUR KAGGLE DATA SOURCES,
# THEN FEEL FREE TO DELETE THIS CELL.
# NOTE: THIS NOTEBOOK ENVIRONMENT DIFFERS FROM KAGGLE'S PYTHON
# ENVIRONMENT SO THERE MAY BE MISSING LIBRARIES USED BY YOUR
# NOTEBOOK.

ana_verse_2_0_h_path = kagglehub.competition_download('ana-verse-2-0-h')
princechourasiya_competition_data_path = kagglehub.dataset_download('princechourasiya/competition-data')

print('Data source import complete.')


**1. Problem Statement & Objective**

The objective is to predict whether a given set of sensor readings corresponds to an anomaly (target = 1) or normal operation (target = 0) in an energy manufacturing plant.

This is a binary classification problem using time-series-like tabular data.

**Evaluation Focus**
Models are evaluated using:
Accuracy
Precision
Recall
F1 Score (primary due to class imbalance)

**2. Import Libraries**

The code !pip install catboost uses the pip package installer to download and install the catboost library. This is a common way to install Python packages in Colab notebooks, allowing you to use the library's functionalities in your code.

In [None]:
!pip install catboost

In [None]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split, StratifiedKFold, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import (
    accuracy_score,
    precision_score,
    recall_score,
    f1_score,
    confusion_matrix,
    classification_report
)

from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier

import warnings
warnings.filterwarnings("ignore")

**3. Load Dataset**

In [None]:
train = pd.read_csv("/kaggle/input/competition-data/train.csv")
test = pd.read_csv("/kaggle/input/competition-data/test.csv")

train['Date'] = pd.to_datetime(train['Date'])
test['Date'] = pd.to_datetime(test['Date'])

train.head()

In [None]:
train.info()
train.describe()

**Data Size:** The dataset contains 109,049 entries and 7
columns.

**Missing Values:** One missing value is present in columns X2, X3, X4, X5, and target.

**Data Types:** The 'Date' column is correctly parsed as datetime, while features X1 through X5 and the target are float64.

**Class Imbalance:** The target variable shows a significant class imbalance, with only about 2.1% of the entries representing anomalies (target = 1).

**Potential Outliers/Errors:** Columns X3 and X4 exhibit extremely large maximum values and standard deviations, suggesting the presence of severe outliers or potential data entry errors that warrant further investigation.

**4. Target Distribution (Class Imbalance)**

The dataset is highly imbalanced with anomalies being rare.
Therefore F1-score and recall are emphasized, and class-weighting / resampling techniques are used.

In [None]:
train['target'].value_counts()

train['target'].value_counts(normalize=True).plot(kind='bar')
plt.title("Target Class Distribution")
plt.show()

**Class Imbalance:** A tall bar for 0.0 and a very short bar for 1.0. This visually confirms the significant class imbalance, where normal operations vastly outnumber anomalies.

**Percentage of Anomalies:** The exact percentage for 1.0 (which we noted as approximately 2.1% earlier) will be clearly visible or inferable from the bar's height. This highlights why traditional accuracy might be misleading, and why metrics like F1-score and recall are more appropriate.

**5. Missing Values & Outliers**


*  Missing Values

In [None]:
train.isnull().sum()

**Date and X1:** Both have 0 missing values, indicating they are complete.

**X2, X3, X4, X5, and target:** Each of these columns has exactly 1 missing value. This means there's a single row across these specific columns that contains NaN (Not a Number) or null entries, which will need to be addressed during data preprocessing.


In [None]:
train.fillna(train.median(), inplace=True)

*   Outlier Detection
  
        Visualize sensores:

In [None]:
sns.boxplot(data=train[['X1','X2','X3','X4','X5']])
plt.show()

**The Box:** The central box in each plot represents the interquartile range (IQR), which spans from the 25th percentile (Q1) to the 75th percentile (Q3) of the data. The line inside the box indicates the median (50th percentile).

**The Whiskers:** The lines extending from the box (whiskers) typically show the range of data within 1.5 times the IQR from the Q1 and Q3. Data points outside these whiskers are considered potential outliers.

**The Dots:** Individual points outside the whiskers are indeed identified as outliers.

      Optional capping

In [None]:
for col in ['X1','X2','X3','X4','X5']:
    q1 = train[col].quantile(0.01)
    q99 = train[col].quantile(0.99)
    train[col] = train[col].clip(q1, q99)

**This technique is called outlier capping (or Winsorization). We are using it for the following reasons:**

**Mitigating Extreme Outliers:** We previously observed that columns X3 and X4 had extremely large maximum values and standard deviations, indicating severe outliers. This clip() operation directly addresses this by setting a floor and ceiling for the values. Instead of removing these extreme data points (which might lead to loss of information), we are 'pulling them in' to a more reasonable range defined by the 1st and 99th percentiles.

**Robustness to Models:** Many machine learning models are sensitive to outliers. By capping these extreme values, we make the data more robust, which can help models perform better and prevent them from being overly influenced by a few anomalous data points.

**Preserving Data Quantity:** Unlike simply removing rows with outliers, capping allows us to keep all our data points, which can be beneficial, especially with imbalanced datasets where every data point might be valuable.

**6 Feature Engineering**

* Time-based Features

In [None]:
def add_time_features(df):
    # Extract hour, day, month, and weekday from the 'Date' column.
    # These features can help capture temporal patterns or cyclical trends
    # that might influence anomaly occurrences. For example, certain hours
    # of the day or days of the week might be more prone to anomalies.
    df['hour'] = df['Date'].dt.hour
    df['day'] = df['Date'].dt.day
    df['month'] = df['Date'].dt.month
    df['weekday'] = df['Date'].dt.weekday
    return df

train = add_time_features(train)
test = add_time_features(test)

In [None]:
display(train.head())

*  Sensor Interactions

The reasoning behind these 'sensor interactions' is to capture relationships or dependencies between different sensor readings that might be indicative of an anomaly. For example, a sudden change in the ratio or difference between two sensors might be a stronger signal for an anomaly than changes in individual sensor readings alone

In [None]:
# Create a ratio feature: X1 divided by X2.
# Ratios can reveal proportional relationships between sensors that might indicate abnormal conditions.
# A small constant (1e-5) is added to the denominator to prevent division by zero errors.
train['X1_X2_ratio'] = train['X1'] / (train['X2'] + 1e-5)

# Create a difference feature: X3 minus X4.
# Differences can highlight discrepancies or deviations between sensor readings.
# These might signal a malfunction or an unusual operational state.
train['X3_minus_X4'] = train['X3'] - train['X4']

# Apply the same feature engineering to the test dataset to maintain consistency.
test['X1_X2_ratio'] = test['X1'] / (test['X2'] + 1e-5)
test['X3_minus_X4'] = test['X3'] - test['X4']

*  Drop Date

**Reasoning:** We previously extracted new, more directly useful time-based features (like 'hour', 'day', 'month', and 'weekday') from the 'Date' column. Once these features are created, the original 'Date' column itself, being a datetime object, is typically not directly fed into most machine learning models. Removing it helps to avoid potential issues with model interpretation, reduces the dimensionality of our datasets, and simplifies our feature set.

In [None]:
train.drop(columns=['Date'], inplace=True)
test.drop(columns=['Date'], inplace=True)

**7 ADVANCED TEMPORAL FEATURES**

In [None]:

# The 'Date' column was already processed and dropped in previous steps.
# The time-based features (hour, day, weekday, month) have already been created.
# Removing the redundant lines that attempt to re-create them from the non-existent 'Date' column.

# Rolling stats per sensor
sensor_cols = [c for c in train.columns if c.startswith("X")]

for col in sensor_cols:
    train[f"{col}_rolling_mean"] = train[col].rolling(5).mean()
    train[f"{col}_rolling_std"] = train[col].rolling(5).std()

    test[f"{col}_rolling_mean"] = test[col].rolling(5).mean()
    test[f"{col}_rolling_std"] = test[col].rolling(5).std()

train.fillna(0, inplace=True)
test.fillna(0, inplace=True)

**INTERACTION FEATURES**

In [None]:
# ================================
# SENSOR INTERACTIONS
# ================================

for i in range(len(sensor_cols)):
    for j in range(i+1, min(i+4, len(sensor_cols))):
        c1, c2 = sensor_cols[i], sensor_cols[j]

        train[f"{c1}_ratio_{c2}"] = train[c1] / (train[c2] + 1e-6)
        test[f"{c1}_ratio_{c2}"] = test[c1] / (test[c2] + 1e-6)

**8. Correlation Analysis**

**Identify relationships between features:** See which features move together (positively or negatively correlated), which can be useful for understanding the underlying processes or for identifying multicollinearity.

**Understand feature importance with respect to the target:** Discover which features have a strong relationship with the target variable. This can provide insights into which sensor readings are most indicative of an anomaly.

**Guide further feature selection:** If two features are highly correlated, one might be redundant, and we might consider dropping one to simplify the model and prevent issues like multicollinearity.
Inform model building: Understanding these relationships can help in choosing appropriate models or interpreting model results.

In [None]:
plt.figure(figsize=(10,8))
sns.heatmap(train.corr(), cmap="coolwarm")
plt.title("Feature Correlation Matrix")
plt.show()

The Feature Correlation Matrix is a visual tool (heatmap) that shows the linear relationship between every pair of features in your dataset. It tells you which features tend to move together (positive correlation), move in opposite directions (negative correlation), or have no linear relationship (near zero correlation). This helps in understanding data structure, identifying important features, and detecting redundancy.

**Explain which sensors correlate with anomalies.**

To accurately identify which sensors correlate with anomalies, let's look at the correlation values of each feature with the 'target' variable. This will show us the strength and direction of their linear relationship.

Based on the correlation values with the target variable, here's what we can observe about which sensors correlate with anomalies:

In [None]:
correlation_with_target = train.corr()['target'].sort_values(ascending=False)
display(correlation_with_target)

**9. Train / Validation Split**

In [None]:
X = train.drop(columns=['target'])
y = train['target']

X_train, X_val, y_train, y_val = train_test_split(
    X, y,
    test_size=0.2,
    stratify=y,
    random_state=42
)

print(X_train.shape, X_val.shape)

1.  **Prevent Overfitting:** It ensures our model learns general patterns from the training data and doesn't just memorize it, which would lead to poor performance on new, unseen data.
2.  **Evaluate Generalization:** It allows us to objectively assess how well our model, once trained, can predict anomalies on data it has never seen before, giving us a realistic measure of its real-world performance.

**10 .Temporal Feature Engineering**

In [None]:
lgbm = LGBMClassifier(
    n_estimators=1200,
    learning_rate=0.03,
    num_leaves=128,
    max_depth=-1,
    subsample=0.85,
    colsample_bytree=0.85,
    class_weight="balanced",
    random_state=42
)

lgbm.fit(X_train, y_train)

This LightGBM output shows the model trained on a large, imbalanced dataset (many more normal operations than anomalies) using 39 features. It utilized multi-threading for efficiency and confirms the hyperparameters set for the LGBMClassifier, including class_weight='balanced' to handle the imbalance.

**11. Feature Scaling**

In [None]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_val_scaled = scaler.transform(X_val)

feature scaling ensures that no single sensor's measurement range unfairly dominates our machine learning models (like SVM or KNN), especially since some original sensor readings had very large values. It helps models learn more effectively and leads to better anomaly detection performance by giving all features an equal footing.



**12. Classical Models**

* Logistic Regression

In [None]:
lr = LogisticRegression(class_weight='balanced', max_iter=500)
lr.fit(X_train_scaled, y_train)

pred = lr.predict(X_val_scaled)

print(classification_report(y_val, pred))

The Logistic Regression model's evaluation metrics have been updated in the markdown cell. Specifically, for anomalies (class 1), the precision was updated from 0.16 to 0.14, recall from 0.92 to 0.96, F1-score from 0.28 to 0.24, and overall accuracy from 0.90 to 0.95.

* KNN

In [None]:
knn = KNeighborsClassifier(n_neighbors=7)
knn.fit(X_train_scaled, y_train)

pred = knn.predict(X_val_scaled)
print(classification_report(y_val, pred))

The KNN model shows strong overall accuracy (0.99). For anomalies, it has 75% precision, 40% recall, and an F1-score of 0.52.

* Decision Tree

In [None]:
dt = DecisionTreeClassifier(
    max_depth=8,
    class_weight='balanced',
    random_state=42
)

dt.fit(X_train, y_train)
pred = dt.predict(X_val)

print(classification_report(y_val, pred))


The Decision Tree model accurately identifies most normal operations but has very low precision for anomalies, leading to many false positives, despite a high recall for anomalies. Its F1-score for anomalies is 0.25.

* SVM

In [None]:
svm = SVC(kernel='rbf', class_weight='balanced')
svm.fit(X_train_scaled[:100000], y_train[:100000])

pred = svm.predict(X_val_scaled)
print(classification_report(y_val, pred))

The SVM model is very good at identifying normal operations. For anomalies, it has a good recall (0.79), meaning it catches many, but its precision is low (0.25), resulting in a moderate F1-score of 0.38. It still creates a fair number of false alarms.

**13. Advanced Models**

* Random Forest

In [None]:
rf = RandomForestClassifier(
    n_estimators=300,
    max_depth=12,
    class_weight='balanced',
    n_jobs=-1,
    random_state=42
)

rf.fit(X_train, y_train)
pred = rf.predict(X_val)
print(classification_report(y_val, pred))

* XGBoost

In [None]:
xgb = XGBClassifier(
    n_estimators=800,
    max_depth=7,
    learning_rate=0.03,
    subsample=0.85,
    colsample_bytree=0.85,
    eval_metric="logloss",
    scale_pos_weight=y_train.value_counts()[0] / y_train.value_counts()[1],
    random_state=42
)

xgb.fit(X_train, y_train)

* CatBoost

It is the go-to tool for tabular data when you want high accuracy and native support for categories with minimal effort.

In [None]:
cat = CatBoostClassifier(
    iterations=800,
    depth=8,
    learning_rate=0.04,
    loss_function="Logloss",
    auto_class_weights="Balanced",
    verbose=0,
    random_state=42
)

cat.fit(X_train, y_train)

**14. ENSEMBLE**

**Threshold Optimization for F1 Score**

In [None]:
from sklearn.metrics import f1_score
import numpy as np

models = {
    "lgbm": lgbm,
    "xgb": xgb,
    "cat": cat
}

val_probs = {}

for name, model in models.items():
    val_probs[name] = model.predict_proba(X_val)[:,1]

avg_probs = np.mean(list(val_probs.values()), axis=0)

thresholds = np.linspace(0.05, 0.95, 60)

best_f1 = 0
best_thresh = 0.5

for t in thresholds:
    preds = (avg_probs >= t).astype(int)
    f1 = f1_score(y_val, preds)

    if f1 > best_f1:
        best_f1 = f1
        best_thresh = t

print("Best F1:", best_f1)
print("Best threshold:", best_thresh)


In [None]:
cat_probs = cat.predict_proba(X_val)[:, 1]
xgb_probs = xgb.predict_proba(X_val)[:, 1]
lgb_probs = lgbm.predict_proba(X_val)[:, 1]

avg_probs = (cat_probs + xgb_probs + lgb_probs) / 3

ensemble_preds = (avg_probs >= best_thresh).astype(int)

print(classification_report(y_val, ensemble_preds))

**15. Hyperparameter Tuning**

In [None]:
param_grid = {
    "depth":[5,7],
    "learning_rate":[0.03,0.1],
    "iterations":[300,500]
}

grid = GridSearchCV(
    CatBoostClassifier(auto_class_weights="Balanced", verbose=0),
    param_grid,
    scoring="f1",
    cv=3
)

grid.fit(X_train, y_train)

grid.best_params_

This output represents the optimal hyperparameters for CatBoost model, specifically for depth, iterations, and learning_rate, that resulted in the best F1-score during the GridSearchCV process. This means that among the combinations you tested, a depth of 7, 500 iterations, and a learning_rate of 0.1 are the settings that yielded the best performance for identifying anomalies.

**16. Cross-Validation**

In [None]:
skf = StratifiedKFold(5)

scores = []

for tr_idx, va_idx in skf.split(X, y):
    X_tr, X_va = X.iloc[tr_idx], X.iloc[va_idx]
    y_tr, y_va = y.iloc[tr_idx], y.iloc[va_idx]

    model = LGBMClassifier(class_weight="balanced")
    model.fit(X_tr, y_tr)

    preds = model.predict(X_va)
    scores.append(f1_score(y_va, preds))

np.mean(scores)

The cross-validation process, using a LightGBM model, resulted in an average F1-score of approximately 0.409. This score indicates the model's performance in identifying anomalies across different splits of your data, providing a more robust evaluation than a single train-validation split. The output logs from LightGBM detail the setup for each training fold.

**17. Robustness Checks**

* Confusion matrix:

A confusion matrix is useful because it provides a detailed breakdown of a model's correct and incorrect predictions across all classes, revealing exactly where it is succeeding or failing beyond simple accuracy.

In [None]:
cm = confusion_matrix(y_val, pred)
sns.heatmap(cm, annot=True, fmt="d")
plt.title("Confusion Matrix")
plt.show()

 The confusion matrix shows that the model accurately identified 20,630 normal operations (True Negatives) and 429 anomalies (True Positives). However, it made 721 false alarms (False Positives), where it predicted an anomaly but it was normal, and missed 30 actual anomalies (False Negatives), predicting them as normal.


* Residual analysis:

In [None]:
residuals = y_val - pred
sns.histplot(residuals)
plt.title("Residual Distribution")
plt.show()

The Residual Distribution plot visually represents the model's errors. A large bar at zero indicates correct predictions (where y_val equals pred). Bars at -1.0 signify False Positives (model predicted 1, actual was 0), and bars at 1.0 signify False Negatives (model predicted 0, actual was 1). This plot helps quickly identify the frequency and type of prediction errors made by the model.

In [None]:
X = train.drop(columns=["target"])
y = train["target"]

lgbm.fit(X, y)
xgb.fit(X, y)
cat.fit(X, y)

**18. Train Final Model & Submission**

In [None]:
X_test = test.drop(columns=["ID"])
test_ids = test["ID"]

test_probs = (
    lgbm.predict_proba(X_test)[:,1] +
    xgb.predict_proba(X_test)[:,1] +
    cat.predict_proba(X_test)[:,1]
) / 3

test_preds = (test_probs >= best_thresh).astype(int)

submission = pd.DataFrame({
    "ID": test_ids,
    "target": test_preds
})

print(submission.shape)
submission.to_csv("submission.csv", index=False)

In [None]:
submission.to_csv("submission.csv", index=False)
print("Saved submission.csv")


**we have taken the following steps:**

**Data Preprocessing:** Handling missing values by imputation and outliers by capping, followed by feature scaling.

**Feature Engineering:** Creating time-based features (hour, day, month, weekday) and new sensor interaction features (ratios, differences, rolling statistics).

**Exploratory Data Analysis:** Analyzing target distribution (class imbalance) and feature correlations.

**Model Training:** Evaluating various classical (Logistic Regression, KNN, Decision Tree, SVM) and advanced (Random Forest, XGBoost, CatBoost, LightGBM) machine learning models, with a focus on F1-score due to class imbalance.

**Optimization:** Performing hyperparameter tuning (GridSearchCV) for CatBoost and ensemble modeling with threshold optimization for improved F1-score.

**Evaluation & Robustness:** Using cross-validation, confusion matrices, and residual analysis for robust model assessment.
Submission: Training the final ensemble model and generating predictions for the test set.