
# Gas Pipeline Anomaly Detection

This notebook processes the `gas_pipeline_dataset.csv` to detect anomalies in SCADA systems using machine learning models. 
The dataset labels normal and attack behaviors based on the `result` column. The following steps are performed:
1. Data Loading and Preprocessing
2. Feature Engineering
3. Binary Classification (Normal vs. Attack)
4. Model Evaluation

Attack categories in the dataset are:
- **Normal (0)**: Standard operations.
- **Naïve Malicious Response Injection (NMRI) (1)**.
- **Complex Malicious Response Injection (CMRI) (2)**.
- **Malicious State Command Injection (MSCI) (3)**.
- **Malicious Parameter Command Injection (MPCI) (4)**.
- **Malicious Function Code Injection (MFCI) (5)**.
- **Denial of Service (DOS) (6)**.
- **Reconnaissance (Recon) (7)**.


In [None]:

import pandas as pd
import numpy as np

# Load dataset
file_path = 'gas_pipeline_dataset.csv'
df = pd.read_csv(file_path)

# Inspect the data
print("First five rows of the dataset:")
print(df.head())

# Convert 'result' column to binary (Normal = 0, Attack = 1)
df['result'] = df['result'].apply(lambda x: 0 if x == "b'0'" else 1)
print("\nValue counts for the 'result' column:")
print(df['result'].value_counts())


In [None]:

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Split features and labels
features = df.drop(columns=['result'])
labels = df['result']

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=0.2, random_state=42)

# Standardize features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

print(f"Training data shape: {X_train.shape}")
print(f"Testing data shape: {X_test.shape}")


In [None]:

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

# Train logistic regression model
logreg = LogisticRegression(max_iter=1000)
logreg.fit(X_train, y_train)

# Predict on test data
y_pred = logreg.predict(X_test)

# Evaluate the model
print("Classification Report:")
print(classification_report(y_test, y_pred))

print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))

accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")


In [None]:

import matplotlib.pyplot as plt

# Hyperparameter tuning for Logistic Regression (C parameter)
C_values = np.logspace(-3, 3, 7)
accuracies = []

for C in C_values:
    logreg = LogisticRegression(C=C, max_iter=1000)
    logreg.fit(X_train, y_train)
    y_pred = logreg.predict(X_test)
    accuracies.append(accuracy_score(y_test, y_pred))

# Plot accuracies
plt.figure(figsize=(8, 6))
plt.semilogx(C_values, accuracies, marker='o', linestyle='--')
plt.xlabel('C (Inverse Regularization Strength)')
plt.ylabel('Accuracy')
plt.title('Hyperparameter Tuning: Logistic Regression')
plt.grid(True)
plt.show()
