<a href="https://colab.research.google.com/github/mukhammadjontursunaliev/Project/blob/main/iotid20mukhammadjon.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# IoTID20 Data Cleaning and Preprocessing

This notebook guides you through cleaning and preprocessing the IoTID20 dataset
for anomaly detection model training. As a Data Engineer/Scientist, we'll:

1. Mount Google Drive  
2. Load the dataset  
3. Inspect data structure and quality  
4. Clean missing or invalid records  
5. Engineer features and split data  
6. Scale features and save artifacts  



In [None]:
# 1. Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
# 2. Imports and Configuration
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
import joblib
import os


In [None]:
# 3. Load IoTID20 Dataset
data_path = '/content/drive/MyDrive/Datasets/IoT Network Intrusion Dataset.csv'
df = pd.read_csv(data_path)
print(f'Dataset loaded with shape: {df.shape}')


Dataset loaded with shape: (625783, 86)


## Initial Inspection


In [None]:
# Inspect basic structure and missing values
print(df.dtypes)
print(df.head())
print('Missing values per column:')
print(df.isnull().sum())
print('Duplicate rows:', df.duplicated().sum())
print('Label distribution:')


Flow_ID      object
Src_IP       object
Src_Port      int64
Dst_IP       object
Dst_Port      int64
             ...   
Idle_Max    float64
Idle_Min    float64
Label        object
Cat          object
Sub_Cat      object
Length: 86, dtype: object
                                     Flow_ID           Src_IP  Src_Port  \
0   192.168.0.13-192.168.0.16-10000-10101-17     192.168.0.13     10000   
1    192.168.0.13-222.160.179.132-554-2179-6  222.160.179.132      2179   
2     192.168.0.13-192.168.0.16-9020-52727-6     192.168.0.16     52727   
3     192.168.0.13-192.168.0.16-9020-52964-6     192.168.0.16     52964   
4  192.168.0.1-239.255.255.250-36763-1900-17      192.168.0.1     36763   

            Dst_IP  Dst_Port  Protocol               Timestamp  Flow_Duration  \
0     192.168.0.16     10101        17  25/07/2019 03:25:53 AM             75   
1     192.168.0.13       554         6  26/05/2019 10:11:06 PM           5310   
2     192.168.0.13      9020         6  11/07/2019 01:24:48 

In [None]:
print('Label distribution:')
print(df['Label'].value_counts())

Label distribution:
Label
Anomaly    585710
Normal      40073
Name: count, dtype: int64


## Data Cleaning


In [None]:
# Drop missing values (if any)
df_clean = df.dropna()
# Normalize column names
df_clean.columns = (
    df_clean.columns
      .str.strip()
      .str.lower()
      .str.replace(' ', '_')
)
print(f'After cleaning: {df_clean.shape}')


After cleaning: (625783, 86)


In [None]:
df_clean.columns

Index(['flow_id', 'src_ip', 'src_port', 'dst_ip', 'dst_port', 'protocol',
       'timestamp', 'flow_duration', 'tot_fwd_pkts', 'tot_bwd_pkts',
       'totlen_fwd_pkts', 'totlen_bwd_pkts', 'fwd_pkt_len_max',
       'fwd_pkt_len_min', 'fwd_pkt_len_mean', 'fwd_pkt_len_std',
       'bwd_pkt_len_max', 'bwd_pkt_len_min', 'bwd_pkt_len_mean',
       'bwd_pkt_len_std', 'flow_byts/s', 'flow_pkts/s', 'flow_iat_mean',
       'flow_iat_std', 'flow_iat_max', 'flow_iat_min', 'fwd_iat_tot',
       'fwd_iat_mean', 'fwd_iat_std', 'fwd_iat_max', 'fwd_iat_min',
       'bwd_iat_tot', 'bwd_iat_mean', 'bwd_iat_std', 'bwd_iat_max',
       'bwd_iat_min', 'fwd_psh_flags', 'bwd_psh_flags', 'fwd_urg_flags',
       'bwd_urg_flags', 'fwd_header_len', 'bwd_header_len', 'fwd_pkts/s',
       'bwd_pkts/s', 'pkt_len_min', 'pkt_len_max', 'pkt_len_mean',
       'pkt_len_std', 'pkt_len_var', 'fin_flag_cnt', 'syn_flag_cnt',
       'rst_flag_cnt', 'psh_flag_cnt', 'ack_flag_cnt', 'urg_flag_cnt',
       'cwe_flag_count', 'ece_

In [None]:
#Remove invalid flows by filtering non-positive values
df_clean = df_clean[df_clean['flow_duration'] > 0]
df_clean = df_clean[df_clean['tot_fwd_pkts'] > 0]

print("Columns after filtering:", df_clean.columns.tolist())


Columns after filtering: ['flow_id', 'src_ip', 'src_port', 'dst_ip', 'dst_port', 'protocol', 'timestamp', 'flow_duration', 'tot_fwd_pkts', 'tot_bwd_pkts', 'totlen_fwd_pkts', 'totlen_bwd_pkts', 'fwd_pkt_len_max', 'fwd_pkt_len_min', 'fwd_pkt_len_mean', 'fwd_pkt_len_std', 'bwd_pkt_len_max', 'bwd_pkt_len_min', 'bwd_pkt_len_mean', 'bwd_pkt_len_std', 'flow_byts/s', 'flow_pkts/s', 'flow_iat_mean', 'flow_iat_std', 'flow_iat_max', 'flow_iat_min', 'fwd_iat_tot', 'fwd_iat_mean', 'fwd_iat_std', 'fwd_iat_max', 'fwd_iat_min', 'bwd_iat_tot', 'bwd_iat_mean', 'bwd_iat_std', 'bwd_iat_max', 'bwd_iat_min', 'fwd_psh_flags', 'bwd_psh_flags', 'fwd_urg_flags', 'bwd_urg_flags', 'fwd_header_len', 'bwd_header_len', 'fwd_pkts/s', 'bwd_pkts/s', 'pkt_len_min', 'pkt_len_max', 'pkt_len_mean', 'pkt_len_std', 'pkt_len_var', 'fin_flag_cnt', 'syn_flag_cnt', 'rst_flag_cnt', 'psh_flag_cnt', 'ack_flag_cnt', 'urg_flag_cnt', 'cwe_flag_count', 'ece_flag_cnt', 'down/up_ratio', 'pkt_size_avg', 'fwd_seg_size_avg', 'bwd_seg_size_a

In [None]:
# Clip extreme values at 1st and 99th percentiles
for col in ['flow_duration', 'totlen_fwd_pkts', 'totlen_bwd_pkts']:
    low, high = df_clean[col].quantile([0.01, 0.99])
    df_clean[col] = df_clean[col].clip(lower=low, upper=high)

## Feature Engineering & Preprocessing


In [None]:
feature_cols = [
    'flow_duration',
    'totlen_fwd_pkts',    # or 'total_length'
    'totlen_bwd_pkts',    # or drop if using total_length
    'src_port',
    'dst_port'
]


In [None]:
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split

# Encode labels
le = LabelEncoder()
df_clean['label_enc'] = le.fit_transform(df_clean['label'])

# Select X and y
X = df_clean[feature_cols]
y = df_clean['label_enc']

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)
print('Train:', X_train.shape, 'Test:', X_test.shape)


Train: (321466, 5) Test: (80367, 5)


## Scaling & Saving Artifacts


In [None]:
import os

data_path = '/content/drive/MyDrive/Datasets/IoT Network Intrusion Dataset.csv'
df = pd.read_csv(data_path)

# derive artifact_dir from the data file’s parent folder:
base_dir     = os.path.dirname(data_path)              # '/content/drive/MyDrive/Datasets'
artifact_dir = os.path.join(base_dir, 'artifacts')     # '/content/drive/MyDrive/Datasets/artifacts'
os.makedirs(artifact_dir, exist_ok=True)

print("Artifacts will be saved to:", artifact_dir)


Artifacts will be saved to: /content/drive/MyDrive/Datasets/artifacts


In [None]:
# 1. Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled  = scaler.transform(X_test)

# 2. Persist scaler and label encoder
artifact_dir = '/content/drive/MyDrive/Datasets/artifacts'
os.makedirs(artifact_dir, exist_ok=True)
joblib.dump(scaler, os.path.join(artifact_dir, 'scaler.joblib'))
joblib.dump(le,     os.path.join(artifact_dir, 'label_encoder.joblib'))

# 3. Save processed datasets
#   Note: feature_cols must match the columns used to build X_train/X_test!
pd.DataFrame(
    X_train_scaled,
    columns=[
        'flow_duration',
        'totlen_fwd_pkts',
        'totlen_bwd_pkts',
        'src_port',
        'dst_port'
    ]
).to_csv(os.path.join(artifact_dir, 'X_train.csv'), index=False)

pd.Series(y_train, name='label').to_csv(
    os.path.join(artifact_dir, 'y_train.csv'),
    index=False
)

pd.DataFrame(
    X_test_scaled,
    columns=[
        'flow_duration',
        'totlen_fwd_pkts',
        'totlen_bwd_pkts',
        'src_port',
        'dst_port'
    ]
).to_csv(os.path.join(artifact_dir, 'X_test.csv'), index=False)

pd.Series(y_test, name='label').to_csv(
    os.path.join(artifact_dir, 'y_test.csv'),
    index=False
)

print('Artifacts saved to', artifact_dir)


Artifacts saved to /content/drive/MyDrive/Datasets/artifacts


In [None]:
import joblib, pandas as pd

X_train = pd.read_csv(f"{artifact_dir}/X_train.csv")
y_train = pd.read_csv(f"{artifact_dir}/y_train.csv")['label']


Instantiate and fit:

In [None]:
from sklearn.ensemble import IsolationForest
model = IsolationForest(contamination=0.01, random_state=42)
model.fit(X_train)
joblib.dump(model, f"{artifact_dir}/model.joblib")


['/content/drive/MyDrive/Datasets/artifacts/model.joblib']

In [None]:
X_test = pd.read_csv(f"{artifact_dir}/X_test.csv")
y_test = pd.read_csv(f"{artifact_dir}/y_test.csv")['label']
y_pred = model.predict(X_test)           # [-1, 1] for IsolationForest
# or clf.predict(X_test) for DecisionTree


#Testing

In [None]:
# Colab Cell 1: Mount Drive and Imports
from google.colab import drive
drive.mount('/content/drive')

import pandas as pd
import numpy as np
import joblib


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
# Colab Cell 2: Load Preprocessing & Model Artifacts
artifact_dir = '/content/drive/MyDrive/Datasets/artifacts'  # adjust if needed

# These should match what you saved earlier
scaler = joblib.load(f"{artifact_dir}/scaler.joblib")
le     = joblib.load(f"{artifact_dir}/label_encoder.joblib")
model  = joblib.load(f"{artifact_dir}/model.joblib")       # IsolationForest or classifier


In [None]:
# Colab Cell 3: Simulate 2–3 “Realistic” Flow Samples
# Make sure your keys match feature_cols exactly
feature_cols = ['flow_duration', 'totlen_fwd_pkts', 'totlen_bwd_pkts', 'src_port', 'dst_port']

samples = [
    {'flow_duration': 1200, 'totlen_fwd_pkts': 1500, 'totlen_bwd_pkts': 1400, 'src_port': 34567, 'dst_port': 80},
    {'flow_duration': 50,   'totlen_fwd_pkts': 30,   'totlen_bwd_pkts': 25,   'src_port': 5678,  'dst_port': 443},
    {'flow_duration': 300,  'totlen_fwd_pkts': 300,  'totlen_bwd_pkts': 310,  'src_port': 12345, 'dst_port': 22}
]

df_samples = pd.DataFrame(samples, columns=feature_cols)
print("Raw samples:\n", df_samples)


Raw samples:
    flow_duration  totlen_fwd_pkts  totlen_bwd_pkts  src_port  dst_port
0           1200             1500             1400     34567        80
1             50               30               25      5678       443
2            300              300              310     12345        22


In [None]:
# Colab Cell 4: Scale & Predict
X_sim_scaled = scaler.transform(df_samples)

# If using IsolationForest:
if hasattr(model, "decision_function"):
    scores   = model.decision_function(X_sim_scaled)      # higher = more “normal”
    preds    = model.predict(X_sim_scaled)                # 1 = normal, -1 = anomaly
    df_results = df_samples.copy()
    df_results['score']   = scores
    df_results['anomaly'] = preds == -1
else:
    # For a supervised classifier
    raw_preds   = model.predict(X_sim_scaled)
    # decode back to original labels
    decoded     = le.inverse_transform(raw_preds)
    df_results = df_samples.copy()
    df_results['predicted_label'] = decoded

print("\nResults:\n", df_results)



Results:
    flow_duration  totlen_fwd_pkts  totlen_bwd_pkts  src_port  dst_port  \
0           1200             1500             1400     34567        80   
1             50               30               25      5678       443   
2            300              300              310     12345        22   

      score  anomaly  
0 -0.019066     True  
1  0.116947    False  
2  0.058294    False  




In [None]:
# Final Cell: Readable Anomaly Detection Output (Fixed)

import pandas as pd
import numpy as np
from IPython.display import display

# 1. Compute normalized scores (0–1) from raw decision_function outputs
raw_scores = model.decision_function(X_sim_scaled)
min_s, max_s = raw_scores.min(), raw_scores.max()
norm_scores = (raw_scores - min_s) / (max_s - min_s)

# 2. Map to human-readable labels
labels = ['Normal' if s >= 0.5 else 'Anomalous' for s in norm_scores]

# 3. Create simple explanations for each sample
reasons = []
for idx, row in df_samples.iterrows():
    if labels[idx] == 'Anomalous':
        reasons.append('Duration or packet counts outside normal range')
    else:
        reasons.append('Within expected network behavior')

# 4. Assemble display DataFrame
df_display = df_samples.copy()
df_display['AnomalyScore'] = [f"{s:.1%}" for s in norm_scores]
df_display['Label']        = labels
df_display['Reason']       = reasons

# 5. Style and display the table, highlighting anomalies
def highlight_anomaly(row):
    return ['background-color: #faa' if row.Label=='Anomalous' else '' for _ in row]

styled = (
    df_display.style
      .apply(highlight_anomaly, axis=1)
      .set_caption("Anomaly Detection Results")
      .hide(axis="index")                # ← replaces .hide_index()
)

display(styled)




flow_duration,totlen_fwd_pkts,totlen_bwd_pkts,src_port,dst_port,AnomalyScore,Label,Reason
1200,1500,1400,34567,80,0.0%,Anomalous,Duration or packet counts outside normal range
50,30,25,5678,443,100.0%,Normal,Within expected network behavior
300,300,310,12345,22,56.9%,Normal,Within expected network behavior


## More Tests

In [None]:
# Final Cell: Readable Anomaly Detection Output with Multiple Examples

import pandas as pd
import numpy as np
from IPython.display import display

# -- 1. Define 5 sample flows covering different scenarios --
feature_cols = ['flow_duration', 'totlen_fwd_pkts', 'totlen_bwd_pkts', 'src_port', 'dst_port']
sample_data = [
    # 0. Typical Web Browsing (normal)
    {'flow_duration': 100,  'totlen_fwd_pkts': 80,  'totlen_bwd_pkts': 75,  'src_port': 49152, 'dst_port': 80},
    # 1. Small IoT Heartbeat (normal, low traffic)
    {'flow_duration': 20,   'totlen_fwd_pkts': 5,   'totlen_bwd_pkts': 5,   'src_port': 37000, 'dst_port': 5683},
    # 2. Borderline: longer than usual, moderate packets
    {'flow_duration': 800,  'totlen_fwd_pkts': 200, 'totlen_bwd_pkts': 195, 'src_port': 40000, 'dst_port': 443},
    # 3. High-volume anomaly: extremely large packet counts
    {'flow_duration': 1500, 'totlen_fwd_pkts': 5000,'totlen_bwd_pkts': 4800,'src_port': 34567, 'dst_port': 22},
    # 4. Very short but many packets (possible flooding)
    {'flow_duration': 5,    'totlen_fwd_pkts': 1000,'totlen_bwd_pkts': 950, 'src_port': 12345, 'dst_port': 8080},
]
df_samples = pd.DataFrame(sample_data, columns=feature_cols)

# -- 2. Scale samples --
X_sim_scaled = scaler.transform(df_samples)

# -- 3. Compute and normalize raw anomaly scores --
raw_scores = model.decision_function(X_sim_scaled)
min_s, max_s = raw_scores.min(), raw_scores.max()
norm_scores = (raw_scores - min_s) / (max_s - min_s)

# -- 4. Map to human-readable labels --
labels = ['Normal' if s >= 0.5 else 'Anomalous' for s in norm_scores]

# -- 5. Generate explanations based on simple thresholds --
reasons = []
for i, row in df_samples.iterrows():
    if labels[i] == 'Anomalous':
        # choose reason based on which feature is extreme
        if row['totlen_fwd_pkts'] > 1000:
            reasons.append('High packet count → potential flood')
        elif row['flow_duration'] < 10 and row['totlen_fwd_pkts'] > 500:
            reasons.append('Very short, high-volume → likely scan/flood')
        else:
            reasons.append('Unusually long or voluminous flow')
    else:
        reasons.append('Within expected IoT/network behavior')

# -- 6. Assemble display DataFrame --
df_display = df_samples.copy()
df_display['AnomalyScore'] = [f"{s:.1%}" for s in norm_scores]
df_display['Label']        = labels
df_display['Reason']       = reasons

# -- 7. Style & display --
def highlight_anomaly(row):
    return ['background-color: #faa' if row.Label == 'Anomalous' else '' for _ in row]

styled = (
    df_display.style
      .apply(highlight_anomaly, axis=1)
      .set_caption("Anomaly Detection Results: Sample Scenarios")
      .hide(axis="index")
)

display(styled)




flow_duration,totlen_fwd_pkts,totlen_bwd_pkts,src_port,dst_port,AnomalyScore,Label,Reason
100,80,75,49152,80,81.2%,Normal,Within expected IoT/network behavior
20,5,5,37000,5683,100.0%,Normal,Within expected IoT/network behavior
800,200,195,40000,443,31.4%,Anomalous,Unusually long or voluminous flow
1500,5000,4800,34567,22,0.0%,Anomalous,High packet count → potential flood
5,1000,950,12345,8080,53.9%,Normal,Within expected IoT/network behavior


In [None]:
# prompt: find the section for  "Data Preprocessing Code"

# ## Feature Engineering & Preprocessing
feature_cols = [
    'flow_duration',
    'totlen_fwd_pkts',    # or 'total_length'
    'totlen_bwd_pkts',    # or drop if using total_length
    'src_port',
    'dst_port'
]


# Encode labels
le = LabelEncoder()
df_clean['label_enc'] = le.fit_transform(df_clean['label'])

# Select X and y
X = df_clean[feature_cols]
y = df_clean['label_enc']

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)
print('Train:', X_train.shape, 'Test:', X_test.shape)


In [None]:
# prompt: ML Model Training Code

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix

# Train a supervised classification model (e.g., Logistic Regression)
# Note: This is an example. You might use other classifiers like RandomForest, GradientBoosting, etc.
classifier_model = LogisticRegression(max_iter=1000, random_state=42)
classifier_model.fit(X_train_scaled, y_train)

# Make predictions on the test set
y_pred_classifier = classifier_model.predict(X_test_scaled)

# Evaluate the classifier model
print("\nClassifier Model Evaluation:")
print(confusion_matrix(y_test, y_pred_classifier))
print(classification_report(y_test, y_pred_classifier, target_names=le.classes_))

# Save the trained classifier model
joblib.dump(classifier_model, os.path.join(artifact_dir, 'classifier_model.joblib'))
print('Classifier model saved to', os.path.join(artifact_dir, 'classifier_model.joblib'))


In [None]:
# prompt: Explain the above project, what kind of project, which algorithm used ofr traning?... more

This Colab notebook project is an anomaly detection system designed for IoT network intrusion detection using the IoTID20 dataset.

**Project Type:** Anomaly Detection

**Algorithm Used for Training:** The primary algorithm used for training the anomaly detection model is **Isolation Forest**.

**How the Project Works:**

1.  **Data Loading:** It starts by mounting Google Drive and loading the IoTID20 dataset, which presumably contains network flow data with labels indicating normal or malicious activity.
2.  **Data Cleaning and Preprocessing:**
    *   It performs initial inspection of the data types, missing values, duplicates, and label distribution.
    *   Missing values are dropped.
    *   Column names are cleaned and standardized.
    *   Invalid flows (with non-positive duration or forward packet counts) are removed.
    *   Extreme values in certain numerical columns (`flow_duration`, `totlen_fwd_pkts`, `totlen_bwd_pkts`) are clipped to the 1st and 99th percentiles to handle outliers.
3.  **Feature Engineering and Preprocessing:**
    *   A subset of features (`flow_duration`, `totlen_fwd_pkts`, `totlen_bwd_pkts`, `src_port`, `dst_port`) is selected for the model.
    *   The 'Label' column (presumably indicating normal/anomaly) is encoded into numerical values using `LabelEncoder`.
    *   The data is split into training and testing sets (`X_train`, `X_test`, `y_train`, `y_test`) using `train_test_split`, ensuring the label distribution is maintained (stratified split).
4.  **Scaling and Artifact Saving:**
    *   The selected features (`X_train`, `X_test`) are scaled using `StandardScaler` to standardize their ranges.
    *   The fitted `scaler` and `label_encoder` objects are saved to disk using `joblib` as artifacts.
    *   The processed, scaled training and testing datasets (`X_train.csv`, `y_train.csv`, `X_test.csv`, `y_test.csv`) are saved as CSV files.
5.  **Model Training:**
    *   An `IsolationForest` model is instantiated. This is an unsupervised learning algorithm specifically designed for anomaly detection. It works by isolating observations by randomly selecting a feature and then randomly selecting a split value between the maximum and minimum values of the selected feature. Anomalies are expected to be isolated in fewer splitting steps than normal instances.
    *   The model is trained (`fit`) on the scaled training features (`X_train`).
    *   The trained `IsolationForest` model is saved to disk using `joblib`.
6.  **Testing and Prediction:**
    *   The processed test data (`X_test`, `y_test`) is loaded.
    *   The trained model is used to make predictions (`predict`) on the scaled test features (`X_test`). Isolation Forest predicts `1` for normal instances and `-1` for anomalies.
7.  **Simulating and Evaluating New Samples:**
    *   The notebook includes sections to simulate new, realistic network flow samples.
    *   These new samples are scaled using the previously saved `scaler`.
    *   The trained model is used to predict anomaly scores (`decision_function`) or labels (`predict`) for these samples.
    *   The anomaly scores are normalized (0-1) for better interpretability.
    *   Human-readable labels ('Normal' or 'Anomalous') are assigned based on a threshold (0.5 on the normalized score).
    *   Simple explanations for why a sample might be flagged as anomalous are generated based on thresholding certain features (like high packet count or very short duration with high volume).
    *   The results, including the raw sample data, normalized anomaly score, label, and reason, are displayed in a styled pandas DataFrame, with anomalous rows highlighted.

**In summary,** the project focuses on building and testing an anomaly detection system for network traffic using the Isolation Forest algorithm. It covers the essential steps of data preprocessing, model training, artifact saving, and evaluating the model's ability to detect anomalies on new, unseen data. The Isolation Forest is chosen for its effectiveness in identifying outliers (anomalies) in high-dimensional datasets.