# üìò Cyber Security Attack Type Detection ‚Äì Logistic Regression Pipeline

## üü¶ 1Ô∏è‚É£ Import Libraries

In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.feature_selection import SelectFromModel
from sklearn.preprocessing import FunctionTransformer

## üü¶ 2Ô∏è‚É£ Load Dataset

In [3]:
df = pd.read_csv(r"C:\Users\USER\Desktop\as.csv")
df.head()

Unnamed: 0,Timestamp,Source IP Address,Destination IP Address,Source Port,Destination Port,Protocol,Packet Length,Packet Type,Traffic Type,Payload Data,...,Action Taken,Severity Level,User Information,Device Information,Network Segment,Geo-location Data,Proxy Information,Firewall Logs,IDS/IPS Alerts,Log Source
0,5/30/2023 6:33,103.216.15.12,84.9.164.252,31225,17616,ICMP,503,Data,HTTP,Qui natus odio asperiores nam. Optio nobis ius...,...,Logged,Low,Reyansh Dugal,Mozilla/5.0 (compatible; MSIE 8.0; Windows NT ...,Segment A,"Jamshedpur, Sikkim",150.9.97.135,Log Data,,Server
1,8/26/2020 7:08,78.199.217.198,66.191.137.154,17245,48166,ICMP,1174,Data,HTTP,Aperiam quos modi officiis veritatis rem. Omni...,...,Blocked,Low,Sumer Rana,Mozilla/5.0 (compatible; MSIE 8.0; Windows NT ...,Segment B,"Bilaspur, Nagaland",,Log Data,,Firewall
2,11/13/2022 8:23,63.79.210.48,198.219.82.17,16811,53600,UDP,306,Control,HTTP,Perferendis sapiente vitae soluta. Hic delectu...,...,Ignored,Low,Himmat Karpe,Mozilla/5.0 (compatible; MSIE 9.0; Windows NT ...,Segment C,"Bokaro, Rajasthan",114.133.48.179,Log Data,Alert Data,Firewall
3,7/2/2023 10:38,163.42.196.10,101.228.192.255,20018,32534,UDP,385,Data,HTTP,Totam maxime beatae expedita explicabo porro l...,...,Blocked,Medium,Fateh Kibe,Mozilla/5.0 (Macintosh; PPC Mac OS X 10_11_5; ...,Segment B,"Jaunpur, Rajasthan",,,Alert Data,Firewall
4,7/16/2023 13:11,71.166.185.76,189.243.174.238,6131,26646,TCP,1462,Data,DNS,Odit nesciunt dolorem nisi iste iusto. Animi v...,...,Blocked,Low,Dhanush Chad,Mozilla/5.0 (compatible; MSIE 5.0; Windows NT ...,Segment C,"Anantapur, Tripura",149.6.110.119,,Alert Data,Firewall


## üü© 3Ô∏è‚É£ Feature Engineering

### üìå Explanation

The dataset contains mixed data types: numerical, categorical, timestamp, IP addresses, and textual logs.  
Machine learning models require numerical input, therefore several transformations were applied.

**Transformations performed:**

- **Timestamp decomposition** ‚Üí hour, day of week, month, year
- **IP splitting** ‚Üí numerical octets
- **Categorical encoding** ‚Üí One-Hot Encoding
- **Text vectorization** ‚Üí TF-IDF
- **Feature scaling** ‚Üí StandardScaler

These transformations allow the model to capture structured, temporal, categorical, and textual patterns.

### üü¶ Feature Engineering Functions

In [5]:
def split_ip(ip):
    if pd.isna(ip):
        return [0, 0, 0, 0]
    parts = str(ip).split(".")
    if len(parts) != 4:
        return [0, 0, 0, 0]
    return [int(p) if p.isdigit() else 0 for p in parts]


def build_features(df):
    df = df.copy()

    # Timestamp features
    df["Timestamp"] = pd.to_datetime(df["Timestamp"], errors="coerce")
    df["ts_hour"]  = df["Timestamp"].dt.hour.fillna(0)
    df["ts_day"]   = df["Timestamp"].dt.dayofweek.fillna(0)
    df["ts_month"] = df["Timestamp"].dt.month.fillna(0)

    # IP split
    src = df["Source IP Address"].apply(split_ip).tolist()
    dst = df["Destination IP Address"].apply(split_ip).tolist()

    for i in range(4):
        df[f"src_ip_{i}"] = [row[i] for row in src]
        df[f"dst_ip_{i}"] = [row[i] for row in dst]

    # Combine text columns
    text_cols = [
        "Payload Data",
        "Alerts/Warnings",
        "Attack Signature",
        "Firewall Logs",
        "IDS/IPS Alerts",
        "Malware Indicators",
        "Device Information",
        "User Information",
    ]

    df["combined_text"] = df[text_cols].fillna("").agg(" ".join, axis=1)

    return df

### üü¶ Apply Feature Engineering

In [6]:
df = build_features(df)

## üü© 4Ô∏è‚É£ Feature Selection

### üìå Explanation

After preprocessing, the dataset may contain:

- Many one-hot encoded variables
- Thousands of TF-IDF features

To reduce noise and improve generalization, feature selection is applied.

We use **SelectFromModel** with Logistic Regression, which selects features based on learned coefficients, keeping only the most informative ones.

**This helps:**
- Reduce dimensionality
- Improve model stability
- Prevent overfitting

### üü¶ Define Feature Groups

In [7]:
NUM_COLS = [
    "Source Port",
    "Destination Port",
    "Packet Length",
    "Anomaly Scores",
    "ts_hour",
    "ts_day",
    "ts_month",
    "src_ip_0", "src_ip_1", "src_ip_2", "src_ip_3",
    "dst_ip_0", "dst_ip_1", "dst_ip_2", "dst_ip_3"
]

CAT_COLS = [
    "Protocol",
    "Packet Type",
    "Traffic Type",
    "Action Taken",
    "Severity Level",
    "Network Segment",
    "Log Source"
]

## üü© 5Ô∏è‚É£ Preprocessing Pipeline

In [8]:
text_pipeline = Pipeline([
    ("tfidf", TfidfVectorizer(max_features=3000))
])

preprocessor = ColumnTransformer(
    transformers=[
        ("num",  StandardScaler(),                  NUM_COLS),
        ("cat",  OneHotEncoder(handle_unknown="ignore"), CAT_COLS),
        ("text", text_pipeline,                    "combined_text"),
    ]
)

## üü© 6Ô∏è‚É£ Train-Test Split

In [9]:
X = df.drop(columns=["Attack Type"])
y = df["Attack Type"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    stratify=y,
    random_state=42
)

print(f"Training samples : {X_train.shape[0]:,}")
print(f"Test samples     : {X_test.shape[0]:,}")
print(f"Class distribution:\n{y.value_counts()}")

Training samples : 32,000
Test samples     : 8,000
Class distribution:
Attack Type
DDoS         13428
Malware      13307
Intrusion    13265
Name: count, dtype: int64


## üü© 7Ô∏è‚É£ Model Training (Logistic Regression)

### üìå Why Logistic Regression?

Logistic Regression was chosen as a baseline model because:

- It is **computationally efficient**.
- It performs well in **multi-class classification**.
- It allows **interpretation through coefficients**.
- It provides a **benchmark** for comparison with more complex models.

In [10]:
model = Pipeline([
    ("preprocess",       preprocessor),
    ("feature_selection", SelectFromModel(
        LogisticRegression(max_iter=1000)
    )),
    ("classifier",       LogisticRegression(max_iter=1000))
])

model.fit(X_train, y_train)
print("Model training complete.")

Model training complete.


## üü© 8Ô∏è‚É£ Model Evaluation

In [11]:
preds = model.predict(X_test)

print("Classification Report:\n")
print(classification_report(y_test, preds))

print("Confusion Matrix:\n")
print(confusion_matrix(y_test, preds))

Classification Report:

              precision    recall  f1-score   support

        DDoS       0.34      0.36      0.35      2686
   Intrusion       0.33      0.31      0.32      2653
     Malware       0.34      0.33      0.33      2661

    accuracy                           0.34      8000
   macro avg       0.34      0.34      0.34      8000
weighted avg       0.34      0.34      0.34      8000

Confusion Matrix:

[[975 838 873]
 [965 833 855]
 [937 847 877]]


## üü© 9Ô∏è‚É£ Interpretation

### üìå Explanation of Results

The model achieves approximately **34% accuracy**.

Given that:
- The dataset contains **three balanced classes**,
- Random guessing would yield approximately **33%**,

The logistic regression baseline **slightly outperforms random classification**.

However, overlapping feature distributions suggest **limited linear separability** between classes.

This indicates that:
- Individual features do not strongly discriminate attack types.
- **Non-linear models** may capture more complex feature interactions.

## üü© 1Ô∏è‚É£0Ô∏è‚É£ Final Justification of Feature Engineering Choices

### üìå Explanation

Feature engineering was designed to:

1. **Convert non-numerical data** into numerical form.
2. **Extract informative patterns** from timestamps and IP addresses.
3. **Preserve semantic meaning** of textual logs using TF-IDF.
4. **Normalize numeric ranges** for stable optimization.
5. **Reduce high-dimensional noise** using feature selection.

This structured transformation ensures compatibility with supervised learning algorithms while maximizing the extraction of meaningful signals from raw cybersecurity logs.

---

### ‚úÖ What This Notebook Includes

| Component | Status |
|---|---|
| Feature Engineering | ‚úî |
| Feature Selection | ‚úî |
| Logistic Regression Training | ‚úî |
| Evaluation | ‚úî |
| Justification Paragraphs | ‚úî |
| Clean Pipeline Structure | ‚úî |