# Module 6 – Preprocessing Lab Notebook
## Preparing Data for Machine Learning

### Objectives:
- Clean a cybersecurity dataset
- Encode categorical variables
- Scale numerical features
- Perform train-test split
- Document each preprocessing step


In [None]:
import pandas as pd
import numpy as np
import textwrap
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split


## Step 1: Load Dataset

In [3]:
df = pd.read_csv(r'C:\Users\kayro\jupyter\assignments\Module6\module6_cyber_ml_dataset.csv')
df.head()
df.info()
df.describe()

NameError: name 'pd' is not defined

## Step 2: Check for Missing Values

In [None]:
summary=f"""dataset is complete with no missing values"""
print(textwrap.fill(summary, width=80))
df.isnull().sum()



dataset is complete with no missing values


Timestamp         0
SourceIP          0
DestinationIP     0
Protocol          0
BytesSent         0
BytesReceived     0
LoginAttempts     0
Severity Level    0
AttackType        0
dtype: int64

## Step 3: Encode Categorical Variables

In [1]:
summary=f"""values are encoded into numbers because machine learning models cannot do math on text\n"""
print(textwrap.fill(summary, width=80))

encoder = LabelEncoder()
df['Protocol'] = encoder.fit_transform(df['Protocol'])
print(dict(zip(encoder.classes_, range(len(encoder.classes_)))))
df['AttackType'] = encoder.fit_transform(df['AttackType'])

print(dict(zip(encoder.classes_, range(len(encoder.classes_)))))
df.head()


NameError: name 'textwrap' is not defined

## Step 4: Feature Engineering

Feature engineering creates new columns by transforming or combining existing variables. This gives the model stronger, more informative signals without collecting any new data. The two features below are built before scaling so they get normalized along with the rest of the numeric columns.

### Feature 1: TotalBytes

**What it represents:**
`TotalBytes` is the sum of `BytesSent` and `BytesReceived` for a single network event. It captures the full volume of data exchanged in both directions during one connection.

**Why it improves predictive signal:**
`BytesSent` and `BytesReceived` individually only tell half the story. A model using both raw columns has to learn their combined effect on its own. By pre-computing the total, we give the model a direct measure of overall traffic size — a single number that correlates strongly with bandwidth-heavy attacks — reducing the work the model has to do.

**How it supports cybersecurity analysis:**
Attacks like DDoS and data exfiltration generate unusually high total traffic. A large `TotalBytes` value is a quick flag that something abnormal may be happening on a connection, making it one of the most practical features for network anomaly detection.

In [None]:
# Feature 1: TotalBytes — total traffic volume per event
df['TotalBytes'] = df['BytesSent'] + df['BytesReceived']

print('TotalBytes sample values:')
print(df['TotalBytes'].describe())


### Feature 2: ByteRatio

**What it represents:**
`ByteRatio` is the ratio of bytes sent to bytes received (`BytesSent / (BytesReceived + 1)`). The `+1` prevents division by zero on rows where nothing was received. A value near 1.0 means traffic was balanced; a very high value means far more was sent than received, and vice versa.

**Why it improves predictive signal:**
Raw byte counts alone do not capture the *direction* imbalance of a connection. Two events can have the same `TotalBytes` but opposite traffic patterns. `ByteRatio` encodes that asymmetry as a single interaction term, giving the model a feature that neither `BytesSent` nor `BytesReceived` can express on its own.

**How it supports cybersecurity analysis:**
Traffic direction imbalance is a hallmark of specific attack types. Brute-force login attempts send many small requests (high `ByteRatio`), while data exfiltration uploads large payloads outbound (also high `ByteRatio`). A DDoS flood may show the opposite pattern — massive inbound traffic with little sent. `ByteRatio` helps the model separate these attack signatures.

In [None]:
# Feature 2: ByteRatio — traffic direction imbalance (+1 avoids division by zero)
df['ByteRatio'] = df['BytesSent'] / (df['BytesReceived'] + 1)

print('ByteRatio sample values:')
print(df['ByteRatio'].describe())


In [None]:
# Verify engineered features — check for missing values or infinite numbers
print('Null values in engineered features:')
print(df[['TotalBytes', 'ByteRatio']].isnull().sum())

print('\nInfinite values in ByteRatio:', np.isinf(df['ByteRatio']).sum())

print('\nEngineered features preview:')
df[['BytesSent', 'BytesReceived', 'TotalBytes', 'ByteRatio']].head(10)

## Step 5: Feature Scaling

In [ ]:
summary = """Scalers look at each numeric column and calculate its mean and standard deviation.
This normalizes the data, putting all numeric columns on the same scale.
The two engineered features (TotalBytes and ByteRatio) are included here so they
are scaled consistently with the rest of the dataset."""
print(textwrap.fill(summary, width=80))

scaler = StandardScaler()
numeric_cols = ['BytesSent', 'BytesReceived', 'LoginAttempts', 'Severity Level',
                'TotalBytes', 'ByteRatio']
df[numeric_cols] = scaler.fit_transform(df[numeric_cols])
df.head()

## Step 6: Train-Test Split

In [None]:
summary="""you never want to test a model on data it was trained on, this split ensures the machine learning model is being tested
on a significantly larger set of of new data than its training data"""
print(textwrap.fill(summary, width=80))

X = df.drop('AttackType', axis=1)
y = df['AttackType']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print('Training set size:', X_train.shape)
print('Testing set size:', X_test.shape)

: 

## Reflection Questions
1. Why is encoding necessary before modeling?
2. Why should scaling occur after splitting data?
3. How does preprocessing impact model performance?

1. Why is encoding necessary before modeling?
Your Answer: Values are encoded into numbers because its much easier for machine learning models to do math on numbers rather than text

2. Why should scaling occur after splitting data?
your Answer: This normalizes the data, so all different numbers are on a similar scale. This prevents outliers

3. How does preprocessing impact model performance?
Your Answer: Makes table data computable, ensuring values are evaluated equally, and removing null instances that can throw off machine learning models.