# Module 6 – Preprocessing Lab Notebook
## Preparing Data for Machine Learning

### Objectives:
- Clean a cybersecurity dataset
- Encode categorical variables
- Scale numerical features
- Perform train-test split
- Document each preprocessing step


In [20]:
import pandas as pd
import numpy as np
import textwrap
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split


## Step 1: Load Dataset

In [21]:
df = pd.read_csv(r'./module6_cyber_ml_dataset.csv')
df.head()
df.info()
df.describe()

<class 'pandas.DataFrame'>
RangeIndex: 300 entries, 0 to 299
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype
---  ------          --------------  -----
 0   Timestamp       300 non-null    str  
 1   SourceIP        300 non-null    str  
 2   DestinationIP   300 non-null    str  
 3   Protocol        300 non-null    str  
 4   BytesSent       300 non-null    int64
 5   BytesReceived   300 non-null    int64
 6   LoginAttempts   300 non-null    int64
 7   Severity Level  300 non-null    int64
 8   AttackType      300 non-null    str  
dtypes: int64(4), str(5)
memory usage: 21.2 KB


Unnamed: 0,BytesSent,BytesReceived,LoginAttempts,Severity Level
count,300.0,300.0,300.0,300.0
mean,2639.9,2676.48,7.926667,4.873333
std,1430.601453,1382.127899,4.045058,2.597392
min,103.0,140.0,1.0,1.0
25%,1534.5,1486.75,5.0,2.75
50%,2495.0,2815.0,8.0,5.0
75%,3919.25,3745.75,11.0,7.0
max,4992.0,4995.0,14.0,9.0


## Step 2: Check for Missing Values

In [22]:
summary=f"""dataset is complete with no missing values"""
print(textwrap.fill(summary, width=80))
df.isnull().sum()



dataset is complete with no missing values


Timestamp         0
SourceIP          0
DestinationIP     0
Protocol          0
BytesSent         0
BytesReceived     0
LoginAttempts     0
Severity Level    0
AttackType        0
dtype: int64

## Step 3: Encode Categorical Variables

In [23]:
summary=f"""values are encoded into numbers because machine learning models cannot do math on text\n"""
print(textwrap.fill(summary, width=80))

encoder = LabelEncoder()
df['Protocol'] = encoder.fit_transform(df['Protocol'])
print(dict(zip(encoder.classes_, range(len(encoder.classes_)))))
df['AttackType'] = encoder.fit_transform(df['AttackType'])

print(dict(zip(encoder.classes_, range(len(encoder.classes_)))))
df.head()


values are encoded into numbers because machine learning models cannot do math
on text
{'ICMP': 0, 'TCP': 1, 'UDP': 2}
{'BruteForce': 0, 'DDoS': 1, 'Normal': 2, 'Phishing': 3}


Unnamed: 0,Timestamp,SourceIP,DestinationIP,Protocol,BytesSent,BytesReceived,LoginAttempts,Severity Level,AttackType
0,1/1/2024 0:00,192.168.1.103,10.0.0.126,1,4585,2135,3,1,2
1,1/1/2024 0:05,192.168.1.180,10.0.0.130,2,355,577,8,9,0
2,1/1/2024 0:10,192.168.1.93,10.0.0.53,2,4904,1354,5,1,1
3,1/1/2024 0:15,192.168.1.15,10.0.0.172,2,1938,1022,1,8,2
4,1/1/2024 0:20,192.168.1.107,10.0.0.218,0,2044,519,7,2,3


## Step 4: Feature Engineering

Feature engineering creates new columns by transforming or combining existing variables. This gives the model stronger, more informative signals without collecting any new data. The two features below are built before scaling so they get normalized along with the rest of the numeric columns.

### Feature 1: TotalBytes

**What it represents:**
`TotalBytes` is the sum of `BytesSent` and `BytesReceived` for a single network event. It captures the full volume of data exchanged in both directions during one connection.

**Why it improves predictive signal:**
`BytesSent` and `BytesReceived` individually only tell half the story. A model using both raw columns has to learn their combined effect on its own. By pre-computing the total, we give the model a direct measure of overall traffic size, a single number that correlates strongly with bandwidth-heavy attacks — reducing the work the model has to do.

**How it supports cybersecurity analysis:**
Attacks like DDoS and data exfiltration generate unusually high total traffic. A large `TotalBytes` value is a quick flag that something abnormal may be happening on a connection, making it one of the most practical features for network anomaly detection.

In [24]:
# Feature 1: TotalBytes — total traffic volume per event
df['TotalBytes'] = df['BytesSent'] + df['BytesReceived']

print('TotalBytes sample values:')
print(df['TotalBytes'].describe())


TotalBytes sample values:
count     300.000000
mean     5316.380000
std      1954.198566
min       873.000000
25%      3874.000000
50%      5282.000000
75%      6687.000000
max      9355.000000
Name: TotalBytes, dtype: float64


### Feature 2: BytesPerLogin

**What it represents:**
`BytesPerLogin` is the ratio of total traffic volume to login attempts (`TotalBytes / (LoginAttempts + 1)`). It measures how much data is transferred per login attempt on a given connection.

**Why it improves predictive signal:**
This is a rate feature that combines two different domains — network traffic and authentication behavior — into one value. A model can't derive this relationship easily from `TotalBytes` and `LoginAttempts` separately because the meaningful signal lies in their proportion, not their individual magnitudes.

**How it supports cybersecurity analysis:**
Different attack types produce very different bytes-per-login ratios. Phishing and credential-stuffing attacks generate many login attempts with small payloads, producing a low `BytesPerLogin`. Successful intrusions followed by data exfiltration produce very few logins but massive data transfers, giving a high `BytesPerLogin`. This makes it a strong discriminator across attack categories.

In [25]:
# Feature 4: BytesPerLogin — data transferred per login attempt (rate feature)
df['BytesPerLogin'] = df['TotalBytes'] / (df['LoginAttempts'] + 1)

print('BytesPerLogin sample values:')
print(df['BytesPerLogin'].describe())

BytesPerLogin sample values:
count     300.000000
mean      833.460350
std       678.279029
min        78.133333
25%       377.892857
50%       596.333333
75%      1046.000000
max      4500.500000
Name: BytesPerLogin, dtype: float64


## Step 5: Feature Scaling

In [26]:
summary = """Scalers look at each numeric column and calculate its mean and standard deviation.
This normalizes the data, putting all numeric columns on the same scale.
All six engineered features are included here so they are scaled consistently
with the rest of the dataset."""
print(textwrap.fill(summary, width=80))

scaler = StandardScaler()
numeric_cols = ['BytesSent', 'BytesReceived', 'LoginAttempts', 'Severity Level', 'TotalBytes', 'BytesPerLogin']
df[numeric_cols] = scaler.fit_transform(df[numeric_cols])
df.head()

Scalers look at each numeric column and calculate its mean and standard
deviation. This normalizes the data, putting all numeric columns on the same
scale. All six engineered features are included here so they are scaled
consistently with the rest of the dataset.


Unnamed: 0,Timestamp,SourceIP,DestinationIP,Protocol,BytesSent,BytesReceived,LoginAttempts,Severity Level,AttackType,TotalBytes,BytesPerLogin
0,1/1/2024 0:00,192.168.1.103,10.0.0.126,1,1.36191,-0.392427,-1.219982,-1.493731,2,0.719459,1.250155
1,1/1/2024 0:05,192.168.1.180,10.0.0.130,2,-1.599829,-1.521558,0.018159,1.591428,0,-2.247318,-1.077911
2,1/1/2024 0:10,192.168.1.93,10.0.0.53,2,1.585265,-0.958442,-0.724725,-1.493731,1,0.48265,0.309445
3,1/1/2024 0:15,192.168.1.15,10.0.0.172,2,-0.491453,-1.199053,-1.715239,1.205783,2,-1.207818,0.954799
4,1/1/2024 0:20,192.168.1.107,10.0.0.218,0,-0.417234,-1.563592,-0.229469,-1.108086,3,-1.41131,-0.757716


## Step 6: Train-Test Split

In [27]:
summary = (
    "Scalers examine each numeric column and compute its mean and standard "
    "standard deviation. This standardizes the data by placing all numeric "
    "columns on the same scale. All engineered features are included so they "
    "are scaled consistently with the rest of the dataset."
)

print(textwrap.fill(summary, width=80))

scaler = StandardScaler()

numeric_cols = [
    'BytesSent',
    'BytesReceived',
    'LoginAttempts',
    'Severity Level',
    'TotalBytes',
    'ByteRatio',
    'LoginSeverityInteraction',
    'BytesPerLogin',
    'SeverityPerLogin',
    'BytesSentSeverity'
]

df.head()

Scalers examine each numeric column and compute its mean and standard standard
deviation. This standardizes the data by placing all numeric columns on the same
scale. All engineered features are included so they are scaled consistently with
the rest of the dataset.


Unnamed: 0,Timestamp,SourceIP,DestinationIP,Protocol,BytesSent,BytesReceived,LoginAttempts,Severity Level,AttackType,TotalBytes,BytesPerLogin
0,1/1/2024 0:00,192.168.1.103,10.0.0.126,1,1.36191,-0.392427,-1.219982,-1.493731,2,0.719459,1.250155
1,1/1/2024 0:05,192.168.1.180,10.0.0.130,2,-1.599829,-1.521558,0.018159,1.591428,0,-2.247318,-1.077911
2,1/1/2024 0:10,192.168.1.93,10.0.0.53,2,1.585265,-0.958442,-0.724725,-1.493731,1,0.48265,0.309445
3,1/1/2024 0:15,192.168.1.15,10.0.0.172,2,-0.491453,-1.199053,-1.715239,1.205783,2,-1.207818,0.954799
4,1/1/2024 0:20,192.168.1.107,10.0.0.218,0,-0.417234,-1.563592,-0.229469,-1.108086,3,-1.41131,-0.757716


## Reflection Questions
1. Why is encoding necessary before modeling?
2. Why should scaling occur after splitting data?
3. How does preprocessing impact model performance?

1. Why is encoding necessary before modeling?
Your Answer: Values are encoded into numbers because its much easier for machine learning models to do math on numbers rather than text

2. Why should scaling occur after splitting data?
your Answer: This normalizes the data, so all different numbers are on a similar scale. This prevents outliers

3. How does preprocessing impact model performance?
Your Answer: Makes table data computable, ensuring values are evaluated equally, and removing null instances that can throw off machine learning models.