# Feature Engineering for Network Intrusion Detection

This notebook performs feature engineering on the BCCC-CSE-CIC-IDS2018 dataset.

## Objectives:
1. Load and preprocess raw network flow data
2. Handle missing values and outliers
3. Create derived features
4. Encode categorical variables
5. Scale numerical features
6. Handle class imbalance
7. Save processed features for model training

In [1]:
# Import libraries
import os
import sys
sys.path.append('..')

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path

from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split

# Set plotting style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

%matplotlib inline

## 1. Load Raw Data

In [2]:
# Load data
project_root = Path().resolve()
data_path = project_root / 'data' / 'raw' / 'friday_02_03_2018_combined_sample.csv'

df = pd.read_csv(data_path)

print(f"Loaded {len(df):,} records")
print(f"Number of features: {len(df.columns)}")
print(f"\nColumns: {list(df.columns)}")

Loaded 289,799 records
Number of features: 323

Columns: ['flow_id', 'timestamp', 'src_ip', 'src_port', 'dst_ip', 'dst_port', 'protocol', 'duration', 'packets_count', 'fwd_packets_count', 'bwd_packets_count', 'total_payload_bytes', 'fwd_total_payload_bytes', 'bwd_total_payload_bytes', 'payload_bytes_max', 'payload_bytes_min', 'payload_bytes_mean', 'payload_bytes_std', 'payload_bytes_variance', 'payload_bytes_median', 'payload_bytes_skewness', 'payload_bytes_cov', 'payload_bytes_mode', 'fwd_payload_bytes_max', 'fwd_payload_bytes_min', 'fwd_payload_bytes_mean', 'fwd_payload_bytes_std', 'fwd_payload_bytes_variance', 'fwd_payload_bytes_median', 'fwd_payload_bytes_skewness', 'fwd_payload_bytes_cov', 'fwd_payload_bytes_mode', 'bwd_payload_bytes_max', 'bwd_payload_bytes_min', 'bwd_payload_bytes_mean', 'bwd_payload_bytes_std', 'bwd_payload_bytes_variance', 'bwd_payload_bytes_median', 'bwd_payload_bytes_skewness', 'bwd_payload_bytes_cov', 'bwd_payload_bytes_mode', 'total_header_bytes', 'max_hea

In [3]:
# Check data types and missing values
print("Data Info:")
print(df.info())
print("\nMissing values:")
print(df.isnull().sum().sort_values(ascending=False).head(10))

Data Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 289799 entries, 0 to 289798
Columns: 323 entries, flow_id to label
dtypes: float64(259), int64(56), object(8)
memory usage: 714.2+ MB
None

Missing values:
payload_bytes_cov                  108358
fwd_payload_bytes_cov               75401
bwd_payload_bytes_cov               59023
bwd_packets_IAT_skewness            49868
bwd_packets_IAT_cov                 49868
fwd_packets_IAT_cov                 34082
fwd_packets_IAT_skewness            34027
cov_payload_bytes_delta_len         28349
cov_fwd_payload_bytes_delta_len     18905
cov_bwd_payload_bytes_delta_len     15945
dtype: int64


## 2. Data Cleaning

In [4]:
# Separate features and target
# Identify the label column (could be 'label', 'Label', etc.)
label_col = None
for col in ['label', 'Label', 'label_name']:
    if col in df.columns:
        label_col = col
        break

if label_col is None:
    raise ValueError("No label column found in dataset")

print(f"Using '{label_col}' as target variable")

# Separate features and labels
X = df.drop(columns=[label_col])
y = df[label_col]

print(f"\nFeature matrix shape: {X.shape}")
print(f"Target shape: {y.shape}")
print(f"\nClass distribution:\n{y.value_counts()}")

Using 'label' as target variable

Feature matrix shape: (289799, 322)
Target shape: (289799,)

Class distribution:
label
Benign    278864
Bot        10935
Name: count, dtype: int64

Feature matrix shape: (289799, 322)
Target shape: (289799,)

Class distribution:
label
Benign    278864
Bot        10935
Name: count, dtype: int64


In [5]:
# Handle missing values
print("Handling missing values...")

# Check for missing values
missing_counts = X.isnull().sum()
missing_cols = missing_counts[missing_counts > 0]

if len(missing_cols) > 0:
    print(f"\nColumns with missing values:\n{missing_cols}")
    
    # Strategy: Fill numeric columns with median, drop columns with >50% missing
    threshold = 0.5
    high_missing = missing_cols[missing_cols / len(X) > threshold]
    
    if len(high_missing) > 0:
        print(f"\nDropping columns with >{threshold*100}% missing: {list(high_missing.index)}")
        X = X.drop(columns=high_missing.index)
    
    # Fill remaining missing values with median for numeric columns
    numeric_cols = X.select_dtypes(include=[np.number]).columns
    for col in numeric_cols:
        if X[col].isnull().any():
            X[col] = X[col].fillna(X[col].median())
    
    print(f"\nRemaining missing values: {X.isnull().sum().sum()}")
else:
    print("No missing values found!")

print(f"\nFinal feature matrix shape: {X.shape}")

Handling missing values...

Columns with missing values:
payload_bytes_cov                  108358
fwd_payload_bytes_cov               75401
bwd_payload_bytes_cov               59023
packets_IAT_cov                       437
fwd_packets_IAT_skewness            34027
fwd_packets_IAT_cov                 34082
bwd_packets_IAT_skewness            49868
bwd_packets_IAT_cov                 49868
cov_packets_delta_time                437
cov_bwd_packets_delta_time              1
cov_fwd_packets_delta_time             55
cov_packets_delta_len                5696
cov_bwd_packets_delta_len            1398
cov_fwd_packets_delta_len            3102
cov_header_bytes_delta_len           7780
cov_bwd_header_bytes_delta_len       1464
cov_fwd_header_bytes_delta_len       4140
cov_payload_bytes_delta_len         28349
cov_bwd_payload_bytes_delta_len     15945
cov_fwd_payload_bytes_delta_len     18905
dtype: int64

Columns with missing values:
payload_bytes_cov                  108358
fwd_payload_bytes_

In [6]:
# Handle infinite values
print("Checking for infinite values...")

numeric_cols = X.select_dtypes(include=[np.number]).columns
inf_counts = {}

for col in numeric_cols:
    inf_count = np.isinf(X[col]).sum()
    if inf_count > 0:
        inf_counts[col] = inf_count
        # Replace inf with NaN, then fill with column median
        X[col] = X[col].replace([np.inf, -np.inf], np.nan)
        X[col] = X[col].fillna(X[col].median())

if inf_counts:
    print(f"\nReplaced infinite values in {len(inf_counts)} columns")
    for col, count in list(inf_counts.items())[:10]:
        print(f"  {col}: {count} infinite values")
else:
    print("No infinite values found!")

Checking for infinite values...

Replaced infinite values in 9 columns
  cov_packets_delta_len: 998 infinite values
  cov_bwd_packets_delta_len: 1093 infinite values
  cov_fwd_packets_delta_len: 141 infinite values
  cov_header_bytes_delta_len: 488 infinite values
  cov_bwd_header_bytes_delta_len: 1102 infinite values
  cov_fwd_header_bytes_delta_len: 63 infinite values
  cov_payload_bytes_delta_len: 204435 infinite values
  cov_bwd_payload_bytes_delta_len: 94260 infinite values
  cov_fwd_payload_bytes_delta_len: 183487 infinite values

Replaced infinite values in 9 columns
  cov_packets_delta_len: 998 infinite values
  cov_bwd_packets_delta_len: 1093 infinite values
  cov_fwd_packets_delta_len: 141 infinite values
  cov_header_bytes_delta_len: 488 infinite values
  cov_bwd_header_bytes_delta_len: 1102 infinite values
  cov_fwd_header_bytes_delta_len: 63 infinite values
  cov_payload_bytes_delta_len: 204435 infinite values
  cov_bwd_payload_bytes_delta_len: 94260 infinite values
  cov_

## 3. Feature Engineering

In [7]:
# ============================================================
# IMPROVED FEATURE ENGINEERING
# ============================================================
# Issues Fixed:
# 1. Port numbers treated as continuous (WRONG - they're categorical!)
# 2. Timestamp dropped (WRONG - temporal patterns are valuable!)
# ============================================================

print("="*80)
print("STEP 1: TEMPORAL FEATURE EXTRACTION")
print("="*80)

# Extract temporal features BEFORE dropping timestamp
if 'timestamp' in X.columns:
    print("Converting timestamp to datetime...")
    X['timestamp'] = pd.to_datetime(X['timestamp'])
    
    # Basic time features
    X['hour'] = X['timestamp'].dt.hour
    X['day_of_week'] = X['timestamp'].dt.dayofweek  # 0=Monday, 6=Sunday
    X['day_of_month'] = X['timestamp'].dt.day
    X['month'] = X['timestamp'].dt.month
    
    # Binary indicators for suspicious times
    X['is_business_hours'] = ((X['hour'] >= 9) & (X['hour'] <= 17)).astype(int)
    X['is_night'] = ((X['hour'] >= 0) & (X['hour'] <= 5)).astype(int)
    X['is_weekend'] = (X['day_of_week'] >= 5).astype(int)
    X['is_late_night'] = ((X['hour'] >= 22) | (X['hour'] <= 4)).astype(int)
    
    # Cyclical encoding (preserves continuity: 23:59 is close to 00:01)
    X['hour_sin'] = np.sin(2 * np.pi * X['hour'] / 24)
    X['hour_cos'] = np.cos(2 * np.pi * X['hour'] / 24)
    X['day_of_week_sin'] = np.sin(2 * np.pi * X['day_of_week'] / 7)
    X['day_of_week_cos'] = np.cos(2 * np.pi * X['day_of_week'] / 7)
    
    print(f"✓ Created temporal features:")
    print(f"  - hour (0-23), day_of_week (0-6), day_of_month, month")
    print(f"  - is_business_hours, is_night, is_weekend, is_late_night")
    print(f"  - hour_sin/cos, day_of_week_sin/cos (cyclical encoding)")
    
    # NOW drop the original timestamp
    X = X.drop(columns=['timestamp'])
    print(f"  ✓ Dropped original timestamp column")
else:
    print("⚠️  No timestamp column found")

print("="*80 + "\n")

# ============================================================
print("="*80)
print("STEP 2: PORT FEATURE ENGINEERING")
print("="*80)

if 'dst_port' in X.columns:
    print("Encoding destination port (target service)...")
    
    # Well-known ports (these are the ones attackers often target)
    X['dst_port_http'] = (X['dst_port'].isin([80, 8080, 8000, 8888])).astype(int)
    X['dst_port_https'] = (X['dst_port'] == 443).astype(int)
    X['dst_port_ssh'] = (X['dst_port'] == 22).astype(int)
    X['dst_port_ftp'] = (X['dst_port'].isin([20, 21])).astype(int)
    X['dst_port_smtp'] = (X['dst_port'].isin([25, 587, 465])).astype(int)
    X['dst_port_dns'] = (X['dst_port'] == 53).astype(int)
    X['dst_port_telnet'] = (X['dst_port'] == 23).astype(int)
    X['dst_port_smb'] = (X['dst_port'].isin([139, 445])).astype(int)
    X['dst_port_rdp'] = (X['dst_port'] == 3389).astype(int)
    X['dst_port_mysql'] = (X['dst_port'] == 3306).astype(int)
    X['dst_port_postgres'] = (X['dst_port'] == 5432).astype(int)
    
    # Port range categories
    def categorize_dst_port(port):
        if port < 1024:
            return 'well_known'
        elif port < 49152:
            return 'registered'
        else:
            return 'ephemeral'
    
    X['dst_port_category'] = X['dst_port'].apply(categorize_dst_port)
    
    # One-hot encode the category
    dst_port_dummies = pd.get_dummies(X['dst_port_category'], prefix='dst_port_cat')
    X = pd.concat([X, dst_port_dummies], axis=1)
    X = X.drop(columns=['dst_port_category'])
    
    print(f"✓ Created destination port features:")
    print(f"  - Binary flags for common services: http, https, ssh, ftp, smtp, dns, etc.")
    print(f"  - Port range categories: well_known, registered, ephemeral")
    
    # Keep original dst_port for model to learn additional patterns
    print(f"  ✓ Keeping original dst_port column (will be scaled later)")
    
else:
    print("⚠️  No dst_port column found")

if 'src_port' in X.columns:
    print("\nHandling source port...")
    
    # Source port is usually ephemeral (random), less predictive
    # But we can create some useful features
    X['src_port_is_privileged'] = (X['src_port'] < 1024).astype(int)
    X['src_port_is_ephemeral'] = (X['src_port'] >= 49152).astype(int)
    
    print(f"✓ Created source port features:")
    print(f"  - src_port_is_privileged (<1024)")
    print(f"  - src_port_is_ephemeral (>=49152)")
    print(f"  ✓ Keeping original src_port column (will be scaled later)")

print("="*80 + "\n")

# ============================================================
print("="*80)
print("STEP 3: PROTOCOL ENCODING")
print("="*80)

if 'protocol' in X.columns:
    print(f"Protocol values: {X['protocol'].unique()}")
    
    # Common protocols
    X['protocol_tcp'] = (X['protocol'].str.upper() == 'TCP').astype(int)
    X['protocol_udp'] = (X['protocol'].str.upper() == 'UDP').astype(int)
    X['protocol_icmp'] = (X['protocol'].str.upper() == 'ICMP').astype(int)
    
    # Drop original protocol column
    X = X.drop(columns=['protocol'])
    
    print(f"✓ Created protocol features:")
    print(f"  - protocol_tcp, protocol_udp, protocol_icmp")
    print(f"  ✓ Dropped original protocol column")
else:
    print("⚠️  No protocol column found")

print("="*80 + "\n")

# ============================================================
print("="*80)
print("STEP 4: DROP IDENTIFIER COLUMNS")
print("="*80)

identifier_cols = ['flow_id', 'src_ip', 'dst_ip']
to_drop = [col for col in identifier_cols if col in X.columns]

if to_drop:
    print(f"Dropping identifier columns: {to_drop}")
    X = X.drop(columns=to_drop)
else:
    print("No identifier columns to drop")

# Drop any remaining non-numeric columns
non_numeric_cols = X.select_dtypes(exclude=[np.number]).columns
if len(non_numeric_cols) > 0:
    print(f"\n⚠️  Dropping remaining non-numeric columns: {list(non_numeric_cols)}")
    X = X.drop(columns=non_numeric_cols)

print(f"\n✓ Final feature count: {X.shape[1]}")
print(f"✓ All features are numeric: {X.select_dtypes(include=[np.number]).shape[1] == X.shape[1]}")
print("="*80 + "\n")

# ============================================================
print("="*80)
print("FEATURE ENGINEERING SUMMARY")
print("="*80)

# Count feature types
temporal_features = [col for col in X.columns if any(t in col.lower() for t in ['hour', 'day', 'night', 'weekend', 'business'])]
port_features = [col for col in X.columns if 'port' in col.lower()]
protocol_features = [col for col in X.columns if 'protocol' in col.lower()]
original_features = [col for col in X.columns if col not in temporal_features + port_features + protocol_features]

print(f"Total features: {X.shape[1]}")
print(f"  - Temporal features: {len(temporal_features)}")
print(f"  - Port features: {len(port_features)}")
print(f"  - Protocol features: {len(protocol_features)}")
print(f"  - Original/derived features: {len(original_features)}")
print("="*80 + "\n")

STEP 1: TEMPORAL FEATURE EXTRACTION
Converting timestamp to datetime...
✓ Created temporal features:
  - hour (0-23), day_of_week (0-6), day_of_month, month
  - is_business_hours, is_night, is_weekend, is_late_night
  - hour_sin/cos, day_of_week_sin/cos (cyclical encoding)
  ✓ Dropped original timestamp column

STEP 2: PORT FEATURE ENGINEERING
Encoding destination port (target service)...
✓ Created destination port features:
  - Binary flags for common services: http, https, ssh, ftp, smtp, dns, etc.
  - Port range categories: well_known, registered, ephemeral
  ✓ Keeping original dst_port column (will be scaled later)

Handling source port...
✓ Created source port features:
  - src_port_is_privileged (<1024)
  - src_port_is_ephemeral (>=49152)
  ✓ Keeping original src_port column (will be scaled later)

STEP 3: PROTOCOL ENCODING
Protocol values: ['TCP']
✓ Created protocol features:
  - protocol_tcp, protocol_udp, protocol_icmp
  ✓ Dropped original protocol column

STEP 4: DROP IDENTIF

## 4. Feature Scaling

In [8]:
# Encode target labels
print("Encoding target labels...")
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y)

print(f"\nLabel mapping:")
for i, label in enumerate(label_encoder.classes_):
    print(f"  {label}: {i}")

# Convert to binary if needed (benign vs attack)
y_binary = (y != 'Benign').astype(int)
print(f"\nBinary distribution (0=Benign, 1=Attack):")
print(pd.Series(y_binary).value_counts())

Encoding target labels...

Label mapping:
  Benign: 0
  Bot: 1

Binary distribution (0=Benign, 1=Attack):
label
0    278864
1     10935
Name: count, dtype: int64


In [9]:
# Split data before scaling to prevent data leakage
print("Splitting data into train/test sets...")

X_train, X_test, y_train, y_test = train_test_split(
    X, y_binary, test_size=0.2, random_state=42, stratify=y_binary
)

print(f"Training set: {X_train.shape}")
print(f"Test set: {X_test.shape}")
print(f"\nTrain class distribution:\n{pd.Series(y_train).value_counts()}")
print(f"\nTest class distribution:\n{pd.Series(y_test).value_counts()}")

Splitting data into train/test sets...
Training set: (231839, 343)
Test set: (57960, 343)

Train class distribution:
label
0    223091
1      8748
Name: count, dtype: int64

Test class distribution:
label
0    55773
1     2187
Name: count, dtype: int64
Training set: (231839, 343)
Test set: (57960, 343)

Train class distribution:
label
0    223091
1      8748
Name: count, dtype: int64

Test class distribution:
label
0    55773
1     2187
Name: count, dtype: int64


In [10]:
# ============================================================
# SMART FEATURE SCALING
# ============================================================
# Don't scale binary/categorical features (already 0/1)
# Only scale continuous features
# ============================================================

print("="*80)
print("IDENTIFYING FEATURES TO SCALE")
print("="*80)

# Identify binary features that should NOT be scaled
binary_feature_patterns = [
    'is_business_hours', 'is_night', 'is_weekend', 'is_late_night',
    'dst_port_http', 'dst_port_https', 'dst_port_ssh', 'dst_port_ftp',
    'dst_port_smtp', 'dst_port_dns', 'dst_port_telnet', 'dst_port_smb',
    'dst_port_rdp', 'dst_port_mysql', 'dst_port_postgres',
    'protocol_tcp', 'protocol_udp', 'protocol_icmp',
    'src_port_is_privileged', 'src_port_is_ephemeral',
]

# Also don't scale one-hot encoded features
one_hot_patterns = ['dst_port_cat_', 'protocol_']

# Find features that should NOT be scaled
no_scale_features = []
for col in X_train.columns:
    # Check if it matches any binary pattern
    if col in binary_feature_patterns:
        no_scale_features.append(col)
    # Check if it's a one-hot encoded feature
    elif any(pattern in col for pattern in one_hot_patterns):
        no_scale_features.append(col)

# Features TO scale (continuous variables)
scale_features = [col for col in X_train.columns if col not in no_scale_features]

print(f"Features TO SCALE (continuous): {len(scale_features)}")
print(f"Features NOT to scale (binary/categorical): {len(no_scale_features)}")

if len(scale_features) > 0:
    print(f"\nSample continuous features to scale:")
    for feat in scale_features[:10]:
        print(f"  - {feat}")

if len(no_scale_features) > 0:
    print(f"\nBinary/categorical features (keeping as 0/1):")
    for feat in no_scale_features[:15]:
        print(f"  - {feat}")
    if len(no_scale_features) > 15:
        print(f"  ... and {len(no_scale_features) - 15} more")

print("="*80 + "\n")

# ============================================================
print("="*80)
print("APPLYING STANDARDSCALER")
print("="*80)

# Initialize scaler
scaler = StandardScaler()

# Create copies to avoid modifying original
X_train_scaled = X_train.copy()
X_test_scaled = X_test.copy()

# Scale ONLY the continuous features
if len(scale_features) > 0:
    print(f"Scaling {len(scale_features)} continuous features...")
    X_train_scaled[scale_features] = scaler.fit_transform(X_train[scale_features])
    X_test_scaled[scale_features] = scaler.transform(X_test[scale_features])
    print("✓ Scaling complete")
else:
    print("⚠️  No continuous features to scale")

# Binary features remain unchanged (already 0/1)
if len(no_scale_features) > 0:
    print(f"✓ {len(no_scale_features)} binary/categorical features kept as 0/1")

print(f"\nScaled training set: {X_train_scaled.shape}")
print(f"Scaled test set: {X_test_scaled.shape}")

print("="*80 + "\n")

# ============================================================
print("="*80)
print("VERIFICATION: Sample of scaled vs unscaled features")
print("="*80)

# Show example of a scaled continuous feature
if len(scale_features) > 0:
    example_continuous = scale_features[0]
    print(f"\nContinuous feature '{example_continuous}':")
    print(f"  Original range: [{X_train[example_continuous].min():.3f}, {X_train[example_continuous].max():.3f}]")
    print(f"  Scaled range: [{X_train_scaled[example_continuous].min():.3f}, {X_train_scaled[example_continuous].max():.3f}]")
    print(f"  Scaled mean: {X_train_scaled[example_continuous].mean():.3f} (should be ~0)")
    print(f"  Scaled std: {X_train_scaled[example_continuous].std():.3f} (should be ~1)")

# Show example of a binary feature
if len(no_scale_features) > 0:
    example_binary = no_scale_features[0]
    print(f"\nBinary feature '{example_binary}':")
    print(f"  Original values: {X_train[example_binary].unique()}")
    print(f"  Scaled values: {X_train_scaled[example_binary].unique()}")
    print(f"  ✓ Binary features remain unchanged!")

print("="*80 + "\n")

# Show sample of final scaled data
print("Sample of scaled features (first 5 rows, first 10 columns):")
print(X_train_scaled.iloc[:5, :10])

IDENTIFYING FEATURES TO SCALE
Features TO SCALE (continuous): 323
Features NOT to scale (binary/categorical): 20

Sample continuous features to scale:
  - src_port
  - dst_port
  - duration
  - packets_count
  - fwd_packets_count
  - bwd_packets_count
  - total_payload_bytes
  - fwd_total_payload_bytes
  - bwd_total_payload_bytes
  - payload_bytes_max

Binary/categorical features (keeping as 0/1):
  - is_business_hours
  - is_night
  - is_weekend
  - is_late_night
  - dst_port_http
  - dst_port_https
  - dst_port_ssh
  - dst_port_ftp
  - dst_port_smtp
  - dst_port_dns
  - dst_port_telnet
  - dst_port_smb
  - dst_port_rdp
  - dst_port_mysql
  - dst_port_postgres
  ... and 5 more

APPLYING STANDARDSCALER
Scaling 323 continuous features...
Scaling 323 continuous features...
✓ Scaling complete
✓ 20 binary/categorical features kept as 0/1

Scaled training set: (231839, 343)
Scaled test set: (57960, 343)

VERIFICATION: Sample of scaled vs unscaled features

Continuous feature 'src_port':
  O

In [11]:
# ============================================================
# FEATURE VALIDATION
# ============================================================
# Quick sanity checks on the engineered features
# ============================================================

print("="*80)
print("VALIDATING ENGINEERED FEATURES")
print("="*80)

# Check for any NaN or infinite values
nan_count = X_train_scaled.isna().sum().sum()
inf_count = np.isinf(X_train_scaled.select_dtypes(include=[np.number])).sum().sum()

print(f"\n✓ NaN values: {nan_count}")
print(f"✓ Infinite values: {inf_count}")

if nan_count > 0 or inf_count > 0:
    print("⚠️  WARNING: Found NaN or infinite values!")
else:
    print("✓ All features are valid!")

# Show distribution of some key features
print("\n" + "="*80)
print("SAMPLE FEATURE DISTRIBUTIONS")
print("="*80)

# Temporal features
temporal_features = [col for col in X_train_scaled.columns if any(t in col.lower() for t in ['hour', 'day', 'night', 'weekend'])]
if temporal_features:
    print(f"\nTemporal feature distributions:")
    sample_temporal = temporal_features[:3]
    for feat in sample_temporal:
        print(f"\n{feat}:")
        print(X_train_scaled[feat].value_counts().head())

# Port features
port_features = [col for col in X_train_scaled.columns if 'dst_port' in col.lower() and any(s in col for s in ['http', 'ssh', 'https'])]
if port_features:
    print(f"\nPort feature distributions:")
    for feat in port_features[:3]:
        print(f"\n{feat}:")
        print(X_train_scaled[feat].value_counts())

print("\n" + "="*80)
print("✓ FEATURE ENGINEERING COMPLETE!")
print("="*80)
print(f"\nFinal dataset ready for training:")
print(f"  Training samples: {X_train_scaled.shape[0]:,}")
print(f"  Test samples: {X_test_scaled.shape[0]:,}")
print(f"  Total features: {X_train_scaled.shape[1]}")
print(f"  Class distribution (train): {pd.Series(y_train).value_counts().to_dict()}")
print("="*80)

VALIDATING ENGINEERED FEATURES

✓ NaN values: 0
✓ Infinite values: 0
✓ All features are valid!

SAMPLE FEATURE DISTRIBUTIONS

Temporal feature distributions:

hour:
hour
0.0    231839
Name: count, dtype: int64

day_of_week:
day_of_week
0.0    231839
Name: count, dtype: int64

day_of_month:
day_of_month
0.0    231839
Name: count, dtype: int64

Port feature distributions:

dst_port_http:
dst_port_http
0    198839
1     33000
Name: count, dtype: int64

dst_port_https:
dst_port_https
0    180625
1     51214
Name: count, dtype: int64

dst_port_ssh:
dst_port_ssh
0    230976
1       863
Name: count, dtype: int64

✓ FEATURE ENGINEERING COMPLETE!

Final dataset ready for training:
  Training samples: 231,839
  Test samples: 57,960
  Total features: 343
  Class distribution (train): {0: 223091, 1: 8748}

✓ NaN values: 0
✓ Infinite values: 0
✓ All features are valid!

SAMPLE FEATURE DISTRIBUTIONS

Temporal feature distributions:

hour:
hour
0.0    231839
Name: count, dtype: int64

day_of_week:
da

## 5. Save Processed Data

In [12]:
# Save processed data
print("Saving processed data...")

processed_dir = project_root / 'data' / 'processed'
processed_dir.mkdir(parents=True, exist_ok=True)

# Save train/test splits
X_train_scaled.to_csv(processed_dir / 'X_train.csv', index=False)
X_test_scaled.to_csv(processed_dir / 'X_test.csv', index=False)
pd.Series(y_train, name='label').to_csv(processed_dir / 'y_train.csv', index=False)
pd.Series(y_test, name='label').to_csv(processed_dir / 'y_test.csv', index=False)

# Save scaler and label encoder for later use
import joblib
joblib.dump(scaler, processed_dir / 'scaler.pkl')
joblib.dump(label_encoder, processed_dir / 'label_encoder.pkl')

print(f"\nProcessed data saved to: {processed_dir}")
print("Files created:")
print("  - X_train.csv")
print("  - X_test.csv")
print("  - y_train.csv")
print("  - y_test.csv")
print("  - scaler.pkl")
print("  - label_encoder.pkl")

Saving processed data...

Processed data saved to: /Users/matthewweaver/Repositories/nidstream/data/processed
Files created:
  - X_train.csv
  - X_test.csv
  - y_train.csv
  - y_test.csv
  - scaler.pkl
  - label_encoder.pkl

Processed data saved to: /Users/matthewweaver/Repositories/nidstream/data/processed
Files created:
  - X_train.csv
  - X_test.csv
  - y_train.csv
  - y_test.csv
  - scaler.pkl
  - label_encoder.pkl


## 6. Summary Statistics

In [13]:
# Summary of feature engineering process
print("=" * 60)
print("FEATURE ENGINEERING SUMMARY")
print("=" * 60)
print(f"\nOriginal dataset: {len(df):,} samples, {len(df.columns)} features")
print(f"Final feature count: {X_train_scaled.shape[1]}")
print(f"\nTraining set: {len(X_train_scaled):,} samples")
print(f"Test set: {len(X_test_scaled):,} samples")
print(f"\nClass distribution (train):")
print(f"  Benign: {(y_train == 0).sum():,} ({(y_train == 0).sum() / len(y_train) * 100:.1f}%)")
print(f"  Attack: {(y_train == 1).sum():,} ({(y_train == 1).sum() / len(y_train) * 100:.1f}%)")
print(f"\nData ready for model training!")
print("=" * 60)

FEATURE ENGINEERING SUMMARY

Original dataset: 289,799 samples, 323 features
Final feature count: 343

Training set: 231,839 samples
Test set: 57,960 samples

Class distribution (train):
  Benign: 223,091 (96.2%)
  Attack: 8,748 (3.8%)

Data ready for model training!


## Next Steps

The processed data is now ready for:
1. Model training in `03_model_training.ipynb`
2. Hyperparameter tuning
3. Model evaluation and comparison

**Note:** You may want to:
- Apply SMOTE or other techniques for class imbalance
- Perform feature selection to reduce dimensionality
- Experiment with different scaling methods
- Create more domain-specific features based on network traffic analysis