# Feature Engineering for Network Intrusion Detection

This notebook performs feature engineering on the BCCC-CSE-CIC-IDS2018 dataset.

## Objectives:
1. Load and preprocess raw network flow data
2. Handle missing values and outliers
3. Create derived features
4. Encode categorical variables
5. Scale numerical features
6. Handle class imbalance
7. Save processed features for model training

In [1]:
# Import libraries
import os
import sys
sys.path.append('..')

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path

from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split

# Set plotting style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

%matplotlib inline

## 1. Load Raw Data

In [2]:
# Load data
project_root = Path().resolve()
data_path = project_root / 'data' / 'raw' / 'friday_02_03_2018_combined_sample.csv'

df = pd.read_csv(data_path)

print(f"Loaded {len(df):,} records")
print(f"Number of features: {len(df.columns)}")
print(f"\nColumns: {list(df.columns)}")

Loaded 238,700 records
Number of features: 323

Columns: ['flow_id', 'timestamp', 'src_ip', 'src_port', 'dst_ip', 'dst_port', 'protocol', 'duration', 'packets_count', 'fwd_packets_count', 'bwd_packets_count', 'total_payload_bytes', 'fwd_total_payload_bytes', 'bwd_total_payload_bytes', 'payload_bytes_max', 'payload_bytes_min', 'payload_bytes_mean', 'payload_bytes_std', 'payload_bytes_variance', 'payload_bytes_median', 'payload_bytes_skewness', 'payload_bytes_cov', 'payload_bytes_mode', 'fwd_payload_bytes_max', 'fwd_payload_bytes_min', 'fwd_payload_bytes_mean', 'fwd_payload_bytes_std', 'fwd_payload_bytes_variance', 'fwd_payload_bytes_median', 'fwd_payload_bytes_skewness', 'fwd_payload_bytes_cov', 'fwd_payload_bytes_mode', 'bwd_payload_bytes_max', 'bwd_payload_bytes_min', 'bwd_payload_bytes_mean', 'bwd_payload_bytes_std', 'bwd_payload_bytes_variance', 'bwd_payload_bytes_median', 'bwd_payload_bytes_skewness', 'bwd_payload_bytes_cov', 'bwd_payload_bytes_mode', 'total_header_bytes', 'max_hea

In [3]:
# Check data types and missing values
print("Data Info:")
print(df.info())
print("\nMissing values:")
print(df.isnull().sum().sort_values(ascending=False).head(10))

Data Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 238700 entries, 0 to 238699
Columns: 323 entries, flow_id to label
dtypes: float64(259), int64(56), object(8)
memory usage: 588.2+ MB
None

Missing values:
payload_bytes_cov                  93970
fwd_payload_bytes_cov              67430
bwd_payload_bytes_cov              49274
bwd_packets_IAT_cov                45170
bwd_packets_IAT_skewness           45170
fwd_packets_IAT_cov                27632
fwd_packets_IAT_skewness           27580
cov_payload_bytes_delta_len        24915
cov_fwd_payload_bytes_delta_len    15871
cov_bwd_payload_bytes_delta_len    13026
dtype: int64
payload_bytes_cov                  93970
fwd_payload_bytes_cov              67430
bwd_payload_bytes_cov              49274
bwd_packets_IAT_cov                45170
bwd_packets_IAT_skewness           45170
fwd_packets_IAT_cov                27632
fwd_packets_IAT_skewness           27580
cov_payload_bytes_delta_len        24915
cov_fwd_payload_bytes_delta_len 

## 2. Data Cleaning

In [4]:
# Separate features and target
# Identify the label column (could be 'label', 'Label', etc.)
label_col = None
for col in ['label', 'Label', 'label_name']:
    if col in df.columns:
        label_col = col
        break

if label_col is None:
    raise ValueError("No label column found in dataset")

print(f"Using '{label_col}' as target variable")

# Separate features and labels
X = df.drop(columns=[label_col])
y = df[label_col]

print(f"\nFeature matrix shape: {X.shape}")
print(f"Target shape: {y.shape}")
print(f"\nClass distribution:\n{y.value_counts()}")

Using 'label' as target variable

Feature matrix shape: (238700, 322)
Target shape: (238700,)

Class distribution:
label
Benign    227764
Bot        10935
Name: count, dtype: int64

Feature matrix shape: (238700, 322)
Target shape: (238700,)

Class distribution:
label
Benign    227764
Bot        10935
Name: count, dtype: int64


In [5]:
# Handle missing values
print("Handling missing values...")

# Check for missing values
missing_counts = X.isnull().sum()
missing_cols = missing_counts[missing_counts > 0]

if len(missing_cols) > 0:
    print(f"\nColumns with missing values:\n{missing_cols}")
    
    # Strategy: Fill numeric columns with median, drop columns with >50% missing
    threshold = 0.5
    high_missing = missing_cols[missing_cols / len(X) > threshold]
    
    if len(high_missing) > 0:
        print(f"\nDropping columns with >{threshold*100}% missing: {list(high_missing.index)}")
        X = X.drop(columns=high_missing.index)
    
    # Fill remaining missing values with median for numeric columns
    numeric_cols = X.select_dtypes(include=[np.number]).columns
    for col in numeric_cols:
        if X[col].isnull().any():
            X[col] = X[col].fillna(X[col].median())
    
    print(f"\nRemaining missing values: {X.isnull().sum().sum()}")
else:
    print("No missing values found!")

print(f"\nFinal feature matrix shape: {X.shape}")

Handling missing values...

Columns with missing values:
payload_bytes_cov                       93970
fwd_payload_bytes_cov                   67430
bwd_payload_bytes_cov                   49274
packets_IAT_cov                           420
fwd_packets_IAT_skewness                27580
                                        ...  
variance_fwd_payload_bytes_delta_len        1
std_fwd_payload_bytes_delta_len             1
median_fwd_payload_bytes_delta_len          1
skewness_fwd_payload_bytes_delta_len        1
cov_fwd_payload_bytes_delta_len         15871
Length: 65, dtype: int64

Remaining missing values: 0

Final feature matrix shape: (238700, 322)

Remaining missing values: 0

Final feature matrix shape: (238700, 322)


In [6]:
# Handle infinite values
print("Checking for infinite values...")

numeric_cols = X.select_dtypes(include=[np.number]).columns
inf_counts = {}

for col in numeric_cols:
    inf_count = np.isinf(X[col]).sum()
    if inf_count > 0:
        inf_counts[col] = inf_count
        # Replace inf with NaN, then fill with column median
        X[col] = X[col].replace([np.inf, -np.inf], np.nan)
        X[col] = X[col].fillna(X[col].median())

if inf_counts:
    print(f"\nReplaced infinite values in {len(inf_counts)} columns")
    for col, count in list(inf_counts.items())[:10]:
        print(f"  {col}: {count} infinite values")
else:
    print("No infinite values found!")

Checking for infinite values...

Replaced infinite values in 9 columns
  cov_packets_delta_len: 890 infinite values
  cov_bwd_packets_delta_len: 980 infinite values
  cov_fwd_packets_delta_len: 135 infinite values
  cov_header_bytes_delta_len: 402 infinite values
  cov_bwd_header_bytes_delta_len: 990 infinite values
  cov_fwd_header_bytes_delta_len: 60 infinite values
  cov_payload_bytes_delta_len: 164473 infinite values
  cov_bwd_payload_bytes_delta_len: 72263 infinite values
  cov_fwd_payload_bytes_delta_len: 147095 infinite values

Replaced infinite values in 9 columns
  cov_packets_delta_len: 890 infinite values
  cov_bwd_packets_delta_len: 980 infinite values
  cov_fwd_packets_delta_len: 135 infinite values
  cov_header_bytes_delta_len: 402 infinite values
  cov_bwd_header_bytes_delta_len: 990 infinite values
  cov_fwd_header_bytes_delta_len: 60 infinite values
  cov_payload_bytes_delta_len: 164473 infinite values
  cov_bwd_payload_bytes_delta_len: 72263 infinite values
  cov_fwd_

## 3. Feature Engineering

In [7]:
# Create derived features if relevant columns exist
print("Creating derived features...")

# Check if we have forward/backward packet columns
fwd_cols = [c for c in X.columns if 'fwd' in c.lower()]
bwd_cols = [c for c in X.columns if 'bwd' in c.lower()]

print(f"Found {len(fwd_cols)} forward features and {len(bwd_cols)} backward features")

# Example derived features (customize based on your data)
derived_features = []

# Add timestamp if you have it (for temporal features)
if 'timestamp' in X.columns:
    X['timestamp'] = pd.to_datetime(X['timestamp'])
    X['hour'] = X['timestamp'].dt.hour
    X['day_of_week'] = X['timestamp'].dt.dayofweek
    derived_features.extend(['hour', 'day_of_week'])
    X = X.drop(columns=['timestamp'])

print(f"\nCreated {len(derived_features)} derived features")

# Drop non-numeric columns (like flow IDs, IP addresses, etc.)
non_numeric_cols = X.select_dtypes(exclude=[np.number]).columns
if len(non_numeric_cols) > 0:
    print(f"\nDropping {len(non_numeric_cols)} non-numeric columns:")
    for col in non_numeric_cols:
        print(f"  - {col}")
    X = X.drop(columns=non_numeric_cols)

print(f"\nFinal feature count: {X.shape[1]}")
print(f"All features are numeric: {X.select_dtypes(include=[np.number]).shape[1] == X.shape[1]}")

Creating derived features...
Found 104 forward features and 104 backward features

Created 2 derived features

Dropping 6 non-numeric columns:
  - flow_id
  - src_ip
  - dst_ip
  - protocol
  - delta_start
  - handshake_duration

Created 2 derived features

Dropping 6 non-numeric columns:
  - flow_id
  - src_ip
  - dst_ip
  - protocol
  - delta_start
  - handshake_duration

Final feature count: 317

Final feature count: 317
All features are numeric: True
All features are numeric: True


## 4. Feature Scaling

In [8]:
# Encode target labels
print("Encoding target labels...")
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y)

print(f"\nLabel mapping:")
for i, label in enumerate(label_encoder.classes_):
    print(f"  {label}: {i}")

# Convert to binary if needed (benign vs attack)
y_binary = (y != 'Benign').astype(int)
print(f"\nBinary distribution (0=Benign, 1=Attack):")
print(pd.Series(y_binary).value_counts())

Encoding target labels...

Label mapping:
  Benign: 0
  Bot: 1
  nan: 2

Binary distribution (0=Benign, 1=Attack):
label
0    227764
1     10936
Name: count, dtype: int64


In [9]:
# Split data before scaling to prevent data leakage
print("Splitting data into train/test sets...")

X_train, X_test, y_train, y_test = train_test_split(
    X, y_binary, test_size=0.2, random_state=42, stratify=y_binary
)

print(f"Training set: {X_train.shape}")
print(f"Test set: {X_test.shape}")
print(f"\nTrain class distribution:\n{pd.Series(y_train).value_counts()}")
print(f"\nTest class distribution:\n{pd.Series(y_test).value_counts()}")

Splitting data into train/test sets...
Training set: (190960, 317)
Test set: (47740, 317)

Train class distribution:
label
0    182211
1      8749
Name: count, dtype: int64

Test class distribution:
label
0    45553
1     2187
Name: count, dtype: int64
Training set: (190960, 317)
Test set: (47740, 317)

Train class distribution:
label
0    182211
1      8749
Name: count, dtype: int64

Test class distribution:
label
0    45553
1     2187
Name: count, dtype: int64


In [10]:
# Scale features using StandardScaler
print("Scaling features...")

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Convert back to DataFrame to preserve column names
X_train_scaled = pd.DataFrame(X_train_scaled, columns=X_train.columns, index=X_train.index)
X_test_scaled = pd.DataFrame(X_test_scaled, columns=X_test.columns, index=X_test.index)

print(f"Scaled training set: {X_train_scaled.shape}")
print(f"Scaled test set: {X_test_scaled.shape}")

# Show sample of scaled data
print("\nSample of scaled features:")
print(X_train_scaled.head())

Scaling features...
Scaled training set: (190960, 317)
Scaled test set: (47740, 317)

Sample of scaled features:
       src_port  dst_port  duration  packets_count  fwd_packets_count  \
88969  1.142225 -0.323891 -0.067872      -0.003076          -0.001120   
92666  0.423515 -0.494147 -0.138858      -0.028874          -0.035488   
71125  0.441839 -0.494147  0.453939       0.021365           0.025611   
58109  0.428884 -0.494147 -0.138858      -0.028874          -0.035488   
12249  0.449523 -0.494147 -0.138858      -0.028874          -0.035488   

       bwd_packets_count  total_payload_bytes  fwd_total_payload_bytes  \
88969          -0.004014            -0.009909                 0.022107   
92666          -0.024367            -0.013745                -0.080952   
71125           0.018375            -0.001605                 0.025878   
58109          -0.024367            -0.013745                -0.080952   
12249          -0.024367            -0.013745                -0.080952   

   

## 5. Save Processed Data

In [11]:
# Save processed data
print("Saving processed data...")

processed_dir = project_root / 'data' / 'processed'
processed_dir.mkdir(parents=True, exist_ok=True)

# Save train/test splits
X_train_scaled.to_csv(processed_dir / 'X_train.csv', index=False)
X_test_scaled.to_csv(processed_dir / 'X_test.csv', index=False)
pd.Series(y_train, name='label').to_csv(processed_dir / 'y_train.csv', index=False)
pd.Series(y_test, name='label').to_csv(processed_dir / 'y_test.csv', index=False)

# Save scaler and label encoder for later use
import joblib
joblib.dump(scaler, processed_dir / 'scaler.pkl')
joblib.dump(label_encoder, processed_dir / 'label_encoder.pkl')

print(f"\nProcessed data saved to: {processed_dir}")
print("Files created:")
print("  - X_train.csv")
print("  - X_test.csv")
print("  - y_train.csv")
print("  - y_test.csv")
print("  - scaler.pkl")
print("  - label_encoder.pkl")

Saving processed data...

Processed data saved to: /home/mweaver/nidstream/data/processed
Files created:
  - X_train.csv
  - X_test.csv
  - y_train.csv
  - y_test.csv
  - scaler.pkl
  - label_encoder.pkl

Processed data saved to: /home/mweaver/nidstream/data/processed
Files created:
  - X_train.csv
  - X_test.csv
  - y_train.csv
  - y_test.csv
  - scaler.pkl
  - label_encoder.pkl


## 6. Summary Statistics

In [12]:
# Summary of feature engineering process
print("=" * 60)
print("FEATURE ENGINEERING SUMMARY")
print("=" * 60)
print(f"\nOriginal dataset: {len(df):,} samples, {len(df.columns)} features")
print(f"Final feature count: {X_train_scaled.shape[1]}")
print(f"\nTraining set: {len(X_train_scaled):,} samples")
print(f"Test set: {len(X_test_scaled):,} samples")
print(f"\nClass distribution (train):")
print(f"  Benign: {(y_train == 0).sum():,} ({(y_train == 0).sum() / len(y_train) * 100:.1f}%)")
print(f"  Attack: {(y_train == 1).sum():,} ({(y_train == 1).sum() / len(y_train) * 100:.1f}%)")
print(f"\nData ready for model training!")
print("=" * 60)

FEATURE ENGINEERING SUMMARY

Original dataset: 238,700 samples, 323 features
Final feature count: 317

Training set: 190,960 samples
Test set: 47,740 samples

Class distribution (train):
  Benign: 182,211 (95.4%)
  Attack: 8,749 (4.6%)

Data ready for model training!


## Next Steps

The processed data is now ready for:
1. Model training in `03_model_training.ipynb`
2. Hyperparameter tuning
3. Model evaluation and comparison

**Note:** You may want to:
- Apply SMOTE or other techniques for class imbalance
- Perform feature selection to reduce dimensionality
- Experiment with different scaling methods
- Create more domain-specific features based on network traffic analysis