# CSE-CIC-IDS2018 Exploratory Data Analysis

This notebook explores the CSE-CIC-IDS2018 network intrusion detection dataset.

## Objectives:
1. Load data from S3 using PySpark
2. Understand data quality and missing values
3. Analyze label distribution and class imbalance
4. Explore feature correlations
5. Identify temporal patterns in attacks
6. Characterize different attack types

In [None]:
# Import libraries
import os
import sys
sys.path.append('..')

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pyspark.sql import SparkSession

from src.feature_pipeline.load import create_spark_session, load_from_s3

# Set plotting style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

%matplotlib inline

## 1. Load Data from S3

In [None]:
# Create Spark session
spark = create_spark_session("NIDStream-EDA")

# Load sample data from S3
bucket = os.getenv('S3_BUCKET', 'your-nids-data-bucket')
s3_path = f"s3a://{bucket}/raw/CSE-CIC-IDS2018/*.csv"

# Load 10% sample for EDA
df = load_from_s3(s3_path, spark, sample_fraction=0.1)

print(f"Loaded {df.count():,} records")
print(f"Number of features: {len(df.columns)}")

In [None]:
# Show schema
df.printSchema()

In [None]:
# Convert to pandas for easier visualization (sample further if needed)
pdf = df.sample(fraction=0.01).toPandas()
print(f"Pandas DataFrame shape: {pdf.shape}")
pdf.head()

## 2. Data Quality Analysis

In [None]:
# Check missing values
missing_pct = (pdf.isnull().sum() / len(pdf) * 100).sort_values(ascending=False)
missing_pct = missing_pct[missing_pct > 0]

if len(missing_pct) > 0:
    plt.figure(figsize=(10, 6))
    missing_pct.head(20).plot(kind='barh')
    plt.xlabel('Missing Percentage')
    plt.title('Top 20 Features with Missing Values')
    plt.tight_layout()
    plt.show()
else:
    print("No missing values found!")

## 3. Label Distribution

In [None]:
# Label distribution
label_counts = pdf['Label'].value_counts()
print("\nLabel Distribution:")
print(label_counts)
print(f"\nClass imbalance ratio: {label_counts.max() / label_counts.min():.2f}")

In [None]:
# Visualize label distribution
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))

# Bar plot
label_counts.plot(kind='bar', ax=ax1, color='steelblue')
ax1.set_title('Attack Type Distribution')
ax1.set_xlabel('Attack Type')
ax1.set_ylabel('Count')
ax1.tick_params(axis='x', rotation=45)

# Pie chart
label_counts.plot(kind='pie', ax=ax2, autopct='%1.1f%%')
ax2.set_ylabel('')
ax2.set_title('Attack Type Proportion')

plt.tight_layout()
plt.show()

## 4. Feature Correlation Analysis

In [None]:
# Select numeric features for correlation
numeric_cols = pdf.select_dtypes(include=[np.number]).columns.tolist()
numeric_cols = [c for c in numeric_cols if c != 'Label']

# Calculate correlation matrix (sample features)
key_features = [
    'Flow Duration', 'Tot Fwd Pkts', 'Tot Bwd Pkts',
    'Flow Byts/s', 'Flow Pkts/s', 'Flow IAT Mean',
    'Fwd IAT Mean', 'Bwd IAT Mean'
]

corr_matrix = pdf[key_features].corr()

# Heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, fmt='.2f', cmap='coolwarm', center=0)
plt.title('Feature Correlation Matrix')
plt.tight_layout()
plt.show()

## 5. Attack Characteristics

In [None]:
# Compare benign vs attack flows
pdf['is_attack'] = (pdf['Label'] != 'Benign').astype(int)

fig, axes = plt.subplots(2, 2, figsize=(15, 10))

features_to_plot = [
    ('Flow Duration', axes[0, 0]),
    ('Tot Fwd Pkts', axes[0, 1]),
    ('Flow Byts/s', axes[1, 0]),
    ('Flow Pkts/s', axes[1, 1])
]

for feature, ax in features_to_plot:
    if feature in pdf.columns:
        benign = pdf[pdf['is_attack'] == 0][feature].dropna()
        attack = pdf[pdf['is_attack'] == 1][feature].dropna()
        
        ax.hist([benign, attack], label=['Benign', 'Attack'], bins=50, alpha=0.7)
        ax.set_xlabel(feature)
        ax.set_ylabel('Frequency')
        ax.set_title(f'{feature} Distribution')
        ax.legend()
        ax.set_yscale('log')

plt.tight_layout()
plt.show()

## 6. Next Steps

Based on this EDA:
1. Class imbalance requires special handling (SMOTE, class weights)
2. Some features show strong correlation - consider dimensionality reduction
3. Attack types have distinct characteristics - feature engineering opportunities
4. Temporal patterns suggest time-based features are important

Proceed to:
- `02_feature_engineering.ipynb` for advanced feature creation
- `03_model_training.ipynb` for model development