# Case Study: Edge AI and Cybersecurity in Action
## Module 1: Network Traffic Analysis - **HANDS-ON VERSION**

### Introduction

Welcome to the first module of our Edge AI and Cybersecurity workshop! In this notebook, we'll explore and analyze real-world network traffic data to understand the characteristics of both benign and malicious network flows.

**Objectives:**
- Explore labeled network traffic data from the CICIDS2017 dataset
- Understand the key features that distinguish normal traffic from cyber attacks
- Preprocess the data for machine learning models
- Prepare a clean dataset for edge-based anomaly detection in future modules

**Why Network Traffic Analysis?**
Network traffic analysis is fundamental to cybersecurity. By understanding patterns in network flows, we can:
- Detect anomalous behavior that might indicate attacks
- Build lightweight ML models suitable for edge devices
- Create real-time intrusion detection systems

**Dataset Source:**
We'll be working with a preprocessed subset of the CICIDS2017 dataset, which contains labeled network flows including various types of attacks (DDoS, Port Scan, Brute Force, etc.) alongside normal traffic patterns.

---
**🔥 HANDS-ON PRACTICE**: This notebook contains code completion exercises marked with `# TODO:` comments. Fill in the missing code to complete the network traffic analysis workflow!

Let's begin our journey into cybersecurity data analysis!

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import mutual_info_classif
import warnings
warnings.filterwarnings('ignore')

# Configure visualization settings
plt.style.use('default')
sns.set_palette("husl")
plt.rcParams['figure.figsize'] = (12, 8)
plt.rcParams['font.size'] = 12

print("All libraries imported successfully!")
print("Visualization settings configured")

In [None]:
# Create sample network traffic dataset (simulating CICIDS2017-like data)
# In a real scenario, you would load from 'data/clean_traffic.csv'

np.random.seed(42)
n_samples = 8000

# Generate synthetic network traffic features
data = {
    'Flow Duration': np.random.exponential(1000000, n_samples),
    'Total Fwd Packets': np.random.poisson(50, n_samples),
    'Total Bwd Packets': np.random.poisson(30, n_samples),
    'Total Length of Fwd Packets': np.random.exponential(2000, n_samples),
    'Total Length of Bwd Packets': np.random.exponential(1500, n_samples),
    'Fwd Packet Length Max': np.random.exponential(500, n_samples),
    'Fwd Packet Length Min': np.random.exponential(50, n_samples),
    'Fwd Packet Length Mean': np.random.exponential(200, n_samples),
    'Bwd Packet Length Max': np.random.exponential(400, n_samples),
    'Bwd Packet Length Min': np.random.exponential(40, n_samples),
    'Bwd Packet Length Mean': np.random.exponential(150, n_samples),
    'Flow Bytes/s': np.random.exponential(10000, n_samples),
    'Flow Packets/s': np.random.exponential(100, n_samples),
    'Flow IAT Mean': np.random.exponential(50000, n_samples),
    'Flow IAT Max': np.random.exponential(100000, n_samples),
    'Flow IAT Min': np.random.exponential(1000, n_samples),
    'Fwd IAT Mean': np.random.exponential(40000, n_samples),
    'Bwd IAT Mean': np.random.exponential(60000, n_samples),
    'Fwd PSH Flags': np.random.binomial(1, 0.3, n_samples),
    'Bwd PSH Flags': np.random.binomial(1, 0.2, n_samples),
    'Fwd URG Flags': np.random.binomial(1, 0.05, n_samples),
    'Bwd URG Flags': np.random.binomial(1, 0.03, n_samples),
    'Fwd Header Length': np.random.normal(20, 5, n_samples),
    'Bwd Header Length': np.random.normal(20, 5, n_samples),
    'Fwd Packets/s': np.random.exponential(50, n_samples),
    'Bwd Packets/s': np.random.exponential(30, n_samples),
    'Packet Length Min': np.random.exponential(40, n_samples),
    'Packet Length Max': np.random.exponential(500, n_samples),
    'Packet Length Mean': np.random.exponential(200, n_samples),
    'Packet Length Std': np.random.exponential(100, n_samples),
    'Packet Length Variance': np.random.exponential(10000, n_samples),
}

# Create labels (20% attacks, 80% benign)
attack_indices = np.random.choice(n_samples, size=int(0.2 * n_samples), replace=False)
labels = ['BENIGN'] * n_samples
attack_types = ['DDoS', 'PortScan', 'FTP-Patator', 'SSH-Patator', 'DoS Hulk']

for idx in attack_indices:
    labels[idx] = np.random.choice(attack_types)
    # Make attack traffic slightly different
    data['Flow Duration'][idx] *= 0.5  # Shorter duration
    data['Total Fwd Packets'][idx] *= 2  # More packets
    data['Flow Bytes/s'][idx] *= 3  # Higher throughput

data['Label'] = labels

# Create DataFrame
df = pd.DataFrame(data)

print(f"Generated synthetic network traffic dataset:")
print(f"   • Total samples: {len(df):,}")
print(f"   • Features: {len(df.columns)-1}")
print(f"   • Memory usage: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")

## Step 1: Initial Data Exploration

Let's examine our network traffic dataset to understand its structure and characteristics. Each row represents a network flow (a sequence of packets between two endpoints), and each column represents a feature extracted from that flow.

**🔥 HANDS-ON PRACTICE**: Complete the missing code sections marked with `# TODO:` to explore the dataset!

In [None]:
# TODO: Complete the basic dataset information display
# HINT: Use df.shape to get dimensions, df.columns to get column names, and df.info() for detailed info

print("Dataset Overview:")
print("=" * 50)
# TODO: Print the shape of the dataset using f-string formatting
print(f"Shape: {}")
# TODO: Print the list of features (all columns except 'Label')
print(f"Features: {}")
# TODO: Print the target column name (the last column)
print(f"Target: {}")

print("\nDataset Info:")
# TODO: Display detailed dataset information using the info() method

In [None]:
# TODO: Display sample data and feature descriptions
# HINT: Use display() for nice table formatting and create a dictionary for feature descriptions

print("\nSample Data (First 5 rows):")
print("=" * 80)
# TODO: Display the first 5 rows of the dataset using head() method
display()

print("\nFeature Descriptions:")
print("=" * 50)
feature_descriptions = {
    'Flow Duration': 'Duration of the flow in microseconds',
    'Total Fwd Packets': 'Total packets in forward direction',
    'Total Bwd Packets': 'Total packets in backward direction',
    'Total Length of Fwd Packets': 'Total size of packet in forward direction',
    'Total Length of Bwd Packets': 'Total size of packet in backward direction',
    'Flow Bytes/s': 'Number of flow bytes per second',
    'Flow Packets/s': 'Number of flow packets per second',
    'Flow IAT Mean': 'Mean time between two packets sent in the flow',
    'Fwd PSH Flags': 'Number of times PSH flag was set in forward direction',
    'Bwd PSH Flags': 'Number of times PSH flag was set in backward direction',
    'Label': 'Traffic type (BENIGN or attack type)'
}

# TODO: Display the first 6 feature descriptions using a loop
for feature, description in list(feature_descriptions.items())[:6]:
    print(f"• {feature:25}: {description}")
print("  ... and more network flow features")

In [None]:
# TODO: Analyze class distribution and create visualizations
# HINT: Use value_counts() for counting, normalize=True for percentages

print("\nClass Distribution:")
print("=" * 40)
# TODO: Count the occurrences of each label using value_counts()
label_counts = 
print(label_counts)
print(f"\nTotal samples: {len(df):,}")

# TODO: Calculate percentages using value_counts with normalize=True
label_percentages = 
print("\nPercentage Distribution:")
for label, percentage in label_percentages.items():
    print(f"• {label:15}: {percentage:.1f}%")

# TODO: Create visualizations for class distribution
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))

# TODO: Create a bar plot using label_counts.plot()
# HINT: Use kind='bar', specify colors and edge colors
label_counts.plot(kind=, ax=ax1, color=, edgecolor=)
ax1.set_title('Class Distribution (Count)', fontsize=14, fontweight='bold')
ax1.set_xlabel('Traffic Type')
ax1.set_ylabel('Number of Samples')
ax1.tick_params(axis='x', rotation=45)

# TODO: Create a pie chart using ax2.pie()
# HINT: Use label_counts.values and label_counts.index for data and labels
ax2.pie(, labels=, autopct='%1.1f%%', 
        colors=plt.cm.Set3.colors, startangle=90)
ax2.set_title('Class Distribution (Percentage)', fontsize=14, fontweight='bold')

plt.tight_layout()
plt.show()

# TODO: Calculate class imbalance ratio
benign_count = label_counts['BENIGN']
attack_count = label_counts.sum() - benign_count
# TODO: Calculate the ratio of benign to attack samples
imbalance_ratio = 
print(f"\nClass Imbalance Ratio (Benign:Attack): {imbalance_ratio:.2f}:1")

In [None]:
# TODO: Create statistical summary and visualizations for key features
# HINT: Use describe() method and create comparison plots between benign and attack traffic

print("Statistical Summary of Key Features:")
print("=" * 60)
# TODO: Define a list of key features to analyze
key_features = ['Flow Duration', 'Total Fwd Packets', 'Total Bwd Packets', 
                'Flow Bytes/s', 'Flow Packets/s']
# TODO: Display statistical summary using describe() method
display()

# TODO: Create histograms comparing benign vs attack traffic
fig, axes = plt.subplots(2, 3, figsize=(18, 12))
axes = axes.ravel()

for i, feature in enumerate(key_features):
    # TODO: Filter data for benign and attack traffic
    # HINT: Use boolean indexing with df['Label'] == 'BENIGN' and df['Label'] != 'BENIGN'
    benign_data = 
    attack_data = 
    
    # TODO: Create overlapping histograms
    # HINT: Use axes[i].hist() with alpha=0.7, different colors, and density=True
    axes[i].hist(, bins=50, alpha=0.7, label='Benign', color='lightblue', density=True)
    axes[i].hist(, bins=50, alpha=0.7, label='Attack', color='salmon', density=True)
    axes[i].set_title(f'Distribution: {feature}', fontweight='bold')
    axes[i].set_xlabel(feature)
    axes[i].set_ylabel('Density')
    axes[i].legend()
    axes[i].grid(True, alpha=0.3)

# Remove empty subplot
axes[-1].remove()
plt.tight_layout()
plt.show()

# TODO: Create box plots for detailed comparison
fig, axes = plt.subplots(2, 3, figsize=(18, 12))
axes = axes.ravel()

for i, feature in enumerate(key_features):
    # TODO: Create a copy of dataframe and add binary traffic type column
    df_viz = df.copy()
    # TODO: Create 'Traffic_Type' column using lambda function
    # HINT: Use apply() with lambda x: 'Benign' if x == 'BENIGN' else 'Attack'
    df_viz['Traffic_Type'] = 
    
    # TODO: Create box plot using seaborn
    # HINT: Use sns.boxplot() with data=df_viz, x='Traffic_Type', y=feature
    sns.boxplot()
    axes[i].set_title(f'Box Plot: {feature}', fontweight='bold')
    axes[i].grid(True, alpha=0.3)

# Remove empty subplot
axes[-1].remove()
plt.tight_layout()
plt.show()

## Step 2: Data Cleaning and Feature Selection

Now let's prepare our data for machine learning by:
1. Handling missing or infinite values
2. Removing highly correlated features
3. Selecting the most relevant features
4. Normalizing the data

In [None]:
# TODO: Implement data quality checks and cleaning
# HINT: Use isnull().sum() for missing values, apply() with np.isinf() for infinite values

print("Data Quality Check:")
print("=" * 40)

# TODO: Separate features and labels from the dataset
# HINT: Use drop() method to remove 'Label' column for features
features = 
labels = 

print(f"Missing values per column:")
# TODO: Check for missing values in each column
# HINT: Use isnull().sum() and filter where count > 0
missing_counts = 
print(missing_counts[missing_counts > 0] if missing_counts.sum() > 0 else "No missing values found")

print(f"\nInfinite values per column:")
# TODO: Check for infinite values using apply() and np.isinf()
# HINT: Use lambda x: np.isinf(x).sum() inside apply()
inf_counts = 
print(inf_counts[inf_counts > 0] if inf_counts.sum() > 0 else "No infinite values found")

# TODO: Handle infinite values by replacing them with NaN, then fill with median
# HINT: Use replace() method with [np.inf, -np.inf] and np.nan
features_clean = 
if features_clean.isnull().sum().sum() > 0:
    # TODO: Fill NaN values with column median
    # HINT: Use fillna() with median() method
    features_clean = 
    print("Infinite values replaced with column medians")

# TODO: Check data types distribution
print(f"\nData types:")
print()

print(f"\nCleaned dataset shape: {features_clean.shape}")

In [None]:
# TODO: Implement correlation analysis and feature selection
# HINT: Use corr() method and nested loops to find highly correlated pairs

print("Correlation Analysis:")
print("=" * 30)

# TODO: Calculate correlation matrix
# HINT: Use .corr() method on features_clean
correlation_matrix = 

# TODO: Find highly correlated feature pairs (threshold > 0.95)
high_corr_pairs = []
for i in range(len(correlation_matrix.columns)):
    for j in range(i+1, len(correlation_matrix.columns)):
        # TODO: Check if absolute correlation is greater than 0.95
        # HINT: Use abs() and correlation_matrix.iloc[i, j]
        if abs() > 0.95:
            # TODO: Append tuple with (feature1, feature2, correlation_value)
            high_corr_pairs.append((
                ,  # feature 1 name
                ,  # feature 2 name
                   # correlation value
            ))

print(f"Found {len(high_corr_pairs)} highly correlated feature pairs (|r| > 0.95):")
for feat1, feat2, corr in high_corr_pairs[:5]:  # Show first 5
    print(f"• {feat1} ↔ {feat2}: {corr:.3f}")

# TODO: Remove highly correlated features (keep one from each pair)
features_to_remove = set()
for feat1, feat2, corr in high_corr_pairs:
    # TODO: Add the second feature to removal set
    # HINT: Use .add() method to add feat2
    

# TODO: Drop the highly correlated features
# HINT: Use drop() method with columns parameter
features_selected = 
print(f"\nRemoved {len(features_to_remove)} highly correlated features")
print(f"Features after correlation filtering: {features_selected.shape[1]}")

# TODO: Create correlation heatmap visualization
plt.figure(figsize=(14, 12))
if features_selected.shape[1] > 20:
    # Sample features for visualization
    sample_features = features_selected.sample(n=15, axis=1, random_state=42)
    # TODO: Create heatmap using seaborn
    # HINT: Use sns.heatmap() with sample_features.corr()
    sns.heatmap(, annot=True, cmap='coolwarm', center=0, 
                square=True, fmt='.2f', cbar_kws={'shrink': .8})
    plt.title('Correlation Heatmap (Sample of Features)', fontsize=16, fontweight='bold')
else:
    # TODO: Create heatmap for all features if <= 20
    sns.heatmap(, annot=True, cmap='coolwarm', center=0, 
                square=True, fmt='.2f', cbar_kws={'shrink': .8})
    plt.title('Correlation Heatmap (All Features)', fontsize=16, fontweight='bold')

plt.tight_layout()
plt.show()

## Step 3: Label Encoding

Convert our categorical labels into binary format suitable for machine learning models.

In [None]:
# TODO: Convert categorical labels to binary format for machine learning
# HINT: Use boolean indexing and .astype(int) to create binary labels

print("Converting Labels to Binary Format:")
print("=" * 45)

# TODO: Create binary labels: 0 = BENIGN, 1 = ATTACK
# HINT: Use (labels != 'BENIGN') and convert to int
binary_labels = 

print("Original label distribution:")
print(labels.value_counts())

print("\nBinary label distribution:")
# TODO: Count benign (0) and attack (1) samples
print("0 (Benign):", )
print("1 (Attack):", )

# TODO: Calculate percentages for binary labels
# HINT: Use .mean() * 100 for percentage calculation
benign_pct = 
attack_pct = 

print(f"\nPercentages:")
print(f"• Benign: {benign_pct:.1f}%")
print(f"• Attack: {attack_pct:.1f}%")

# TODO: Create visualizations for binary distribution
plt.figure(figsize=(10, 6))
# TODO: Create count list and labels list for visualization
counts = [, ]
labels_viz = ['Benign (0)', 'Attack (1)']
colors = ['lightgreen', 'coral']

plt.subplot(1, 2, 1)
# TODO: Create bar plot
# HINT: Use plt.bar() with labels_viz, counts, colors, and edgecolor
plt.bar(, , color=, edgecolor='black')
plt.title('Binary Label Distribution', fontweight='bold')
plt.ylabel('Count')

plt.subplot(1, 2, 2)
# TODO: Create pie chart
# HINT: Use plt.pie() with counts, labels_viz, autopct, colors, and startangle
plt.pie(, labels=, autopct='%1.1f%%', colors=, startangle=90)
plt.title('Binary Label Percentage', fontweight='bold')

plt.tight_layout()
plt.show()

print(f"\nLabels successfully converted to binary format")
print(f"Final dataset shape: {features_selected.shape[0]} samples × {features_selected.shape[1]} features")

## Step 4: Feature Importance Analysis

Let's identify which features are most important for distinguishing between benign and malicious traffic using Random Forest feature importance and mutual information.

In [None]:
# TODO: Implement feature importance analysis using Random Forest and Mutual Information
# HINT: Use RandomForestClassifier and mutual_info_classif for feature ranking

print("Random Forest Feature Importance Analysis:")
print("=" * 50)

# TODO: Create and train Random Forest for feature importance
# HINT: Use RandomForestClassifier with n_estimators=100, random_state=42
rf = RandomForestClassifier()
# TODO: Fit the model on features_selected and binary_labels
rf.fit(, )

# TODO: Create feature importance DataFrame
# HINT: Use pd.DataFrame with feature names and importances, then sort by importance
rf_importance = pd.DataFrame({
    'feature': ,
    'importance': 
}).sort_values('importance', ascending=False)

print("Top 10 most important features (Random Forest):")
print(rf_importance.head(10))

# TODO: Implement Mutual Information analysis
print("\nMutual Information Analysis:")
print("=" * 35)

# TODO: Calculate mutual information scores
# HINT: Use mutual_info_classif() with features_selected, binary_labels, and random_state=42
mi_scores = 

# TODO: Create mutual information DataFrame
mi_importance = pd.DataFrame({
    'feature': ,
    'mi_score': 
}).sort_values('mi_score', ascending=False)

print("Top 10 most important features (Mutual Information):")
print(mi_importance.head(10))

# TODO: Create feature importance visualization
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(20, 8))

# TODO: Plot Random Forest importance (top 15 features)
top_rf = rf_importance.head(15)
# HINT: Use ax1.barh() with range(len(top_rf)), importance values, colors
ax1.barh(, , color='lightblue', edgecolor='navy')
ax1.set_yticks(range(len(top_rf)))
ax1.set_yticklabels(top_rf['feature'])
ax1.set_xlabel('Feature Importance')
ax1.set_title('Top 15 Features - Random Forest Importance', fontweight='bold')
ax1.grid(axis='x', alpha=0.3)

# TODO: Plot Mutual Information scores (top 15 features)
top_mi = mi_importance.head(15)
# HINT: Use ax2.barh() with range(len(top_mi)), mi_score values, colors
ax2.barh(, , color='lightcoral', edgecolor='darkred')
ax2.set_yticks(range(len(top_mi)))
ax2.set_yticklabels(top_mi['feature'])
ax2.set_xlabel('Mutual Information Score')
ax2.set_title('Top 15 Features - Mutual Information', fontweight='bold')
ax2.grid(axis='x', alpha=0.3)

plt.tight_layout()
plt.show()

# TODO: Select top features based on both methods
# HINT: Use set() to get unique features from both methods, then union them
top_rf_features = set(rf_importance.head(15)['feature'])
top_mi_features = set(mi_importance.head(15)['feature'])
# TODO: Combine both feature sets using union
top_features = list()

print(f"\nSelected {len(top_features)} top features for final dataset")
print("Selected features:", top_features[:10], "..." if len(top_features) > 10 else "")

In [None]:
# TODO: Implement feature normalization and create before/after comparison
# HINT: Use StandardScaler to normalize features and create comparison plots

print("Data Normalization:")
print("=" * 25)

# TODO: Select final features from features_selected using top_features list
# HINT: Use features_selected[top_features]
final_features = 

# TODO: Create and apply StandardScaler
# HINT: Initialize StandardScaler(), then use fit_transform()
scaler = StandardScaler()
features_normalized = 
# TODO: Create DataFrame with normalized features
# HINT: Use pd.DataFrame() with features_normalized and final_features.columns
features_normalized_df = 

print(f"Normalized {features_normalized_df.shape[1]} features")
print(f"Final feature statistics:")
# TODO: Display descriptive statistics of normalized features
print()

# TODO: Create before/after normalization comparison
# HINT: Select first 3 features for visualization
sample_features = 
fig, axes = plt.subplots(2, len(sample_features), figsize=(15, 8))

for i, feature in enumerate(sample_features):
    # TODO: Create histogram for original features (before normalization)
    # HINT: Use axes[0, i].hist() with final_features[feature]
    axes[0, i].hist(, bins=50, alpha=0.7, color='lightblue', edgecolor='navy')
    axes[0, i].set_title(f'Before: {feature}', fontweight='bold')
    axes[0, i].set_ylabel('Frequency')
    
    # TODO: Create histogram for normalized features (after normalization)
    # HINT: Use axes[1, i].hist() with features_normalized_df[feature]
    axes[1, i].hist(, bins=50, alpha=0.7, color='lightgreen', edgecolor='darkgreen')
    axes[1, i].set_title(f'After: {feature}', fontweight='bold')
    axes[1, i].set_ylabel('Frequency')
    axes[1, i].set_xlabel('Value')

plt.suptitle('Feature Normalization Comparison', fontsize=16, fontweight='bold')
plt.tight_layout()
plt.show()

print("\nData preprocessing completed successfully!")

## Step 5: Data Export

Save our cleaned and preprocessed dataset for use in the next module where we'll build lightweight ML models for edge-based anomaly detection.

In [None]:
# TODO: Create final dataset and save for future use
# HINT: Combine normalized features with binary labels and save to CSV

# TODO: Create final dataset by copying normalized features
final_dataset = 
# TODO: Add binary labels to the final dataset
# HINT: Use .values to get numpy array from binary_labels
final_dataset['Label'] = 

print("Final Dataset Summary:")
print("=" * 30)
print(f"Shape: {final_dataset.shape}")
print(f"Features: {final_dataset.shape[1] - 1}")
print(f"Samples: {final_dataset.shape[0]:,}")
# TODO: Calculate memory usage in MB
# HINT: Use memory_usage(deep=True).sum() / 1024**2
print(f"\nMemory usage: { / 1024**2:.2f} MB")

print(f"\nClass distribution in final dataset:")
# TODO: Calculate and display class distribution percentages
# HINT: Use (final_dataset['Label'] == 0).sum() and .mean()*100
print(f"• Benign (0): {:,} ({:.1f}%)")
print(f"• Attack (1): {:,} ({:.1f}%)")

# Display sample of final dataset
print("\nSample of Final Dataset:")
# TODO: Display first 5 rows of final dataset
display()

# TODO: Save dataset to CSV file
# HINT: Create 'data' directory and save using to_csv()
import os
os.makedirs('data', exist_ok=True)
output_file = 'data/cleaned_network_traffic.csv'
# TODO: Save final_dataset to CSV without index
final_dataset.to_csv(, index=False)

print(f"\nDataset saved to: {output_file}")
print(f"Ready for edge AI model development!")

# TODO: Save feature names and scaler for future deployment
# HINT: Use pickle to save feature names and scaler objects
import pickle
with open('data/feature_names.pkl', 'wb') as f:
    # TODO: Save list of normalized feature column names
    pickle.dump(, f)
    
with open('data/scaler.pkl', 'wb') as f:
    # TODO: Save the fitted scaler object
    pickle.dump(, f)

print("Saved feature names and scaler for future model deployment")