# Module 5.4: Data Fusion and Preprocessing - **HANDS-ON VERSION**

## Combined Case Study: Cybersecurity, Edge AI and Autonomous Driving

---

## Objective

Demonstrate how to combine and preprocess multimodal data from autonomous driving and cybersecurity contexts, preparing it for Edge AI modeling.

**Key Learning Goals:**
- Understand data fusion challenges in Connected and Autonomous Vehicles (CAVs)
- Learn to synchronize asynchronous multimodal data streams
- Master preprocessing techniques for heterogeneous data types
- Prepare integrated datasets for edge AI deployment

---

## Why Data Fusion Matters in CAVs

Connected and Autonomous Vehicles (CAVs) operate in complex environments where:

1. **Physical Safety**: Vehicle sensors (LiDAR, cameras, GPS) monitor road conditions and obstacles
2. **Cybersecurity**: Network monitoring detects malicious attacks on vehicle communication systems
3. **Edge Processing**: Real-time decisions require fast, local processing with limited computational resources

**Challenge**: Physical sensor data and network traffic data have different:
- Sampling rates (telemetry: 10Hz, network: variable)
- Data formats (numerical vs. categorical)
- Timestamp precision (milliseconds vs. seconds)

**Solution**: Data fusion techniques that align, normalize, and combine these streams for unified anomaly detection.

---
**🔥 HANDS-ON PRACTICE**: This notebook contains code completion exercises marked with `# TODO:` comments. Fill in the missing code to complete the data fusion and preprocessing workflow!

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime, timedelta
import warnings
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.impute import SimpleImputer
import random

# Configuration
warnings.filterwarnings('ignore')
np.random.seed(42)
random.seed(42)

# Set plotting style
plt.style.use('default')
plt.rcParams['figure.figsize'] = (12, 8)
sns.set_palette("husl")

## Step 1: Generate Synthetic Datasets

We'll create two realistic datasets representing:
1. **Vehicle Telemetry**: Physical sensor data from autonomous driving systems
2. **Network Traffic Logs**: Cybersecurity monitoring data from vehicle communication systems

In [None]:
def generate_telemetry_data(n_samples=500, duration_minutes=30):
    """
    Generate realistic vehicle telemetry data
    
    Parameters:
    - n_samples: Number of telemetry records
    - duration_minutes: Time span of the simulation
    
    Returns:
    - DataFrame with vehicle sensor data
    """
    
    print(f"Generating {n_samples} vehicle telemetry records...")
    
    # Generate timestamps (every ~3.6 seconds for 30 minutes)
    start_time = datetime.now() - timedelta(minutes=duration_minutes)
    time_intervals = np.linspace(0, duration_minutes*60, n_samples)
    timestamps = [start_time + timedelta(seconds=t) for t in time_intervals]
    
    # Simulate realistic driving patterns
    # Speed varies between 0-100 km/h with traffic patterns
    base_speed = 50 + 30 * np.sin(np.linspace(0, 4*np.pi, n_samples))  # Traffic patterns
    speed = np.maximum(0, base_speed + np.random.normal(0, 5, n_samples))
    
    # Acceleration correlates with speed changes
    acceleration = np.diff(np.concatenate([[speed[0]], speed])) + np.random.normal(0, 0.5, n_samples)
    
    # Distance to obstacle (closer in urban areas, farther on highways)
    urban_factor = 0.5 + 0.5 * np.sin(np.linspace(0, 6*np.pi, n_samples))
    distance_to_obstacle = 10 + 40 * urban_factor + np.random.exponential(5, n_samples)
    
    # GPS coordinates (simulated route)
    base_lat, base_lon = 40.7128, -74.0060  # NYC area
    gps_lat = base_lat + np.cumsum(np.random.normal(0, 0.0001, n_samples))
    gps_lon = base_lon + np.cumsum(np.random.normal(0, 0.0001, n_samples))
    
    # Brake status (categorical: 0=off, 1=light, 2=heavy)
    brake_prob = np.where(acceleration < -2, 0.8, 0.1)  # Brake when decelerating
    brake_status = np.random.choice([0, 1, 2], n_samples, p=[0.7, 0.25, 0.05])
    brake_status = np.where(np.random.random(n_samples) < brake_prob, 
                           np.random.choice([1, 2], n_samples), brake_status)
    
    # Create DataFrame
    telemetry_df = pd.DataFrame({
        'timestamp': timestamps,
        'speed': np.round(speed, 2),
        'acceleration': np.round(acceleration, 3),
        'distance_to_obstacle': np.round(distance_to_obstacle, 2),
        'gps_lat': np.round(gps_lat, 6),
        'gps_lon': np.round(gps_lon, 6),
        'brake_status': brake_status
    })
    
    print(f"Generated telemetry data: {len(telemetry_df)} records")
    print(f"   Time range: {telemetry_df['timestamp'].min()} to {telemetry_df['timestamp'].max()}")
    
    return telemetry_df

# Generate larger vehicle telemetry dataset for balanced training
telemetry_data = generate_telemetry_data(n_samples=3000, duration_minutes=90)

# Display sample data
print("\nSample Vehicle Telemetry Data:")
print(telemetry_data.head(10))
print(f"\nTelemetry Data Shape: {telemetry_data.shape}")

In [None]:
def generate_network_data(n_samples=500, duration_minutes=30):
    """
    Generate realistic network traffic logs for vehicle communication systems
    
    Parameters:
    - n_samples: Number of network log entries
    - duration_minutes: Time span of the simulation
    
    Returns:
    - DataFrame with network monitoring data
    """
    
    print(f"Generating {n_samples} network traffic records...")
    
    # Generate timestamps with slight offset and variation from telemetry
    start_time = datetime.now() - timedelta(minutes=duration_minutes)
    # Network logs more frequent but irregular
    time_intervals = np.sort(np.random.uniform(0, duration_minutes*60, n_samples))
    timestamps = [start_time + timedelta(seconds=t) for t in time_intervals]
    
    # Source and destination IP addresses (vehicle communication)
    vehicle_ips = ['192.168.1.10', '192.168.1.11', '192.168.1.12']  # Vehicle internal network
    external_ips = ['8.8.8.8', '1.1.1.1', '10.0.0.1', '172.16.0.1']  # External services
    
    src_ips = np.random.choice(vehicle_ips + external_ips, n_samples)
    dst_ips = np.random.choice(vehicle_ips + external_ips, n_samples)
    
    # Protocol types (TCP, UDP, ICMP for vehicle communications)
    protocols = np.random.choice(['TCP', 'UDP', 'ICMP'], n_samples, p=[0.6, 0.35, 0.05])
    
    # Packet counts (bursts during active communication)
    base_packets = np.random.poisson(lam=10, size=n_samples)
    burst_factor = np.random.choice([1, 5, 20], n_samples, p=[0.8, 0.15, 0.05])  # Occasional bursts
    packet_count = base_packets * burst_factor
    
    # Bytes transferred (correlates with packet count)
    bytes_per_packet = np.random.normal(1500, 500, n_samples)  # Average Ethernet frame size
    bytes_transferred = np.maximum(64, packet_count * np.maximum(64, bytes_per_packet))
    
    # Port numbers (common vehicle communication ports)
    common_ports = [80, 443, 53, 22, 8080, 1883, 5683]  # HTTP, HTTPS, DNS, SSH, MQTT, CoAP
    # Simplified port selection: 60% common ports, 40% random high ports
    port = []
    for _ in range(n_samples):
        if np.random.random() < 0.6:
            port.append(np.random.choice(common_ports))
        else:
            port.append(np.random.randint(1024, 65536))
    port = np.array(port)
    
    # Create DataFrame
    network_df = pd.DataFrame({
        'timestamp': timestamps,
        'src_ip': src_ips,
        'dst_ip': dst_ips,
        'protocol': protocols,
        'packet_count': packet_count,
        'bytes_transferred': np.round(bytes_transferred).astype(int),
        'port': port
    })
    
    # Sort by timestamp
    network_df = network_df.sort_values('timestamp').reset_index(drop=True)
    
    print(f"Generated network data: {len(network_df)} records")
    print(f"   Time range: {network_df['timestamp'].min()} to {network_df['timestamp'].max()}")
    
    return network_df

# Generate larger network traffic dataset for balanced training  
network_data = generate_network_data(n_samples=3000, duration_minutes=90)

# Display sample data
print("\nSample Network Traffic Data:")
print(network_data.head(10))
print(f"\nNetwork Data Shape: {network_data.shape}")

## Step 2: Data Preprocessing and Cleaning

Before fusion, we need to:
1. **Standardize timestamps** to enable proper synchronization
2. **Handle missing values** realistically
3. **Encode categorical variables** for ML compatibility
4. **Normalize numerical features** for consistent scaling

**🔥 HANDS-ON PRACTICE**: Complete the preprocessing functions below by filling in the missing code sections!

In [None]:
def preprocess_telemetry_data(df):
    """
    Preprocess vehicle telemetry data
    """
    print("Preprocessing telemetry data...")
    
    df_processed = df.copy()
    
    # TODO: 1. Standardize timestamp format
    # HINT: Use pd.to_datetime() to convert timestamp column to datetime format
    df_processed['timestamp'] = # TODO: Convert timestamp to datetime format
    
    # 2. Introduce realistic missing values (sensor failures)
    missing_rate = 0.02  # 2% missing data
    for col in ['speed', 'acceleration', 'distance_to_obstacle']:
        missing_indices = np.random.choice(len(df_processed), 
                                         int(len(df_processed) * missing_rate), 
                                         replace=False)
        df_processed.loc[missing_indices, col] = np.nan
    
    # TODO: 3. Handle missing values with interpolation (realistic for continuous sensor data)
    # HINT: Use the interpolate() method with 'linear' method for each numeric column
    numeric_cols = ['speed', 'acceleration', 'distance_to_obstacle', 'gps_lat', 'gps_lon']
    for col in numeric_cols:
        # TODO: Apply linear interpolation to handle missing values
        df_processed[col] = df_processed[col].# TODO: Add interpolation method
    
    # 4. Add prefix to column names
    rename_dict = {col: f'veh_{col}' for col in df_processed.columns if col != 'timestamp'}
    df_processed = df_processed.rename(columns=rename_dict)
    
    print(f"   Processed {len(df_processed)} telemetry records")
    print(f"   Missing values handled via interpolation")
    
    return df_processed

def preprocess_network_data(df):
    """
    Preprocess network traffic data for cybersecurity analysis
    """
    print("Preprocessing network data...")
    
    df_processed = df.copy()
    
    # TODO: 1. Standardize timestamp format
    # HINT: Convert timestamp column to datetime format like in telemetry preprocessing
    df_processed['timestamp'] = # TODO: Convert timestamp to datetime format
    
    # TODO: 2. Encode categorical variables
    # HINT: Use LabelEncoder for protocol column
    label_encoder = LabelEncoder()
    # TODO: Apply label encoding to the 'protocol' column
    df_processed['protocol_encoded'] = # TODO: Fit and transform protocol column
    
    # TODO: 3. Create derived features for network analysis
    # HINT: Calculate bytes per packet and packets per second
    df_processed['bytes_per_packet'] = # TODO: Calculate bytes_transferred / packet_count
    
    # TODO: 4. Handle potential division by zero
    # HINT: Replace infinite values with 0 using np.where or fillna
    df_processed['bytes_per_packet'] = # TODO: Replace inf values with 0
    
    # 5. Add prefix to column names (except timestamp)
    rename_dict = {col: f'net_{col}' for col in df_processed.columns if col != 'timestamp'}
    df_processed = df_processed.rename(columns=rename_dict)
    
    print(f"   Processed {len(df_processed)} network records")
    print(f"   Categorical encoding and feature engineering completed")
    
    return df_processed

# TODO: Apply preprocessing functions to our datasets
# HINT: Call the preprocessing functions on telemetry_data and network_data

print("Step 2: Preprocessing datasets...")
print("=" * 50)

# TODO: Preprocess telemetry data
telemetry_processed = # TODO: Call preprocess_telemetry_data function

# TODO: Preprocess network data  
network_processed = # TODO: Call preprocess_network_data function

print("\nPreprocessing completed!")
print(f"Telemetry processed shape: {telemetry_processed.shape}")
print(f"Network processed shape: {network_processed.shape}")

## Step 3: Temporal Synchronization

The key challenge is synchronizing data from different sensors with varying sampling rates:
- **Vehicle telemetry**: Regular intervals (~3.6 seconds)
- **Network logs**: Irregular, event-driven timestamps

We'll use **nearest neighbor matching** to align the datasets.

**🔥 HANDS-ON PRACTICE**: Implement the temporal synchronization algorithm to align asynchronous data streams!

In [None]:
def synchronize_datasets(telemetry_df, network_df, time_tolerance_seconds=5):
    """
    Synchronize telemetry and network data using nearest neighbor timestamp matching
    
    Parameters:
    - telemetry_df: Preprocessed vehicle telemetry data
    - network_df: Preprocessed network traffic data
    - time_tolerance_seconds: Maximum allowed time difference for matching
    
    Returns:
    - Synchronized and merged dataset
    """
    print(f"Synchronizing datasets with {time_tolerance_seconds}s tolerance...")
    
    # Use telemetry timestamps as the reference (more regular)
    merged_data = []
    
    for idx, telem_row in telemetry_df.iterrows():
        telem_time = telem_row['timestamp']
        
        # TODO: Find closest network record within tolerance
        # HINT: Calculate absolute time differences using (network_df['timestamp'] - telem_time).dt.total_seconds()
        time_diffs = # TODO: Calculate time differences between network timestamps and current telemetry timestamp
        
        # TODO: Find the index of the minimum time difference
        # HINT: Use idxmin() method on time_diffs
        closest_idx = # TODO: Get index of minimum time difference
        min_diff = time_diffs.iloc[closest_idx]
        
        # TODO: Only merge if within tolerance
        # HINT: Check if min_diff <= time_tolerance_seconds
        if # TODO: Add condition to check time tolerance:
            # Merge the records
            network_row = network_df.iloc[closest_idx]
            
            merged_row = {
                'timestamp': telem_time,
                'time_diff_seconds': min_diff
            }
            
            # TODO: Add telemetry features (excluding timestamp)
            # HINT: Loop through telem_row.index and add columns that are not 'timestamp'
            for col in telem_row.index:
                if col != 'timestamp':
                    # TODO: Add telemetry column to merged_row
                    merged_row[col] = # TODO: Get value from telem_row
            
            # TODO: Add network features (excluding timestamp) 
            # HINT: Similar to telemetry, loop through network_row.index
            for col in network_row.index:
                if col != 'timestamp':
                    # TODO: Add network column to merged_row
                    merged_row[col] = # TODO: Get value from network_row
            
            merged_data.append(merged_row)
    
    # TODO: Convert merged_data list to DataFrame
    # HINT: Use pd.DataFrame() constructor
    merged_df = # TODO: Create DataFrame from merged_data list
    
    print(f"   Successfully synchronized {len(merged_df)} record pairs")
    print(f"   Average time difference: {merged_df['time_diff_seconds'].mean():.2f}s")
    print(f"   Max time difference: {merged_df['time_diff_seconds'].max():.2f}s")
    
    return merged_df

# TODO: Synchronize the datasets
# HINT: Call synchronize_datasets function with the processed datasets
print("Step 3: Temporal Synchronization")
print("=" * 50)

synchronized_data = # TODO: Call synchronization function

print(f"\nSynchronized Dataset Shape: {synchronized_data.shape}")
print(f"Sample Synchronized Data:")
print(synchronized_data.head(3))

In [None]:
def add_anomaly_labels(df, anomaly_rate=0.40, balance_classes=True):
    """
    Add BALANCED anomaly labels to the synchronized dataset for fair model training
    
    Labels:
    - 0: Normal operation
    - 1: Physical anomaly (vehicle sensor/behavior issue)  
    - 2: Network anomaly (cybersecurity threat)
    
    Parameters:
    - df: Synchronized dataset
    - anomaly_rate: Fraction of data to label as anomalous (increased for balance)
    - balance_classes: Whether to create balanced class distribution
    
    Returns:
    - Dataset with balanced anomaly labels and injected anomalies
    """
    print(f"Adding BALANCED anomaly labels (target anomaly rate: {anomaly_rate:.1%})...")
    
    df_labeled = df.copy()
    
    # TODO: Initialize all labels as normal (0)
    # HINT: Create a new column 'label' and set all values to 0
    df_labeled['label'] = # TODO: Set all labels to 0 (normal)
    
    if balance_classes:
        # BALANCED APPROACH: Create roughly equal class sizes for fair training
        total_samples = len(df_labeled)
        
        # Target distribution: 60% Normal, 20% Physical, 20% Network
        # This creates a much more balanced dataset for better learning
        normal_target = int(total_samples * 0.60)      # 60% normal
        physical_target = int(total_samples * 0.20)    # 20% physical anomalies  
        network_target = total_samples - normal_target - physical_target  # 20% network anomalies
        
        print(f"   BALANCED class targets:")
        print(f"      Normal: {normal_target} samples (60%)")
        print(f"      Physical: {physical_target} samples (20%)")
        print(f"      Network: {network_target} samples (20%)")
        
        # TODO: Randomly select indices for physical anomalies
        # HINT: Use np.random.choice() to select indices without replacement
        physical_indices = # TODO: Randomly choose physical_target indices
        
        # TODO: Randomly select indices for network anomalies (from remaining samples)
        # HINT: Select from indices not already chosen for physical anomalies
        remaining_indices = # TODO: Get indices that are not in physical_indices
        network_indices = # TODO: Randomly choose network_target indices from remaining
        
        # TODO: Assign labels for physical anomalies
        # HINT: Set df_labeled.loc[physical_indices, 'label'] = 1
        df_labeled.loc[physical_indices, 'label'] = # TODO: Set to 1 for physical anomalies
        
        # TODO: Assign labels for network anomalies  
        # HINT: Set df_labeled.loc[network_indices, 'label'] = 2
        df_labeled.loc[network_indices, 'label'] = # TODO: Set to 2 for network anomalies
        
        # TODO: Inject realistic anomaly patterns
        # For physical anomalies: Modify vehicle features
        for idx in physical_indices:
            # TODO: Simulate erratic vehicle behavior
            # HINT: Multiply speed by a random factor between 0.3 and 2.0
            df_labeled.loc[idx, 'veh_speed'] *= # TODO: Random factor for erratic speed
            # TODO: Add random acceleration anomaly
            df_labeled.loc[idx, 'veh_acceleration'] += # TODO: Add random value between -5 and 5
        
        # For network anomalies: Modify network features  
        for idx in network_indices:
            # TODO: Simulate suspicious network activity
            # HINT: Multiply packet_count by a factor between 5 and 20
            df_labeled.loc[idx, 'net_packet_count'] *= # TODO: Random factor for packet bursts
            # TODO: Modify bytes transferred proportionally
            df_labeled.loc[idx, 'net_bytes_transferred'] *= # TODO: Same random factor as packets
    
    # TODO: Calculate final label distribution
    # HINT: Use value_counts() method on the 'label' column
    label_counts = # TODO: Get counts for each label value
    
    print(f"\n   Final label distribution:")
    print(f"      Normal (0): {label_counts.get(0, 0):4d} ({label_counts.get(0, 0)/len(df_labeled)*100:.1f}%)")
    print(f"      Physical (1): {label_counts.get(1, 0):4d} ({label_counts.get(1, 0)/len(df_labeled)*100:.1f}%)")
    print(f"      Network (2): {label_counts.get(2, 0):4d} ({label_counts.get(2, 0)/len(df_labeled)*100:.1f}%)")
    
    if balance_classes:
        print(f"   BALANCED dataset created for fair model training!")
        print(f"   Much better class representation vs original 88%/6%/6%")
    
    return df_labeled

# TODO: Generate BALANCED dataset for fair model training
print("CREATING BALANCED DATASET FOR FAIR MODEL TRAINING")
print("="*60)
# TODO: Call add_anomaly_labels function
final_dataset = # TODO: Add anomaly labels to synchronized data

print(f"\nFinal Dataset Shape: {final_dataset.shape}")
print(f"Sample Final Dataset:")
print(final_dataset.head(5))

## Step 4: Exploratory Data Analysis (EDA)

Let's analyze the fused dataset to understand:
1. **Data distribution and quality**
2. **Correlation between vehicle and network features**
3. **Anomaly patterns and characteristics**

**🔥 HANDS-ON PRACTICE**: Complete the data analysis functions to explore patterns in the fused dataset!

In [None]:
def analyze_dataset_statistics(df):
    """
    Provide comprehensive statistics about the fused dataset
    """
    print("DATASET STATISTICS SUMMARY")
    print("=" * 60)
    
    # TODO: Basic information
    # HINT: Use df.shape for dimensions, df['timestamp'].min()/max() for time range
    print(f"Dataset Shape: {# TODO: Get dataset shape}")
    print(f"Time Range: {# TODO: Get min timestamp} to {# TODO: Get max timestamp}")
    
    # TODO: Calculate duration in minutes
    # HINT: Subtract min timestamp from max timestamp, then use .total_seconds()/60
    duration_minutes = # TODO: Calculate duration between max and min timestamps in minutes
    print(f"Duration: {duration_minutes:.1f} minutes")
    
    # TODO: Data quality metrics
    # HINT: Use df.isnull().sum().sum() to count all missing values
    missing_data = # TODO: Count total missing values in dataset
    print(f"\nData Quality:")
    print(f"   Total missing values: {missing_data}")
    
    # TODO: Calculate data completeness percentage
    # HINT: (1 - missing_data/(df.shape[0]*df.shape[1]))*100
    completeness = # TODO: Calculate data completeness percentage
    print(f"   Data completeness: {completeness:.2f}%")
    
    # TODO: Feature categories
    # HINT: Use list comprehension to filter columns that start with 'veh_' and 'net_'
    vehicle_features = # TODO: Get columns that start with 'veh_'
    network_features = # TODO: Get columns that start with 'net_'
    
    print(f"\nVehicle Features ({len(vehicle_features)}): {vehicle_features}")
    print(f"Network Features ({len(network_features)}): {network_features}")
    
    # TODO: Label distribution
    # HINT: Use value_counts() on the 'label' column and sort_index()
    label_dist = # TODO: Get label distribution using value_counts
    
    print(f"\nLabel Distribution:")
    labels = ['Normal', 'Physical Anomaly', 'Network Anomaly']
    for i, (count, label) in enumerate(zip(label_dist, labels)):
        print(f"   {label}: {count} ({count/len(df)*100:.1f}%)")
    
    return {
        'vehicle_features': vehicle_features,
        'network_features': network_features,
        'label_distribution': label_dist
    }

# TODO: Analyze the dataset
# HINT: Call analyze_dataset_statistics function with final_dataset
analysis_results = # TODO: Call analysis function

# TODO: Basic descriptive statistics
print("\nNumerical Features Statistics:")
# TODO: Select only numerical columns (exclude 'label' column)
# HINT: Use select_dtypes(include=[np.number]) and remove 'label' if present
numerical_cols = # TODO: Get numerical columns excluding 'label'
if 'label' in numerical_cols:
    numerical_cols.remove('label')
    
# TODO: Display descriptive statistics
# HINT: Use df[numerical_cols].describe().round(3)
# TODO: Show descriptive statistics for numerical columns

In [None]:
# Correlation analysis and visualization
def visualize_correlations_and_anomalies(df):
    """
    Create comprehensive visualizations of the fused dataset
    """
    print("Creating comprehensive visualizations...")
    
    # TODO: Set up subplot configuration
    # HINT: Create a 2x2 subplot grid using plt.subplots()
    fig, ((ax1, ax2), (ax3, ax4)) = # TODO: Create 2x2 subplot grid with figsize=(15, 12)
    
    # Plot 1: Correlation heatmap of numerical features
    # TODO: Select numerical columns excluding non-feature columns
    # HINT: Get numerical columns and exclude 'label', 'timestamp', 'time_diff_seconds'
    numerical_features = # TODO: Get numerical columns for correlation analysis
    exclude_cols = ['label', 'timestamp', 'time_diff_seconds']
    correlation_cols = [col for col in numerical_features if col not in exclude_cols]
    
    # TODO: Calculate correlation matrix
    # HINT: Use df[correlation_cols].corr()
    correlation_matrix = # TODO: Calculate correlation matrix
    
    # TODO: Create correlation heatmap
    # HINT: Use sns.heatmap() with annot=True, cmap='coolwarm', center=0
    sns.heatmap(# TODO: Add correlation matrix and parameters for heatmap, ax=ax1)
    ax1.set_title('Feature Correlation Matrix')
    
    # Plot 2: Label distribution
    # TODO: Create label distribution plot
    # HINT: Use df['label'].value_counts().plot(kind='bar')
    label_counts = # TODO: Get label counts and create bar plot
    ax2.set_title('Class Distribution')
    ax2.set_xlabel('Anomaly Type')
    ax2.set_ylabel('Count')
    ax2.set_xticklabels(['Normal', 'Physical', 'Network'], rotation=0)
    
    # Plot 3: Vehicle vs Network features scatter
    # TODO: Create scatter plot comparing vehicle speed vs network packet count
    # HINT: Use different colors for different labels using c=df['label']
    scatter = ax3.scatter(# TODO: Add x=vehicle speed, y=network packet count, c=labels, parameters)
    ax3.set_xlabel('Vehicle Speed (veh_speed)')
    ax3.set_ylabel('Network Packet Count (net_packet_count)')
    ax3.set_title('Vehicle vs Network Features by Anomaly Type')
    
    # Plot 4: Feature distribution by anomaly type
    # TODO: Create distribution plots for a key feature across different labels
    feature_to_analyze = 'veh_acceleration'
    for label in [0, 1, 2]:
        label_names = {0: 'Normal', 1: 'Physical', 2: 'Network'}
        # TODO: Plot density curve for the feature filtered by label
        # HINT: Use df[df['label']==label][feature_to_analyze].plot(kind='density')
        # TODO: Plot density for current label
        ax4.# TODO: Add density plot for feature filtered by label, with label parameter
    
    ax4.set_title(f'Distribution: {feature_to_analyze}')
    ax4.set_xlabel('Acceleration (m/s²)')
    ax4.set_ylabel('Density')
    ax4.legend()
    ax4.grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()

# TODO: Create comprehensive visualizations
# HINT: Call visualize_correlations_and_anomalies function with final_dataset
# TODO: Call visualization function

print("\nExploratory Data Analysis Complete!")

## Step 5: Feature Normalization and Final Dataset Preparation

Before saving the final dataset, we'll normalize numerical features to ensure they're ready for machine learning models.

**🔥 HANDS-ON PRACTICE**: Complete the feature normalization and dataset export functions!

In [None]:
def normalize_features(df, exclude_cols=['timestamp', 'label', 'time_diff_seconds']):
    """
    Normalize numerical features using StandardScaler
    
    Parameters:
    - df: Input dataset
    - exclude_cols: Columns to exclude from normalization
    
    Returns:
    - Normalized dataset and fitted scaler
    """
    print("Normalizing numerical features...")
    
    df_normalized = df.copy()
    
    # TODO: Identify numerical columns to normalize
    # HINT: Use select_dtypes(include=[np.number]) to get numerical columns
    numerical_cols = # TODO: Get all numerical columns from dataframe
    
    # TODO: Filter out columns that should not be normalized
    # HINT: Use list comprehension to exclude cols in exclude_cols
    cols_to_normalize = # TODO: Get numerical columns that are not in exclude_cols
    
    print(f"   Features to normalize: {cols_to_normalize}")
    
    # TODO: Apply StandardScaler
    # HINT: Create StandardScaler instance and use fit_transform()
    scaler = # TODO: Create StandardScaler instance
    df_normalized[cols_to_normalize] = # TODO: Apply fit_transform to selected columns
    
    print(f"   Normalized {len(cols_to_normalize)} features")
    print(f"   Features now have mean≈0 and std≈1")
    
    return df_normalized, scaler

# TODO: Apply normalization to the final dataset
print("Step 5: Feature Normalization and Final Preparation")
print("=" * 60)

# TODO: Call normalize_features function
final_normalized_dataset, feature_scaler = # TODO: Normalize final_dataset

print(f"\nNormalized Dataset Shape: {final_normalized_dataset.shape}")

# TODO: Display normalization verification
print("\nNormalization Verification (mean should ≈ 0, std should ≈ 1):")
# TODO: Show mean and std of normalized numerical features
# HINT: Select numerical columns, calculate mean() and std(), and round to 3 decimals
numerical_cols = final_normalized_dataset.select_dtypes(include=[np.number]).columns.tolist()
exclude_verify = ['label', 'time_diff_seconds'] 
verify_cols = [col for col in numerical_cols if col not in exclude_verify]

print("Means:")
print(# TODO: Show means of normalized columns)
print("\nStandard Deviations:")  
print(# TODO: Show standard deviations of normalized columns)

## Step 6: Save Final Dataset

Save the preprocessed, fused, and normalized dataset for use in subsequent machine learning workflows.

In [None]:
# TODO: Export the final processed dataset
def export_processed_dataset(df, filename='combined_dataset.csv'):
    """
    Export the final processed and normalized dataset
    
    Parameters:
    - df: Final processed dataset
    - filename: Output filename
    
    Returns:
    - Success message and dataset summary
    """
    print(f"Exporting processed dataset to {filename}...")
    
    # TODO: Convert timestamp to string format for CSV compatibility
    # HINT: Use pd.to_datetime() and dt.strftime() to format timestamps
    df_export = df.copy()
    df_export['timestamp'] = # TODO: Convert timestamp to string format 'YYYY-MM-DD HH:MM:SS'
    
    # TODO: Save to CSV file
    # HINT: Use df.to_csv() with index=False
    # TODO: Export dataframe to CSV file
    
    # TODO: Calculate and display summary statistics
    print(f"\nDataset successfully exported!")
    print(f"   File: {filename}")
    print(f"   Shape: {df_export.shape}")
    print(f"   Size: {# TODO: Calculate file size - use df_export.memory_usage(deep=True).sum() / 1024**2} MB")
    
    return df_export

# TODO: Create final export dataset 
print("Final Dataset Export")
print("=" * 50)

# TODO: Call export function with normalized dataset
final_export_dataset = # TODO: Export final_normalized_dataset

# TODO: Display final summary
print(f"\nFINAL DATASET SUMMARY:")
print(f"   Total Records: {# TODO: Get total number of records}")
print(f"   Total Features: {# TODO: Get total number of features}")
print(f"   Vehicle Features: {# TODO: Count columns starting with 'veh_'}")
print(f"   Network Features: {# TODO: Count columns starting with 'net_'}")

# TODO: Show label distribution
# HINT: Use value_counts().sort_index() on 'label' column
label_counts = # TODO: Get final label distribution

print(f"\nLabel Distribution:")
print(f"   Normal (0): {label_counts.get(0, 0):,} records ({label_counts.get(0, 0)/len(final_export_dataset)*100:.1f}%)")
print(f"   Physical Anomaly (1): {label_counts.get(1, 0):,} records ({label_counts.get(1, 0)/len(final_export_dataset)*100:.1f}%)")
print(f"   Network Anomaly (2): {label_counts.get(2, 0):,} records ({label_counts.get(2, 0)/len(final_export_dataset)*100:.1f}%)")

print(f"\nDataset ready for Edge AI model training in next notebook!")