## Data Exploration Notebook

## Libary Imports

In [1]:
## Import necessary libraries here
import polars as pl
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib.ticker as ticker

In [2]:
# set the filepath to the parquet cleaned dataset
PATH = "../Data/Cleaned/Cleaned_Dataset.parquet"

# scan the parquet file with polars
scan = pl.scan_parquet(PATH)

# get the dataset schema
schema = scan.collect_schema()

## 1. General Descriptive Statistics

This section examines missing values, row and column counts, numeric descriptive statistics, and categorical descriptive statistics for the cleaned dataset.

### 1.1 Get row and column counts of the cleaned dataset

In [3]:
# get row counts
row_count = scan.select(pl.len()).collect().item()

# get column counts
col_names = schema.names()
col_count = len(col_names)

# output the row and column counts
print(f"The cleaned dataset has {row_count} rows and {col_count} columns.")

The cleaned dataset has 21005240 rows and 42 columns.


### 1.2 Column names and their respective data types

In [4]:
dtype_df = pd.DataFrame({
    "column": col_names,
    "dtype": [str(schema[name]) for name in col_names]
})

dtype_df

Unnamed: 0,column,dtype
0,Header_Length,Float32
1,Protocol_Type,Categorical
2,Time_To_Live,Float32
3,Rate,Float32
4,fin_flag_number,Float32
5,syn_flag_number,Float32
6,rst_flag_number,Float32
7,psh_flag_number,Float32
8,ack_flag_number,Float32
9,ece_flag_number,Float32


### 1.3 Missing Value Analysis

In [5]:
# Missing counts
missing_count_df = (
    scan
    .select([
        pl.col(c).null_count().alias(c)
        for c in col_names
    ])
    .collect()
    .transpose(include_header=True, header_name="feature")
)

# Rename the second column (e.g. 'column_0') to 'missing_count'
missing_count_df = missing_count_df.rename({
    missing_count_df.columns[1]: "missing_count"
})


# Missing percentages
missing_pct_df = (
    scan
    .select([
        (pl.col(c).null_count() / pl.len() * 100).alias(c)
        for c in col_names
    ])
    .collect()
    .transpose(include_header=True, header_name="feature")
)

# Rename the second column to 'missing_pct'
missing_pct_df = missing_pct_df.rename({
    missing_pct_df.columns[1]: "missing_pct"
})


# Combine into a single DataFrame and sort
missing_pl = (
    missing_count_df
    .join(missing_pct_df, on="feature", how="inner")
    .sort("missing_pct", descending=True)
)

missing_df = missing_pl.to_pandas()
missing_df

Unnamed: 0,feature,missing_count,missing_pct
0,Protocol_Type,3168935,15.086402
1,Header_Length,0,0.0
2,Time_To_Live,0,0.0
3,Rate,0,0.0
4,fin_flag_number,0,0.0
5,syn_flag_number,0,0.0
6,rst_flag_number,0,0.0
7,psh_flag_number,0,0.0
8,ack_flag_number,0,0.0
9,ece_flag_number,0,0.0


In [None]:
# visualization for missing features
nonzero_missing = missing_df[missing_df["missing_count"] > 0].copy()

if not nonzero_missing.empty:
    plt.figure(figsize=(8, max(3, 0.3 * len(nonzero_missing))))
    sns.barplot(
        data=nonzero_missing,
        y="feature",
        x="missing_pct",
        orient="h"
    )
    plt.xlabel("Missing (%)")
    plt.ylabel("Feature")
    plt.title("Missingness by Feature (Full Dataset)")
    plt.tight_layout()
    plt.show()

### 1.4 Numeric Descriptive Statistics

In [None]:
# establish the numeric columns using Int32 or Float32 data types
numeric_cols = [
    name for name in col_names
    if schema[name] in {pl.Int32, pl.Float32}]

# build aggregate expressions for each numeric column
# intiitalize the empty list
agg_exprs = []

# for loop to append the aggregate expressions for each numeric column
for c in numeric_cols:
    agg_exprs.extend([
        pl.col(c).mean().alias(f"{c}_mean"),
        pl.col(c).std().alias(f"{c}_std"),
        pl.col(c).min().alias(f"{c}_min"),
        pl.col(c).quantile(0.25).alias(f"{c}_q1"),
        pl.col(c).median().alias(f"{c}_median"),
        pl.col(c).quantile(0.75).alias(f"{c}_q3"),
        pl.col(c).max().alias(f"{c}_max"),
    ])

# run a single lazy pass over the full dataset
stats_pl = scan.select(agg_exprs).collect()

# convert to pandas and reshape to a nicer format
stats_df = stats_pl.to_pandas().T
stats_df.columns = ["value"]

# reshape to a wider format
rows = []
for idx, val in stats_df["value"].items():
    feature, stat = idx.rsplit("_", 1)
    rows.append([feature, stat, val])

wide = pd.DataFrame(rows, columns=["feature", "stat", "value"])
wide_df = wide.pivot(index="feature", columns="stat", values="value")

wide_df

We note that the above numeric descriptive statistics dataframe is revealing especially identifying that there is a minimum value for the feature IAT of -0.01781797967851162. This says that there are packets which are arriving before the next packet and is not an artifact of actual network traffic. This will have to be readjusted to clamp negative values to zero or left as is. This occurs as a result of packet capture timestamp jitter which is aknown limitation of CICFlowMeter.

### 1.5 Categorical Descriptive Statistics


In [None]:
categorical_cols = ["Protocol_Type", "Label"]

# initialize empty categorical summary dictionary
categorical_summary = {}

# for loop to compute value counts for each categorical column
for col in categorical_cols:
    # Compute full value counts for the column using lazy evaluation
    value_counts_df = (
        scan
        .group_by(col)
        .len()
        .sort("len", descending=True)
        .collect()
        .to_pandas()
        .rename(columns={"len": "count"})
    )
    
    categorical_summary[col] = value_counts_df

# display
display(value_counts_df)
display(categorical_summary["Protocol_Type"])

This above categorical descriptive stats combined with the missing values statistics shows that there are 2,814,440 missing Protocol Types, which is roughly ~14% of the cleaned dataset. We may consider dropping the missing values in order to properly clean our dataset for modeling even further.

In [None]:
for col in categorical_cols:
    value_counts_df = categorical_summary[col]
    plt.figure(figsize=(14, 8))
    ax = sns.barplot(data=value_counts_df, x=col, y="count")
    # Rotate x labels vertically for readability
    plt.xticks(rotation=90)
    # Format y-axis with comma separators
    ax.yaxis.set_major_formatter(ticker.StrMethodFormatter('{x:,.0f}'))
    plt.title(f"Distribution of {col} (Full Dataset)")
    plt.xlabel(col)
    plt.ylabel("Count")
    plt.tight_layout()
    plt.show()

## 2. Univariate Analysis

In [None]:
# Build our helper histogram function leveraging polars lazy computation
def full_histogram(scan, column: str, bins: int = 50):
    """
    Compute histogram bins for a numeric column using the full dataset 
    (lazy Polars computation), returning a small pandas DataFrame with:
    - bin_mid
    - count
    """
    # Compute min/max for the full column
    min_max = scan.select([
        pl.col(column).min().alias("min"),
        pl.col(column).max().alias("max")
    ]).collect()

    col_min = float(min_max["min"][0])
    col_max = float(min_max["max"][0])

    # avoid invalid bins
    if not np.isfinite(col_min) or not np.isfinite(col_max) or col_min == col_max:
        return pd.DataFrame({"bin_mid": [], "count": []})

    # Build bin edges
    edges = np.linspace(col_min, col_max, bins + 1)

    # Cut into bins & count frequencies
    hist_df = (
        scan
        .with_columns([
            pl.col(column).cut(breaks=edges).alias("bin")
        ])
        .group_by("bin")
        .len()
        .sort("bin")
        .collect()
        .to_pandas()
        .rename(columns={"len": "count"})
    )

    # Compute bin midpoints
    mids = []
    for b in hist_df["bin"]:
        # Format: "[a, b)"
        s = str(b).strip("[]()")
        left, right = s.split(",")
        mids.append((float(left), float(right)))

    hist_df["bin_mid"] = [(l + r) / 2 for (l, r) in mids]

    return hist_df[["bin_mid", "count"]]

def plot_univariate_numeric(column: str, bins: int = 50):
    """
    Plot the univariate distribution of a numeric feature using a fully 
    memory-safe histogram computed over the entire dataset.

    Parameters
    ----------
    column : str
        Name of the numeric column to visualize.
    bins : int, default=50
        Number of histogram bins used when aggregating the feature.

    Description
    -----------
    This function visualizes the distribution of a numeric variable by using 
    `full_histogram()`, which computes bin counts lazily with Polars. Unlike 
    traditional histogram plotting approaches that require loading the entire 
    column into memory, this method performs aggregation at the Polars 
    LazyFrame level, making it safe to use on very large datasets (e.g., 
    20+ million rows).

    The returned histogram is a small, aggregated DataFrame containing:
        - bin_mid : midpoint of each histogram bin
        - count   : number of observations falling within each bin

    These aggregated results are then plotted using seaborn for a clean, 
    readable visualization of the full distribution.
    """
    hist_df = full_histogram(scan, column, bins=bins)

    plt.figure(figsize=(10, 5))
    sns.barplot(data=hist_df, x="bin_mid", y="count", color="steelblue")

    plt.title(f"Distribution of {column} (Full Dataset)")
    plt.xlabel(column)
    plt.ylabel("Count")
    plt.xticks(rotation=45, ha="right")
    plt.tight_layout()
    plt.show()

### 2.1 Skewness of Numeric Features

In [None]:
# Extract random 200,000 observation sample for skewness analysis
SKEW_SAMPLE_N = 200_000
# chunk this to keep memory small
CHUNK_SIZE = 100_000

rng = np.random.default_rng(42)
collected = []

start = 0

# Loop until we collect 200k total rows
while sum(len(df) for df in collected) < SKEW_SAMPLE_N:
    chunk = (
        scan
        .slice(start, CHUNK_SIZE)
        .select(numeric_cols)
        .collect()
        .to_pandas()
    )
    
    # If no more rows, break (failsafe)
    if chunk.empty:
        break

    # Randomly choose ~30% of each chunk
    sampled = chunk.sample(frac=0.3, random_state=rng.integers(0, 1e9))
    collected.append(sampled)

    start += CHUNK_SIZE

# Concatenate all chunk samples
sample_df = pd.concat(collected, ignore_index=True)

# Final trim down to perfect 200,000
if len(sample_df) > SKEW_SAMPLE_N:
    sample_df = sample_df.sample(n=SKEW_SAMPLE_N, random_state=42)

print("Final sample size:", len(sample_df))
sample_df.head()

In [None]:
# compute skewness dataframe

skewness_series = sample_df.skew(numeric_only=True)

# build skewness dataframe
skewness_df = (
    skewness_series
    .sort_values(ascending=False)
    .to_frame(name="skewness")
)

# compute absolute skewness for ordering
skewness_df["abs_skew"] = skewness_df["skewness"].abs()

# sort by absolute skew (most skewed first)
skewness_df = skewness_df.sort_values("abs_skew", ascending=False)

skewness_df.head(15)

In [None]:
# Select the top 6 most skewed features
top_skewed_features = skewness_df.index[:6].tolist()

for col in top_skewed_features:
    print(f"\nProcessing feature: {col}")

    # Use the existing 200,000 sample_df
    raw_vals = sample_df[col].dropna()

    # Log1p transform
    log_vals = np.log1p(raw_vals)

    fig, axes = plt.subplots(1, 2, figsize=(14, 4))

    # RAW
    sns.histplot(raw_vals, bins=60, ax=axes[0], color="steelblue")
    axes[0].set_title(f"{col}: Raw (Sampled 200k)")
    axes[0].set_xlabel(col)

    # LOG
    sns.histplot(log_vals, bins=60, ax=axes[1], color="darkorange")
    axes[1].set_title(f"{col}: log1p (Sampled 200k)")
    axes[1].set_xlabel(f"log1p({col})")

    plt.tight_layout()
    plt.show()

#### Interpreting the Univariate Distributions for Highly Skewed Features

The raw and log-transformed histograms for highly skewed numeric features (e.g., `IAT`, `Telnet`, `SMTP`, `IRC`, `ece_flag_num`, `cwr_flag_num`) appear dominated by a single bar. Although not visually rich, these plots reveal **critical characteristics** of the dataset.

---

##### 1. Extreme Sparsity in Network Flow Features
Across all highly skewed variables, **over 99% of values are zero or extremely close to zero**. This is typical in IoT network telemetry:

- Most flows never trigger certain protocols or flags.  
- Inter-arrival times (e.g., `IAT`) are often 0.  
- Protocol counters (`SMTP`, `Telnet`, `IRC`, etc.) are rarely activated.  

As a result, the first histogram bin (containing near-zero values) captures almost the entire dataset.

---

##### 2. The Dominant Zero Bin
Because nearly all values lie at or near zero, the leftmost bin completely dominates the histogram. This causes:

- All other bins to appear empty  
- The distribution to visually collapse into a single spike  
- Very large differences between the dominant bin and the tail  

This is a **true reflection of the data**, not a plotting issue.

---
    
##### 3. Why Log Transform Doesn't Change the Shape Much
The log transform helps with long-tailed data, but:

- `log1p(0) = 0`  
- When 99%+ of values are zero, log1p still leaves the same dominant zero spike  
- Only the very small fraction of high outliers shift position  

Thus, the log1p histogram remains dominated by a single bar.

---

##### 4. These Features Behave Like Sparse Event Counters
Instead of behaving like continuous numeric variables, these columns function more like:

- **event indicators** (presence vs absence)  
- **anomaly spikes**  
- **rare protocol activations**  

This makes traditional histograms less informative.

---

##### 5. Better Approaches for Understanding These Features
Given the extreme sparsity, more meaningful summaries include:

- Zero vs non-zero counts  
- Percentile tables (p50, p90, p95, p99, p99.9, max)  
- Log-scale boxplots  
- Distributions of **non-zero** values only  
- Comparing distributions across the `Label` column (malicious vs benign)

These provide clearer insight into how these features behave in relation to attack detection.

---

##### Conclusion for skewness analysis
Although the histograms appear visually simple, they accurately reflect that these features are **highly sparse, zero-dominated, and spike-driven**—a common pattern in IoT network traffic data. Understanding this structure is essential for guiding appropriate feature engineering and model selection in downstream analysis.


### 2.2 Zero vs. Non-Zero Value Analysis

In [None]:
zero_nonzero_stats = []

for col in numeric_cols:
    # Compute zero count and non-zero count lazily
    result = (
        scan
        .select([
            (pl.col(col) == 0).sum().alias("zero_count"),
            (pl.col(col) != 0).sum().alias("nonzero_count")
        ])
        .collect()
    )

    zero_count = int(result["zero_count"][0])
    nonzero_count = int(result["nonzero_count"][0])
    total = zero_count + nonzero_count

    zero_nonzero_stats.append({
        "feature": col,
        "zero_count": zero_count,
        "nonzero_count": nonzero_count,
        "pct_nonzero": nonzero_count / total * 100
    })

# Convert to DataFrame
zero_nonzero_df = pd.DataFrame(zero_nonzero_stats)

# Sort by non-zero percentage (descending)
zero_nonzero_df = zero_nonzero_df.sort_values("pct_nonzero", ascending=False)

zero_nonzero_df

In [None]:
# visualization of zero vs. non-zero features
plt.figure(figsize=(14, 6))
sns.barplot(
    data=zero_nonzero_df,
    x="feature",
    y="pct_nonzero",
    color="steelblue"
)

plt.xticks(rotation=90)
plt.ylabel("Percentage of Non-Zero Values (%)")
plt.xlabel("Feature")
plt.title("Non-Zero Frequency Across Numeric Features")
plt.tight_layout()
plt.show()

#### Zero vs Non-Zero Analysis Summary

To better understand sparsity patterns in the dataset, we evaluated the percentage of zero and non-zero values for every numeric feature. This revealed several important structural properties of the network traffic:

---

##### 1. Dense Features (≈ 100% Non-Zero)
Features such as `Rate`, `Number`, `IAT`, `Tot_size`, `AVG`, `Max`, `Min`, and `Tot_sum` show **no zero values at all**.  
These represent continuous traffic characteristics (packet counts, sizes, timing aggregates) and exhibit normal numeric behavior suitable for standard scaling and transformation techniques.

---

##### 2. Moderately Sparse Features (10%–60% Non-Zero)
Features like `TCP`, `UDP`, `Std`, `Variance`, `ack_count`, `syn_count`, `psh_flag_number`, and `ICMP` contain a mix of zero and non-zero values.  
These features likely capture protocol activity or flow behavior that occurs intermittently.  
They may exhibit long-tailed or bursty patterns that require log-scale analysis or special consideration during feature engineering.

---

##### 3. Highly Sparse Features (< 5% Non-Zero)
Protocol-specific and flag-specific counters (e.g., `SSH`, `IRC`, `Telnet`, `SMTP`, `IGMP`, `ece_flag_number`, `cwr_flag_number`) are **almost always zero**, with non-zero rates below 1%.  
This sparsity is expected in IoT network traffic: most flows do not activate these protocols or flags.  
These features behave more like **binary indicators** of rare events rather than continuous numeric variables.

---

##### Key Insight
The presence of both dense and highly sparse numeric features suggests a mixture of:

- **continuous traffic descriptors**  
- **event-driven anomaly counters**  
- **rare protocol activations**

This explains why raw and log-transformed histograms often collapsed into a single bar: the distributions are dominated by near-zero values.  
Understanding this sparsity structure is essential for selecting appropriate transformations, feature encodings, and downstream modeling strategies.

---

##### Next Steps
To complete the univariate numeric analysis, we will generate **percentile summary tables (p50–p99.9)** for each feature.  
This will help characterize tail behavior, scale differences, and outlier severity across the dataset.


### 2.3 Percentile Summary Table

In [None]:
percentiles = [0.50, 0.90, 0.95, 0.99, 0.999]

summary_rows = []

for col in numeric_cols:
    # Compute stats lazily
    result = (
        scan
        .select([
            pl.col(col).min().alias("min"),
            pl.col(col).max().alias("max"),
            *[
                pl.col(col).quantile(q, "nearest").alias(f"p{int(q*1000)/10}")
                for q in percentiles
            ]
        ])
        .collect()
    )

    row = {"feature": col}
    for key in result.columns:
        row[key] = float(result[key][0])
    
    summary_rows.append(row)

percentile_df = pd.DataFrame(summary_rows)

# Order columns nicely
ordered_cols = ["feature", "min", "p50.0", "p90.0", "p95.0", "p99.0", "p99.9", "max"]
percentile_df = percentile_df[ordered_cols]

# Sort by tail severity (p99.9 - median)
percentile_df["tail_spread"] = percentile_df["p99.9"] - percentile_df["p50.0"]
percentile_df = percentile_df.sort_values("tail_spread", ascending=False)

percentile_df

#### Percentile Summary Table: Interpretation of Numeric Feature Distributions

To better understand the distributional structure of numeric features—especially those exhibiting heavy tails or extreme sparsity—we computed detailed percentile statistics (p50 → p99.9) for every numeric column. This analysis provides critical insight into feature scale, tail behavior, and the presence of rare but extreme values commonly found in network intrusion datasets.

---

##### 1. Strong Long-Tail Behavior in Continuous Traffic Features
Features such as `Variance`, `Rate`, `Std`, `Tot_sum`, `Min`, and `Max` exhibit **very large differences** between the median (p50) and extreme percentiles (p99 and p99.9).  

For example:

- **Variance**: p50 ≈ 1.96 → p99.9 ≈ 1,253,775  
- **Rate**: p50 ≈ 12,682 → p99.9 ≈ 303,935  
- **Std**: p50 ≈ 1.40 → p99.9 ≈ 2,643  

These extremely heavy-tailed distributions are expected in high-volume IoT network flows and indicate that a small fraction of connections exhibit dramatically different behavior compared to the majority.  
Such variables may benefit from **log transformation**, **robust scaling**, or **winsorization** when used in models sensitive to outliers.

---

##### 2. Dense but Highly Variable Aggregation Features
Aggregated size and timing metrics such as `Tot_size`, `AVG`, `Time_To_Live`, and `Header_Length` show:

- high percentages of non-zero values  
- moderate to strong increases from p50 to p99.9  
- typical behavior for continuous packet-level metrics  

These features are likely **important predictors** due to their broad distribution and variability across flows.

---

##### 3. Sparse Event Counters and Protocol Indicators
Many protocol or flag-based features (e.g., `ack_count`, `syn_count`, `psh_flag_number`, `UDP`, `ICMP`) show:

- **min = p50 = 0**  
- small increases by p90/p95  
- sharp jumps at p99 or p99.9  

This pattern indicates **rare but meaningful spikes**, which align with anomalous or malicious behaviors.  
For example:

- `ack_count`: p50 = 0 → p99.9 = 100  
- `UDP`: p50 = 0 → p99.9 = 2  

Even small non-zero values may signal specific types of attacks or protocol misuse.

---
##### 4. Nearly Binary Features (0 Almost Everywhere)
Some features (e.g., `SSH`, `IRC`, `Telnet`, `SMTP`, `IGMP`, `cwr_flag_number`, `ece_flag_number`) remain zero through almost all percentiles, only rising at p99 or p99.9.

These essentially behave as **binary indicators**:

- 0 = no activity  
- >0 = rare protocol activation (often associated with attack traffic)

Given their extremely low frequency of non-zero values, these features may be more effective when converted to:

- `"is_nonzero"` binary flags  
- rare-event indicators  
- or categorical representations

rather than treated as continuous numeric variables.

---

##### Key Insights
The percentile analysis confirms that the dataset contains a mix of:

- **dense continuous features** with long-tailed variability  
- **sparse but informative event counters**  
- **protocol usage indicators** where even small non-zero spikes carry semantic meaning  

This combination reflects the heterogeneous nature of IoT network flows and highlights the importance of:

- robust scaling methods  
- careful handling of sparse features  
- binary feature engineering  
- and awareness of extreme values in downstream modeling.

---

##### Next Steps
With univariate numeric analysis complete, we can now transition into **Bivariate Analysis**, examining how these features relate to the target `Label` and to each other. This will deepen our understanding of which numeric signals most strongly differentiate benign and malicious network activity.


## 3. Bivariate Analysis

The goal of our Bivariate Analysis section is two-fold.

1. Assess relationships between numeric features and the target `Label` (different attack types vs benign) to identify strong predictors.
2. How do numeric and catefgorical features relate to one another?

### 3.1 Numeric Features vs. Target Label

#### 3.1.1 Percent Non-Zero Activation by Label

Does this numeric feature activate more often in attacks traffic compared to benign traffic?

In [None]:
rows = []

for col in numeric_cols:
    df = (
        scan
        .group_by("Label")
        .agg([
            (pl.col(col) == 0).sum().alias("zero_count"),
            (pl.col(col) != 0).sum().alias("nonzero_count")
        ])
        .collect()
        .to_pandas()
    )

    df["total"] = df["zero_count"] + df["nonzero_count"]
    df["pct_nonzero"] = df["nonzero_count"] / df["total"] * 100
    df["feature"] = col

    rows.append(df[["Label", "feature", "pct_nonzero"]])

# Combine into one big long-form table
pct_nonzero_multiclass_df = pd.concat(rows, ignore_index=True)

pct_nonzero_multiclass_df

In [None]:
# heatmap visualization of non-zero percentages by label
heatmap_df = pct_nonzero_multiclass_df.pivot(
    index="Label",
    columns="feature",
    values="pct_nonzero"
)

plt.figure(figsize=(20, 10))
sns.heatmap(heatmap_df, cmap="viridis")
plt.title("Percent Non-Zero per Feature Across All Labels")
plt.ylabel("Label")
plt.xlabel("Feature")
plt.tight_layout()
plt.show()

#### 3.1.1 Percent-Nonzero by Attack Category — Interpretation

The heatmap above visualizes the percentage of non-zero values for every numeric feature across all 34 traffic categories (33 attack types + benign). This provides an activation “fingerprint” for each attack type and illustrates which features are triggered more or less frequently under specific malicious behaviors.

---

##### 1. Clear Separation Between Benign and Attack Traffic
The benign class shows a distinct non-zero activation pattern compared to nearly all attack types.  
In particular:

- Many protocol counters and flag-based features activate **more frequently in attacks**.  
- Benign traffic tends to have **lower activation** across sparse features (e.g., `ack_flag_number`, `syn_flag_number`, `fmt_flag_number`, `psh_flag_number`, etc.).  
- Several continuous features (e.g., `Rate`, `Tot_size`, `Variance`) show elevated activation in multiple attack types.

This validates that network attacks produce characteristic shifts in feature activation patterns.

---
##### 2. Attack-Type Fingerprints Are Strongly Visible
Different attack categories exhibit distinct activation signatures:

- **Flooding attacks** (e.g., SYN flood, UDP flood, TCP flood) activate connection state flags (`syn_count`, `ack_count`, `rst_count`, etc.) at extremely high rates.
- **Reconnaissance attacks** (e.g., portscan, pingsweep) show increased activation in features such as `Time_To_Live`, `Header_Length`, and certain protocol counters.
- **Application-layer attacks** (e.g., slowloris, http flood) activate high-level protocol features (`HTTP`, `HTTPS`, `DNS`) more consistently.
- **Backdoor and malware-related attacks** activate highly specific protocol counters that remain nearly unused in benign traffic.

This demonstrates that each attack type produces a measurable and distinct pattern of protocol or flag activity.

---

##### 3. Sparse Features Become Highly Informative
Some features (e.g., `SSH`, `IRC`, `Telnet`, `SMTP`, `IGMP`, and several flag counters) show:

- **Very low activation in benign traffic**
- **High activation in certain attack types**

These features behave like **rare-event binary indicators**, where even a small number of non-zero values strongly indicates malicious activity.

This is valuable for downstream modeling, especially for tree-based classifiers and anomaly-oriented pipelines.

---

##### 4. Dense Features Also Discriminate Across Attack Families
Continuous features such as:

- `Rate`
- `Std`
- `Variance`
- `Tot_size`

exhibit noticeable differences in activation percentages across different attack categories.  
These patterns suggest differences in throughput, packet timing, and burst behavior among various types of attacks.

---

##### Key Takeaways
This heatmap demonstrates that **both sparse and dense numeric features vary significantly across attack types**, creating unique activation profiles for each class. These patterns will be highly valuable for:

- feature engineering  
- attack classification  
- model explainability  
- identifying discriminative signals for each class  

This analysis sets a strong foundation for deeper bivariate analysis, including boxplots, violin plots, and class-conditional distribution comparisons.

---

##### Next Step
Boxplots (log scale) of numeric features across attack categories


#### 3.1.2 Boxplots of Numeric Features by Attack Category 