# Dataset Reduction
## Overview
This notebook processes a very large intrusion-detection dataset that contains various types of network attacks, including DDoS, DoS, brute-force attempts, SQL injection, and more. Because the full dataset is over 7 GB in size and includes several million network packets, training complex models directly on all data is impractical without specialized hardware. The goal of this notebook is therefore to reduce the dataset in a targeted way and focus exclusively on DoS and DDoS attacks, as these categories are both sufficiently represented and highly relevant for security research. The following documentation explains step by step how this reduction is implemented across the individual code cells.

In [2]:
import warnings
warnings.filterwarnings("ignore")
import pandas as pd
import os
import kagglehub


## Data Loading Strategy
Next, the full IDS dataset is loaded. If a locally stored file named data.csv already exists, it is read directly to avoid long loading times. If not, the dataset is downloaded once via kagglehub. Since the original dataset is split into many individual CSV files, each file is read, concatenated into a single large DataFrame, and then saved locally. This ensures that subsequent notebook runs do not require a re-download or slow file merging operations.

In [2]:
if os.path.exists('data.csv'):
    full_df = pd.read_csv('data.csv')
else:
    path = kagglehub.dataset_download("solarmainframe/ids-intrusion-csv")
    print("Files:", os.listdir(path))
    csv_files = [f for f in os.listdir(path) if f.endswith(".csv")]
    dfs = {file: pd.read_csv(os.path.join(path, file), low_memory=False) for file in csv_files}
    full_df = pd.concat(dfs.values(), ignore_index=True)
    full_df.to_csv('data.csv', index=False)

## Class Distribution Analysis
To understand the types of attacks included in the dataset, the distribution of the Label column is inspected. This analysis reveals a strong class imbalance. The “Benign” class contains over 13 million entries, while some categories, such as SQL Injection or XSS, contain only a few dozen samples. At the same time, both DoS and DDoS attacks occur in large numbers, making them suitable candidates for constructing robust reduced datasets. This observation motivates the decision to focus on these two attack types.

In [3]:
full_df['Label'].value_counts()

Label
Benign                      13484708
DDOS attack-HOIC              686012
DDoS attacks-LOIC-HTTP        576191
DoS attacks-Hulk              461912
Bot                           286191
FTP-BruteForce                193360
SSH-Bruteforce                187589
Infilteration                 161934
DoS attacks-SlowHTTPTest      139890
DoS attacks-GoldenEye          41508
DoS attacks-Slowloris          10990
DDOS attack-LOIC-UDP            1730
Brute Force -Web                 611
Brute Force -XSS                 230
SQL Injection                     87
Label                             59
Name: count, dtype: int64

## Dataset Reduction
The reduction process begins by converting all labels to lowercase to simplify matching and filtering logic. Using keywords such as “dos” and “ddos”, all corresponding attack samples are extracted and grouped together. Benign samples are collected separately, resulting in two clearly defined categories: attack traffic and normal network traffic. A binary target variable is then created, labeling attack entries as “1” and benign entries as “0”, preparing the dataset for downstream modeling.

In [4]:
import pandas as pd

full_df['label_lower'] = full_df['Label'].str.lower()
dos_keywords = ['dos', 'ddos']

df_attack = full_df[full_df['label_lower'].str.contains('|'.join(dos_keywords))].copy()
df_attack['label_binar'] = 1  # Attack

df_benign = full_df[full_df['Label'] == 'Benign'].copy()
df_benign['label_binar'] = 0  # Normal

total_size = 1000000

def create_balanced_datasets(benign_df, attack_df, prefix):
    size_per_class = total_size // 2

    max_possible = min(len(benign_df), len(attack_df))
    if size_per_class > max_possible:
        print(f"Not enough samples for total size {total_size}. Max possible per class: {max_possible}")
        size_per_class = max_possible

    benign_sample = benign_df.sample(n=size_per_class, random_state=42)
    attack_sample = attack_df.sample(n=size_per_class, random_state=42)

    df_balanced = pd.concat([benign_sample, attack_sample])
    df_balanced = df_balanced.sample(frac=1, random_state=42).reset_index(drop=True)

    filename = f"{prefix}_{total_size}.csv"
    df_balanced.to_csv(filename, index=False)

    print(f"Created {filename} with {size_per_class} benign + {size_per_class} attack samples.")

create_balanced_datasets(df_benign, df_attack, "dos_vs_benign")

✔️ Created dos_vs_benign_100000.csv with 50000 benign + 50000 attack samples.
✔️ Created dos_vs_benign_1000000.csv with 500000 benign + 500000 attack samples.
