# Dataset Preprocessing

In this script, we undertake preprocessing of all standardized Netflow datasets (version 2), to render it compatible for analysis by Large Language Models. All datasets will be stored locally under the efficient streaming *arrow* format and published to Hugging Face as private datasets.

The resulting datasets share a uniform structure with the following features:

| Feature Name | Description|
|------------------------------|-----------------------------------------------|
|*input*                | A tabular netflow entry encoded as text using key-value pairs separated by commas to represent the feature name and value pairs. For instance, a network flow originally represented as a row within a CSV table is transformed into text as follows: ```IPV4_SRC_ADDR: 149.171.126.0 [...] TCP_FLAGS: 25, FLOW_DURATION_MILLISECONDS: 15"```
| *output*                | Label associated the with the network flows, 0 being benign and 1 malicious|

In [6]:
from dotenv import load_dotenv
from os import getenv
import pandas as pd
from datasets import load_from_disk

pd.set_option('display.max_rows', 500)
pd.set_option('display.max_colwidth', 500)

load_dotenv()
HUGGING_FACE_WRITE_TOKEN = getenv("HUGGING_FACE_WRITE_TOKEN")

In [7]:
def push_dataset_to_hub(dataset, dataset_name):
    dataset.push_to_hub(f"Jetlime/{dataset_name}", token=HUGGING_FACE_WRITE_TOKEN)

## NF-UNSW-NB15-v2

Building a training and testing set with a 95-5% ratio.

In [17]:
DATASET_NAME = "NF-UNSW-NB15-v2"

In [24]:
!python3 -W ignore ./dataset_gen_helper.py {DATASET_NAME}

Opening ../data_raw/NF-UNSW-NB15-v2.csv
Merging all columns in parralel.
[########################################] | 100% Completed | 286.02 s
[########################################] | 100% Completed | 286.12 s
Generating train split: 2390275 examples [00:02, 1039752.50 examples/s]
Dataset({
    features: ['input', 'output', 'Attack', '__null_dask_index__'],
    num_rows: 2390275
})
Stringifying the column: 100%|█| 2390275/2390275 [00:02<00:00, 802575.98 example
Casting to class labels: 100%|█| 2390275/2390275 [00:03<00:00, 637079.11 example
Stringifying the column: 100%|█| 2390275/2390275 [00:03<00:00, 735588.16 example
Casting to class labels: 100%|█| 2390275/2390275 [00:04<00:00, 568828.13 example
Saving the dataset (5/5 shards): 100%|█| 2270761/2270761 [00:08<00:00, 278014.94
Saving the dataset (1/1 shards): 100%|█| 119514/119514 [00:00<00:00, 284103.75 e
[0m

In [25]:
dataset = load_from_disk(f"./{DATASET_NAME}", keep_in_memory=True)
push_dataset_to_hub(dataset, DATASET_NAME)

Uploading the dataset shards:   0%|          | 0/5 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/455 [00:00<?, ?ba/s]

Creating parquet from Arrow format:   0%|          | 0/455 [00:00<?, ?ba/s]

Creating parquet from Arrow format:   0%|          | 0/455 [00:00<?, ?ba/s]

Creating parquet from Arrow format:   0%|          | 0/455 [00:00<?, ?ba/s]

Creating parquet from Arrow format:   0%|          | 0/455 [00:00<?, ?ba/s]

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/120 [00:00<?, ?ba/s]

README.md:   0%|          | 0.00/8.46k [00:00<?, ?B/s]

## NF-CSE-CIC-IDS2018-v2

Building a training and testing set with a 95-5% ratio.
Building solely a testing set made of 5% of the dataset.

In [26]:
DATASET_NAME = "NF-CSE-CIC-IDS2018-v2"

In [27]:
!python3 -W ignore ./dataset_gen_helper.py {DATASET_NAME}

Opening ../data_raw/NF-CSE-CIC-IDS2018-v2.csv
Merging all columns in parralel.
[############                            ] | 31% Completed | 16m 0sss

In [10]:
dataset = load_from_disk(f"./{DATASET_NAME}", keep_in_memory=True)
push_dataset_to_hub(dataset, DATASET_NAME)

FileNotFoundError: Directory ./NF-CSE-CIC-IDS2018-v2 not found

## NF-UQ-NIDS-v2


Building solely a testing set made of 5% of the dataset.

In [None]:
DATASET_NAME = "NF-UQ-NIDS-v2"

In [None]:
!python3 -W ignore ./dataset_gen_helper.py {DATASET_NAME}

Opening ./data_raw/NF-UQ-NIDS-v2.csv
(<dask_expr.expr.Scalar: expr=ReadCSV(d4d2cec).size() // 46, dtype=int64>, 46)
(<dask_expr.expr.Scalar: expr=(SplitTake(frame=Split(frame=ReadCSV(d4d2cec), frac=[0.050000000000000044, 0.95], random_state=1608637542, shuffle=False), i=0, ndim=2)).size() // 46, dtype=int64>, 46)
Merging all columns in parralel.
[########################################] | 100% Completed | 452.50 s
Stringifying the column: 100%|█| 3798653/3798653 [00:04<00:00, 865015.58 example
Casting to class labels: 100%|█| 3798653/3798653 [00:05<00:00, 655034.99 example
Casting to class labels: 100%|█| 3798653/3798653 [00:06<00:00, 566633.90 example
Saving the dataset (8/8 shards): 100%|█| 3798653/3798653 [00:02<00:00, 1496135.8
[0m

In [None]:
dataset = load_from_disk(f"./{DATASET_NAME}", keep_in_memory=True)
push_dataset_to_hub(dataset, DATASET_NAME)

Uploading the dataset shards:   0%|          | 0/8 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/475 [00:00<?, ?ba/s]

Creating parquet from Arrow format:   0%|          | 0/475 [00:00<?, ?ba/s]

Creating parquet from Arrow format:   0%|          | 0/475 [00:00<?, ?ba/s]

Creating parquet from Arrow format:   0%|          | 0/475 [00:00<?, ?ba/s]

Creating parquet from Arrow format:   0%|          | 0/475 [00:00<?, ?ba/s]

Creating parquet from Arrow format:   0%|          | 0/475 [00:00<?, ?ba/s]

Creating parquet from Arrow format:   0%|          | 0/475 [00:00<?, ?ba/s]

Creating parquet from Arrow format:   0%|          | 0/475 [00:00<?, ?ba/s]