# Dataset Preprocessing

In this script, we undertake preprocessing of all standardized Netflow datasets (version 2), to render it compatible for analysis by Large Language Models. All datasets will be stored locally under the efficient streaming *arrow* format and published to Hugging Face as private datasets.

The resulting datasets share a uniform structure with the following features:

| Feature Name | Description|
|------------------------------|-----------------------------------------------|
|*input*                | A tabular netflow entry encoded as text using key-value pairs separated by commas to represent the feature name and value pairs. For instance, a network flow originally represented as a row within a CSV table is transformed into text as follows: ```IPV4_SRC_ADDR: 149.171.126.0 [...] TCP_FLAGS: 25, FLOW_DURATION_MILLISECONDS: 15"```
| *output*                | Label associated the with the network flows, 0 being benign and 1 malicious|

In [28]:
from dotenv import load_dotenv
from os import getenv, remove
import pandas as pd
from datasets import load_from_disk

pd.set_option('display.max_rows', 500)
pd.set_option('display.max_colwidth', 500)

load_dotenv()
HUGGING_FACE_WRITE_TOKEN = getenv("HUGGING_FACE_WRITE_TOKEN")

In [6]:
def push_dataset_to_hub(dataset, dataset_name):
    dataset.push_to_hub(f"Jetlime/{dataset_name}", token=HUGGING_FACE_WRITE_TOKEN)

## NF-UNSW-NB15-v2

In [13]:
DATASET_NAME = "NF-UNSW-NB15-v2"

In [25]:
!python3 -W ignore ./dataset_gen_helper.py {DATASET_NAME}

Dask Apply: 100%|███████████████████████████████| 33/33 [01:47<00:00,  3.25s/it]
Stringifying the column: 100%|█| 2390275/2390275 [00:03<00:00, 789112.42 example
Casting to class labels: 100%|█| 2390275/2390275 [00:04<00:00, 594792.72 example
Saving the dataset (4/4 shards): 100%|█| 1673192/1673192 [00:04<00:00, 336940.47
Saving the dataset (2/2 shards): 100%|█| 717083/717083 [00:02<00:00, 349803.22 e


In [29]:
dataset = load_from_disk(f"./{DATASET_NAME}", keep_in_memory=True)
push_dataset_to_hub(dataset, DATASET_NAME)

Creating parquet from Arrow format: 100%|██████████| 419/419 [00:00<00:00, 947.16ba/s]
Creating parquet from Arrow format: 100%|██████████| 419/419 [00:00<00:00, 1045.00ba/s]
Creating parquet from Arrow format: 100%|██████████| 419/419 [00:00<00:00, 1068.09ba/s]
Creating parquet from Arrow format: 100%|██████████| 419/419 [00:00<00:00, 1081.11ba/s]
Uploading the dataset shards: 100%|██████████| 4/4 [00:03<00:00,  1.04it/s]
Creating parquet from Arrow format: 100%|██████████| 359/359 [00:00<00:00, 1043.45ba/s]
Creating parquet from Arrow format: 100%|██████████| 359/359 [00:00<00:00, 1041.47ba/s]
Uploading the dataset shards: 100%|██████████| 2/2 [00:01<00:00,  1.16it/s]


## NF-CSE-CIC-IDS2018-v2

In [32]:
DATASET_NAME = "NF-CSE-CIC-IDS2018-v2"

In [48]:
!python3 -W ignore ./dataset_gen_helper.py {DATASET_NAME}

In [33]:
dataset = load_from_disk(f"./{DATASET_NAME}", keep_in_memory=True)
push_dataset_to_hub(dataset, DATASET_NAME)

FileNotFoundError: Directory ./NF-CSE-CIC-IDS2018-v2 not found

## NF-UQ-NIDS-v2 - Ignore for now

In [None]:
DATASET_NAME = "NF-UQ-NIDS-v2"

In [37]:
!python3 -W ignore ./dataset_gen_helper.py {DATASET_NAME}

Opening ./data_raw/NF-CSE-CIC-IDS2018-v2.csv
Merging all columns in parralel.
