# ETL of the CIC IoT 2023 Dataset for Cybersecurity Research

[University of New Brunswick - Canadian Institute for Cybersecurity](https://www.unb.ca/cic/datasets/index.html)

# Imports

In [1]:
import os
import operator as op
from functools import reduce

import pandas as pd
import yaml

## Definitions

In [2]:
with open(os.path.join(os.getcwd(), '..', 'constants.yaml' )) as f:
    definitions = yaml.safe_load(f)

path = definitions['path']
csv_path = os.path.join(path, 'unb_cic_csv')
parquet_path = os.path.join(path, definitions['parquet_name'])

attack_category = definitions['attack_category']
protocol_layer = definitions['protocol_layer']
features = definitions['features']
features['protocol'] = reduce(op.concat, protocol_layer.values())

attack_category_map = {
    col: attack_category
    for attack_category, column_list in attack_category.items()
    for col in column_list
}

# Ingestion

## Download dataset

In [3]:
# !wget -P {path} http://205.174.165.80/IOTDataset/CIC_IOT_Dataset2023/Dataset/CSV/CICIoT2023.zip
!du -sh {path + 'CICIoT2023.zip'}

2,7G	/var/fasttmp/bruno_dsn/CICIoT2023.zip


In [4]:
# !unzip {path + 'CICIoT2023.zip'} -d {csv_path}
!du -sh {csv_path}

13G	/var/fasttmp/bruno_dsn/unb_cic_csv


## Load and Transform

In [5]:
columns_dtype = {
    'category': [
        'label'
    ],
    'bool': [
        *features['protocol'],
        *features['flag']
    ],
    'float32': [
        *features['flag_counts'],
        *features['flow'],
        *features['packet'],
    ],
}

column_dtype_map = {
    col: dtype
    for dtype, column_list in columns_dtype.items()
    for col in column_list
}

Since most of the floating point features are related to a simple average of the pre-defined windows inside a flow, there isn't much to gain having a double precision here, so we're going to work with `float32` for all floating point values.

In [6]:
%%time
csv_files = (
    filename
    for filename in os.listdir(csv_path)
    if filename.endswith('.csv')
)

df = pd.concat(
    pd.read_csv(os.path.join(csv_path, csv), index_col=None, header=0, dtype=column_dtype_map)[column_dtype_map.keys()]
    for csv in csv_files
)

df['general_label'] = df['label'].map(attack_category_map).astype('category')

df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 46686579 entries, 0 to 451497
Data columns (total 48 columns):
 #   Column           Dtype   
---  ------           -----   
 0   label            category
 1   HTTP             bool    
 2   HTTPS            bool    
 3   DNS              bool    
 4   Telnet           bool    
 5   SMTP             bool    
 6   SSH              bool    
 7   IRC              bool    
 8   DHCP             bool    
 9   TCP              bool    
 10  UDP              bool    
 11  ICMP             bool    
 12  IPv              bool    
 13  ARP              bool    
 14  LLC              bool    
 15  fin_flag_number  bool    
 16  syn_flag_number  bool    
 17  rst_flag_number  bool    
 18  psh_flag_number  bool    
 19  ack_flag_number  bool    
 20  ece_flag_number  bool    
 21  cwr_flag_number  bool    
 22  ack_count        float32 
 23  syn_count        float32 
 24  fin_count        float32 
 25  urg_count        float32 
 26  rst_count        fl

## Save

In [7]:
%%time
sort_columns = ['general_label', 'label', 'Protocol Type', 'Tot size', 'Header_Length']

df.sort_values(sort_columns).to_parquet(parquet_path, index=False)

!du -sh {parquet_path}

960M	/var/fasttmp/bruno_dsn/unb_cic_ds_parquet
CPU times: user 1min 54s, sys: 3.87 s, total: 1min 58s
Wall time: 1min 44s


Here we're converting the CSVs to a parquet to make it easier to work with the data.

We already define the schema and save the dataset with the values sorted to take advantage of some optimizations related to disk space usage.

In [8]:
# del df

In [9]:
# !rm -r {csv_path}