# Feature Engineering - CIC IoT 2023 Dataset for Cybersecurity Research

[University of New Brunswick - Canadian Institute for Cybersecurity](https://www.unb.ca/cic/datasets/index.html)

This notebook aims to identify possible feature engineering processes we can do before running our models.

# Imports

In [10]:
import os
import sys

import numpy as np
import pandas as pd

from IPython.display import HTML

sys.path.append('../')

In [2]:
from sklearn.feature_selection import VarianceThreshold

## Definitions

In [54]:
import utils
import model_utils
import visualization as viz

constants = utils.get_constants()
seed = constants['seed']

parquet_path = constants['parquet_path']
refined_parquet_path = constants['refined_parquet_path']

features = constants['features']
target_columns = constants['target_columns']
protocol_layer = constants['protocol_layer']
protocol_layer_map = constants['protocol_layer_map']
attack_category = constants['attack_category']
attack_category_map = constants['attack_category_map']

# Feature Engineering

In [4]:
df = pd.read_parquet(parquet_path)

features_list = utils.get_features_list(df)

original_columns_set = set(features_list)
drop_columns_set = set()

In [5]:
balanced_weights = model_utils.get_balanced_weights(df['general_label'])

df_sample = df.sample(1_000_000, weights=df.general_label.map(balanced_weights), random_state=seed)

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 46686579 entries, 0 to 46686578
Data columns (total 48 columns):
 #   Column           Dtype   
---  ------           -----   
 0   flow_duration    float32 
 1   Header_Length    float32 
 2   Protocol Type    float32 
 3   Duration         float32 
 4   Rate             float32 
 5   Srate            float32 
 6   Drate            float32 
 7   fin_flag_number  bool    
 8   syn_flag_number  bool    
 9   rst_flag_number  bool    
 10  psh_flag_number  bool    
 11  ack_flag_number  bool    
 12  ece_flag_number  bool    
 13  cwr_flag_number  bool    
 14  ack_count        float32 
 15  syn_count        float32 
 16  fin_count        float32 
 17  urg_count        float32 
 18  rst_count        float32 
 19  HTTP             bool    
 20  HTTPS            bool    
 21  DNS              bool    
 22  Telnet           bool    
 23  SMTP             bool    
 24  SSH              bool    
 25  IRC              bool    
 26  TCP         

## Low Variance Features

Here we're going to try to identify features with very low variance (that are present in almost all reacords or in almost none of them) to drop, aiming to simplify our dataset with a lower dimensionality.

In [7]:
%%time
sel = VarianceThreshold(threshold=0.001)
sel.fit(df_sample[features_list])

drop_features_variance = {
    feature
    for feature, is_relevant in zip(features_list, sel.get_support())
    if not is_relevant
}

drop_columns_set |= drop_features_variance

drop_features_variance

CPU times: user 645 ms, sys: 345 ms, total: 990 ms
Wall time: 989 ms


{'ARP',
 'DHCP',
 'Drate',
 'IRC',
 'SMTP',
 'Telnet',
 'cwr_flag_number',
 'ece_flag_number'}

As we saw in the EDA, there are a couple protocols that are present in a very low number of records from the 46.7M that we have, so it's natural that even a simple feature selection algorithm would detect them here.

## Correlated Features

In [49]:
%%time
corr = df_sample.corr(method='spearman', numeric_only=True)
corr.values[range(len(corr)), range(len(corr))] = 0

correlated_features = corr.apply(lambda row: [
    col_name
    for col_name, value
    in row.items()
    if value >= .999
])

correlated_features[correlated_features.map(len) >= 1].to_frame(name='Correlated Features')

CPU times: user 1min 8s, sys: 399 ms, total: 1min 8s
Wall time: 1min 8s


Unnamed: 0,Correlated Features
Rate,[Srate]
Srate,[Rate]
IPv,[LLC]
LLC,[IPv]
AVG,[Magnitue]
Number,[Weight]
Magnitue,[AVG]
Weight,[Number]


We can double check that in the whole DataFrame, those features are always the same:

In [76]:
def show_different_records(df, feature1, feature2):
    return (
        f"Records where <code>{feature1}</code> is different from <code>{feature2}</code>: "
        f"<strong>{viz.eng_formatter_full((df[feature1] != df[feature2]).sum(), len(df))}</strong>."
        "<br>"
    )

display(HTML(f"""
<p>
    {show_different_records(df, 'Rate', 'Srate')}
    {show_different_records(df, 'IPv', 'LLC')}
</p>
"""))

In [64]:
drop_correlated_features = {
    'Srate',
    'LLC',
}

drop_columns_set |= drop_correlated_features

drop_correlated_features

{'LLC', 'Srate'}

# Save

In [65]:
added_features = set(df.columns)- original_columns_set

HTML(f"""
<p>
    In this process, we have <strong>dropped {len(drop_columns_set)} features</strong>:
    <ul>
        {''.join(f"<li>{feature_name}</li>" for feature_name in sorted(drop_columns_set))}
    </ul>
</p>
""")

In [66]:
refined_df = df.drop(columns=drop_columns_set)

refined_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 46686579 entries, 0 to 46686578
Data columns (total 38 columns):
 #   Column           Dtype   
---  ------           -----   
 0   flow_duration    float32 
 1   Header_Length    float32 
 2   Protocol Type    float32 
 3   Duration         float32 
 4   Rate             float32 
 5   fin_flag_number  bool    
 6   syn_flag_number  bool    
 7   rst_flag_number  bool    
 8   psh_flag_number  bool    
 9   ack_flag_number  bool    
 10  ack_count        float32 
 11  syn_count        float32 
 12  fin_count        float32 
 13  urg_count        float32 
 14  rst_count        float32 
 15  HTTP             bool    
 16  HTTPS            bool    
 17  DNS              bool    
 18  SSH              bool    
 19  TCP              bool    
 20  UDP              bool    
 21  ICMP             bool    
 22  IPv              bool    
 23  Tot sum          float32 
 24  Min              float32 
 25  Max              float32 
 26  AVG         

In [68]:
HTML(f"""
<p>
    We finish the Feature Engineering process with 
    <strong>{len(utils.get_features_list(refined_df))} features</strong>.
</p>
""")

In [73]:
%%time

refined_df.to_parquet(refined_parquet_path)

!du -sh {refined_parquet_path}

773M	/var/fasttmp/dsn/unb_cic_ds_refined.parquet
CPU times: user 1min 5s, sys: 5.86 s, total: 1min 11s
Wall time: 1min 2s
