This notebook aims to identify possible feature engineering processes we can do before running our models.

# Imports

In [1]:
import os
import sys

import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.feature_selection import VarianceThreshold

from IPython.display import HTML

sys.path.append('../')

## Definitions

In [2]:
import utils
import model_utils
import visualization as viz

constants = utils.get_constants()
seed = constants['seed']

parquet_path = constants['parquet_path']
train_parquet_path = constants['train_parquet_path']
test_parquet_path = constants['test_parquet_path']

features = constants['features']
target_columns = constants['target_columns']
protocol_layer = constants['protocol_layer']
protocol_layer_map = constants['protocol_layer_map']
attack_category = constants['attack_category']
attack_category_map = constants['attack_category_map']

# Feature Engineering

In [3]:
df = pd.read_parquet(parquet_path)

features_list = utils.get_features_list(df)

original_columns_set = set(features_list) | set(target_columns)
drop_columns_set = set()
add_columns_set = set()

In [4]:
balanced_weights = model_utils.get_balanced_weights(df['general_label'])

df_sample = df.sample(1_000_000, weights=df.general_label.map(balanced_weights), random_state=seed)

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 46686579 entries, 0 to 46686578
Data columns (total 48 columns):
 #   Column           Dtype   
---  ------           -----   
 0   flow_duration    float32 
 1   Header_Length    float32 
 2   Protocol Type    float32 
 3   Duration         float32 
 4   Rate             float32 
 5   Srate            float32 
 6   Drate            float32 
 7   fin_flag_number  bool    
 8   syn_flag_number  bool    
 9   rst_flag_number  bool    
 10  psh_flag_number  bool    
 11  ack_flag_number  bool    
 12  ece_flag_number  bool    
 13  cwr_flag_number  bool    
 14  ack_count        float32 
 15  syn_count        float32 
 16  fin_count        float32 
 17  urg_count        float32 
 18  rst_count        float32 
 19  HTTP             bool    
 20  HTTPS            bool    
 21  DNS              bool    
 22  Telnet           bool    
 23  SMTP             bool    
 24  SSH              bool    
 25  IRC              bool    
 26  TCP         

## Low Variance Features

Here we're going to try to identify features with very low variance (that are present in almost all reacords or in almost none of them) to drop, aiming to simplify our dataset with a lower dimensionality.

In [6]:
%%time
sel = VarianceThreshold(threshold=0.001)
sel.fit(df_sample[features_list])

drop_features_variance = {
    feature
    for feature, is_relevant in zip(features_list, sel.get_support())
    if not is_relevant
}

drop_columns_set |= drop_features_variance

drop_features_variance

CPU times: user 758 ms, sys: 546 ms, total: 1.3 s
Wall time: 1.3 s


{'ARP',
 'DHCP',
 'Drate',
 'IRC',
 'SMTP',
 'Telnet',
 'cwr_flag_number',
 'ece_flag_number'}

As we saw in the EDA, there are a couple protocols that are present in a very low number of records from the 46.7M that we have, so it's natural that even a simple feature selection algorithm would detect them here.

## Correlated Features

In [7]:
%%time
corr = df_sample.corr(method='spearman', numeric_only=True)
superior_triangle_mask = np.arange(len(corr)) > np.arange(len(corr)).reshape(-1, 1)
corr.values[~superior_triangle_mask] = 0

correlated_features = corr.apply(lambda row: [
    col_name
    for col_name, value
    in row.items()
    if abs(value) >= .999
])

correlated_features[correlated_features.map(len) >= 1].to_frame(name='Correlated Features')

CPU times: user 41.2 s, sys: 331 ms, total: 41.6 s
Wall time: 41.6 s


Unnamed: 0,Correlated Features
Srate,[Rate]
LLC,[IPv]
Magnitue,[AVG]
Weight,[Number]


In [8]:
drop_correlated_features = set(correlated_features[correlated_features.map(len) >= 1].index)

drop_columns_set |= drop_correlated_features

drop_correlated_features

{'LLC', 'Magnitue', 'Srate', 'Weight'}

We can double check that in the whole DataFrame those features are always the same:

In [9]:
def show_different_records(df, feature1, feature2):
    return (
        f"Records where <code>{feature1}</code> is different from <code>{feature2}</code>: "
        f"<strong>{viz.eng_formatter_full((df[feature1] != df[feature2]).sum(), len(df))}</strong>."
        "<br>"
    )

display(HTML(f"""
<p>
{
    ''.join(
        show_different_records(df, feature, correlated_features[feature][0])
        for feature in drop_correlated_features
    )
}
</p>
"""))

We see that even though `Magnitue` and `AVG`, and `Weight` and `Number` are different, they have the same rank correlation, what means that for a tree based model, they would bring the same information. Since we're planning on improving metrics for tree based algorithms, we have chosen to drop all these features here.

# Interpretable Features

## Dropping Protocol Type

In the original dataset, we have a feature called `Protocol Type`, which summarizes the protocols in the flow into a single numeric feature, but here we have a couple problems: first, since the features in each record come from an average of the windows in a flow, the `Protocol Type` ends up having a numeric type that doesn't really map back to the protocols, and second, as a numeric variable, it may be interpreted as an ordinal number, as if a protocol could have a reason for having a higher or lower value, but this isn't the case.

For this reason, and since we understand that the protocol is already well defined by the boolean flags, we opted to drop this feature to prioritize the interpretability of the model and without sacrificing the expected results, although we recognize that since we're exchanging a single consolidated feature by 8 booleans, we may end up having a more complex model.

In [10]:
df['Protocol Type'].value_counts()

drop_columns_set |= {'Protocol Type'}

## Adding highest TCP/IP Layer

Even though we have flags indicating the presence of each protocol, it's possible that for some attacks one important feature would be the information about the presence of *any* protocol of a given layer and not which specific protocol was present. This feature can help to simplify the resulting model and also to produce relevant information about how we could interpret the results.

In [12]:
# %%time

# df['highest_layer'] = 2

# for i, (layer, protocols) in enumerate(reversed(protocol_layer.items())):
#     df['highest_layer'] = np.where(df[protocols].sum(axis=1), 2 + i, df['highest_layer']).astype('uint8')

# df['highest_layer'].value_counts().to_frame().sort_index()

# add_columns_set |= {'highest_layer'}

## Adding Header Overhead

In [12]:
# df['header_overhead'] = df['Header_Length'] / (df['Tot sum'] + df['Header_Length'])

# add_columns_set |= {'header_overhead'}

## Adding TCP Control Rate

In [13]:
# df['tcp_control_rate'] = (df[features['tcp_flag_counts']].sum(axis=1) / (df['flow_duration'] + 1e-6))

# add_columns_set |= {'tcp_control_rate'}

# Save

In [13]:
HTML(f"""
<p>
    In this process, we have <strong>dropped {len(drop_columns_set)} features</strong>:
    <ul>
        {''.join(f"<li>{feature_name}</li>" for feature_name in sorted(drop_columns_set))}
    </ul>
</p>
<p>
    And we have <strong>created {len(add_columns_set)} features</strong>:
    <ul>
        {''.join(f"<li>{feature_name}</li>" for feature_name in sorted(add_columns_set))}
    </ul>
</p>
""")

In [14]:
refined_column = [
    col
    for col in df.columns
    if col in original_columns_set and col not in drop_columns_set
    or col in add_columns_set
]

df_refined = df[refined_column]

df_refined.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 46686579 entries, 0 to 46686578
Data columns (total 35 columns):
 #   Column           Dtype   
---  ------           -----   
 0   flow_duration    float32 
 1   Header_Length    float32 
 2   Duration         float32 
 3   Rate             float32 
 4   fin_flag_number  bool    
 5   syn_flag_number  bool    
 6   rst_flag_number  bool    
 7   psh_flag_number  bool    
 8   ack_flag_number  bool    
 9   ack_count        float32 
 10  syn_count        float32 
 11  fin_count        float32 
 12  urg_count        float32 
 13  rst_count        float32 
 14  HTTP             bool    
 15  HTTPS            bool    
 16  DNS              bool    
 17  SSH              bool    
 18  TCP              bool    
 19  UDP              bool    
 20  ICMP             bool    
 21  IPv              bool    
 22  Tot sum          float32 
 23  Min              float32 
 24  Max              float32 
 25  AVG              float32 
 26  Std         

In [15]:
HTML(f"""
<p>
    We finish the Feature Engineering process with 
    <strong>{len(utils.get_features_list(df_refined))} features</strong>.
</p>
""")

In [16]:
%%time

df_refined_train, df_refined_test = train_test_split(df_refined, train_size=0.80, random_state=seed)

print(f"Train size: {viz.eng_formatter_full(len(df_refined_train), len(df))}")
print(f"Test size: {viz.eng_formatter_full(len(df_refined_test), len(df))}")

df_refined_train.sort_index().reset_index(drop=True).to_parquet(train_parquet_path)
df_refined_test.sort_index().reset_index(drop=True).to_parquet(test_parquet_path)

!du -sh {train_parquet_path}
!du -sh {test_parquet_path}

Train size: 37.3M (80.0%)
Test size: 9.3M (20.0%)
580M	/var/fasttmp/dsn/unb_cic_ds_train.parquet
160M	/var/fasttmp/dsn/unb_cic_ds_test.parquet
CPU times: user 2min 22s, sys: 10.6 s, total: 2min 32s
Wall time: 2min 27s
