# Process, Balance, and Select Features from the CIC-DDoS2019 Dataset

Here we load data from the CIC-DDoS2019 dataset sequentially in batches and process it for our experiments.

After loading the data, we do a cursory analysis of the feature space to determine which features we want to discard.

The datasets have their features reduced and are saved as "processed" data.

From there we extract the benign samples from the malicious samples in order to pool them together.

We then prepare balanced datasets from the processed malicious and benign data seeking 50/50 ratios.

We can also prepare data in other ratios to see if our results improve or are maintained from the initial results

# Process and Select Features from the CIC-DDoS2019 Dataset

Here we load data from the CIC-DDoS2019 dataset sequentially in batches and process it for our experiments.

After loading the data, we do a cursory analysis of the feature space to determine which features we want to discard.

First we import all relevant libraries, set a random seed, and print python and library versions for reproducability

In [1]:
import os, platform, pprint, sys
import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

seed: int = 14

# set up pretty printer for easier data evaluation
pretty = pprint.PrettyPrinter(indent=4, width=30).pprint

print(
    f'''
    python:\t{platform.python_version()}

    \tmatplotlib:\t{mpl.__version__}
    \tnumpy:\t\t{np.__version__}
    \tpandas:\t\t{pd.__version__}
    '''
)


    python:	3.7.10

    	matplotlib:	3.3.4
    	numpy:		1.20.3
    	pandas:		1.2.5
    


Next we prepare some helper functions to help process the data

In [2]:
def get_file_path(directory: str):
    '''
        Closure that will return a function. 
        Function will return the filepath to the directory given to the closure
    '''

    def func(file: str) -> str:
        return os.path.join(directory, file)

    return func



def load_data(filePath):
    '''
        Loads the Dataset from the given filepath and caches it for quick access in the future
        Function will only work when filepath is a .csv file
    '''

    # slice off the ./CSV/ from the filePath
    if filePath[0] == '.' and filePath[1] == '/':
        filePathClean: str = filePath[11::]
        pickleDump: str = f'./cache/{filePathClean}.pickle'
    else:
        pickleDump: str = f'./cache/{filePath}.pickle'
    
    print(f'Loading Dataset: {filePath}')
    print(f'\tTo Dataset Cache: {pickleDump}\n')
    
    # check if data already exists within cache
    if os.path.exists(pickleDump):
        df = pd.read_pickle(pickleDump)
        
    # if not, load data and clean it before caching it
    else:
        df = pd.read_csv(filePath, low_memory=True)
        df.to_pickle(pickleDump)
    
    return df



def features_with_bad_values(df: pd.DataFrame, datasetName: str) -> pd.DataFrame:
    '''
        Function will scan the dataframe for features with Inf, NaN, or Zero values.
        Returns a new dataframe describing the distribution of these values in the original dataframe
    '''

    # Inf and NaN values can take different forms so we screen for every one of them
    invalid_values: list = [ np.inf, np.nan, 'Infinity', 'inf', 'NaN', 'nan', 0 ]
    infs          : list = [ np.inf, 'Infinity', 'inf' ]
    NaNs          : list = [ np.nan, 'NaN', 'nan' ]

    # We will collect stats on the dataset, specifically how many instances of Infs, NaNs, and 0s are present.
    # using a dictionary that will be converted into a (3, 2+88) dataframe
    stats: dict = {
        'Dataset':[ datasetName, datasetName, datasetName ],
        'Value'  :['Inf', 'NaN', 'Zero']
    }

    i = 0
    for col in df.columns:
        
        i += 1
        feature = np.zeros(3)
        
        for value in invalid_values:
            if value in infs:
                j = 0
            elif value in NaNs:
                j = 1
            else:
                j = 2
            indexNames = df[df[col] == value].index
            if not indexNames.empty:
                feature[j] += len(indexNames)
                
        stats[col] = feature

    return pd.DataFrame(stats)


Before we do any processing on the data, we need to list out all their filepaths. If trying to reproduce the process carried out here, place files in the same location relative to the notebook.

In [3]:
data_path_1: str = './original/01-12/'
data_path_2: str = './original/03-11/'
    
data_set_1: list = [
    'DrDoS_DNS.csv',
    'DrDoS_LDAP.csv',
    'DrDoS_MSSQL.csv',
    'DrDoS_NetBIOS.csv',
    'DrDoS_NTP.csv',
    'DrDoS_SNMP.csv',
    'DrDoS_SSDP.csv',
    'DrDoS_UDP.csv',
    'Syn.csv',
    'TFTP.csv',
    'UDPLag.csv',    
]
    
data_set_2: list = [
    'LDAP.csv',
    'MSSQL.csv',
    'NetBIOS.csv',
    'Portmap.csv',   
    'Syn.csv',
    'UDP.csv',
    'UDPLag.csv',
]

data_set: list = data_set_1 + data_set_2


file_path_1 = get_file_path(data_path_1)
file_path_2 = get_file_path(data_path_2)


file_set: list = list(map(file_path_1, data_set_1))
file_set.extend(list(map(file_path_2, data_set_2)))

This gives us a set of file locations. Lets look at the set of files that make up the CIC-DDoS2019 dataset

In [4]:
print(f'We will be cleaning {len(file_set)} files:')
print(f'Benign samples will be grabbed from each dataset and saved separately\n')
pretty(file_set)

We will be cleaning 18 files:
Benign samples will be grabbed from each dataset and saved separately

[   './original/01-12/DrDoS_DNS.csv',
    './original/01-12/DrDoS_LDAP.csv',
    './original/01-12/DrDoS_MSSQL.csv',
    './original/01-12/DrDoS_NetBIOS.csv',
    './original/01-12/DrDoS_NTP.csv',
    './original/01-12/DrDoS_SNMP.csv',
    './original/01-12/DrDoS_SSDP.csv',
    './original/01-12/DrDoS_UDP.csv',
    './original/01-12/Syn.csv',
    './original/01-12/TFTP.csv',
    './original/01-12/UDPLag.csv',
    './original/03-11/LDAP.csv',
    './original/03-11/MSSQL.csv',
    './original/03-11/NetBIOS.csv',
    './original/03-11/Portmap.csv',
    './original/03-11/Syn.csv',
    './original/03-11/UDP.csv',
    './original/03-11/UDPLag.csv']


Here we create a dictionary that maps all the raw CSV column labels with more meaningful, human interpretable labels. Extra whitespace is stripped, and superfluous information is eliminated.

In [5]:
new_column_names = {
    'Unnamed: 0'                :'Unnamed'                  , 'Flow ID'                     :'Flow ID'                      ,
    ' Source IP'                :'Source IP'                , ' Source Port'                :'Source Port'                  ,
    ' Destination IP'           :'Destination IP'           , ' Destination Port'           :'Destination Port'             ,
    ' Protocol'                 :'Protocol'                 , ' Total Length of Bwd Packets':'Total Length of Bwd Packets'  ,     
    ' Flow Duration'            :'Flow Duration'            , ' Total Fwd Packets'          :'Total Fwd Packets'            , 
    ' Total Backward Packets'   :'Total Backward Packets'   , 'Total Length of Fwd Packets' :'Total Length of Fwd Packets'  ,
    ' Timestamp'                :'Timestamp'                , ' Init_Win_bytes_backward'    :'Init Win bytes backward'      ,
    ' Fwd Packet Length Max'    :'Fwd Packet Length Max'    , ' Fwd Packet Length Min'      :'Fwd Packet Length Min'        ,
    ' Fwd Packet Length Mean'   :'Fwd Packet Length Mean'   , ' Fwd Packet Length Std'      :'Fwd Packet Length Std'        ,
    'Bwd Packet Length Max'     :'Bwd Packet Length Max'    , ' Bwd Packet Length Min'      :'Bwd Packet Length Min'        ,
    ' Bwd Packet Length Mean'   :'Bwd Packet Length Mean'   , ' Bwd Packet Length Std'      :'Bwd Packet Length Std'        ,
    'Flow Bytes/s'              :'Flow Bytes/s'             , ' Flow Packets/s'             :'Flow Packets/s'               ,
    ' Flow IAT Mean'            :'Flow IAT Mean'            , ' Flow IAT Std'               :'Flow IAT Std'                 ,
    ' Flow IAT Max'             :'Flow IAT Max'             , ' Flow IAT Min'               :'Flow IAT Min'                 ,
    'Fwd IAT Total'             :'Fwd IAT Total'            , ' Fwd IAT Mean'               :'Fwd IAT Mean'                 ,
    ' Fwd IAT Std'              :'Fwd IAT Std'              , ' Fwd IAT Max'                :'Fwd IAT Max'                  ,
    ' Fwd IAT Min'              :'Fwd IAT Min'              , 'Bwd IAT Total'               :'Bwd IAT Total'                ,    
    ' Bwd IAT Mean'             :'Bwd IAT Mean'             , ' Bwd IAT Std'                :'Bwd IAT Std'                  ,
    ' Bwd IAT Max'              :'Bwd IAT Max'              , ' Bwd IAT Min'                :'Bwd IAT Min'                  ,
    'Fwd PSH Flags'             :'Fwd PSH Flags'            , ' Bwd PSH Flags'              :'Bwd PSH Flags'                , 
    ' Fwd URG Flags'            :'Fwd URG Flags'            , ' Bwd URG Flags'              :'Bwd URG Flags'                ,
    ' Fwd Header Length'        :'Fwd Header Length'        , ' Bwd Header Length'          :'Bwd Header Length'            , 
    'Fwd Packets/s'             :'Fwd Packets/s'            , ' Bwd Packets/s'              :'Bwd Packets/s'                , 
    ' Min Packet Length'        :'Min Packet Length'        , ' Max Packet Length'          :'Max Packet Length'            , 
    ' Packet Length Mean'       :'Packet Length Mean'       , ' Packet Length Std'          :'Packet Length Std'            , 
    ' Packet Length Variance'   :'Packet Length Variance'   , 'FIN Flag Count'              :'FIN Flag Count'               ,
    ' SYN Flag Count'           :'SYN Flag Count'           , ' RST Flag Count'             :'RST Flag Count'               ,
    ' PSH Flag Count'           :'PSH Flag Count'           , ' ACK Flag Count'             :'ACK Flag Count'               , 
    ' URG Flag Count'           :'URG Flag Count'           , ' CWE Flag Count'             :'CWE Flag Count'               , 
    ' ECE Flag Count'           :'ECE Flag Count'           , ' Down/Up Ratio'              :'Down/Up Ratio'                ,
    ' Average Packet Size'      :'Average Packet Size'      , ' Avg Fwd Segment Size'       :'Avg Fwd Segment Size'         ,
    ' Avg Bwd Segment Size'     :'Avg Bwd Segment Size'     , ' Fwd Header Length.1'        :'Fwd Header Length.1'          , 
    'Fwd Avg Bytes/Bulk'        :'Fwd Avg Bytes/Bulk'       , ' Inbound'                    :'Inbound'                      , 
    ' Fwd Avg Packets/Bulk'     :'Fwd Avg Packets/Bulk'     , ' Fwd Avg Bulk Rate'          :'Fwd Avg Bulk Rate'            , 
    ' Bwd Avg Bytes/Bulk'       :'Bwd Avg Bytes/Bulk'       , ' Bwd Avg Packets/Bulk'       :'Bwd Avg Packets/Bulk'         ,
    'Bwd Avg Bulk Rate'         :'Bwd Avg Bulk Rate'        , 'Subflow Fwd Packets'         :'Subflow Fwd Packets'          ,
    ' Subflow Fwd Bytes'        :'Subflow Fwd Bytes'        , ' Subflow Bwd Packets'        :'Subflow Bwd Packets'          ,
    ' Subflow Bwd Bytes'        :'Subflow Bwd Bytes'        , 'Init_Win_bytes_forward'      :'Init Win bytes forward'       ,
    ' act_data_pkt_fwd'         :'act data pkt fwd'         , ' min_seg_size_forward'       :'min seg size forward'         ,     
    'Active Mean'               :'Active Mean'              , ' Active Std'                 :'Active Std'                   ,
    ' Active Max'               :'Active Max'               , ' Active Min'                 :'Active Min'                   , 
    'Idle Mean'                 :'Idle Mean'                , ' Idle Std'                   :'Idle Std'                     ,
    ' Idle Max'                 :'Idle Max'                 , ' Idle Min'                   :'Idle Min'                     ,
    'SimillarHTTP'              :'SimillarHTTP'             , ' Label'                      :'Label'                        ,
}

It will also come in handy to record some statistics about the data as it is being processed

In [6]:
composition_columns = ['File', 'Benign', 'Malicious', 'Total', 'Percent Benign']
data_composition = pd.DataFrame(columns = composition_columns)

In [7]:
current_job = 0
print(f'''
    Dataset {current_job+1}/{len(data_set)}: We now look at {file_set[current_job]}
''')

df          = load_data(file_set[current_job])
df          = df.rename(columns=new_column_names)
benign_df   = df[df['Label'] == 'BENIGN']

data_composition = data_composition.append(pd.DataFrame([
    [file_set[current_job][11:], benign_df.shape[0], df.shape[0]-benign_df.shape[0], df.shape[0], 100*benign_df.shape[0]/df.shape[0]]
], columns = composition_columns))


print(f"""
File:\t\t\t\t{file_set[current_job]}  
Job Number:\t\t\t{current_job+1}
Shape:\t\t\t\t{df.shape}
Samples:\t\t\t{df.shape[0]} 
Features:\t\t\t{df.shape[1]}
Benign Samples:\t\t\t{benign_df.shape[0]}
Malicious Samples:\t\t{df.shape[0]-benign_df.shape[0]}
Benign-to-Malicious Ratio:\t{benign_df.shape[0]/(df.shape[0]-benign_df.shape[0])}
""")


    Dataset 1/18: We now look at ./original/01-12/DrDoS_DNS.csv

Loading Dataset: ./original/01-12/DrDoS_DNS.csv
	To Dataset Cache: ./cache/01-12/DrDoS_DNS.csv.pickle


File:				./original/01-12/DrDoS_DNS.csv  
Job Number:			1
Shape:				(5074413, 88)
Samples:			5074413 
Features:			88
Benign Samples:			3402
Malicious Samples:		5071011
Benign-to-Malicious Ratio:	0.0006708721396975869



Now that we have a dataset loaded, let's explore the features and find which ones we want to eliminate, creating a 'pruning' list to reduce the size of the dataset. We will use a few simple heuristics to eliminate features before examining particular methodologies. One of those heuristics is to eliminate non-numerical data. We could encode these value, but at this stage the goal is dimension reduction. If we meet poor performance, we can come back and re-examine our heuristics

In [8]:
prune: list = [] # prune is a list of all features we know we don't want to use
clip : list = [] # clip is a list of all values we do not want to use

# we extract the data from the benign_df and use it to layout our features
# we use the benign_df because it is smaller and will process faster
# if the feature is string valued, we add it to our pruning list
values = benign_df.values
columns = benign_df.columns
for i in range(benign_df.shape[1]):
    if type(values[0][i]) == str and columns[i] != 'Label':
        prune.append(columns[i]) 
    print(f"Column: {i}\tType: {type(values[0][i])}\tLabel: {columns[i]}")

Column: 0	Type: <class 'int'>	Label: Unnamed
Column: 1	Type: <class 'str'>	Label: Flow ID
Column: 2	Type: <class 'str'>	Label: Source IP
Column: 3	Type: <class 'int'>	Label: Source Port
Column: 4	Type: <class 'str'>	Label: Destination IP
Column: 5	Type: <class 'int'>	Label: Destination Port
Column: 6	Type: <class 'int'>	Label: Protocol
Column: 7	Type: <class 'str'>	Label: Timestamp
Column: 8	Type: <class 'int'>	Label: Flow Duration
Column: 9	Type: <class 'int'>	Label: Total Fwd Packets
Column: 10	Type: <class 'int'>	Label: Total Backward Packets
Column: 11	Type: <class 'float'>	Label: Total Length of Fwd Packets
Column: 12	Type: <class 'float'>	Label: Total Length of Bwd Packets
Column: 13	Type: <class 'float'>	Label: Fwd Packet Length Max
Column: 14	Type: <class 'float'>	Label: Fwd Packet Length Min
Column: 15	Type: <class 'float'>	Label: Fwd Packet Length Mean
Column: 16	Type: <class 'float'>	Label: Fwd Packet Length Std
Column: 17	Type: <class 'float'>	Label: Bwd Packet Length Max
C

Next, we use our previously defined function to examine the dataset and see if any features have unappealing values mixed in with the Real number valued features. These include infinite and NaN (Not a number) values that could interfere with our model's ability to process the data

In [9]:
feature_stats = features_with_bad_values(df, file_set[current_job])

Now that we have compiled the stats on the undesirable values in the dataset, we inspect the data to find out what features we should get rid of.

Our stats take the form of a dataframe with the dataset location, value being looked for, and the value count for each feature in the dataset

In [10]:
feature_stats

Unnamed: 0,Dataset,Value,Unnamed,Flow ID,Source IP,Source Port,Destination IP,Destination Port,Protocol,Timestamp,...,Active Std,Active Max,Active Min,Idle Mean,Idle Std,Idle Max,Idle Min,SimillarHTTP,Inbound,Label
0,./original/01-12/DrDoS_DNS.csv,Inf,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,./original/01-12/DrDoS_DNS.csv,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,./original/01-12/DrDoS_DNS.csv,Zero,188.0,0.0,0.0,382.0,0.0,382.0,382.0,0.0,...,5074144.0,5073990.0,5073990.0,5073989.0,5074137.0,5073989.0,5073989.0,0.0,4735.0,0.0


We can see there are plenty of features with a large number of 0 values, but tells us little about the distribution of inf and nan values. Lets take a closer look at the stats

In [11]:
f = feature_stats[feature_stats['Value'] == 'Inf'].T
f[f[0] != 0]

Unnamed: 0,0
Dataset,./original/01-12/DrDoS_DNS.csv
Value,Inf
Flow Bytes/s,162363.0
Flow Packets/s,162394.0


Flow Bytes per Second and Flow Packets per Second have over 162 thousand inf values. This makes these features a candidate for pruning, but 162 thousand out of 5 million samples may not justify pruning the entire feature, we may just remove the samples with the inf values

In [12]:
f = feature_stats[feature_stats['Value'] == 'NaN'].T
f[f[1] != 0]

Unnamed: 0,1
Dataset,./original/01-12/DrDoS_DNS.csv
Value,


No NaN values in our set sofar. This is pretty surprising because NaN values cropped up alot when cleaning the TOR dataset created with the same tool. So lets add Inf and NaN values to our clip list since they take up a small fraction of the number of samples in the dataset. Our clip list just specifies what samples to remove if they have a given value

In [13]:
toClip = [ np.inf, np.nan, 'Infinity', 'inf', 'NaN', 'nan' ]
for i in toClip:
    if i not in clip:
        clip.append(i)

Now we investigate the distribution of 0 valued features in the dataset. Unlike Inf and NaN values, we dont necessarily have to remove them. However if a feature is overwhelmingly populated with 0 values, it would be pointless to include the feature in our experiments

In [14]:
f = feature_stats[feature_stats['Value'] == 'Zero'].T
f[f[2] != 0]

Unnamed: 0,2
Dataset,./original/01-12/DrDoS_DNS.csv
Value,Zero
Unnamed,188.0
Source Port,382.0
Destination Port,382.0
...,...
Idle Mean,5073989.0
Idle Std,5074137.0
Idle Max,5073989.0
Idle Min,5073989.0


In [15]:
f_top = f[:2]
f_bottom = f[2:]
f_bottom[f_bottom[2] > 0]

Unnamed: 0,2
Unnamed,188.0
Source Port,382.0
Destination Port,382.0
Protocol,382.0
Flow Duration,162394.0
...,...
Idle Mean,5073989.0
Idle Std,5074137.0
Idle Max,5073989.0
Idle Min,5073989.0


When it comes to 0 values, 79 out of 88 of our features have more than 0. This isnt necessarily bad, we expect a fair number of 0 values in any distribution of number, but features with >99% 0 values are obvious candidates for pruning

In [16]:
f_bottom[f_bottom[2] > 5000]

Unnamed: 0,2
Flow Duration,162394.0
Total Backward Packets,5070677.0
Total Length of Bwd Packets,5072319.0
Fwd Packet Length Std,5068906.0
Bwd Packet Length Max,5072319.0
Bwd Packet Length Min,5073164.0
Bwd Packet Length Mean,5072319.0
Bwd Packet Length Std,5073556.0
Flow IAT Mean,162394.0
Flow IAT Std,5055499.0


In [17]:
f_bottom[f_bottom[2] > 5000].shape

(60, 1)

Filtering the 0 values for instances greater than 5000 still gave us 60 features. Still 5000 is rather arbitrary, but filtering for it helps us see all of the large counts. We can split the data into 4 partitions with regards to the number of 0 valued features
    
    0-5,000

    5,000-200,000

    200,000-1,000,000

    1,000,000-5,071,011


The range from 0-200,000 seems reasonable in any normal distribution of samples, but features with more than 200,000 are questionable. So next we filter for instances of 0 values greater than 200,000

In [18]:
f_bottom[f_bottom[2] > 200000]

Unnamed: 0,2
Total Backward Packets,5070677.0
Total Length of Bwd Packets,5072319.0
Fwd Packet Length Std,5068906.0
Bwd Packet Length Max,5072319.0
Bwd Packet Length Min,5073164.0
Bwd Packet Length Mean,5072319.0
Bwd Packet Length Std,5073556.0
Flow IAT Std,5055499.0
Fwd IAT Std,5057615.0
Bwd IAT Total,5070975.0


In [19]:
f_bottom[f_bottom[2] > 200000].shape

(51, 1)

This still leaves us with a set of 51 out of our original 88 features. Expanding our search just values greater than 1,000,000 then shows

In [20]:
f_bottom[f_bottom[2] > 1000000]

Unnamed: 0,2
Total Backward Packets,5070677.0
Total Length of Bwd Packets,5072319.0
Fwd Packet Length Std,5068906.0
Bwd Packet Length Max,5072319.0
Bwd Packet Length Min,5073164.0
Bwd Packet Length Mean,5072319.0
Bwd Packet Length Std,5073556.0
Flow IAT Std,5055499.0
Fwd IAT Std,5057615.0
Bwd IAT Total,5070975.0


In [21]:
f_bottom[f_bottom[2] > 200000].shape

(51, 1)

Which shows no change. Filtering for instances greater than 5,000,000, we find

In [22]:
f_bottom[f_bottom[2] > 5000000]

Unnamed: 0,2
Total Backward Packets,5070677.0
Total Length of Bwd Packets,5072319.0
Fwd Packet Length Std,5068906.0
Bwd Packet Length Max,5072319.0
Bwd Packet Length Min,5073164.0
Bwd Packet Length Mean,5072319.0
Bwd Packet Length Std,5073556.0
Flow IAT Std,5055499.0
Fwd IAT Std,5057615.0
Bwd IAT Total,5070975.0


In [23]:
f_bottom[f_bottom[2] > 5000000].shape

(48, 1)

So we have 48 features with almost nothing but 0 values, 3 features with between 1,000,000 and 5,000,000 0 values, 9 features with between 5,000 and 200,000 0 values, and 18 features with less than 5,000 zero values

In [24]:
pruneCandidates: list = list(f_bottom[f_bottom[2] > 5000000].T.columns)

In [25]:
pruneCandidates

['Total Backward Packets',
 'Total Length of Bwd Packets',
 'Fwd Packet Length Std',
 'Bwd Packet Length Max',
 'Bwd Packet Length Min',
 'Bwd Packet Length Mean',
 'Bwd Packet Length Std',
 'Flow IAT Std',
 'Fwd IAT Std',
 'Bwd IAT Total',
 'Bwd IAT Mean',
 'Bwd IAT Std',
 'Bwd IAT Max',
 'Bwd IAT Min',
 'Fwd PSH Flags',
 'Bwd PSH Flags',
 'Fwd URG Flags',
 'Bwd URG Flags',
 'Bwd Header Length',
 'Bwd Packets/s',
 'Packet Length Std',
 'Packet Length Variance',
 'FIN Flag Count',
 'SYN Flag Count',
 'RST Flag Count',
 'PSH Flag Count',
 'ACK Flag Count',
 'URG Flag Count',
 'CWE Flag Count',
 'ECE Flag Count',
 'Down/Up Ratio',
 'Avg Bwd Segment Size',
 'Fwd Avg Bytes/Bulk',
 'Fwd Avg Packets/Bulk',
 'Fwd Avg Bulk Rate',
 'Bwd Avg Bytes/Bulk',
 'Bwd Avg Packets/Bulk',
 'Bwd Avg Bulk Rate',
 'Subflow Bwd Packets',
 'Subflow Bwd Bytes',
 'Active Mean',
 'Active Std',
 'Active Max',
 'Active Min',
 'Idle Mean',
 'Idle Std',
 'Idle Max',
 'Idle Min']

We add any feature with more than 5 million 0 values to the prune list, giving us a preliminary list of 53/88 features to remove. 

In [26]:
# toPrune = f_bottom[f_bottom[2] > 5000000].T.columns
# for i in toPrune:
#     if i not in prune:
#         prune.append(i)
# len(prune) 

In [27]:
prune

['Flow ID', 'Source IP', 'Destination IP', 'Timestamp', 'SimillarHTTP']

We will also add the Unnamed feature to this list due to our inability to identify what characteristic of the dataset it represents, as well as Fwd Header Length.1 due to it being a dupicate

In [28]:
toPrune = ['Fwd Header Length.1', 'Unnamed']

for i in toPrune:
    if i not in prune:
        prune.append(i)
len(prune)

7

### Now, lets make a few functions to do everything we did above so we can evaluate the features of the other 17 collections of data in the CIC_DDoS2019 dataset

In [29]:
def examine_dataset(job_id: int) -> dict({'File': str, 'Dataset': pd.DataFrame, 'Feature_stats': pd.DataFrame, 'Data_composition': pd.DataFrame}):
    '''
        Function will return a dictionary containing dataframe of the job_id passed in as well as that dataframe's
        feature stats, data composition, and file name.
    '''

    job_id = job_id - 1  # adjusts for indexing while enumerating jobs from 1
    print(f'Dataset {job_id+1}/{len(data_set)}: We now look at {file_set[job_id]}\n\n')

    # Load the dataset
    df: pd.DataFrame = load_data(file_set[job_id])
    df = df.rename(columns=new_column_names)
    benign_df: pd.DataFrame = df[df['Label'] == 'BENIGN']

    # Record the data composition of the dataset
    composition = data_composition.append(
        pd.DataFrame([
            [file_set[job_id][11:], benign_df.shape[0], df.shape[0] - benign_df.shape[0], df.shape[0], 100*benign_df.shape[0]/df.shape[0]]
        ], columns = composition_columns)
    )

    # print the data composition
    print(f'''
        File:\t\t\t\t{file_set[job_id]}  
        Job Number:\t\t\t{job_id+1}
        Shape:\t\t\t\t{df.shape}
        Samples:\t\t\t{df.shape[0]} 
        Features:\t\t\t{df.shape[1]}
        Benign Samples:\t\t\t{benign_df.shape[0]}
        Malicious Samples:\t\t{df.shape[0]-benign_df.shape[0]}
        Benign-to-Malicious Ratio:\t{benign_df.shape[0]/(df.shape[0]-benign_df.shape[0])}
    ''')
    
    # return the dataframe and the feature stats
    data_summary =  {'File':file_set[job_id] , 'Dataset':df, 'Feature_stats':features_with_bad_values(df, file_set[job_id]), 'Data_composition':composition}
    return data_summary


def check_infs(data_summary: dict) -> pd.DataFrame:
    '''
        Function will return a dataframe of features with a value of Inf.
    '''

    
    vals: pd.DataFrame = data_summary['Feature_stats']
    inf_df = vals[vals['Value'] == 'Inf'].T

    return inf_df[inf_df[0] != 0]


def check_nans(data_summary: dict) -> pd.DataFrame:
    '''
        Function will return a dataframe of features with a value of NaN.
    '''

    vals: pd.DataFrame = data_summary['Feature_stats']
    nan_df = vals[vals['Value'] == 'NaN'].T

    return nan_df[nan_df[1] != 0]


def check_zeros(data_summary: dict) -> pd.DataFrame:
    '''
        Function will return a dataframe of features with a value of 0.
    '''

    vals: pd.DataFrame = data_summary['Feature_stats']
    zero_df = vals[vals['Value'] == 'Zero'].T

    return zero_df[zero_df[2] != 0]


def check_zeros_over_threshold(data_summary: dict, threshold: int) -> pd.DataFrame:
    '''
        Function will return a dataframe of features with a value of 0.
    '''

    vals: pd.DataFrame = data_summary['Feature_stats']
    zero_df = vals[vals['Value'] == 'Zero'].T
    zero_df_bottom = zero_df[2:]

    return zero_df_bottom[zero_df_bottom[2] > threshold]


def check_zeros_over_threshold_percentage(data_summary: dict, threshold: float) -> pd.DataFrame:
    '''
        Function will return a dataframe of features with all features with
        a frequency of 0 values greater than the threshold
    '''

    vals: pd.DataFrame = data_summary['Feature_stats']
    size: int = data_summary['Dataset'].shape[0]
    zero_df = vals[vals['Value'] == 'Zero'].T
    zero_df_bottom = zero_df[2:]

    return zero_df_bottom[zero_df_bottom[2] > threshold*size]


def create_new_prune_candidates(zeros_df: pd.DataFrame) -> list:
    '''
        Function creates a list of prune candidates from a dataframe of features with a high frequency of 0 values
    '''

    return list(zeros_df.T.columns)


def intersection_of_prune_candidates(pruneCandidates: list, newPruneCandidates: list) -> list:
    '''
        Function will return a list of features that are in both pruneCandidates and newPruneCandidates
    '''

    return list(set(pruneCandidates).intersection(newPruneCandidates))

### First, we test out our new functions on the first collection of data we evaluated above

## Data Collection #1

In [30]:
dataset_1 = examine_dataset(1)

Dataset 1/18: We now look at ./original/01-12/DrDoS_DNS.csv


Loading Dataset: ./original/01-12/DrDoS_DNS.csv
	To Dataset Cache: ./cache/01-12/DrDoS_DNS.csv.pickle


        File:				./original/01-12/DrDoS_DNS.csv  
        Job Number:			1
        Shape:				(5074413, 88)
        Samples:			5074413 
        Features:			88
        Benign Samples:			3402
        Malicious Samples:		5071011
        Benign-to-Malicious Ratio:	0.0006708721396975869
    


In [31]:
check_infs(dataset_1)

Unnamed: 0,0
Dataset,./original/01-12/DrDoS_DNS.csv
Value,Inf
Flow Bytes/s,162363.0
Flow Packets/s,162394.0


In [32]:
check_nans(dataset_1)

Unnamed: 0,1
Dataset,./original/01-12/DrDoS_DNS.csv
Value,


In [33]:
check_zeros(dataset_1)

Unnamed: 0,2
Dataset,./original/01-12/DrDoS_DNS.csv
Value,Zero
Unnamed,188.0
Source Port,382.0
Destination Port,382.0
...,...
Idle Mean,5073989.0
Idle Std,5074137.0
Idle Max,5073989.0
Idle Min,5073989.0


In [34]:
check_zeros_over_threshold(dataset_1, 5000000)

Unnamed: 0,2
Total Backward Packets,5070677.0
Total Length of Bwd Packets,5072319.0
Fwd Packet Length Std,5068906.0
Bwd Packet Length Max,5072319.0
Bwd Packet Length Min,5073164.0
Bwd Packet Length Mean,5072319.0
Bwd Packet Length Std,5073556.0
Flow IAT Std,5055499.0
Fwd IAT Std,5057615.0
Bwd IAT Total,5070975.0


In [35]:
check_zeros_over_threshold_percentage(dataset_1, .95).shape

(48, 1)

So lets add the features that are made up of 95% or more 0 values to a pruneCandidates list. We will go through each collection of data within CIC_DDoS2019 and the intersection of all the pruneCandidates will be added to our prune list for preliminary feature selection.

In [36]:
newPruneCandidates: list = create_new_prune_candidates(check_zeros_over_threshold_percentage(dataset_1, .95))
pruneCandidates   : list = intersection_of_prune_candidates(pruneCandidates, newPruneCandidates)
pretty(pruneCandidates)

[   'URG Flag Count',
    'Fwd PSH Flags',
    'Idle Std',
    'Bwd Packet Length Std',
    'Fwd URG Flags',
    'Packet Length Std',
    'PSH Flag Count',
    'Fwd Avg Packets/Bulk',
    'Bwd IAT Std',
    'Bwd IAT Min',
    'Bwd Avg Bulk Rate',
    'FIN Flag Count',
    'CWE Flag Count',
    'Avg Bwd Segment Size',
    'ECE Flag Count',
    'Bwd URG Flags',
    'Bwd IAT Mean',
    'Bwd IAT Max',
    'Total Length of Bwd '
    'Packets',
    'Fwd Avg Bytes/Bulk',
    'Idle Mean',
    'SYN Flag Count',
    'RST Flag Count',
    'Down/Up Ratio',
    'Flow IAT Std',
    'Active Std',
    'ACK Flag Count',
    'Bwd Packets/s',
    'Fwd Avg Bulk Rate',
    'Bwd IAT Total',
    'Bwd Packet Length Max',
    'Idle Max',
    'Packet Length Variance',
    'Bwd PSH Flags',
    'Bwd Packet Length Mean',
    'Fwd Packet Length Std',
    'Bwd Avg Bytes/Bulk',
    'Bwd Header Length',
    'Active Mean',
    'Active Max',
    'Active Min',
    'Bwd Packet Length Min',
    'Bwd Avg Packets/Bulk',
    

We skipped testing the add_to_comp_stats function because this data collection's stats are already in the data_composition dataframe

In [37]:
data_composition

Unnamed: 0,File,Benign,Malicious,Total,Ratio
0,01-12/DrDoS_DNS.csv,3402,5071011,5074413,0.000671


## Data Collection #2

Now, let's examine the next collection of data

In [38]:
dataset_2 = examine_dataset(2)

Dataset 2/18: We now look at ./original/01-12/DrDoS_LDAP.csv


Loading Dataset: ./original/01-12/DrDoS_LDAP.csv
	To Dataset Cache: ./cache/01-12/DrDoS_LDAP.csv.pickle


        File:				./original/01-12/DrDoS_LDAP.csv  
        Job Number:			2
        Shape:				(2181542, 88)
        Samples:			2181542 
        Features:			88
        Benign Samples:			1612
        Malicious Samples:		2179930
        Benign-to-Malicious Ratio:	0.0007394732858394536
    


Here we see that the ratio of benign to malicious in this data collection is similar to the first. This collection is about half the size of the first and has around 20% of the inf values found in the first as well

In [39]:
check_infs(dataset_2)

Unnamed: 0,0
Dataset,./original/01-12/DrDoS_LDAP.csv
Value,Inf
Flow Bytes/s,38638.0
Flow Packets/s,38650.0


We can see this collection also has no NaN valued entries

In [40]:
check_nans(dataset_2)

Unnamed: 0,1
Dataset,./original/01-12/DrDoS_LDAP.csv
Value,


Checking out second collection for 0 values reveals a situation mirroring that of the first collection. Lets go through and check the number of features with 0 values over a particular threshold

In [41]:
check_zeros(dataset_2)

Unnamed: 0,2
Dataset,./original/01-12/DrDoS_LDAP.csv
Value,Zero
Unnamed,64.0
Source Port,207.0
Destination Port,207.0
...,...
Idle Std,2181542.0
Idle Max,2181530.0
Idle Min,2181530.0
SimillarHTTP,1870246.0


In [42]:
print(f'''
Features with a frequency of 0 values greater than
    2,000,000: {check_zeros_over_threshold(dataset_2, 2000000).shape[0]}
    1,000,000: {check_zeros_over_threshold(dataset_2, 1000000).shape[0]}
    500,000  : {check_zeros_over_threshold(dataset_2, 500000).shape[0]}
    200,000  : {check_zeros_over_threshold(dataset_2, 200000).shape[0]}
    50,000   : {check_zeros_over_threshold(dataset_2, 50000).shape[0]}
    5,000    : {check_zeros_over_threshold(dataset_2, 5000).shape[0]}
    0        : {check_zeros_over_threshold(dataset_2, 0).shape[0]}
''')


Features with a frequency of 0 values greater than
    2,000,000: 48
    1,000,000: 49
    500,000  : 52
    200,000  : 52
    50,000   : 52
    5,000    : 61
    0        : 80



We can see that there is a similar distribution of 0 values in this data collection as there was in the first. Just as in the first, 48 features consist of 95% 0 values. So we add them to our pruneCandidates list

In [43]:
check_zeros_over_threshold_percentage(dataset_2, .95).shape

(48, 1)

In [44]:
newPruneCandidates: list = create_new_prune_candidates(check_zeros_over_threshold_percentage(dataset_2, .95))
pruneCandidates   : list = intersection_of_prune_candidates(pruneCandidates, newPruneCandidates)
pretty(pruneCandidates)

[   'URG Flag Count',
    'Fwd PSH Flags',
    'Idle Std',
    'Bwd Packet Length Std',
    'Fwd URG Flags',
    'Packet Length Std',
    'PSH Flag Count',
    'Fwd Avg Packets/Bulk',
    'Bwd IAT Std',
    'Bwd IAT Min',
    'Bwd Avg Bulk Rate',
    'FIN Flag Count',
    'CWE Flag Count',
    'Avg Bwd Segment Size',
    'ECE Flag Count',
    'Bwd URG Flags',
    'Bwd IAT Mean',
    'Bwd IAT Max',
    'Total Length of Bwd '
    'Packets',
    'Fwd Avg Bytes/Bulk',
    'Idle Mean',
    'SYN Flag Count',
    'RST Flag Count',
    'Down/Up Ratio',
    'Flow IAT Std',
    'Active Std',
    'ACK Flag Count',
    'Bwd Packets/s',
    'Fwd Avg Bulk Rate',
    'Bwd IAT Total',
    'Bwd Packet Length Max',
    'Idle Max',
    'Packet Length Variance',
    'Bwd PSH Flags',
    'Bwd Packet Length Mean',
    'Fwd Packet Length Std',
    'Bwd Avg Bytes/Bulk',
    'Bwd Header Length',
    'Active Mean',
    'Active Max',
    'Active Min',
    'Bwd Packet Length Min',
    'Bwd Avg Packets/Bulk',
    

In [45]:
data_composition

Unnamed: 0,File,Benign,Malicious,Total,Ratio
0,01-12/DrDoS_DNS.csv,3402,5071011,5074413,0.000671


In [46]:
data_composition = dataset_2['Data_composition']
data_composition

Unnamed: 0,File,Benign,Malicious,Total,Ratio
0,01-12/DrDoS_DNS.csv,3402,5071011,5074413,0.000671
0,01-12/DrDoS_LDAP.csv,1612,2179930,2181542,0.000739


## Data Collection #3

In [47]:
dataset_3 = examine_dataset(3)

Dataset 3/18: We now look at ./original/01-12/DrDoS_MSSQL.csv


Loading Dataset: ./original/01-12/DrDoS_MSSQL.csv
	To Dataset Cache: ./cache/01-12/DrDoS_MSSQL.csv.pickle


        File:				./original/01-12/DrDoS_MSSQL.csv  
        Job Number:			3
        Shape:				(4524498, 88)
        Samples:			4524498 
        Features:			88
        Benign Samples:			2006
        Malicious Samples:		4522492
        Benign-to-Malicious Ratio:	0.0004435607625176562
    


In [48]:
check_infs(dataset_3)

Unnamed: 0,0
Dataset,./original/01-12/DrDoS_MSSQL.csv
Value,Inf
Flow Bytes/s,126452.0
Flow Packets/s,126466.0


In [49]:
check_nans(dataset_3)

Unnamed: 0,1
Dataset,./original/01-12/DrDoS_MSSQL.csv
Value,


Again we see that the third collection of data has a similar distribution of inf and nan values as the first two collections

In [50]:
check_zeros(dataset_3)

Unnamed: 0,2
Dataset,./original/01-12/DrDoS_MSSQL.csv
Value,Zero
Unnamed,31.0
Source Port,121.0
Destination Port,121.0
...,...
Idle Std,4524429.0
Idle Max,4521932.0
Idle Min,4521932.0
SimillarHTTP,4046848.0


In [51]:
print(f'''
Features with a frequency of 0 values greater than
    4,000,000: {check_zeros_over_threshold(dataset_3, 4000000).shape[0]}
    1,000,000: {check_zeros_over_threshold(dataset_3, 1000000).shape[0]}
    500,000  : {check_zeros_over_threshold(dataset_3, 500000).shape[0]}
    200,000  : {check_zeros_over_threshold(dataset_3, 200000).shape[0]}
    50,000   : {check_zeros_over_threshold(dataset_3, 50000).shape[0]}
    5,000    : {check_zeros_over_threshold(dataset_3, 5000).shape[0]}
    0        : {check_zeros_over_threshold(dataset_3, 0).shape[0]}
''')


Features with a frequency of 0 values greater than
    4,000,000: 49
    1,000,000: 52
    500,000  : 52
    200,000  : 52
    50,000   : 61
    5,000    : 61
    0        : 80



This shows us that the third collection of data has a similar distribution to the first two collections. 

In [52]:
check_zeros_over_threshold_percentage(dataset_3, .95).shape

(48, 1)

In [53]:
newPruneCandidates: list = create_new_prune_candidates(check_zeros_over_threshold_percentage(dataset_3, .95))
pruneCandidates   : list = intersection_of_prune_candidates(pruneCandidates, newPruneCandidates)
pretty(pruneCandidates)

[   'URG Flag Count',
    'Fwd PSH Flags',
    'Idle Std',
    'Bwd Packet Length Std',
    'Fwd URG Flags',
    'Packet Length Std',
    'PSH Flag Count',
    'Fwd Avg Packets/Bulk',
    'Bwd IAT Std',
    'Bwd IAT Min',
    'Bwd Avg Bulk Rate',
    'FIN Flag Count',
    'CWE Flag Count',
    'Avg Bwd Segment Size',
    'ECE Flag Count',
    'Bwd URG Flags',
    'Bwd IAT Mean',
    'Bwd IAT Max',
    'Total Length of Bwd '
    'Packets',
    'Fwd Avg Bytes/Bulk',
    'Idle Mean',
    'SYN Flag Count',
    'RST Flag Count',
    'Down/Up Ratio',
    'Flow IAT Std',
    'Active Std',
    'ACK Flag Count',
    'Bwd Packets/s',
    'Fwd Avg Bulk Rate',
    'Bwd IAT Total',
    'Bwd Packet Length Max',
    'Idle Max',
    'Packet Length Variance',
    'Bwd PSH Flags',
    'Bwd Packet Length Mean',
    'Fwd Packet Length Std',
    'Bwd Avg Bytes/Bulk',
    'Bwd Header Length',
    'Active Mean',
    'Active Max',
    'Active Min',
    'Bwd Packet Length Min',
    'Bwd Avg Packets/Bulk',
    

In [54]:
data_composition = dataset_3['Data_composition']
data_composition

Unnamed: 0,File,Benign,Malicious,Total,Ratio
0,01-12/DrDoS_DNS.csv,3402,5071011,5074413,0.000671
0,01-12/DrDoS_LDAP.csv,1612,2179930,2181542,0.000739
0,01-12/DrDoS_MSSQL.csv,2006,4522492,4524498,0.000444


## Data Collection #4

In [55]:
dataset_4 = examine_dataset(4)

Dataset 4/18: We now look at ./original/01-12/DrDoS_NetBIOS.csv


Loading Dataset: ./original/01-12/DrDoS_NetBIOS.csv
	To Dataset Cache: ./cache/01-12/DrDoS_NetBIOS.csv.pickle


        File:				./original/01-12/DrDoS_NetBIOS.csv  
        Job Number:			4
        Shape:				(4094986, 88)
        Samples:			4094986 
        Features:			88
        Benign Samples:			1707
        Malicious Samples:		4093279
        Benign-to-Malicious Ratio:	0.0004170250794045556
    


In [56]:
data_composition = dataset_4['Data_composition']
data_composition

Unnamed: 0,File,Benign,Malicious,Total,Ratio
0,01-12/DrDoS_DNS.csv,3402,5071011,5074413,0.000671
0,01-12/DrDoS_LDAP.csv,1612,2179930,2181542,0.000739
0,01-12/DrDoS_MSSQL.csv,2006,4522492,4524498,0.000444
0,01-12/DrDoS_NetBIOS.csv,1707,4093279,4094986,0.000417


In [57]:
check_infs(dataset_4)

Unnamed: 0,0
Dataset,./original/01-12/DrDoS_NetBIOS.csv
Value,Inf
Flow Bytes/s,129845.0
Flow Packets/s,129853.0


In [58]:
check_nans(dataset_4)

Unnamed: 0,1
Dataset,./original/01-12/DrDoS_NetBIOS.csv
Value,


In [59]:
check_zeros(dataset_4)

Unnamed: 0,2
Dataset,./original/01-12/DrDoS_NetBIOS.csv
Value,Zero
Unnamed,12.0
Source Port,54.0
Destination Port,54.0
...,...
Idle Std,4094802.0
Idle Max,4092133.0
Idle Min,4092133.0
SimillarHTTP,3611658.0


In [60]:
print(f'''
Features with a frequency of 0 values greater than
    4,000,000: {check_zeros_over_threshold(dataset_4, 4000000).shape[0]}
    1,000,000: {check_zeros_over_threshold(dataset_4, 1000000).shape[0]}
    500,000  : {check_zeros_over_threshold(dataset_4, 500000).shape[0]}
    200,000  : {check_zeros_over_threshold(dataset_4, 200000).shape[0]}
    50,000   : {check_zeros_over_threshold(dataset_4, 50000).shape[0]}
    5,000    : {check_zeros_over_threshold(dataset_4, 5000).shape[0]}
    0        : {check_zeros_over_threshold(dataset_4, 0).shape[0]}
''')


Features with a frequency of 0 values greater than
    4,000,000: 48
    1,000,000: 49
    500,000  : 52
    200,000  : 52
    50,000   : 61
    5,000    : 61
    0        : 80



In [61]:
check_zeros_over_threshold_percentage(dataset_4, .95).shape

(48, 1)

In [62]:
newPruneCandidates: list = create_new_prune_candidates(check_zeros_over_threshold_percentage(dataset_4, .95))
pruneCandidates   : list = intersection_of_prune_candidates(pruneCandidates, newPruneCandidates)
pretty(pruneCandidates)

[   'URG Flag Count',
    'Fwd PSH Flags',
    'Idle Std',
    'Bwd Packet Length Std',
    'Fwd URG Flags',
    'Packet Length Std',
    'PSH Flag Count',
    'Fwd Avg Packets/Bulk',
    'Bwd IAT Std',
    'Bwd IAT Min',
    'Bwd Avg Bulk Rate',
    'FIN Flag Count',
    'CWE Flag Count',
    'Avg Bwd Segment Size',
    'ECE Flag Count',
    'Bwd URG Flags',
    'Bwd IAT Mean',
    'Bwd IAT Max',
    'Total Length of Bwd '
    'Packets',
    'Fwd Avg Bytes/Bulk',
    'Idle Mean',
    'SYN Flag Count',
    'RST Flag Count',
    'Down/Up Ratio',
    'Flow IAT Std',
    'Active Std',
    'ACK Flag Count',
    'Bwd Packets/s',
    'Fwd Avg Bulk Rate',
    'Bwd IAT Total',
    'Bwd Packet Length Max',
    'Idle Max',
    'Packet Length Variance',
    'Bwd PSH Flags',
    'Bwd Packet Length Mean',
    'Fwd Packet Length Std',
    'Bwd Avg Bytes/Bulk',
    'Bwd Header Length',
    'Active Mean',
    'Active Max',
    'Active Min',
    'Bwd Packet Length Min',
    'Bwd Avg Packets/Bulk',
    

## Data Collection #5

In [63]:
dataset_5 = examine_dataset(5)
data_composition = dataset_5['Data_composition']
data_composition

Dataset 5/18: We now look at ./original/01-12/DrDoS_NTP.csv


Loading Dataset: ./original/01-12/DrDoS_NTP.csv
	To Dataset Cache: ./cache/01-12/DrDoS_NTP.csv.pickle


        File:				./original/01-12/DrDoS_NTP.csv  
        Job Number:			5
        Shape:				(1217007, 88)
        Samples:			1217007 
        Features:			88
        Benign Samples:			14365
        Malicious Samples:		1202642
        Benign-to-Malicious Ratio:	0.011944535447789117
    


Unnamed: 0,File,Benign,Malicious,Total,Ratio
0,01-12/DrDoS_DNS.csv,3402,5071011,5074413,0.000671
0,01-12/DrDoS_LDAP.csv,1612,2179930,2181542,0.000739
0,01-12/DrDoS_MSSQL.csv,2006,4522492,4524498,0.000444
0,01-12/DrDoS_NetBIOS.csv,1707,4093279,4094986,0.000417
0,01-12/DrDoS_NTP.csv,14365,1202642,1217007,0.011945


In [64]:
check_infs(dataset_5)

Unnamed: 0,0
Dataset,./original/01-12/DrDoS_NTP.csv
Value,Inf
Flow Bytes/s,7015.0
Flow Packets/s,7046.0


In [65]:
check_nans(dataset_5)

Unnamed: 0,1
Dataset,./original/01-12/DrDoS_NTP.csv
Value,


In [66]:
check_zeros(dataset_5)

Unnamed: 0,2
Dataset,./original/01-12/DrDoS_NTP.csv
Value,Zero
Unnamed,191.0
Source Port,408.0
Destination Port,408.0
...,...
Idle Std,1215725.0
Idle Max,1215340.0
Idle Min,1215340.0
SimillarHTTP,1015808.0


In [67]:
print(f'''
Features with a frequency of 0 values greater than
    1,000,000: {check_zeros_over_threshold(dataset_5, 1000000).shape[0]}
    750,000  : {check_zeros_over_threshold(dataset_5, 750000).shape[0]}
    500,000  : {check_zeros_over_threshold(dataset_5, 500000).shape[0]}
    200,000  : {check_zeros_over_threshold(dataset_5, 200000).shape[0]}
    50,000   : {check_zeros_over_threshold(dataset_5, 50000).shape[0]}
    5,000    : {check_zeros_over_threshold(dataset_5, 5000).shape[0]}
    0        : {check_zeros_over_threshold(dataset_5, 0).shape[0]}
''')


Features with a frequency of 0 values greater than
    1,000,000: 44
    750,000  : 49
    500,000  : 49
    200,000  : 49
    50,000   : 54
    5,000    : 74
    0        : 80



In [68]:
check_zeros_over_threshold_percentage(dataset_5, .95).shape

(43, 1)

Now that we are examining the 5th collection of data, we finally see some changes to the pattern that we saw in the previous collections. Only 43 features are 95% or more zeros. This means that we would have been wrong to just remove all 48 features before examining the rest of the data

In [69]:
newPruneCandidates: list = create_new_prune_candidates(check_zeros_over_threshold_percentage(dataset_5, .95))
pruneCandidates   : list = intersection_of_prune_candidates(pruneCandidates, newPruneCandidates)
pretty(pruneCandidates)

[   'URG Flag Count',
    'Fwd PSH Flags',
    'Idle Std',
    'Bwd Packet Length Std',
    'Fwd URG Flags',
    'PSH Flag Count',
    'Fwd Avg Packets/Bulk',
    'Bwd IAT Std',
    'Bwd IAT Min',
    'Bwd Avg Bulk Rate',
    'FIN Flag Count',
    'CWE Flag Count',
    'Avg Bwd Segment Size',
    'ECE Flag Count',
    'Bwd URG Flags',
    'Bwd IAT Mean',
    'Bwd IAT Max',
    'Total Length of Bwd '
    'Packets',
    'Fwd Avg Bytes/Bulk',
    'Idle Mean',
    'SYN Flag Count',
    'RST Flag Count',
    'Down/Up Ratio',
    'Active Std',
    'ACK Flag Count',
    'Bwd Packets/s',
    'Fwd Avg Bulk Rate',
    'Bwd IAT Total',
    'Bwd Packet Length Max',
    'Idle Max',
    'Bwd PSH Flags',
    'Bwd Packet Length Mean',
    'Bwd Avg Bytes/Bulk',
    'Active Mean',
    'Bwd Header Length',
    'Active Max',
    'Active Min',
    'Bwd Packet Length Min',
    'Bwd Avg Packets/Bulk',
    'Subflow Bwd Bytes',
    'Subflow Bwd Packets',
    'Idle Min',
    'Total Backward Packets']


## Data Collection #6

In [70]:
dataset_6 = examine_dataset(6)
data_composition = dataset_6['Data_composition']
data_composition

Dataset 6/18: We now look at ./original/01-12/DrDoS_SNMP.csv


Loading Dataset: ./original/01-12/DrDoS_SNMP.csv
	To Dataset Cache: ./cache/01-12/DrDoS_SNMP.csv.pickle


        File:				./original/01-12/DrDoS_SNMP.csv  
        Job Number:			6
        Shape:				(5161377, 88)
        Samples:			5161377 
        Features:			88
        Benign Samples:			1507
        Malicious Samples:		5159870
        Benign-to-Malicious Ratio:	0.00029206162170752366
    


Unnamed: 0,File,Benign,Malicious,Total,Ratio
0,01-12/DrDoS_DNS.csv,3402,5071011,5074413,0.000671
0,01-12/DrDoS_LDAP.csv,1612,2179930,2181542,0.000739
0,01-12/DrDoS_MSSQL.csv,2006,4522492,4524498,0.000444
0,01-12/DrDoS_NetBIOS.csv,1707,4093279,4094986,0.000417
0,01-12/DrDoS_NTP.csv,14365,1202642,1217007,0.011945
0,01-12/DrDoS_SNMP.csv,1507,5159870,5161377,0.000292


In [71]:
check_infs(dataset_6)

Unnamed: 0,0
Dataset,./original/01-12/DrDoS_SNMP.csv
Value,Inf
Flow Bytes/s,10611.0
Flow Packets/s,10623.0


In [72]:
check_nans(dataset_6)

Unnamed: 0,1
Dataset,./original/01-12/DrDoS_SNMP.csv
Value,


In [73]:
check_zeros(dataset_6)

Unnamed: 0,2
Dataset,./original/01-12/DrDoS_SNMP.csv
Value,Zero
Unnamed,85.0
Source Port,246.0
Destination Port,246.0
...,...
Idle Std,5161248.0
Idle Max,5161183.0
Idle Min,5161183.0
SimillarHTTP,4612513.0


In [74]:
print(f'''
Features with a frequency of 0 values greater than
    5,000,000: {check_zeros_over_threshold(dataset_6, 5000000).shape[0]}
    1,000,000: {check_zeros_over_threshold(dataset_6, 1000000).shape[0]}
    500,000  : {check_zeros_over_threshold(dataset_6, 500000).shape[0]}
    200,000  : {check_zeros_over_threshold(dataset_6, 200000).shape[0]}
    50,000   : {check_zeros_over_threshold(dataset_6, 50000).shape[0]}
    5,000    : {check_zeros_over_threshold(dataset_6, 5000).shape[0]}
    0        : {check_zeros_over_threshold(dataset_6, 0).shape[0]}
''')


Features with a frequency of 0 values greater than
    5,000,000: 48
    1,000,000: 52
    500,000  : 52
    200,000  : 52
    50,000   : 52
    5,000    : 61
    0        : 80



In [75]:
check_zeros_over_threshold_percentage(dataset_6, .95).shape

(48, 1)

In [76]:
newPruneCandidates: list = create_new_prune_candidates(check_zeros_over_threshold_percentage(dataset_6, .95))
pruneCandidates   : list = intersection_of_prune_candidates(pruneCandidates, newPruneCandidates)
pretty(pruneCandidates)

[   'URG Flag Count',
    'Fwd PSH Flags',
    'Idle Std',
    'Bwd Packet Length Std',
    'Fwd URG Flags',
    'PSH Flag Count',
    'Fwd Avg Packets/Bulk',
    'Bwd IAT Std',
    'Bwd IAT Min',
    'Bwd Avg Bulk Rate',
    'FIN Flag Count',
    'CWE Flag Count',
    'Avg Bwd Segment Size',
    'ECE Flag Count',
    'Bwd URG Flags',
    'Bwd IAT Mean',
    'Bwd IAT Max',
    'Total Length of Bwd '
    'Packets',
    'Fwd Avg Bytes/Bulk',
    'Idle Mean',
    'SYN Flag Count',
    'RST Flag Count',
    'Down/Up Ratio',
    'Active Std',
    'ACK Flag Count',
    'Bwd Packets/s',
    'Fwd Avg Bulk Rate',
    'Bwd IAT Total',
    'Bwd Packet Length Max',
    'Idle Max',
    'Bwd PSH Flags',
    'Bwd Packet Length Mean',
    'Bwd Avg Bytes/Bulk',
    'Active Mean',
    'Bwd Header Length',
    'Active Max',
    'Active Min',
    'Bwd Packet Length Min',
    'Bwd Avg Packets/Bulk',
    'Subflow Bwd Bytes',
    'Subflow Bwd Packets',
    'Idle Min',
    'Total Backward Packets']


## Data Collection #7

In [77]:
dataset_7 = examine_dataset(7)
data_composition = dataset_7['Data_composition']
data_composition

Dataset 7/18: We now look at ./original/01-12/DrDoS_SSDP.csv


Loading Dataset: ./original/01-12/DrDoS_SSDP.csv
	To Dataset Cache: ./cache/01-12/DrDoS_SSDP.csv.pickle


        File:				./original/01-12/DrDoS_SSDP.csv  
        Job Number:			7
        Shape:				(2611374, 88)
        Samples:			2611374 
        Features:			88
        Benign Samples:			763
        Malicious Samples:		2610611
        Benign-to-Malicious Ratio:	0.0002922687447497923
    


Unnamed: 0,File,Benign,Malicious,Total,Ratio
0,01-12/DrDoS_DNS.csv,3402,5071011,5074413,0.000671
0,01-12/DrDoS_LDAP.csv,1612,2179930,2181542,0.000739
0,01-12/DrDoS_MSSQL.csv,2006,4522492,4524498,0.000444
0,01-12/DrDoS_NetBIOS.csv,1707,4093279,4094986,0.000417
0,01-12/DrDoS_NTP.csv,14365,1202642,1217007,0.011945
0,01-12/DrDoS_SNMP.csv,1507,5159870,5161377,0.000292
0,01-12/DrDoS_SSDP.csv,763,2610611,2611374,0.000292


In [78]:
check_infs(dataset_6)

Unnamed: 0,0
Dataset,./original/01-12/DrDoS_SNMP.csv
Value,Inf
Flow Bytes/s,10611.0
Flow Packets/s,10623.0


In [79]:
check_nans(dataset_7)

Unnamed: 0,1
Dataset,./original/01-12/DrDoS_SSDP.csv
Value,


In [80]:
check_zeros(dataset_7)

Unnamed: 0,2
Dataset,./original/01-12/DrDoS_SSDP.csv
Value,Zero
Unnamed,21.0
Source Port,79.0
Destination Port,79.0
...,...
Idle Std,2611247.0
Idle Max,2611172.0
Idle Min,2611172.0
SimillarHTTP,2291886.0


In [81]:
print(f'''
Features with a frequency of 0 values greater than
    2,000,000: {check_zeros_over_threshold(dataset_7, 2000000).shape[0]}
    1,000,000: {check_zeros_over_threshold(dataset_7, 1000000).shape[0]}
    500,000  : {check_zeros_over_threshold(dataset_7, 500000).shape[0]}
    200,000  : {check_zeros_over_threshold(dataset_7, 200000).shape[0]}
    50,000   : {check_zeros_over_threshold(dataset_7, 50000).shape[0]}
    5,000    : {check_zeros_over_threshold(dataset_7, 5000).shape[0]}
    0        : {check_zeros_over_threshold(dataset_7, 0).shape[0]}
''')


Features with a frequency of 0 values greater than
    2,000,000: 44
    1,000,000: 49
    500,000  : 52
    200,000  : 54
    50,000   : 54
    5,000    : 61
    0        : 80



In [82]:
check_zeros_over_threshold_percentage(dataset_7, .95).shape

(43, 1)

It turns out the 7th collection also has 43 features that consist of 95% or more 0 values instead of 48

In [83]:
newPruneCandidates: list = create_new_prune_candidates(check_zeros_over_threshold_percentage(dataset_7, .95))
pruneCandidates   : list = intersection_of_prune_candidates(pruneCandidates, newPruneCandidates)

## Data Collection #8

In [84]:
dataset_8 = examine_dataset(8)
data_composition = dataset_8['Data_composition']
data_composition

Dataset 8/18: We now look at ./original/01-12/DrDoS_UDP.csv


Loading Dataset: ./original/01-12/DrDoS_UDP.csv
	To Dataset Cache: ./cache/01-12/DrDoS_UDP.csv.pickle


        File:				./original/01-12/DrDoS_UDP.csv  
        Job Number:			8
        Shape:				(3136802, 88)
        Samples:			3136802 
        Features:			88
        Benign Samples:			2157
        Malicious Samples:		3134645
        Benign-to-Malicious Ratio:	0.0006881161981659805
    


Unnamed: 0,File,Benign,Malicious,Total,Ratio
0,01-12/DrDoS_DNS.csv,3402,5071011,5074413,0.000671
0,01-12/DrDoS_LDAP.csv,1612,2179930,2181542,0.000739
0,01-12/DrDoS_MSSQL.csv,2006,4522492,4524498,0.000444
0,01-12/DrDoS_NetBIOS.csv,1707,4093279,4094986,0.000417
0,01-12/DrDoS_NTP.csv,14365,1202642,1217007,0.011945
0,01-12/DrDoS_SNMP.csv,1507,5159870,5161377,0.000292
0,01-12/DrDoS_SSDP.csv,763,2610611,2611374,0.000292
0,01-12/DrDoS_UDP.csv,2157,3134645,3136802,0.000688


In [85]:
check_infs(dataset_8)

Unnamed: 0,0
Dataset,./original/01-12/DrDoS_UDP.csv
Value,Inf
Flow Bytes/s,40665.0
Flow Packets/s,40673.0


In [86]:
check_nans(dataset_8)

Unnamed: 0,1
Dataset,./original/01-12/DrDoS_UDP.csv
Value,


In [87]:
check_zeros(dataset_8)

Unnamed: 0,2
Dataset,./original/01-12/DrDoS_UDP.csv
Value,Zero
Unnamed,25.0
Source Port,126.0
Destination Port,126.0
...,...
Idle Std,3136569.0
Idle Max,3136358.0
Idle Min,3136358.0
SimillarHTTP,2694434.0


In [88]:
print(f'''
Features with a frequency of 0 values greater than
    3,000,000: {check_zeros_over_threshold(dataset_8, 2000000).shape[0]}
    2,000,000: {check_zeros_over_threshold(dataset_8, 2000000).shape[0]}
    1,000,000: {check_zeros_over_threshold(dataset_8, 1000000).shape[0]}
    500,000  : {check_zeros_over_threshold(dataset_8, 500000).shape[0]}
    200,000  : {check_zeros_over_threshold(dataset_8, 200000).shape[0]}
    50,000   : {check_zeros_over_threshold(dataset_8, 50000).shape[0]}
    5,000    : {check_zeros_over_threshold(dataset_8, 5000).shape[0]}
    0        : {check_zeros_over_threshold(dataset_8, 0).shape[0]}
''')


Features with a frequency of 0 values greater than
    3,000,000: 44
    2,000,000: 44
    1,000,000: 49
    500,000  : 52
    200,000  : 54
    50,000   : 54
    5,000    : 61
    0        : 80



In [89]:
check_zeros_over_threshold_percentage(dataset_8, .95).shape

(43, 1)

In [90]:
newPruneCandidates: list = create_new_prune_candidates(check_zeros_over_threshold_percentage(dataset_8, .95))
pruneCandidates   : list = intersection_of_prune_candidates(pruneCandidates, newPruneCandidates)

## Data Collection #9

In [91]:
dataset_9 = examine_dataset(9)
data_composition = dataset_9['Data_composition']
data_composition

Dataset 9/18: We now look at ./original/01-12/Syn.csv


Loading Dataset: ./original/01-12/Syn.csv
	To Dataset Cache: ./cache/01-12/Syn.csv.pickle


        File:				./original/01-12/Syn.csv  
        Job Number:			9
        Shape:				(1582681, 88)
        Samples:			1582681 
        Features:			88
        Benign Samples:			392
        Malicious Samples:		1582289
        Benign-to-Malicious Ratio:	0.0002477423530088372
    


Unnamed: 0,File,Benign,Malicious,Total,Ratio
0,01-12/DrDoS_DNS.csv,3402,5071011,5074413,0.000671
0,01-12/DrDoS_LDAP.csv,1612,2179930,2181542,0.000739
0,01-12/DrDoS_MSSQL.csv,2006,4522492,4524498,0.000444
0,01-12/DrDoS_NetBIOS.csv,1707,4093279,4094986,0.000417
0,01-12/DrDoS_NTP.csv,14365,1202642,1217007,0.011945
0,01-12/DrDoS_SNMP.csv,1507,5159870,5161377,0.000292
0,01-12/DrDoS_SSDP.csv,763,2610611,2611374,0.000292
0,01-12/DrDoS_UDP.csv,2157,3134645,3136802,0.000688
0,01-12/Syn.csv,392,1582289,1582681,0.000248


In [92]:
check_infs(dataset_9)

Unnamed: 0,0
Dataset,./original/01-12/Syn.csv
Value,Inf
Flow Bytes/s,40.0
Flow Packets/s,202317.0


In [93]:
check_nans(dataset_9)

Unnamed: 0,1
Dataset,./original/01-12/Syn.csv
Value,


In [94]:
check_zeros(dataset_9)

Unnamed: 0,2
Dataset,./original/01-12/Syn.csv
Value,Zero
Unnamed,2.0
Source Port,9.0
Destination Port,9.0
...,...
Idle Std,1451786.0
Idle Max,1446760.0
Idle Min,1446760.0
SimillarHTTP,1451609.0


In [95]:
print(f'''
Features with a frequency of 0 values greater than
    1,500,000: {check_zeros_over_threshold(dataset_9, 1500000).shape[0]}
    1,000,000: {check_zeros_over_threshold(dataset_9, 1000000).shape[0]}
    500,000  : {check_zeros_over_threshold(dataset_9, 500000).shape[0]}
    200,000  : {check_zeros_over_threshold(dataset_9, 200000).shape[0]}
    50,000   : {check_zeros_over_threshold(dataset_9, 50000).shape[0]}
    5,000    : {check_zeros_over_threshold(dataset_9, 5000).shape[0]}
    0        : {check_zeros_over_threshold(dataset_9, 0).shape[0]}
''')


Features with a frequency of 0 values greater than
    1,500,000: 39
    1,000,000: 60
    500,000  : 60
    200,000  : 70
    50,000   : 70
    5,000    : 70
    0        : 80



In [96]:
check_zeros_over_threshold_percentage(dataset_9, .95).shape

(39, 1)

We see another deviation, in this case, we find there are only 39 features that consist of 95% 0 values

In [97]:
newPruneCandidates: list = create_new_prune_candidates(check_zeros_over_threshold_percentage(dataset_9, .95))
pruneCandidates   : list = intersection_of_prune_candidates(pruneCandidates, newPruneCandidates)

## Data Collection #10

In [98]:
dataset_10 = examine_dataset(10)
data_composition = dataset_10['Data_composition']
data_composition

Dataset 10/18: We now look at ./original/01-12/TFTP.csv


Loading Dataset: ./original/01-12/TFTP.csv
	To Dataset Cache: ./cache/01-12/TFTP.csv.pickle


        File:				./original/01-12/TFTP.csv  
        Job Number:			10
        Shape:				(20107827, 88)
        Samples:			20107827 
        Features:			88
        Benign Samples:			25247
        Malicious Samples:		20082580
        Benign-to-Malicious Ratio:	0.0012571591897057052
    


Unnamed: 0,File,Benign,Malicious,Total,Ratio
0,01-12/DrDoS_DNS.csv,3402,5071011,5074413,0.000671
0,01-12/DrDoS_LDAP.csv,1612,2179930,2181542,0.000739
0,01-12/DrDoS_MSSQL.csv,2006,4522492,4524498,0.000444
0,01-12/DrDoS_NetBIOS.csv,1707,4093279,4094986,0.000417
0,01-12/DrDoS_NTP.csv,14365,1202642,1217007,0.011945
0,01-12/DrDoS_SNMP.csv,1507,5159870,5161377,0.000292
0,01-12/DrDoS_SSDP.csv,763,2610611,2611374,0.000292
0,01-12/DrDoS_UDP.csv,2157,3134645,3136802,0.000688
0,01-12/Syn.csv,392,1582289,1582681,0.000248
0,01-12/TFTP.csv,25247,20082580,20107827,0.001257


In [99]:
check_infs(dataset_10)

Unnamed: 0,0
Dataset,./original/01-12/TFTP.csv
Value,Inf
Flow Bytes/s,556264.0
Flow Packets/s,566761.0


In [100]:
check_nans(dataset_10)

Unnamed: 0,1
Dataset,./original/01-12/TFTP.csv
Value,


In [101]:
check_zeros(dataset_10)

Unnamed: 0,2
Dataset,./original/01-12/TFTP.csv
Value,Zero
Unnamed,199.0
Source Port,865.0
Destination Port,865.0
...,...
Idle Std,20105566.0
Idle Max,20087473.0
Idle Min,20087473.0
SimillarHTTP,18726912.0


In [102]:
print(f'''
Features with a frequency of 0 values greater than
    20,000,000: {check_zeros_over_threshold(dataset_10, 20000000).shape[0]}
    17,500,000: {check_zeros_over_threshold(dataset_10, 17500000).shape[0]}
    15,000,000: {check_zeros_over_threshold(dataset_10, 15000000).shape[0]}
    10,000,000: {check_zeros_over_threshold(dataset_10, 10000000).shape[0]}
    1,000,000 : {check_zeros_over_threshold(dataset_10, 1000000).shape[0]}
    500,000   : {check_zeros_over_threshold(dataset_10, 500000).shape[0]}
    200,000   : {check_zeros_over_threshold(dataset_10, 200000).shape[0]}
    50,000    : {check_zeros_over_threshold(dataset_10, 50000).shape[0]}
    5,000     : {check_zeros_over_threshold(dataset_10, 5000).shape[0]}
    0         : {check_zeros_over_threshold(dataset_10, 0).shape[0]}
''')


Features with a frequency of 0 values greater than
    20,000,000: 45
    17,500,000: 47
    15,000,000: 47
    10,000,000: 47
    1,000,000 : 54
    500,000   : 61
    200,000   : 61
    50,000    : 73
    5,000     : 75
    0         : 80



In [103]:
check_zeros_over_threshold_percentage(dataset_10, .95).shape

(46, 1)

In [104]:
newPruneCandidates: list = create_new_prune_candidates(check_zeros_over_threshold_percentage(dataset_10, .95))
pruneCandidates   : list = intersection_of_prune_candidates(pruneCandidates, newPruneCandidates)

## Data Collection #11

In [105]:
dataset_11 = examine_dataset(11)
data_composition = dataset_11['Data_composition']
data_composition

Dataset 11/18: We now look at ./original/01-12/UDPLag.csv


Loading Dataset: ./original/01-12/UDPLag.csv
	To Dataset Cache: ./cache/01-12/UDPLag.csv.pickle


        File:				./original/01-12/UDPLag.csv  
        Job Number:			11
        Shape:				(370605, 88)
        Samples:			370605 
        Features:			88
        Benign Samples:			3705
        Malicious Samples:		366900
        Benign-to-Malicious Ratio:	0.01009811937857727
    


Unnamed: 0,File,Benign,Malicious,Total,Ratio
0,01-12/DrDoS_DNS.csv,3402,5071011,5074413,0.000671
0,01-12/DrDoS_LDAP.csv,1612,2179930,2181542,0.000739
0,01-12/DrDoS_MSSQL.csv,2006,4522492,4524498,0.000444
0,01-12/DrDoS_NetBIOS.csv,1707,4093279,4094986,0.000417
0,01-12/DrDoS_NTP.csv,14365,1202642,1217007,0.011945
0,01-12/DrDoS_SNMP.csv,1507,5159870,5161377,0.000292
0,01-12/DrDoS_SSDP.csv,763,2610611,2611374,0.000292
0,01-12/DrDoS_UDP.csv,2157,3134645,3136802,0.000688
0,01-12/Syn.csv,392,1582289,1582681,0.000248
0,01-12/TFTP.csv,25247,20082580,20107827,0.001257


In [106]:
check_infs(dataset_11)

Unnamed: 0,0
Dataset,./original/01-12/UDPLag.csv
Value,Inf
Flow Bytes/s,271.0
Flow Packets/s,36403.0


In [107]:
check_nans(dataset_11)

Unnamed: 0,1
Dataset,./original/01-12/UDPLag.csv
Value,


In [108]:
check_zeros(dataset_11)

Unnamed: 0,2
Dataset,./original/01-12/UDPLag.csv
Value,Zero
Unnamed,1.0
Source Port,49.0
Destination Port,49.0
...,...
Idle Std,327740.0
Idle Max,308084.0
Idle Min,308084.0
SimillarHTTP,280493.0


In [109]:
print(f'''
Features with a frequency of 0 values greater than
    350,000   : {check_zeros_over_threshold(dataset_11, 350000).shape[0]}
    300,000   : {check_zeros_over_threshold(dataset_11, 300000).shape[0]}
    200,000   : {check_zeros_over_threshold(dataset_11, 200000).shape[0]}
    50,000    : {check_zeros_over_threshold(dataset_11, 50000).shape[0]}
    5,000     : {check_zeros_over_threshold(dataset_11, 5000).shape[0]}
    0         : {check_zeros_over_threshold(dataset_11, 0).shape[0]}
''')


Features with a frequency of 0 values greater than
    350,000   : 26
    300,000   : 56
    200,000   : 60
    50,000    : 63
    5,000     : 74
    0         : 80



This data collection is an extremely small slice of the whole dataset, but it has only 26 features with a 0 value frequency higher than 95%. Since this is a small slice of the dataset, we won't reduce the prune candidates to 26 unless other data collections reflect the same distribution.

In [110]:
check_zeros_over_threshold_percentage(dataset_11, .95).shape

(26, 1)

## Data Collection #12

with this collection, we start to examine the data collections generated on 3-11 instead of 1-12

In [111]:
dataset_12 = examine_dataset(12)
data_composition = dataset_12['Data_composition']
data_composition

Dataset 12/18: We now look at ./original/03-11/LDAP.csv


Loading Dataset: ./original/03-11/LDAP.csv
	To Dataset Cache: ./cache/03-11/LDAP.csv.pickle


        File:				./original/03-11/LDAP.csv  
        Job Number:			12
        Shape:				(2113234, 88)
        Samples:			2113234 
        Features:			88
        Benign Samples:			5124
        Malicious Samples:		2108110
        Benign-to-Malicious Ratio:	0.002430613203295843
    


Unnamed: 0,File,Benign,Malicious,Total,Ratio
0,01-12/DrDoS_DNS.csv,3402,5071011,5074413,0.000671
0,01-12/DrDoS_LDAP.csv,1612,2179930,2181542,0.000739
0,01-12/DrDoS_MSSQL.csv,2006,4522492,4524498,0.000444
0,01-12/DrDoS_NetBIOS.csv,1707,4093279,4094986,0.000417
0,01-12/DrDoS_NTP.csv,14365,1202642,1217007,0.011945
0,01-12/DrDoS_SNMP.csv,1507,5159870,5161377,0.000292
0,01-12/DrDoS_SSDP.csv,763,2610611,2611374,0.000292
0,01-12/DrDoS_UDP.csv,2157,3134645,3136802,0.000688
0,01-12/Syn.csv,392,1582289,1582681,0.000248
0,01-12/TFTP.csv,25247,20082580,20107827,0.001257


In [112]:
check_infs(dataset_12)

Unnamed: 0,0
Dataset,./original/03-11/LDAP.csv
Value,Inf
Flow Bytes/s,54099.0
Flow Packets/s,54112.0


In [113]:
check_nans(dataset_12)

Unnamed: 0,1
Dataset,./original/03-11/LDAP.csv
Value,


In [114]:
check_zeros(dataset_12)

Unnamed: 0,2
Dataset,./original/03-11/LDAP.csv
Value,Zero
Unnamed,57.0
Source Port,198.0
Destination Port,198.0
...,...
Idle Std,2112881.0
Idle Max,2112595.0
Idle Min,2112595.0
SimillarHTTP,1957586.0


In [115]:
print(f'''
Features with a frequency of 0 values greater than
    2,000,000 : {check_zeros_over_threshold(dataset_12, 2000000).shape[0]}
    1,000,000 : {check_zeros_over_threshold(dataset_12, 1000000).shape[0]}
    500,000   : {check_zeros_over_threshold(dataset_12, 500000).shape[0]}
    350,000   : {check_zeros_over_threshold(dataset_12, 350000).shape[0]}
    300,000   : {check_zeros_over_threshold(dataset_12, 300000).shape[0]}
    200,000   : {check_zeros_over_threshold(dataset_12, 200000).shape[0]}
    50,000    : {check_zeros_over_threshold(dataset_12, 50000).shape[0]}
    5,000     : {check_zeros_over_threshold(dataset_12, 5000).shape[0]}
    0         : {check_zeros_over_threshold(dataset_12, 0).shape[0]}
''')


Features with a frequency of 0 values greater than
    2,000,000 : 48
    1,000,000 : 49
    500,000   : 52
    350,000   : 52
    300,000   : 52
    200,000   : 52
    50,000    : 61
    5,000     : 61
    0         : 80



In [116]:
check_zeros_over_threshold_percentage(dataset_12, .95).shape

(48, 1)

In [117]:
newPruneCandidates: list = create_new_prune_candidates(check_zeros_over_threshold_percentage(dataset_12, .95))
pruneCandidates   : list = intersection_of_prune_candidates(pruneCandidates, newPruneCandidates)

## Data Collection #13

In [118]:
dataset_13 = examine_dataset(13)
data_composition = dataset_13['Data_composition']
data_composition

Dataset 13/18: We now look at ./original/03-11/MSSQL.csv


Loading Dataset: ./original/03-11/MSSQL.csv
	To Dataset Cache: ./cache/03-11/MSSQL.csv.pickle


        File:				./original/03-11/MSSQL.csv  
        Job Number:			13
        Shape:				(5775786, 88)
        Samples:			5775786 
        Features:			88
        Benign Samples:			2794
        Malicious Samples:		5772992
        Benign-to-Malicious Ratio:	0.0004839778056162212
    


Unnamed: 0,File,Benign,Malicious,Total,Ratio
0,01-12/DrDoS_DNS.csv,3402,5071011,5074413,0.000671
0,01-12/DrDoS_LDAP.csv,1612,2179930,2181542,0.000739
0,01-12/DrDoS_MSSQL.csv,2006,4522492,4524498,0.000444
0,01-12/DrDoS_NetBIOS.csv,1707,4093279,4094986,0.000417
0,01-12/DrDoS_NTP.csv,14365,1202642,1217007,0.011945
0,01-12/DrDoS_SNMP.csv,1507,5159870,5161377,0.000292
0,01-12/DrDoS_SSDP.csv,763,2610611,2611374,0.000292
0,01-12/DrDoS_UDP.csv,2157,3134645,3136802,0.000688
0,01-12/Syn.csv,392,1582289,1582681,0.000248
0,01-12/TFTP.csv,25247,20082580,20107827,0.001257


In [119]:
check_infs(dataset_13)

Unnamed: 0,0
Dataset,./original/03-11/MSSQL.csv
Value,Inf
Flow Bytes/s,202210.0
Flow Packets/s,202217.0


In [120]:
check_nans(dataset_13)

Unnamed: 0,1
Dataset,./original/03-11/MSSQL.csv
Value,


In [121]:
check_zeros(dataset_13)

Unnamed: 0,2
Dataset,./original/03-11/MSSQL.csv
Value,Zero
Unnamed,38.0
Source Port,137.0
Destination Port,137.0
...,...
Idle Std,5775727.0
Idle Max,5773833.0
Idle Min,5773833.0
SimillarHTTP,5620138.0


In [122]:
print(f'''
Features with a frequency of 0 values greater than
    5,000,000 : {check_zeros_over_threshold(dataset_13, 5000000).shape[0]}
    1,000,000 : {check_zeros_over_threshold(dataset_13, 1000000).shape[0]}
    500,000   : {check_zeros_over_threshold(dataset_13, 500000).shape[0]}
    300,000   : {check_zeros_over_threshold(dataset_13, 300000).shape[0]}
    200,000   : {check_zeros_over_threshold(dataset_13, 200000).shape[0]}
    50,000    : {check_zeros_over_threshold(dataset_13, 50000).shape[0]}
    5,000     : {check_zeros_over_threshold(dataset_13, 5000).shape[0]}
    0         : {check_zeros_over_threshold(dataset_13, 0).shape[0]}
''')


Features with a frequency of 0 values greater than
    5,000,000 : 49
    1,000,000 : 52
    500,000   : 52
    300,000   : 52
    200,000   : 61
    50,000    : 61
    5,000     : 61
    0         : 80



In [123]:
check_zeros_over_threshold_percentage(dataset_13, .95).shape

(49, 1)

In [124]:
newPruneCandidates: list = create_new_prune_candidates(check_zeros_over_threshold_percentage(dataset_13, .95))
pruneCandidates   : list = intersection_of_prune_candidates(pruneCandidates, newPruneCandidates)

## Data Collection #14

In [125]:
dataset_14 = examine_dataset(14)
data_composition = dataset_14['Data_composition']
data_composition

Dataset 14/18: We now look at ./original/03-11/NetBIOS.csv


Loading Dataset: ./original/03-11/NetBIOS.csv
	To Dataset Cache: ./cache/03-11/NetBIOS.csv.pickle


        File:				./original/03-11/NetBIOS.csv  
        Job Number:			14
        Shape:				(3455899, 88)
        Samples:			3455899 
        Features:			88
        Benign Samples:			1321
        Malicious Samples:		3454578
        Benign-to-Malicious Ratio:	0.00038239113431510303
    


Unnamed: 0,File,Benign,Malicious,Total,Ratio
0,01-12/DrDoS_DNS.csv,3402,5071011,5074413,0.000671
0,01-12/DrDoS_LDAP.csv,1612,2179930,2181542,0.000739
0,01-12/DrDoS_MSSQL.csv,2006,4522492,4524498,0.000444
0,01-12/DrDoS_NetBIOS.csv,1707,4093279,4094986,0.000417
0,01-12/DrDoS_NTP.csv,14365,1202642,1217007,0.011945
0,01-12/DrDoS_SNMP.csv,1507,5159870,5161377,0.000292
0,01-12/DrDoS_SSDP.csv,763,2610611,2611374,0.000292
0,01-12/DrDoS_UDP.csv,2157,3134645,3136802,0.000688
0,01-12/Syn.csv,392,1582289,1582681,0.000248
0,01-12/TFTP.csv,25247,20082580,20107827,0.001257


In [126]:
check_infs(dataset_14)

Unnamed: 0,0
Dataset,./original/03-11/NetBIOS.csv
Value,Inf
Flow Bytes/s,130554.0
Flow Packets/s,130560.0


In [127]:
check_nans(dataset_14)

Unnamed: 0,1
Dataset,./original/03-11/NetBIOS.csv
Value,


In [128]:
check_zeros(dataset_14)

Unnamed: 0,2
Dataset,./original/03-11/NetBIOS.csv
Value,Zero
Unnamed,10.0
Source Port,38.0
Destination Port,38.0
...,...
Idle Std,3455762.0
Idle Max,3451770.0
Idle Min,3451770.0
SimillarHTTP,3324827.0


In [129]:
print(f'''
Features with a frequency of 0 values greater than
    3,000,000 : {check_zeros_over_threshold(dataset_14, 3000000).shape[0]}
    1,000,000 : {check_zeros_over_threshold(dataset_14, 1000000).shape[0]}
    500,000   : {check_zeros_over_threshold(dataset_14, 500000).shape[0]}
    300,000   : {check_zeros_over_threshold(dataset_14, 300000).shape[0]}
    200,000   : {check_zeros_over_threshold(dataset_14, 200000).shape[0]}
    50,000    : {check_zeros_over_threshold(dataset_14, 50000).shape[0]}
    5,000     : {check_zeros_over_threshold(dataset_14, 5000).shape[0]}
    0         : {check_zeros_over_threshold(dataset_14, 0).shape[0]}
''')


Features with a frequency of 0 values greater than
    3,000,000 : 49
    1,000,000 : 49
    500,000   : 49
    300,000   : 49
    200,000   : 49
    50,000    : 61
    5,000     : 61
    0         : 80



In [130]:
check_zeros_over_threshold_percentage(dataset_14, .95).shape

(49, 1)

In [131]:
newPruneCandidates: list = create_new_prune_candidates(check_zeros_over_threshold_percentage(dataset_14, .95))
pruneCandidates   : list = intersection_of_prune_candidates(pruneCandidates, newPruneCandidates)

## Data Collection #15

In [132]:
dataset_15 = examine_dataset(15)
data_composition = dataset_15['Data_composition']
data_composition

Dataset 15/18: We now look at ./original/03-11/Portmap.csv


Loading Dataset: ./original/03-11/Portmap.csv
	To Dataset Cache: ./cache/03-11/Portmap.csv.pickle


        File:				./original/03-11/Portmap.csv  
        Job Number:			15
        Shape:				(191694, 88)
        Samples:			191694 
        Features:			88
        Benign Samples:			4734
        Malicious Samples:		186960
        Benign-to-Malicious Ratio:	0.025320924261874198
    


Unnamed: 0,File,Benign,Malicious,Total,Ratio
0,01-12/DrDoS_DNS.csv,3402,5071011,5074413,0.000671
0,01-12/DrDoS_LDAP.csv,1612,2179930,2181542,0.000739
0,01-12/DrDoS_MSSQL.csv,2006,4522492,4524498,0.000444
0,01-12/DrDoS_NetBIOS.csv,1707,4093279,4094986,0.000417
0,01-12/DrDoS_NTP.csv,14365,1202642,1217007,0.011945
0,01-12/DrDoS_SNMP.csv,1507,5159870,5161377,0.000292
0,01-12/DrDoS_SSDP.csv,763,2610611,2611374,0.000292
0,01-12/DrDoS_UDP.csv,2157,3134645,3136802,0.000688
0,01-12/Syn.csv,392,1582289,1582681,0.000248
0,01-12/TFTP.csv,25247,20082580,20107827,0.001257


In [133]:
check_infs(dataset_15)

Unnamed: 0,0
Dataset,./original/03-11/Portmap.csv
Value,Inf
Flow Bytes/s,9799.0
Flow Packets/s,9800.0


In [134]:
check_nans(dataset_15)

Unnamed: 0,1
Dataset,./original/03-11/Portmap.csv
Value,


In [135]:
check_zeros(dataset_15)

Unnamed: 0,2
Dataset,./original/03-11/Portmap.csv
Value,Zero
Unnamed,1.0
Source Port,84.0
Destination Port,84.0
...,...
Idle Std,191124.0
Idle Max,190955.0
Idle Min,190955.0
SimillarHTTP,183502.0


In [136]:
print(f'''
Features with a frequency of 0 values greater than
    175,000   : {check_zeros_over_threshold(dataset_15, 175000).shape[0]}
    150,000   : {check_zeros_over_threshold(dataset_15, 150000).shape[0]}
    100,000   : {check_zeros_over_threshold(dataset_15, 100000).shape[0]}
    50,000    : {check_zeros_over_threshold(dataset_15, 50000).shape[0]}
    5,000     : {check_zeros_over_threshold(dataset_15, 5000).shape[0]}
    0         : {check_zeros_over_threshold(dataset_15, 0).shape[0]}
''')


Features with a frequency of 0 values greater than
    175,000   : 49
    150,000   : 49
    100,000   : 49
    50,000    : 49
    5,000     : 58
    0         : 80



In [137]:
check_zeros_over_threshold_percentage(dataset_15, .95).shape

(49, 1)

In [138]:
newPruneCandidates: list = create_new_prune_candidates(check_zeros_over_threshold_percentage(dataset_15, .95))
pruneCandidates   : list = intersection_of_prune_candidates(pruneCandidates, newPruneCandidates)

## Data Collection #16

In [139]:
dataset_16 = examine_dataset(16)
data_composition = dataset_16['Data_composition']
data_composition

Dataset 16/18: We now look at ./original/03-11/Syn.csv


Loading Dataset: ./original/03-11/Syn.csv
	To Dataset Cache: ./cache/03-11/Syn.csv.pickle


        File:				./original/03-11/Syn.csv  
        Job Number:			16
        Shape:				(4320541, 88)
        Samples:			4320541 
        Features:			88
        Benign Samples:			35790
        Malicious Samples:		4284751
        Benign-to-Malicious Ratio:	0.008352877448421156
    


Unnamed: 0,File,Benign,Malicious,Total,Ratio
0,01-12/DrDoS_DNS.csv,3402,5071011,5074413,0.000671
0,01-12/DrDoS_LDAP.csv,1612,2179930,2181542,0.000739
0,01-12/DrDoS_MSSQL.csv,2006,4522492,4524498,0.000444
0,01-12/DrDoS_NetBIOS.csv,1707,4093279,4094986,0.000417
0,01-12/DrDoS_NTP.csv,14365,1202642,1217007,0.011945
0,01-12/DrDoS_SNMP.csv,1507,5159870,5161377,0.000292
0,01-12/DrDoS_SSDP.csv,763,2610611,2611374,0.000292
0,01-12/DrDoS_UDP.csv,2157,3134645,3136802,0.000688
0,01-12/Syn.csv,392,1582289,1582681,0.000248
0,01-12/TFTP.csv,25247,20082580,20107827,0.001257


In [140]:
check_infs(dataset_16)

Unnamed: 0,0
Dataset,./original/03-11/Syn.csv
Value,Inf
Flow Bytes/s,282982.0
Flow Packets/s,283076.0


In [141]:
check_nans(dataset_16)

Unnamed: 0,1
Dataset,./original/03-11/Syn.csv
Value,


In [142]:
check_zeros(dataset_16)

Unnamed: 0,2
Dataset,./original/03-11/Syn.csv
Value,Zero
Unnamed,9.0
Source Port,816.0
Destination Port,816.0
...,...
Idle Std,3879812.0
Idle Max,3846648.0
Idle Min,3846648.0
SimillarHTTP,4153344.0


In [143]:
print(f'''
Features with a frequency of 0 values greater than
    4,000,000 : {check_zeros_over_threshold(dataset_16, 4000000).shape[0]}
    1,000,000 : {check_zeros_over_threshold(dataset_16, 1000000).shape[0]}
    500,000   : {check_zeros_over_threshold(dataset_16, 500000).shape[0]}
    150,000   : {check_zeros_over_threshold(dataset_16, 150000).shape[0]}
    100,000   : {check_zeros_over_threshold(dataset_16, 100000).shape[0]}
    50,000    : {check_zeros_over_threshold(dataset_16, 50000).shape[0]}
    5,000     : {check_zeros_over_threshold(dataset_16, 5000).shape[0]}
    0         : {check_zeros_over_threshold(dataset_16, 0).shape[0]}
''')


Features with a frequency of 0 values greater than
    4,000,000 : 23
    1,000,000 : 49
    500,000   : 51
    150,000   : 58
    100,000   : 58
    50,000    : 58
    5,000     : 72
    0         : 80



In [144]:
check_zeros_over_threshold_percentage(dataset_16, .95).shape

(22, 1)

## Data Collection #17

In [145]:
dataset_17 = examine_dataset(17)
data_composition = dataset_17['Data_composition']
data_composition

Dataset 17/18: We now look at ./original/03-11/UDP.csv


Loading Dataset: ./original/03-11/UDP.csv
	To Dataset Cache: ./cache/03-11/UDP.csv.pickle


        File:				./original/03-11/UDP.csv  
        Job Number:			17
        Shape:				(3782206, 88)
        Samples:			3782206 
        Features:			88
        Benign Samples:			3134
        Malicious Samples:		3779072
        Benign-to-Malicious Ratio:	0.0008293041254572552
    


Unnamed: 0,File,Benign,Malicious,Total,Ratio
0,01-12/DrDoS_DNS.csv,3402,5071011,5074413,0.000671
0,01-12/DrDoS_LDAP.csv,1612,2179930,2181542,0.000739
0,01-12/DrDoS_MSSQL.csv,2006,4522492,4524498,0.000444
0,01-12/DrDoS_NetBIOS.csv,1707,4093279,4094986,0.000417
0,01-12/DrDoS_NTP.csv,14365,1202642,1217007,0.011945
0,01-12/DrDoS_SNMP.csv,1507,5159870,5161377,0.000292
0,01-12/DrDoS_SSDP.csv,763,2610611,2611374,0.000292
0,01-12/DrDoS_UDP.csv,2157,3134645,3136802,0.000688
0,01-12/Syn.csv,392,1582289,1582681,0.000248
0,01-12/TFTP.csv,25247,20082580,20107827,0.001257


In [146]:
check_infs(dataset_17)

Unnamed: 0,0
Dataset,./original/03-11/UDP.csv
Value,Inf
Flow Bytes/s,77043.0
Flow Packets/s,77047.0


In [147]:
check_nans(dataset_17)

Unnamed: 0,1
Dataset,./original/03-11/UDP.csv
Value,


In [148]:
check_zeros(dataset_17)

Unnamed: 0,2
Dataset,./original/03-11/UDP.csv
Value,Zero
Unnamed,30.0
Source Port,121.0
Destination Port,121.0
...,...
Idle Std,3782011.0
Idle Max,3781727.0
Idle Min,3781727.0
SimillarHTTP,3642942.0


In [149]:
print(f'''
Features with a frequency of 0 values greater than
    3,500,000 : {check_zeros_over_threshold(dataset_17, 3500000).shape[0]}
    3,000,000 : {check_zeros_over_threshold(dataset_17, 3000000).shape[0]}
    1,000,000 : {check_zeros_over_threshold(dataset_17, 1000000).shape[0]}
    500,000   : {check_zeros_over_threshold(dataset_17, 500000).shape[0]}
    150,000   : {check_zeros_over_threshold(dataset_17, 150000).shape[0]}
    100,000   : {check_zeros_over_threshold(dataset_17, 100000).shape[0]}
    50,000    : {check_zeros_over_threshold(dataset_17, 50000).shape[0]}
    5,000     : {check_zeros_over_threshold(dataset_17, 5000).shape[0]}
    0         : {check_zeros_over_threshold(dataset_17, 0).shape[0]}
''')


Features with a frequency of 0 values greater than
    3,500,000 : 44
    3,000,000 : 44
    1,000,000 : 49
    500,000   : 49
    150,000   : 54
    100,000   : 54
    50,000    : 61
    5,000     : 61
    0         : 80



In [150]:
check_zeros_over_threshold_percentage(dataset_17, .95).shape

(44, 1)

In [151]:
newPruneCandidates: list = create_new_prune_candidates(check_zeros_over_threshold_percentage(dataset_15, .95))
pruneCandidates   : list = intersection_of_prune_candidates(pruneCandidates, newPruneCandidates)

## Data Collection #18

Finally we made it to the last data collection

In [152]:
dataset_18 = examine_dataset(18)
data_composition = dataset_18['Data_composition']
data_composition

Dataset 18/18: We now look at ./original/03-11/UDPLag.csv


Loading Dataset: ./original/03-11/UDPLag.csv
	To Dataset Cache: ./cache/03-11/UDPLag.csv.pickle


        File:				./original/03-11/UDPLag.csv  
        Job Number:			18
        Shape:				(725165, 88)
        Samples:			725165 
        Features:			88
        Benign Samples:			4068
        Malicious Samples:		721097
        Benign-to-Malicious Ratio:	0.005641404693127277
    


Unnamed: 0,File,Benign,Malicious,Total,Ratio
0,01-12/DrDoS_DNS.csv,3402,5071011,5074413,0.000671
0,01-12/DrDoS_LDAP.csv,1612,2179930,2181542,0.000739
0,01-12/DrDoS_MSSQL.csv,2006,4522492,4524498,0.000444
0,01-12/DrDoS_NetBIOS.csv,1707,4093279,4094986,0.000417
0,01-12/DrDoS_NTP.csv,14365,1202642,1217007,0.011945
0,01-12/DrDoS_SNMP.csv,1507,5159870,5161377,0.000292
0,01-12/DrDoS_SSDP.csv,763,2610611,2611374,0.000292
0,01-12/DrDoS_UDP.csv,2157,3134645,3136802,0.000688
0,01-12/Syn.csv,392,1582289,1582681,0.000248
0,01-12/TFTP.csv,25247,20082580,20107827,0.001257


In [153]:
check_infs(dataset_18)

Unnamed: 0,0
Dataset,./original/03-11/UDPLag.csv
Value,Inf
Flow Bytes/s,50699.0
Flow Packets/s,50702.0


In [154]:
check_nans(dataset_18)

Unnamed: 0,1
Dataset,./original/03-11/UDPLag.csv
Value,


In [155]:
check_zeros(dataset_18)

Unnamed: 0,2
Dataset,./original/03-11/UDPLag.csv
Value,Zero
Unnamed,2.0
Source Port,58.0
Destination Port,58.0
...,...
Idle Std,660222.0
Idle Max,657489.0
Idle Min,657489.0
SimillarHTTP,692397.0


In [156]:
print(f'''
Features with a frequency of 0 values greater than
    600,000   : {check_zeros_over_threshold(dataset_18, 600000).shape[0]}
    500,000   : {check_zeros_over_threshold(dataset_18, 500000).shape[0]}
    150,000   : {check_zeros_over_threshold(dataset_18, 150000).shape[0]}
    100,000   : {check_zeros_over_threshold(dataset_18, 100000).shape[0]}
    50,000    : {check_zeros_over_threshold(dataset_18, 50000).shape[0]}
    5,000     : {check_zeros_over_threshold(dataset_18, 5000).shape[0]}
    0         : {check_zeros_over_threshold(dataset_18, 0).shape[0]}
''')


Features with a frequency of 0 values greater than
    600,000   : 31
    500,000   : 34
    150,000   : 49
    100,000   : 52
    50,000    : 59
    5,000     : 62
    0         : 80



In [157]:
check_zeros_over_threshold_percentage(dataset_18, .95).shape

(19, 1)

## Breakdown

In [158]:
datasets: list = [
    dataset_1,
    dataset_2,
    dataset_3,
    dataset_4,
    dataset_5,
    dataset_6,
    dataset_7,
    dataset_8,
    dataset_9,
    dataset_10,
    dataset_11,
    dataset_12,
    dataset_13,
    dataset_14,
    dataset_15,
    dataset_16,
    dataset_17,
    dataset_18
]

features = list(dataset_1['Feature_stats'].columns)[2:]

sumStats = pd.DataFrame(
    np.zeros((3, len(features))),
    columns = features,
    index = dataset_1['Feature_stats'].index,
)



In [159]:
for collection in datasets:
    sumStats += collection['Feature_stats'][features]

In [160]:
sumStats

Unnamed: 0,Unnamed,Flow ID,Source IP,Source Port,Destination IP,Destination Port,Protocol,Timestamp,Flow Duration,Total Fwd Packets,...,Active Std,Active Max,Active Min,Idle Mean,Idle Std,Idle Max,Idle Min,SimillarHTTP,Inbound,Label
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,966.0,0.0,0.0,3998.0,0.0,3998.0,3998.0,0.0,2170750.0,0.0,...,69863731.0,69661383.0,69661383.0,69650971.0,69742330.0,69650971.0,69650971.0,60177143.0,116846.0,0.0


In [161]:
sumStats

Unnamed: 0,Unnamed,Flow ID,Source IP,Source Port,Destination IP,Destination Port,Protocol,Timestamp,Flow Duration,Total Fwd Packets,...,Active Std,Active Max,Active Min,Idle Mean,Idle Std,Idle Max,Idle Min,SimillarHTTP,Inbound,Label
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,966.0,0.0,0.0,3998.0,0.0,3998.0,3998.0,0.0,2170750.0,0.0,...,69863731.0,69661383.0,69661383.0,69650971.0,69742330.0,69650971.0,69650971.0,60177143.0,116846.0,0.0


In [162]:
dataset_1.keys()

dict_keys(['File', 'Dataset', 'Feature_stats', 'Data_composition'])

In [163]:
dataset_1['Feature_stats']

Unnamed: 0,Dataset,Value,Unnamed,Flow ID,Source IP,Source Port,Destination IP,Destination Port,Protocol,Timestamp,...,Active Std,Active Max,Active Min,Idle Mean,Idle Std,Idle Max,Idle Min,SimillarHTTP,Inbound,Label
0,./original/01-12/DrDoS_DNS.csv,Inf,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,./original/01-12/DrDoS_DNS.csv,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,./original/01-12/DrDoS_DNS.csv,Zero,188.0,0.0,0.0,382.0,0.0,382.0,382.0,0.0,...,5074144.0,5073990.0,5073990.0,5073989.0,5074137.0,5073989.0,5073989.0,0.0,4735.0,0.0


In [164]:
f = dataset_1['Feature_stats']
g = dataset_2['Feature_stats']

In [165]:
f[features] + g[features]

Unnamed: 0,Unnamed,Flow ID,Source IP,Source Port,Destination IP,Destination Port,Protocol,Timestamp,Flow Duration,Total Fwd Packets,...,Active Std,Active Max,Active Min,Idle Mean,Idle Std,Idle Max,Idle Min,SimillarHTTP,Inbound,Label
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,252.0,0.0,0.0,589.0,0.0,589.0,589.0,0.0,201044.0,0.0,...,7255686.0,7255521.0,7255521.0,7255519.0,7255679.0,7255519.0,7255519.0,1870246.0,6847.0,0.0


In [170]:
benign_samples = data_composition['Benign'].sum()

In [171]:
ddos_samples = data_composition['Malicious'].sum()

In [172]:
total_samples = data_composition['Total'].sum()

In [175]:
pd.concat([data_composition.append(
        pd.DataFrame([
            ['CIC_DDoS2019', benign_samples, ddos_samples, total_samples, 100*benign_samples/total_samples]
        ], columns = composition_columns)
    )], ignore_index=True)

Unnamed: 0,File,Benign,Malicious,Total,Ratio
0,01-12/DrDoS_DNS.csv,3402,5071011,5074413,0.000671
1,01-12/DrDoS_LDAP.csv,1612,2179930,2181542,0.000739
2,01-12/DrDoS_MSSQL.csv,2006,4522492,4524498,0.000444
3,01-12/DrDoS_NetBIOS.csv,1707,4093279,4094986,0.000417
4,01-12/DrDoS_NTP.csv,14365,1202642,1217007,0.011945
5,01-12/DrDoS_SNMP.csv,1507,5159870,5161377,0.000292
6,01-12/DrDoS_SSDP.csv,763,2610611,2611374,0.000292
7,01-12/DrDoS_UDP.csv,2157,3134645,3136802,0.000688
8,01-12/Syn.csv,392,1582289,1582681,0.000248
9,01-12/TFTP.csv,25247,20082580,20107827,0.001257


In [167]:
assert(false)

NameError: name 'false' is not defined

In [None]:
print('Hello')

# Dataset Generation and Balancing

Here we will take the data from all of the data collections and combine them into a lightly cleaned collection of datasets for establishing the baseline for our analysis as well as a collection of datasets containing only time-based features. After that, these datasets will be balanced for binary classification with regards to either benign or DDoS attacks or with regards to a specific type of DDoS attack against a basket of oher DDoS attacks. 

We will be performing many-to-one multi-class classification for an efficient implementation in the field. In this implementation, traffic is first screened to detect if it is a benign or DDoS attack. If it is a benign traffic, the traffic is classified as such. If it does match the profile of a DDoS attack, it can be saved for further analysis at a point when more resources are available, or it can be further classified by type of DDoS attack through screening the data against pool of known DDoS attacks and a specific type of DDoS attack, sequentially eliminating DDoS attacks that do not match the profile of the DDoS attack. This continues until the either a match is found or there are no more DDoS attacks to screen.

### Data Cleaning

Here we create a small list of features that will not contribute to classification in either our baseline or our model.

In [None]:
# prune is a list of all features we know we don't want to use
# Unnamed is eliminated because it is un-labeled and we cannot verify what it qualities of the data if describes
# Fwd Header Length.1 is eliminated because it is a duplicate
prune: list = ['Fwd Header Length.1', 'Unnamed', 'Source Port', 'Destination Port'] 

# if the feature is string valued, we add it to our pruning list because they cannot be used for classification
values = benign_df.values
columns = benign_df.columns
for i in range(benign_df.shape[1]):
    if type(values[0][i]) == str and columns[i] != 'Label':
        prune.append(columns[i]) 


In [None]:
prune

Maranhao et al. found in their study 'Tensor based framework for Distributed Denial of Service attack detection' that nine features were filled with only 0 values for every data collection in the dataset. Since an empty column of zeros will not contribute to the models performance, we will remove those columns.

In [None]:
# toPrune is a list of features with empty columns of 0s
toPrune: list = [
    'Fwd URG Flags',
    'Bwd URG Flags',
    'Fwd PSH Flags',
    'Fwd Avg Bytes/Bulk',
    'Fwd Avg Packets/Bulk',
    'Fwd Avg Bulk Rate',
    'Bwd Avg Bytes/Bulk',
    'Bwd Avg Packets/Bulk',
    'Bwd Avg Bulk Rate'
]

for i in toPrune:
    if i not in prune:
        prune.append(i)

print(f'We will be pruning {len(prune)} features')
for i, x in enumerate(prune):
    print(f'\t{i+1}:\t{x}')

In [None]:

def clean_data(df: pd.DataFrame) -> pd.DataFrame:
    '''
        Function will take a dataframe and remove the values from prune 
        Inf values will also be removed from Flow Bytes/s and Flow Packets/s
        once appropriate rows and columns have been removed, we will return
        the dataframe with the appropriate values
    '''

    # remove the features in the prune list    
    for col in prune:
        if col in df.columns:
            df.drop(columns=[col], inplace=True)
            
    
    # drop missing values/NaN etc.
    df.dropna(inplace=True)

    
    # Search through dataframe for any Infinite or NaN values in various forms that were not picked up previously
    invalid_values: list = [
        np.inf, np.nan, 'Infinity', 'inf', 'NaN', 'nan'
    ]
    
    for col in df.columns:
        for value in invalid_values:
            indexNames = df[df[col] == value].index
            if not indexNames.empty:
                print(f'deleting {len(indexNames)} rows with Infinity in column {col}')
                df.drop(indexNames, inplace=True)


    # Standardize the contents of the Label column
    df = df.replace( ['DrDoS_DNS'], 'DNS')
    df = df.replace( ['DrDoS_LDAP'], 'LDAP')
    df = df.replace( ['DrDoS_MSSQL'], 'MSSQL')
    df = df.replace( ['DrDoS_NetBIOS'], 'NetBIOS')
    df = df.replace( ['DrDoS_NTP'], 'NTP')
    df = df.replace( ['DrDoS_SNMP'], 'SNMP')
    df = df.replace( ['DrDoS_SSDP'], 'SSDP')
    df = df.replace( ['DrDoS_UDP'], 'UDP')

    
    return df

    # df.to_csv(f'./processed/{directory}{file}', index=False)


In [None]:
dataset_1.keys()

In [None]:
dataset_1['File']

In [None]:
dataset_1['Dataset']

In [None]:
test_df = clean_data(dataset_1['Dataset'])

In [None]:
test_df.shape

In [None]:
dataset_1['Dataset'].shape

In [None]:
datasets: list = [
    dataset_1['Dataset'],
    dataset_2['Dataset'],
    dataset_3['Dataset'],
    dataset_4['Dataset'],
    dataset_5['Dataset'],
    dataset_6['Dataset'],
    dataset_7['Dataset'],
    dataset_8['Dataset'],
    dataset_9['Dataset'],
    dataset_10['Dataset'],
    dataset_11['Dataset'],
    dataset_12['Dataset'],
    dataset_13['Dataset'],
    dataset_14['Dataset'],
    dataset_15['Dataset'],
    dataset_16['Dataset'],
    dataset_17['Dataset'],
    dataset_18['Dataset']
]

clean_datasets: list = list(map(clean_data, datasets))

In [None]:
len(clean_datasets)

In [None]:
clean_datasets[1].head()

In [None]:
data_location: list = [ 
    'DNS' , 'LDAP'  , 'MSSQL', 'NetBIOS', 'NTP'    , 'SNMP'   , 'SSDP', 'UDP', 'Syn'   , 
    'TFTP', 'UDPLag', 'LDAP' , 'MSSQL'  , 'NetBIOS', 'Portmap', 'Syn' , 'UDP', 'UDPLag',
]

attack_type_datasets: dict = {}

benign_df = None

def process_datasets_into_attack_type(dataset: int): 
    print(dataset)
    if data_location[i] is not None:
        global benign_df
        df = clean_datasets[i]

        benign    = df[df['Label'] == 'BENIGN']
        malicious = df[df['Label'] == data_location[i]]


        print(f"{data_location[i]}: {malicious['Label'][0]}")




        # save benign samples in benign_df
        if benign_df is None:
            benign_df = benign
        else:
            benign_df = pd.concat([benign_df, benign], ignore_index=True)


        # save DDoS attacks in our prepared dictionary
        if data_location[i] in attack_type_datasets.keys():
            attack_type_datasets[data_location[i]] = pd.concat([attack_type_datasets[data_location[i]], malicious], ignore_index=True)
        else:
            attack_type_datasets[data_location[i]] = malicious


        data_location[i] = None

In [None]:
        print(f"{data_location[i]}: {malicious['Label'][0]}")


In [None]:
for i in range(18):
    process_datasets_into_attack_type(i)

In [None]:
for i in range(18):
    process_datasets_into_attack_type(i)

In [None]:
%whos

In [None]:
len(clean_datasets)

Since one of our research directions is investigating the use of time-based features as a methodology to detect and classify DDoS traffic like they have been used to detect and classify Tor traffic, lets examine the properties of the time-based features in the CIC-DDoS2019 dataset

![Feature descriptions used by Lashkari et al, 2017 in their conference paper -- Characterization of Tor Traffic using Time based Features](./assets/CIC_feature_descriptions.png "Feature descriptions used by Lashkari et al, 2017 in their conference paper -- Characterization of Tor Traffic using Time based Features")

In [None]:
timeFeatures = [
    'Fwd IAT Mean', 
    'Fwd IAT Std', 
    'Fwd IAT Max', 
    'Fwd IAT Min', 
    'Bwd IAT Mean', 
    'Bwd IAT Std', 
    'Bwd IAT Max', 
    'Bwd IAT Min', 
    'Flow IAT Mean', 
    'Flow IAT Std', 
    'Flow IAT Max', 
    'Flow IAT Min', 
    'Active Mean', 
    'Active Std', 
    'Active Max', 
    'Active Min', 
    'Idle Mean', 
    'Idle Std', 
    'Idle Max', 
    'Idle Min', 
    'Flow Bytes/s', 
    'Flow Packets/s', 
    'Flow Duration'
]

otherTimeFeatures = [
    'Fwd IAT Total',
    'Bwd IAT Total',
]

In [None]:
feature_time_stats_head = feature_stats[['Dataset', 'Value']].T
feature_time_stats = feature_stats[timeFeatures].T
feature_extra_time_stats = feature_stats[otherTimeFeatures].T

In [None]:
feature_time_stats[feature_time_stats[2] < 200000]

In [None]:
feature_extra_time_stats[feature_extra_time_stats[2] < 200000]

In [None]:
remaining = feature_time_stats[feature_time_stats[2] < 200000]
remaining

In [None]:
remaining = remaining[remaining[0] == 0]

In [None]:
remaining

In [None]:
clip

In [None]:
feature_stats.shape

In [None]:
remaining = [n for n in df.columns if n not in prune]

In [None]:
remaining

In [None]:
len(remaining)

In [None]:
remdf = df[remaining]

In [None]:
remdf[remdf['Init Win bytes backward'] != -1]

In [None]:
remdf

In [None]:
remdf[remdf['Inbound'] != 1]