# Introduction

### Step 1: Downloading date from kaggle
The first step is to get the date from kaggle. Kaggle provides large variety of datasets that can be used for different usecases. The dataset can be downloaded manually through kaggle website [UNSW-NB15](https://www.kaggle.com/datasets/mrwellsdavid/unsw-nb15/download?datasetVersionNumber=1) or using kaggle API.

To use kaggle API follow the following instructions:
1. Create a kaggle API key from your kaggle account. **Your Account** -> **Create New API Token**.
2. Save the generated *kaggle.json* file in the place where you want to store your datasets.
3. Set the enviroment variable for the kaggle tool.
```bash
export KAGGLE_CONFIG_DIR='/preferred/path'
```
4. Open a terminal in the directory where the file *kaggle.json* was saved.
```bash
kaggle datasets download -d mrwellsdavid/unsw-nb15
```
5. Unzip the dataset.
```bash
unzip unsw-nb15.zip
```

In [None]:
%cd /content/drive/MyDrive/Colab\ Notebooks/kaggle-datasets/

/content/drive/MyDrive/Colab Notebooks/kaggle-datasets


In [None]:
import os
os.environ["KAGGLE_CONFIG_DIR"] = "/content/drive/MyDrive/Colab Notebooks/kaggle-datasets/"

In [None]:
!kaggle datasets download -d mrwellsdavid/unsw-nb15

Downloading unsw-nb15.zip to /content/drive/MyDrive/Colab Notebooks/kaggle-datasets
 95% 142M/149M [00:00<00:00, 155MB/s]
100% 149M/149M [00:01<00:00, 155MB/s]


In [None]:
!ls

kaggle.json  unsw-nb15.zip


In [None]:
!unzip unsw-nb15.zip

Archive:  unsw-nb15.zip
  inflating: NUSW-NB15_features.csv  
  inflating: UNSW-NB15_1.csv         
  inflating: UNSW-NB15_2.csv         
  inflating: UNSW-NB15_3.csv         
  inflating: UNSW-NB15_4.csv         
  inflating: UNSW-NB15_LIST_EVENTS.csv  
  inflating: UNSW_NB15_testing-set.csv  
  inflating: UNSW_NB15_training-set.csv  


### Step 2: Import necessary libraries

In [52]:
import numpy as np
import pandas as pd

### Step 3: Explor Dataset

In [53]:
training = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/kaggle-datasets/unsw-nb15/UNSW_NB15_training-set.csv')
testing  = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/kaggle-datasets/unsw-nb15/UNSW_NB15_testing-set.csv')

df = pd.concat([training,testing]).reset_index(drop=True)

After loading the training dataset using pandas, we could print a small information list that shows the different features provided by the training dataset. Luckily the dataset seems clean of null values. The datasets contains 43 columsn without the id and label columns.

The columns descriptions are provided by the following table below.

| Field Name  | Description |
 ----------- | ----------- |
| id          | unique identifier for each attack |
| dur         | Record total duration         |
| proto       | Transaction protocol      |
| service     | http, ftp, ssh, dns ..,else (-) |
| state       | The state and its dependent protocol, e.g. ACC, CLO, else (-) |
| spkts       | Source to destination packet count |
| dpkts       | Destination to source packet count |
| sbytes      | Source to destination bytes         |
| dbytes      | Destination to source bytes|
| rate        | The avrage attack rate           |
| sttl        | Source to destination time to live         |
| dttl        | Destination to destination time to live     |
| sload       | Source packets retransmitted or dropped      |
| dload       | Destination packets retransmitted or dropped      |
| sloss       | Source packets retransmitted or dropped
| dloss       | Destination packets retransmitted or dropped     |
| sinpkt      | Source inter-packet arrival time (mSec)         |
| dinpkt      | Destination inter-packet arrival time (mSec)    |
| sjit        | Source jitter (mSec)                            |
| djit        | Destination jitter (mSec)                     |
| swin        | Source TCP window advertisement               |
| dwin        | Destination TCP window advertisement          |
| stcpb       | Source TCP sequence number                    |
| dtcpb       | Destination TCP sequence number               |
| tcprtt      | The sum of ’synack’ and ’ackdat’ of the TCP   |
| synack      | The time between the SYN and the SYN_ACK packets of the TCP |
| ackdat      | The time between the SYN_ACK and the ACK packets of the TCP |
| smean       | Mean of the flow packet size transmitted by the src         |
| dmean       | Mean of the flow packet size transmitted by the dst         |
| trans_depth | the depth into the connection of http request/response transaction |
| response_body_len | The content size of the data transferred from the server’s http service |
| ct_srv_src         | No. of connections that contain the same service and destination address in 100 connections according to the last time |
| ct_state_ttl       | No. for each state according to specific range of values for source/destination time to live       |
| ct_dst_ltm         | No. of connections of the same destination address in 100 connections according to the last time        |
| ct_src_dport_ltm   | No of connections of the same source address  and the destination port  in 100 connections according to the last time    |
| ct_dst_sport_ltm   | No of connections of the same destination address and the source port in 100 connections according to the last time    |
| ct_dst_src_ltm     | No of connections of the same source and the destination address in in 100 connections according to the last time   |
| is_ftp_login       | If the ftp session is accessed by user and password then 1 else 0     |
| ct_ftp_cmd         | No of flows that has a command in ftp session |
| ct_flw_http_mthd   | No. of flows that has methods such as Get and Post in http service        |
| ct_src_ltm         | No. of connections of the same destination address in 100 connections according to the last time     |
| ct_srv_dst         | No. of connections that contain the same service and destination address in 100 connections according to the last time        |
| is_sm_ips_ports    |  If source equals to destination IP addresses and port numbers are equal, this variable takes value 1 else 0        |
| attack_cat | The name of each attack category. In this data set, nine categories (e.g., Fuzzers, Analysis, Backdoors, DoS, Exploits, Generic, Reconnaissance, Shellcode and Worms) |
| label | 0 for normal and 1 for attack records |

In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 257673 entries, 0 to 257672
Data columns (total 45 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   id                 257673 non-null  int64  
 1   dur                257673 non-null  float64
 2   proto              257673 non-null  object 
 3   service            257673 non-null  object 
 4   state              257673 non-null  object 
 5   spkts              257673 non-null  int64  
 6   dpkts              257673 non-null  int64  
 7   sbytes             257673 non-null  int64  
 8   dbytes             257673 non-null  int64  
 9   rate               257673 non-null  float64
 10  sttl               257673 non-null  int64  
 11  dttl               257673 non-null  int64  
 12  sload              257673 non-null  float64
 13  dload              257673 non-null  float64
 14  sloss              257673 non-null  int64  
 15  dloss              257673 non-null  int64  
 16  si

The following commands would shows us the first 10 rows in the dataset to give us a clear idea and insight about the dataset.

In [12]:
df.head(10)

Unnamed: 0,id,dur,proto,service,state,spkts,dpkts,sbytes,dbytes,rate,...,ct_dst_sport_ltm,ct_dst_src_ltm,is_ftp_login,ct_ftp_cmd,ct_flw_http_mthd,ct_src_ltm,ct_srv_dst,is_sm_ips_ports,attack_cat,label
0,1,1.1e-05,udp,-,INT,2,0,496,0,90909.0902,...,1,2,0,0,0,1,2,0,Normal,0
1,2,8e-06,udp,-,INT,2,0,1762,0,125000.0003,...,1,2,0,0,0,1,2,0,Normal,0
2,3,5e-06,udp,-,INT,2,0,1068,0,200000.0051,...,1,3,0,0,0,1,3,0,Normal,0
3,4,6e-06,udp,-,INT,2,0,900,0,166666.6608,...,1,3,0,0,0,2,3,0,Normal,0
4,5,1e-05,udp,-,INT,2,0,2126,0,100000.0025,...,1,3,0,0,0,2,3,0,Normal,0
5,6,3e-06,udp,-,INT,2,0,784,0,333333.3215,...,1,2,0,0,0,2,2,0,Normal,0
6,7,6e-06,udp,-,INT,2,0,1960,0,166666.6608,...,1,2,0,0,0,2,2,0,Normal,0
7,8,2.8e-05,udp,-,INT,2,0,1384,0,35714.28522,...,1,3,0,0,0,1,3,0,Normal,0
8,9,0.0,arp,-,INT,1,0,46,0,0.0,...,2,2,0,0,0,2,2,1,Normal,0
9,10,0.0,arp,-,INT,1,0,46,0,0.0,...,2,2,0,0,0,2,2,1,Normal,0


In [13]:
df.describe(include='all')

Unnamed: 0,id,dur,proto,service,state,spkts,dpkts,sbytes,dbytes,rate,...,ct_dst_sport_ltm,ct_dst_src_ltm,is_ftp_login,ct_ftp_cmd,ct_flw_http_mthd,ct_src_ltm,ct_srv_dst,is_sm_ips_ports,attack_cat,label
count,257673.0,257673.0,257673,257673,257673,257673.0,257673.0,257673.0,257673.0,257673.0,...,257673.0,257673.0,257673.0,257673.0,257673.0,257673.0,257673.0,257673.0,257673,257673.0
unique,,,133,13,11,,,,,,...,,,,,,,,,10,
top,,,tcp,-,FIN,,,,,,...,,,,,,,,,Normal,
freq,,,123041,141321,117164,,,,,,...,,,,,,,,,93000,
mean,72811.823858,1.246715,,,,19.777144,18.514703,8572.952,14387.29,91253.91,...,4.032677,8.322964,0.012819,0.01285,0.132005,6.800045,9.121049,0.014274,,0.639077
std,48929.917641,5.974305,,,,135.947152,111.985965,173773.9,146199.3,160344.6,...,5.831515,11.120754,0.116091,0.116421,0.681854,8.396266,10.874752,0.118618,,0.480269
min,1.0,0.0,,,,1.0,0.0,24.0,0.0,0.0,...,1.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,,0.0
25%,32210.0,8e-06,,,,2.0,0.0,114.0,0.0,30.78928,...,1.0,1.0,0.0,0.0,0.0,2.0,2.0,0.0,,0.0
50%,64419.0,0.004285,,,,4.0,2.0,528.0,178.0,2955.665,...,1.0,3.0,0.0,0.0,0.0,3.0,4.0,0.0,,1.0
75%,110923.0,0.685777,,,,12.0,10.0,1362.0,1064.0,125000.0,...,3.0,8.0,0.0,0.0,0.0,8.0,11.0,0.0,,1.0


### Step 4: Data Preprocessing and Feature Selection
The data preprocessing step is the process of cleanning the data, normalizing, encoding discreat values, and more. On the other hand, feature selection is the process of selecting the most helpful columns. However, if the dataset is good all the features could be used without any problem.

The first thing we would do is to drop the `id` and `attack_cat` columns. Because, the `id` columns is not real information and not helpful. The `attack_cat` is basically the attack type which is not known in inference so it is not helpful to train the model with it.

In [54]:
df.drop(['id', 'attack_cat'],axis=1,inplace=True)

We can see in the following block of code that the id and attack_cat were dropped from the dataset.

In [55]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 257673 entries, 0 to 257672
Data columns (total 43 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   dur                257673 non-null  float64
 1   proto              257673 non-null  object 
 2   service            257673 non-null  object 
 3   state              257673 non-null  object 
 4   spkts              257673 non-null  int64  
 5   dpkts              257673 non-null  int64  
 6   sbytes             257673 non-null  int64  
 7   dbytes             257673 non-null  int64  
 8   rate               257673 non-null  float64
 9   sttl               257673 non-null  int64  
 10  dttl               257673 non-null  int64  
 11  sload              257673 non-null  float64
 12  dload              257673 non-null  float64
 13  sloss              257673 non-null  int64  
 14  dloss              257673 non-null  int64  
 15  sinpkt             257673 non-null  float64
 16  di

In [56]:
df.describe()

Unnamed: 0,dur,spkts,dpkts,sbytes,dbytes,rate,sttl,dttl,sload,dload,...,ct_src_dport_ltm,ct_dst_sport_ltm,ct_dst_src_ltm,is_ftp_login,ct_ftp_cmd,ct_flw_http_mthd,ct_src_ltm,ct_srv_dst,is_sm_ips_ports,label
count,257673.0,257673.0,257673.0,257673.0,257673.0,257673.0,257673.0,257673.0,257673.0,257673.0,...,257673.0,257673.0,257673.0,257673.0,257673.0,257673.0,257673.0,257673.0,257673.0,257673.0
mean,1.246715,19.777144,18.514703,8572.952,14387.29,91253.91,180.000931,84.754957,70608690.0,658214.3,...,5.238271,4.032677,8.322964,0.012819,0.01285,0.132005,6.800045,9.121049,0.014274,0.639077
std,5.974305,135.947152,111.985965,173773.9,146199.3,160344.6,102.488268,112.762131,185731300.0,2412372.0,...,8.160822,5.831515,11.120754,0.116091,0.116421,0.681854,8.396266,10.874752,0.118618,0.480269
min,0.0,1.0,0.0,24.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0
25%,8e-06,2.0,0.0,114.0,0.0,30.78928,62.0,0.0,12318.0,0.0,...,1.0,1.0,1.0,0.0,0.0,0.0,2.0,2.0,0.0,0.0
50%,0.004285,4.0,2.0,528.0,178.0,2955.665,254.0,29.0,743942.3,1747.441,...,1.0,1.0,3.0,0.0,0.0,0.0,3.0,4.0,0.0,1.0
75%,0.685777,12.0,10.0,1362.0,1064.0,125000.0,254.0,252.0,80000000.0,22105.38,...,4.0,3.0,8.0,0.0,0.0,0.0,8.0,11.0,0.0,1.0
max,59.999989,10646.0,11018.0,14355770.0,14657530.0,1000000.0,255.0,254.0,5988000000.0,22422730.0,...,59.0,46.0,65.0,4.0,4.0,30.0,60.0,62.0,1.0,1.0


The dataset has three categorical features (columns) which are `state`, `proto`, and `service`. The below cell codes shows the three features and convert the categorical values into numerical values so we could use the dataset to train the model.

The `service` feautre has `-` as a value which indicate that the service is unknown.

When applying one hot encoding for large values of categorical features. The number of features will explode and increase which is known as curse of dimensionality. To prevent the curse of dimensionality we will process the categorical features to have only a maximum of 6 categorical values as maximum.

In [57]:
df_cat = df.select_dtypes(exclude=[np.number])
df_cat.describe(include='all')

Unnamed: 0,proto,service,state
count,257673,257673,257673
unique,133,13,11
top,tcp,-,FIN
freq,123041,141321,117164


In [58]:
df.proto.unique()

array(['udp', 'arp', 'tcp', 'igmp', 'ospf', 'sctp', 'gre', 'ggp', 'ip',
       'ipnip', 'st2', 'argus', 'chaos', 'egp', 'emcon', 'nvp', 'pup',
       'xnet', 'mux', 'dcn', 'hmp', 'prm', 'trunk-1', 'trunk-2',
       'xns-idp', 'leaf-1', 'leaf-2', 'irtp', 'rdp', 'netblt', 'mfe-nsp',
       'merit-inp', '3pc', 'idpr', 'ddp', 'idpr-cmtp', 'tp++', 'ipv6',
       'sdrp', 'ipv6-frag', 'ipv6-route', 'idrp', 'mhrp', 'i-nlsp', 'rvd',
       'mobile', 'narp', 'skip', 'tlsp', 'ipv6-no', 'any', 'ipv6-opts',
       'cftp', 'sat-expak', 'ippc', 'kryptolan', 'sat-mon', 'cpnx', 'wsn',
       'pvp', 'br-sat-mon', 'sun-nd', 'wb-mon', 'vmtp', 'ttp', 'vines',
       'nsfnet-igp', 'dgp', 'eigrp', 'tcf', 'sprite-rpc', 'larp', 'mtp',
       'ax.25', 'ipip', 'aes-sp3-d', 'micp', 'encap', 'pri-enc', 'gmtp',
       'ifmp', 'pnni', 'qnx', 'scps', 'cbt', 'bbn-rcc', 'igp', 'bna',
       'swipe', 'visa', 'ipcv', 'cphb', 'iso-tp4', 'wb-expak', 'sep',
       'secure-vmtp', 'xtp', 'il', 'rsvp', 'unas', 'fc', 'iso-ip',


In [59]:
print(f'The most repeated protocols = ', df['proto'].value_counts().head().index)

The most repeated protocols =  Index(['tcp', 'udp', 'unas', 'arp', 'ospf'], dtype='object')


In [60]:
df.state.unique()

array(['INT', 'FIN', 'REQ', 'ACC', 'CON', 'RST', 'CLO', 'ECO', 'PAR',
       'URN', 'no'], dtype=object)

In [61]:
print(f'The most repeated states = ', df['state'].value_counts().head().index)

The most repeated states =  Index(['FIN', 'INT', 'CON', 'REQ', 'RST'], dtype='object')


In [62]:
df.service.unique()

array(['-', 'http', 'ftp', 'ftp-data', 'smtp', 'pop3', 'dns', 'snmp',
       'ssl', 'dhcp', 'irc', 'radius', 'ssh'], dtype=object)

Remove `-` from categorical the `service` feature

In [63]:
df['service'] = np.where(df['service'] == '-', 'None', df['service'])

In [64]:
df.service.unique()

array(['None', 'http', 'ftp', 'ftp-data', 'smtp', 'pop3', 'dns', 'snmp',
       'ssl', 'dhcp', 'irc', 'radius', 'ssh'], dtype=object)

In [65]:
print(f'The most repeated services = ', df['service'].value_counts().head().index)

The most repeated services =  Index(['None', 'dns', 'http', 'smtp', 'ftp-data'], dtype='object')


In [66]:
for feature in ['state', 'proto', 'service']:
    if df[feature].nunique() > 6:
        df[feature] = np.where(df[feature].isin(df[feature].value_counts().head().index), df[feature], 'None')

In [67]:
df_cat = df.select_dtypes(exclude=[np.number])
df_cat.describe(include='all')

Unnamed: 0,proto,service,state
count,257673,257673.0,257673
unique,6,5.0,6
top,tcp,,FIN
freq,123041,149701.0,117164


In [68]:
X = df.iloc[:,:-1]
y = df.iloc[:,-1]

In [69]:
X.shape

(257673, 42)

In [70]:
y.shape

(257673,)

Data normalization is the process of unifying the features
values into a range for a better model training.

In [71]:
from sklearn import preprocessing
cols_numeric = X.select_dtypes(include=[np.number]).columns

In [72]:
normalizer = preprocessing.StandardScaler().fit(X[cols_numeric])

In [73]:
X[cols_numeric] = normalizer.transform(X[cols_numeric])

In [74]:
X.shape

(257673, 42)

In [75]:
X = pd.get_dummies(X,columns=['state', 'proto', 'service'])

In [76]:
X.shape

(257673, 56)

In [77]:
'Different values in categorical features = ',len(df.service.unique()) + len(df.proto.unique()) + len(df.state.unique())

('Different values in categorical features = ', 17)

The number of features increased becuase the categorical features have 151 different values so each value would have it is own column with a total of `39 + 17 = 56` features

### Step 5: Modeling
The training and testing dataset were combined together for easy preprocessing. Now before training the model we need to split the dataset into training and testing datasets.

In [78]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size = 0.2,
                                                    random_state = 0,
                                                    stratify=y)

In [79]:
from sklearn.metrics import f1_score, precision_score, recall_score, accuracy_score
import time
model_performance = pd.DataFrame(columns=['Accuracy','time to train','time to predict','total time'])

#### 1. Logistic Regression

In [80]:
%%time
from sklearn.linear_model import LogisticRegression
start = time.time()
lr = LogisticRegression().fit(X_train,y_train)
end_train = time.time()
y_predictions = lr.predict(X_test) # These are the predictions from the test data.
end_predict = time.time()

CPU times: user 4.93 s, sys: 1.55 s, total: 6.47 s
Wall time: 5.92 s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [81]:
accuracy = accuracy_score(y_test, y_predictions)

print("Accuracy: "+ "{:.2%}".format(accuracy))
print("time to train: "+ "{:.2f}".format(end_train-start)+" s")
print("time to predict: "+"{:.2f}".format(end_predict-end_train)+" s")
print("total: "+"{:.2f}".format(end_predict-start)+" s")
model_performance.loc['Logistic'] = [accuracy,end_train-start,end_predict-end_train,end_predict-start]

Accuracy: 90.15%
time to train: 5.90 s
time to predict: 0.02 s
total: 5.92 s


#### 2. K-Nearest Neighbor

In [82]:
%%time
from sklearn.neighbors import KNeighborsClassifier
start = time.time()
knn = KNeighborsClassifier(n_neighbors=3).fit(X_train,y_train)
end_train = time.time()
y_predictions = knn.predict(X_test) # These are the predictions from the test data.
end_predict = time.time()

CPU times: user 2min 11s, sys: 227 ms, total: 2min 11s
Wall time: 1min 38s


In [83]:
accuracy = accuracy_score(y_test, y_predictions)

print("Accuracy: "+ "{:.2%}".format(accuracy))
print("time to train: "+ "{:.2f}".format(end_train-start)+" s")
print("time to predict: "+"{:.2f}".format(end_predict-end_train)+" s")
print("total: "+"{:.2f}".format(end_predict-start)+" s")
model_performance.loc['kNN'] = [accuracy,end_train-start,end_predict-end_train,end_predict-start]

Accuracy: 91.48%
time to train: 0.19 s
time to predict: 98.55 s
total: 98.75 s


#### 3. Decision Tree

In [84]:
%%time
from sklearn.tree import DecisionTreeClassifier
start = time.time()
dt = DecisionTreeClassifier().fit(X_train,y_train)
end_train = time.time()
y_predictions = dt.predict(X_test) # These are the predictions from the test data.
end_predict = time.time()

CPU times: user 3.93 s, sys: 29 ms, total: 3.95 s
Wall time: 3.96 s


In [85]:
accuracy = accuracy_score(y_test, y_predictions)

print("Accuracy: "+ "{:.2%}".format(accuracy))
print("time to train: "+ "{:.2f}".format(end_train-start)+" s")
print("time to predict: "+"{:.2f}".format(end_predict-end_train)+" s")
print("total: "+"{:.2f}".format(end_predict-start)+" s")
model_performance.loc['Decision Tree'] = [accuracy,end_train-start,end_predict-end_train,end_predict-start]

Accuracy: 93.88%
time to train: 3.94 s
time to predict: 0.02 s
total: 3.96 s


#### 4. Random Forest

In [86]:
%%time
from sklearn.ensemble import RandomForestClassifier
start = time.time()
rf = RandomForestClassifier(n_estimators = 100,n_jobs=-1,random_state=0,bootstrap=True,).fit(X_train,y_train)
end_train = time.time()
y_predictions = rf.predict(X_test) # These are the predictions from the test data.
end_predict = time.time()

CPU times: user 1min 7s, sys: 257 ms, total: 1min 7s
Wall time: 51.8 s


In [87]:
accuracy = accuracy_score(y_test, y_predictions)

print("Accuracy: "+ "{:.2%}".format(accuracy))
print("time to train: "+ "{:.2f}".format(end_train-start)+" s")
print("time to predict: "+"{:.2f}".format(end_predict-end_train)+" s")
print("total: "+"{:.2f}".format(end_predict-start)+" s")
model_performance.loc['Random Forest'] = [accuracy,end_train-start,end_predict-end_train,end_predict-start]

Accuracy: 95.10%
time to train: 51.04 s
time to predict: 0.76 s
total: 51.80 s


#### 5. Gradient Boosting Classifier

In [88]:
%%time
from sklearn.ensemble import GradientBoostingClassifier
start = time.time()
gbc = GradientBoostingClassifier().fit(X_train,y_train)
end_train = time.time()
y_predictions = gbc.predict(X_test) # These are the predictions from the test data.
end_predict = time.time()

CPU times: user 2min 16s, sys: 153 ms, total: 2min 16s
Wall time: 2min 18s


In [89]:
accuracy = accuracy_score(y_test, y_predictions)

print("Accuracy: "+ "{:.2%}".format(accuracy))
print("time to train: "+ "{:.2f}".format(end_train-start)+" s")
print("time to predict: "+"{:.2f}".format(end_predict-end_train)+" s")
print("total: "+"{:.2f}".format(end_predict-start)+" s")
model_performance.loc['Gradient Boosting Classifier'] = [accuracy,end_train-start,end_predict-end_train,end_predict-start]

Accuracy: 93.27%
time to train: 138.09 s
time to predict: 0.19 s
total: 138.28 s


### Conclusion

In [90]:
model_performance.style.background_gradient(cmap='coolwarm').format({'Accuracy': '{:.2%}',
                                                                     'time to train':'{:.1f}',
                                                                     'time to predict':'{:.1f}',
                                                                     'total time':'{:.1f}',
                                                                     })

Unnamed: 0,Accuracy,time to train,time to predict,total time
Logistic,90.15%,5.9,0.0,5.9
kNN,91.48%,0.2,98.6,98.7
Decision Tree,93.88%,3.9,0.0,4.0
Random Forest,95.10%,51.0,0.8,51.8
Gradient Boosting Classifier,93.27%,138.1,0.2,138.3


### Save the Best Accuracy Model

In [None]:
import pickle

filename = '/content/drive/MyDrive/Colab Notebooks/network-anomaly-detection/trained-models/RandomForest.sav'
pickle.dump(rf, open(filename, 'wb'))

### Load the model

In [None]:
loaded_model = pickle.load(open(filename, 'rb'))
accuracy = loaded_model.score(X_test, y_test)

print("Accuracy: "+ "{:.2%}".format(accuracy))

Accuracy: 95.10%
