**Introduction:**

This is a project predicting Attack for Network. The dataset used for this analysis was taken from kaggle dataset. The raw network packets of the UNSW-NB15 is a comprehensive dataset for network intrusion detection systems which was created by the IXIA PerfectStorm tool in the Cyber Range Lab of the Australian Centre for Cyber Security (ACCS) for generating a hybrid of real modern normal activities and synthetic contemporary attack behaviours. It was published in 2015. This dataset has nine types of attacks, namely: Fuzzers, Analysis, Backdoors, DoS, Exploits, Generic, Reconnaissance, Shellcode and Worms. It has features with the class label. I use a partition from this dataset is configured as a training set, namely: UNSW_NB15_training-set.csv. The number of records in the dataset is 175,341 records from the different types of attack and normal.

-The objective of this project is:

- Exploring  data for  analysing of cyber security data
- Perform anomaly detection using some algorithms and evaluate its learning profile and predict the anomaly detection.


This notebook is centered around 6 different questions:

- What are the most common types of Attack?
- What are the most common protocol,service and state for Attack?
- What are the effect of Attack?

- NOTE:
 
The features of dataset are described in UNSW-NB15_features.csv file which says that:
- In 'state' column: '-' means that 'Not used stste'.
- In 'service' column: '-' means that ' Not much used service'

In [1]:
# Import Liberaries and Packages:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import scipy
import struct
%matplotlib inline
from wordcloud import WordCloud
import scipy.stats as stats
from scipy.stats import boxcox
from scipy.stats import jarque_bera
from scipy.stats import normaltest
from sklearn.preprocessing import normalize
from sklearn.preprocessing import scale
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.naive_bayes import BernoulliNB
from sklearn.linear_model import LogisticRegressionCV
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score,log_loss
from sklearn.svm import SVC
from scipy.stats import mannwhitneyu


import warnings
warnings.filterwarnings(action="ignore")

In [2]:
# Concat all dataset's files:

df = pd.read_csv(r'C:\Users\mebra.DESKTOP-L12LJA6\Thinkful Works\PythonThinkful\capstonbotdataset\UNSW_NB15_training-set.csv')


In [3]:
# Look at the dataset:

df.head()

Unnamed: 0,id,dur,proto,service,state,spkts,dpkts,sbytes,dbytes,rate,...,ct_dst_sport_ltm,ct_dst_src_ltm,is_ftp_login,ct_ftp_cmd,ct_flw_http_mthd,ct_src_ltm,ct_srv_dst,is_sm_ips_ports,attack_cat,label
0,1,1.1e-05,udp,-,INT,2,0,496,0,90909.0902,...,1,2,0,0,0,1,2,0,Normal,0
1,2,8e-06,udp,-,INT,2,0,1762,0,125000.0003,...,1,2,0,0,0,1,2,0,Normal,0
2,3,5e-06,udp,-,INT,2,0,1068,0,200000.0051,...,1,3,0,0,0,1,3,0,Normal,0
3,4,6e-06,udp,-,INT,2,0,900,0,166666.6608,...,1,3,0,0,0,2,3,0,Normal,0
4,5,1e-05,udp,-,INT,2,0,2126,0,100000.0025,...,1,3,0,0,0,2,3,0,Normal,0


In [4]:
# look at the shape of dataset:

df.shape

(82332, 45)

In [5]:
# Clean dataset by droping duplicates: 

df.drop_duplicates(inplace=True)

In [6]:
# Look at the length of dataset after removing duplicate:

len(df)

82332

In [7]:
# Look at the type of columns:

df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 82332 entries, 0 to 82331
Data columns (total 45 columns):
id                   82332 non-null int64
dur                  82332 non-null float64
proto                82332 non-null object
service              82332 non-null object
state                82332 non-null object
spkts                82332 non-null int64
dpkts                82332 non-null int64
sbytes               82332 non-null int64
dbytes               82332 non-null int64
rate                 82332 non-null float64
sttl                 82332 non-null int64
dttl                 82332 non-null int64
sload                82332 non-null float64
dload                82332 non-null float64
sloss                82332 non-null int64
dloss                82332 non-null int64
sinpkt               82332 non-null float64
dinpkt               82332 non-null float64
sjit                 82332 non-null float64
djit                 82332 non-null float64
swin                 82332 non-n

In [8]:
# Findout number of unique value in each column:

for col in df.columns:
    print("Number of Unique Values in Column {} are: {}".format(col, df[col].nunique()))

Number of Unique Values in Column id are: 82332
Number of Unique Values in Column dur are: 39888
Number of Unique Values in Column proto are: 131
Number of Unique Values in Column service are: 13
Number of Unique Values in Column state are: 7
Number of Unique Values in Column spkts are: 420
Number of Unique Values in Column dpkts are: 436
Number of Unique Values in Column sbytes are: 4489
Number of Unique Values in Column dbytes are: 4034
Number of Unique Values in Column rate are: 40616
Number of Unique Values in Column sttl are: 11
Number of Unique Values in Column dttl are: 8
Number of Unique Values in Column sload are: 42873
Number of Unique Values in Column dload are: 40614
Number of Unique Values in Column sloss are: 253
Number of Unique Values in Column dloss are: 311
Number of Unique Values in Column sinpkt are: 39970
Number of Unique Values in Column dinpkt are: 37617
Number of Unique Values in Column sjit are: 39944
Number of Unique Values in Column djit are: 38381
Number of 

In [9]:
# Findout percentage of missing values in each columns:

null_count = round(df.isnull().sum()*100/df.isnull().count(),2)
null_count[null_count>0]

Series([], dtype: float64)

In [10]:
# Findout object columns:

object_columns = df.select_dtypes('object')
object_columns.head()

Unnamed: 0,proto,service,state,attack_cat
0,udp,-,INT,Normal
1,udp,-,INT,Normal
2,udp,-,INT,Normal
3,udp,-,INT,Normal
4,udp,-,INT,Normal


In [11]:
# Findout unique values in each object columns:

for col in object_columns:
    print("Unique values in column {} are: {}, {}".format(col, df[col].nunique(), df[col].unique()))

Unique values in column proto are: 131, ['udp' 'arp' 'tcp' 'igmp' 'ospf' 'sctp' 'gre' 'ggp' 'ip' 'ipnip' 'st2'
 'argus' 'chaos' 'egp' 'emcon' 'nvp' 'pup' 'xnet' 'mux' 'dcn' 'hmp' 'prm'
 'trunk-1' 'trunk-2' 'xns-idp' 'leaf-1' 'leaf-2' 'irtp' 'rdp' 'netblt'
 'mfe-nsp' 'merit-inp' '3pc' 'idpr' 'ddp' 'idpr-cmtp' 'tp++' 'ipv6' 'sdrp'
 'ipv6-frag' 'ipv6-route' 'idrp' 'mhrp' 'i-nlsp' 'rvd' 'mobile' 'narp'
 'skip' 'tlsp' 'ipv6-no' 'any' 'ipv6-opts' 'cftp' 'sat-expak' 'ippc'
 'kryptolan' 'sat-mon' 'cpnx' 'wsn' 'pvp' 'br-sat-mon' 'sun-nd' 'wb-mon'
 'vmtp' 'ttp' 'vines' 'nsfnet-igp' 'dgp' 'eigrp' 'tcf' 'sprite-rpc' 'larp'
 'mtp' 'ax.25' 'ipip' 'aes-sp3-d' 'micp' 'encap' 'pri-enc' 'gmtp' 'ifmp'
 'pnni' 'qnx' 'scps' 'cbt' 'bbn-rcc' 'igp' 'bna' 'swipe' 'visa' 'ipcv'
 'cphb' 'iso-tp4' 'wb-expak' 'sep' 'secure-vmtp' 'xtp' 'il' 'rsvp' 'unas'
 'fc' 'iso-ip' 'etherip' 'pim' 'aris' 'a/n' 'ipcomp' 'snp' 'compaq-peer'
 'ipx-n-ip' 'pgm' 'vrrp' 'l2tp' 'zero' 'ddx' 'iatp' 'stp' 'srp' 'uti' 'sm'
 'smp' 'isis' '

In [12]:
# Reolace the '-' to null:

df['service'] = df['service'].replace('-', 'else')

In [13]:
# Look at the unique value of service column:
df["service"].unique()

array(['else', 'http', 'ftp', 'ftp-data', 'smtp', 'pop3', 'dns', 'snmp',
       'ssl', 'dhcp', 'irc', 'radius', 'ssh'], dtype=object)

In [14]:
# Look at the unique value of state column:
df["state"].unique()

array(['INT', 'FIN', 'REQ', 'ACC', 'CON', 'RST', 'CLO'], dtype=object)

In [15]:
# Descriptive statistics for object variables:

df.describe(include=['O'])

Unnamed: 0,proto,service,state,attack_cat
count,82332,82332,82332,82332
unique,131,13,7,10
top,tcp,else,FIN,Normal
freq,43095,47153,39339,37000


In [16]:
# Get univariate statistics for numeric columns:

df.describe()

Unnamed: 0,id,dur,spkts,dpkts,sbytes,dbytes,rate,sttl,dttl,sload,...,ct_src_dport_ltm,ct_dst_sport_ltm,ct_dst_src_ltm,is_ftp_login,ct_ftp_cmd,ct_flw_http_mthd,ct_src_ltm,ct_srv_dst,is_sm_ips_ports,label
count,82332.0,82332.0,82332.0,82332.0,82332.0,82332.0,82332.0,82332.0,82332.0,82332.0,...,82332.0,82332.0,82332.0,82332.0,82332.0,82332.0,82332.0,82332.0,82332.0,82332.0
mean,41166.5,1.006756,18.666472,17.545936,7993.908,13233.79,82410.89,180.967667,95.713003,64549020.0,...,4.928898,3.663011,7.45636,0.008284,0.008381,0.129743,6.46836,9.164262,0.011126,0.5506
std,23767.345519,4.710444,133.916353,115.574086,171642.3,151471.5,148620.4,101.513358,116.667722,179861800.0,...,8.389545,5.915386,11.415191,0.091171,0.092485,0.638683,8.543927,11.121413,0.104891,0.497436
min,1.0,0.0,1.0,0.0,24.0,0.0,0.0,0.0,0.0,0.0,...,1.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0
25%,20583.75,8e-06,2.0,0.0,114.0,0.0,28.60611,62.0,0.0,11202.47,...,1.0,1.0,1.0,0.0,0.0,0.0,1.0,2.0,0.0,0.0
50%,41166.5,0.014138,6.0,2.0,534.0,178.0,2650.177,254.0,29.0,577003.2,...,1.0,1.0,3.0,0.0,0.0,0.0,3.0,5.0,0.0,1.0
75%,61749.25,0.71936,12.0,10.0,1280.0,956.0,111111.1,254.0,252.0,65142860.0,...,4.0,3.0,6.0,0.0,0.0,0.0,7.0,11.0,0.0,1.0
max,82332.0,59.999989,10646.0,11018.0,14355770.0,14657530.0,1000000.0,255.0,253.0,5268000000.0,...,59.0,38.0,63.0,2.0,2.0,16.0,60.0,62.0,1.0,1.0


In [19]:
# Copy of dataset :
df_main = df.copy()  

In [None]:
# look at the correlation between columns:

plt.figure(figsize=(25,25))
sns.heatmap(df.corr(), annot=True, linewidths=.5, fmt= '.2f')

In [None]:
# look at the correlation between Attack and other columns:
np.abs(df.corr())[['label']].sort_values(by='label', ascending=False).head(15)

In [None]:
# Findout numeric columns:

numeric_columns = df.select_dtypes(exclude=['object']).columns
print('Number of numeric columns is {}'.format(len(numeric_columns)))

In [None]:
# Univariate visualization of continuous variables by using hist:

plt.figure(figsize=(25,50))
for i,col in enumerate(numeric_columns.drop('id')):
    plt.subplot(11, 4, i+1)
    plt.boxplot(df[col])
    plt.title(f'Distribution of {col}')
    plt.xticks(rotation=90)
plt.subplots_adjust(hspace = 0.8, top = 0.7)
plt.show()

In [None]:
# Findout non_numeric columns:

nonnumeric_columns = df.select_dtypes(['object']).columns
print('Number of non_numeric columns is {}'.format(len(nonnumeric_columns)))

In [None]:
# Univariate visualization of categorical variables by barplot:

plt.figure(figsize=(15,20))
for i,col in enumerate(nonnumeric_columns):
    plt.subplot(4, 1, i+1)
    sns.countplot(df[col])
    plt.title(f'Distribution of {col}')
    plt.xticks(rotation=90, fontsize=8)
plt.subplots_adjust(hspace = 0.8, top = 0.7)
plt.show()

**Most Common:** 

- TCP and UDP are most common protocols were used in the dataset, also TCP was used more than UDP. 

- DNS and other services are most common services were used, also other services used more than DNS.

- INT and FIN are most common states were used and INT was used more.

- The number of Normal records is greater than Attack records. 

- The most common Attack was occured in the dataset is Genericand then Exploits. 



In [None]:
# Bivariate analysis of continuous variables by scatter plot:

plt.figure(figsize=(25,50))
for i,col in enumerate(numeric_columns.drop('id')):
    plt.subplot(11, 4, i+1)
    sns.boxplot(x=df[col], hue=df['label'], data=df)
    plt.title(f'Distribution of label by {col}')
    plt.xticks(rotation=90)
plt.subplots_adjust(hspace = 0.8, top = 0.7)
plt.show()

In [None]:
# Keep top 6 protocols:
df['proto'].value_counts()[:6].sum()/df.shape[0]*100
proto_other_lst = list(df['proto'].value_counts()[7:].index)
df['proto'] = df['proto'].apply(lambda x: x if x not in proto_other_lst else 'other')
df['proto'].value_counts()

In [None]:
# Bivariate analysis of a continuous-categorical pair:

plt.figure(figsize=(15,30))
for i,col in enumerate(nonnumeric_columns):
    plt.subplot(4,1,i+1)
    sns.barplot(df[col], df['rate'])
    plt.title(f'Distribution of {col} by rate')
    plt.xticks(rotation=90, fontsize=8)
plt.subplots_adjust(hspace = 0.8, top = 0.7)
plt.show()

In [None]:
# Bivariate analysis of a continuous-categorical pair:

plt.figure(figsize=(25,40))
for i,col in enumerate(nonnumeric_columns):
    plt.subplot(4,1,i+1)
    df.groupby(col).label.value_counts().plot(kind='bar')
    plt.title(f'Distribution of {col} and label')
    plt.xticks(rotation=90, fontsize=10)
plt.subplots_adjust(hspace = 0.8, top = 0.7)
plt.show()


**Most common protocol, service, state:**
- In TCP and UDP protocol, number of Normal records is  greater thann Attack records.
- Most common protocol for Attack is UDP.
- Most common servic for Attack is DNS.
- Most common state for Attack is INT.
- The number of Normal records in other services is greater than Attack records. But in DNS the numbr of Attack records is greater than normal.
- The number of Attack records in INT is greater then Normal. But in FIN the number of Normal records is greater than Attack records. 


In [None]:
# Bivariate analysis of a continuous-categorical pair:
plt.figure(figsize=(30,50))
for i,col in enumerate(numeric_columns.drop('id')):
    plt.subplot(12,4,i+1)
    sns.barplot(df['attack_cat'], df[col])
    plt.title(f'Distribution of Attac_cat by {col}')
    plt.xticks(rotation=90, fontsize=8)
plt.subplots_adjust(hspace = 1.5, top = 0.7)
plt.show()

-Attack:

- Dos and Fuzzers Attacks have most total duration of records.
- Exploits Attack has maximum number of packets (bytes) from source to destination.
- Worms Attack has maximum number of packets (bytes) from destination to sourc.
- Generic Attack has most total transaction packets per second.
- Normal records have lowest time to live value from Source to destination.
- Worms Aattack has longest time to live value from destination to source.
- Normal records have minimum source bits per second, but for destination have maximum.
- Exploits Attack has maximum number of source packets retransmitted or dropped. 
- Worms Attack has maximum number of destination packets retransmitted or dropped. 
- Fuzzers and Exploits Attacks have maximum mean of the flow packet size transmitted by the src, for destination Worms Attack is maximum.

In [None]:
# Using Piechart to see distribution of source and destination in each attack_cat:

cols=['spkts','sload','sbytes','dpkts','dload','dbytes','sttl','dttl','sloss','dloss','smean','dmean','swin', 'dwin','dur','rate']
df_attack = df.groupby('attack_cat')
plt.figure(figsize=(40,50))

for i,col in enumerate(cols):
    plt.subplot(7,3,i+1)
    df_attack[col].sum().plot(kind='pie',  title=(f'Ditribution of {col} in each attack_cat'), autopct='%1.0f%%')
    labels=df_attack[col].sum().index
    plt.legend(labels=labels, loc="upper left", prop={'size': 7}, bbox_to_anchor=(1,1))
    
plt.show()

- Normal records have higher total number of packets transmitted from source to destination and conversely.
- Normal records have higher total average of the flow packet size transmitted by the src  and dst.
- Normal records have higher total destination packets retransmitted or dropped and for source both Normal records and Exploits Attack are high.
- Total number of bit seconds is high in Generic attack records for source and in destination, Normal records are high.
- Total number of bytes transaction from src to dst is high in Exploits Attack records and from dst to src is high in Normal records.
- Total value of time to live from src to dst is high in Generic Attack and Normal records, but from dst to src is high for Normal records.
- Normal records have higher total duration.
- Generic Attacks have higher total rate of packets per second in transaction.

In [None]:
# Using Piechart to see distribution of source and destination in each attack_cat:

cols=['spkts','sload','sbytes','dpkts','dload','dbytes','sttl','dttl','sloss','dloss','smean','dmean','swin', 'dwin', 'dur', 'rate']
df_label = df.groupby('label')
plt.figure(figsize=(40,50))

for i,col in enumerate(cols):
    plt.subplot(7,3,i+1)
    df_label[col].sum().plot(kind='bar',  title=(f'Ditribution of {col} in each attack_cat'))
    
    
plt.show()

- The Normal label is higher in spkts, dpkts, dloss, dload, dbytes, dttl, dmean.
- The Attack lable is higher in sload, sbytes, sttl, sloss, dur, rate.
- The Normal and Attack labe have almost the same value for smean.

In [None]:
# Bivariate analysis of a continuous-categorical pair:
plt.figure(figsize=(30,50))
for i,col in enumerate(numeric_columns.drop('id')):
    plt.subplot(12,4,i+1)
    sns.barplot(df['proto'], df[col])
    plt.title(f'Distribution of proto by {col}')
    plt.xticks(rotation=90, fontsize=8)
plt.subplots_adjust(hspace = 1.2, top = 0.7)
plt.show()

-Protocol:

- OSPF protocol has most total duration of records.
- OSPF protocol has maximum number of packets and SCTP protocol has maximum number of bytes from source to destination.
- TCP protocol has maximum number of packets (bytes) from destination to sourc.
- OSPF protocol has lowest total transaction packets per second and longest is any.
- TCp protocol has lowest time to live value from Source to destination.
- UDP protocol has lowest time to live value from destination to source.
- OSPF protocol has lowest source bits per second and SCTP protocol has the maximum.
- TCP protocol has maximum destination bits per secondbut.
- TCP  protocol has maximum number of source and destination packets retransmitted or dropped. 
- SCTP protocol has maximum mean of the flow packet size transmitted by the src, and for destination TCP protocol is the maximum.
- TCP protocol has maximum number of Normal records.
- UDP protocol has maximum number of Attack records and SCTP protocol has minimum number of Attack records.

In [None]:
# Look at the distribution of protocol and label:

df.groupby('proto').label.value_counts().plot(kind='barh', title='Distribution of protocol and label')

In [None]:
# Bivariate analysis of a continuous-categorical pair:
plt.figure(figsize=(30,50))
for i,col in enumerate(numeric_columns.drop('id')):
    plt.subplot(12,4,i+1)
    sns.barplot(df['state'], df[col])
    plt.title(f'Distribution of state by {col}')
    plt.xticks(rotation=90, fontsize=8)
plt.subplots_adjust(hspace = 1, top = 0.7)
plt.show()

-Transaction State:

- REQ state has most total duration of records.
- FIN state has maximum number of packets and CON state has maximum number of bytes from source to destination.
- FIN state has maximum number of packets (bytes) from destination to sourc.
- INT state has Maximum total transaction packets per second and longest is any.
- CON state has lowest time to live value from Source to destination.
- CLO state has longest time to live value from destination to source and INT , REQ lowest.
- INT state has maximum source bits per second.
- FIN state has maximum destination bits per secondbut.
- CON and FIN  states have maximum number of source packets retransmitted or dropped and FIN for destination. 
- SCTP state has maximum mean of the flow packet size transmitted by the src, and for destination TCP protocol is the maximum.
- FIN state has maximum number of Normal recordsbut less than Attack records.
- INT state has maximum number of Attack records and CON, REQ, RST and ACC states have minimum number of Attack records.

In [None]:
# Look at the distribution of transaction state and label:
df.groupby('state').label.value_counts().plot(kind='bar', color='pink', title='Distribution of transaction state and label')

In [None]:
# Bivariate analysis of a continuous-categorical pair:
plt.figure(figsize=(30,50))
for i,col in enumerate(numeric_columns.drop('id')):
    plt.subplot(12,4,i+1)
    sns.barplot(df['service'], df[col])
    plt.title(f'Distribution of service by {col}')
    plt.xticks(rotation=90, fontsize=10)
plt.subplots_adjust(hspace = 1, top = 0.7)
plt.show()

- Service:

- SSL service has most total duration of records.
- SMTP service has maximum number of packets (bytes) from source to destination.
- POP3 service has maximum number of packets (bytes) from destination to source.
- SMTP service has maximum total transaction packets per second.
- SNMP and Radius services have longest time to live value from Source to destination and SSH service is the lowest.
- POP3, SSL and IRC services have longest time to live value from destination to source and SNMP is the lowest.
- DHCP service has maximum source bits per second.
- FTP_data service has maximum destination bits per secondbut.
- SMTP service has maximum number of source packets retransmitted or dropped. 
- POP3 service has maximum number of destination packets retransmitted or dropped.
- SMTP service has maximum mean of the flow packet size transmitted by the src, and for destination POP3 service is the maximum.
- DNS service has maximum number of Attack records.

In [None]:
# Look at the distribution of service and label:
df.groupby('service').label.value_counts().plot(kind='bar', color='skyblue', title='Distribution of service and label')

In [None]:
# Look at the distribution of some source and destination columns:
cols=['spkts','sload','sbytes','sttl','dpkts','dload','dbytes','dttl','sloss','dloss','smean','dmean','swin', 'dwin', 'dur','rate']

plt.figure(figsize=(30,50))
for i,col in enumerate(cols):
    plt.subplot(5,4,i+1)
    sns.distplot(df[col])
    plt.title(f'Distribution of {col}')
    plt.xticks(rotation=90, fontsize=10)
plt.subplots_adjust(hspace = 0.5, top = 0.7)
plt.show()

As you can see the distribution of source and destination almost the same, in sload,smean abit higher than dload,dmean and dttl higher than sttl. 

In [None]:
# Using violinplot to distribution of rate and attack_cat by label: 

plt.figure(figsize=(25,10))

sns.catplot(x="attack_cat", y='rate', hue="label", kind="violin", split=False, data=df)
plt.title('Distribution of attack_cat and rate by label')
plt.xticks(rotation=90, fontsize=10)

plt.show()

In [None]:
# Using violinplot to distribution of duration and attack_cat by label: 

plt.figure(figsize=(25,10))

sns.catplot(x="attack_cat", y="dur", hue="label", kind="violin", split=False, data=df)
plt.title('Distribution of attack_cat and duration by label')
plt.xticks(rotation=90, fontsize=10)

plt.show()

In [None]:
# Using boxen plot to distribution of rate and attack_cat by state:

plt.figure(figsize=(25,10))

sns.catplot(x="attack_cat", y="rate", hue="state", kind="boxen", data=df)
plt.title('Distribution of attack_cat and rate by state')
plt.xticks(rotation=90, fontsize=10)

plt.show()

In [None]:
# Using boxen plot to distribution of duration and attack_cat by state:

plt.figure(figsize=(25,10))

sns.catplot(x="attack_cat", y="dur", hue="state", kind="boxen", data=df)
plt.title('Distribution of attack_cat and duration by state')
plt.xticks(rotation=90, fontsize=10)

plt.show()

In [None]:
# Using barplot to distribution of rate and attack_cat by protocol:

plt.figure(figsize=(25,10))

sns.catplot(x="attack_cat", y="rate", hue="proto", kind="bar", data=df)
plt.title('Distribution of attack_cat and rate by protocol')
plt.xticks(rotation=90, fontsize=10)

plt.show()

In [None]:
# Using barplot to distribution of duration and attack_cat by protocol:


plt.figure(figsize=(25,10))

sns.catplot(x="attack_cat", y="dur", hue="proto", kind="bar", data=df)
plt.title('Distribution of attack_cat and duration by protocol')
plt.xticks(rotation=90, fontsize=10)

plt.show()

In [None]:
# Using barplot to distribution of rate and attack_cat by service:

plt.figure(figsize=(25,10))

sns.catplot(x="attack_cat", y="rate", hue="service", kind="bar", data=df)
plt.title('Distribution of attack_cat and rate by service')
plt.xticks(rotation=90, fontsize=10)

plt.show()

In [None]:
# Using barplot to distribution of duration and attack_cat by service:

plt.figure(figsize=(25,10))

sns.catplot(x="attack_cat", y="dur", hue="service", kind="bar", data=df)
plt.title('Distribution of attack_cat and duration by service')
plt.xticks(rotation=90, fontsize=10)

plt.show()

In [None]:
# Using boxplot to display range of average packets size transmittd by source in each category attack and label:

plt.figure(figsize=(10,5))
sns.set(style="whitegrid")

ax = sns.boxplot(x='attack_cat',y='smean',hue='label',data=df)  
plt.title('Distribution of attack_cat and smean by label')

sns.despine(offset=10, trim=True)
ax.set(xlabel='attack_cat', ylabel='smean')
plt.xticks(rotation=90, fontsize=10)
plt.legend(loc="upper right")

plt.show()

In [None]:
# Using boxplot to display range of average packets size transmittd by destination in each category attack and label:

plt.figure(figsize=(10,5))
sns.set(style="whitegrid")

ax = sns.boxplot(x='attack_cat',y='dmean',hue='label',data=df)  
plt.title('Distribution of Attack_cat and dmean by label')

sns.despine(offset=10, trim=True)
ax.set(xlabel='attack_cat', ylabel='dmean')
plt.xticks(rotation=90, fontsize=10)
plt.legend(loc="upper right")

plt.show()

In [None]:
# Using boxplot to display Source TCP window advertisement value in each category attack and label:

plt.figure(figsize=(10,5))
sns.set(style="whitegrid")

ax = sns.boxplot(x='attack_cat',y='swin',hue='label',data=df)  
plt.title('Distribution of Attack_cat and swin by label')

sns.despine(offset=10, trim=True)
ax.set(xlabel='attack_cat', ylabel='swin')
plt.xticks(rotation=90, fontsize=10)
plt.legend(loc="upper right")

plt.show()

In [None]:
# Using boxplot to display destination TCP window advertisement value in each category attack and label:

plt.figure(figsize=(10,5))
sns.set(style="whitegrid")

ax = sns.boxplot(x='attack_cat',y='dwin',hue='label',data=df)  
plt.title('Distribution of Attack_cat and dwin by label')

sns.despine(offset=10, trim=True)
ax.set(xlabel='attack_cat', ylabel='dwin')
plt.xticks(rotation=90, fontsize=10)
plt.legend(loc="upper right")

plt.show()

In [None]:
plt.figure(figsize=(25,5))

plt.subplot(2,1,1)
sns.boxplot(x='attack_cat', y='dur', data=df)
plt.title('Duration of attack_cat')
plt.xticks(rotation=45)

plt.subplot(2,1,2)
sns.boxplot(x='attack_cat', y='rate', data=df)
plt.title('Rating of attack_cat')
plt.xticks(rotation=45)

plt.subplots_adjust(hspace = 1.2, top = 0.9)
plt.show()

In [None]:
# Distribution of attack_cat with service, state, protocol by rate: 
plt.figure(figsize=(25,10))


df.groupby(['attack_cat', 'service']).rate.mean().plot(kind = 'line', color = 'green', label = 'service', linewidth=1, alpha = 0.5, grid = True)

df.groupby(['attack_cat', 'state']).rate.mean().plot(kind = 'line', color = 'blue', label = 'state', linewidth=1, alpha = 0.5, grid = True)

df.groupby(['attack_cat', 'proto']).rate.mean().plot(kind = 'line', color = 'purple', label = 'protocol', linewidth=1, alpha = 0.5, grid = True)

plt.show()

In [None]:
# Distribution of attack_cat with service, state, protocol by duration: 
plt.figure(figsize=(25,10))

df.groupby(['attack_cat', 'service']).dur.mean().plot(kind = 'line', color = 'blue', label = 'service', linewidth=1, alpha = 0.5, grid = True)

df.groupby(['attack_cat', 'state']).dur.mean().plot(kind = 'line', color = 'purple', label = 'state', linewidth=1, alpha = 0.5, grid = True)

df.groupby(['attack_cat', 'proto']).dur.mean().plot(kind = 'line' , color = 'green', label = 'protocol', linewidth=1, alpha = 0.5, grid = True)

plt.show()

In [None]:
# Distribution of service , state, protocol with label by rate: 

plt.figure(figsize=(25,10))

plt.subplot(3,1,1)
df.groupby(['service', 'label']).rate.mean().plot(kind = 'bar', color = 'skyblue', label = 'service', linewidth=1, alpha = 0.5, grid = True)
plt.xticks(rotation=90, fontsize=10)


plt.subplot(3,1,2)
df.groupby(['state', 'label']).rate.mean().plot(kind = 'bar', color = 'purple', label = 'state', linewidth=1, alpha = 0.5, grid = True)
plt.xticks(rotation=90, fontsize=10)

plt.subplot(3,1,3)
df.groupby(['proto', 'label']).rate.mean().plot(kind = 'bar', color = 'pink', label = 'protocol', linewidth=1, alpha = 0.5, grid = True)
plt.xticks(rotation=90, fontsize=10)

plt.subplots_adjust(hspace = 1.2, top = 0.9)

plt.show()

In [None]:
# Distribution of service , state, protocol with label by duration: 

plt.figure(figsize=(25,10))

plt.subplot(3,1,1)
df.groupby(['service', 'label']).dur.mean().plot(kind = 'bar', color = 'skyblue', label = 'service', linewidth=1, alpha = 0.5, grid = True)
plt.xticks(rotation=90, fontsize=10)

plt.subplot(3,1,2)
df.groupby(['state', 'label']).dur.mean().plot(kind = 'bar', color = 'purple', label = 'state', linewidth=1, alpha = 0.5, grid = True)
plt.xticks(rotation=90, fontsize=10)

plt.subplot(3,1,3)
df.groupby(['proto', 'label']).dur.mean().plot(kind = 'bar', color = 'pink', label = 'protocol', linewidth=1, alpha = 0.5, grid = True)
plt.xticks(rotation=90, fontsize=10)

plt.subplots_adjust(hspace = 1.2, top = 0.9)

plt.show()

In [None]:
# Look at the distribution of target variable because target variable is binary use boxplot instead of hist plot:
plt.figure(figsize=(15,5))###???? im not sure choose label or attack_cat as a target??????
plt.subplot(1,2,1)
plt.boxplot(df['label'])
plt.title('Distribution of label Attack')
plt.xlabel("Attack")
plt.ylabel("Number of Occurrence")

plt.subplot(1,2,2)
plt.hist(df['label'])
plt.title('Distribution of label Attack')
plt.xlabel("Attack")
plt.ylabel("Number of Occurrence")
plt.show()

In [None]:
# Look at the distribution of target variable.#####???? can we choose both(label, attack_cat) as target variable??????

sns.countplot(df['attack_cat'])         ###? can use hist for object type???type of plot is correct??????
plt.title('Distribution of categories Attack')
plt.xlabel("Categories of Attack")
plt.ylabel("Number of Occurrence")
plt.xticks(rotation =90)
plt.show()

In [None]:
# Using T_test to determine if there is a significant difference between the Normal and Attack records in rate:

# Use reset_index because i want to change panda series to df nead to be old index:

normal_record=df[df.label== 0].groupby('attack_cat').rate.sum()
normal_record=np.array(normal_record)

attack_record= df[df.label==1].groupby('attack_cat').rate.sum()
attack_record=np.array(attack_record)

scipy.stats.ttest_ind(normal_record, attack_record, equal_var=False)

####?????for ttest choose correct column?????why get nan?????what does t test is parametric test means???
###????what is reset_index use for????????????????
### ???? what does mannwhitneyu means and use for what?????/

**Preparing data for modeling:** 

- For modeling,need all columns to be numeric. To convert nonnumeric to numeric values, I can either use dummy variables or encode them. By using dummy, we can make  

In [None]:
# Convert nonnumeric column to numeric by using encoding:
                     #####?????what is different between cat.codes and label encoding???????
categorical = df.select_dtypes(include=['object']).drop('attack_cat', axis=1)
dummies = pd.get_dummies(categorical, drop_first=True)
dummies.head()

In [None]:
# Drop nonnumeric columns variables after converting to dummies: 
df = df.drop(list(categorical.columns), axis=1)
df.head()

In [None]:
# Concat dummies variables with dataset:
df = pd.concat([df, dummies], axis=1)
df.head()

In [None]:
# Findout label assigned to attack_cat:

#c = df_main['attack_cat'].astype('category')
#dic = dict(enumerate(c.cat.categories))
#df['code'] = df_main.attack_cat.astype('category').cat.codes
#df['attack_name'] = df['code'].map(dic)


#dummies_attack_cat=[col for col in df if col.startswith('attack_cat')]

In [None]:
# Use train_test_split to create the necessary training and test groups:
x = df.drop(['attack_cat', 'id'], axis=1)
y = df['attack_cat']
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=20)

**Applying Models:**

In [None]:
def get_scores(model, model_name):
    model.fit(X_train, y_train)
    
    train_preds = model.predict(X_train)
    test_preds = model.predict(X_test)
    
    train_probs = model.predict_proba(X_train)
    test_probs = model.predict_proba(X_test)
    
    print('{} has training accuracy of: {}'.format(model_name, accuracy_score(y_train, train_preds)))
    print('{} has test accuracy of: {}\n'.format(model_name, accuracy_score(y_test, test_preds)))
    
    print('{} has training log loss of: {}'.format(model_name, log_loss(y_train, train_probs)))
    print('{} has test log loss of: {}\n'.format(model_name, log_loss(y_test, test_probs)))
    
    return train_preds, test_preds, train_probs, test_probs


**1- Preliminary Logistic Regression:**

In [None]:
# Applying logistic Regression model:

lr_initial = LogisticRegression( n_jobs=-1)

initial_lr_train_preds, initial_lr_test_preds, initial_lr_train_probs, initial_lr_test_probs = get_scores(lr_initial, ' Preliminary Logistic Regression model')


**2- Preliminary K Neighbors Classifier:**

In [None]:
# Applying KNeighbors Classifier model:


knn_initial = KNeighborsClassifier(n_jobs=-1)

initial_knn_train_preds, initial_knn_test_preds, initial_knn_train_probs, initial_knn_test_probs = get_scores(knn_initial, 'Preliminary knn model')

**3- Preliminary Random Forest Classifier:**

In [None]:
# Applying Random Forest Classifier model:

rfc_initial = RandomForestClassifier(n_jobs=-1)

initial_rfc_train_preds, initial_rfc_test_preds, initial_rfc_train_probs, initial_rfc_test_probs = get_scores(rfc_initial,'Preliminary Random Forest model')

**4- Preliminary Support Vector Classifier:**

In [None]:
# Applying Support Vector Classifier: why took long time ???????

#svc_initial = SVC(gamma='auto', probability=True)

#initial_svc_train_preds, initial_svc_test_preds, initial_svc_train_probs, initial_svc_test_probs = get_scores(svc_initial,'Preliminary Support Vector Classification model')

**5- Preliminary Gradiant Boosting Classifier:**

In [None]:
# Applying Gradoant Boosting Classifier:

gbc_initial = GradientBoostingClassifier()

initial_gbc_train_preds, initial_gbc_test_preds, initial_gbc_train_probs, initial_gbc_test_probs = get_scores(gbc_initial, 'Preliminary Gradient Boosting model')

In [None]:
# Create a dataFrame with accuracy of different models using dictionary:
preliminary_model_accuracy=pd.DataFrame({"Models":['Initial Logistic Regression', 'Initial knn', 'Initial Random Forest', 'Initial Gradient Boosting'], 
                 "Training Accuracy":[0.63,0.77,0.94,0.9],
                 "Test Accuracy":[0.63,0.69,0.89, 0.9],
                 "Training Log Loss":[1.38,1.04,0.16,0.31],
                 "Test Log Loss":[1.37,4.6,0.74,0.32]}) 
 
preliminary_model_accuracy###>??? is there any way to get value??????

As shown in above summary, training accuracies range (Number of correct predictions /Total number of predictions) from 63% to 94% and test accuracies range between 63% to 90%. The knn model has the most overfitting because the value of test accuracy is much lower than training accuracy compare with other models. These overfitting trends are similar in the log loss scoring(log loss:uncertainly). So, the best test accuracy and test log loss score goes to Gradient Boosting model. should be noted that these initial models are not optimized; I only use the default hyperparameters.

**Accuracy by Attack:**

In [None]:
# Findout models predict a certain type of attack particularly well or terribly:

def get_accuracies(predict, y_true):
    y_true = y_true.reset_index()###????? im not sure is it correct????
    accuracy_lst = []
    for attack in df['attack_cat'].unique(): 
        count = 0
        for i in y_true[y_true==attack].index:
            if predict[i] == attack:
                count += 1
        accuracy_lst.append(count/y_true[y_true==attack].shape[0]*100)
    return accuracy_lst


In [None]:
# Findout the accuracy of each model for attack type:

lr_train_accuracies = get_accuracies(initial_lr_train_preds, y_train)
lr_test_accuracies = get_accuracies(initial_lr_test_preds, y_test)

knn_train_accuracies = get_accuracies(initial_knn_train_preds, y_train)
knn_test_accuracies = get_accuracies(initial_knn_test_preds, y_test) 

rfc_train_accuracies = get_accuracies(initial_rfc_train_preds, y_train)
rfc_test_accuracies = get_accuracies(initial_rfc_test_preds, y_test) 

#svc_train_accuracies = get_accuracies(initial_svc_train_preds, y_train)
#svc_test_accuracies = get_accuracies(initial_svc_test_preds, y_test)

gbc_train_accuracies = get_accuracies(initial_gbc_train_preds, y_train)
gbc_test_accuracies = get_accuracies(initial_gbc_test_preds, y_test)


In [None]:
# Look at the test accuracies for each model and attack type by using heatmap:

models_accuracies = {'lr': lr_test_accuracies, 'knn': knn_test_accuracies, 'rfc': rfc_test_accuracies, 'gbc': gbc_test_accuracies}
initial_df_test_accuracy = pd.DataFrame(models_accuracies, index = sorted(df['attack_cat'].unique()), columns = ['lr', 'knn', 'rfc', 'gbc'])

fig, ax = plt.subplots(figsize=(13, 5))

sns.heatmap(initial_df_test_accuracy.T, cmap = 'coolwarm', square = True, linewidths=0.1, annot=True)
plt.title('preliminary Test Accuracies', fontsize = 16)
plt.xlabel('Attack', fontsize = 13)
plt.ylabel('Model', fontsize = 13)
plt.tick_params(axis='both', which='major', labelsize=11)

As shown in heatmap plot to look at the accuracies for each model and attack type to predict, the highest accuracy came from Logistic Regression model with an accuracy of 61% for Analysis. Overall, Analysis Attack has highest accuracy in all model and models have very low accuracy to predict other attack types.


**_ Improving Scores**

- Feature Engineering: 

  I've already done a bit of feature engineering by converting nonnumeric columns to numeric. 

   - Using PCA for dimentional reduction.

In [None]:
# Applying PCA for feature reduction: 
X = df.drop(['attack_cat', 'id'], axis = 1)
Y = df['attack_cat']      

x = StandardScaler().fit_transform(X)
pca = PCA(0.90)
principalComponents = pca.fit_transform(x)

In [None]:
# Look at the pca components:
print(abs( pca.components_ )) 

In [None]:
# Findout number of components explained 90% of variance in the dataset:
pca_number = pca.n_components_
print(pca_number)

In [None]:
# print the percentage of total variance in the dataset explained by each components:
print(
    'The percentage of total variance in the dataset explained by each, component from Sklearn PCA.\n',
    pca.explained_variance_ratio_ ,pca.explained_variance_ratio_.sum() 
)

In [None]:
# Convert PCA to dataframe:
principalDf = pd.DataFrame(data = principalComponents, columns = ['pca' + str(i) for i in range (1, pca_number+1)])
principalDf.head()

In [None]:
# Concat PCA with target variable:
principalDf['attack_cat'] = df['attack_cat']
principalDf.dropna(inplace=True)


In [None]:
x = principalDf.drop('attack_cat', 1)
y = principalDf['attack_cat']
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=20)

**1.1 Logistic Regression:**

In [None]:
# Applying logistic Regression model After applying PCA:

lr_initial = LogisticRegression( n_jobs=-1)

initial_lr_train_preds, initial_lr_test_preds, initial_lr_train_probs, initial_lr_test_probs = get_scores(lr_initial, ' Preliminary Logistic Regression model')


**2.1 K Neighbors Classifier:**

In [None]:
# Applying KNeighbors Classifier model:


knn_initial = KNeighborsClassifier(n_jobs=-1)

initial_knn_train_preds, initial_knn_test_preds, initial_knn_train_probs, initial_knn_test_probs = get_scores(knn_initial, 'Preliminary knn model')

**3.1 Random Forest Classifier:**

In [None]:
# Applying Random Forest Classifier model:

rfc_initial = RandomForestClassifier(n_jobs=-1)

initial_rfc_train_preds, initial_rfc_test_preds, initial_rfc_train_probs, initial_rfc_test_probs = get_scores(rfc_initial,'Preliminary Random Forest model')

**4.1 Support Vector Classifier:**

In [None]:
# Applying Support Vector Classifier: why took long time ???????

#svc_initial = SVC(gamma='auto', probability=True)

#initial_svc_train_preds, initial_svc_test_preds, initial_svc_train_probs, initial_svc_test_probs = get_scores(svc_initial,'Preliminary Support Vector Classification model')

**5.1 Preliminary Gradiant Boosting Classifier:**

In [None]:
# Applying Gradoant Boosting Classifier:

gbc_initial = GradientBoostingClassifier()

initial_gbc_train_preds, initial_gbc_test_preds, initial_gbc_train_probs, initial_gbc_test_probs = get_scores(gbc_initial, 'Preliminary Gradient Boosting model')

In [None]:
# Create a dataFrame with accuracy of different models using dictionary:
applying_pca_model_accuracy=pd.DataFrame({"Models":['Initial Logistic Regression', 'Initial knn', 'Initial Random Forest', 'Initial Gradient Boosting'], 
                 "Training Accuracy PCA":[0.86,0.89,0.94,0.89],
                 "Test Accuracy PCA":[0.86,0.87,0.87, 0.87],
                 "Training Log Loss PCA":[0.4,0.74,0.17,0.31],
                 "Test Log Loss PCA":[0.4,1.75,1.00,0.36]}) 
 
applying_pca_model_accuracy

As shown in above summary, training accuracies after applying PCA is in range (Number of correct predictions /Total number of predictions) from 86% to 94% and test accuracies range between 66% to 87%. The Random Forest model has the most overfitting because the value of test accuracy is much lower than training accuracy compare with other models. These overfitting trends are similar in the log loss scoring(log loss:uncertainly). So, the best test accuracy and test log loss score goes to Gradient Boosting model. should be noted that these initial models are not optimized; I only use the default hyperparameters.

In [None]:
# Findout models predict a certain type of attack particularly well or terribly:

def get_accuracies(predict, y_true):
    y_true = y_true.reset_index()###????? im not sure is it correct????
    accuracy_lst = []
    for attack in df['attack_cat'].unique(): 
        count = 0
        for i in y_true[y_true==attack].index:
            if predict[i] == attack:
                count += 1
        accuracy_lst.append(count/y_true[y_true==attack].shape[0]*100)
    return accuracy_lst

In [None]:
# Findout the accuracy of each model for attack type:

lr_train_accuracies = get_accuracies(initial_lr_train_preds, y_train)
lr_test_accuracies = get_accuracies(initial_lr_test_preds, y_test)

knn_train_accuracies = get_accuracies(initial_knn_train_preds, y_train)
knn_test_accuracies = get_accuracies(initial_knn_test_preds, y_test) 

rfc_train_accuracies = get_accuracies(initial_rfc_train_preds, y_train)
rfc_test_accuracies = get_accuracies(initial_rfc_test_preds, y_test) 

#svc_train_accuracies = get_accuracies(initial_svc_train_preds, y_train)
#svc_test_accuracies = get_accuracies(initial_svc_test_preds, y_test)

gbc_train_accuracies = get_accuracies(initial_gbc_train_preds, y_train)
gbc_test_accuracies = get_accuracies(initial_gbc_test_preds, y_test)

In [None]:
# Look at the test accuracies for each model and attack type by using heatmap:

models_accuracies = {'lr': lr_test_accuracies, 'knn': knn_test_accuracies, 'rfc': rfc_test_accuracies, 'gbc': gbc_test_accuracies}
initial_df_test_accuracy = pd.DataFrame(models_accuracies, index = sorted(df['attack_cat'].unique()), columns = ['lr', 'knn', 'rfc', 'gbc'])

fig, ax = plt.subplots(figsize=(13, 5))

sns.heatmap(initial_df_test_accuracy.T, cmap = 'coolwarm', square = True, linewidths=0.1, annot=True)
plt.title('preliminary Test Accuracies', fontsize = 16)
plt.xlabel('Attack', fontsize = 13)
plt.ylabel('Model', fontsize = 13)
plt.tick_params(axis='both', which='major', labelsize=11)

As you can see, after applying PCA,the accuracy of all models get a bit high but not much.

- Adding External Sources:

I add BoT-IoT  dataset which is new approaches of authors for developing, Intrusion Detection and threat intelligence approaches in different systems, such as Network Systems.

The BoT-IoT dataset was created by designing a realistic network environment in the Cyber Range Lab of The center of UNSW Canberra Cyber. The environment incorporates a combination of normal and botnet traffic. The dataset’s source files are provided in csv files. The files were separated, based on attack category and subcategory, to better assist in labeling process. The dataset includes DDoS, DoS, OS and Service Scan, Keylogging and Data exfiltration attacks, with the DDoS and DoS attacks further organized, based on the protocol used. To ease the handling of the dataset, I used top 10 features of  5% of dataset which is configured as a training set. namely: UNSW_2018_IoT_Botnet_Final_10_best_Training.csv. 

In [17]:
# Load new dataset:
df_new = pd.read_csv(r'C:\Users\mebra.DESKTOP-L12LJA6\Thinkful Works\PythonThinkful\capstonbotdataset\UNSW_2018_IoT_Botnet_Final_10_best_Training.csv')


In [None]:
# Look at the new dataset:
df_new.head()

In [None]:
# Look at the shape of new dataset:
df_new.shape

In [None]:
# Look at the type of new dataset:
df_new.info()

In [None]:
# Findout object columns:

df_new_object_columns = df_new.select_dtypes('object')
df_new_object_columns.head()

In [None]:
df_main.head()

In [None]:
df_new.head()

In [None]:
df_new = df_new.rename(columns={"Attack" : "label"})

In [None]:
type(df_main['proto'])

In [21]:
# Merge the dataset with new one:

df_main = df_main.merge(df_new, on ='proto')

TypeError: object of type 'NoneType' has no len()

In [None]:
# Look at the new dataset:
df_final.head()