![GMV](https://upload.wikimedia.org/wikipedia/en/3/31/GMV_logo_small_gif.gif) 
![Apache Spark](http://spark.apache.org/images/spark-logo.png)

# Intrusion Detection System

Abstract 
--------

Security and privacy of a system is compromised, when an intrusion happens. Intrusion Detection System (IDS) plays vital role in network security as it detects various types of attacks in network. So here, we are going to propose Intrusion Detection System using machine learning algorithms. The proposed system will be done by conducting some experiments using [NSL-KDD Cup’99](http://nsl.cs.unb.ca/NSL-KDD/) dataset which is improved version of [KDD Cup’99 data set](http://archive.ics.uci.edu/ml/datasets/KDD+Cup+1999+Data).

Data Mining Process
------------------- 
The [Cross Industry Standard Process for Data Mining](https://en.wikipedia.org/wiki/Cross_Industry_Standard_Process_for_Data_Mining) introduced a process model for data mining in 2000 that has become widely adopted.

<a href="https://en.wikipedia.org/wiki/Cross_Industry_Standard_Process_for_Data_Mining"> 
<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/b/b9/CRISP-DM_Process_Diagram.png/479px-CRISP-DM_Process_Diagram.png" title="Cross Industry Standard Process for Data Mining" alt="CRISP-DM_Process_Diagram"/></a>



The model emphasizes the ***iterative*** nature of the data mining process, distinguishing several different stages that are regularly revisited in the course of developing and deploying data-driven solutions to business problems
* Business understanding
* Data understanding
* Data preparation
* Modeling
* Deployment

# Business Understanding

Nowadays the cyber security threats are rising and putting in risk a great number of organizations. Any organization could be a target for the attackers that produce huge damages into it.

For these reasons it's necessary to provide a system that help to detect intruders in a network. This intrusion detection system should be non-invasive to other systems to provide a quick deployment and to get results a soon as possible. 

It's also important to balance the number of false positives (that increase the maintenance of the system) with the number of false negatives (that would allow to intruders achieve their objectives)

# Data Understanding

**KDD Cup'99 data set** used for The Third International Knowledge Discovery and Data Mining Tools Competition, which was held in conjunction with KDD-99 The Fifth International Conference on Knowledge Discovery and Data Mining. The competition task was to build a network intrusion detector, a predictive model capable of distinguishing between **``bad'' connections**, called intrusions or attacks, and **``good'' normal connections**. This database contains a standard set of data to be audited, which includes a wide variety of intrusions simulated in a military network environment.

**NSL-KDD** is a data set suggested to solve some of the inherent problems of the KDD'99 data set. Although, this new version of the KDD data set still suffers from some of the problems and may not be a perfect representative of existing real networks, because of the lack of public data sets for network-based IDSs, we believe it still can be applied as an effective benchmark data set to help researchers compare different intrusion detection methods. Furthermore, the number of records in the NSL-KDD train and test sets are reasonable. This advantage makes it affordable to run the experiments on the complete set without the need to randomly select a small portion. Consequently, evaluation results of different research work will be consistent and comparable

A connection is a sequence of TCP packets starting and ending at some well defined times, between which data flows to and from a source IP address to a target IP address under some well defined protocol.  Each connection is labeled as either normal, or as an attack, with exactly one specific attack type.  Each connection record consists of about 100 bytes.

Attacks fall into four main categories:

* DOS: denial-of-service, e.g. syn flood;
* R2L: unauthorized access from a remote machine, e.g. guessing password;
* U2R:  unauthorized access to local superuser (root) privileges, e.g., various ``buffer overflow'' attacks;
* probing: surveillance and other probing, e.g., port scanning.

It is important to note that the test data is not from the same probability distribution as the training data, and it includes specific attack types not in the training data.  This makes the task more realistic.  Some intrusion experts believe that most novel attacks are variants of known attacks and the "signature" of known attacks can be sufficient to catch novel variants.  The datasets contain a total of 24 training attack types, with an additional 14 types in the test data only. 

### Derived Features ###

Higher-level features had been defined that help in distinguishing *normal* connections from *attacks*.  There are several categories of derived features.

**Intrinsic attributes**

These attributes are extracted from the headers' area of the network packets.

Col|Feature name  | description |	type
---|--------------|-------------|------------
1  |duration 	  |length (number of seconds) of the connection |continuous
2  |protocol_type |type of the protocol, e.g. tcp, udp, etc. |discrete
3  |service 	  |network service on the destination, e.g., http, telnet, etc. |discrete
4  |flag 	      |normal or error status of the connection. The possible status are this: SF, S0, S1, S2, S3, OTH, REJ, RSTO, RSTOS0, SH, RSTRH, SHR 	|discrete 
5  |src_bytes 	  |number of data bytes from source to destination 	|continuous
6  |dst_bytes 	  |number of data bytes from destination to source 	|continuous
7  |land 	      |1 if connection is from/to the same host/port; 0 otherwise 	|discrete
8  |wrong_fragment|sum of bad checksum packets in a connection 	|continuous
9  |urgent 	      |number of urgent packets. Urgent packets are packets with the urgent bit activated 	|continuous



**Content attributes** 

These attributes are extracted from the contents area of the network packets based on expert person knowledge.

Col|Feature name	      |description 	|type
---|----------------------|-------------|-----
10 |hot 	              |sum of hot actions in a connection such as: entering a system direc- tory, creating programs and executing programs	|continuous
11 |num_failed_logins 	  |number of failed login attempts 	|continuous
12 |logged_in 	          |1 if successfully logged in; 0 otherwise 	|discrete
13 |num_compromised 	  |number of "compromised" conditions 	|continuous
14 |root_shell 	          |1 if root shell is obtained; 0 otherwise 	|discrete
15 |su_attempted 	      |1 if "su root" command attempted; 0 otherwise 	|discrete
16 |num_root 	          |number of "root'" accesses 	|continuous
17 |num_file_creations 	  |number of file creation operations 	|continuous
18 |num_shells 	          |number of shell prompts 	|continuous
19 |num_access_files 	  |number of operations on access control files 	|continuous
20 |num_outbound_cmds	  |number of outbound commands in an ftp session 	|continuous
21 |is_hot_login 	      |1 if the login belongs to the "hot" list; 0 otherwise 	|discrete
22 |is_guest_login 	      |1 if the login is a "guest" login; 0 otherwise 	|discrete

**Traffic attributes**

These attributes are calculated taking into account the previous connections. 9+10 attributes are divided into two groups: (1) time traffic features (2) machine traffic features. The difference between one group and the other is the mode to select the previous connections.


*Time traffic attributes*

To calculate these attributes we considered the connections that occurred in the past 2 seconds.

Col|Feature name	      |description 	|type
---|----------------------|-------------|-----
23 |count 	              |sum of connections to the same destination IP address |continuous
24 |srv_count 	          |sum of connections to the same destination port number |continuous
25 |serror_rate 	      |the percentage of connections that have activated the flag (4) s0, s1, s2 or s3, among the connections aggregated in count (23)|continuous
26 |srv_serror_rate 	  |the percentage of connections that have activated the flag (4) s0, s1, s2 or s3, among the connections aggregated in srv_count (24)|continuous
27 |rerror_rate 	      |the percentage of connections that have activated the flag (4) REJ, among the connections aggregated in count (23)|continuous
29 |same_srv_rate 	      |the percentage of connections that have activated the flag (4) REJ, among the connections aggregated in srv_count (24)|continuous
28 |srv_rerror_rate 	  |the percentage of connections that were to the same service, among the connections aggregated in count (23)|continuous
30 |diff_srv_rate 	      |the percentage of connections that were to different services, among the connections aggregated in count (23)|continuous
31 |srv_diff_host_rate 	  |the percentage of connections that were to different destination ma- chines among the connections aggregated in srv_count (24)|continuous

*Machine traffic attributes*

To calculate these attributes we took into account the previous 100 connections.

Col|Feature name	      |description 	|type
---|----------------------|-------------|-----
32 |dst_host_count        |sum of connections to the same destination IP address   |continuous
33 |dst_host_srv_count    |sum of connections to the same destination port number  |continuous
34 |dst_host_same_srv_rate|the percentage of connections that were to the same service, among the connections aggregated in dst_host_count (32)|continuous 
35 |dst_host_diff_srv_rate|the percentage of connections that were to different services, among the connections aggregated in dst_host_count (32)|continuous 
36 |dst_host_same_src_port_rate|the percentage of connections that were to the same source port, among the connections aggregated in dst_host_srv_count (33)|continuous 
37 |dst_host_srv_diff_host_rate|the percentage of connections that were to different destination ma- chines, among the connections aggregated in dst_host_srv_count (33)|continuous 
38 |dst_host_serror_rate  |the percentage of connections that have activated the flag (4) s0, s1, s2 or s3, among the connections aggregated in dst_host_count (32)|continuous
39 |dst_host_srv_serror_rate|the percent of connections that have activated the flag (4) s0, s1, s2 or s3, among the connections aggregated in dst_host_srv_count (33)|continuous 
40 |dst_host_rerror_rate  |the percentage of connections that have activated the flag (4) REJ, among the connections aggregated in dst_host_count (32)|continuous
41 |dst_host_srv_rerror_rate|the percentage of connections that have activated the flag (4) REJ, among the connections aggregated in dst_host_srv_count (33)|continuous 



**Class attribute**

The 42nd attribute is the ***class*** attribute, it indicates which type of connections is each instance: normal or which attack. The values it can take are the following: *anomaly, dict, dict_simple, eject, eject-fail, ffb, ffb_clear, format, format_clear, format-fail, ftp-write, guest, imap, land, load_clear, loadmodule, multihop, perl_clear, perlmagic, phf, rootkit, spy, syslog, teardrop, warez, warezclient, warezmaster, pod, back, ip- sweep, neptune, nmap, portsweep, satan, smurf and normal.

# Apache Spark Initialization

In [None]:
import findspark
findspark.init()
import pyspark
sc = pyspark.SparkContext(appName="SecurityDataScience")

## Data Exploration

In [None]:
%matplotlib inline

### Libraries

In [None]:
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

### Seaborn palette setting

In [None]:
sns.set_palette("deep", desat=.6)
sns.set_context("notebook")
sns.set_style("whitegrid")
sns.set(font= 'serif', font_scale=0.75)

### Reading Data File

In [None]:
from pyspark.sql import SQLContext
from pyspark.sql.types import *
from pyspark.sql.functions import * 
sqlContext = SQLContext(sc)

In [None]:
textFileConn = sc.textFile('./data/KDD/KDDTrain+.txt')

In [None]:
textFileConn.count()

In [None]:
textFileConn.take(3)

In [None]:
#Creating the schema

#we define the name of the columns
columnNames = ["duration","protocol_type","service","flag","src_bytes","dst_bytes","land",
              "wrong_fragment","urgent","hot","num_failed_logins","logged_in","num_compromised",
              "root_shell","su_attempted","num_root","num_file_creations","num_shells",
              "num_access_files","num_outbound_cmds","is_hot_login","is_guest_login","count",
              "srv_count","serror_rate","srv_serror_rate","rerror_rate","same_srv_rate",
              "srv_rerror_rate","diff_srv_rate","srv_diff_host_rate","dst_host_count",
              "dst_host_srv_count","dst_host_same_srv_rate","dst_host_diff_srv_rate",
              "dst_host_same_src_port_rate","dst_host_srv_diff_host_rate","dst_host_serror_rate",
              "dst_host_srv_serror_rate","dst_host_rerror_rate","dst_host_srv_rerror_rate",
              "attack"
              ]

In [None]:
#quick fields initialitation all for FloatType
connFields = [StructField(colName, FloatType(), True) for colName in columnNames]

In [None]:
#we proceed to modify the respective fields so that they reflect the correct data type:
connFields[1].dataType = StringType()
connFields[2].dataType = StringType()
connFields[3].dataType = StringType()
connFields[6].dataType = StringType()
connFields[11].dataType = StringType()
connFields[13].dataType = StringType()
connFields[14].dataType = StringType()
connFields[20].dataType = StringType()
connFields[21].dataType = StringType()
connFields[41].dataType = StringType()

In [None]:
connFields

In [None]:
# we can construct our schema, which we will use later below for building the data frame
connSchema = StructType(connFields)

In [None]:
connSchema

In [None]:
#Parsing the file
def parseReg(p):
    return ( float(p[0])
            ,p[1], p[2], p[3] 
            ,float(p[4])
            ,float(p[5])
            ,p[6]
            ,float(p[7])
            ,float(p[8])
            ,float(p[9])
            ,float(p[10])
            ,p[11]
            ,float(p[12])
            ,p[13], p[14]
            ,float(p[15])
            ,float(p[16])
            ,float(p[17])
            ,float(p[18])
            ,float(p[19])
            ,p[20], p[21]
            ,float(p[22])
            ,float(p[23])
            ,float(p[24])
            ,float(p[25])
            ,float(p[26])
            ,float(p[27])
            ,float(p[28])
            ,float(p[29])
            ,float(p[30])
            ,float(p[31])
            ,float(p[32])
            ,float(p[33])
            ,float(p[34])
            ,float(p[35])
            ,float(p[36])
            ,float(p[37])
            ,float(p[38])
            ,float(p[39])
            ,float(p[40])
            ,p[41])

In [None]:
connParsedFile = (textFileConn.map(lambda line: line.split(','))
                              .map(parseReg))

In [None]:
# We are now ready to build our data frame, using the connParsedFile RDD computed above and the schema 
# variable already calculated:
conn = sqlContext.createDataFrame(connParsedFile, connSchema)
conn.cache()

In [None]:
# Infer the schema, and register the DataFrame as a table.
conn.registerTempTable("connection")

### Auxiliary functions

In [None]:
def percentageOf(df, colName=''):
    """Calculate the percentage of each categorical value of colName from de Spark DataFrame df
    
    Keyword arguments:
    df -- the DataFrame
    colName -- the name of the column
    """
    rows = df.groupBy(colName).count().collect()
    total = 0
    for r in rows:
        total += r.asDict()["count"]
    dictResult = {r.asDict()[colName]: 1.0*r.asDict()["count"]/total for r in rows}
    return sorted(dictResult.items(), key=lambda x: x[1], reverse= True)

In [None]:
def numberOf(df, colName=''):
    """Calculate the number of each categorical value of colName from de Spark DataFrame df
    
    Keyword arguments:
    df -- the DataFrame
    colName -- the name of the column
    """
    rows = df.groupBy(colName).count().collect()
    dictResult = {r.asDict()[colName]: r.asDict()['count'] for r in rows}
    return sorted(dictResult.items(), key=lambda x: x[1], reverse= True)

### Univariate Analysis

* **name:** duration
* **description:** length (number of seconds) of the connection
* **type:** continuous

In [None]:
#Statistics with all data
conn.describe('duration').toPandas()

##### The data set is big --> Getting only a sample of the data set for the analysis

In [None]:
connSample = conn.sample(True, 0.05).toPandas()

In [None]:
connSample.duration.describe()

In [None]:
sns.distplot(connSample.duration, bins = 10);

##### Skew distribution -> log transformation

In [None]:
np.log(connSample.duration+1).describe()

In [None]:
sns.distplot(np.log(connSample.duration+1), bins=10);

##### Let's inspect how many values are equal to 0:

In [None]:
print("Numbers of 0's in duration feature: ", len(connSample.duration[connSample.duration == 0]))
print("The percentage of 0's in duration feature:", 100.0*len(connSample.duration[connSample.duration == 0])/len(connSample.duration))

##### Inspecting the distribution without 0's

In [None]:
np.log(connSample.duration[connSample.duration>0]+1).describe()

In [None]:
sns.distplot(np.log(connSample.duration[connSample.duration > 0]+1), bins=10);

---

* **name:** protocol_type
* **description:** type of the protocol, e.g. tcp, udp, etc.
* **type:** discrete

In [None]:
#Statistics with all data
conn.describe('protocol_type').toPandas()

In [None]:
#Get the percentage of the values
percentageOf(conn, 'protocol_type')

In [None]:
#Statistics with the sample data
connSample.protocol_type.describe()

In [None]:
sns.countplot(x='protocol_type', data = connSample, palette='Paired');

In [None]:
print("Percentages of protocol_type: ")
print(1.0*connSample.protocol_type.value_counts()/len(connSample.protocol_type))

------------------------------

* **name:** service
* **description:** network service on the destination, e.g., http, telnet, etc.
* **type:** discrete

In [None]:
#Statistics with all data
conn.describe('service').toPandas()

In [None]:
#Get all different values
conn.freqItems(['service']).collect()[0]

In [None]:
#Get the percentage of the values
percentageOf(conn, 'service')

In [None]:
connSample.service.describe()

In [None]:
sns.countplot(x='service', data=connSample, palette='Paired');

In [None]:
sns.set(font= 'serif', font_scale= 0.65)
sns.countplot(y='service', data=connSample, palette='Paired');

In [None]:
print("Percentages of service's value:")
print(1.0*connSample.service.value_counts()/len(connSample.service))

---

* **name:** flag
* **description:** normal or error status of the connection. The possible status are this: SF, S0, S1, S2, S3, OTH, REJ, RSTO, RSTOS0, SH, RSTRH, SHR
* **type:** discrete

In [None]:
#Statistics with all data
conn.describe('flag').toPandas()

In [None]:
#Get all different values
conn.freqItems(['flag']).collect()[0]

In [None]:
#Get the percentage of the values
percentageOf(conn, 'flag')

In [None]:
#Get stats with the Sample
connSample.flag.describe()

In [None]:
sns.countplot(x='flag', data=connSample, palette='Paired');

In [None]:
connSample.flag.value_counts()

In [None]:
print("Percentages of flag's value:")
print(1.0*connSample.flag.value_counts()/len(connSample.flag))

---

* **name:** src_bytes
* **description:** number of data bytes from source to destination
* **type:** continuous

In [None]:
#Statistics with all data
conn.describe('src_bytes').toPandas()

In [None]:
#Stats with the sample
connSample.src_bytes.describe()

In [None]:
sns.distplot(connSample.src_bytes, bins=10);

##### Skew distribution -> log transformation

In [None]:
np.log(connSample.src_bytes+1).describe()

In [None]:
sns.distplot(np.log(connSample.src_bytes+1), bins = 10);

---

* **name:** dst_bytes
* **description:** number of data bytes from destination to source
* **type:** continuous

In [None]:
#Statistics with all data
conn.describe('dst_bytes').toPandas()

In [None]:
#Log transformation with all data
conn.select(log(conn['dst_bytes']+1)).describe().toPandas()

In [None]:
#Statistics with sample data
np.log(connSample.dst_bytes+1).describe()

In [None]:
sns.distplot(np.log(connSample.dst_bytes+1), bins = 10);

---

* **name:** land
* **description:** 1 if connection is from/to the same host/port; 0 otherwise
* **type:** discrete

In [None]:
#Statistics with all data
conn.describe('land').toPandas()

In [None]:
numberOf(conn, 'land')

In [None]:
connSample.land.describe()

In [None]:
sns.countplot(x='land', data=connSample, palette='Paired');

In [None]:
connSample.land.value_counts()

In [None]:
print("Percentages of lands's value:")
print(1.0*connSample.land.value_counts()/len(connSample.land))

---

* **name:** wrong_fragment	
* **description:** sum of bad checksum packets in a connection
* **type:** continuous

In [None]:
#Statistics with all data
conn.describe('wrong_fragment').toPandas()

In [None]:
#Get the percentage of the values
numberOf(conn, 'wrong_fragment')

In [None]:
#Stat from sample data
connSample.wrong_fragment.describe()

In [None]:
sns.countplot(x= 'wrong_fragment', data= connSample, palette='Paired');

In [None]:
connSample.wrong_fragment.value_counts()

In [None]:
print("Percentages of wrong_fragment's value:")
print(1.0*connSample.wrong_fragment.value_counts()/len(connSample.wrong_fragment))

---

* **name:** urgent	
* **description:** number of urgent packets. Urgent packets are packets with the urgent bit activated
* **type:** continuous

In [None]:
#Statistics with all data
conn.describe('urgent').toPandas()

In [None]:
#Get the percentage of the values
numberOf(conn, 'urgent')

In [None]:
connSample.urgent.describe()

In [None]:
connSample.urgent.value_counts()

In [None]:
print("Percentages of urgent's value:")
print(1.0*connSample.urgent.value_counts()/len(connSample.urgent))

In [None]:
sns.countplot(x = 'urgent', data = connSample, palette="Paired");

#### Conclusions: Univariate Analysis

The most relevant conclusions of univariante Analysis are:
* **duration:** is highly right-skewed. In fact the 92% of the values are 0's. The log transformation without 0's seems to have a bimodal distribution
* **connection_type:** has most of the instances with the values *tcp* (81%)
* **service:** only two values (*http* and *private*) of 70 concentrate the 50% of the instances
* **flag:** only three values (*SF*, *S0* and *REJ*) of 11 concentrate the 95% of the instances
* **src_bytes:** is right-skewed. The log transformation seems to have a bimodal distribution
* **dst_bytes:** is right-skewed. The log transformation seems to have a bimodal distribution
* **land:** actually is a discrete feature with two values: 0 and 1. The 99,98% of the instances have the value 0
* **wrong_fragment:** is a continuous variable with only three values: 0, 1 and 3. The 99.13% of the instances have the value 0
* **urgent:** is a continuous variable with only four values: 0, 1, 2 and 3. The 99.99% of the instances have the value 0


----------

### Bivariate Analysis

### duration vs protocol_type

* **name:** duration
* **description:** length (number of seconds) of the connection
* **type:** continuous


* **name:** protocol_type
* **description:** type of the protocol, e.g. tcp, udp, etc.
* **type:** discrete

In [None]:
sns.boxplot(x = 'protocol_type', y='duration', data = connSample, palette= 'Paired');

In [None]:
connSample['duration_log'] = np.log(connSample.duration+1)

In [None]:
sns.boxplot(x = 'protocol_type', y = 'duration_log', data = connSample, palette= 'Paired');

##### Inspecting the distribution with duration_log more than 0

In [None]:
sns.boxplot(x= 'protocol_type', y= 'duration_log'
            , data = connSample[connSample.duration_log > 0]
            , palette = 'Paired');

##### Inspecting the distribution with duration_log == 0

In [None]:
sns.countplot( x='protocol_type' 
              , data = connSample[connSample.duration_log == 0]
              , palette = 'Paired');

#### Conclusions duration vs protocol_type analysis
* Regardless the *protocol_type* feature most of the values of *duration* features are 0's
* When the value of *protocol_type* is *'icmp'* all the values of *duration* features is 0
* After the log(x+1) transformation of *protocol_type* and removing the 0's values, the distributions of *duration* is very different between *'tcp'* and *'udp'* 




----------------------------------------------

### Exercise 1
Make the *duration* vs *service* analysis

----------

In [None]:
connSample[connSample.duration_log>0].boxplot('duration_log', 
                                              by='service', 
                                              rot = 45, 
                                              figsize = (15, 7));

In [None]:
locs, labels = plt.xticks()
plt.setp(labels, rotation=90)
sns.countplot(y='service', data = connSample[connSample.duration_log == 0], palette= 'Paired');

### duration vs flag

* **name:** duration
* **description:** length (number of seconds) of the connection
* **type:** continuous


* **name:** flag
* **description:** normal or error status of the connection. The possible status are this: SF, S0, S1, S2, S3, OTH, REJ, RSTO, RSTOS0, SH, RSTRH, SHR
* **type:** discrete

In [None]:
connSample[connSample.duration > 0].boxplot('duration_log', by='flag');

##### Inspecting the distribution with duration_log == 0

In [None]:
sns.countplot(x = 'flag', data = connSample[connSample.duration_log == 0], palette= 'Paired');

In [None]:
print("Percentages of flag's value:")
print(1.0*connSample[connSample.duration_log == 0].flag.value_counts()/len(connSample[connSample.duration_log == 0].flag))

#### Conclusions duration vs flag analysis
* Regardless the *flag* feature most of the values of *duration* features are 0's
* After the log(x+1) transformation of *flag* and removing the 0's values, the distributions of *duration* is very different in *'RSTR'* and *'RSTOS0'* categories 
* If duration_log is 0, only three values (*SF*, *S0* and *REJ*) of 11 concentrate the 96% of the instances


----------------------------------------------

### duration vs src_bytes

* **name:** duration
* **description:** length (number of seconds) of the connection
* **type:** continuous


* **name:** src_bytes
* **description:** number of data bytes from source to destination
* **type:** continuous

In [None]:
connSample['src_bytes_log'] = np.log(connSample.src_bytes+1)

**Inspecting with duration > 0 **

In [None]:
color = sns.color_palette()[2]
sns.jointplot("duration_log", "src_bytes_log", data=connSample[connSample.duration>0], 
              kind="reg", color = color, size=7);

In [None]:
#with kde (Kernel density estimation)
color = sns.color_palette()[3]
sns.jointplot("duration_log", "src_bytes_log", 
              data=connSample[connSample.duration>0], 
              kind="kde", color = color, size=7);

**Inspecting with duration == 0 **

In [None]:
sns.distplot(connSample[connSample.duration_log == 0].src_bytes_log, bins = 10);

**Inspecting with src_bytes_log == 0 **

In [None]:
sns.distplot(connSample[connSample.src_bytes_log == 0].duration_log, bins = 10);

#### Conclusions duration vs src_bytes analysis
The analysis was splited in three cases:
* Case 1: *duration_log* and *src_bytes_logs* are greater than 0:
    - There are not strong correlation between *duration_log* and *src_bytes_log*
    - On the other hand, it seems there are some clusters between both variables
* Case 2: *duration_log* is equal to 0:
    - It seems *src_bytes* has a multimodal distribution
* Case 3: *src_bytes_log* is equal to 0:
    - Most of the values of *duration_log* is equal to 0




----------------------------------------------

### duration vs dst_bytes

* **name:** duration
* **description:** length (number of seconds) of the connection
* **type:** continuous


* **name:** dst_bytes
* **description:** number of data bytes from destination to source
* **type:** continuous

In [None]:
connSample['dst_bytes_log'] = np.log(connSample.dst_bytes+1)

In [None]:
sns.jointplot("duration_log", "dst_bytes_log", 
              data= connSample[connSample.duration>0], 
              kind="reg", size=7);

In [None]:
#with kde (Kernel density estimation)
color = sns.color_palette()[2]
sns.jointplot("duration_log", "dst_bytes_log", 
              data= connSample[connSample.duration>0], 
              color = color, kind="kde", size=7);

**Inspecting with duration == 0 **

In [None]:
sns.distplot(connSample[connSample.duration_log == 0].dst_bytes_log, bins = 10);

**Inspecting with dst_bytes_log == 0 **

In [None]:
sns.distplot(connSample[connSample.dst_bytes_log == 0].duration_log, bins = 10);

#### Conclusions duration vs dst_bytes analysis
The analysis was splited in three cases:
* Case 1: *duration_log* and *dst_bytes_logs* are greater than 0:
    - There are not strong correlation between *duration_log* and *dst_bytes_log*
    - On the other hand, it seems there are some clusters between both variables
* Case 2: *duration_log* is equal to 0:
    - It seems *dst_bytes* has a bimodal distribution
* Case 3: *dst_bytes_log* is equal to 0:
    - Most of the values of *duration_log* is equal to 0




----------------------------------------------

### duration vs land

* **name:** duration
* **description:** length (number of seconds) of the connection
* **type:** continuous


* **name:** land
* **description:** 1 if connection is from/to the same host/port; 0 otherwise
* **type:** discrete

In [None]:
sns.boxplot(x = "land", y = "duration_log", data= connSample, palette= 'Paired');

In [None]:
connSample.pivot_table("duration_log","land", aggfunc=np.average)

**Inspecting with duration > 0 **

In [None]:
sns.boxplot(x = "land", y = "duration_log", 
            data= connSample[connSample.duration > 0], 
            palette= 'Paired');

####Conclusions duration vs land analysis
* If *duration_log* is greater than 0 then we have only values with *land* == 0 and its distribution is right-skewed




----------------------------------------------

### duration vs wrong_fragment

* **name:** duration
* **description:** length (number of seconds) of the connection
* **type:** continuous


* **name:** wrong_fragment	
* **description:** sum of bad checksum packets in a connection
* **type:** continuous

In [None]:
sns.boxplot(x= 'wrong_fragment', y= 'duration_log', 
           data = connSample, 
           palette = 'Paired');

In [None]:
connSample.pivot_table("duration_log","wrong_fragment", aggfunc=np.average)

**Inspecting with duration > 0 **

In [None]:
sns.boxplot(x= 'wrong_fragment', y= 'duration_log', 
           data = connSample[connSample.duration_log > 0], 
           palette = 'Paired');

#### Conclusions duration vs wrong_fragment

* If *duration_log* is greater than 0 then we have only values with *wrong_fragment* == 0 and its distribution is right-skewed




----------------------------------------------

### Exercicie 2: duration vs urgent

* **name:** duration
* **description:** length (number of seconds) of the connection
* **type:** continuous


* **name:** urgent	
* **description:** number of urgent packets. Urgent packets are packets with the urgent bit activated
* **type:** continuous

#### Conclusions duration vs urgent




----------------------------------------------

### protocol_type vs service

* **name:** protocol_type
* **description:** type of the protocol, e.g. tcp, udp, etc.
* **type:** discrete

* **name:** service
* **description:** network service on the destination, e.g., http, telnet, etc.
* **type:** discrete

In [None]:
connSample.pivot_table("attack", "service","protocol_type", aggfunc=len)

In [None]:
sns.set(font= 'serif', font_scale=0.65, rc={"figure.figsize": (3, 10)})
sns.heatmap(connSample.pivot_table("attack", "service","protocol_type", aggfunc=len), 
            square= False,
            annot=True, annot_kws={"size": 7}, fmt=".0f", linewidths= .5);

#### Conclusions protocol_type vs service analysis
* Except two general services (*'other'* and *'private'*) all the *services* categories belong to a *protocol_type* category




----------------------------------------------

### protocol_type vs flag

* **name:** protocol_type
* **description:** type of the protocol, e.g. tcp, udp, etc.
* **type:** discrete

* **name:** flag
* **description:** normal or error status of the connection. The possible status are this: SF, S0, S1, S2, S3, OTH, REJ, RSTO, RSTOS0, SH, RSTRH, SHR
* **type:** discrete

In [None]:
sns.set(font= 'serif', font_scale=0.65, rc={"figure.figsize": (3, 6)})
sns.heatmap(connSample.pivot_table("attack", "flag","protocol_type", aggfunc=len), 
            annot=True, fmt=".0f");

#### Conclusions protocol_type vs flag analysis
* All *flag* categories beleng only to *'tcp'* *protocol_type* except *'SF'* flag that belongs to the three *protocol_type*s



----------------------------------------------

### Exercise 3: protocol_type vs src_bytes
Make the *protocol_type* vs *src_bytes* analysis

----------

### Exercise 4: protocol_type vs dst_bytes
Make the *protocol_type* vs *dst_bytes* analysis

----------

### Multivariate Analysis

### src_bytes & dst_bytes & wrong_fragmnet & urgent vs class

* **name:** urgent	
* **description:** number of urgent packets. Urgent packets are packets with the urgent bit activated
* **type:** continuous

* **name:** src_bytes
* **description:** number of data bytes from source to destination
* **type:** continuous

* **name:** dst_bytes
* **description:** number of data bytes from destination to source
* **type:** continuous

* **name:** wrong_fragment	
* **description:** sum of bad checksum packets in a connection
* **type:** continuous

* **name:** attack	
* **description:** is the label attribute and indicates which type of connections is each instance: normal or which attack. The values it can take are the following: *anomaly, dict, dict_simple, eject, eject-fail, ffb, ffb_clear, format, format_clear, format-fail, ftp-write, guest, imap, land, load_clear, loadmodule, multihop, perl_clear, perlmagic, phf, rootkit, spy, syslog, teardrop, warez, warezclient, warezmaster, pod, back, ip- sweep, neptune, nmap, portsweep, satan, smurf and normal*
* **type:** discrete

In [None]:
#creation of "is_attack" attribute to differentiate the normal connections vs attack connections
connSample["is_attack"] = connSample["attack"].map(lambda x: int(x != "normal"))

In [None]:
sns.pairplot(connSample, 
             vars=['src_bytes_log','dst_bytes_log','wrong_fragment','urgent'], 
             hue='is_attack');