# Capstone Deliverable 2
## Problem Statement
Our main goal is to build a basic Intrusion Detection System (IDS), which classifies network traffic as malicious or normal. This can be extended into a multiclass classifier (identifying what _kind_ of malicious activity is occurring).

## Methods and models
We intend to run the gamut of classification algorithms, gradually escalating in complexity. We will start with a basic logistic regression, escalted to kNN, SVM, tree models, and finally crown it all with a neural network.

I saw in a paper somewhere that a good IDS has a 95% sensitivity, so that's a benchmark we will attempt to achieve. At the same time, (ideally) only a small fraction of network traffic is malicious, so it'd be best if our precision is also fairly high. 

Our first-effort naive logistic regression model achieved $\approx$ 80.5% sensitivity and 99.7% precision. So we have a ways to go! 
## Data Sources
We found a labelled network traffic dataset with a [data dictionary](http://kdd.ics.uci.edu/databases/kddcup99/task.html). This dataset comes with pre-built features suggested by domain experts, which makes it particularly nice for first efforts. We may attempt to extend our modeling to other datasets, but as the majority of them are unlabeled it may be difficult to define a metric for success on those.

## EDA

In [1]:
import string

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.model_selection import cross_val_score, train_test_split, GridSearchCV
from sklearn.metrics import confusion_matrix

In [2]:
train_attack_types = pd.read_csv("datasets/training_attack_types.txt", delimiter=" ", header=None, names= ["attack_type", "attack_category"])

In [3]:
target = "back"
train_attack_types.loc[train_attack_types["attack_type"]==target, "attack_category"].values[0]

'dos'

In [4]:
cols = list(pd.read_csv("datasets/kddcup.names.txt", skiprows=1, header=None)[0].map(lambda x: str(x).split(":")[0]).values)
cols.append("label")
cols

['duration',
 'protocol_type',
 'service',
 'flag',
 'src_bytes',
 'dst_bytes',
 'land',
 'wrong_fragment',
 'urgent',
 'hot',
 'num_failed_logins',
 'logged_in',
 'num_compromised',
 'root_shell',
 'su_attempted',
 'num_root',
 'num_file_creations',
 'num_shells',
 'num_access_files',
 'num_outbound_cmds',
 'is_host_login',
 'is_guest_login',
 'count',
 'srv_count',
 'serror_rate',
 'srv_serror_rate',
 'rerror_rate',
 'srv_rerror_rate',
 'same_srv_rate',
 'diff_srv_rate',
 'srv_diff_host_rate',
 'dst_host_count',
 'dst_host_srv_count',
 'dst_host_same_srv_rate',
 'dst_host_diff_srv_rate',
 'dst_host_same_src_port_rate',
 'dst_host_srv_diff_host_rate',
 'dst_host_serror_rate',
 'dst_host_srv_serror_rate',
 'dst_host_rerror_rate',
 'dst_host_srv_rerror_rate',
 'label']

In [5]:
train = pd.read_csv("datasets/kddcup.data.corrected.txt", header=None, names=cols)

In [6]:
test = pd.read_csv("datasets/corrected.txt", header=None, names=cols)

In [7]:
train.head()

Unnamed: 0,duration,protocol_type,service,flag,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,...,dst_host_srv_count,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate,label
0,0,tcp,http,SF,215,45076,0,0,0,0,...,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,normal.
1,0,tcp,http,SF,162,4528,0,0,0,0,...,1,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,normal.
2,0,tcp,http,SF,236,1228,0,0,0,0,...,2,1.0,0.0,0.5,0.0,0.0,0.0,0.0,0.0,normal.
3,0,tcp,http,SF,233,2032,0,0,0,0,...,3,1.0,0.0,0.33,0.0,0.0,0.0,0.0,0.0,normal.
4,0,tcp,http,SF,239,486,0,0,0,0,...,4,1.0,0.0,0.25,0.0,0.0,0.0,0.0,0.0,normal.


In [8]:
test.head()

Unnamed: 0,duration,protocol_type,service,flag,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,...,dst_host_srv_count,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate,label
0,0,udp,private,SF,105,146,0,0,0,0,...,254,1.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,normal.
1,0,udp,private,SF,105,146,0,0,0,0,...,254,1.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,normal.
2,0,udp,private,SF,105,146,0,0,0,0,...,254,1.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,normal.
3,0,udp,private,SF,105,146,0,0,0,0,...,254,1.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,snmpgetattack.
4,0,udp,private,SF,105,146,0,0,0,0,...,254,1.0,0.01,0.01,0.0,0.0,0.0,0.0,0.0,snmpgetattack.


### Process the label column
We want to remove the period. Then we want to generate three label columns:

In [9]:
train["label"] = train["label"].map(lambda x: x.split(".")[0])
test["label"] = test["label"].map(lambda x: x.split(".")[0])

#### Binary label (normal/malicious)

In [10]:
train["label_binary"] = train["label"].map(lambda x: 0 if x=="normal" else 1)
test["label_binary"] = test["label"].map(lambda x: 0 if x=="normal" else 1)

#### Coarse multiclass label (normal, probe, u2r, r2l, dos)

In [11]:
train_attack_types

Unnamed: 0,attack_type,attack_category
0,back,dos
1,buffer_overflow,u2r
2,ftp_write,r2l
3,guess_passwd,r2l
4,imap,r2l
5,ipsweep,probe
6,land,dos
7,loadmodule,u2r
8,multihop,r2l
9,neptune,dos


In [12]:
attack_dict_coarse = {
    i:j for i,j in zip(train_attack_types["attack_type"], train_attack_types["attack_category"])
} 
attack_dict_coarse["normal"] = "normal"

In [13]:
train["label_coarse"] = train["label"].map(attack_dict_coarse).map({
    "normal":0,
    "dos": 1,
    "probe": 2,
    "r2l": 3,
    "u2r": 4
})

In [14]:
train["label_coarse"].value_counts()

1    3883370
0     972781
2      41102
3       1126
4         52
Name: label_coarse, dtype: int64

#### Fine-grained multiclass label (all).

In [15]:
train.describe()

Unnamed: 0,duration,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,num_failed_logins,logged_in,num_compromised,...,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate,label_binary,label_coarse
count,4898431.0,4898431.0,4898431.0,4898431.0,4898431.0,4898431.0,4898431.0,4898431.0,4898431.0,4898431.0,...,4898431.0,4898431.0,4898431.0,4898431.0,4898431.0,4898431.0,4898431.0,4898431.0,4898431.0,4898431.0
mean,48.34243,1834.621,1093.623,5.716116e-06,0.0006487792,7.961733e-06,0.01243766,3.205108e-05,0.143529,0.008088304,...,0.7537132,0.03071111,0.605052,0.006464107,0.1780911,0.1778859,0.0579278,0.05765941,0.8014097,0.8102921
std,723.3298,941431.1,645012.3,0.002390833,0.04285434,0.007215084,0.4689782,0.007299408,0.3506116,3.856481,...,0.411186,0.1085432,0.4809877,0.04125978,0.3818382,0.3821774,0.2309428,0.2309777,0.3989389,0.4147374
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,45.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.41,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0
50%,0.0,520.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0
75%,0.0,1032.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.04,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0
max,58329.0,1379964000.0,1309937000.0,1.0,3.0,14.0,77.0,5.0,1.0,7479.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,4.0


In [16]:
X = pd.get_dummies(train, columns=["protocol_type", "service", "flag"], drop_first=True).drop(columns=["label", "label_binary", "label_coarse"])

# Logistic Regression
Our baseline model will always be logistic regression.

In [17]:
y = train["label_binary"]

In [18]:
X_train, X_test, y_train, y_test = train_test_split(X, y)

In [19]:
logreg = LogisticRegression(penalty="none", solver="sag")

In [20]:
cross_val_score(logreg, X_train, y_train)



array([0.86986588, 0.80863814, 0.8081128 , 0.80884883, 0.80871273])

In [21]:
logreg.fit(X_train, y_train)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='none',
                   random_state=None, solver='sag', tol=0.0001, verbose=0,
                   warm_start=False)

In [22]:
logreg.score(X_train, y_train)

0.8086829441701464

In [23]:
logreg.score(X_test, y_test)

0.8085460816849147

In [24]:
y_actual = test["label"].map(lambda x: 0 if x == "normal" else 1)
y_actual

0         0
1         0
2         0
3         1
4         1
         ..
311024    0
311025    0
311026    0
311027    0
311028    0
Name: label, Length: 311029, dtype: int64

In [25]:
X_big_test = pd.get_dummies(test, columns=["protocol_type", "service", "flag"], drop_first=True).drop(columns=["label", "label_binary"])

In [26]:
cols_to_zero = [col for col in X_test.columns if col not in X_big_test.columns]
for col in cols_to_zero:
    X_big_test[col] = 0

In [27]:
logreg.score(X_big_test[X_test.columns], y_actual)

0.8036967613952397

In [30]:
tn, fp, fn, tp = confusion_matrix(logreg.predict(X_big_test[X_test.columns]), y_actual).ravel()

In [31]:
tn

52

In [32]:
fp

515

In [33]:
fn

60541

In [34]:
tp

249921

In [35]:
# Our sensitivity is
(tp) / (tp + fn) 

0.8049970688844368

In [36]:
# Our precision is
(tp) / (tp + fp)

0.997943586385344