# Lesson 02 Assignment

## Background

    You are involved in a project where you are tasked to build a machine learning algorithm that distinguishes between "bad'' connections (called intrusions or attacks) and "good'' (normal) connections. Note that the number of normal connections is greater than that of bad ones.

## Instructions

    (1) Read data
    (2) Build a classifier
    (3) Determine your model accuracy
    (4) Modify data by handling class imbalance
    (5) Use the same mode on updated data
    (6) What is the accuracy?
    (7) Describe your findings

In [None]:
# Import packages

import numpy as np
import pandas as pd
from sklearn import metrics
from sklearn.naive_bayes import GaussianNB
from sklearn import metrics
from sklearn.model_selection import train_test_split
from collections import Counter
from sklearn.datasets import make_classification
from imblearn.over_sampling import SMOTE 

### (1) Read Data

In [None]:
# Reading url

data = pd.read_csv("/Users/matt.denko/Downloads/Intrusion Detection.csv") 
data.columns = ["duration",
"protocol_type",
"service",
"flag",
"src_bytes",
"dst_bytes",
"land",
"wrong_fragment",
"urgent",
"hot",
"num_failed_logins",
"logged_in",
"num_compromised",
"root_shell",
"su_attempted",
"num_root",
"num_file_creations",
"num_shells",
"num_access_files",
"num_outbound_cmds",
"is_host_login",
"is_guest_login",
"count",
"srv_count",
"serror_rate",
"srv_serror_rate",
"rerror_rate",
"srv_rerror_rate",
"same_srv_rate",
"diff_srv_rate",
"srv_diff_host_rate",
"dst_host_count",
"dst_host_srv_count",
"dst_host_same_srv_rate",
"dst_host_diff_srv_rate",
"dst_host_same_src_port_rate",
"dst_host_srv_diff_host_rate",
"dst_host_serror_rate",
"dst_host_srv_serror_rate",
"dst_host_rerror_rate",
"dst_host_srv_rerror_rate",
"connection_type"]
print(data.columns)
data.describe()
data.head()

In [None]:
#Removing cases with missing data

data = data.replace(to_replace= "?", value=float("NaN"))
data_null = data.isnull().sum()
print(data_null)
print("There are 0 columns with missing data")

### (2) Build a classifier

### Comments:

    I am going to build a classier predicting whether or not a same host connection has a greater percentage of SYN errors than the average. To do this, I have to first determine the average then create a dummy variable for the target label.

In [None]:
# Determine the mean

serror_rate = data.loc[:,"serror_rate"]
mean = serror_rate.mean()
print(mean)

In [None]:
# Create Target label

data.loc[:,'serror_rate'] = (data.loc[:,'serror_rate'] > 0.0016060344473219086).astype(int)
print(data.loc[:,'serror_rate'])

In [None]:
# Define the target and features:

target_label = 'serror_rate'
non_features = ['protocol_type', 'service', 'flag']
feature_labels = [x for x in data.columns if x not in [target_label] + non_features]

# One-hot encode inputs

data_expanded = pd.get_dummies(data, drop_first=True)
print('DataFrame one-hot-expanded shape: {}'.format(data_expanded.shape))

# Get target and original x-matrix

y = data[target_label]
x = data.as_matrix(columns=feature_labels)

In [None]:
# Split dataset into training set and test set

X_train, X_test, y_train, y_test = train_test_split(x, y, 
                                  test_size=0.3,random_state=42) # 70% training and 30% test

In [None]:
# model

gnb = GaussianNB()

# train the model on the training sets only

gnb_model = gnb.fit(X_train, y_train)

### (3) Determine your model accuracy

In [None]:
#Predict the response for test dataset

y_pred = gnb.predict(X_test)

#Accuracy

print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

### (4) Modify data by handling class imbalance

In [None]:
sm = SMOTE(random_state=42)
X_res, y_res = sm.fit_sample(x, y)
print('Resampled dataset shape {}'.format(Counter(y_res)))

### (5) Use the same mode on updated data

In [None]:
# Split dataset into training set and test set

X_train, X_test, y_train, y_test = train_test_split(X_res, y_res, 
                                  test_size=0.3,random_state=42) # 70% training and 30% test

In [None]:
# model

gnb = GaussianNB()

# train the model on the training sets only

gnb_model = gnb.fit(X_train, y_train)

### (6) What is the accuracy?

In [None]:
#Predict the response for test dataset

y_pred = gnb.predict(X_test)

#Accuracy

print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

### (7) Describe your findings

#### Comments:
    
    The accuracy increased significantly after using the SMOTE method for soliving for class imbalance. SMOTE stands for Synthetic Minority Oversampling Technique. It combines informed oversampling of the minority class with random undersampling of the majority class. In this case, my original model had extreme class imbalance. The SMOTE method increased my model accuarcy to 0.53.