# Lesson 2 Assignment
# Houda Aynaou

## Workplace Scenario
You are involved in a project where you are tasked to build a machine learning algorithm that distinguishes between "bad'' connections (called intrusions or attacks) and "good'' (normal) connections. Note that the number of normal connections is greater than that of bad ones.

The dataset you will use in this assignment originated with the [KDD Cup 1999](http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html).

## To Do: 

1. Read data
2. Build a classifier
3. Determine your model accuracy
4. Modify data by handling class imbalance
5. Use the same model on updated data
6. What is the accuracy?
7. Describe your findings

In [1]:
# imports
import numpy as np 
import pandas as pd 
import imblearn
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score


# ignore warnings 
import warnings
warnings.filterwarnings('ignore')

# display all columns
pd.options.display.max_columns = 42

## 1. Read Data

In [2]:
LINK = 'https://raw.githubusercontent.com/houdaaynaou/DS-Certificate-UW/master/Course%203%20Machine%20Learning%20Techniques/Data/Intrusion%20Detection.csv'

data = pd.read_csv(LINK)
data.head()

Unnamed: 0,duration,protocol_type,service,flag,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,num_failed_logins,logged_in,num_compromised,root_shell,su_attempted,num_root,num_file_creations,num_shells,num_access_files,num_outbound_cmds,is_host_login,is_guest_login,count,srv_count,serror_rate,srv_serror_rate,rerror_rate,srv_rerror_rate,same_srv_rate,diff_srv_rate,srv_diff_host_rate,dst_host_count,dst_host_srv_count,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate,Class
0,0,tcp,http,SF,181,5450,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,8,8,0.0,0.0,0.0,0.0,1.0,0.0,0.0,9,9,1.0,0.0,0.11,0.0,0.0,0.0,0.0,0.0,0
1,0,tcp,http,SF,239,486,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,8,8,0.0,0.0,0.0,0.0,1.0,0.0,0.0,19,19,1.0,0.0,0.05,0.0,0.0,0.0,0.0,0.0,0
2,0,tcp,http,SF,235,1337,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,8,8,0.0,0.0,0.0,0.0,1.0,0.0,0.0,29,29,1.0,0.0,0.03,0.0,0.0,0.0,0.0,0.0,0
3,0,tcp,http,SF,219,1337,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,6,6,0.0,0.0,0.0,0.0,1.0,0.0,0.0,39,39,1.0,0.0,0.03,0.0,0.0,0.0,0.0,0.0,0
4,0,tcp,http,SF,217,2032,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,6,6,0.0,0.0,0.0,0.0,1.0,0.0,0.0,49,49,1.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0


In [3]:
# Count of each connections class:
print('Proportion of the classes in the data:')
print(data['Class'].value_counts() / len(data))

Proportion of the classes in the data:
0    0.999692
1    0.000308
Name: Class, dtype: float64


The counts of normal connections exceeds counts the bad connections in the data sets which implies class imbalance.

## 2. Build a Classifier

In order to build a logistic regression model, we need to preprocess the data fist by encoding categorical variables and normalize the data. 

In [4]:
data['protocol_type'].unique()

array(['tcp', 'udp', 'icmp'], dtype=object)

### 2.1 Preprocessing Data

In [5]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler


# hotencode protocol_type
enc1 = OneHotEncoder(handle_unknown='ignore',  sparse=False)
encoded_protocol = enc1.fit_transform(data['protocol_type'].values.reshape(-1,1))
enc_p = pd.DataFrame(encoded_protocol, columns=['icmp','tcp', 'udp'])

# hotencode serive 
enc2 = OneHotEncoder(handle_unknown='ignore',  sparse=False)
enc_serive = enc2.fit_transform(data['service'].values.reshape(-1,1))
enc_s = pd.DataFrame(enc_serive)

# hotencode serive 
enc3 = OneHotEncoder(handle_unknown='ignore',  sparse=False)
enc_flag = enc3.fit_transform(data['flag'].values.reshape(-1,1))
enc_f = pd.DataFrame(enc_flag)

# dropping column 
X = pd.concat([data.drop(['Class', 'protocol_type','service', 'flag'], axis= 1),enc_p, enc_s,enc_f], axis=1)

# normalize features
X_standarized = StandardScaler().fit_transform(X)

### 2.2 Simple Logistic Regression classifer without SMOTE

In [6]:
# split the data
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X_standarized, 
                                                    data['Class'], 
                                                    test_size = 0.20, 
                                                    random_state = 0)

# # Build and train the model
lr = LogisticRegression()
lr.fit(x_train, y_train)

# predictions
predictions = lr.predict(x_test)

## 3. Determine your model accuracy

In [7]:
from sklearn.metrics import accuracy_score

score = accuracy_score(predictions, y_test)

print('Accuracy score on testing data', score)

Accuracy score on testing data 0.999794471277361


High accuracy score! It seems the model has performed exceptionally well, but we need to examine the confusion matrix for our predictions.


In [8]:
print(pd.crosstab(y_test.ravel(), predictions, rownames = ['True'], colnames = ['Predicted'], margins = True))

Predicted      0  1    All
True                      
0          19452  1  19453
1              3  6      9
All        19455  7  19462


In [9]:
print('False positive rate (bad connection classified as good):',3/9)

False positive rate (bad connection classified as good): 0.3333333333333333


3 out of 9 instances which belong to class 1 have been classifed as class 0. We are missing about 33% of the fraud connection  cases. This is going to cause serious issues for the company.

The higher accuracy is not due to correct classification. The model has predicted the majority class for almost all the examples. And since about 99.9% of the examples actually belong to this class, it leads to such high accuracy scores.

## 4. Modify data by handling class imbalance: SMOTE

Researchers have found that balancing the data will to better classification models. We will try balancing our data using SMOTE.

In [10]:
import imblearn
from imblearn.over_sampling import SMOTE
from sklearn.datasets import make_classification


# Over sampling with SMOTE 
sm = SMOTE(random_state = 42)
X_train_new, y_train_new = sm.fit_sample(x_train, y_train.ravel())

# Balanced Classes 
np.unique(y_train_new, return_counts= True)

(array([0, 1]), array([77825, 77825]))

## 5. Use the same model on updated data

In [11]:
# fitting model for the imbalanced data 
lr.fit(X_train_new, y_train_new)

# prediction for Testing data
test_pred_sm = lr.predict(x_test)


## 6. What is the accuracy?

In [12]:
# accuracy 
sm_score = accuracy_score(y_test, test_pred_sm)

print('Accuracy score on testing data after balancing classes:', sm_score)

Accuracy score on testing data after balancing classes: 0.9996403247353818


Our accuracy has reduced. But our model has definitely improved. Observe the confusion matrices.

In [13]:
print('Confusion Matrix - Training Dataset')
print(pd.crosstab(y_test.ravel(), test_pred_sm, rownames = ['True'], colnames = ['Predicted'], margins = True))


Confusion Matrix - Training Dataset
Predicted      0   1    All
True                       
0          19446   7  19453
1              0   9      9
All        19446  16  19462


A vast improvement! All of the bad connection cases have been classified as fraud.



## 7. Describe your findings

One might argue that the reduced accuracy is an indicator of lower model performance. However, this is not true.

Error in prediction can be made in two ways:

    1.Classifying good connection as bad connection.
    2.Classifying bad connection as good connection.
    
It should not be hard to understand that the second error is costlier than the first in our study case.

We must note that the objective of each classification problem is different. That's why we need to evaluate each model with respect to its own objective instead of merely judging it on its accuracy.