# Ensemble Algorithm

### Data descriptioon

XYZ Bank is a large and profitable bank in Saint Louis, Missouri. Like any large corporation, XYZ Bank has a very large

and intricate infrastructure that supports its networking system. A Network Analyst recently discovered unusual network

activity. Then, pouring over year’s worth of logs, their team of analysts discovered many instances of anomalous

network activity that resulted in significant sums of money being siphoned from bank accounts. The Chief Networking

Officer has come to your group for help in developing a system that can automatically detect and warn of such known, as

well as other unknown, anomalous network activities.

The network_traffic.csv file is a synopsis of logged network activity. It contains labeled examples of benign network

sessions as well as examples of sessions involving intrusions. It is important to note that it is likely that there

exist many different intrusion types in the data, but we will treat all intrusions as the same. The

data_description.txt file provides explanations of each of the attributes found in the network_traffic dataset.


### Objective

The objective of this study is to use various ensemble algorithms to analyse the data.

### Load data

In [6]:
import pandas as pd
import numpy as np

In [8]:
data = pd.read_csv("intrusion.csv")
data.head()

Unnamed: 0,duration,protocol_type,service,flag,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,...,root_shell,su_attempted,num_root,num_file_creations,num_shells,num_access_files,num_outbound_cmds,is_host_login,is_guest_login,is_intrusion
0,190.048316,udp,private,SF,105,146,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
1,0.0,udp,private,SF,105,105,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
2,0.0,udp,private,unknown,105,146,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
3,0.0,udp,private,SF,105,146,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
4,0.0,udp,private,SF,105,147,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1


In [39]:
data.describe()

Unnamed: 0,duration,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,num_failed_logins,logged_in,num_compromised,root_shell,su_attempted,num_root,num_file_creations,num_shells,num_access_files,num_outbound_cmds,is_host_login,is_guest_login,is_intrusion
count,699.0,699.0,699.0,699.0,699.0,699.0,699.0,699.0,699.0,699.0,699.0,699.0,699.0,699.0,699.0,699.0,699.0,699.0,699.0,699.0
mean,190.048316,18032.052933,1806.351931,0.0,0.0,0.0,0.151645,0.0,0.590844,1.264664,0.001431,0.002861,1.429185,0.0,0.0,0.007153,0.0,0.0,0.051502,0.429185
std,814.87387,59040.018323,8271.114218,0.0,0.0,0.0,1.071863,0.0,0.49203,33.435951,0.037823,0.075647,36.879601,0.0,0.0,0.189117,0.0,0.0,0.221178,0.495314
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,105.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,217.0,147.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,1.0,330.5,760.5,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
max,15122.0,283618.0,176690.0,0.0,0.0,0.0,25.0,0.0,1.0,884.0,1.0,2.0,975.0,0.0,0.0,5.0,0.0,0.0,1.0,1.0


In [40]:
data.shape

(699, 23)

### preprocessing data

In [10]:
X = data[['duration', 'src_bytes', 'dst_bytes']].values
y = data[['is_intrusion']].values.ravel()

In [11]:
y = np.array(y).astype(int)

In [12]:
from sklearn import preprocessing

In [13]:
X = preprocessing.scale(X)

### Implementing DT

In [18]:
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn import model_selection

In [20]:
param_grid = {'max_depth': range(3,10)}
fold = model_selection.KFold(n_splits = 10)

In [21]:
DT = GridSearchCV(tree.DecisionTreeClassifier(), param_grid, verbose = 1)

In [22]:
DT.fit(X,y)

Fitting 5 folds for each of 7 candidates, totalling 35 fits


GridSearchCV(estimator=DecisionTreeClassifier(),
             param_grid={'max_depth': range(3, 10)}, verbose=1)

### Cross validation

In [23]:
DT_accuracy = cross_val_score(DT, X, y, scoring ='accuracy', cv = fold).mean()
DT_accuracy

Fitting 5 folds for each of 7 candidates, totalling 35 fits
Fitting 5 folds for each of 7 candidates, totalling 35 fits
Fitting 5 folds for each of 7 candidates, totalling 35 fits
Fitting 5 folds for each of 7 candidates, totalling 35 fits
Fitting 5 folds for each of 7 candidates, totalling 35 fits
Fitting 5 folds for each of 7 candidates, totalling 35 fits
Fitting 5 folds for each of 7 candidates, totalling 35 fits
Fitting 5 folds for each of 7 candidates, totalling 35 fits
Fitting 5 folds for each of 7 candidates, totalling 35 fits
Fitting 5 folds for each of 7 candidates, totalling 35 fits


0.9226708074534162

### Implimenting Bagging

In [27]:
from sklearn.ensemble import BaggingClassifier
num_trees = 500
DT2 = DecisionTreeClassifier(random_state = 0)
bag = BaggingClassifier(base_estimator =DT2, n_estimators = num_trees, random_state = 0)
bag_accuracy = cross_val_score(bag, X, y, scoring ='accuracy', cv =fold).mean()

In [28]:
#accuracy statistics
bag_accuracy

0.935527950310559

### Implimenting Ada Boosting

In [29]:
from sklearn.ensemble import AdaBoostClassifier
adaboost = AdaBoostClassifier(n_estimators = num_trees, learning_rate=0.1, random_state =0)
adaboost_accuracy = cross_val_score(adaboost, X, y, scoring ='accuracy', cv =fold).mean()

In [30]:
adaboost_accuracy

0.9040993788819875

### Implimenting Gradient boosting

In [31]:
from sklearn.ensemble import GradientBoostingClassifier
gboost = GradientBoostingClassifier(n_estimators = num_trees, learning_rate=0.1, random_state =0)
gboost_accuracy = cross_val_score(gboost, X, y, scoring ='accuracy', cv =fold).mean()

In [32]:
gboost_accuracy

0.9169565217391306

### Implementing random forrest

In [33]:
from sklearn.ensemble import RandomForestClassifier

In [34]:
rf = RandomForestClassifier(n_estimators = num_trees, random_state = 0)
rf_accuracy = cross_val_score(rf, X, y, scoring ='accuracy', cv = fold).mean()

In [35]:
rf_accuracy

0.9412422360248447

### Classifiers comparison

In [37]:
comparison = pd.DataFrame({'Decision Tree':DT_accuracy,
                          'Random Forest': rf_accuracy,
                          'Gradient Boosting': gboost_accuracy,
                          'ada boosting':adaboost_accuracy },
                         index =["Accuracy"])

In [38]:
comparison

Unnamed: 0,Decision Tree,Random Forest,Gradient Boosting,ada boosting
Accuracy,0.922671,0.941242,0.916957,0.904099


# Discussion

based on the results, we can tell that the random forest computed the best accuracy estimated at 94.12%. the Decisition tree's accuracy is at 92.27%. the ensemble method helps encover more technique when it comes to finding the best method similar to the decision tree. The ensemble learning helped discover a better way to predictive performance by combining the predictions from various models such as: ada boosting, bagging,gradient boosting, and random forrest. these methods allows us to reduce bias and variance to boost the accuracy of each model.