<a href="https://colab.research.google.com/github/inspire-lab/CyberAI-labs/blob/main/Intrusion-detection/intrusion_detection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lab 2
## Intrusion detection and prevention
In this lab we will use different methods to detect intrusions which can be used to prevent intrusions. We will experiment with numeric data, traces and time information rather than text. The goal of this lab is to learn to handle datasets containing large features. We will use technics to automatically combine, rank and select features. We will evaluate the performance of the model with selected features and compare it with a model that includes all features.

The dataset used in this lab is adapted from CICIDS dataset available <a href="https://www.unb.ca/cic/datasets/ids-2017.html">here.</a>

Let us first set up a simple filter to ignore the future warnings and deprecation warnings we may encounter during the execution of this lab.

In [None]:
# import warnings filter
import sys
import os
import warnings
"""from warnings import simplefilter
# ignore all future warnings
simplefilter(action='ignore', category=FutureWarning)
simplefilter(action='ignore', category=DeprecationWarning)
#simplefilter(action='ignore', category=ConvergenceWarning)"""
if not sys.warnoptions:
  warnings.simplefilter("ignore")
  os.environ["PYTHONWARNINGS"] = "ignore"  # Also affect subprocesses|


The first step to train any model in a machine learning task is to load the data. The data needed for this lab is present in the `Friday-WorkingHours-Morning.pcap_ISCX.csv` file, where each row indicates an instance of BENIGN request or BOT request in a network. Read more about the dataset <a href="https://www.unb.ca/cic/datasets/ids-2017.html">here.</a>

After loading the dataset, we assign X and Y variables to features and labels resepctively. Features are characteristics observed for a particular label (in this case BENIGN or BOT activity) when the dataset was created.

The ' Labels' column in the dataset is assigned to Y variable and the features present in the rest of the columns are transformed using an imputer function. This imputer function is useful in handling any missing values in the dataset.

In [None]:
import pandas as pd
import numpy as np

df = pd.read_csv("data/Friday-WorkingHours-Morning.pcap_ISCX.csv")

Y = df[' Label']
del df[' Label']
df.replace([np.inf, -np.inf], np.nan, inplace=True)
X = df.dropna(axis=1, how='all')
columns = df.columns.values.tolist()

from sklearn.impute import SimpleImputer
from sklearn.feature_selection import RFE

imputer = SimpleImputer(missing_values=np.nan, strategy='mean', axis=0)
imputer = imputer.fit(df)
X = imputer.transform(X)

## Question 1:
Print the number of features present after transformation using the imput function.

In [None]:
# solution

## Question 2:
Create a 80/20 train test split using X as features and Y as labels.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

In [None]:
# solution




Let us now print the number of labels in each category in the training set.

In [None]:
BENIGN_counts, bot_counts = y_train.value_counts()
print(BENIGN_counts)
print(bot_counts)


We can clearly see that the number of BENIGN labels out number the number of Bot labels. This indicates that the dataset is unbalanced. Let us now balance our dataset using undersampling. This technique underrepresents the majority class (BENIGN in our case) to balance the dataset.

In [None]:
from imblearn.under_sampling import RandomUnderSampler
from collections import Counter

rus = RandomUnderSampler(random_state=89)
X_resampled, y_resampled = rus.fit_resample(X_train, y_train)
print(sorted(Counter(y_resampled).items()))

We can now see that our dataset is balanced. Let us now consider our undersampled train dataset to be our actual training set.

In [None]:
X_train = X_resampled
y_train = y_resampled

Let us now build a simple logistic regresstion model.

## Question 3:
Using the logmodel predict the labels for X_test and print the classification report for y_test and predictions.

In [None]:
# solution


## KBest:
The kbest algorithm is used to pick the k best features from a given dataset using a function that maps the features and the labels. The features are ranked using the function and all features ranking greater than k are removed. Read more on sklearn's kbest implementation <a href="https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html">here.</a>

The output of this code block prints the chosen k features from the top 5 rows in the dataset.

In [None]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif

test = SelectKBest(score_func=f_classif, k=25)
kbest_fit = test.fit(X, Y)
# summarize scores
np.set_printoptions(precision=3)
#print(kbest_fit.scores_)
kbest_features = kbest_fit.transform(X)
# summarize selected features
print(kbest_features[0:5, :])


Let us now build a logistic regression model only with the chosen features from the kbest algorithm.

In [None]:
X_train_kbest, X_test_kbest, y_train_kbest, y_test_kbest = train_test_split(
    kbest_features, Y, test_size=0.30, random_state=101)
logmodel.fit(X_train_kbest, y_train_kbest)
predictions = logmodel.predict(X_test_kbest)
print(classification_report(y_test_kbest, predictions))


## Question 4:
Comment on the performance of the model build using only the k best features versus the model build using all the features. Why do you think there is a difference in the performance?

## Answer:

## RFE:
RFE refers to recursive feature elimination. RFE recursively removes the weakest features until the mentioned number of features is reached. Learn more about sklearn's RFE implementation <a href ="https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html">here.</a>

In [None]:
rfe = RFE(logmodel, n_features_to_select=10)
rfe_fit = rfe.fit(X, Y)
#print number of features chosen
print(rfe_fit.n_features_)
#print if each feature from the dataset is chosen or not
print(rfe_fit.support_)
#print the ranks of each feature
print(rfe_fit.ranking_)

## Question 5:
Build a logistic regression classifier with the features obtained from RFE. Print the classification report to evaluate the performance.

In [None]:
# solution


## Question 6:
Evaluate the performance if the number of chosen features is decreased to 3. What happens if the number of chosen features is 25? What happens if the  number of chosen features is 50? Does increase in the number of features affect the performance of the model?

## Answer:

## PCA:
Principal Component Analysis is a statistical procedure that takes in observations that are possibly correlated to generate non-correlated values called the principal components. Read more on sklearns PCA implementation <a href="https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html">here.</a>

This code block prints the chosen number of principal components generated using the entire dataset.

In [None]:
from sklearn.decomposition import PCA

pca = PCA(n_components=5)
pca_fit = pca.fit(X)
print(pca_fit.components_)

## Question 7:
Try 10, 25 and 50 principal components as input and evaluate their performance by looking at their classification report. Print the classification report for which the maximum performance is achieved.

In [None]:
# solution



## Decision trees:
Decision trees are non-parametric algorithms that exploit the features in a dataset to generate a tree based model. Let us build a decision tree for our dataset and evaluate its performance. Read more on sklearn's decision tree implementation <a href="https://scikit-learn.org/stable/modules/tree.html">here.</a>

In [None]:
from sklearn.tree import DecisionTreeClassifier  # Import Decision Tree Classifier

clf = DecisionTreeClassifier()

# Train Decision Tree Classifer
clf = clf.fit(X_train, y_train)

#Predict the response for test dataset
predictions = clf.predict(X_test)
print(classification_report(y_test, predictions))


Let us now build a decision tree where we restrict the maximum depth of a tree. This allows the decision tree classifier to leave out features that are not very important in the classification task, and to control the size of the tree.

## Question 8:
From the sklearn's implementation of decision trees, pass the parameter that specifies the maximum depth of the tree. Assign maximum depth of the tree to 3 and evaluate its performance by printing a classification report.


In [None]:
# solution


## Question 9:
From the sklearn's implementation of decision trees, pass the parameter that specifies the minimum number of leaves in the tree. Assign the minimum number of leaves in the tree to 3 and evaluate its performance by printing a classification report.

In [None]:
# solution


In this lab we learned to automatically extract features from a large number of features. From the results, we can see that the performance of models built on the selected features is comparable with the one built on the entire dataset.