# Case Study 5 - SGD & SVM

__Team Members:__ Amber Clark, Andrew Leppla, Jorge Olmos, Paritosh Rai

# Content
* [Business Understanding](#business-understanding)
    - [Scope](#scope)
    - [Introduction](#introduction)
    - [Methods](#methods)
    - [Results](#results)
* [Data Evaluation](#data-evaluation)
    - [Loading Data](#loading-data) 
    - [Data Summary](#data-summary)
    - [Missing Values](#missing-values)
    - [Feature Removal](#feature-removal)
    - [Exploratory Data Analysis (EDA)](#eda)
    - [Assumptions](#assumptions)
* [Model Preparations](#model-preparations)
    - [Sampling & Scaling Data](#sampling-scaling-data)
    - [Proposed Method](#proposed-metrics)
    - [Evaluation Metrics](#evaluation-metrics)
    - [Feature Selection](#feature-selection)
* [Model Building & Evaluations](#model-building)
    - [Sampling Methodology](#sampling-methodology)
    - [Model](#model)
    - [Performance Analysis](#performance-analysis)
* [Model Interpretability & Explainability](#model-explanation)
    - [Examining Feature Importance](#examining-feature-importance)
* [Conclusion](#conclusion)
    - [Final Model Proposal](#final-model-proposal)
    - [Future Considerations and Model Enhancements](#model-enhancements)
    - [Alternative Modeling Approaches](#alternative-modeling-approaches)

# Business Understanding & Executive Summary <a id='business-understanding'/>

## Objective:
The goal of this case study is to build a Firewall classification model to automatically allow or deny access requests in real time with a high level of accuracy. 

## Introduction:
This case study is about the management of firewall traffic. Cyber security consists of all the technologies and practices that keep computer systems and electronic data safe. A firewall is one of the most critical components in supporting big corporations to protect their data. The challenge offered to the team was to build a classification model that can automatically allow or deny the access request based on the features of the incoming request. A firewall is a hardware or software filter that authorizes access based on a certain set of pre-established rules. These rules are based on multiple aspects of packet data like their source, destination, content, protocol, and other data characteristics. When the network is significantly large, and policies are complicated, manual cross-check may be insufficient, inefficient, and ineffective in detecting anomalies. An automated model based on machine learning and high-performance computing methods is leveraged to detect anomalies to strengthen the firewall. To achieve this, firewall logs are analyzed, and the extracted features are fed to a set of machine learning classification algorithms.

Cybercrimes are increasing every day, which makes it critical that organizations build a robust system to ensure CPI (Customer Personal Information) and other confidential information is kept secure. Firewalls are one of the most critical tools to safeguard the network. Firewalls are like fencing to keep trespassers away and not allow unauthorized access to the system.


<img src="https://raw.githubusercontent.com/olmosjorge28/QTW-SPRING-2022/main/ds7333_case_study_5/Firewall.png" />   

The efficiency and effectiveness of the firewall are judged by its accuracy and speed to identify malware and other suspicious activities. The firewall implementation can be done by hardware, software, and cloud. It carries out its function by filtering information at the application layer (proxy server) or permitting (or blocking) based on state, port, or protocol. Next-Generation Firewalls conduct deep packet inspection (beyond port/ protocol inspection). The most advanced firewalls are Unified Threat Management (UTM), which integrate multiple methodologies like stateful inspection, deep packet inspection, and Antivirus. The firewalls can provide advanced threat detection and mitigation by correlating ports, protocols, and/or suspicious behaviors. [1]



## Model:
The team used SVM (Support Vector Machine) and SGD (Stochastic Gradient Descent) classification algorithms to categorize the requests into Allowed or Denied.

# Data Evaluation <a id='data-evaluation'>
    

Summarize data being used?

Are there missing values?

Which variables are needed and which are not?

What assumptions or conclusions are you drawing about your data?

## Dataset:
The provided dataset contains 65,532 rows and 12 columns. It includes the ports, bytes, and packet information along with elapsed time. The dataset will need no imputation as there are no missing values. The ‘Action’ response column identifies the requests that were Allowed or Denied. ‘Action’ with a designation, "deny", "drop", or "reset-both" was categorized as Denied and assigned a value of "0" (27,892 requests) and Allowed requests were assigned as "1" (37,640 requests). The Dataset is relatively balanced per the barchart below. However, the team decided to use stratified splitting for training and test in an abundance of caution. Also, the dataset will be scaled to bring the values into acceptable ranges for faster, more efficient modeling.

<img src="https://raw.githubusercontent.com/olmosjorge28/QTW-SPRING-2022/main/ds7333_case_study_5/Action Count.png" />

All of the data (other than the target) are numeric, but not necessarily treated as continuous variables. As discussed in the introduction, ports represent a sort of address and are technically categorical features. However, many models can handle these port variables as continuous features which is discussed further below.

The Port features were explored for any relationships to the response variable.  Both NAT Source and Destination Ports had clear separation of the class variable at Ports = 0 vs. Ports  > 0. 

<img src="https://raw.githubusercontent.com/olmosjorge28/QTW-SPRING-2022/main/ds7333_case_study_5/EDA_All_Ports.png" />

<img src="https://raw.githubusercontent.com/olmosjorge28/QTW-SPRING-2022/main/ds7333_case_study_5/EDA_Ports_0.png" />

The other numeric features were also examined with histograms.  This was done to explore the scale and distribution of the features, as well as any relationships to the response variable.  Note: features are plotted on a log scale to better visualize the highly skewed data.  Elapsed Time (sec) showed the biggest separation of the target variable with Elapsed Time (sec) = 0 vs. Elapsed Time (sec) > 0.

<img src="https://raw.githubusercontent.com/olmosjorge28/QTW-SPRING-2022/main/ds7333_case_study_5/EDA_Num_Vars.png" />

In [1]:
# standard libraries
import pandas as pd
import numpy as np
import re
import os
from IPython.display import Image
import sklearn
import time
# email
from email import policy
from email.parser import BytesParser

# visualization
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from tabulate import tabulate

# data pre-processing
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.feature_extraction.text import TfidfVectorizer

# clustering
from sklearn.cluster import DBSCAN
from statistics import stdev

# prediction models
from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_selection import SelectPercentile, f_classif
from sklearn.ensemble import VotingClassifier
from sklearn.model_selection import cross_val_score

from sklearn.metrics import f1_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

# import warnings filter
'''import warnings
warnings.filterwarnings('ignore')
from warnings import simplefilter 
simplefilter(action='ignore', category=FutureWarning)'''



## Loading Data <a id='loading-data'>

## Data Summary <a id='data-summary'>

## Missing Values <a id='missing-values'>



## Feature Removal <a id='feature-removal'>

## Exploratory Data Analysis (EDA) <a id='eda'>

## Assumptions <a id='assumptions'>

# Model Preparations <a id='model-preparations'/>

What methods did you use (or not) to solve the problem?

Why are the methods you chose appropriate given the business objective?

How did you decide your approach was useful?  If more than one method, which one was better or why are each better or not?

What evaluation smetrics are most useful given the problem is a binary classification (ex. accuracy, f1-score, precision, recall AUC, etc)?



SVM creates decision boundaries with a margin of separation for classification.  Points on one side of the boundary are classified as Allowed, and points on the other side of the boundary are classified as Denied.  Only the points near the decision boundaries and misclassifications are used to determine where the boundaries are ultimately placed.  In this way, SVM is well suited to handle this “gray area” of classification where the two classes are more similar.  These boundaries can be linear or nonlinear (i.e. linear, poly, rbf, sigmoid).  Also, SVM is resistant to the effect of outliers as long as the outliers are on the correct side of the boundary.

The Stochastic Gradient Descent (SGD) classifier is a simple and efficient approach to fitting classifiers and regressors such as (linear) Support Vector Machines and Logistic Regression. These are very efficient and easy to implement. SGD should take a much shorter time to complete by minimizing the loss “stochastically”:

Due to the large number of rows in the dataset, SGD implementation will leverage a “partial fit” training approach to loop through the dataset in smaller batches to improve the training efficiency (both time and memory). SGD is a method that can reduce memory when working with datasets of vast sizes that overwhelm available computing resources. 

SGD works similarly to linear regression and logistic regression for classification.  It requires regularization to prevent overfitting as well as hyperparameter tuning to optimize the algorithm for the best performance.

Training and test sets were created using a stratified splitting method to maintain the ratio of accepted versus denied requests. 30% of the data was withheld for the test set, and the relevant continuous features were normalized using StandardScaler.

The “port” categorical variables were considered for one hot encoding as dummy variables. The team attempted this and discovered that data would grow exponentially from 12 columns to over 27,000. The team dropped the encoded dummy variable proposal as it took a very long time to process the using SVM and SGD. The processing time is one of the critical measures along with the accuracy. Also, the team found that SVM and SGD algorithms are able to handle these categorical variables as continuous variables and still achieve consistent classification with a high accuracy.   
A range of models were explored using SVM and SGD Classifiers to tune the model hyperparameters.  Maximum accuracy and minimum compute time were the criteria used to identify the best model.


## Sampling & Scaling Data <a id='sampling-scaling-data' />

## Proposed Method <a id='proposed-metrics' />

## Evaluation Metrics <a id='evaluation-metrics' />

### Baseline Model

## Feature Selection <a id='feature-selection' />

# Model Building & Evaluations <a id='model-building'/>

In this case, your primary task is to use SVM and SGD to build a model to determine whether or not to accept or deny access to internet requests coming across a network and will involve the following steps:

- Specify your sampling methodology
- Setup your models - highlighting any important parameters
- Analyze your model's performance - referencing your chosen evaluation metric (including supplemental visuals and analysis where appropriate)

## Split into training and test

## Modeling

## Sampling Methodology <a id='sampling-methodology'/>

## Model's Performance Analysis <a id='performance-analysis'/>

The SVM Classifier resulted in 99.98% accuracy. However, it did run for XXX sec. to train the model and less than XX  second to make the predictions with accuracy of 99.98%

The final SGDClassifier resulted in 99.898% accuracy with XXX sec. to train the model using partial fit approach and less than XX  second to make the predictions with accuracy of 99.898%.

# Final SGD Classifier!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

<img src="https://raw.githubusercontent.com/olmosjorge28/QTW-SPRING-2022/main/ds7333_case_study_5/XXXX.png" />


# TODO Check data in tables!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

SGD Train - Confusion Matrix:

|                   | Actual Denied | Actual Allowed |
|-------------------|---------------|----------------|
| Predicted Denied  | 19, 507       | 17             |
| Predicted Allowed | 30            | 26, 318        |

SGD Test - Confusion Matrix:

|                   | Actual Denied | Actual Allowed |
|-------------------|---------------|----------------|
| Predicted Denied  | 8,361         | 7              |
| Predicted Allowed | 13            | 11,279         |


# TODO Check data in tables!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

The final SGD Classifier coefficients are given in the graph below.  The L1 regularization eliminated all but 4 variables, with NAT Destination Port and Elapsed Time (sec) having the largest coefficients by far.

# Model Interpretability & Explainability <a id='model-explanation'>

Which variables were more important and why?

How did you come to the conclusion these variables were important how how should the audience interpret this?

## Examining Feature Importance <a id='examining-feature-importance'/>

# Conclusion <a id='conclusion'>

What are you proposing to the audience with your models and why?

How should your audience interpret your conclusion and whwere should they go moving forward on the topic?

What other approaches do you recommend exploring?

Bring it all home!

The accuracy obtained from the model could be considered very high at 99.9% for both the training and test sets.  However, depending on how often the firewall has to deal with these requests, that may not be good enough for real-world application. If these requests were coming in at over 1000 per day, that would mean that 1-2 would be incorrectly classified, leading to an important blocked request or an unfortunate malicious attack getting through the firewall.  The model could be further tuned such that more bad requests are blocked, but this would result in blocking more legitimate requests.

Methods used in this case study could be effective with High-Performance Computing (HPC) with multi-core processors to produce more comprehensive and accurate models. Also, more diligence can be done to reduce stochastic gradient descent to obtain minimum loss. The larger volume of data will also help enhance the learning leading to better performance of firewalls. 

The quick convergence of these models in training could indicate that only a few important features and values contribute to the classification. Exploring the data showed that there were a few consistent variable values that always lead to a denied request, so if these are identified quickly in training, the rest of the data likely contributes minimal benefit to classification.

Reviewing other classification algorithms, including Naive Bayes, kNN, Decision Table, and HyperPipes may enhance the model. Based on the article by Erden Ucar & Erkan Ozhan, “ The Analysis of Firewall Policy Through Machine Learning and Data Mining,” published on May 17, 2017, KNN has shown some outstanding results.  However, this would likely be a slow and inefficient model to predict in real time.   

It is equally vital for the companies to update firewall policies regularly and continuously review the firewall logs.


### Final Model Proposal <a id='final-model-proposal'/>

### Future Considerations and Model Enhancements <a id='model-enhancements'/>

### Alternative Modeling Approaches <a id='alternative-modeling-approaches'>

## References

In [None]:
[1] What Is Firewall: Types, How Does It Work & Advantages | Simplilearn
[2] Difference between IP address and Port Number. https://www.geeksforgeeks.org/difference-between-ip-address-and-port-number/
[3] The Analysis of Firewall Policy Through Machine Learning and Data Mining | SpringerLink (smu.edu)