## Logistic Regression

Logistic Regression is very similar to linear regression, except all of the points can only have $y$-values of $1$ or $0$. This is useful if we want to predict whether something is or isn't part of a particular class. Instead of fitting a line (as in linear regression), logistic regression involves fitting a probability curve.

For example, using our device traffic, let's see whether we can predict a DNS packet is request or response from its length. 

First, let's import the data, extract only the DNS packets, and view the first few packets.

In [1]:
# Pandas, Numpy
import numpy as np
import pandas as pd

import logging
logging.getLogger("scapy.runtime").setLevel(logging.ERROR)

# Machine Learning
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV

# Plotting
import matplotlib.pyplot as plt
%matplotlib inline

import sys
sys.path.insert(1,"/Users/feamster/research/netml/src/")
import netml
from netml.pparser.parser import PCAP

In [15]:
#hpcap = PCAP('/Users/feamster/Documents/teaching/ml-networking/activities/pcaps/google_home.pcap', flow_ptks_thres=2, verbose=10)
hpcap = PCAP('data/http.pcap', flow_ptks_thres=2, verbose=10)

hpcap.pcap_to_pandas()
pcap = hpcap.df

'_pcap_to_pandas()' starts at 2022-10-28 19:31:14
'_pcap_to_pandas()' ends at 2022-10-28 19:31:21 and takes 0.1198 mins.


Each row in the printed data is a packet and each column is a feature of the packet.

Next let's divide the DNS packets into requests and repsonses, and convert them into points where the $x$-value is the length of the packet and $y$-value is $0$ for requests and $1$ for responses. This will allow us to fit the data to a logistic regression curve.

Let's see how many data points we have.

Next we will convert the DNS response column into a 0/1 value so that it is amenable to logstic regression.

In [21]:
regr = LogisticRegression(solver='lbfgs', C=1)
regr.fit(x.reshape(-1,1),y)

---

## Model Selection/Cross Validation

How do we know we have a good model? How do we know that the machine learning algorithm is "good"?

We can perform a K-fold cross-validaion on the data: 
* holding out 1/K of the data for testing, 
* training on the reamining data, 
* repeating this K times, one for each fold
* averaging the resulting accuracy/score of the model

In [10]:
from sklearn.model_selection import KFold, cross_val_score

kf = KFold(n_splits=5)

cv_results = cross_val_score(regr,
                             x.reshape(-1,1),
                             y,
                             cv=kf,
                             scoring="accuracy")

In [11]:
cv_results

array([0.86446886, 0.93014706, 0.9375    , 0.86029412, 0.70220588])

In [12]:
cv_results.mean()

0.8589231846584788

### Hyper-Parameter Tuning: Grid Search

Grid search can allow for hyper-parameter tuning, with cross validation, such as a the k-fold cross validation that we just performed.  The search performs the evaluation above, automating the search through the set of all possible parameters dedined in the grid.

The resulting model is returned as a result, and a model can be generated from the training data by calling `fit` on the results of the search.

In [13]:
from sklearn.model_selection import GridSearchCV

k = 5
C_range = np.arange(1,5,1)
params = {
    'C': C_range,
}

grid_model = GridSearchCV(estimator=regr,
                          param_grid=params,
                          cv=k,
                          return_train_score=True,
                          scoring='accuracy'                        )

grid_model_result = grid_model.fit(x.reshape(-1,1),y)

In [14]:
cv_results = pd.DataFrame(grid_model.cv_results_)
cv_results

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_C,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,...,mean_test_score,std_test_score,rank_test_score,split0_train_score,split1_train_score,split2_train_score,split3_train_score,split4_train_score,mean_train_score,std_train_score
0,0.004232,0.000996,0.000261,6.9e-05,1,{'C': 1},0.857143,0.922794,0.963235,0.834559,...,0.855987,0.089526,1,0.859375,0.853076,0.836547,0.86685,0.890725,0.861315,0.01778
1,0.003454,0.000193,0.000209,2.5e-05,2,{'C': 2},0.857143,0.922794,0.963235,0.834559,...,0.855987,0.089526,1,0.859375,0.853076,0.836547,0.86685,0.890725,0.861315,0.01778
2,0.003394,0.000158,0.000191,2.1e-05,3,{'C': 3},0.857143,0.922794,0.963235,0.834559,...,0.855987,0.089526,1,0.859375,0.853076,0.836547,0.86685,0.890725,0.861315,0.01778
3,0.003095,8e-05,0.000145,4e-06,4,{'C': 4},0.857143,0.922794,0.963235,0.834559,...,0.855987,0.089526,1,0.859375,0.853076,0.836547,0.86685,0.890725,0.861315,0.01778
