# Logistic Regression

The logistic regression model is similar to the linear regression model. However, in the logistic model the response variable $Y_{i}$ is binary. A binary variable takes two values, for example, $Y_{i} = 0$ and $Y_{i} = 1$, called "failure" and "success", respectively. In this case, "success" is the event of interest. 

We can't use a normal linear regression model on binary groups. It won't lead to a good fit. Instead we can transform a linear regression to a logistic regression curve. The Sigmoid (aka Logistic) Function takes in any value and outputs it to be between 0 and 1.

$$\begin{align}
\phi(z)=\frac{1}{1+e^{-z}}
\end{align}
$$

This means we can take a linear regression solution and place it into the Sigmoid Function.

$$\begin{align}
y=b_{0}+b_{1}x \\ \\ 
p=\frac{1}{1+e^{(b_{0}+b_{1}x)}}
\end{align}
$$

This results in a probability from 0 to 1 of belonging in the 1 class. We can set a cutoff point, at 0.5 for example, resulting that anything bellow it results in class 0, anything above it is class 1. We use the logistic function to output a value ranging from 0 to 1. Based on this probability we can assign a class.

For futher information, please check Sections 4-4.3 of **Introduction to Statistical Learning** by Gareth James, et al.

http://faculty.marshall.usc.edu/gareth-james/

## Libraries

In [34]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report, confusion_matrix
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
import time
from sklearn.linear_model import LogisticRegression
from collections import Counter
from imblearn.over_sampling import SMOTE, ADASYN, SMOTENC
from imblearn.combine import SMOTETomek, SMOTEENN 
import matplotlib.pyplot as plt
from numpy import where

## Read the data from csv

In [35]:
df_train = pd.read_csv('../data/df_train.csv')
df_test = pd.read_csv('../data/df_test.csv')

X_train = df_train.drop('kill', axis=1)
y_train = df_train['kill']
X_test = df_test.drop(['kill'], axis=1)
y_test = df_test['kill']

X_train = X_train.values
y_train = y_train.values
X_test = X_test.values
y_test = y_test.values

In [36]:
X_train.shape

(130565, 11)

## Train Test Split

The train-test split is a technique for evaluating the performance of a machine learning algorithm.

It can be used for classification or regression problems and can be used for any supervised learning algorithm.

The procedure involves taking a dataset and dividing it into two subsets. The first subset is used to fit the model and is referred to as the training dataset. The second subset is not used to train the model; instead, the input element of the dataset is provided to the model, then predictions are made and compared to the expected values. This second dataset is referred to as the test dataset.

- **Train Dataset:** Used to fit the machine learning model.
- **Test Dataset:** Used to evaluate the fit machine learning model.

The objective is to estimate the performance of the machine learning model on new data: data not used to train the model.

The train-test procedure is appropriate when there is a sufficiently large dataset available. The train-test procedure is not appropriate when the dataset available is small. The reason is that when the dataset is split into train and test sets, there will not be enough data in the training dataset for the model to learn an effective mapping of inputs to outputs. There will also not be enough data in the test set to effectively evaluate the model performance. The estimated performance could be overly optimistic (good) or overly pessimistic (bad).

If you have insufficient data, then a suitable alternate model evaluation procedure would be the **k-fold cross-validation procedure**.

In addition to dataset size, another reason to use the train-test split evaluation procedure is computational efficiency.

Samples from the original training dataset are split into the two subsets using random selection. This is to ensure that the train and test datasets are representative of the original dataset.

### Choosing the split

The procedure has one main configuration parameter, which is the size of the train and test sets. This is most commonly expressed as a percentage between 0 and 1 for either the train or test datasets. 

There is no optimal split percentage.

You must choose a split percentage that meets your project’s objectives with considerations that include:

- Computational cost in training the model.
- Computational cost in evaluating the model.
- Training set representativeness.
- Test set representativeness.

Nevertheless, common split percentages include:

- Train: 80%, Test: 20%
- Train: 67%, Test: 33%
- Train: 50%, Test: 50%


### Spliting the data into training set and testing set using train_test_split

The scikit-learn Python machine learning library provides an implementation of the train-test split evaluation procedure via the train_test_split() function.

The size of the split can be specified via the “test_size” argument that takes a number of rows (integer) or a percentage (float) of the size of the dataset between 0 and 1.

Another important consideration is that rows are assigned to the train and test sets randomly.

This is done to ensure that datasets are a representative sample (e.g. random sample) of the original dataset, which in turn, should be a representative sample of observations from the problem domain.

When comparing machine learning algorithms, it is desirable (perhaps required) that they are fit and evaluated on the same subsets of the dataset.

This can be achieved by fixing the seed for the pseudo-random number generator used when splitting the dataset. This can be achieved by setting the “random_state” to an integer value. Any value will do; it is not a tunable hyperparameter.

Some classification problems do not have a balanced number of examples for each class label. As such, it is desirable to split the dataset into train and test sets in a way that preserves the same proportions of examples in each class as observed in the original dataset.

This is called a stratified train-test split.

We can achieve this by setting the “stratify” argument to the y component of the original dataset. This will be used by the train_test_split() function to ensure that both the train and test sets have the proportion of examples in each class that is present in the provided “y” array.

**We already have the the train and test datasets, so you are going to skip the train-test split part!**

**Standardization isn't required for logistic regression. The main goal of standardizing features is to help convergence of the technique used for optimization.**

### Training and fitting a logistic regression model on the training set.

In [37]:
model = LogisticRegression()

In [38]:
model.fit(X_train,y_train)

LogisticRegression()

## Predictions
Predicting values for the testing data.

In [39]:
predictions = model.predict(X_test)

## Evaluations

The metrics that you choose to evaluate your machine learning algorithms are very important.

Choice of metrics influences how the performance of machine learning algorithms is measured and compared. They influence how you weight the importance of different characteristics in the results and your ultimate choice of which algorithm to choose.

Classification problems are perhaps the most common type of machine learning problem and as such there are a myriad of metrics that can be used to evaluate predictions for these problems.

1. Classification Accuracy.
2. Confusion Matrix.
3. Precision-Recall Curves.
4. Classification Report.

### Classification Accuracy

Classification accuracy is the number of correct predictions made as a ratio of all predictions made.

This is the most common evaluation metric for classification problems, it is also the most misused. It is really only suitable when there are an equal number of observations in each class (which is rarely the case) and that all predictions and prediction errors are equally important, which is often not the case.

Classification accuracy is the total number of correct predictions divided by the total number of predictions made for a dataset.

As a performance measure, accuracy is inappropriate for imbalanced classification problems.

The main reason is that the overwhelming number of examples from the majority class (or classes) will overwhelm the number of examples in the minority class, meaning that even unskillful models can achieve accuracy scores of 90 percent, or 99 percent, depending on how severe the class imbalance happens to be. An alternative to using classification accuracy is to use precision and recall metrics.

In [40]:
# calculate accuracy
accuracy = accuracy_score(y_test, predictions)
print(f'Accuracy: {accuracy:.3f}')

Accuracy: 0.877


### Confusion Matrix

The confusion matrix is a handy presentation of the accuracy of a model with two or more classes.

The table presents predictions on the x-axis and accuracy outcomes on the y-axis. The cells of the table are the number of predictions made by a machine learning algorithm.

For example, a machine learning algorithm can predict 0 or 1 and each prediction may actually have been a 0 or 1. Predictions for 0 that were actually 0 appear in the cell for prediction=0 and actual=0, whereas predictions for 0 that were actually 1 appear in the cell for prediction = 0 and actual=1. And so on.

For imbalanced classification problems, the majority class is typically referred to as the negative outcome (e.g. such as “no change” or “negative test result“), and the minority class is typically referred to as the positive outcome (e.g. “change” or “positive test result”).

The confusion matrix provides more insight into not only the performance of a predictive model, but also which classes are being predicted correctly, which incorrectly, and what type of errors are being made.

The simplest confusion matrix is for a two-class classification problem, with negative (class 0) and positive (class 1) classes.

In this type of confusion matrix, each cell in the table has a specific and well-understood name, summarized as follows:

In [41]:
print(confusion_matrix(y_test,predictions))

[[20210    11]
 [ 2812     8]]


The precision and recall metrics are defined in terms of the cells in the confusion matrix, specifically terms like true positives and false negatives.

### Precision and Recall Curves

**Precision for Imbalanced Classification**

Precision is a metric that quantifies the number of correct positive predictions made.

Precision, therefore, calculates the accuracy for the minority class.

It is calculated as the ratio of correctly predicted positive examples divided by the total number of positive examples that were predicted.

Precision evaluates the fraction of correct classified instances among the ones classified as positive … — Page 52, Learning from Imbalanced Data Sets, 2018.

In an imbalanced classification problem with two classes, precision is calculated as the number of true positives divided by the total number of true positives and false positives.

The result is a value between 0.0 for no precision and 1.0 for full or perfect precision.

You can see that precision is simply the ratio of correct positive predictions out of all positive predictions made, or the accuracy of minority class predictions.

Consider the same dataset, where a model predicts 50 examples belonging to the minority class, 45 of which are true positives and five of which are false positives. We can calculate the precision for this model as follows:

This highlights that although precision is useful, it does not tell the whole story. It does not comment on how many real positive class examples were predicted as belonging to the negative class, so-called false negatives.

**The precision score can be calculated using the *precision_score()* scikit-learn function.**

In [42]:
# calculate prediction
precision = precision_score(y_test,predictions, average='binary')
print(f'Precision: {precision:.3f}')

Precision: 0.421


**Recall for Imbalanced Classification**

Recall is a metric that quantifies the number of correct positive predictions made out of all positive predictions that could have been made.

Unlike precision that only comments on the correct positive predictions out of all positive predictions, recall provides an indication of missed positive predictions.

In this way, recall provides some notion of the coverage of the positive class.

For imbalanced learning, recall is typically used to measure the coverage of the minority class. — Page 27, Imbalanced Learning: Foundations, Algorithms, and Applications, 2013.

In an imbalanced classification problem with two classes, recall is calculated as the number of true positives divided by the total number of true positives and false negatives.

The result is a value between 0.0 for no recall and 1.0 for full or perfect recall.

**The recall score can be calculated using the *recall_score()* scikit-learn function.**

In [43]:
# calculate recall
recall = recall_score(y_test,predictions, average='binary')
print(f'Recall: {recall:.3f}')

Recall: 0.003


**Precision vs. Recall for Imbalanced Classification**

You may decide to use precision or recall on your imbalanced classification problem.

Maximizing precision will minimize the number false positives, whereas maximizing the recall will minimize the number of false negatives.

- Precision: Appropriate when minimizing false positives is the focus.
- Recall: Appropriate when minimizing false negatives is the focus.

Sometimes, we want excellent predictions of the positive class. We want high precision and high recall.

This can be challenging, as often increases in recall often come at the expense of decreases in precision.

In imbalanced datasets, the goal is to improve recall without hurting precision. These goals, however, are often conflicting, since in order to increase the TP for the minority class, the number of FP is also often increased, resulting in reduced precision. — Page 55, Imbalanced Learning: Foundations, Algorithms, and Applications, 2013.

Nevertheless, instead of picking one measure or the other, we can choose a new metric that combines both precision and recall into one score.

**F-Measure for Imbalanced Classification**

Classification accuracy is widely used because it is one single measure used to summarize model performance.

F-Measure provides a way to combine both precision and recall into a single measure that captures both properties.

Alone, neither precision or recall tells the whole story. We can have excellent precision with terrible recall, or alternately, terrible precision with excellent recall. F-measure provides a way to express both concerns with a single score.

Once precision and recall have been calculated for a binary or multiclass classification problem, the two scores can be combined into the calculation of the F-Measure.

The traditional F measure is calculated as follows:

* F-Measure = (2 * Precision * Recall) / (Precision + Recall)

This is the harmonic mean of the two fractions. This is sometimes called the F-Score or the F1-Score and might be the most common metric used on imbalanced classification problems.

"*... the F1-measure, which weights precision and recall equally, is the variant most often used when learning from imbalanced data.*" — Page 27, Imbalanced Learning: Foundations, Algorithms, and Applications, 2013.

Like precision and recall, a poor F-Measure score is 0.0 and a best or perfect F-Measure score is 1.0

**The F-measure score can be calculated using the *f1_score()* scikit-learn function.**

In [44]:
# calculate score
f1 = f1_score(y_test, predictions, average='binary')
print(f'F-Measure: {f1:.3f}')

F-Measure: 0.006


### Classification Report

Scikit-learn does provide a convenience report when working on classification problems to give you a quick idea of the accuracy of a model using a number of measures.

The classification_report() function displays the precision, recall, f1-score and support for each class.

In [45]:
print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

           0       0.88      1.00      0.93     20221
           1       0.42      0.00      0.01      2820

    accuracy                           0.88     23041
   macro avg       0.65      0.50      0.47     23041
weighted avg       0.82      0.88      0.82     23041



## GridSearch

In [46]:
param_grid = {'C':np.logspace(-3,3,7), 'max_iter':[1000]}

In [47]:
grid = GridSearchCV(LogisticRegression(),param_grid,refit=True,verbose=2, cv = 5, n_jobs = -1)

In [48]:
# May take awhile!
grid.fit(X_train,y_train)

Fitting 5 folds for each of 7 candidates, totalling 35 fits


GridSearchCV(cv=5, estimator=LogisticRegression(), n_jobs=-1,
             param_grid={'C': array([1.e-03, 1.e-02, 1.e-01, 1.e+00, 1.e+01, 1.e+02, 1.e+03]),
                         'max_iter': [1000]},
             verbose=2)

In [49]:
grid.best_params_

{'C': 100.0, 'max_iter': 1000}

In [50]:
best_grid = grid.best_estimator_
best_grid

LogisticRegression(C=100.0, max_iter=1000)

In [51]:
grid.best_score_

0.8806188488492322

Calculate the predictions and the inference time

In [52]:
def calculate_pred_and_inf_time(best_grid, X_test):
    # get the start time
    st_wall_inf = time.time()

    # Generate generalization metrics
    grid_predictions = best_grid.predict(X_test)

    # get the end time
    et_wall_inf = time.time()

    # get execution time
    wall_time_inf = et_wall_inf - st_wall_inf
    print(f'Inference Time: {1000*wall_time_inf:.3f} miliseconds')

calculate_pred_and_inf_time(best_grid, X_test)

Inference Time: 1.066 miliseconds


In [53]:
# Generate generalization metrics
grid_predictions = best_grid.predict(X_test)
print(confusion_matrix(y_test,grid_predictions))

[[20210    11]
 [ 2812     8]]


In [54]:
print(classification_report(y_test,grid_predictions))

              precision    recall  f1-score   support

           0       0.88      1.00      0.93     20221
           1       0.42      0.00      0.01      2820

    accuracy                           0.88     23041
   macro avg       0.65      0.50      0.47     23041
weighted avg       0.82      0.88      0.82     23041



In [55]:
def fit_and_print(model, X_train, y_train):
    model.fit(X_train, y_train)  
    y_pred = model.predict(X_test)
    print("\n")  
    print("Confusion Matrix: \n", confusion_matrix(y_test, y_pred))  
    print("Classification Report: \n", classification_report(y_test, y_pred))  
    print("Accuracy: ", round(accuracy_score(y_test, y_pred),3))
    print("Precision:", round(precision_score(y_test, y_pred),3))
    print("Recall:", round(recall_score(y_test, y_pred),3))
    print("f1: ", round(f1_score(y_test, y_pred),3))
    print("\n")  

In [56]:
fit_and_print(best_grid,X_train,y_train)



Confusion Matrix: 
 [[20210    11]
 [ 2812     8]]
Classification Report: 
               precision    recall  f1-score   support

           0       0.88      1.00      0.93     20221
           1       0.42      0.00      0.01      2820

    accuracy                           0.88     23041
   macro avg       0.65      0.50      0.47     23041
weighted avg       0.82      0.88      0.82     23041

Accuracy:  0.877
Precision: 0.421
Recall: 0.003
f1:  0.006




## Resampling

### SMOTE

In [57]:
# Oversample and plot imbalanced dataset with SMOTE

# summarize class distribution
counter = Counter(y_train)
print(counter)
# transform the dataset
oversample = SMOTE(random_state=42)
X_train_rel, y_train_rel = oversample.fit_resample(X_train, y_train)
# summarize the new class distribution
counter = Counter(y_train_rel)
print(counter)

fit_and_print(best_grid, X_train_rel, y_train_rel)

calculate_pred_and_inf_time(best_grid, X_test)

Counter({0: 114988, 1: 15577})
Counter({0: 114988, 1: 114988})


Confusion Matrix: 
 [[13168  7053]
 [  889  1931]]
Classification Report: 
               precision    recall  f1-score   support

           0       0.94      0.65      0.77     20221
           1       0.21      0.68      0.33      2820

    accuracy                           0.66     23041
   macro avg       0.58      0.67      0.55     23041
weighted avg       0.85      0.66      0.71     23041

Accuracy:  0.655
Precision: 0.215
Recall: 0.685
f1:  0.327


Inference Time: 1.126 miliseconds


### ADASYN

In [59]:
# Oversample and plot imbalanced dataset with ADASYN

# summarize class distribution
counter = Counter(y_train)
print(counter)
# transform the dataset
oversample = ADASYN(random_state=42)
X_train_rel, y_train_rel = oversample.fit_resample(X_train, y_train)
# summarize the new class distribution
counter = Counter(y_train_rel)
print(counter)

fit_and_print(best_grid, X_train_rel, y_train_rel)

calculate_pred_and_inf_time(best_grid, X_test)

Counter({0: 114988, 1: 15577})
Counter({0: 114988, 1: 112138})


Confusion Matrix: 
 [[12937  7284]
 [  863  1957]]
Classification Report: 
               precision    recall  f1-score   support

           0       0.94      0.64      0.76     20221
           1       0.21      0.69      0.32      2820

    accuracy                           0.65     23041
   macro avg       0.57      0.67      0.54     23041
weighted avg       0.85      0.65      0.71     23041

Accuracy:  0.646
Precision: 0.212
Recall: 0.694
f1:  0.325


Inference Time: 1.097 miliseconds


### SMOTE and TL

In [60]:
# Oversample and plot imbalanced dataset with SMOTE and TL

# summarize class distribution
counter = Counter(y_train)
print(counter)
# transform the dataset
oversample = SMOTETomek(random_state=42)
X_train_rel, y_train_rel = oversample.fit_resample(X_train, y_train)
# summarize the new class distribution
counter = Counter(y_train_rel)
print(counter)

fit_and_print(best_grid, X_train_rel, y_train_rel)

calculate_pred_and_inf_time(best_grid, X_test)

Counter({0: 114988, 1: 15577})
Counter({0: 106266, 1: 106266})


Confusion Matrix: 
 [[13327  6894]
 [  915  1905]]
Classification Report: 
               precision    recall  f1-score   support

           0       0.94      0.66      0.77     20221
           1       0.22      0.68      0.33      2820

    accuracy                           0.66     23041
   macro avg       0.58      0.67      0.55     23041
weighted avg       0.85      0.66      0.72     23041

Accuracy:  0.661
Precision: 0.217
Recall: 0.676
f1:  0.328


Inference Time: 1.083 miliseconds


### SMOTE and ENN

In [61]:
# Oversample and plot imbalanced dataset with SMOTE and ENN

# summarize class distribution
counter = Counter(y_train)
print(counter)
# transform the dataset
oversample = SMOTEENN(random_state=42)
X_train_rel, y_train_rel = oversample.fit_resample(X_train, y_train)
# summarize the new class distribution
counter = Counter(y_train_rel)
print(counter)

fit_and_print(best_grid, X_train_rel, y_train_rel)

calculate_pred_and_inf_time(best_grid, X_test)

Counter({0: 114988, 1: 15577})
Counter({1: 79510, 0: 69097})


Confusion Matrix: 
 [[12388  7833]
 [  814  2006]]
Classification Report: 
               precision    recall  f1-score   support

           0       0.94      0.61      0.74     20221
           1       0.20      0.71      0.32      2820

    accuracy                           0.62     23041
   macro avg       0.57      0.66      0.53     23041
weighted avg       0.85      0.62      0.69     23041

Accuracy:  0.625
Precision: 0.204
Recall: 0.711
f1:  0.317


Inference Time: 1.106 miliseconds
