# Logistic Regression

## What will you learn in this course? 🧐🧐

This course's goal is to teach you the principles of logistic regression and how to apply this model to binary classification problems. You will also learn how to protect yourself from over fitting thanks to a very useful method in supervised machine learning: cross validation. Actually you've already used it without really knowing it when you used the ```GridSearchCV``` function looking for the optimal penalty parameter for ridge and LASSO.

* Definition
    * Equation
    * Linear decision boundary
    * Interpretation
* Classification in pratice
* How to monitor errors made by the model
     * False positive
     * False Negative
     * Watch out for false positives and ESPECIALLY false negatives.
* Metrics for performance assessment
     * Accuracy
     * Recall
     * Precision
     * F1-score
     * Confusion matrix
     * ROC and AUC curves and GINI coefficient



## Definition

Unlike linear regressions, which predict a number, classification models predict a category, or rather the probability distribution across all possible categories of the target variable. For example, if you are trying to predict whether someone will buy a product from you based on certain independent variables, you get into a classification problem because the categories you are trying to predict are "yes, the person will buy the product" or "no, the person won't buy my product".

Logistic regressions are only one type of classification model: there are many others, such as decision trees, SVM (_support vector machine_) or Naive Bayes.

### Model Equation 👩‍🔬

When bulding a logistic regression model, we assume that there is an $f$ function that links the target variable $Y$ to the explanatory variables represented in the $X$ matrix as follows :

$$
P(Y=1)=f(X)+\epsilon
$$

Where $\epsilon$ is the residual.

The specific function we choose to estimate $f$ in the context of logistic regression is no other than the logistic function (also known as "sigmoid function" because of its shape) !

$$
\begin{aligned}
f(X) &= \mathbb{sigmoid}(X)  \\
 &= \frac{1}{1+\mathbb{exp}(-X)} \\
 &= \frac{1}{1+\mathbb{exp}(-(\beta_{0}+X_{1}\beta_{1}+...+X_{p}\beta_{p}))} 
\end{aligned}
$$

Where :

$$
\mathbb{sigmoid}(X) = \frac{1}{1+\mathbb{exp}(-X)} 
$$

and

$$
X = \beta_{0}+X_{1}\beta_{1}+...+X_{p}\beta_{p}
$$

The parameters $\beta_0, \beta_1, ..., \beta_p$ are the coefficients to be determined at the training step.

In other words, the logistic regression model consists in finding a linear combination of the features that allows to make the distinction between $Y=0$ and $Y=1$. The probability $P(Y=1)$ is computed as the logistic(sigmoid) function applied to that linear combination.

![logistic_function](https://full-stack-assets.s3.eu-west-3.amazonaws.com/images/logistic_function.png)

In the graph above, Y represents whether a person will buy a product ($Y = 1$) or will not buy a product ($Y = 0$). The blue curve represents the probability that a person will buy ($Y = 1$) or not depending on the value of $X$. This function constrains the values of $P(Y=1)$ to stay in the interval $[0,1]$, which is the set of values that a probability can take. According to the estimated probability, the algorithm will know in which category to classify our individual. 

### Cost Function - Log Loss 📉

As you know, in Machine Learning, we always try to minimize what we call a cost function. In the case of logistic regression , we use the *Log Loss* function that looks like this:

$$ Log Loss = \sum_{i=0}^n -y_ilog(\hat{y_i}) - (1-y_i)log(1-\hat{y_i}) $$

Where: 

* $y_i$ is your actual target value 
* $\hat{y_i}$ is your model's prediction 

This formula is a little more complicated and you don't really need to understand it at all. The whole idea behind it is to make sure the algorithm predictions get closer to the actual target value. 


### Linear decision boundary

When we studied the linear regressions, we saw that our predictor could be seen as a line drawn by our model. With logistic regression, the line is now a boundary that separates the explanatory variable space into two areas for the two categories of the target variable. The figure below represents the model's decision boundary in a 2-dimensional feature space :

![decision_boundary](https://full-stack-assets.s3.eu-west-3.amazonaws.com/images/decision_boundary.png)

That's the reason why logistic regression is said to be a "linear model". There are other classification models that allow to create non-linear decision boundaries like Decision Trees.

### Interpretation

Logistic regression is a transposition into the interval [0,1] of the multiple linear regression. Just like linear regressions it has coefficients, you can analyse the coefficients' p-values in order to determine what variables have a significant impact on the target variable.
	

## Classification in pratice

Now that we have drawn the classification line, we can begin our interpretations. Since our model is now producing probabilities, the points that will have a probability greater than 50% will belong to category A while the points that will have a probability less than 50% will belong to category B. This is the threshold chosen by default for classification. Depending on the problem, another threshold may be chosen. For example, for banking fraud issues, data scientists may be tempted to set a much lower threshold in order to avoid the highest possible proportion of fraudsters, regardless of the amount of genuine transactions they have to block along with them because one fraud costs them so much trouble wasted time and money that they would rather delay many genuine payments rather than letting one fraud go through.

Let's take an example, based on some independent variables: we discovered that person A has a 60% chance of buying the product. They will therefore be considered a "buyer" for our model. On the other hand, if this time we have a person B who has only a 45% chance of buying the product, they will be considered as a "non-buyer".

## How to monitor errors made by the model

Our model can be wrong sometimes. False positives and false negatives represent errors made by our classification model.

### False positive

Let's continue with the example from above. If our model categorizes person A as a "buyer" and that person, in reality, does not buy the product then we are dealing with a false positive. The model expected a positive result, but the true value of the target was negative.

### False Negative

Let's consider the opposite situation, if person B, whom the model predicted as a non-buyer, finally buys the product, this is a false negative. We predicted a negative but the target variable is positive.

### Watch out for false positives and ESPECIALLY false negatives.

Be vigilant about false positives and negatives because a prediction error can have more or less serious consequences depending on what you are trying to predict. For example, if you are a scientist trying to predict an earthquake and you come across a false negative (i.e. you predicted that the earthquake would not happen and eventually it did), no one was actually prepared for the event and several people are in danger. If the opposite situation took place, that is you predicted that an earthquake would occur and in the end it did not, the only thing you have lost is some time and stress in order to evacuate everyone to safety, but you did not cause any casualties.

Always remember, whenever using a model for prediction, to ask yourself what is at stake and align the quality of your model with the objectives you are trying to achieve. Ethics in the field of data science are becoming a most important matter. Data and predictive models are all around us and influencing or assisting many of our actions or decisions, it is very important therefore to build these models in a fair and mindful way.

## Metrics for performance assessment

### Confusion matrix

![crack](https://essentials-assets.s3.eu-west-3.amazonaws.com/M04-Machine-learning/confusion_matrix.png)

Confusion matrices allow you to quickly and easily measure the performance of your model. The idea is to see which predictions are exact, as well as the amount of false positives and false negatives.

A simple and relevant measure of your model's performance would be to compare the model's accuracy rate with the proportion of positives in the data. Indeed, the simplest model in the case of a classification problem is one that classifies all individuals in the most represented class. In the case of this trivial model, the accuracy rate will be equal to the proportion that majority group represents in the data. Let us assume that we have a dataset that gives bachelor's degree results for a sample of the population. If the sample contains 70% of individuals who have passed, then if our model predicts that everyone passes, it would be right 70% of the time. In fact, it is only worth building a more complex model if its accuracy can be higher than 70%.

This reasonning is what we call comparing your model to a baseline, meaning comparing your model to a model that is very simple. Remember that in the words of Leonardo da Vinci "simplicity is the ultimate sophistication", this principle also holds in statistics and is known as Occam's razor. A simple model is always preferable to a more complex model if performances are comparable.

### Accuracy

The main evaluation metric, the one sklearn computes for any classification model when using the ```model.score(X,y)``` command line is the accuracy metric. The accuracy is calculated as the number of exact predictions performed by your model divided by the total number of predictions, in other words the accuracy is the proportion of exact predictions produced by your model.

$$Accuracy = \frac{TN+TP}{TN+TP+FN+FP}$$

### True Positive Rate - (Aka Recall or Sensitivity)

Other evaluation metrics can be calculated for classification problems. The recall is a metric that can be calculated relatively to a single class of the target variable at a time. Let's call this class A for our example.

The recall is the number of A detected as A by the model divided by the total number of A in whole dataset. The recall is the proportion of A that were detected by the model.

$$TPR = \frac{TP}{TP+FN}$$


In plain english, this is the ability of your model to detect positives events among all the real positives events in your dataset. 

### False Positive Rate - (Aka Fallout)


False positive rate is the "False Alarm" rate. It simply determines the ratio of negative events that the model classified as positive. 

$$FPR = \frac{FP}{FP+TN}$$



### True Negative Rate - (Aka Specificity)

True Negative Rate is the ability of your model to correctly classify real negative events as negatives.

$$TNR = \frac{TN}{TN+FP}$$

### Precision

The precision is equal to the number of A that were predicted as A by the model divided by the total number of observations that were predicted as A by the model. It is the proportion of predicted A that are actually A.

$$Precision = \frac{TP}{TP+FP}$$

### F1-score

The F1 score is a very useful metric because contrary to the accuracy it takes into account the potential imbalance between the different classes in the dataset for the target variable. The F1 score metric is equal to the harmonic mean between the recall and the precision for a given class of the target variable.

$$
\frac{Precision + Recall}{Precision \times Recall}
$$



### ROC and AUC curves

The ROC curve (receiver operating characteristic curve) is used to visualize the performance of a binary classification model overall given all different possible classification thresholds.

This curve is obtained by plotting the true positives rate (_sensitivity_) as a function of the false positives rate (*fall out* or *1 - _specificity_*) for all possible values of the threshold.

![courbe_roc](https://drive.google.com/uc?export=view&id=1uno8_q_YU183T7xwDRoRggr_Zy0i13uW)

A ROC curve generally looks similar to the illustration above. A ROC curve that always stays close or below the diagonal is a sign of terrible model, the higher above the diagonal the ROC curve is the better. The ROC curve immediately describes the model's performance in terms of detecting positive observations across all possible thresholds. However, it also allows a general appreciation of the model. The bias by which the curve evaluates the general performance of the model is a numerical indicator called AUC (_Area Under the Curve_). The AUC is literally the calculation of the area bounded by the ROC curve and the sides of the unit square.

 The AUC is interpreted as the probability that the model will give a higher score to a randomly selected positive observation than to a randomly selected negative observation. 
 
### GINI coefficient 

The AUC is also related to the GINI index, which describes the statistical dispersion of the population and is widely used in economics to quantify inequalities.

$$
GINI = 2AUC -1
$$

The AUC varies between 0 and 1 in theory, but models with an AUC of less than 0.5 (50%) should be excluded immediately because this means that the model performs less well than pure random predictions.

## Resources 📚📚

- Implementing Logistic Regression - [https://bit.ly/2FFUjAn](https://bit.ly/2FFUjAn)
- Logistic Regression - [http://bit.ly/2bdDELb](http://bit.ly/2bdDELb)
- Summary of Probability - [http://bit.ly/2m8YgDR](http://bit.ly/2m8YgDR)
- Confusion Matrix - [http://bit.ly/2xApsRz](http://bit.ly/2xApsRz)
- False Positives & False Negatives - [http://bit.ly/2FmhMql](http://bit.ly/2FmhMql)
- Understanding AUC - ROC Curve - [https://bit.ly/20czs3S](https://towardsdatascience.com/understanding-auc-roc-curve-68b2303cc9c5)
- Sensitivity and Specificity - [https://bit.ly/02CAscas](https://en.wikipedia.org/wiki/Sensitivity_and_specificity)