# PC7 -  Logistic Regression case study: Default

cours [MAP 535](https://moodle.polytechnique.fr/course/view.php?id=14763): Regression de Karim Lounici

auteur de ce notebook python : Marc Lelarge

In [None]:
import statsmodels.api as sm
import pandas as pd
import numpy as np
from patsy import dmatrices
from scipy import stats


import matplotlib.pyplot as plt
import seaborn as sns
sns.set_theme(palette='colorblind',style='darkgrid')

#import sys
#sys.path.append('../')
from model_selection_python import backwardSelection, forwardSelection, bothSelection, anova_glm, anova_onemodel

## 1  Logistic Regression

We are interested in predicting whether an individual will default on his or her credit card payment, on the basis of annual income and monthly credit card balance (levels of debt).

### 1.1 Import this data set and look at/play with the data.

In [None]:
df = sm.datasets.get_rdataset("Default",'ISLR').data

df = pd.get_dummies(df,drop_first=True)

In [None]:
df.head()

### 1.2 Before implementing a logistic regression, look at the following estimate. Comment its performances.

In [None]:
dummy_classifier = lambda x: 'No' if (np.mean(df['default_Yes']==1)<0.5) else 'Yes'

This classifier outputs the most frequent value (“No” in this case) for every new observations. In other words, this classifier always predict the absence of payment default. This is a dumb estimate since it does not use the variable in the data set (except “default”) to predict. It is however important to obtain a baseline for the error. Here, we would like our logistic regression to have an error lower than the error of the previous estimate.

### 1.3 First visualize the relation between each variable and the default payment.

### 1.4 We are now implementing a logistic regression using the function “GLM”. 

Indeed, logistic regression is a particular class of generalized linear model (glm). At first, we will try to explain the probability of default with the balance variable. Implement a logistic regression with this variable.

Hint: use [sm.families.Binomial](https://www.statsmodels.org/dev/generated/statsmodels.genmod.families.family.Binomial.html)

Use the `anova_glm` fucntion to compare nested models.

Retrieve the values given (value of the test statistic, the residual deviance of both models 2920.7 and 1596.5, p.value, df 1) and take a decision to keep one model. Hint: the log-likelihood of a model is given with the [loglike](https://www.statsmodels.org/dev/dev/generated/statsmodels.base.model.LikelihoodModel.loglike.html#statsmodels.base.model.LikelihoodModel.loglike) function applied on an object glm with parameters fitted glm object.

### 1.5 Write mathematically the logistic model you just implemented. 

Once the coefficient have been estimated, how do you predict the default for a new data point?

### 1.6  Using logistic regression, predict the default for a person whose balance is equal to 1600.

The estimated probability of default can be computed using the equation above (where the true coefficients are replaced by their estimated values). To obtain a prediction, we just need to know if the predicted value is lower than 0.5. If it is lower, then the algorithm predicts 1 else 0.

To obtain the values predicted by the model model, we first bring together the new data in an array with the same structure as the initial data table (keep the order of variables, including constant):

### 1.7  Compute the confusion matrix associated to the logistic regression predictions.

That is the matrix where you compare the observed values for “default” and the predicted ones. You can use the function [confusion_matrix](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html) from sklearn.metrics

What is the classifier error? What are the true positive and false positive rates?

In [None]:
from sklearn.metrics import confusion_matrix

The error rate obtained in this way is generally optimistic as the same sample is used to construct the model and to estimate the misclassification rate. We can obtain a more precise estimation using cross-validation methods. In order to do so, we use the function cv.glm. First of all, the package boot must be loaded and a cost function created which admits the observed values of Y as well as the predicted probabilities as input:

In [None]:
from sklearn.model_selection import KFold

### 1.8 We can change the threshold $0.5$ in the logistic regression procedure.

The estimated coefficients are unchanged, so is the estimated probability, but now we will predict 1 if $\hat{p} (x)\gt0.2$. Compute the confusion matrix and the error. What are the true positive and false positive rates?

### 1.9  ROC

The ROC (Receiver Operating Characteristic) is a curve generated by plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings. Using the functions “prediction” and “performance” plot the ROC curve. Plot on this curve the two points corresponding to the previous thereshold 0.5 and 0.2.

In [None]:
from sklearn.metrics import roc_curve

### 1.10  AUC

The AUC (area under the curve, which are typical performance measurements for a binary classifier) is the area under the ROC curve. As a rule of thumb, a model with good predictive ability should have an AUC closer to 1 (1 is ideal) than to 0.5. Use the function roc_auc_score from sklearn to compute the auc

In [None]:
from sklearn.metrics import roc_auc_score

## 2  Logistic regression with multiple explanatory variables

First of all, we randomly separate the database into:
- a 8000 size learning sample that will be used to estimate the (or the logistics model(s);
- a 2000 test sample size that will be used to measure performance of the models.

In [None]:
np.random.seed(5678)
perm = np.random.choice(len(df),size=8000,replace=False)
app = df.loc[perm]
test = df.loc[~df.index.isin(perm),]

### 2.1  Implement a logistic regression using all variables in the data set on the training set. 

Comment the python output. How do you interpret each coefficient?

### 2.2 Compute the confusion matrix of the model and its error. 

Plot the ROC curve and compute the AUC. Comment on these results.

Predictions on test data set.