In [None]:
%matplotlib inline
import numpy as np
import math
import matplotlib.pyplot as plt
import pandas as pd
from scipy import stats
import statsmodels.api as sm

**Note**: on this notebook I am just practicing concepts of logistic regression. I am not considering some aspects related to machine learning like the imputation of missing values or the normalisation of the predictor variables. 

## Loading and processing the data

The dataset used in this notebook is an example dataset about accessing graduate school which was obtained from https://stats.idre.ucla.edu/r/dae/logit-regression/. The `admit` column corresponds to the two-level categorical response variable. The variables containing the `gre` and `gpa` scores of the candidate are numerical, whereas the variable `rank`, that indicates the prestige of the school, is categorical. 

In [None]:
df = pd.read_csv('data/binary.csv')
df.head()

Let's examine the distribution of the predictor variables:

In [None]:
fig, ax = plt.subplots(1,3)

ax[0].hist(df['gre'])
ax[0].set_title('gre')

ax[1].hist(df['gpa'])
ax[1].set_title('gpa')

ax[2].hist(df['rank'])
ax[2].set_title('rank')

fig.set_figwidth(12)
fig.set_figheight(2)

Is there collinearity between any pair of variables?

In [None]:
def compute_R(df, col1, col2):
    var1 = df[col1]
    var2 = df[col2]    
    return 1/(len(df)-1)*np.sum(((var1-var1.mean())/var1.std())*((var2-var2.mean())/var2.std()))

fig, ax = plt.subplots(1,3)

ax[0].plot(df['gre'], df['gpa'], 'o')
ax[0].set_title('gre vs gpa\nR = ' + str(compute_R(df, 'gre', 'gpa')))

ax[1].plot(df['gre'], df['rank'], 'o')
ax[1].set_title('gre vs rank\nR = ' + str(compute_R(df, 'gre', 'rank')))

ax[2].plot(df['gpa'], df['rank'], 'o')
ax[2].set_title('gpa vs rank\nR = ' + str(compute_R(df, 'gpa', 'rank')))

fig.set_figwidth(12)
fig.set_figheight(2)

Building indicator variables to replace the rank categorical variable:

In [None]:
values = np.unique(df['rank'])[0:-1]
for v in values:
    df['rank_' + str(v)] = (df['rank'] == v).astype(int)
del df['rank']
df.head()

## Logistic regression

Logistic regression is a type of generalised linear model in which the response variable is a two-level categorical variable that, for each observation, takes the value Yi = 1 with probability pi and the value Yi = 0 with probability Yi = 0.

A generalised linear model is a generalisation of linear regression in which the residuals can be non-normally distributed. This is achieved by linking the response variable to a multiple regression model by means of a transformation variable, usually the logit function:

In [None]:
fig, ax = plt.subplots()
p = np.arange(0.01, 0.99, 0.01)
logit = np.log(p/(1-p))
ax.plot(p, logit)
ax.set_xlabel('pi')
ax.set_ylabel('logit')

The logistic regression model has the following form:
```
logit(pi) = b0 + b1*x1i + b2*x2i + ... + bk*xki
```
In order to fit a logistic regression model a function based on Newton method for numerical optimisation is commonly used:

In [None]:
Y = df['admit']
# Intercept is not included by default
X = df[['gre', 'gpa', 'rank_1', 'rank_2', 'rank_3']]
X = np.append(np.ones((X.shape[0], 1)), X, axis=1)
   

logit_model = sm.Logit(Y, X)
result = logit_model.fit()
# The following line is a workaround to make summary work
stats.chisqprob = lambda chisq, df: stats.chi2.sf(chisq, df)
print(result.summary())

When we plot the predicted probabilities of the observations versus the value returned by the logistic regression model for these observations, we can see the shape of the logit function:

In [None]:
# predict returns the probabilities
fig, ax = plt.subplots()
ax.plot(result.predict(X), result.fittedvalues, 'o')

The following plot shows the residuals as the difference between the value of the response variable and the probability returned by the logistic regression model. This plot demonstrates the independence of the observations, since we cannot see any pattern in the data. The residuals are split into two groups due to the fact that the reponse variable is a two level categorical variable.

In [None]:
fig, ax = plt.subplots()
ax.plot(df['admit'] - result.predict(X), 'o')
ax.set_xlabel('observation')
ax.set_ylabel('residuals')

Another condition to apply logistic regression, aside from the observations being independent of each other, is that there exists a linear relationship between logit(pi) and each predictor variable, when the rest of the predictor variables are held constant. We can test this condition by plotting residuals versus the values of each predictor variable:

In [None]:
fig, ax = plt.subplots(1,len(predictors))

residuals = df['admit'] - result.predict(X)
predictors = ['gre', 'gpa', 'rank_1', 'rank_2', 'rank_3']

for i in range(len(predictors)):
    ax[i].plot(df[predictors[i]], residuals, 'o')
    ax[i].set_xlabel(predictors[i])
    ax[i].set_ylabel('residuals')
    
fig.set_figwidth(16)
fig.set_figheight(3)
fig.tight_layout()

Linearity seems to be fine in most cases, except maybe in the case of variables rank-2 and rank_3, that have different variabilities between groups. 

In [None]:
result.params.values

In [None]:
result.pvalues

In [None]:
result.fittedvalues # Logit value

Let's compute the probability values from the logit predictions of the model for the training data