In [None]:
%matplotlib inline
import numpy as np
import math
import matplotlib.pyplot as plt
import pandas as pd
from scipy import stats
import statsmodels.api as sm

**Note**: on this notebook I am just practicing concepts of multiple linear regression. I am not considering some aspects related to machine learning like the imputation of missing values or the normalisation of the predictor variables. 

## Loading and processing the data

The dataset used in this notebook is an example dataset about accessing graduate school which was obtained from https://stats.idre.ucla.edu/r/dae/logit-regression/. The `admit` column corresponds to the two-level categorical response variable. The variables containing the `gre` and `gpa` scores of the candidate are numerical, whereas the variable `rank`, that indicates the prestige of the school, is categorical. 

In [None]:
df = pd.read_csv('data/binary.csv')
df.head()

Building indicator variables to replace the rank categorical variable:

In [None]:
values = np.unique(df['rank'])[0:-1]
for v in values:
    df['rank_' + str(v)] = (df['rank'] == v).astype(int)
del df['rank']
df.head()

## Logistic regression

Logistic regression is a type of generalised linear model in which the response variable is a two-level categorical variable that, for each observation, takes the value Yi = 1 with probability pi and the value Yi = 0 with probability Yi = 0.

A generalised linear model is a generalisation of linear regression in which the residuals can be non-normally distributed. This is achieved by linking the response variable to a multiple regression model by means of a transformation variable, usually the logit function:

In [None]:
fig, ax = plt.subplots()
p = np.arange(0.01, 0.99, 0.01)
logit = np.log(p/(1-p))
ax.plot(p, logit)
ax.set_xlabel('pi')
ax.set_ylabel('logit')

In order to fit a linear regression model a function based on Newton method for numerical optimisation is commonly used:

In [None]:
# Code taken from https://stackoverflow.com/questions/27928275/find-p-value-significance-in-scikit-learn-linearregression
class LogisticRegressionStats(LogisticRegression):
    """
    LogisticRegression class after sklearn's, but calculate t-statistics
    and p-values for model coefficients (betas).
    Additional attributes available after .fit()
    are `t` and `p` which are of the shape (y.shape[1], X.shape[1])
    which is (n_features, n_coefs)
    This class sets the intercept to 0 by default, since usually we include it
    in X.
    """

    def __init__(self, *args, **kwargs):
        super(LogisticRegression, self)\
                .__init__(*args, **kwargs)

    def fit(self, X, y):
        self = super(LogisticRegression, self).fit(X, y)

        sse = np.sum((self.predict(X) - y) ** 2, axis=0) / float(X.shape[0] - X.shape[1])
        se = np.array([
            np.sqrt(np.diagonal(sse[i] * np.linalg.inv(np.dot(X.T, X))))
                                                    for i in range(sse.shape[0])
                    ])

        self.t = self.coef_ / se
        self.p = 2 * (1 - stats.t.cdf(np.abs(self.t), y.shape[0] - X.shape[1]))
        return self

In [None]:
Y = df['admit']
# Intercept is not included by default
X = df[['gre', 'gpa', 'rank_1', 'rank_2', 'rank_3']]
X = np.append(np.ones((X.shape[0], 1)), X, axis=1)
   

logit_model=sm.Logit(Y, X)
result=logit_model.fit()
# The following line is a workaround to make summary work
stats.chisqprob = lambda chisq, df: stats.chi2.sf(chisq, df)
print(result.summary())