In [None]:
%matplotlib inline
import numpy as np
import math
import matplotlib.pyplot as plt
import pandas as pd
import scipy.stats as st
from sklearn.linear_model import LogisticRegression

**Note**: on this notebook I am just practicing concepts of multiple linear regression. I am not considering some aspects related to machine learning like the imputation of missing values or the normalisation of the predictor variables. 

## Loading and processing the data

The dataset used in this notebook is an example dataset about accessing graduate school which was obtained from https://stats.idre.ucla.edu/r/dae/logit-regression/. The `admit` column corresponds to the two-level categorical response variable. The variables containing the `gre` and `gpa` scores of the candidate are numerical, whereas the variable `rank`, that indicates the prestige of the school, is categorical. 

In [None]:
df = pd.read_csv('data/binary.csv')
df.head()

## Logistic regression

Logistic regression is a type of generalised linear model in which the response variable is a two-level categorical variable that, for each observation, takes the value Yi = 1 with probability pi and the value Yi = 0 with probability Yi = 0.

A generalised linear model is a generalisation of linear regression in which the residuals can be non-normally distributed. This is achieved by linking the response variable to a multiple regression model by means of a transformation variable, usually the logit function:

In [None]:
fig, ax = plt.subplots()
p = np.arange(0.01, 0.99, 0.01)
logit = np.log(p/(1-p))
ax.plot(p, logit)
ax.set_xlabel('pi')
ax.set_ylabel('logit')

In order to fit a linear regression model a function based on Newton method for numerical optimisation is commonly used:

In [None]:
lg = LogisticRegression()
lg.fit(df[['gre', 'gpa', 'rank']], df['admit'])

In [None]:
lg.score(df[['gre', 'gpa', 'rank']], df['admit'])

In [None]:
lg.get_params()