# Logistic Regression

Consider we are interested in finding the classes in the Target variable Y, we model the probability that Y belongs to a particular Class.

This probability is a conditional probability Pr(default = Yes|balance)

Then, we can keep a threshold, saying anything greater than 0.5 means there is no default 

We model this problem as a Linear problem in beta, \begin{equation*} P(x) = \beta_0 + \beta_1 x \end{equation*}

with a trick, if we use this directly then we would be predicting probablity of less than 0 for the value of x close to 0 which is not right. so, we use a logistic function, which gives value only between 0 and 1

\begin{equation*} P(x) = \frac{e^{\beta_0 + \beta_1 x}}{1 + e^{\beta_0 + \beta_1 x}}  \end{equation*}


In [2]:
import pandas as pd
import numpy as np
from numpy import*
from sklearn import datasets
from sklearn.preprocessing import LabelEncoder 
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
import seaborn as sns
import matplotlib.pyplot as plt 
df = pd.read_csv("C:\\Users\\Manjit\\Desktop\\Masters\\1.Self-Courses\\1.ML\\Classification\\iris.data",sep= ',',names=['sepallength','sepalwidth','petallength','petalwidth','class'])


###### we use Maximum Likelihood method to find the estimates of beta to fit the Model, we use this method for many non-linear model/hypothesis 

p(x)/(1 - p(x)) is called the odds, LR is nothing but the logit of odds which is linear in X

choose the parameters to maximize the likelihood that the prediction matches the actual observation

In [3]:
df.head()

Unnamed: 0,sepallength,sepalwidth,petallength,petalwidth,class
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


###### Implementation

In [None]:
#few problems with this scheme, the encoding is label which assumes and 
#order which is not there in this case and secondly there are 3 classes 
#in the original problem but that has been turned into two class problem

In [None]:
iris = sklearn.datasets.load_iris()
x = iris.data[:,:2]
y = (iris.target != 0)*1

In [None]:
#original data 

plt.figure(figsize=(10, 6))
plt.scatter(x[y == 0][:, 0], x[y == 0][:, 1], color='b', label='0')
plt.scatter(x[y == 1][:, 0], x[y == 1][:, 1], color='r', label='1')
#abline(12,-13)
plt.legend();

In [None]:
#logistic function is used to generate the probabilities given the value of x

def sigmoid(z):
    return 1/(1 + np.exp(-z))

In [None]:
#this is a convex function

def cost_function(h,y):    
    return (-y * np.log(h) - (1 - y) * np.log(1-h)).mean()

In [None]:
#this equation comes from taking MLE estimate 

def gradient(x,y,h):
    return np.dot(x.T, (h - y)) / y.size

In [None]:
#this is also called Newton raphson method
loss = []
def optimize(x,y,theta,learning_rate,iteration):
    #assign the theta to initial values

    theta = np.zeros(x.shape[1])

    for i in range(iteration):
        z = np.dot(x,theta)
        h_x = sigmoid(z)
        gradient_v = gradient(x,y,h_x)        
        theta -= learning_rate * gradient_v   
        #find the loss
        z = np.dot(x,theta)
        h_x = sigmoid(z)
        loss.append(cost_function(h_x,y))
    return theta

In [None]:
theta_v = optimize(x,y,theta,0.1,1000)
theta_v

In [None]:
plt.plot(loss)

In [None]:
def predicted_prob(x,theta):
    return sigmoid(np.dot(x,theta))

In [None]:
def predict(x,theta,threshold):
    return predicted_prob(x,theta) >= threshold

In [None]:
#plot a line given slope and intercept

def abline(slope,intercept):
    axes = plt.gca()
    x_vals = np.array(axes.get_xlim())
    y_vals = intercept + slope*x_vals
    plt.plot(x_vals, y_vals, '--')

In [None]:
df_x = pd.DataFrame(x)
df_x.rename(columns={0:"Sepal_length",1:"Sepal_Width"},inplace=True)
#sns.scatterplot(data=df_x)
#abline(theta_v[0],theta_v[1])

In [None]:
predicted_prob(x,theta)

##### How to make an inference using the available data 

In [5]:
#use the student data 

student = pd.read_csv("C:\\Users\\Manjit\\Desktop\\Masters\\1.Self-Courses\\1.ML\\Classification\\Default.csv",sep= ',')

In [6]:
student.columns

Index(['Unnamed: 0', 'default', 'student', 'balance', 'income'], dtype='object')

In [7]:
student.head()

Unnamed: 0.1,Unnamed: 0,default,student,balance,income
0,1,No,No,729.526495,44361.625074
1,2,No,Yes,817.180407,12106.1347
2,3,No,No,1073.549164,31767.138947
3,4,No,No,529.250605,35704.493935
4,5,No,No,785.655883,38463.495879


In [8]:
student.describe()

Unnamed: 0.1,Unnamed: 0,balance,income
count,10000.0,10000.0,10000.0
mean,5000.5,835.374886,33516.981876
std,2886.89568,483.714985,13336.639563
min,1.0,0.0,771.967729
25%,2500.75,481.731105,21340.462903
50%,5000.5,823.636973,34552.644802
75%,7500.25,1166.308386,43807.729272
max,10000.0,2654.322576,73554.233495


In [33]:
#change the default to a binary variable manually without the use of encoders

student.loc[student['default'] == "Yes",'default'] = 1
student.loc[student['default'] == "No",'default'] = 0

student.loc[student['student'] == "Yes",'student'] = 1
student.loc[student['student'] == "No",'student'] = 0


In [34]:
student.head()

Unnamed: 0.1,Unnamed: 0,default,student,balance,income
0,1,0,0,729.526495,44361.625074
1,2,0,1,817.180407,12106.1347
2,3,0,0,1073.549164,31767.138947
3,4,0,0,529.250605,35704.493935
4,5,0,0,785.655883,38463.495879


In [43]:
#Run the Logistic Regression model

from sklearn.linear_model import LogisticRegression
import statsmodels.api as sm

In [45]:
#using student, balance and income feature

y = student.iloc[:,1]
X = student.iloc[:,2:5] 

#LR = LogisticRegression(random_state=0, solver='lbfgs', multi_class='ovr').fit(X, y)

LR1 = sm.OLS(y, X)
LR2 = LR1.fit()
print(LR2.summary())

                                 OLS Regression Results                                
Dep. Variable:                default   R-squared (uncentered):                   0.145
Model:                            OLS   Adj. R-squared (uncentered):              0.145
Method:                 Least Squares   F-statistic:                              566.2
Date:                Thu, 30 Apr 2020   Prob (F-statistic):                        0.00
Time:                        15:25:58   Log-Likelihood:                          3606.3
No. Observations:               10000   AIC:                                     -7207.
Df Residuals:                    9997   BIC:                                     -7185.
Df Model:                           3                                                  
Covariance Type:            nonrobust                                                  
                 coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------

In [47]:
y2 = student.iloc[:,1]
X2 = student.iloc[:,2:3] 

LR3 = sm.OLS(y2, X2)
LR4 = LR3.fit()
print(LR4.summary())

                                 OLS Regression Results                                
Dep. Variable:                default   R-squared (uncentered):                   0.016
Model:                            OLS   Adj. R-squared (uncentered):              0.016
Method:                 Least Squares   F-statistic:                              167.3
Date:                Thu, 30 Apr 2020   Prob (F-statistic):                    5.92e-38
Time:                        15:27:52   Log-Likelihood:                          2904.5
No. Observations:               10000   AIC:                                     -5807.
Df Residuals:                    9999   BIC:                                     -5800.
Df Model:                           1                                                  
Covariance Type:            nonrobust                                                  
                 coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------

What we understood from above is that in the presence of only student information we see that, student show high probability of default, but in the presence of additional data like income and balance they show less probability of default

###### sklearn does not provide summary of the fit 

so you have to write custom code for summary, but in order to make life easier we can use statsmodel, which also has the model fittings features but uses a different approach (PFA for more details)

https://stats.stackexchange.com/questions/146804/difference-between-statsmodel-ols-and-scikit-linear-regression

how to create summary for sklearn

https://stackoverflow.com/questions/27928275/find-p-value-significance-in-scikit-learn-linearregression#27928411

###### Python ideas

#import sklearn imports only the package along with the functions, classes and variables but not the subclasses 

#from sklearn import datasets imports the sub-package 

https://stackoverflow.com/questions/41467570/sklearn-doesnt-have-attribute-datasets

#plotting a line with slope and intercept

https://stackoverflow.com/questions/7941226/how-to-add-line-based-on-slope-and-intercept-in-matplotlib

how to conditionally update a column

https://stackoverflow.com/questions/18196203/how-to-conditionally-update-dataframe-column-in-pandas

run logistic regression using sklearn

https://stackabuse.com/classification-in-python-with-scikit-learn-and-pandas/