In this program, we use logistic regression to see how the features could affect the prob of 1, i.e. exit, and also do the sklearn logistic prediction

In [1]:
#Importing necessary Libraries
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression

In [2]:
#Loading Dataset
data = pd.read_csv("Churn_Modelling.csv")

In [3]:
#Generating Dependent Variable Vectors
Y = data.iloc[:,-1].values
X = data.iloc[:,3:13]
X.head()

Unnamed: 0,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary
0,619,France,Female,42,2,0.0,1,1,1,101348.88
1,608,Spain,Female,41,1,83807.86,1,0,1,112542.58
2,502,France,Female,42,8,159660.8,3,1,0,113931.57
3,699,France,Female,39,1,0.0,2,0,0,93826.63
4,850,Spain,Female,43,2,125510.82,1,1,1,79084.1


In [4]:
#Generating Dependent Variable Vectors
Y = data.iloc[:,-1].values
X = data.iloc[:,3:13]
X['Gender']=X['Gender'].map({'Female':0,'Male':1})
### above is used instead of a more complicated package involving -- from sklearn.preprocessing import LabelEncoder
### converts Female -- 0, Male -- 1, i.e. hot-encoding categorical variables
print (X['Gender'])

0       0
1       0
2       0
3       0
4       0
       ..
9995    1
9996    1
9997    0
9998    1
9999    0
Name: Gender, Length: 10000, dtype: int64


In [5]:
#Encoding Categorical variable Geography
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct =ColumnTransformer(transformers=[('encoder',OneHotEncoder(),[1])],remainder="passthrough")
X = np.array(ct.fit_transform(X))
### Geography is transformed into France -- 1,0,0; Spain -- 0,0,1; Germany -- 0,1,0.
### Moreover -- this encoded vector of ones-zeros is now put in first 3 cols. Credit Score pushed to 4th col.

In [6]:
### convert X to dataframe X1
X1 = pd.DataFrame(X)
### Note there are 12 features including onehotencoder for the Geography feature-- 
### The features are encoded using a one-hot (aka ‘one-of-K’ or ‘dummy’) encoding scheme
### Renaming columns so they appear as variable names in regression output table
X1.columns = ['France', 'Spain','Germany','CrScore','Gender','Age','Tenure','Balance','Products','CrCard','Active','Salary']
X1.head()

Unnamed: 0,France,Spain,Germany,CrScore,Gender,Age,Tenure,Balance,Products,CrCard,Active,Salary
0,1.0,0.0,0.0,619.0,0.0,42.0,2.0,0.0,1.0,1.0,1.0,101348.88
1,0.0,0.0,1.0,608.0,0.0,41.0,1.0,83807.86,1.0,0.0,1.0,112542.58
2,1.0,0.0,0.0,502.0,0.0,42.0,8.0,159660.8,3.0,1.0,0.0,113931.57
3,1.0,0.0,0.0,699.0,0.0,39.0,1.0,0.0,2.0,0.0,0.0,93826.63
4,0.0,0.0,1.0,850.0,0.0,43.0,2.0,125510.82,1.0,1.0,1.0,79084.1


In [7]:
#Splitting dataset into training and testing dataset
from sklearn.model_selection import train_test_split
X_train,X_test,Y_train,Y_test = train_test_split(X,Y,test_size=0.2,random_state=1)

In [8]:
#Performing Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

We call fit_transform() method on our training data and transform() method on our test data. Each feature in the training
set is scaled to mean 0, variance 1. In sklearn.preprocessing.StandardScaler(), centering and scaling happens independently on each feature. The fit method is calculating the mean and variance of each of the features present in the data. The transform method is transforming all the features using the respective feature's mean and variance that are calculated in the statement
before on X_train.

Logit regression basically finds the max log likelihood of Log L = sum_i { Y_i x Ln F(b'X_i) + (1-Y_i) x Ln (1-F(b'X_i)) }
and F(b'X_i) = 1/(1+exp(-b'X_i)) is Prob (Y_i=1). Note X_i increases makes Prob(Y_i=1) increases to 1, while X_i decreases makes Prob(Y_i=1) decreases to 0. Logit (logistic) regression below shows coeff estimates of b. Note also max log L is minimizing loss function minus sum_i { Y_i x Ln F(b'X_i) + (1-Y_i) x Ln (1-F(b'X_i)) }

In [9]:
import statsmodels.api as sm
logit_model=sm.Logit(Y_train,sm.add_constant(X_train))
result=logit_model.fit()
#print(result.summary())
print(result.summary(xname=['Constant','France', 'Spain','Germany','CrScore','Gender','Age','Tenure','Balance','Products','CrCard','Active','Salary']))

Optimization terminated successfully.
         Current function value: 0.429076
         Iterations 8
                           Logit Regression Results                           
Dep. Variable:                      y   No. Observations:                 8000
Model:                          Logit   Df Residuals:                     7988
Method:                           MLE   Df Model:                           11
Date:                Fri, 21 Oct 2022   Pseudo R-squ.:                  0.1490
Time:                        10:57:49   Log-Likelihood:                -3432.6
converged:                       True   LL-Null:                       -4033.5
Covariance Type:            nonrobust   LLR p-value:                6.481e-251
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Constant      -1.6549      0.035    -47.918      0.000      -1.723      -1.587
France        -0.1232   1.73e

In [10]:
from sklearn.linear_model import LogisticRegression
LogReg=LogisticRegression()

In [11]:
Lresult = LogReg.fit(X_train, Y_train)
print(Lresult.coef_, Lresult.intercept_)
### SKlearn logisticRegression adds regularization so the results are a bit different from the statsmodel package above

[[-0.1231805   0.21764563 -0.07679625 -0.05350094 -0.26540571  0.74686475
  -0.02786673  0.16927536 -0.04035424 -0.03663908 -0.54912598  0.01607395]] [-1.6543875]


In [12]:
Lpredict=LogReg.predict(X_test)

In [13]:
### Using Score method to obtain accuracy (% correct prediction) of model
score = LogReg.score(X_test,Y_test)
print(score)

0.8125


In [14]:
print(Lpredict) ### shows prediction of 1 or else of 0

[0 0 0 ... 0 0 0]


In [15]:
Lpredict.shape

(2000,)

In [16]:
Lpredict1=Lpredict
Lpredict1[Lpredict1==0]=-1
Y_test1=Y_test
Y_test1[Y_test1==0]=-1
J1=np.multiply(Y_test1.T,Lpredict1.T)  ### element by element multiplication
c1=np.count_nonzero(J1 > 0) 
print(c1,c1/2000)

1625 0.8125
