<table align="center" width=100%>
    <tr>
        <td width="15%">
            <img src="homework.png">
        </td>
        <td>
            <div align="center">
                <font color="#21618C" size=8px>
                    <b> Take-Home <br>(Session 0)
                    </b>
                </font>
            </div>
        </td>
    </tr>
</table>

#### Import the required libraries

In [1]:
# import 'Pandas' 
import pandas as pd 

# import 'Numpy' 
import numpy as np

# import subpackage of Matplotlib
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap

# import 'Seaborn' 
import seaborn as sns

# to suppress warnings 
from warnings import filterwarnings
filterwarnings('ignore')

# display all columns of the dataframe
pd.options.display.max_columns = None

# display all rows of the dataframe
pd.options.display.max_rows = None
 
# to display the float values upto 6 decimal places     
pd.options.display.float_format = '{:.6f}'.format

# import train-test split 
from sklearn.model_selection import train_test_split

# import various functions from statsmodels
import statsmodels
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor

# import StandardScaler to perform scaling
from sklearn.preprocessing import StandardScaler 

# import various functions from sklearn 
from sklearn import metrics
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.metrics import cohen_kappa_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_curve

# import function to perform feature selection
from sklearn.feature_selection import RFE

In [2]:
# set the plot size using 'rcParams'
# once the plot size is set using 'rcParams', it sets the size of all the forthcoming plots in the file
# pass width and height in inches to 'figure.figsize' 
plt.rcParams['figure.figsize'] = [15,8]

#### Read the data

Load the csv file and print the first five observations.

In [3]:
# read the data
df_bank = pd.read_csv("bank_churn.csv")

# display the first five rows of the data
df_bank.head()

Unnamed: 0,CreditScore,Geography,Gender,Age,NumOfYrsWithBank,Balance,NumOfProducts,HasCrCard,Closed_Acc
0,619,France,Female,42,2,0.0,1,1,1
1,608,Spain,Female,41,1,83807.86,1,0,0
2,502,France,Female,42,8,159660.8,3,1,1
3,699,France,Female,39,1,0.0,2,0,0
4,850,Spain,Female,43,2,125510.82,1,1,0


Our objective is to predict whether the customer has closed the bank account or not.

**The data definition is as follows:** <br>

**CreditScore**: Credit score of the customer 

**Geography**: Resident country of the customer

**Gender**: Gender of the customer

**Age**: Age of the customer

**NumOfYrsWithBank**: Years for which the customer has been with the bank

**Balance**: Bank balance of the customer in Euro

**NumOfProducts**: Number of bank facilities for which customer has opted

**HasCrCard**: Whether the customer has credit card or not (1 = Yes, 0 = No)

**Closed_Acc**: Whether the customer has closed the bank account or not (1 = Yes, 0 = No) (target/dependent variable)

### Let's begin with some hands-on practice exercises

<table align="left">
    <tr>
        <td width="6%">
            <img src="question_icon.png">
        </td>
        <td>
            <div align="left", style="font-size:120%">
                <font color="#21618C">
                    <b>1. Build a full logistic model and calculate the odds for each variable </b>
                </font>
            </div>
        </td>
    </tr>
</table>

In [5]:
# we can ignore the probability threshold as it is not required to calculate the odds
# consider the independent variables
# select_dtypes: selects the variable having specified datatype
# include: includes the variables with specified datatype
# drop(): drops specified column(s)/row(s) from the dataframe
# axis: specifies whether to drop labels from index or columns; use 1 for columns and 0 for index
df_num = df_bank.select_dtypes(include=np.number).drop(["Closed_Acc"],axis=1)

# scale all the numeric independent variables
# initialize the standard scalar
X_scaler = StandardScaler()

# standardize all the columns of the dataframe 'df_num'
num_scaled = X_scaler.fit_transform(df_num)

# create a dataframe of scaled numerical variables
# pass the required column names to the parameter 'columns'
df_num_scaled = pd.DataFrame(num_scaled, columns = df_num.columns)

# consider all the categorical variables in the data
# select_dtypes: selects the variable having specified datatype
# include: includes the variables with specified datatype
df_cat = df_bank.select_dtypes(include="object")

# convert the categorical variable to dummy variable
# get_dummies(): converts the variable to categorical variable
# drop_first=True: indicates n-1 dummy enoding; if set to false indicated one-hot encoding
dummy_variables = pd.get_dummies(df_cat, drop_first=True)

# concatenate the scaled numerical and dummy variables
# axis: specifies whether to drop labels from index or columns; use 1 for columns and 0 for index
X = pd.concat([df_num_scaled, dummy_variables],axis=1)

# add a constant column to the dataframe
# while using the 'Logit' method in the Statsmodels library, the method do not consider the intercept by default
# we can add the intercept to the set of independent variables using 'add_constant()'
X = sm.add_constant(X)

# consider the dependent variable
y = df_bank["Closed_Acc"]

# split data into train subset and test subset
# set 'random_state' to generate the same dataset each time you run the code 
# 'test_size' returns the proportion of data to be included in the testing set
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 1, test_size = 0.2)

# build the model on train data 
# use fit() to fit the logistic regression model
logreg_full = sm.Logit(y_train, X_train.astype(float)).fit()

# take the exponential of the coefficient of a variable to calculate the odds
# 'params' returns the coefficients of all the independent variables
# pass the required column name to the parameter, 'columns'
df_odds = pd.DataFrame(np.exp(logreg_full.params), columns= ['Odds']) 

# print the dataframe
df_odds

Optimization terminated successfully.
         Current function value: 0.441849
         Iterations 6


Unnamed: 0,Odds
const,0.230789
CreditScore,0.944647
Age,2.042779
NumOfYrsWithBank,0.986998
Balance,1.183638
NumOfProducts,0.939485
HasCrCard,0.981084
Geography_Germany,2.102382
Geography_Spain,0.988354
Gender_Male,0.578571


In [7]:
logreg_full.summary()

0,1,2,3
Dep. Variable:,Closed_Acc,No. Observations:,7936.0
Model:,Logit,Df Residuals:,7926.0
Method:,MLE,Df Model:,9.0
Date:,"Mon, 15 Apr 2024",Pseudo R-squ.:,0.1239
Time:,08:57:38,Log-Likelihood:,-3506.5
converged:,True,LL-Null:,-4002.6
Covariance Type:,nonrobust,LLR p-value:,8.439e-208

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
const,-1.4663,0.054,-26.940,0.000,-1.573,-1.360
CreditScore,-0.0569,0.030,-1.892,0.059,-0.116,0.002
Age,0.7143,0.029,24.469,0.000,0.657,0.772
NumOfYrsWithBank,-0.0131,0.030,-0.439,0.661,-0.071,0.045
Balance,0.1686,0.036,4.738,0.000,0.099,0.238
NumOfProducts,-0.0624,0.030,-2.072,0.038,-0.121,-0.003
HasCrCard,-0.0191,0.030,-0.640,0.522,-0.078,0.039
Geography_Germany,0.7431,0.074,9.999,0.000,0.597,0.889
Geography_Spain,-0.0117,0.079,-0.149,0.882,-0.166,0.143
