## Problems

In [1]:
import pandas as pd

from sklearn.model_selection import train_test_split

**1. Personal Loan Acceptance.** The file `UniversalBank.csv` contains data on 5000 customers of Universal Bank. The data include customer demographic information (age, income, etc.), the customer's relationship with the bank (mortgage, securities account, etc.), and the customer response to the last personal loan campaign (Personal Loan). Among these 5000 customers, only 480 (=9.6%) accepted the personal loan that was offered to them in the earlier campaign. In this exercise, we focus on two predictors: Online (whether or not the customer is an active user of
online banking services) and Credit Card (abbreviated CC below) (does the customer hold a credit card issued by the bank), and the outcome Personal Loan (abbreviated Loan below).

Partition the data into training (60%) and validation (40%) sets.

**a.** Create a pivot table for the training data with Online as a column variable, CC as a row variable, and Loan as a secondary row variable. The values inside the table should convey the count. Use the pandas dataframe methods `melt()` and `pivot()`.

In [2]:
bank_df = pd.read_csv("../datasets/UniversalBank.csv")
bank_df.head()

Unnamed: 0,ID,Age,Experience,Income,ZIP Code,Family,CCAvg,Education,Mortgage,Personal Loan,Securities Account,CD Account,Online,CreditCard
0,1,25,1,49,91107,4,1.6,1,0,0,1,0,0,0
1,2,45,19,34,90089,3,1.5,1,0,0,1,0,0,0
2,3,39,15,11,94720,1,1.0,1,0,0,0,0,0,0
3,4,35,9,100,94112,1,2.7,2,0,0,0,0,0,0
4,5,35,8,45,91330,4,1.0,2,0,0,0,0,0,1


In [3]:
# data preparation for the exercise
predictors = ["Online", "CreditCard"]
outcome = "Personal Loan"

X_train, X_valid, y_train, y_valid = train_test_split(bank_df[predictors], bank_df[outcome],
                                                      test_size=0.4, random_state=1)

# data preparation to generate the pivot tables
train_df, valid_df = train_test_split(bank_df[predictors+[outcome]],
                                      test_size=0.4, random_state=1)

melt_table = pd.melt(train_df,
                     id_vars=["CreditCard", "Personal Loan"],
                     var_name=["Online"])\
               .groupby(["CreditCard", "Personal Loan"])\
               .count()["Online"]\
               .reset_index()
pivot_table = pd.pivot_table(train_df,
                             values=["Online"],
                             index=["CreditCard", "Personal Loan"],
                             aggfunc='count')\
                .reset_index()

print(melt_table)
print()
print(pivot_table)

   CreditCard  Personal Loan  Online
0           0              0    1909
1           0              1     199
2           1              0     804
3           1              1      88

   CreditCard  Personal Loan  Online
0           0              0    1909
1           0              1     199
2           1              0     804
3           1              1      88


**b.** Consider the task of classifying a customer who owns a bank credit card and is actively using online banking services. Looking at the pivot table, what is the probability that this customer will accept the loan offer? (This is the probability of loan acceptance (Loan = 1) conditional on having a bank credit card (CC = 1) and being an active user of online banking services (Online = 1)).

In this case, we will be looking at the probability of the record belonging to class `Personal Loan` = 1 given that its predictor values are `CC` = 1 and `Online` = 1:

<p style="text-align:center">
    $P(\text{Loan}=1 ∣ \text{CC}=1, \text{Online}=1) = \frac{88}{804+88} = 0.098 = 9.8\%$
</p>

**c.** Create two separate pivot tables for the training data. One will have Loan (rows) as a function of Online (columns) and the other will have Loan (rows) as a function of CC.

In [4]:
pd.set_option("precision", 4)
# probability of loan acceptance
print(train_df["Personal Loan"].value_counts() / len(train_df))
print()

for predictor in predictors:
    # construct the frequency table
    df = train_df[["Personal Loan", predictor]]
    freq_table = df.pivot_table(index="Personal Loan", columns=predictor, aggfunc=len)

    # divide each value by the sum of the row to get conditional probabilities
    prop_table = freq_table.apply(lambda x: x, axis=1)
    print(prop_table)
    print()

pd.reset_option("precision")

0    0.9043
1    0.0957
Name: Personal Loan, dtype: float64

Online            0     1
Personal Loan            
0              1119  1594
1               112   175

CreditCard        0    1
Personal Loan           
0              1909  804
1               199   88



**d.** Compute the following quantities [P(A ∣ B) means "the probability of A given B"]:

    i. P(CC = 1 ∣ Loan = 1) (the proportion of credit card holders among the loan acceptors)

<p style="text-align:center">
    $P(\text{CC}=1 ∣ \text{Loan}=1) = \frac{88}{199+88} = 0.306 = 30.6\%$
</p>

    ii. P(Online = 1 ∣ Loan = 1)

<p style="text-align:center">
    $P(\text{Online}=1 ∣ \text{Loan}=1) = \frac{175}{175+112} = 0.609 = 60.9\%$
</p>

    iii. P(Loan = 1) (the proportion of loan acceptors)

<p style="text-align:center">
    $P(\text{Loan}=1) = \frac{287}{3000} = 0.0957 = 9.57\%$
</p>

    iv. P(CC = 1 ∣ Loan = 0)

<p style="text-align:center">
    $P(\text{CC}=1 ∣ \text{Loan}=0) = \frac{804}{1909+804} = 0.296 = 29.6\%$
</p>

    v. P(Online = 1 ∣ Loan = 0)

<p style="text-align:center">
    $P(\text{Online}=1 ∣ \text{Loan}=0) = \frac{1594}{1119+1594} = 0.587 = 58.7\%$
</p>

    vi. P(Loan = 0)

<p style="text-align:center">
    $P(\text{Loan}=0) = 1 - P(\text{Loan}=1) = 1 - 0.0957 = 0.9043 = 90.4\%$
</p>

**e.** Use the quantities computed above to compute the naive Bayes probability $P(\text{Loan} = 1 ∣ \text{CC} = 1, \text{Online} = 1)$.

The naive Bayes probability for this case is given by:

<p style="text-align:center">
    $P_{nb}(\text{Loan} = 1 ∣ \text{CC} = 1, \text{Online} = 1) = \frac{P(\text{Loan}=1) P(\text{CC}=1 ∣ \text{Loan}=1) P(\text{Online}=1 ∣ \text{Loan}=1)}{P(\text{Loan}=1) P(\text{CC}=1 ∣ \text{Loan}=1) P(\text{Online}=1 ∣ \text{Loan}=1) + P(\text{Loan}=0) P(\text{CC}=1 ∣ \text{Loan}=0) P(\text{Online}=1 ∣ \text{Loan}=0)}$
    <br>$ = \frac{(0.0957)(0.306)(0.609)}{(0.0957)(0.306)(0.609) + (0.9043)(0.296)(0.587)} = $
    <br>$ = \frac{0.01783}{0.01783 + 0.1571} = 0.1019 = 10.1\%$ (+/-)
   
</p>

**f.** Compare this value with the one obtained from the pivot table in (b). Which is a more accurate estimate?

The value obtained between naive Bayes probabilities are very close to the exact Bayes probabilities. Although they are not equal, both would lead to exactly the same classification for a cutoff of 0.5 (and many other values). It is often the case that the rank ordering of probabilities is even closer to the exact Bayes method than the probabilities themselves, and for classification purposes it is the rank orderings that matter.

**g.** Which of the entries in this table are needed for computing P(Loan = 1 ∣ CC = 1, Online = 1)? In Python, run naive Bayes on the data. Examine the model output on training data, and find the entry that corresponds to P(Loan = 1 ∣ CC = 1, Online = 1). Compare this to the number you obtained in (e).