# Problems

In [1]:
import pandas as pd

from sklearn import preprocessing
from sklearn.neighbors import NearestNeighbors, KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics.pairwise import euclidean_distances
from sklearn.metrics import accuracy_score

**1. Calculating Distance with Categorical Predictors.**

This exercise with a tiny dataset illustrates the calculation of Euclidean distance, and the creation of binary
dummies. The online education company Statistics.com segments its customers and prospects into three main categories: IT professionals (IT), statisticians (Stat), and other (Other). It also tracks, for each customer, the number of years since first contact (years). Consider the following customers; information about whether they have taken a course or not (the outcome to be predicted) is included:

    Customer 1: Stat, 1 year, did not take course
    Customer 2: Other, 1.1 year, took course

**a.** Consider now the following new prospect:

    Prospect 1: IT, 1 year

Using the above information on the two customers and one prospect, create one dataset for all three with the categorical predictor variable transformed into 2 binaries, and a similar dataset with the categorical predictor variable transformed into 3 binaries.

In [2]:
# dataset for all three customers with the categorical predictor (category)
# transformed into 2 binaries
tiny_two_cat_dummies_df = pd.DataFrame({"IT": [0, 0, 1], "Stat": [1, 0, 0],
                                        "years_since_first_contact": [1, 1.1, 1],
                                        "course": [0, 1, None]})
tiny_two_cat_dummies_df

Unnamed: 0,IT,Stat,years_since_first_contact,course
0,0,1,1.0,0.0
1,0,0,1.1,1.0
2,1,0,1.0,


In [3]:
# dataset for all three customers with the categorical predictor (category)
# transformed into 3 binaries
tiny_all_cat_dummies_df = pd.DataFrame({"IT": [0, 0, 1], "Stat": [1, 0, 0], 
                                        "Other": [0, 1, 0], "years_since_first_contact": [1, 1.1, 1],
                                        "course": [0, 1, None]})
tiny_all_cat_dummies_df

Unnamed: 0,IT,Stat,Other,years_since_first_contact,course
0,0,1,0,1.0,0.0
1,0,0,1,1.1,1.0
2,1,0,0,1.0,


**b.** For each derived dataset, calculate the Euclidean distance between the prospect and each of the other two customers. (Note: While it is typical to normalize data for k-NN, this is not an iron-clad rule and you may proceed here without normalization.)

- Two categorical dummies (IT/Stat):

In [4]:
predictors = ["IT", "Stat", "years_since_first_contact"]
pd.DataFrame(euclidean_distances(tiny_two_cat_dummies_df[predictors],
                                 tiny_two_cat_dummies_df[predictors]),
             columns=["customer_1", "customer_2", "customer_3"],
             index=["customer_1", "customer_2", "customer_3"])

Unnamed: 0,customer_1,customer_2,customer_3
customer_1,0.0,1.004988,1.414214
customer_2,1.004988,0.0,1.004988
customer_3,1.414214,1.004988,0.0


- Three categorical dummies (IT/Stat/Other):

In [5]:
predictors = ["IT", "Stat", "Other", "years_since_first_contact"]

pd.DataFrame(euclidean_distances(tiny_all_cat_dummies_df[predictors],
                                 tiny_all_cat_dummies_df[predictors]),
             columns=["customer_1", "customer_2", "customer_3"],
             index=["customer_1", "customer_2", "customer_3"])

Unnamed: 0,customer_1,customer_2,customer_3
customer_1,0.0,1.417745,1.414214
customer_2,1.417745,0.0,1.417745
customer_3,1.414214,1.417745,0.0


We can already see the effect of using two/three dummy variables. For the two dummy variables dataset, the `customer_3` is nearer to `customer_2` than to `customer_1`. This happens because the variable `years_since_first_contact` are the same for the both customers. For the three dummy variables, we still see that the `customer_3` are nearer to `customer_1` than to `customer_2` though the distances are very close between all customers. This happens because the `Other` variable helps to discriminate each of the customers.

In contrast to the situation with statistical models such as regression, all *m* binaries should be created and
used with *k*-NN. While mathematically this is redundant, since *m* - 1 dummies contain the same information as *m* dummies, this redundant information does not create the multicollinearity problems that it does for linear models. Moreover, in *k*-NN the use of *m* - 1 dummies can yield different classifications than the use of *m* dummies, and lead to an imbalance in the contribution of the different categories to the model.

**c.** Using k-NN with k = 1, classify the prospect as taking or not taking a course using each of the two derived datasets. Does it make a difference whether you use two or three dummies?

- Two dummies variables (IT/Stat)

In [6]:
predictors = ["IT", "Stat", "years_since_first_contact"]

# user NearestNeighbors from scikit-learn to compute knn
knn = NearestNeighbors(n_neighbors=1)
knn.fit(tiny_two_cat_dummies_df.loc[:1, predictors])

new_customer = pd.DataFrame({"IT": [1], "Stat": [0],
                             "years_since_first_contact": [1]})

distances, indices = knn.kneighbors(new_customer)

# indices is a list of lists, we are only interested in the first element
tiny_two_cat_dummies_df.iloc[indices[0], :]

Unnamed: 0,IT,Stat,years_since_first_contact,course
1,0,0,1.1,1.0


- Three dummies variable(IT/Stat/Other)

In [7]:
predictors = ["IT", "Stat", "Other", "years_since_first_contact"]

# user NearestNeighbors from scikit-learn to compute knn
knn = NearestNeighbors(n_neighbors=1)
knn.fit(tiny_all_cat_dummies_df.loc[:1, predictors])

new_customer = pd.DataFrame({"IT": [1], "Stat": [0], "Other": [1],
                             "years_since_first_contact": [1]})

distances, indices = knn.kneighbors(new_customer)

# indices is a list of lists, we are only interested in the first element
tiny_all_cat_dummies_df.iloc[indices[0], :]

Unnamed: 0,IT,Stat,Other,years_since_first_contact,course
1,0,0,1,1.1,1.0


If we use *k* = 1, the nearest customer is the one that took the course for both variables. Therefore, for this specific example there was no difference on using two or three categorical variable. Therefore, as indicated in the previous item (**b**), this redundant information does not create the multicollinearity problems that it does for linear models. Moreover, in *k*-NN the use of *m* - 1 dummies can yield different classifications than the use of *m* dummies, and lead to an imbalance in the contribution of the different categories to the model.

**2. Personal Loan Acceptance.** Universal Bank is a relatively young bank growing rapidly in terms of overall customer acquisition. The majority of these customers are liability customers (depositors) with varying sizes of relationship with the bank. The customer base of asset customers (borrowers) is quite small, and the bank is interested in expanding this base rapidly to bring in more loan business. In particular, it wants to explore ways of converting its liability customers to personal loan customers (while retaining them as depositors).

A campaign that the bank ran last year for liability customers showed a healthy conversion rate of over 9% success. This has encouraged the retail marketing department to devise smarter campaigns with better target marketing. The goal is to use *k*-NN to predict whether a new customer will accept a loan offer. This will serve as the basis for the design of a new campaign.

The file `UniversalBank.csv` contains data on 5000 customers. The data include customer demographic information (age, income, etc.), the customer's relationship with the bank (mortgage, securities account, etc.), and the customer response to the last personal loan campaign (Personal Loan). Among these 5000 customers, only 480 (=9.6%) accepted the personal loan that was offered to them in the earlier campaign.

Partition the data into training (60%) and validation (40%) sets.

**a.** Consider the following customer:
    
    Age = 40, Experience = 10, Income = 84, Family = 2, CCAvg = 2, Education_1 = 0,
    Education_2 = 1, Education_3 = 0, Mortgage = 0, Securities Account = 0, CDAccount = 0,
    Online = 1, and Credit Card = 1.
    
Perform a *k*-NN classification with all predictors except ID and ZIP code using k = 1. Remember to transform categorical predictors with more than two categories into dummy variables first. Specify the success class as 1 (loan acceptance), and use the default cutoff value of 0.5. How would this customer be classified?

In [8]:
customer_df = pd.read_csv("../datasets/UniversalBank.csv")
customer_df.head()

Unnamed: 0,ID,Age,Experience,Income,ZIP Code,Family,CCAvg,Education,Mortgage,Personal Loan,Securities Account,CD Account,Online,CreditCard
0,1,25,1,49,91107,4,1.6,1,0,0,1,0,0,0
1,2,45,19,34,90089,3,1.5,1,0,0,1,0,0,0
2,3,39,15,11,94720,1,1.0,1,0,0,0,0,0,0
3,4,35,9,100,94112,1,2.7,2,0,0,0,0,0,0
4,5,35,8,45,91330,4,1.0,2,0,0,0,0,0,1


In [9]:
# define predictors and the outcome for this problem
predictors = ["Age", "Experience", "Income", "Family", "CCAvg", "Education", "Mortgage",
              "Securities Account", "CD Account", "Online", "CreditCard"]
outcome = "Personal Loan"

# before k-NN, we will convert 'Education' to binary dummies.
# 'Family' remains unchanged
customer_df = pd.get_dummies(customer_df, columns=["Education"], prefix_sep="_")

# update predictors to include the new dummy variables
predictors = ["Age", "Experience", "Income", "Family", "CCAvg", "Education_1",
              "Education_2", "Education_3", "Mortgage",
              "Securities Account", "CD Account", "Online", "CreditCard"]

# partition the data into training 60% and validation 40% sets
train_data, valid_data = train_test_split(customer_df, test_size=0.4,
                                          random_state=26)

# equalize the scales that the various predictors (standardization)
scaler = preprocessing.StandardScaler()
scaler.fit(train_data[predictors])

# transform the full dataset
customer_norm = pd.concat([pd.DataFrame(scaler.transform(customer_df[predictors]),
                                        columns=["z"+col for col in predictors]),
                           customer_df[outcome]], axis=1)

train_norm = customer_norm.iloc[train_data.index]
valid_norm = customer_norm.iloc[valid_data.index]

## new customer
new_customer = pd.DataFrame({"Age": [40], "Experience": [10], "Income": [84], "Family": [2],
                             "CCAvg": [2], "Education_1": [0], "Education_2": [1],
                             "Education_3": [0], "Mortgage": [0], "Securities Account": [0],
                             "CDAccount": [0], "Online": [1], "Credit Card": [1]})
new_customer_norm = pd.DataFrame(scaler.transform(new_customer),
                                 columns=["z"+col for col in predictors])

# use NearestNeighbors from scikit-learn to compute knn
knn = NearestNeighbors(n_neighbors=1)
knn.fit(train_norm.iloc[:, 0:-1])

distances, indices = knn.kneighbors(new_customer_norm)

# indices is a list of lists, we are only interested in the first element
customer_norm.iloc[indices[0], :]

Unnamed: 0,zAge,zExperience,zIncome,zFamily,zCCAvg,zEducation_1,zEducation_2,zEducation_3,zMortgage,zSecurities Account,zCD Account,zOnline,zCreditCard,Personal Loan
253,0.147952,0.081042,1.42996,-1.210912,-1.098605,1.167135,-0.6298,-0.643242,-0.547625,-0.346151,-0.248891,-1.24019,-0.645314,0


Since the closest customer did not accepted the loan (=0), we can estimate for the new customer a probability of 1 of being an non-borrower (and 0 for being a borrower). Using a simple majority rule is equivalent to setting the cutoff value to 0.5. In the above results, we see that the software assigned class non-borrower to this record.

**b.** What is a choice of *k* that balances between overfitting and ignoring the predictor information?

First, we need to remember that a balanced choice greatly depends on the nature of the data. The more complex and irregular the structure of the data, the lower the optimum value of *k*. Typically, values of *k* fall in the range of 1-20. We will use odd numbers to avoid ties.

If we choose *k* = 1, we will classify in a way that is very sensitive to the local characteristics of the training data. On the other hand, if we choose a large value of *k*, such as *k* = 14, we would simply predict the most frequent class in the dataset in all cases.

To find a balance, we examine the accuracy (of predictions in the validation set) that results from different choices of *k* between 1 and 14.

In [10]:
train_X = train_norm[["z"+col for col in predictors]]
train_y = train_norm[outcome]
valid_X = valid_norm[["z"+col for col in predictors]]
valid_y = valid_norm[outcome]

# Train a classifier for different values of k
results = []
for k in range(1, 15):
    knn = KNeighborsClassifier(n_neighbors=k).fit(train_X, train_y)
    results.append({"k": k,
                    "accuracy": accuracy_score(valid_y, knn.predict(valid_X))})

# Convert results to a pandas data frame
results = pd.DataFrame(results)
results

Unnamed: 0,k,accuracy
0,1,0.955
1,2,0.946
2,3,0.9555
3,4,0.9445
4,5,0.9525
5,6,0.9445
6,7,0.9495
7,8,0.9425
8,9,0.946
9,10,0.943


Based on the above table, we would choose **k = 3** (though **k = 5** appears to be another option too), which maximizes our accuracy in the validation set. Note, however, that now the validation set is used as part of the training process (to set *k*) and does not reflect a
true holdout set as before. Ideally, we would want a third test set to evaluate the performance of the method on data that it did not see.

**c.** Show the confusion matrix for the validation data that results from using the best *k*.


**d.** Consider the following customer:

    Age = 40, Experience = 10, Income = 84, Family = 2, CCAvg = 2, Education_1 = 0,
    Education_2 = 1, Education_3 = 0, Mortgage = 0, Securities Account = 0, CD Account = 0,
    Online = 1 and Credit Card = 1.

Classify the customer using the best *k*.

e. Repartition the data, this time into training, validation, and test sets (50%:30%:20%). Apply the *k*-NN method with the *k* chosen above. Compare the confusion matrix of the test set with that of the training and validation sets. Comment on the differences and their reason.