# Problems

In [1]:
import pandas as pd

from sklearn.neighbors import NearestNeighbors
from sklearn.metrics.pairwise import euclidean_distances

**1. Calculating Distance with Categorical Predictors.**

This exercise with a tiny dataset illustrates the calculation of Euclidean distance, and the creation of binary
dummies. The online education company Statistics.com segments its customers and prospects into three main categories: IT professionals (IT), statisticians (Stat), and other (Other). It also tracks, for each customer, the number of years since first contact (years). Consider the following customers; information about whether they have taken a course or not (the outcome to be predicted) is included:

    Customer 1: Stat, 1 year, did not take course
    Customer 2: Other, 1.1 year, took course

**a.** Consider now the following new prospect:

    Prospect 1: IT, 1 year

Using the above information on the two customers and one prospect, create one dataset for all three with the categorical predictor variable transformed into 2 binaries, and a similar dataset with the categorical predictor variable transformed into 3 binaries.

In [2]:
# dataset for all three customers with the categorical predictor (category)
# transformed into 2 binaries
tiny_two_cat_dummies_df = pd.DataFrame({"IT": [0, 0, 1], "Stat": [1, 0, 0],
                                        "years_since_first_contact": [1, 1.1, 1],
                                        "course": [0, 1, None]})
tiny_two_cat_dummies_df

Unnamed: 0,IT,Stat,years_since_first_contact,course
0,0,1,1.0,0.0
1,0,0,1.1,1.0
2,1,0,1.0,


In [3]:
# dataset for all three customers with the categorical predictor (category)
# transformed into 3 binaries
tiny_all_cat_dummies_df = pd.DataFrame({"IT": [0, 0, 1], "Stat": [1, 0, 0], 
                                        "Other": [0, 1, 0], "years_since_first_contact": [1, 1.1, 1],
                                        "course": [0, 1, None]})
tiny_all_cat_dummies_df

Unnamed: 0,IT,Stat,Other,years_since_first_contact,course
0,0,1,0,1.0,0.0
1,0,0,1,1.1,1.0
2,1,0,0,1.0,


**b.** For each derived dataset, calculate the Euclidean distance between the prospect and each of the other two customers. (Note: While it is typical to normalize data for k-NN, this is not an iron-clad rule and you may proceed here without normalization.)

- Two categorical dummies (IT/Stat):

In [4]:
predictors = ["IT", "Stat", "years_since_first_contact"]
pd.DataFrame(euclidean_distances(tiny_two_cat_dummies_df[predictors],
                                 tiny_two_cat_dummies_df[predictors]),
             columns=["customer_1", "customer_2", "customer_3"],
             index=["customer_1", "customer_2", "customer_3"])

Unnamed: 0,customer_1,customer_2,customer_3
customer_1,0.0,1.004988,1.414214
customer_2,1.004988,0.0,1.004988
customer_3,1.414214,1.004988,0.0


- Three categorical dummies (IT/Stat/Other):

In [5]:
predictors = ["IT", "Stat", "Other", "years_since_first_contact"]

pd.DataFrame(euclidean_distances(tiny_all_cat_dummies_df[predictors],
                                 tiny_all_cat_dummies_df[predictors]),
             columns=["customer_1", "customer_2", "customer_3"],
             index=["customer_1", "customer_2", "customer_3"])

Unnamed: 0,customer_1,customer_2,customer_3
customer_1,0.0,1.417745,1.414214
customer_2,1.417745,0.0,1.417745
customer_3,1.414214,1.417745,0.0


We can already see the effect of using two/three dummy variables. For the two dummy variables dataset, the `customer_3` is nearer to `customer_2` than to `customer_1`. This happens because the variable `years_since_first_contact` are the same for the both customers. For the three dummy variables, we still see that the `customer_3` are nearer to `customer_1` than to `customer_2` though the distances are very close between all customers. This happens because the `Other` variable helps to discriminate each of the customers.

In contrast to the situation with statistical models such as regression, all *m* binaries should be created and
used with *k*-NN. While mathematically this is redundant, since *m* - 1 dummies contain the same information as *m* dummies, this redundant information does not create the multicollinearity problems that it does for linear models. Moreover, in *k*-NN the use of *m* - 1 dummies can yield different classifications than the use of *m* dummies, and lead to an imbalance in the contribution of the different categories to the model.

**c.** Using k-NN with k = 1, classify the prospect as taking or not taking a course using each of the two derived datasets. Does it make a difference whether you use two or three dummies?

- Two dummies variables (IT/Stat)

In [6]:
predictors = ["IT", "Stat", "years_since_first_contact"]

# user NearestNeighbors from scikit-learn to compute knn
knn = NearestNeighbors(n_neighbors=1)
knn.fit(tiny_two_cat_dummies_df.loc[:1, predictors])

new_customer = pd.DataFrame({"IT": [1], "Stat": [0],
                             "years_since_first_contact": [1]})

distances, indices = knn.kneighbors(new_customer)

# indices is a list of lists, we are only interested in the first element
tiny_two_cat_dummies_df.iloc[indices[0], :]

Unnamed: 0,IT,Stat,years_since_first_contact,course
1,0,0,1.1,1.0


- Three dummies variable(IT/Stat/Other)

In [7]:
predictors = ["IT", "Stat", "Other", "years_since_first_contact"]

# user NearestNeighbors from scikit-learn to compute knn
knn = NearestNeighbors(n_neighbors=1)
knn.fit(tiny_all_cat_dummies_df.loc[:1, predictors])

new_customer = pd.DataFrame({"IT": [1], "Stat": [0], "Other": [1],
                             "years_since_first_contact": [1]})

distances, indices = knn.kneighbors(new_customer)

# indices is a list of lists, we are only interested in the first element
tiny_all_cat_dummies_df.iloc[indices[0], :]

Unnamed: 0,IT,Stat,Other,years_since_first_contact,course
1,0,0,1,1.1,1.0


If we use *k* = 1, the nearest customer is the one that took the course for both variables. Therefore, for this specific example there was no difference on using two or three categorical variable. Therefore, as indicated in the previous item (**b**), this redundant information does not create the multicollinearity problems that it does for linear models. Moreover, in *k*-NN the use of *m* - 1 dummies can yield different classifications than the use of *m* dummies, and lead to an imbalance in the contribution of the different categories to the model.