# Data Description
- The dataset has 85 predictors on demographics for 5,822 individuals, with the response variable `Purchase` indicating if they bought caravan (lữ hành) insurance (6% did).
- There are 5822 rows.
- There are 86 variables:
    - Sociodemographic data (variables 1-43)
      - Sociodemographic data is based on zip codes, and all individuals in the same area share these attributes.
    - Product ownership (variables 44-86)
    - Variable 86 (`Purchase`) shows if the customer bought caravan insurance.

# Load Packages and Data

In [1]:
import numpy as np, pandas as pd, matplotlib.pyplot as plt
import statsmodels.api as sm
from ISLP import load_data
from ISLP.models import (ModelSpec as MS, summarize)
from ISLP import confusion_table
from ISLP.models import contrast
from sklearn.discriminant_analysis import \
     (LinearDiscriminantAnalysis as LDA,
      QuadraticDiscriminantAnalysis as QDA)
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

In [2]:
Caravan = load_data('Caravan')
Caravan.head()

Unnamed: 0,MOSTYPE,MAANTHUI,MGEMOMV,MGEMLEEF,MOSHOOFD,MGODRK,MGODPR,MGODOV,MGODGE,MRELGE,...,APERSONG,AGEZONG,AWAOREG,ABRAND,AZEILPL,APLEZIER,AFIETS,AINBOED,ABYSTAND,Purchase
0,33,1,3,2,8,0,5,1,3,7,...,0,0,0,1,0,0,0,0,0,No
1,37,1,2,2,8,1,4,1,4,6,...,0,0,0,1,0,0,0,0,0,No
2,37,1,2,2,8,0,4,2,4,3,...,0,0,0,1,0,0,0,0,0,No
3,9,1,3,3,3,2,3,2,4,5,...,0,0,0,1,0,0,0,0,0,No
4,40,1,4,2,10,1,4,1,4,7,...,0,0,0,1,0,0,0,0,0,No


In [3]:
Caravan.describe().round(1)

Unnamed: 0,MOSTYPE,MAANTHUI,MGEMOMV,MGEMLEEF,MOSHOOFD,MGODRK,MGODPR,MGODOV,MGODGE,MRELGE,...,ALEVEN,APERSONG,AGEZONG,AWAOREG,ABRAND,AZEILPL,APLEZIER,AFIETS,AINBOED,ABYSTAND
count,5822.0,5822.0,5822.0,5822.0,5822.0,5822.0,5822.0,5822.0,5822.0,5822.0,...,5822.0,5822.0,5822.0,5822.0,5822.0,5822.0,5822.0,5822.0,5822.0,5822.0
mean,24.3,1.1,2.7,3.0,5.8,0.7,4.6,1.1,3.3,6.2,...,0.1,0.0,0.0,0.0,0.6,0.0,0.0,0.0,0.0,0.0
std,12.8,0.4,0.8,0.8,2.9,1.0,1.7,1.0,1.6,1.9,...,0.4,0.1,0.1,0.1,0.6,0.0,0.1,0.2,0.1,0.1
min,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,10.0,1.0,2.0,2.0,3.0,0.0,4.0,0.0,2.0,5.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,30.0,1.0,3.0,3.0,7.0,0.0,5.0,1.0,3.0,6.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
75%,35.0,1.0,3.0,3.0,8.0,1.0,6.0,2.0,4.0,7.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
max,41.0,10.0,5.0,6.0,10.0,9.0,9.0,5.0,9.0,9.0,...,8.0,1.0,1.0,2.0,7.0,1.0,2.0,3.0,2.0,2.0


In [4]:
Purchase = Caravan.Purchase
Purchase.value_counts(normalize=True)

Purchase
No     0.940227
Yes    0.059773
Name: proportion, dtype: float64

# K-Nearest Neighbors

- Our features will include all columns except `Purchase`.

In [5]:
feature_df = Caravan.drop(columns=['Purchase'])

- KNN's performance is affected by variable scales since it predicts based on the nearest observations. Large-scale variables dominate distance calculations, skewing results. For example, a 1,000 USD salary difference outweighs a 50-year age difference, affecting classification. To fix this, standardize the data to a mean of zero and a standard deviation of one using `StandardScaler()`.

In [6]:
scaler = StandardScaler(with_mean=True, with_std=True, copy=True)

- `with_mean` decides if the mean should be subtracted, `with_std` if columns should have a standard deviation of 1, and `copy=True` ensures data is copied for calculations.
- The transformation is fit to compute parameters and then applied. The first line stores parameters in `scaler`, and the second line constructs standardized features.

In [7]:
scaler.fit(feature_df)
X_std = scaler.transform(feature_df)

In [8]:
X_std[:1].round(1)

array([[ 0.7, -0.3,  0.4, -1.2,  0.8, -0.7,  0.2, -0.1, -0.2,  0.4, -0.9,
        -0.2, -0.5, -0.8,  0.8, -0.3, -0.8,  1.1, -0.5, -0.5,  0.5, -0.5,
         1.6, -0.2, -0.4, -0.5, -0.1,  1.2, -0.1, -1. ,  1. ,  1.3, -1.1,
        -0.6,  0.9, -0.9, -1.2,  0.2,  1.2, -0.7, -0.4,  0.2, -0.6, -0.8,
        -0.1, -0.1,  1. , -0.1, -0.2, -0. , -0.1, -0.2, -0.1, -0.3, -0.2,
        -0.1, -0.1, -0.1,  1.7, -0. , -0.1, -0.2, -0.1, -0.1, -0.8, -0.1,
        -0.1,  0.7, -0.1, -0.2, -0. , -0.1, -0.1, -0. , -0.3, -0.2, -0.1,
        -0.1, -0.1,  0.8, -0. , -0.1, -0.2, -0.1, -0.1]])

- Now each column of `feature_std` has a mean of zero and a standard deviation of one.

In [9]:
feature_std = pd.DataFrame(X_std, columns=feature_df.columns);
feature_std.std()

MOSTYPE     1.000086
MAANTHUI    1.000086
MGEMOMV     1.000086
MGEMLEEF    1.000086
MOSHOOFD    1.000086
              ...   
AZEILPL     1.000086
APLEZIER    1.000086
AFIETS      1.000086
AINBOED     1.000086
ABYSTAND    1.000086
Length: 85, dtype: float64

- The standard deviations aren't exactly 1 due to different conventions: `scaler()` uses $1/n$ and `std()` uses $1/(n-1)$. This is fine as long as variables are on the same scale.
- We use `train_test_split()` to divide the data into a test set of 1000 observations and a training set with the rest. `random_state=0` ensures consistent splits.

In [10]:
(X_train, X_test,  y_train, y_test) = train_test_split(
    feature_std, Purchase, test_size=1000, random_state=0)

- `?train_test_split` shows non-keyword arguments can be `lists`, `arrays`, or `pandas dataframes` of the same length (i.e., indexable). Here, they are `feature_std` and `Purchase`. 
{We converted `feature_std` to an `ndarray` to fix a `sklearn` bug.}.
- We fit a KNN model with \( K=1 \) on the training data and evaluate it on the test data.

In [11]:
knn1 = KNeighborsClassifier(n_neighbors=1)
knn1_pred = knn1.fit(X_train, y_train).predict(X_test)
np.mean(y_test != knn1_pred), np.mean(y_test == "Yes")

(0.111, 0.067)

- The KNN error rate on the 1,000 test observations is about 11%. However, since only 6% purchased insurance, always predicting 'No' would result in an error rate of just over 6%, known as the *null rate*.
- If selling insurance causes costs, a success rate of 6% by random selection is too low. Instead, the company should target customers likely to buy. Therefore, the focus should be on correctly predicting those who will purchase insurance.

In [12]:
confusion_table(knn1_pred, y_test)

Truth,No,Yes
Predicted,Unnamed: 1_level_1,Unnamed: 2_level_1
No,880,58
Yes,53,9


In [13]:
(880+9)/1000

0.889

- KNN with $ K=1 $ performs better than random guessing for predicting insurance buyers. Among 62 predicted buyers, 9 (14.5%) actually purchase insurance, double the random guessing rate.

## Tuning Parameters
- The number of neighbors in KNN is a *tuning* or *hyperparameter*. Its optimal value is unknown beforehand, so we evaluate performance on test data by varying this parameter.
- Using a `for` loop, we examine classifier accuracy for predicted insurance buyers with neighbors ranging from 1 to 5:

In [14]:
for K in range(1,6):
    knn = KNeighborsClassifier(n_neighbors=K)
    knn_pred = knn.fit(X_train, y_train).predict(X_test)
    C = confusion_table(knn_pred, y_test)
    templ = ('K={0:d}: # predicted to rent: {1:>2},' +
            '  # who did rent {2:d}, accuracy {3:.1%}')
    pred = C.loc['Yes'].sum()
    did_rent = C.loc['Yes','Yes']
    print(templ.format(
          K,
          pred,
          did_rent,
          did_rent / pred))

K=1: # predicted to rent: 62,  # who did rent 9, accuracy 14.5%
K=2: # predicted to rent:  6,  # who did rent 1, accuracy 16.7%
K=3: # predicted to rent: 20,  # who did rent 3, accuracy 15.0%
K=4: # predicted to rent:  4,  # who did rent 0, accuracy 0.0%
K=5: # predicted to rent:  7,  # who did rent 1, accuracy 14.3%


We see some variability ---  the numbers for `K=4` are very different from the rest.

# Comparison to Logistic Regression
- We can fit a logistic regression model with `sklearn`, which defaults to a ridge regression version. Setting `C` to a large number replicates the usual logistic regression.
- Unlike `statsmodels`, `sklearn` focuses on classification, so it lacks `summary` methods for detailed inference.

In [15]:
logit = LogisticRegression(C=1e10, solver='liblinear')
logit.fit(X_train, y_train)
logit_pred = logit.predict_proba(X_test)
logit_labels = np.where(logit_pred[:,1] > .5, 'Yes', 'No')
confusion_table(logit_labels, y_test)

Truth,No,Yes
Predicted,Unnamed: 1_level_1,Unnamed: 2_level_1
No,931,67
Yes,2,0


- We used `solver='liblinear'` to avoid convergence warnings with the default solver.
- Using a 0.5 probability cut-off predicted only two purchases. Lowering it to 0.25 improved results, predicting 29 purchases with about 31% accuracy, nearly five times better than random guessing.

In [16]:
logit_labels = np.where(logit_pred[:,1]>0.25, 'Yes', 'No')
confusion_table(logit_labels, y_test)

Truth,No,Yes
Predicted,Unnamed: 1_level_1,Unnamed: 2_level_1
No,913,58
Yes,20,9


In [17]:
9/(20+9)

0.3103448275862069