# 4.7.6 K-Nearest Neighbors

We will now perform KNN using the `knn()` function, which is part of the `class` library. This function works rather differently from the other modelfitting functions that we have encountered thus far. Rather than a two-step approach in which we first fit the model and then we use the model to make predictions, `knn()` forms predictions using a single command. The function requires four inputs.  
1. A matrix containing the predictors associated with the training data, labeled `train.X` below.
2. A matrix containing the predictors associated with the data for which we wish to make predictions, labeled `test.X` below.
3. A vector containing the class labels for the training observations, labeled `train.Direction` below.
4. A value for K, the number of nearest neighbors to be used by the classifier.
We use the `cbind()` function, short for _column bind_, to bind the `Lag1` and `Lag2` variables together into two matrices, one for the training set and the other for the test set.

In [1]:
library(ISLR2)
library(class)
attach(Smarket)
train <- (Year < 2005)
Smarket.2005 <- Smarket[!train,]
Direction.2005 <- Direction[!train]

In [2]:
train.X <- cbind(Lag1, Lag2)[train,]
test.X <- cbind(Lag1, Lag2)[!train,]
train.Direction <- Direction[train]

Now the `knn()` function can be used to predict the market's movement for the dates in 2005. We set a random seed before we apply `knn()` because if several observations are tied as nearest neighbors, then `R` will randomly break the tie. Therefore, a seed must be set in order to ensure reproducibility of results.

In [3]:
set.seed(1)
knn.pred <- knn(train.X, test.X, train.Direction, k = 1)
table(knn.pred, Direction.2005)

        Direction.2005
knn.pred Down Up
    Down   43 58
    Up     68 83

In [4]:
(83 + 43) / 252

The results using $K=1$ are not very good, since only $50%$ of the observations are correctly predicted. Of course, it may be that $K=1$ results in an overly flexible fit to the data. Below, we repeat the analysis using $K=3$.

In [5]:
knn.pred <- knn(train.X, test.X, train.Direction, k = 3)
table(knn.pred, Direction.2005)

        Direction.2005
knn.pred Down Up
    Down   48 54
    Up     63 87

In [6]:
mean(knn.pred == Direction.2005)

The results have improved slightly. But increasing K further turns out to provide no further improvements. It appears that for this data, QDA provides the best results of the methods that we have examined so far.  

KNN does not perform well on the `Smarket` data but it does often provide impressive results. As an example we will apply the KNN approach to the `Caravan` data set, which is part of teh `ISLR2` library. This data set includes $85$ predictors that measure demographic characteristics for $5,822$ individuals. The response variable is `Purchase`, which indicates whether or not a given individual purchases a caravan insurance policy. In this data set, only $6%$ of people purchased caravan insurance.

In [7]:
dim(Caravan)

In [8]:
attach(Caravan)
summary(Purchase)
348 / 5822

Because the KNN classifier predicts the class of a given test observation by identifying the observations that are nearest to it, the scale of the variables matters. Variables that are on a large scale will have a much larger effect on the _distance_ between the observations, and hence on the KNN classifier, than variables that are on a small scale. For instance, imagine a data set that contains two variables, `salary` and `age` (measured in dollars and years, repsectively). As far as KNN is concerned, a difference of $\$1,000$ in salary is enormous compared to a difference of $50$ years in age. Consequently, `salary` will drive the KNN classification results, and `age` will have almost no effect. THis is contrary to our intuition that a salary difference of $\$1,000$ is quite small compared ot an age difference of $50$ years. Furthermore, the importance of scale to the KNN classifier leads to another issue: if we measured `salary` in Japanese yen, or if we measured `age` in minutes, then we'd get quite different classification results from what we get if these two variables are measured in dollars and years.  

A good way to handle this problem is to _standardize_ the data so that all variables are given a mean of zero and a standard deviation of one. THen all variables will be on a comparable scale. The `scale()` functio does just this. In standardizing the data, we exclude column 86, because that is the qualitative `Purchase` variable.

In [9]:
standardized.X <- scale(Caravan[,-86])
var(Caravan[,1])

In [10]:
var(Caravan[,2])

In [11]:
var(standardized.X[,1])

In [12]:
var(standardized.X[,2])

Now every column of `standardized.X` has a standard deviation of one and a mean of zero.  

We now split the observations into a test set, containing the first $1,000$ observations, and a training set, containing the remaining observations. We fit a KNN model on the training data using $K=1$, and evaluate its performance on the test data.

In [13]:
test <- 1:1000
train.X <- standardized.X[-test,]
test.X <- standardized.X[test,]
train.Y <- Purchase[-test]
test.Y <- Purchase[test]
set.seed(1)
knn.pred <- knn(train.X, test.X, train.Y, k = 1)
mean(test.Y != knn.pred)

In [14]:
mean(test.Y != "No")

The vector `test` is numeric, with values from $1$ through $1,000$. Typing `standardized.X[test,]` yields the submatrix of the data containing the observations whose indices range from $1$ to $1,000$, whereas typing `standardized.X[-test,]` yields the submatrix containing the observations whose indices do _not_ range from $1$ to $1,000$. THe KNN error rate on the $1,000$ test observations is just under $12%$. At first glance, this may appear to be fairly good. However, since only $6%$ of customers purchased insurance, we could get the error rate down to $6%$ by always predicting `No` regardless of the values of the predictors!  

Suppose that there is some non-trivial cost to trying to sell insurance to a given individual. For instance, perhaps a salesperson must visit each potential customer. If the company tires to sell insurance to a random selection of customers, then the success rate will be only $6%$, which may be far too low given the costs involved. Instead, the company would like to try to sell insurance only to customers who are likely to buy it. So the overall error rate is not of interest. Instead, the fraction of individuals that are correctly predicted to buy insurance is of interest.  

It turns out that KNN with $K=1$ does far better than random guessing among the customers that are predicted to buy insurance. Among $77$ such customers, $9$, or $11.7%$, actually do purchase insurance. THis is double the rate that one would obtain from random guessing.

In [15]:
table(knn.pred, test.Y)

        test.Y
knn.pred  No Yes
     No  873  50
     Yes  68   9

In [16]:
9 / (68 + 9)

Using $K=3$, the success rate increases to $19%$, and with $K=5$ the rate is $26.7%$. This is over four times the rate that results from random guessing. It appears that KNN is finding some real patterns in a difficult data set!

In [17]:
knn.pred <- knn(train.X, test.X, train.Y, k = 3)
table(knn.pred, test.Y)

        test.Y
knn.pred  No Yes
     No  920  54
     Yes  21   5

In [18]:
5 / 26

In [19]:
knn.pred <- knn(train.X, test.X, train.Y, k = 5)
table(knn.pred, test.Y)

        test.Y
knn.pred  No Yes
     No  930  55
     Yes  11   4

In [20]:
4 / 15

However, while this strategy is cost-effective, it is worth noting that only $15$ custmers are predicted to purchase insurance using KNN with $K=5$. In practice, the insurance company may wish to expend resources on convincing more than just 15 potential custmers to buy insurance.  

As a comparison, we can also fit a logistic regression model to the data. If we use $0.5$ as the predicted probability cut-off for the classifier, then we have a problem: only seven of the test observations are predicted to purchase insurance. Even worse, we are wrong about all of these! However, we are not required to use a cut-off of $0.5$. If we instead predict a purchase any time the predicted probability of purchase exceeds $0.25$, we get much better results: we predict that $33$ people will purchase insurance, and we are correct for about $33%$ of these people. This is over five times better than random guessing!

In [21]:
glm.fits <- glm(Purchase ~ ., data = Caravan, family = binomial, subset = -test)

“glm.fit: fitted probabilities numerically 0 or 1 occurred”


In [22]:
glm.probs <- predict(glm.fits, Caravan[test,], type = "response")

In [23]:
glm.pred <- rep("No", 1000)
glm.pred[glm.probs > .5] <- "Yes"
table(glm.pred, test.Y)

        test.Y
glm.pred  No Yes
     No  934  59
     Yes   7   0

In [24]:
glm.pred <- rep("No", 1000)
glm.pred[glm.probs > .25] <- "Yes"
table(glm.pred, test.Y)

        test.Y
glm.pred  No Yes
     No  919  48
     Yes  22  11

In [25]:
11 / (22 + 11)