# k-Nearest neighbors

Hello again, welcome to the machine learning book with Scikit-learn. This is the last lesson on classification models.

Another way to classify elements is through the k-Nearest Neighbors algorithm. This works by classifying data based on the labels of the data closest to it in the feature space – for each new sample, the "k" nearest neighbors are sought, and depending on these labels, it is decided which class a new element belongs to.

The way to use it in scikit-learn is as simple as any other classifier, we import it from the module:

In [None]:
from sklearn.neighbors import KNeighborsClassifier


And it has the usual `fit` and `predict` methods:

In [None]:
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split

X, y = make_moons(n_samples=1000, random_state=42, noise=0.1)

X_train, X_test, y_train, y_test = train_test_split(X, y)


In [None]:
knn = KNeighborsClassifier()

knn.fit(X_train, y_train)

print(knn.predict(X_test))
print(knn.score(X_test, y_test))


It also has the `predict_proba` method, although the probability here helps us define how many close neighbors it had:

In [None]:
print(knn.predict_proba(X_test)[:20])


Anyway, the interesting part lies in the arguments, the hyperparameters of the class.

## Arguments

Like many other machine learning models, the `KNeighborsClassifier` class has some arguments to modify its behavior:

 - `n_neighbors`: This hyperparameter determines the number of neighbors to be used in the classification. If the value of `n_neighbors` is too low, the model may overfit the data, while if the value is too high, the model may underfit the data. The default value is 5.
 - `weights`: This hyperparameter determines how the distances between the training samples and the test sample are weighted. The options are 'uniform', where all samples have the same weight in the classification, and 'distance', where closer samples have a greater weight. Usually, the default option is 'uniform'.
 - `metric`: This hyperparameter determines the distance metric used to calculate the distances between samples. Some common options are 'euclidean', 'manhattan', and 'minkowski'.
 - `algorithm`: This hyperparameter determines the algorithm used to find the nearest neighbors. The options are 'brute', which searches for the nearest neighbors by calculating all distances between all samples, and 'kd_tree' or 'ball_tree', which use data structures to search for the nearest neighbors more efficiently.
Let's see the behavior of the model when we modify its, starting with what may be the most important one:

First, let's create an example dataset:

In [None]:
from utils import plot_boundaries

X, y = make_moons(n_samples=1000, random_state=42, noise=0.15)


## `n_neighbors`

In [None]:
plot_boundaries(
    X, y, 
    [
        ('n_neighbors = 1', KNeighborsClassifier(n_neighbors=1)),
        ('n_neighbors = 10', KNeighborsClassifier(n_neighbors=10)),
        ('n_neighbors = 100', KNeighborsClassifier(n_neighbors=100)),
        ('n_neighbors = 999', KNeighborsClassifier(n_neighbors=999)),
    ]
)


## The importance of scaling features

k-Nearest Neighbors is an algorithm entirely based on distances between features, so it is vitally important that these are scaled before passing them to the model. Otherwise, you will experience problems when training and obtaining predictions.

To demonstrate this, here I am creating a dataset and taking one of its features out of scale by multiplying it by 5:

In [None]:
X, y = make_moons(n_samples=100, random_state=42, noise=.1)
X[:,1] = X[:,1] * 5

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)


Then I train a dataset with unscaled data:

In [None]:
knn_unscaled = KNeighborsClassifier(n_neighbors=5)
knn_unscaled.fit(X_train, y_train)
accuracy_unscaled = knn_unscaled.score(X_test, y_test)


And I train one by scaling the features beforehand:

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
knn_scaled = KNeighborsClassifier(n_neighbors=5)
knn_scaled.fit(X_train_scaled, y_train)
accuracy_scaled = knn_scaled.score(X_test_scaled, y_test)


From the outset, we can see the difference in performance between one and the other:

In [None]:
print(f"Sin escalar\t{accuracy_unscaled:.4f}")
print(f"Escaladas\t{accuracy_scaled:.4f}")


But it can be better appreciated with a two-dimensional graph:

In [None]:
from utils import plot_knn_boundaries

plot_knn_boundaries(knn_unscaled,knn_scaled, X_train, X_train_scaled, y_train)


## Size

The size of a kNN model on disk and in memory varies with respect to the size of its training dataset:

In [None]:
import joblib
import os
from sklearn.datasets import make_classification

n_samples = [100, 1000, 10000, 100000]

for n in n_samples:
    X, y = make_classification(n_samples=n, n_features=20)
    knn = KNeighborsClassifier()
    knn.fit(X, y)
    joblib.dump(knn, f"/tmp/knn_model_{n}.joblib")
    model_size = os.path.getsize(f"/tmp/knn_model_{n}.joblib")

    print(f"Tamaño del modelo (n={n}):\t{model_size:>10} bytes")


## In conclusion

The k-NN model is one that you can use in classification problems, especially in small or moderate-sized datasets, problems with multiple classes, noisy data or missing values, and in problems with low to moderate dimensionality.

But consider not using it in problems with large datasets, problems with high dimensionality, problems where speed is critical, problems with very sparse data, and in problems where accuracy is more important than simplicity and interpretability. In these cases, it may be necessary to consider other machine learning algorithms more suitable for the specific problem.

We'll see you in the next chapter, where we'll discuss another distance-based model.