# KNeighborsClassifier from scratch

![Creative Commons License](https://i.creativecommons.org/l/by/4.0/88x31.png)  
This work by Jephian Lin is licensed under a [Creative Commons Attribution 4.0 International License](http://creativecommons.org/licenses/by/4.0/).

In [None]:
import numpy as np
import matplotlib.pyplot as plt

## Algorithm
**Input:**  
- `X`: an array of shape `(N,d)` whose rows are samples and columns are features
- `y`: the labels of shape `(N,)`
- `k`: Numbers of neighbors (including self) to vote
- `algorithm`: `'brute'`, `'ball_tree'`, or `'kd_tree'`

**Output:**  
A tuple `(predict, k_nearest_neighbors)`.  
- `predict`: a function that takes data `X_sample` and output their predicted labels
- `k_nearest_neighbors`: a function that takes data `X_sample` and return an array of shape `(X_sample_height, k)` that stores the indices of the nearest neighbors in `X` for each row in `X_sample`

**Steps:**
1. If `algorithm=="brute"`, create the function `k_nearest_neighbors` by the distance matrix.  
2. If `algorithm=="ball_tree"` or `algorithm=="kd_tree"`, create the function `k_nearest_neighbors` by `sklearn.neighbors.NearestNeighbors` with the corresponding algorithm.
3. Create the function `predict` that executes the following steps:
    1. Input `X_sample` .
    2. Let `nbrhoods = k_nearest_neighbors(X_sample)` .  
    3. Let `votes = y[nbrhoods]` .
    4. Calculate the most frequent label in each row of `votes` and store the results in `y_new` .
    5. Return `y_new` .

## Pseudocode
Translate the algorithm into the pseudocode.  
This helps you to identify the parts that you don't know how to do it.  

    1. 這個程式碼定義了一個名為 knn 的函式，它接受四個參數：X、y、algorithm 和 k。
    2. 如果 algorithm 參數為 "brute"，則使用 dist_mtx 函式計算距離矩陣，並使用 get_k_nearest_indices 函式找到每個樣本的 k 個最近Neighbors。
    3. 如果 algorithm 參數為 "ball_tree" 或 "kd_tree"，則使用 sklearn.neighbors.NearestNeighbors 類創建 k_nearest_neighbors 函式。
    4.knn 函式還義了一個名為 predict 的子函式，它接受一個樣本 X_sample，使用 k_nearest_neighbors 函式找到其 k 個最近Neighbors，使用 y 數組中的標籤計算最常見的標籤，並返回預測的標籤。
    5.最後，knn 函式返回 predict 和 k_nearest_neighbors 函式的元組。 dist_mtx 函式計算兩個數組之間的歐幾里德距離矩陣，get_k_nearest_indices 函式返回每個樣本的 個最近Neighbors的索引，calculate_most_frequent_label 函式計算每個樣本的最常見標籤。    

## Code

In [None]:
from sklearn.neighbors import NearestNeighbors
from scipy import stats

def dist_mtx(X, Y):     #Defines how the distance is calculated
    X_col = X[:,np.newaxis,:]
    Y_row = Y[np.newaxis,:,:]
    diff = X_col - Y_row
    dist = np.linalg.norm(diff, axis=-1)
    return dist

def get_y(kneighbors, y):   #Calculate the "most neighbors" in the vicinity
    return stats.mode(y[kneighbors], axis=1).mode.reshape(kneighbors.shape[0])

def knn(X, y, algorithm, k):
    if algorithm == 'brute':
        def k_nearest_neighbors(X_sample): 
            dist = dist_mtx(X_sample, X)
            argp = dist.argpartition(k-1, axis=1)
            return argp[:,:k]
        
        def predict(X_sample):
            k_neighbors = k_nearest_neighbors(X_sample)
            return get_y(k_neighbors, y)
            
    elif algorithm in ['ball_tree', 'kd_tree']:
        def k_nearest_neighbors(X_sample):
            nbr = NearestNeighbors(n_neighbors=k, algorithm=algorithm)
            nbr.fit(X)
            return nbr.kneighbors(X_sample, return_distance=False)
        
        def predict(X_sample):
            k_neighbors = k_nearest_neighbors(X_sample)
            return get_y(k_neighbors, y)

    return predict, k_nearest_neighbors

#Anothor Solution:
# def knn(X, y, k, algorithm):
#     if algorithm == "brute":
#         # create k_nearest_neighbors function using distance matrix
#         def k_nearest_neighbors(X_sample):
#             # calculate distance matrix between X and X_sample
#             dist_matrix = calculate_distance_matrix(X, X_sample)
#             # get indices of k nearest neighbors for each sample in X_sample
#             k_nearest_indices = get_k_nearest_indices(dist_matrix, k)
#             return k_nearest_indices

#     elif algorithm == "ball_tree" or algorithm == "kd_tree":
#         # create k_nearest_neighbors function using sklearn.neighbors.NearestNeighbors
#         from sklearn.neighbors import NearestNeighbors
#         nbrs = NearestNeighbors(n_neighbors=k, algorithm=algorithm).fit(X)
#         def k_nearest_neighbors(X_sample):
#             # get indices of k nearest neighbors for each sample in X_sample
#             _, k_nearest_indices = nbrs.kneighbors(X_sample)
#             return k_nearest_indices

#     # create predict function
#     def predict(X_sample):
#         # get indices of k nearest neighbors for each sample in X_sample
#         nbrhoods = k_nearest_neighbors(X_sample)
#         # get labels of k nearest neighbors for each sample in X_sample
#         votes = y[nbrhoods]
#         # calculate most frequent label for each row in votes
#         y_new = calculate_most_frequent_label(votes)
#         return y_new

#     return predict, k_nearest_neighbors

## Test
Take some sample data from [KNeighborsClassifier-with-scikit-learn](KNeighborsClassifier-with-scikit-learn.ipynb) and check if your code generates similar outputs with the existing packages.

##### Name of the data
Description of the data.

In [None]:
mu1 = np.array([2.5,0])
cov1 = np.array([[1.1,-1],
                [-1,1.1]])

mu2 = np.array([-2.5,0])
cov2 = np.array([[1.1,1],
                [1,1.1]])

X = np.vstack([np.random.multivariate_normal(mu1, cov1, 100), 
               np.random.multivariate_normal(mu2, cov2, 100)])
y = np.array([0]*100 + [1]*100)

X_sample = np.random.uniform(low=-5, high=5, size=(1000,2))
plt.scatter(*X.T, c=y)

In [None]:
predict, k_nearest_neighbors  = knn(X, y, 'brute', 2)
y_new = predict(X_sample)
plt.figure(figsize=(6, 6)) 
plt.subplot(311).set_title('brute')
plt.scatter(*X_sample.T, c=y_new)

predict, k_nearest_neighbors  = knn(X, y, 'ball_tree', 2)
y_new = predict(X_sample)
plt.subplot(312).set_title('ball_tree')
plt.scatter(*X_sample.T, c=y_new)

predict, k_nearest_neighbors  = knn(X, y, 'kd_tree', 2)
y_new = predict(X_sample)
plt.subplot(313).set_title('kd_tree')
plt.scatter(*X_sample.T, c=y_new)

In [None]:
from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier(n_neighbors=2)
model.fit(X, y)

y_new = model.predict(X_sample)
plt.figure(figsize=(6, 2))
plt.suptitle('sklearn')
plt.scatter(*X_sample.T, c=y_new)

## Comparison

##### Exercise 1
Let  
```python
t = np.arange(20)
angle = 2 * np.pi / 20 * t
X1 = np.vstack([np.cos(angle), np.sin(angle)]).T
X2 = 5 * X1
X = np.vstack([X1, X2])
y = np.array([0]*20 + [1]*20)
X_sample = 10 * np.random.rand(1000,2) - np.array([5,5])
```

###### 1(a)
Train a $k$-nearest neighbors classification model by `X` and `y` .  
Make a prediction of `X_sample` by:  
1. your code with different algorithm settings
2. `sklearn.neighbors.KNeighborsClassifier`

The results should be the same.  
Check if this is true.

In [None]:
t = np.arange(20)
angle = 2 * np.pi / 20 * t
X1 = np.vstack([np.cos(angle), np.sin(angle)]).T
X2 = 5 * X1
X = np.vstack([X1, X2])
y = np.array([0]*20 + [1]*20)
X_sample = 10 * np.random.rand(1000,2) - np.array([5,5])

In [None]:
import numpy as np
from sklearn.neighbors import NearestNeighbors

def k_nearest_neighbors(X, y, k, algorithm):
    if algorithm == 'brute':
        def knn(X_sample):
            distances = np.sqrt(((X - X_sample[:, np.newaxis])**2).sum(axis=2))
            k_nearest = np.argsort(distances, axis=1)[:, :k]
            return k_nearest
        
    elif algorithm == 'ball_tree' or algorithm == 'kd_tree':
        nn = NearestNeighbors(n_neighbors=k, algorithm=algorithm)
        nn.fit(X)
        
        def knn(X_sample):
            return nn.kneighbors(X_sample, return_distance=False)
        
    return knn

def predict(X_sample, X, y, k, algorithm):
    knn_func = k_nearest_neighbors(X, y, k, algorithm)
    nbrhoods = knn_func(X_sample)
    votes = y[nbrhoods]
    
    y_new = np.apply_along_axis(lambda x: np.bincount(x).argmax(), axis=1, arr=votes)
    return y_new, nbrhoods

y_new, nbrhoods = predict(X_sample, X, y, k=3, algorithm='brute')
print(y_new)
print(nbrhoods)

In [None]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=3, algorithm='brute')
knn.fit(X, y)

y_new = knn.predict(X_sample)
nbrhoods = knn.kneighbors(X_sample, return_distance=False)
print(y_new)
print(nbrhoods)

=> The result of the above inspection can be obtained they are same and true.

##### Veronica  

I recommend using the following code since we don't have to print all the values to check whether the two outputs are the same or not.

Another method:

```python
y_new, nbrhoods = predict(X_sample, X, y, k=3, algorithm='brute')
y_new_knn = knn.predict(X_sample)
nbrhoods_knn = knn.kneighbors(X_sample, return_distance=False)
print(y_new.all() == y_new_knn.all())
```

###### 1(b)
Let `y_new` be the prediction of `X_sample` in the previous question. 
Plot the points (rows) in `X` with `c=y` .  
Plot the points (rows) in `X_sample` with `c=y_new` and `alpha=0.1` .

In [None]:
import matplotlib.pyplot as plt

# Plot points in X with c=y
plt.scatter(X[:, 0], X[:, 1], c=y)

# Plot points in X_sample with c=y_new and alpha=0.1
plt.scatter(X_sample[:, 0], X_sample[:, 1], c=y_new, alpha=0.1)

plt.show()

In [None]:
plt.scatter(*X.T, c=y, label="Training data", marker="*", s=60)
plt.scatter(*X_sample.T, c=y_new, alpha=0.1, label="prediction")
plt.legend(loc="upper left");

Please continue with the previous question

###### 1(c)
Let  
```python
model = KNeighborsClassifier()
model.fit(X, y)
```  
and let `k_nearest_neighbors` be one of the output of your function.  
The results of `k_nearest_neighbors(X_sample)` should be the same as `model.kneighbors(X_sample, return_distance=False)` .  
(The corresponding rows contains the same collection of elements, but might be in different order.)  
Check if this is true.

In [None]:
import matplotlib.pyplot as plt

# Define X_sample, X, and y
X = np.array([[0, 0], [1, 1], [1, 0], [0, 1]])
y = np.array([0, 0, 1, 1])
X_sample = np.array([[0.5, 0.5], [0.2, 0.2]])

#Define k_nearest_neighbors and predict functions
def k_nearest_neighbors(X, y, k, algorithm):
    if algorithm == 'brute':
        def knn(X_sample):
            distances = np.sqrt(((X - X_sample[:, np.newaxis])**2).sum(axis=2))
            k_nearest = np.argsort(distances, axis=1)[:, :k]
            return k_nearest
    elif algorithm == 'ball_tree' or algorithm == 'kd_tree':
        nn = NearestNeighbors(n_neighbors=k, algorithm=algorithm)
        nn.fit(X)
        def knn(X_sample):
            return nn.kneighbors(X_sample, return_distance=False)
    return knn

def predict(X_sample, X, y, k, algorithm):
    knn_func = k_nearest_neighbors(X, y, k, algorithm)
    nbrhoods = knn_func(X_sample)
    votes = y[nbrhoods]
    y_new = np.apply_along_axis(lambda x: np.bincount(x).argmax(), axis=1, arr=votes)
    return y_new, nbrhoods

# Get predictions with brute-force algorithm
y_new, nbrhoods = predict(X_sample, X, y, k=3, algorithm='brute')

# Plot points in X with c=y
plt.scatter(X[:, 0], X[:, 1], c=y)

# Plot points in X_sample with c=y_new and alpha=0.1
plt.scatter(X_sample[:, 0], X_sample[:, 1], c=y_new, alpha=0.1)

plt.show()

model = KNeighborsClassifier(n_neighbors=3)
model.fit(X, y)
print('Check if this is true:',(nbrhoods==model.kneighbors(X_sample, return_distance=False)).all())

=> The result of the above inspection can be obtained this is true.

##### Veronica  

In my opinion, you don't have to redefine the `k_nearest_neighbors` again since you have done once again.

It may be redundant.

##### Exercise 2
Let  
```python
m,n = 8,8
frames = (m-2) * (n-2)

o = np.array([[1,1,1],
              [1,0,1],
              [1,1,1]])
x = np.array([[1,0,1],
              [0,1,0],
              [1,0,1]])
oo = np.zeros((frames, m, n))
xx = np.zeros((frames, m, n))
count =  0
for i in range(m-2):
    for j in range(n-2):
        oo[count, i:i+3, j:j+3] = o
        xx[count, i:i+3, j:j+3] = x
        count += 1


X = np.vstack([oo, xx]).reshape(2*frames, -1)
y = np.array([0]*frames + [1]*frames)
```

###### 2(a)
Run  
```python
plt.imshow(oo[i], cmap="Greys")
```
with different `i` .  
Guess what is the meaning of `oo` and `xx` .

In [None]:
m,n = 8,8
frames = (m-2) * (n-2)

o = np.array([[1,1,1],
              [1,0,1],
              [1,1,1]])
x = np.array([[1,0,1],
              [0,1,0],
              [1,0,1]])
oo = np.zeros((frames, m, n))
xx = np.zeros((frames, m, n))
count =  0

for i in range(m-2):
    for j in range(n-2):
        oo[count, i:i+3, j:j+3] = o
        xx[count, i:i+3, j:j+3] = x
        count += 1

X = np.vstack([oo, xx]).reshape(2*frames, -1)
y = np.array([0]*frames + [1]*frames)

In [None]:
for i in range(6):
    for j in range(6):
        ax = plt.subplot2grid((6,6), (i,j))
        ax.imshow(oo[i*6+j], cmap="Greys")
        
print('oo is a 36 images set, an o-like object in each image from left to right, up to down with step 1')
print('xx is a 36 images set, an x-like object in each image from left to right, up to down with step 1')

###### 2(b)
Train a $k$-nearest neighbors classification model by `X` an `y` .  
Make a prediction `y_new` for the training data `X` .  
What is the outcome?  
Can you give a reason to this phenomenon?

In [None]:
def knn(X, y, k, algorithm):
    def k_nearest_neighbors(X_sample):
        if algorithm == "brute":
            dist = dist_mtx(X_sample, X)
            argp = dist.argpartition(k-1, axis=1)
            return argp[:,:k]
        
        elif algorithm == "ball_tree" or algorithm == "kd_tree":
            nbr = NearestNeighbors(n_neighbors=k, algorithm=algorithm)
            nbr.fit(X)
            return nbr.kneighbors(X_sample, return_distance=False)

    def predict(X_sample):
        k_neighbors = k_nearest_neighbors(X_sample)
        return get_y(k_neighbors, y)

    return predict, k_nearest_neighbors

for k in range(1, 10):
    predict, k_nearest_neighbors = knn(X, y, k, 'brute')
    y_new = predict(X) #(72,)
    
    print("k = ",k,"\n",y_new,"\n") 

=> When k = 1, the prediction must be correct, because you use yourself to predict, y = y_new.

=> Different $k$ will have different prediction results, because different points are circled, so the decision-making results will be different.