# KNeighborsClassifier from scratch

![Creative Commons License](https://i.creativecommons.org/l/by/4.0/88x31.png)  
This work by Jephian Lin is licensed under a [Creative Commons Attribution 4.0 International License](http://creativecommons.org/licenses/by/4.0/).

In [None]:
import numpy as np
import matplotlib.pyplot as plt

## Algorithm
**Input:**  
- `X`: an array of shape `(N,d)` whose rows are samples and columns are features
- `y`: the labels of shape `(N,)`
- `k`: Numbers of neighbors (including self) to vote
- `algorithm`: `'brute'`, `'ball_tree'`, or `'kd_tree'`

**Output:**  
A tuple `(predict, k_nearest_neighbors)`.  
- `predict`: a function that takes data `X_sample` and output their predicted labels
- `k_nearest_neighbors`: a function that takes data `X_sample` and return an array of shape `(X_sample_height, k)` that stores the indices of the nearest neighbors in `X` for each row in `X_sample`

**Steps:**
1. If `algorithm=="brute"`, create the function `k_nearest_neighbors` by the distance matrix.  
2. If `algorithm=="ball_tree"` or `algorithm=="kd_tree"`, create the function `k_nearest_neighbors` by `sklearn.neighbors.NearestNeighbors` with the corresponding algorithm.
3. Create the function `predict` that executes the following steps:
    1. Input `X_sample` .
    2. Let `nbrhoods = k_nearest_neighbors(X_sample)` .  
    3. Let `votes = y[nbrhoods]` .
    4. Calculate the most frequent label in each row of `votes` and store the results in `y_new` .
    5. Return `y_new` .

## Pseudocode
Translate the algorithm into the pseudocode.  
This helps you to identify the parts that you don't know how to do it.  

    1. 
    2. 
    3. ...

## Code

In [None]:
from sklearn.neighbors import NearestNeighbors
from scipy import stats #內含一些統計相關計算

def dist_mtx(X, Y):     #定義距離如何計算
    X_col = X[:,np.newaxis,:]
    Y_row = Y[np.newaxis,:,:]
    diff = X_col - Y_row
    dist = np.linalg.norm(diff, axis=-1)
    return dist

def get_y(kneighbors, y):   #計算附近出現"最多次的成員"，即出現次數
    # y[kneighbors] => pick the class value by the index of kneighbors
    # use scipy.stats.mode to find out mode in y, and reshape to vectors
    return stats.mode(y[kneighbors], axis=1).mode.reshape(kneighbors.shape[0])
    #stats.mode:找出現最多次的成員
    #先根據y[kneighbors]得到各個點是分到哪一類，再來去算最多的點是誰？最多人的那一個結果就是最終預測的答案，也就是我們常用的y_new

def knn(X, y, algorithm, k):  #判斷輸入的點是哪一類,方法為用algorithm並依據訓練集判斷附近k個點哪一種類較多
    # find out the indeices of k-neighbors
    if algorithm == 'brute':
        def k_nearest_neighbors(X_sample): 
            dist = dist_mtx(X_sample, X) #先算出目標點(X_Sample)與各個X的距離 
            argp = dist.argpartition(k-1, axis=1)  #再找出前k-1個距離最小的index
            return argp[:,:k] 
        def predict(X_sample):
            k_neighbors = k_nearest_neighbors(X_sample)
            return get_y(k_neighbors, y)
            
    elif algorithm in ['ball_tree', 'kd_tree']:
        def k_nearest_neighbors(X_sample):
            nbr = NearestNeighbors(n_neighbors=k, algorithm=algorithm)
            nbr.fit(X)
            return nbr.kneighbors(X_sample, return_distance=False)
        def predict(X_sample):
            k_neighbors = k_nearest_neighbors(X_sample)
            return get_y(k_neighbors, y)
    return predict, k_nearest_neighbors

## Test
Take some sample data from [KNeighborsClassifier-with-scikit-learn](KNeighborsClassifier-with-scikit-learn.ipynb) and check if your code generates similar outputs with the existing packages.

##### Name of the data
Description of the data.

In [None]:
mu1 = np.array([2.5,0])
cov1 = np.array([[1.1,-1],
                [-1,1.1]])
mu2 = np.array([-2.5,0])
cov2 = np.array([[1.1,1],
                [1,1.1]])
X = np.vstack([np.random.multivariate_normal(mu1, cov1, 100), 
               np.random.multivariate_normal(mu2, cov2, 100)])
y = np.array([0]*100 + [1]*100)

X_test = np.random.rand(1000,2)*10-5 ##製造1000個點介於5~-5

from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier()
model.fit(X, y)
y_new = model.predict(X_test)


plt.scatter(*X.T,c=y)
plt.scatter(*X_test.T,c=y_new,s=10,alpha=0.3)

In [None]:
#用bruust演算法
predict, k_nearest_neighbors  = knn(X, y, 'brute', 5)
y_new = predict(X_test)
plt.figure(figsize=(6, 6)) 
plt.subplot(311).set_title('brute')
plt.scatter(*X_test.T, c=y_new)
#用ball_tree演算法
predict, k_nearest_neighbors  = knn(X, y, 'ball_tree', 5)
y_new = predict(X_test)
plt.subplot(312).set_title('ball_tree')
plt.scatter(*X_test.T, c=y_new)
#用kd_tree演算法
predict, k_nearest_neighbors  = knn(X, y, 'kd_tree', 5)
y_new = predict(X_test)
plt.subplot(313).set_title('kd_tree')
plt.scatter(*X_test.T, c=y_new)

## Comparison

##### Exercise 1
Let  
```python
t = np.arange(20)
angle = 2 * np.pi / 20 * t
X1 = np.vstack([np.cos(angle), np.sin(angle)]).T
X2 = 5 * X1
X = np.vstack([X1, X2])
y = np.array([0]*20 + [1]*20)
X_sample = 10 * np.random.rand(1000,2) - np.array([5,5])
```

###### 1(a)
Train a $k$-nearest neighbors classification model by `X` and `y` .  
Make a prediction of `X_sample` by:  
1. your code with different algorithm settings
2. `sklearn.neighbors.KNeighborsClassifier`

The results should be the same.  
Check if this is true.

In [None]:
t = np.arange(20)
angle = 2 * np.pi / 20 * t
X1 = np.vstack([np.cos(angle), np.sin(angle)]).T
X2 = 5 * X1
X = np.vstack([X1, X2])
y = np.array([0]*20 + [1]*20)
X_sample = 10 * np.random.rand(1000,2) - np.array([5,5])

#如上面分別用三種方式來做分類,並驗證答案與scikit相同

#用bruust演算法
predict, k_nearest_neighbors  = knn(X, y, 'brute', 5)
y_new_scratch_brute = predict(X_sample)
plt.figure(figsize=(6, 6)) 
plt.subplot(311).set_title('brute')
plt.scatter(*X_sample.T, c=y_new_scratch_brute)
#用ball_tree演算法
predict, k_nearest_neighbors  = knn(X, y, 'ball_tree', 5)
y_new_scratch_ball_tree = predict(X_sample)
plt.subplot(312).set_title('ball_tree')
plt.scatter(*X_sample.T, c=y_new_scratch_ball_tree)
#用kd_tree演算法
predict, k_nearest_neighbors  = knn(X, y, 'kd_tree', 5)
y_new_scratch_kd_tree = predict(X_sample)
plt.subplot(313).set_title('kd_tree')
plt.scatter(*X_sample.T, c=y_new_scratch_kd_tree)

model = KNeighborsClassifier(n_neighbors=5)
model.fit(X, y)
y_new_scikit = model.predict(X_sample)

print(((y_new_scratch_brute==y_new_scratch_ball_tree) &   #驗證我們自己寫出來的三種不同方式與scikit learn套件做出來的結果相同
       (y_new_scratch_ball_tree==y_new_scratch_kd_tree) & 
       (y_new_scratch_kd_tree==y_new_scikit)).all())

###### 1(b)
Let `y_new` be the prediction of `X_sample` in the previous question. 
Plot the points (rows) in `X` with `c=y` .  
Plot the points (rows) in `X_sample` with `c=y_new` and `alpha=0.1` .

In [None]:
#以brute為例畫圖
plt.scatter(*X.T, c=y, label="Training data", marker="^", s=60) 
plt.scatter(*X_sample.T, c=y_new_scratch_brute, alpha=0.1, label="prediction") #y_new = y_new_scratch_brute
plt.legend(loc="upper left") #標記在左上方

###### 1(c)
Let  
```python
model = KNeighborsClassifier()
model.fit(X, y)
```  
and let `k_nearest_neighbors` be one of the output of your function.  
The results of `k_nearest_neighbors(X_sample)` should be the same as `model.kneighbors(X_sample, return_distance=False)` .  
(The corresponding rows contains the same collection of elements, but might be in different order.)  
Check if this is true.

In [None]:
#如上面的方式用brust做分類並比較與以KNeighborsClassifier做出來是否相等

#brust
predict, k_nearest_neighbors = knn(X, y, 'brute',3)
neighbors = k_nearest_neighbors(X_sample)
#KNeighborsClassifier
model = KNeighborsClassifier(n_neighbors=3)
model.fit(X, y)
print((neighbors==model.kneighbors(X_sample, return_distance=False)).all())

##### Exercise 2
Let  
```python
m,n = 8,8
frames = (m-2) * (n-2)

o = np.array([[1,1,1],
              [1,0,1],
              [1,1,1]])
x = np.array([[1,0,1],
              [0,1,0],
              [1,0,1]])
oo = np.zeros((frames, m, n))
xx = np.zeros((frames, m, n))
count =  0
for i in range(m-2):
    for j in range(n-2):
        oo[count, i:i+3, j:j+3] = o
        xx[count, i:i+3, j:j+3] = x
        count += 1


X = np.vstack([oo, xx]).reshape(2*frames, -1)
y = np.array([0]*frames + [1]*frames)
```

###### 2(a)
Run  
```python
plt.imshow(oo[i], cmap="Greys")
```
with different `i` .  
Guess what is the meaning of `oo` and `xx` .

In [None]:
m,n = 8,8
frames = (m-2) * (n-2) #6x6

o = np.array([[1,1,1],
              [1,0,1],
              [1,1,1]])
x = np.array([[1,0,1],
              [0,1,0],
              [1,0,1]])
oo = np.zeros((frames, m, n)) #(36,8,8)
xx = np.zeros((frames, m, n)) #(36,8,8)
count =  0
for i in range(m-2): #range(6)
    for j in range(n-2): #range(6)
        oo[count, i:i+3, j:j+3] = o
        xx[count, i:i+3, j:j+3] = x
        count += 1

X = np.vstack([oo, xx]).reshape(2*frames, -1)
y = np.array([0]*frames + [1]*frames)

In [None]:
for i in range(6):
    for j in range(6):
        ax = plt.subplot2grid((6,6), (i,j)) #6x6個圖，這個圖是在(i,j)的位置
        ax.imshow(oo[i*6+j], cmap="Greys")
print('oo是一個由36個子圖所組成，然後這組圖是：oo這個圖案，從左到右、從上到下的一個過程！')
print('oo is a 36 images set, an o-like object in each image from left to right, up to down with step 1')

In [None]:
for i in range(6):
    for j in range(6):
        ax = plt.subplot2grid((6,6), (i,j)) #6x6個圖，這個圖是在(i,j)的位置
        ax.imshow(xx[i*6+j], cmap="Greys")
print('xx是一個由36個子圖所組成，然後這組圖是：xx這個圖案，從左到右、從上到下的一個過程！')
print('xx is a 36 images set, an x-like object in each image from left to right, up to down with step 1')

###### 2(b)
Train a $k$-nearest neighbors classification model by `X` an `y` .  
Make a prediction `y_new` for the training data `X` .  
What is the outcome?  
Can you give a reason to this phenomenon?

In [None]:
#X = np.vstack([oo, xx]).reshape(2*frames, -1) #(72, 64)
#y = np.array([0]*frames + [1]*frames) #(72,)

for k in range(1,10):
    predict, k_nearest_neighbors = knn(X, y, 'brute', k)
    y_new = predict(X) #(72,)
    print("k = ",k,"\n",y_new,"\n") 
#k = 1:預測一定正確，因為用自己來預測，y = y_new
#不同的k會有不同的預測結果，因為圈到不同的點，所以決策結果會不同！

#### TA:
Well done!