# KNeighborsClassifier with scikit learn

![Creative Commons License](https://i.creativecommons.org/l/by/4.0/88x31.png)  
This work by Jephian Lin is licensed under a [Creative Commons Attribution 4.0 International License](http://creativecommons.org/licenses/by/4.0/).

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier

## Code
```python
from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier(<parameters)
model.fit(X, y)
y_new = model.predict(X_test)
```

[Official Reference](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html)

## Parameters
- `n_neighbors`: Numbers of neighbors (including self) to vote
- `algorithm`: `'auto'`, `'ball_tree'`, `'kd_tree'`, or `'brute'`  
it only affect the speed but not the outcome. 

## Attributes
- `classes_`: an array of shape `(n_classes,)`  
(Usually `0, ..., n_classes-1`)

## Sample data

##### Exercise 1
Let  
```python
mu1 = np.array([2.5,0])
cov1 = np.array([[1.1,-1],
                [-1,1.1]])
mu2 = np.array([-2.5,0])
cov2 = np.array([[1.1,1],
                [1,1.1]])
X = np.vstack([np.random.multivariate_normal(mu1, cov1, 100), 
               np.random.multivariate_normal(mu2, cov2, 100)])
y = np.array([0]*100 + [1]*100)
```

###### 1(a)
Plot the points (rows) in `X` with `c=y` .  

In [None]:
mu1 = np.array([2.5,0])
cov1 = np.array([[1.1,-1],
                [-1,1.1]])
mu2 = np.array([-2.5,0])
cov2 = np.array([[1.1,1],
                [1,1.1]])
X = np.vstack([np.random.multivariate_normal(mu1, cov1, 100), 
               np.random.multivariate_normal(mu2, cov2, 100)])
y = np.array([0]*100 + [1]*100)

In [None]:
plt.axis('equal')
plt.scatter(X[:,0],X[:,1],c=y)

###### 1(b)
Use `np.random.rand` to make 1000 random points in the region $-5\leq x\leq 5$, $-5\leq y\leq 5$.  
Make a prediction of them and plot them upon your previous figure.

In [None]:
X_test = np.random.rand(1000,2)*10-5
model = KNeighborsClassifier(n_neighbors=5)
model.fit(X, y)
y_new = model.predict(X_test)

In [None]:
plt.axis('equal')
plt.scatter(X[:,0],X[:,1],c=y)
plt.scatter(X_test[:,0],X_test[:,1],c=y_new,s=10,alpha=0.3)

##### Exercise 2
Let  
```python
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data
y = iris.target
```

###### 2(a)
Use `X` and `y` to train a $k$-nearest neighbors classification model with `n_neighbors=5` .  
Let `y_new` be the prediction.  
Calculate the accuracy score between `y` and `y_new` ,  
that is, the number of correct answers divided by the number of samples.

In [None]:
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data
y = iris.target

In [None]:
model = KNeighborsClassifier(n_neighbors=5)
model.fit(X, y)
y_new = model.predict(X)

In [None]:
score = np.sum(y==y_new)/y.shape[0]
print('score =',score)

###### 2(b)
Let  
```python
from sklearn.metrics import accuracy_score
accuracy_score(y_new, y)
```
Check if the output is the same as your previous answer.

In [None]:
from sklearn.metrics import accuracy_score
accuracy_score(y_new, y)

The outputs are the same.

###### 2(c)
Change the model to the setting `n_neighbors=1` .  
Does the accuracy increase or decrease?  Why?

In [None]:
model = KNeighborsClassifier(n_neighbors=1)
model.fit(X, y)
y_new = model.predict(X)

In [None]:
score = np.sum(y==y_new)/y.shape[0]
print('score =',score)

The accuracy increases because the accuracy is always 100% when number of neighbors is one and the testing data is the same as the training data.

##### Exercise 3
Let  
```python
from sklearn.datasets import load_digits
digits = load_digits()
mask = (digits.target == 0) | (digits.target == 1)
X = digits.data[mask]
y = digits.target[mask]
```

###### 3(a)
Train a $k$-nearest neighbors classification model.  
How is its accuracy score?

In [None]:
from sklearn.datasets import load_digits
digits = load_digits()
mask = (digits.target == 0) | (digits.target == 1)
X = digits.data[mask]
y = digits.target[mask]

In [None]:
model = KNeighborsClassifier(n_neighbors=5)
model.fit(X, y)
y_new = model.predict(X)

In [None]:
score = np.sum(y==y_new)/y.shape[0]
print('score =',score)

###### 3(b)
Use any software or online app to draw a picture of 0 or 1.  
Save it as a file, e.g., `my_digit.png` .  
Use the following code to load it.  
```python
from PIL import Image
img = Image.open("my_digit.png").resize((8,8))
```
Does the model give you the right answer?  
Each of you can do 5 pictures.  
Let's see what is the accuracy score.

In [None]:
from PIL import Image
img = []
for i in range(1,6):
    img.append(Image.open("digit"+str(i)+".png").convert('L').resize((8,8)))
X_test = (255-np.vstack([np.array(image).ravel() for image in img]))/16
ans = np.array([1,0,0,1,1])

In [None]:
for i in range(5):
    plt.subplot(2,3,i+1)
    plt.title('image'+str(i+1))
    plt.imshow(X_test[i].reshape(8,8))
plt.show()

In [None]:
y_new = model.predict(X_test)
score = np.sum(ans==y_new)/5
print('score =',score)
print(y_new)

## Experiments

##### Exercise 4
For a supervised learning model, you have to partition your data into a training set and a testing set.  
You may do it easily by  
```python
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)
```

###### 4(a)
Split the data in Exercise 2.  
Use the training set to train the model.  
Apply the trained model to the testing set.  
How is the accuracy?  
Run it several time and get an average.

In [None]:
iris = load_iris()
X = iris.data
y = iris.target
from sklearn.model_selection import train_test_split

In [None]:
score = np.array([])
for i in range(10):
    X_train, X_test, y_train, y_test = train_test_split(X, y)
    model = KNeighborsClassifier(n_neighbors=5)
    model.fit(X_train, y_train)
    y_new = model.predict(X_test)
    
    scr = np.sum(y_new==y_test)/y_test.shape[0]
    score = np.append(score,scr)

In [None]:
for i in range(10):
    print("test",i+1,": score =",score[i])
print("average score =",np.mean(score))

###### 4(b)
Split the data in Exercise 3.  
Use the training set to train the model.  
Apply the trained model to the testing set.  
How is the accuracy?  
Run it several time and get an average.

In [None]:
digits = load_digits()
mask = (digits.target == 0) | (digits.target == 1)
X = digits.data[mask]
y = digits.target[mask]

In [None]:
score = np.array([])
for i in range(10):
    X_train, X_test, y_train, y_test = train_test_split(X, y)
    model = KNeighborsClassifier(n_neighbors=5)
    model.fit(X_train, y_train)
    y_new = model.predict(X_test)
    
    scr = np.sum(y_new==y_test)/y_test.shape[0]
    score = np.append(score,scr)

In [None]:
for i in range(10):
    print("test",i+1,": score =",score[i])
print("average score =",np.mean(score))

##### Exercise 5
Let  
```python 
X = 5 * np.random.randn(1000,2)
lengths = np.linalg.norm(X, axis=1)
band1 = (lengths > 1) & (lengths <2)  
band2 = (lengths > 3) & (lengths <4)  
X = X[band1 | band2, :]
y = np.array([0]*band1.shape[0] + [1]*band2.shape[0])
```

###### 5(a)
Go through the split-train-test process.  
What is the accuracy score?

In [None]:
X = 5 * np.random.randn(1000,2)
lengths = np.linalg.norm(X, axis=1)
band1 = (lengths > 1) & (lengths <2)  
band2 = (lengths > 3) & (lengths <4)  
X = np.vstack([X[band1],X[band2]])
y = np.array([0]*np.sum(band1) + [1]*np.sum(band2))

In [None]:
score = np.array([])
for i in range(10):
    X_train, X_test, y_train, y_test = train_test_split(X, y)
    model = KNeighborsClassifier(n_neighbors=5)
    model.fit(X_train, y_train)
    y_new = model.predict(X_test)
    
    scr = np.sum(y_new==y_test)/y_test.shape[0]
    score = np.append(score,scr)

In [None]:
for i in range(10):
    print("test",i+1,": score =",score[i])
print("average score =",np.mean(score))

###### 5(b)
Use some random points to plot the regions for each class.  
(Just as what we did in Exercise 1.)

In [None]:
X_test = np.random.rand(1000,2)*10-5
model = KNeighborsClassifier(n_neighbors=5)
model.fit(X,y)
y_new = model.predict(X_test)

In [None]:
plt.axis('equal')
plt.scatter(X[:,0],X[:,1],c=y)
plt.scatter(X_test[:,0],X_test[:,1],c=y_new,s=10,alpha=0.3)