# K Nearest Neighbors

課程範例程式及資料檔下載網址： https://www.superdatascience.com/machine-learning/

## Importing the Libraries 載入套件

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib notebook

## Importing the Dataset 讀入資料

利用 pd.read_csv 來載入資料，這組資料是一個 400 列(row) 5 行(column) Social Network的資料，其中 Purchased 表示有沒有購買。

In [2]:
path = '/Users/hsinyu/Desktop/K_Nearest_Neighbors/'
dataset = pd.read_csv( path+'Social_Network_Ads.csv' )

In [3]:
dataset

Unnamed: 0,User ID,Gender,Age,EstimatedSalary,Purchased
0,15624510,Male,19,19000,0
1,15810944,Male,35,20000,0
2,15668575,Female,26,43000,0
3,15603246,Female,27,57000,0
4,15804002,Male,19,76000,0
5,15728773,Male,27,58000,0
6,15598044,Female,27,84000,0
7,15694829,Female,32,150000,1
8,15600575,Male,25,33000,0
9,15727311,Female,35,65000,0


## Dependent & independent variables 定義解釋變數及反應變數

In [4]:
X = dataset.iloc[:, [2, 3]].values #Age,EstimatedSalary
y = dataset.iloc[:, 4].values #Purchased

## K Nearest Neighbors Intuition

如何利用 K-NN 的方法來決定一個新的資料點，它的類別是 Category1 還是 Category2

![](plot_3_2_1.png)

#### K Nearest Neighbors 的建立過程 ：
+ **Step1.** 選定 <span style="color:blue">k</span> 個鄰居，如果沒有指定 Default 是 k=5
+ **Step2.** 利用 Euclidean distance 來計算新的資料點與其他資料點的距離，並挑出最近的 <span style="color:blue">k</span> 個鄰居
+ **Step3.** 查看這 <span style="color:blue">k</span> 個鄰居的類別分別是什麼，並計算每個類別的個數
+ **Step4.** 根據 **Step3.** 最多數的類別結果，即為此新資料點的類別

![](plot_3_2_2.png)

> [Note] <br>
> 1. 距離的部分大多數是 Euclidean distance(歐式距離)，也可以選用其他的距離 <br>
> 2. Euclidean n-space：假設 $x=(x_1,x_2,...,x_n) \,\, y=(y_1,y_2,...,y_n)$ 那麼 $\overline{xy}$(x,y的距離)

> $$
d(x,y) = d(y,x) = \sqrt{(y_1-x_1)^2+(y_2-x_2)^2+...+(y_n-x_n)^2} = \sqrt{\sum_{i=1}^{n}{(y_i-x_i)^2}}
$$
> ![](plot_3_2_3.png)

## Splitting the dataset into the Training set and Test set 切分訓練及測試樣本

In [5]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)

## Feature Scaling 數值型資料尺度轉換

In [6]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)



## Fitting K-NN to the Training set

In [7]:
from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors = 5, metric = 'minkowski', p = 2)
classifier.fit(X_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')

>[Note] <br>
>當 metric = 'minkowski', p = 2 時，使用的是 Euclidean distance <br>
>當 metric = 'minkowski', p = 1 時，使用的是 ManhattanDistance(曼哈頓距離) <br>

## Predicting the Test set results

In [8]:
y_pred = classifier.predict(X_test)

## Making the Confusion Matrix

In [9]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)

In [10]:
cm

array([[64,  4],
       [ 3, 29]])

## Visualising the Training set results

In [11]:
from matplotlib.colors import ListedColormap
X_set, y_set = X_train, y_train
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1, stop = X_set[:, 0].max() + 1, step = 0.01),
                     np.arange(start = X_set[:, 1].min() - 1, stop = X_set[:, 1].max() + 1, step = 0.01))
plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape),
             alpha = 0.75, cmap = ListedColormap(('red', 'green')))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
    plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1],
                c = ListedColormap(('red', 'green'))(i), label = j)
plt.title('K-NN (Training set)')
plt.xlabel('Age')
plt.ylabel('Estimated Salary')
plt.legend()
plt.show()

<IPython.core.display.Javascript object>

## Visualising the Test set results

In [12]:
from matplotlib.colors import ListedColormap
X_set, y_set = X_test, y_test
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1, stop = X_set[:, 0].max() + 1, step = 0.01),
                     np.arange(start = X_set[:, 1].min() - 1, stop = X_set[:, 1].max() + 1, step = 0.01))
plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape),
             alpha = 0.75, cmap = ListedColormap(('red', 'green')))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
    plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1],
                c = ListedColormap(('red', 'green'))(i), label = j)
plt.title('K-NN (Test set)')
plt.xlabel('Age')
plt.ylabel('Estimated Salary')
plt.legend()
plt.show()

<IPython.core.display.Javascript object>