# K- Nearest Neighbors(KNN)

* The value of the dependent variable is estimated according to the similarity of the observations for the independent variables.



* KNN is not efficient in huge data set.

## 1-)Data Preprocessing

In [1]:
import numpy as np
import pandas as pd 
from sklearn.model_selection import train_test_split

In [2]:
hit = pd.read_csv("Hitters.csv")
df = hit.copy()
df = df.dropna()
dms = pd.get_dummies(df[['League', 'Division', 'NewLeague']])
y = df["Salary"]
X_ = df.drop(['Salary', 'League', 'Division', 'NewLeague'], axis=1).astype('float64')
X = pd.concat([X_, dms[['League_N', 'Division_W', 'NewLeague_N']]], axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.25, 
                                                    random_state=42)

## 2-) Model

In [3]:
from sklearn.neighbors import KNeighborsRegressor

In [4]:
knn_model = KNeighborsRegressor().fit(X_train, y_train)

In [5]:
knn_model

KNeighborsRegressor()

In [6]:
knn_model.n_neighbors# number of neighbors

5

In [7]:
knn_model.effective_metric_# type of metric that is used to calculate the distance between the points

'euclidean'

## 3-) Prediction

In [8]:
y_pred = knn_model.predict(X_test)
y_pred[0:10]

array([ 510.3334,  808.3334,  772.5   ,  125.5   , 1005.    ,  325.5   ,
        216.5   ,  101.5   ,  982.    ,  886.6666])

In [9]:
from sklearn.metrics import mean_squared_error

In [10]:
test_error_before=np.sqrt(mean_squared_error(y_test, y_pred))
test_error_before #test error before  model tuning

426.6570764525201

## 4-) Model Tuning

* In this section, we will try to determine the optimum number of neighbors(**n_neighbors**) with the GridSearchCV method.


* GridSearchCV: Grid Search Cross Validation Methode



* Then , we will create the mos optimum model by using this number of  neighbors .



In [11]:
from sklearn.model_selection import GridSearchCV


In [12]:
k=np.arange(1,30,1)
k

array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
       18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29])

In [13]:
knn_params = {'n_neighbors': k}

In [14]:
knn = KNeighborsRegressor()

In [15]:
knn_cv_model = GridSearchCV(knn, knn_params, cv = 10)

* knn===> Name of the algorithm

* kkn_params===> Indicate the number of  neighborhoods

* cv ===> Indicate  the number of cross validations

In [16]:
knn_cv_model.fit(X_train, y_train) # We fit our cross validated model

GridSearchCV(cv=10, estimator=KNeighborsRegressor(),
             param_grid={'n_neighbors': array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
       18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29])})

In [17]:
knn_cv_model.best_params_["n_neighbors"]

8

* The optimum number of neighbors(**"n_neighbors"**) is 8 


* We will create final model(tuned model) by using the optimum number of neighbors

### 4.1)Tuned model

In [18]:
knn_tuned = KNeighborsRegressor(n_neighbors = knn_cv_model.best_params_["n_neighbors"])

In [19]:
knn_tuned.fit(X_train, y_train)

KNeighborsRegressor(n_neighbors=8)

In [27]:
y_pred_tuned = knn_tuned.predict(X_test)
y_pred_tuned[0:5]

array([624.583375, 812.083375, 846.25    , 155.3125  , 850.      ])

In [21]:
test_error_after=np.sqrt(mean_squared_error(y_test, y_pred_tuned ))
test_error_after #test error after model tuning

413.7094731463598