### K-nearest-neighbours with sci-kit learn
In this notebook we will use sci-kit interface to create KNN classifier on iris dataset with parameter tuning

The notebook contains following parts:
1. Loading dataset from .csv file
2. Splitting data on train & test set
3. Creating base k-nn model on data with accuracy checking
4. Searching for best model with gridsearch method
5. Analyzing test error for different model parameters

### 1. Loading dataset from .csv file
We can use pandas package to load iris dataset - this time in csv format. Use *pandas.read_csv(path_to_file)* function to get dataset and save it to *iris* variable.

In [43]:
import pandas as pd

Load dataset:

In [44]:
iris = pd.read_csv('iris.csv')

Check first five elements

In [45]:
iris.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


### 2. Splitting data on train & test set
Now as we have dataset imported we should divide it into train set, which will be used for training and optimizing hyperparametrs, and testing dataset which will lead us to evaluation. To do this firstly we need to store features and target variable as different python variables. Then we can use the function train_test_split from model_selection sklearn's submodule.

[http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)

It outputs 4-element tuple of 
* X training
* X testing
* y training
* y testing



Create variable X which is a dataframe with 4 features

Create variable y which is a column from iris dataset that shows specie category

In [46]:
X = iris.iloc[:, 0:4]
y = iris.iloc[:, 4]

In [47]:
print(X.head())
print(y.head())

   sepal_length  sepal_width  petal_length  petal_width
0           5.1          3.5           1.4          0.2
1           4.9          3.0           1.4          0.2
2           4.7          3.2           1.3          0.2
3           4.6          3.1           1.5          0.2
4           5.0          3.6           1.4          0.2
0    setosa
1    setosa
2    setosa
3    setosa
4    setosa
Name: species, dtype: object


In [48]:
from sklearn.model_selection import train_test_split
TEST_FRACTION = 0.2
RANDOM_STATE = 123

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=TEST_FRACTION, random_state=RANDOM_STATE)

In [49]:
print(X_train.shape)
print(X_test.shape)

(120, 4)
(30, 4)


### 3. Creating base k-nn model on data with accuracy 
Now we can train (fit) a base KNeighboursClassifier from sklearn's implementation.
* Firstly we need to import KNeighborsClassifier class
* Secondly we should create above class' instance (we can start with parameter n_neighbors set to be 3)
* Thirdly we can use .fit property to create our first model ;) - note: to fit model, you should use training data (X_train as features, and y_train as target variable)

In [73]:
from sklearn.neighbors import KNeighborsClassifier
neigh = KNeighborsClassifier(n_neighbors=1)
neigh.fit(X_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=1, p=2,
           weights='uniform')

In [74]:
print(neigh)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=1, p=2,
           weights='uniform')


We can use accuracy_score from metrics submodule to assess the algorithm performance. To get the score we need to have target variable for test_set (which we have in *y_test* variable), and the predictions made by the model on testing features. To get them we can use .predict() property of KNeighborsClassifier instance

In [75]:
from sklearn.metrics import accuracy_score
predictions = neigh.predict(X_test)

In [76]:
print('Predictions on test set \n: {}'.format(predictions))
print('Truth values \n: {}'.format(y_test))

Predictions on test set 
: ['virginica' 'virginica' 'virginica' 'versicolor' 'setosa' 'versicolor'
 'versicolor' 'setosa' 'setosa' 'versicolor' 'virginica' 'setosa'
 'versicolor' 'virginica' 'virginica' 'virginica' 'setosa' 'setosa'
 'versicolor' 'setosa' 'setosa' 'versicolor' 'setosa' 'virginica' 'setosa'
 'setosa' 'setosa' 'virginica' 'virginica' 'setosa']
Truth values 
: 72     versicolor
112     virginica
132     virginica
88     versicolor
37         setosa
138     virginica
87     versicolor
42         setosa
8          setosa
90     versicolor
141     virginica
33         setosa
59     versicolor
116     virginica
135     virginica
104     virginica
36         setosa
13         setosa
63     versicolor
45         setosa
28         setosa
133     virginica
24         setosa
127     virginica
46         setosa
20         setosa
31         setosa
121     virginica
117     virginica
4          setosa
Name: species, dtype: object


In [77]:
accuracy_score(predictions, y_test)

0.90000000000000002

You could also use .score() property of the KNeighborsClassifier instance - the returned score will be the same

In [78]:
neigh.score(X_test, y_test)

0.90000000000000002

### 4. Searching for best model with gridsearch method

So. We reached pretty good accuracy with K-NN (k=3) model - but can we do better? We can try multiple parameter sets and compare their scores to check which set of parameters is the best for classification. We could write multiple for loops, but instead of this implementing this on our own, we can use *GridSearchCV* class from model_selection submodule.

We feed the GridSearchCV object with:
* base model (in our case that will be a KNeighborsClassifier() instance)
* parameter - a dictionary in form 'parameter_name': [list of all values we want to test algorithm against]
* cv - cross-validation strategy

Firstly we import relevant class:

In [56]:
from sklearn.model_selection import GridSearchCV

Then we set parameter space:

In [57]:
parameters = {'n_neighbors': [1, 2, 3, 4]}

Then we create an instance of basic classifier, GridSearchCV with K-Fold cross-validation with k=3 (GridSearch does it internally, we don't need to worry about implementation), and then we feed the optimizer with training data and fit it.

In [58]:
knn = KNeighborsClassifier()
clf = GridSearchCV(knn, parameters, cv=3)
clf.fit(X_train, y_train)

GridSearchCV(cv=3, error_score='raise',
       estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform'),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'n_neighbors': [1, 2, 3, 4]}, pre_dispatch='2*n_jobs',
       refit=True, return_train_score='warn', scoring=None, verbose=0)

To show all the values computed internally we can use .cv_results_ property:

In [59]:
clf.cv_results_



{'mean_fit_time': array([ 0.00200025,  0.00166678,  0.00100009,  0.00100001]),
 'mean_score_time': array([ 0.00166678,  0.00133348,  0.00100009,  0.0006667 ]),
 'mean_test_score': array([ 0.96666667,  0.95      ,  0.95      ,  0.95      ]),
 'mean_train_score': array([ 1.        ,  0.97478903,  0.97484177,  0.97067511]),
 'param_n_neighbors': masked_array(data = [1 2 3 4],
              mask = [False False False False],
        fill_value = ?),
 'params': [{'n_neighbors': 1},
  {'n_neighbors': 2},
  {'n_neighbors': 3},
  {'n_neighbors': 4}],
 'rank_test_score': array([1, 2, 2, 2]),
 'split0_test_score': array([ 0.97560976,  0.95121951,  0.97560976,  0.97560976]),
 'split0_train_score': array([ 1.        ,  0.94936709,  0.96202532,  0.96202532]),
 'split1_test_score': array([ 1.   ,  0.975,  0.975,  0.975]),
 'split1_train_score': array([ 1.    ,  0.975 ,  0.9625,  0.95  ]),
 'split2_test_score': array([ 0.92307692,  0.92307692,  0.8974359 ,  0.8974359 ]),
 'split2_train_score': array([

we can convert it to pd.DataFrame to see it in a nicer way

In [63]:
pd.DataFrame(clf.cv_results_)



Unnamed: 0,mean_fit_time,mean_score_time,mean_test_score,mean_train_score,param_n_neighbors,params,rank_test_score,split0_test_score,split0_train_score,split1_test_score,split1_train_score,split2_test_score,split2_train_score,std_fit_time,std_score_time,std_test_score,std_train_score
0,0.002,0.001667,0.966667,1.0,1,{'n_neighbors': 1},1,0.97561,1.0,1.0,1.0,0.923077,1.0,0.0008165347,0.0004713704,0.031862,0.0
1,0.001667,0.001333,0.95,0.974789,2,{'n_neighbors': 2},2,0.95122,0.949367,0.975,0.975,0.923077,1.0,0.0004713704,0.000471539,0.021081,0.020671
2,0.001,0.001,0.95,0.974842,3,{'n_neighbors': 3},2,0.97561,0.962025,0.975,0.9625,0.897436,1.0,1.123916e-07,2.247832e-07,0.036474,0.017791
3,0.001,0.000667,0.95,0.970675,4,{'n_neighbors': 4},2,0.97561,0.962025,0.975,0.95,0.897436,1.0,1.123916e-07,0.0004714266,0.036474,0.021309


With clf object we can predict values on our earlier test set and check accuracy:

In [62]:
predictions_grid_search = clf.predict(X_test)
accuracy_score(predictions, y_test)

0.96666666666666667

### 5. Analyzing test error for different model parameters

Your task: find best model parameters for k = 1...30 and p =1, 2, 3

Hint: range() will be helpful in defining list of potential k parameters

In [64]:
parameters = {'n_neighbors': range(1, 30), 'p': [1, 2, 3]}
knn = KNeighborsClassifier()
clf = GridSearchCV(knn, parameters, cv=3)
clf.fit(X_train, y_train)

GridSearchCV(cv=3, error_score='raise',
       estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform'),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'n_neighbors': range(1, 30), 'p': [1, 2, 3]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [69]:
pd.DataFrame(clf.cv_results_).sort_values(by='rank_test_score')



Unnamed: 0,mean_fit_time,mean_score_time,mean_test_score,mean_train_score,param_n_neighbors,param_p,params,rank_test_score,split0_test_score,split0_train_score,split1_test_score,split1_train_score,split2_test_score,split2_train_score,std_fit_time,std_score_time,std_test_score,std_train_score
26,0.001333,0.001333,0.975000,0.983228,9,3,"{'n_neighbors': 9, 'p': 3}",1,1.000000,0.974684,0.975,0.9750,0.948718,1.000000,4.713704e-04,4.714266e-04,0.020929,0.011860
2,0.001333,0.001333,0.975000,1.000000,1,3,"{'n_neighbors': 1, 'p': 3}",1,1.000000,1.000000,1.000,1.0000,0.923077,1.000000,4.713704e-04,4.714827e-04,0.036029,0.000000
23,0.001000,0.001667,0.975000,0.979008,8,3,"{'n_neighbors': 8, 'p': 3}",1,1.000000,0.962025,0.975,0.9750,0.948718,1.000000,2.247832e-07,4.714827e-04,0.020929,0.015760
20,0.001000,0.001000,0.975000,0.983228,7,3,"{'n_neighbors': 7, 'p': 3}",1,1.000000,0.974684,0.975,0.9750,0.948718,1.000000,2.247832e-07,1.123916e-07,0.020929,0.011860
19,0.001000,0.001000,0.975000,0.983228,7,2,"{'n_neighbors': 7, 'p': 2}",1,1.000000,0.974684,0.975,0.9750,0.948718,1.000000,1.123916e-07,1.946680e-07,0.020929,0.011860
29,0.001000,0.001667,0.966667,0.983228,10,3,"{'n_neighbors': 10, 'p': 3}",6,1.000000,0.974684,0.975,0.9750,0.923077,1.000000,1.123916e-07,4.713704e-04,0.031942,0.011860
27,0.001333,0.001000,0.966667,0.974842,10,1,"{'n_neighbors': 10, 'p': 1}",6,1.000000,0.962025,0.975,0.9625,0.923077,1.000000,4.713704e-04,1.123916e-07,0.031942,0.017791
25,0.001000,0.001000,0.966667,0.983228,9,2,"{'n_neighbors': 9, 'p': 2}",6,1.000000,0.974684,0.975,0.9750,0.923077,1.000000,1.123916e-07,0.000000e+00,0.031942,0.011860
24,0.001333,0.000667,0.966667,0.979008,9,1,"{'n_neighbors': 9, 'p': 1}",6,1.000000,0.962025,0.950,0.9750,0.948718,1.000000,4.714827e-04,4.714266e-04,0.024019,0.015760
22,0.001000,0.001000,0.966667,0.974842,8,2,"{'n_neighbors': 8, 'p': 2}",6,1.000000,0.962025,0.975,0.9625,0.923077,1.000000,1.123916e-07,1.123916e-07,0.031942,0.017791


In [72]:
predictions_grid_search = clf.predict(X_test)
accuracy_score(predictions, y_test)

0.96666666666666667