<img src="AV_Logo.png" style="width: 200px;height: 75px"/>

Table of Contents:
---------
* [What is kNN Algorithm?](#What-is-kNN-Algorithm?)
* [How does the KNN algorithm work?](#How-does-the-KNN-algorithm-work?)
* [How do we choose the factor K?](#How-do-we-choose-the-factor-K?)
* [kNN Algorithm – Pros and Cons](#kNN-Algorithm-–-Pros-and-Cons)
* [Implementation of kNN](#Implementation-of-kNN)

## What is kNN Algorithm?

Let’s assume we have several groups of labeled samples. The items present in the groups are homogeneous in nature. Now, suppose we have an unlabeled example which needs to be classified into one of the several labeled groups. 

How do you do that? WIthout a doubt, you can use the kNN Algorithm.

k nearest neighbors is a simple algorithm that stores all available cases and classifies new cases by a majority vote of its k neighbors. This algorithms segregates unlabeled data points into well defined groups.

KNN can be used for both classification and regression predictive problems. However, it is more widely used in classification problems in the industry

## How does the KNN algorithm work?

Let’s take a simple case to understand this algorithm. Following is a spread of red circles (RC) and green squares (GS).

<img src="knn1.png" style="width: 650px;height: 300px">

You intend to find out the class of the blue star (BS). BS has only two options, it can either have GS or RC. The “K” is KNN algorithm is the nearest neighbors we wish to take vote from. Let’s say K = 3. Hence, we will now make a circle with BS as center just as big as to enclose only three datapoints on the plane. Refer to following diagram for more details:

<img src="kkn2.png" style="width: 650px;height: 300px">

The three closest points to BS is all RC. Hence, with good confidence level we can say that the BS should belong to the class RC. Here, the choice became very obvious as all three votes from the closest neighbor went to RC. The choice of the parameter K is very crucial in this algorithm. Next we will understand what are the factors to be considered in order to conclude the best K.

### How do we choose the factor K?

First let us try to understand what exactly does K influence in the algorithm. If we see the last example, given that all the 6 training observation remain constant, with a given K value we can make boundaries of each class. These boundaries will segregate RC from GS. The same way, let’s try to see the effect of value “K” on the class boundaries. Following are the different boundaries separating the two classes with different values of K.

<img src="knn3.png" style="width: 450px;height: 200px">

<img src="knn4.png" style="width: 450px;height: 200px">

If you watch carefully, you can see that the boundary becomes smoother with increasing value of K. As K increases to infinity it finally becomes all blue or all red depending on the total majority.  The training error rate and the validation error rate are two parameters we need to access on different K-value. Following is the curve for the training error rate with varying value of K :

<img src="knn5.png" style="width: 450px;height: 300px">

As you can see, the error rate at K=1 is always zero for the training sample. This is because the closest point to any training data point is itself.Hence the prediction is always accurate with K=1. If validation error curve would have been similar, our choice of K would have been 1. Following is the validation error curve with varying value of K:

<img src="knn6.png" style="width: 450px;height: 300px">

This makes the story more clear. At K=1, we were overfitting the boundaries. Hence, error rate initially decreases and reaches a minima. After the minima point, it then increase with increasing K. To get the optimal value of K, you can segregate the training and validation from the initial dataset. Now plot the validation error curve to get the optimal value of K. This value of K should be used for all predictions.

## kNN Algorithm – Pros and Cons

**Pros:** The algorithm is highly unbiased in nature and makes no prior assumption of the underlying data. Being simple and effective in nature, it is easy to implement and has gained good popularity.

**Cons:** Indeed it is simple but kNN algorithm has drawn a lot of flake for being extremely simple! If we take a deeper look, this doesn’t create a model since there’s no abstraction process involved. Yes, the training process is really fast as the data is stored verbatim (hence lazy learner) but the prediction time is pretty high with useful insights missing at times. Therefore, building this algorithm requires time to be invested in data preparation (especially treating the missing data and categorical features) to obtain a robust model.

## Implementation of kNN

In [1]:
import pandas as pd

In [2]:
data = pd.read_csv('winequality.csv')

In [3]:
data.head()

Unnamed: 0,ID,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,W0001,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8,2
1,W0002,6.3,0.3,0.34,1.6,0.049,14.0,132.0,0.994,3.3,,9.5,2
2,W0003,8.1,0.28,0.4,6.9,0.05,30.0,97.0,0.9951,3.26,,10.1,2
3,W0004,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,2
4,W0005,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,2


In [4]:
# first separate dependent and independent variables
X = data.drop(['ID', 'quality'], axis=1)
y = data.quality

In [5]:
# fill missing values
X.fillna(X.mean(), inplace=True)

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol
0,7.0,0.270,0.360000,20.70,0.045,45.0,170.0,1.00100,3.000000,0.450000,8.800000
1,6.3,0.300,0.340000,1.60,0.049,14.0,132.0,0.99400,3.300000,0.490158,9.500000
2,8.1,0.280,0.400000,6.90,0.050,30.0,97.0,0.99510,3.260000,0.490158,10.100000
3,7.2,0.230,0.320000,8.50,0.058,47.0,186.0,0.99560,3.190000,0.400000,9.900000
4,7.2,0.230,0.320000,8.50,0.058,47.0,186.0,0.99560,3.190000,0.400000,9.900000
5,8.1,0.280,0.400000,6.90,0.050,30.0,97.0,0.99510,3.260000,0.440000,10.100000
6,6.2,0.320,0.334031,7.00,0.045,30.0,136.0,0.99490,3.188762,0.470000,9.600000
7,7.0,0.270,0.360000,20.70,0.045,45.0,170.0,1.00100,3.000000,0.450000,8.800000
8,6.3,0.300,0.340000,1.60,0.049,14.0,132.0,0.99400,3.300000,0.490158,9.500000
9,8.1,0.220,0.430000,1.50,0.044,28.0,129.0,0.99380,3.220000,0.450000,11.000000


In [6]:
from sklearn.neighbors import KNeighborsClassifier

In [7]:
knn_class = KNeighborsClassifier(n_neighbors=10, metric='l1')

In [8]:
from sklearn.metrics import accuracy_score

In [9]:
# train model
knn_class.fit(X, y)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='l1',
           metric_params=None, n_jobs=1, n_neighbors=10, p=2,
           weights='uniform')

In [10]:
# get score
accuracy_score(y, knn_class.predict(X))

0.76521028991425066

### Hyperparameters of kNN
    
#### Value of k
[Discussed above](#How-do-we-choose-the-factor-K?)

#### Distance metric
Defines the method by which error/distance is calculated, for example "l1", "l2". Choosing this would depend on the dataset.

In [11]:
knn_class = KNeighborsClassifier(n_neighbors=5, metric='l1')

In [12]:
knn_class.fit(X, y)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='l1',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')

By changing value of k from 10 to 5, the accuracy increase by 5%

---------------
Now let's implement decision tree on our [practice hackathon](https://datahack.analyticsvidhya.com/contest/datahack-hour-bike-sharing/).

In [13]:
train = pd.read_csv('train_ysMSKmQ.csv')
test = pd.read_csv('test_uLBXQQR.csv')

In [14]:
train.head()

Unnamed: 0,instant,dteday,season,yr,mnth,hr,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,cnt
0,1,01/01/11,1,0,1,0,0,6,0,1,0.24,0.2879,0.81,0.0,16
1,2,01/01/11,1,0,1,1,0,6,0,1,0.22,0.2727,0.8,0.0,40
2,3,01/01/11,1,0,1,2,0,6,0,1,0.22,0.2727,0.8,0.0,32
3,4,01/01/11,1,0,1,3,0,6,0,1,0.24,0.2879,0.75,0.0,13
4,5,01/01/11,1,0,1,4,0,6,0,1,0.24,0.2879,0.75,0.0,1


In [15]:
X = train.drop(['instant', 'dteday', 'cnt'], axis=1)
y = train.cnt

X_test = test.drop(['instant', 'dteday'], axis=1)

In [16]:
from sklearn.neighbors import KNeighborsRegressor

In [17]:
knn_reg = KNeighborsRegressor(n_neighbors=10, metric='l1')

In [18]:
knn_reg.fit(X, y)

KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='l1',
          metric_params=None, n_jobs=1, n_neighbors=10, p=2,
          weights='uniform')

In [19]:
pred = knn_reg.predict(X_test)

In [20]:
# create submission file
submission = pd.DataFrame(data=[], columns=['instant', 'cnt'])
submission.instant = test.instant; submission.cnt = pred

submission.to_csv('submission.csv', index=False)

submission.head()

Unnamed: 0,instant,cnt
0,13036,398.0
1,13037,208.3
2,13038,165.5
3,13039,167.2
4,13040,192.9


**Exercise**:

Q1. Get the best possible score in the practice problem by tuning hyperparameters of kNN.

That's all for today!
----------------
-------------------------------
<img src="AV_Datafest_logo.png" style="width: 200px;height: 200px"/>
[www.analyticsvidhya.com](www.analyticsvidhya.com)

DATAFEST 2017