### KNN (K-Nearest Neighbors) Lab

In this lab we will attempt to classify, whether or not a wine is HIGH or LOW quality. This is a classification task since we will attempt to discrimnate between these two options.  To do so, we will take as input features of each wine and previous labels of HIGH or LOW quality.

In [37]:
import pandas as pd
import seaborn as sns
%matplotlib inline


# Load in the dataset
df = pd.read_csv("https://s3.amazonaws.com/demo-datasets/wine.csv")
df.head()

Unnamed: 0,fixed_acidity,volatile_acidity,citric_acid,residual_sugar,chlorides,free_sulfur_dioxide,total_sulfur_dioxide,density,pH,sulphates,alcohol,quality,color,is_red,high_quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,red,1.0,0.0
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5,red,1.0,0.0
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5,red,1.0,0.0
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6,red,1.0,0.0
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,red,1.0,0.0


#### Classifying high quality wines
- In this dataset, `quality` is a numeric column, if we were performing regression, we could predict this value.
- Since we are performing classification, we will predict `high_quality` which is 1 or 0, a binary label.
- What is the baseline accuracy that we should attempt to beat?

In [2]:
# ???



In [3]:
# TODO

#### What features are important to predict high quality wines?
- Let's explore the dataset

In [11]:
wine_plot = df[[x for x in df.columns]]
wine_plot.corr()
# Alcohol, density, chlorides, volatile acididty

Unnamed: 0,fixed_acidity,volatile_acidity,citric_acid,residual_sugar,chlorides,free_sulfur_dioxide,total_sulfur_dioxide,density,pH,sulphates,alcohol,quality,is_red,high_quality
fixed_acidity,1.0,0.219008,0.324436,-0.111981,0.298195,-0.282735,-0.329054,0.45891,-0.2527,0.299568,-0.095452,-0.076743,0.48674,-0.049447
volatile_acidity,0.219008,1.0,-0.377981,-0.196011,0.377124,-0.352557,-0.414476,0.271296,0.261454,0.225984,-0.03764,-0.265699,0.653036,-0.151714
citric_acid,0.324436,-0.377981,1.0,0.142451,0.038998,0.133126,0.195242,0.096154,-0.329808,0.056197,-0.010493,0.085532,-0.187397,0.054444
residual_sugar,-0.111981,-0.196011,0.142451,1.0,-0.12894,0.402871,0.495482,0.552517,-0.26732,-0.185927,-0.359415,-0.03698,-0.348821,-0.063992
chlorides,0.298195,0.377124,0.038998,-0.12894,1.0,-0.195045,-0.27963,0.362615,0.044708,0.395593,-0.256916,-0.200666,0.512678,-0.161781
free_sulfur_dioxide,-0.282735,-0.352557,0.133126,0.402871,-0.195045,1.0,0.720934,0.025717,-0.145854,-0.188457,-0.179838,0.055463,-0.471644,0.014767
total_sulfur_dioxide,-0.329054,-0.414476,0.195242,0.495482,-0.27963,0.720934,1.0,0.032395,-0.238413,-0.275727,-0.26574,-0.041385,-0.700357,-0.051226
density,0.45891,0.271296,0.096154,0.552517,0.362615,0.025717,0.032395,1.0,0.011686,0.259478,-0.686745,-0.305858,0.390645,-0.275441
pH,-0.2527,0.261454,-0.329808,-0.26732,0.044708,-0.145854,-0.238413,0.011686,1.0,0.192123,0.121248,0.019506,0.329129,0.028149
sulphates,0.299568,0.225984,0.056197,-0.185927,0.395593,-0.188457,-0.275727,0.259478,0.192123,1.0,-0.003029,0.038485,0.487218,0.033971


#### Build K-Nearest Neighbors model to predict whether or not a wine is high quality
- Select features you think will be predictive of high quality wines
- Scale the dataset (remember, this is so that each variable contributes equally to the distance computation)
- Evaluate the accuracy of your model using cross-validation
- Evaluate different values of `n` to see how the number of neighbors affects the classification accuracy

In [50]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn import cross_validation
import numpy as np
from sklearn import preprocessing


scaler = preprocessing.MinMaxScaler()

def accuracy_crossvalidator(X, Y, knn, cv_indices):
    scores = []
    for train_i, test_i in cv_indices:
        X_train = X[train_i, :]
        X_test = X[test_i, :]

        Y_train = Y[train_i]
        Y_test = Y[test_i]

        knn.fit(X_train, Y_train)
        
        acc = knn.score(X_test, Y_test)
        scores.append(acc)
        print('Fold accuracy:', acc)

    print('Mean CV accuracy:', np.mean(scores))

cv_indices = cross_validation.StratifiedKFold(Y, n_folds=5)

Y = scaler.fit_transform(df['high_quality'].values)
X = scaler.fit_transform(df[['alcohol','density','volatile_acidity','chlorides']].values)
   
knn = KNeighborsClassifier(n_neighbors=n,
                           weights='uniform',
                           p=2,
                           metric='minkowski')

accuracy_crossvalidator(X, Y, knn, cv_indices)

#Quality actually maps exactly to high quality.. every 7 in quality is a 1.0 in high quality.

('Fold accuracy:', 0.8107692307692308)
('Fold accuracy:', 0.80153846153846153)
('Fold accuracy:', 0.798306389530408)
('Fold accuracy:', 0.79676674364896072)
('Fold accuracy:', 0.77983063895304083)
('Mean CV accuracy:', 0.79744229288802049)




In [51]:
Y = scaler.fit_transform(df['high_quality'].values)
X = scaler.fit_transform(df[['alcohol','density','volatile_acidity','chlorides']].values)


for n in range(1,11):
    knn = KNeighborsClassifier(n_neighbors=n,
                           weights='uniform',
                           p=2,
                           metric='minkowski')
    print "Neighbors: %s" % n
    accuracy_crossvalidator(X, Y, knn, cv_indices)

Neighbors: 1
('Fold accuracy:', 0.8092307692307692)
('Fold accuracy:', 0.76923076923076927)
('Fold accuracy:', 0.76135488837567356)
('Fold accuracy:', 0.73672055427251737)
('Fold accuracy:', 0.69668976135488836)
('Mean CV accuracy:', 0.75464534849292342)
Neighbors: 2
('Fold accuracy:', 0.80615384615384611)
('Fold accuracy:', 0.80461538461538462)
('Fold accuracy:', 0.79522709776751344)
('Fold accuracy:', 0.77906081601231714)
('Fold accuracy:', 0.75673595073133182)
('Mean CV accuracy:', 0.78835861905607862)
Neighbors: 3
('Fold accuracy:', 0.80692307692307697)
('Fold accuracy:', 0.78538461538461535)
('Fold accuracy:', 0.78906851424172442)
('Fold accuracy:', 0.76212471131639725)
('Fold accuracy:', 0.72902232486528096)
('Mean CV accuracy:', 0.77450464854621903)
Neighbors: 4
('Fold accuracy:', 0.80538461538461537)
('Fold accuracy:', 0.79692307692307696)
('Fold accuracy:', 0.815242494226328)
('Fold accuracy:', 0.7906081601231717)
('Fold accuracy:', 0.7575057736720554)
('Mean CV accuracy:', 0.

In [None]:
# Not sure why even numbers seem to do better than odds.. in any case, 2 NN is almost as good as 10 NN, so I would
# use 2