In this Lab today, we will run the kNN model to determine the best ‘k’ value in order to determine whether a wine is of high quality. 

We will use the following features: ‘density’, ‘sulphates’, residual sugar’.

Please check 'k' values from 1 to 50 in order to determine the best 'k' value.

Below is start code which obtains demo data from Amazon Web Services website:

In [1]:
import numpy as np
import pandas as pd
import pylab as pl
from sklearn.neighbors import KNeighborsClassifier

df = pd.read_csv("https://s3.amazonaws.com/demo-datasets/wine.csv")

test_idx = np.random.uniform(0, 1, len(df)) <= 0.3
train = df[test_idx==True]
test = df[test_idx==False]

features = ['density', 'sulphates', 'residual_sugar']



In [7]:
#Filtering the train and test sets
train_features = train[features]

test_features = test[features]


In [10]:
#high quality is our target, and is binary.
train_target = train.high_quality

test_target = test.high_quality

In [19]:
#Time to set up a KNeighbors from 1 to 50...
neigh = {}
for i in range(50):
    neigh[i+1] = KNeighborsClassifier(n_neighbors=i+1).fit(train_features, train_target)

In [25]:
#Checking that the dictionary is from 1 to 50 for k neighbors.
scores = {}
for i in range(50):
    neigh[i+1].predict(test_features)
    print ("The accuracy of KNN classification of %i neighbors on this test set is: %f" % (i+1, neigh[i+1].score(test_features, test_target)))
    print ""
    scores[i+1]= neigh[i+1].score(test_features, test_target)


The accuracy of KNN classification of 1 neighbors on this test set is: 0.738943

The accuracy of KNN classification of 2 neighbors on this test set is: 0.787702

The accuracy of KNN classification of 3 neighbors on this test set is: 0.752751

The accuracy of KNN classification of 4 neighbors on this test set is: 0.779072

The accuracy of KNN classification of 5 neighbors on this test set is: 0.769579

The accuracy of KNN classification of 6 neighbors on this test set is: 0.792017

The accuracy of KNN classification of 7 neighbors on this test set is: 0.780798

The accuracy of KNN classification of 8 neighbors on this test set is: 0.793096

The accuracy of KNN classification of 9 neighbors on this test set is: 0.789213

The accuracy of KNN classification of 10 neighbors on this test set is: 0.796764

The accuracy of KNN classification of 11 neighbors on this test set is: 0.789213

The accuracy of KNN classification of 12 neighbors on this test set is: 0.793528

The accuracy of KNN class

In [30]:

print "The number of K nearest neighbors with the best accuracy score is:", max(scores, key=scores.get)
print "The score itself is", scores[max(scores, key=scores.get)]

The number of K nearest neighbors with the best accuracy score is: 38
The score itself is 0.800862998921


### Logistic Regression.

Now that we've run a k nearest neighbors and found that 38 is the best 'k' value, let's look at logistic. We are first going to look at sklearn, where we apply regularization with L2 as the penalty setting automatically.

In [32]:
from sklearn import linear_model

logit = linear_model.LogisticRegression()

In [33]:
#Setting the model.
model = logit.fit(train_features, train_target)
predict_logit = model.predict(test_features)

In [84]:
#What's the accuracy of logistic with regularization?

print "The accuracy of the logit is:", logit.score(test_features, test_target)
print ""
print "The difference in accuracy between this logistic and the k nearest neighbors is:", (logit.score(test_features, test_target)-scores[max(scores, key=scores.get)])

The accuracy of the logit is: 0.800647249191

The difference in accuracy between this logistic and the k nearest neighbors is: -0.000215749730313


In [49]:
model

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

### Logistic Regression continued...

So we looked at logistic regression using sklearn. I am actually interested in seeing the regression output, so I am going to look at logistic with statsmodels.

I will look at the logistic and then try to regularize.

In [46]:
import statsmodels.api as sm

In [47]:
#Setting the logistic model. Binary, so should only need Logit.

logit_quality = sm.Logit(train_target, train_features)

In [48]:
logit_quality_results = logit_quality.fit()
logit_quality_results.summary()

Optimization terminated successfully.
         Current function value: 0.479912
         Iterations 6


0,1,2,3
Dep. Variable:,high_quality,No. Observations:,1862.0
Model:,Logit,Df Residuals:,1859.0
Method:,MLE,Df Model:,2.0
Date:,"Wed, 04 May 2016",Pseudo R-squ.:,0.01174
Time:,00:18:50,Log-Likelihood:,-893.6
converged:,True,LL-Null:,-904.22
,,LLR p-value:,2.443e-05

0,1,2,3,4,5
,coef,std err,z,P>|z|,[95.0% Conf. Int.]
density,-1.9022,0.249,-7.647,0.000,-2.390 -1.415
sulphates,1.1641,0.399,2.921,0.003,0.383 1.945
residual_sugar,-0.0366,0.014,-2.674,0.007,-0.063 -0.010


In [75]:
#How does this model do on the test data?

predictions = pd.Series(logit_quality_results.predict(test_features), index = test_target.index)
combined = pd.concat([test_target, predictions], axis = 1, ignore_index = False)
combined.columns = ['high quality actual', 'predictions']

In [77]:
combined['difference'] = combined['high quality actual'] - combined.predictions

In [80]:
combined

Unnamed: 0,high quality actual,predictions,difference
1,0.0,0.231533,-0.231533
2,0.0,0.227233,-0.227233
3,0.0,0.215387,-0.215387
4,0.0,0.211542,-0.211542
5,0.0,0.212153,-0.212153
7,1.0,0.199611,0.800389
8,1.0,0.213195,0.786805
12,0.0,0.206757,-0.206757
14,0.0,0.266163,-0.266163
16,1.0,0.251785,0.748215


I am assuming the fit is not good on this logistic given the R squared.

I am going to attempt regularization in statsmodels.

### Regularization in Statsmodels

In [81]:
#Regularization...

logit_reg_quality = logit_quality.fit_regularized()

Optimization terminated successfully.    (Exit mode 0)
            Current function value: 0.479911524195
            Iterations: 26
            Function evaluations: 28
            Gradient evaluations: 26


Note that 26 iterations occured for regularization compared to six for no regularization.

In [83]:
logit_reg_quality.summary()

0,1,2,3
Dep. Variable:,high_quality,No. Observations:,1862.0
Model:,Logit,Df Residuals:,1859.0
Method:,MLE,Df Model:,2.0
Date:,"Wed, 04 May 2016",Pseudo R-squ.:,0.01174
Time:,00:41:47,Log-Likelihood:,-893.6
converged:,True,LL-Null:,-904.22
,,LLR p-value:,2.443e-05

0,1,2,3,4,5
,coef,std err,z,P>|z|,[95.0% Conf. Int.]
density,-1.9022,0.249,-7.647,0.000,-2.390 -1.415
sulphates,1.1641,0.399,2.921,0.003,0.383 1.945
residual_sugar,-0.0366,0.014,-2.674,0.007,-0.063 -0.010


The fit does not look good here either.

### Conclusion

Logistic regression with regularization in sklearn and K nearest neighbors both do a good job of classifying if a wine is high quality or not based on the three features passed (both have an accuracy of 80%).