# DAT19 Class 5 - Model Evaluation


# Cross Validation with KNN

Part of the big step with this lab is understanding general sklearn syntax. Each family of classification algorithms have various knobs and levers to tune it appropriately but there is a general overall structure to these models that will help you as you move forward.
1. All models need to be trained. Sklearn models have a `.fit` method for doing so.
2. We need to use the model to make a guess. the `.predict` method takes data and returns the model's guess for the value. Stipulations around this pertain to the specific model.

Last time, we imported our data from the UCI Machine Learning repository using pandas. Scikit-learn also includes some well-known datasets. So, for convenience, we will import the iris data set from sklearn this time.

In [None]:
import numpy as np

In [None]:
# from the datasets load the iris data into a variable called iris
from sklearn import datasets

sk_iris = datasets.load_iris()

In [None]:
type(sk_iris)

In [None]:
print sk_iris

In [None]:
help(sk_iris)

That's interesting:
```Container object for datasets: dictionary-like object that exposes its keys as attributes.```

In [None]:
print sk_iris['DESCR']

In [None]:
sk_iris['feature_names']

In [None]:
sk_iris['target_names']

Remember last time when we put all the features in a matrix and the labels (what we are trying to predict) into a vector?

Let's re-assign the data to standard named variables. Sklearn makes this very easy.

In [None]:
X = sk_iris.data
y = sk_iris.target
Names = sk_iris.target_names

In [None]:
#print type(X)
#print np.shape(X)
#print X

In [None]:
#print type(Names)
#print np.shape(Names)
#print Names

Now we get into cross validation! The first step is to split the data into a training set and a test set.

In [None]:
# is there a function to do that in sklearn?
from sklearn.cross_validation import train_test_split

In [None]:
ind = range(150) #What data structure is ind? What is its shape?
np.random.shuffle(ind) #Why must we randomly shuffle the (indices for the) training data before splitting it?
test_ind = ind[:150/5] #Would this work if 20% of the number of records were not an integer?
train_ind = ind[150/5:]

In [None]:
print test_ind
print 'length of test index is ' + str(len(test_ind))
print '\n'
print train_ind
print 'length of training index is ' + str(len(train_ind))

In [None]:
X_train = []
y_train = []

X_test = []
y_test = []

for ind in test_ind:
    X_test.append(X[ind])
    y_test.append(y[ind])
    
for ind in train_ind:
    X_train.append(X[ind])
    y_train.append(y[ind])

In [None]:
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.20, random_state=0)

Wait a minute, what's going on with this syntax above? Does anything about it look unusual to you?
Let's take a look at the [function documentation](http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.train_test_split.html) and the [user guide](http://scikit-learn.org/stable/modules/cross_validation.html#cross-validation)

In [None]:
tts_return = train_test_split( X, y, test_size=0.20, random_state=0)
print len(tts_return)
print type(tts_return)
#tts_return

In [None]:
X_train

In [None]:
y_train

In [None]:
y_test

In [None]:
#Quick Question: How can we double check that got the number of features and labels that we expected?


Now, we'll train our model and use it to make predictions, following the steps we outlined last time.

In [None]:
# Train KNN classifier defined function on the train data
from sklearn.neighbors import KNeighborsClassifier

In [None]:
myknn = KNeighborsClassifier(2).fit(X_train,y_train)

Let's figure out how good our model is. The traditional score is what percentage of my labels did I correctly identify. This is called **accuracy** or **precision**. There are other types of statistical scores but we will start here. We'll ask our model to predict what the labels for our test set are, then generate a score.

In [None]:
predictions = myknn.predict(X_test)
predictions

In [None]:
correct = 0

for a,b in zip(y_test,myknn.predict(X_test)):
    if a == b:
        correct += 1
    else:
        pass

print "Number correct:",correct
print "Score:",float(correct)/len(y_test)

That was easy enough. Sklearn also has an easy method for generating a score. 

In [None]:
myknn.score(X_test, y_test)

Sklearn also has a way of showing more information about the prediction. Here, we're using sklearn.metrics.classification_report to generate a more informative picture. The wikipedia pages for recall, f1-score, and support are also informative if you're looking to understand more.

https://en.wikipedia.org/wiki/Precision_and_recall

In [None]:
from sklearn import metrics

print metrics.classification_report([sk_iris['target_names'][label] for label in y_test], 
                                    [sk_iris['target_names'][label] for label in myknn.predict(X_test)])

## Exercise

#### 1. How does the model perform as we increase the number of neighbors?  To answer this, plot the score as a function of the number of neighbors.

In [None]:
# Create a list of the various numbers of neighbors to use to build models
# Create training and test sets
# Iterate through that list and for each number of neighbors:
#    Build a KNN model
#    Evaluate it
#    Record the score with the number of neighbors for that model
# Plot results

#### 2. Do different train/test splits affect our score (accuracy)? How much do the scores vary each time you shuffle and split?