# Validation 


### Why Use Training & Testing Data? 

- Gives estimate of performance on an independent dataset
 
- Serves as check on overfitting 


## sklearn train test split 
Here is an exampled of how to implement cross validation: 
``` python 
    import numpy as np 
    from sklearn import cross_validation 
    from sklearn import datasets 
    from sklearn import svm 
    
    
    # Load dataset 
    iris = datasets.load_iris()
   
   
    # create a quick sample training set while holding our 40% of the data
    X_train, X_test, y_train, y_test = cross_validation.train_test_split(
    iris.data, iris.target, text_size=0.4, random_state=0) 
    
    # create an event
    X_train.shape, y_train.shape ## ((90,4), (90,))
    X_test.shape, y_test.shape ## ((60,4), (60,))
    
    # classifier 
    clf = svm.SVC(kernel = "linear", C=1).fit(X_train,y_train) 
    
    # evaluation of test set
    clf.score(X_test,y_test) # 0.96 . . . 
```

## Training, Transforms, Predicting 

Suppose your overall example looks something like this 

$$ \textbf{Train & Test_split} \Rightarrow  PCA \Rightarrow SVM$$

remember $PCA$ has a few commands on it: 

- $PCA$:   ```pca.fit``` & ```pca.transform``` 
- $SVC$: ```svc.fit ``` & ```svc.predict``` 


- ```pca.fit(training-features)``` we only want to look for patterns in the training data
- ```pca.transform(training-features) ``` in order to use principle components we need to transform into the correct rep.
- ```svc.train(training-features) ``` 

The training of the classifier is done now we have to do testing features therefore, 

- ```pca.transform(test_features) ``` 
- ```svc.predict(test_features) ``` 

you wanna make your predictions making the test dataset this is the whole point


## Cross-Validation 

**Problems with splitting into training and testing data:** 

suppose this is your data: 

<img src= "v_images/example_1.png" style="width: 500px;"/> 

you would like to maximize both of the sets maximum in test set to get best validation. You arrive at some lossesness 

### K-fold cross validation 

Run $k$ seperate learning experiements 

- pick testing set 
- remaining $k-1$ training set 
- test on the testing set

Average the testing results from the $k$ experiemnts. 

### Practical advice for k-fold in Sklearn 

If our original data comes in some sort of sorted fashion, then we will want to first shuffle the order of the data points before splitting them up into folds, or otherwise randomly assign data points to each fold. If we want to do this using ```KFold()```, then we can add the "shuffle = True" parameter when setting up the cross-validation object.

If we have concerns about class imbalance, then we can use the ```StratifiedKFold()``` class instead. Where ```KFold()``` assigns points to folds without attention to output class, ```StratifiedKFold()``` assigns data points to folds so that each fold has approximately the same number of data points of each output class. This is most useful for when we have imbalanced numbers of data points in your outcome classes (e.g. one is rare compared to the others). For this class as well, we can use "shuffle = True" to shuffle the data points' order before splitting into folds.


## GridSearch CV in sklearn 

GridSearchCV is a way of systematically working through multiple combinations of parameter tunes, cross-validating as it goes to determine which tune gives the best performance. The beauty is that it can work through many combinations in only a couple extra lines of code.

Here's an example from the sklearn documentation:
http://scikit-learn.org/stable/modules/generated/sklearn.grid_search.GridSearchCV.html

parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10]}
svr = svm.SVC()
clf = grid_search.GridSearchCV(svr, parameters)
clf.fit(iris.data, iris.target)

Let's break this down line by line.

parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10]}

A dictionary of the parameters, and the possible values they may take. In this case, they're playing around with the kernel (possible choices are 'linear' and 'rbf'), and C (possible choices are 1 and 10).

Then a 'grid' of all the following combinations of values for (kernel, C) are automatically generated:

('rbf', 1)	('rbf', 10)
('linear', 1)	('linear', 10)


Each is used to train an SVM, and the performance is then assessed using cross-validation.

svr = svm.SVC() 
This looks kind of like creating a classifier, just like we've been doing since the first lesson. But note that the "clf" isn't made until the next line--this is just saying what kind of algorithm to use. Another way to think about this is that the "classifier" isn't just the algorithm in this case, it's algorithm plus parameter values. Note that there's no monkeying around with the kernel or C; all that is handled in the next line.

clf = grid_search.GridSearchCV(svr, parameters) 
This is where the first bit of magic happens; the classifier is being created. We pass the algorithm (svr) and the dictionary of parameters to try (parameters) and it generates a grid of parameter combinations to try.

clf.fit(iris.data, iris.target) 
And the second bit of magic. The fit function now tries all the parameter combinations, and returns a fitted classifier that's automatically tuned to the optimal parameter combination. You can now access the parameter values via clf.best_params_.


http://scikit-learn.org/stable/modules/generated/sklearn.grid_search.GridSearchCV.html

In [17]:
from sklearn import datasets 
from sklearn.svm import SVC 

# load data set
iris = datasets.load_iris() 
features = iris.data 
labels = iris.target 

# get shape of features , labels 

print features.shape
print labels.shape


from sklearn import cross_validation 

features_train,features_test,labels_train,labels_test = cross_validation.train_test_split(
features, labels,  random_state = 0 )


clf = SVC(kernel="linear", C=1.)
clf.fit(features_train, labels_train)

print clf.score(features_test, labels_test)
##############################################################
def submitAcc():
    return clf.score(features_test, labels_test)


print submitAcc() 





###############################################################
# k-fold cross validation 



(150, 4)
(150,)
0.973684210526
0.973684210526


NameError: name 'authors' is not defined