# DSCI 6003 3.3 Lecture


## Introduction Validation

### By the End of This Lecture You Will:

1. Distinguish Validation from Evaluation
2. Distinguish a classifier from a model
3. Be able to describe the three different types of Cross-Validation
4. Give case examples of where we might use the different types
5. Be able to use the sklearn regression API


# Validation vs. Evaluation

These two concepts are often entangled in the mind of an inexperienced investigator. When we speak of models in machine learning, **evaluation** refers to the use of established metrics or measurements to characterize the expected future performance of a model.

**Validation** has a specific scientific meaning. It refers to the verification of a prediction using other methods of observation(within established criteria of acceptance). For example if we obtain a result with one experiment, we must verify our conclusion of that experiment with other experiments. In order to make sure that our result is **valid** the additional studies on this predicition must be a representative sample of potential observers.

In traditional science this is done with the process of **peer review.** Historically, scientists would publish results in **letters,** and other scientists of similar skill would repeat these experiments. Eventually the community would come to a consensus, and these final validated conclusions would be printed as an **article**. 

There are two types of validation: cross-validation, wherein the same experiment (or method or model) is repeated different times, and orthogonal validation, where an entirely different technique capable of obtaining approximately the same conclusion is used and compared against the first experiment. 

In machine learning, we always use **cross-validation every time a new model is constructed** in order to verify the model's consistency and performance over a fixed set of data. 

**Orthogonal validation** should be used if you are not certain about results or are using a new classifier not yet tested. The word "orthogonal" suggests that the method being used is entirely different from the original. [Benchmark data sets and standard classifiers](http://scikit-learn.org/stable/modules/clustering.html) are typiclly used to perform comparison, with an appropriate visualization technique. We teach you most of the standard classfiers in this class. When you implement your own classifier, you can use orthogonal methods



## Cross Validation


*We use cross validation as a means to get a sense of the error. Our final model will be built on all of the data so that we can have the best model possible.*


#### Validation Set

A validation (or hold out) set is a random sample of our data that we reserve for testing. We don't use this part of our data for building our model, just for assessing how well it did.

* A typical breakdown is:
    - 80% of our data in the training set (which we use the build the model)
    - 20% of our data in the test set (which we use to evaluate the model)
* Make sure that you randomize your data! It's really dangerous to pick the first 80% of the data to train and the last 20% to test since data is often sorted by a feature or the target! It would cause trouble if all the expensive houses were in your test set and never in the training set!
* Concerns:
    - *Variable:* Depending what random sample we get, we will get different values
    - *Underestimating:* We are actually underestimating a little bit since we testing a model built on just 80% of the dataset instead of the whole 100%.


#### KFold Cross Validation

In K-fold cross validation, the data is split into **k** groups. One group
out of the k groups will be the test set, the rest (**k-1**) groups will
be the training set. In the next iteration, another group will be the test set,
and the rest will be the training set. The process repeats for k iterations (k-fold).
In each fold, a metric for accuracy will be calculated and
an overall average of that metric will be calculated over k-folds. 

![KFold Cross Validation](images/kfold.jpeg)


#### Stratified KFold Cross Validation

    Stratified KFold works just like KFolds, except 




#### Leave P Out Cross Validation




## Regression with sklearn

There are several good modules with implementations of regression. We've
used [statsmodels](http://statsmodels.sourceforge.net/devel/generated/statsmodels.regression.linear_model.OLS.html).
Today we will be using [sklearn](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html).
[numpy](docs.­scipy.­org/­doc/­numpy/­reference/­generated/­numpy.­polyfit.­html) and
[scipy](http://docs.scipy.org/doc/scipy-0.13.0/reference/generated/scipy.stats.linregress.html)
also have implementations.

Resources:
* [sklearn documentation](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html)
* [sklearn example](http://scikit-learn.org/0.11/auto_examples/linear_model/plot_ols.html)

For all `sklearn` modules, the `fit` method is used to train and the `score`
method is used to test. You can also use the `predict` method to see the
predicted y values.

#### Example

This is the general workflow for working with sklearn. Any algorithm we use from sklearn will have the same form.

```
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.cross_validation import train_test_split

# Load data from csv file
df = pd.read_csv('data/housing_prices.csv')
X = df[['square_feet', 'num_rooms']].values
y = df['price'].values

# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15)

# Run Linear Regression
regr = LinearRegression()
regr.fit(X_train, y_train)
print "Intercept:", regr.intercept_
print "Coefficients:", regr.coef_
print "R^2 error:", regr.score(X_test, y_test)
predicted_y = regr.predict(X_test)
```

 `LinearRegression` is a class and you have to create an instance of it. 
 If there are any parameters to the model, you should set them when you instantiate the object.
 For example, with `LinearRegression`, you can choose whether to normalize you data:

```
regr = LinearRegression(normalize=True)    # the default is False
```


You should call the `fit` method once. Here you give it the training data and it will train your model. Once you have that, you can get the coefficients for the equation (`intercept_` and `coef_`) and also get the score for your test set (`score` method). You can also get the predicted values for any new data you would like to give to the model (`predict` method).

### Cross Validation

Here's an example of cross-validation using kfold:


```
from sklearn import cross_validation
kf = cross_validation.KFold(X.shape[0], n_folds=5, shuffle=True)
results = []
for train_index, test_index in kf:
    regr = LinearRegression()
    regr.fit(X[train_index], y[train_index])
    results.append(regr.score(X[test_index], y[test_index]))
print "average score:", np.mean(results)
```

In [None]:
%matplotlib inline
from __future__ import print_function
print(__doc__)

import numpy as np
import matplotlib.pyplot as plt

from sklearn import cross_validation, datasets, linear_model

diabetes = datasets.load_diabetes()
X = diabetes.data[:150]
y = diabetes.target[:150]

lasso = linear_model.Lasso()
alphas = np.logspace(-4, -.5, 30)

scores = list()
scores_std = list()

for alpha in alphas: # for each one of the selected reg. parameters, CV the new model
    lasso.alpha = alpha
    this_scores = cross_validation.cross_val_score(lasso, X, y, cv=KFold(len(X), n_folds=5), n_jobs=1)
    scores.append(np.mean(this_scores))
    scores_std.append(np.std(this_scores))

print(scores)
print(scores_std)


plt.figure(figsize=(4, 3))
plt.semilogx(alphas, scores)

# plot error lines showing +/- std. errors of the scores
plt.semilogx(alphas, np.array(scores) + np.array(scores_std) / np.sqrt(len(X)),
             'b--')
plt.semilogx(alphas, np.array(scores) - np.array(scores_std) / np.sqrt(len(X)),
             'b--')
plt.ylabel('CV score')
plt.xlabel('alpha')
plt.axhline(np.max(scores), linestyle='--', color='.5')

##############################################################################
# Bonus: how much can you trust the selection of alpha?

# To answer this question we use the LassoCV object that sets its alpha
# parameter automatically from the data by internal cross-validation (i.e. it
# performs cross-validation on the training data it receives).
# We use external cross-validation to see how much the automatically obtained
# alphas differ across different cross-validation folds.
lasso_cv = linear_model.LassoCV(alphas=alphas)
k_fold = cross_validation.KFold(len(X), 10)

print("Answer to the bonus question:",
      "how much can you trust the selection of alpha?")
print()
print("Alpha parameters maximising the generalization score on different")
print("subsets of the data:")
for k, (train, test) in enumerate(k_fold):
    lasso_cv.fit(X[train], y[train])
    print("[fold {0}] alpha: {1:.5f}, score: {2:.5f}".
          format(k, lasso_cv.alpha_, lasso_cv.score(X[test], y[test])))
print()
print("Answer: Not very much since we obtained different alphas for different")
print("subsets of the data and moreover, the scores for these alphas differ")
print("quite substantially.")

plt.show()

## Bias Variance Trade Off

We are assuming that our dataset is a random sample of the data. We want to make sure to answer this question: Does the model I'm building represent the whole population well? (i.e. not just the sample dataset that I have!)

* Bias: Is the average of the residuals of the models close to the true model?

    * A biased model would center around the incorrect solution. How you collect data can lead to bias (for example, you only get user data from San Francisco and try to use your model for your whole user base).
    * High bias can also come from underfitting, i.e., not fully representing the data you are given.

* Variance: Are all the models close together?

    * The main contributor to high variance is insufficient data or that what you're trying to predict isn't actually correlated to your features.
    * High variance is also a result of overfitting to your sample dataset.
    
Note that both high bias or high variance are bad. Note that high variance is worse than it sounds since you will only be constructing the model once, so with high variance there's a low probability that your model will be near the optimal one.

Looking at this from a number of feature perspective:

* Increasing the number of features means:
    * Increase in variance
    * Decrease in bias
    
A graph can make this more clear:
    
![bias variance](images/bias_variance_graph.png)