# There's No Such Thing as a Free Lunch

In the field of machine learning, the "no free lunch" theorem "state[s] that any two optimization 
algorithms are equivalent when their performance is averaged across all possible problems" (see [1] 
and [2] for some references).That is, you can't tell which machine learning algorithm will best solve the problem in front of you. Of course, with more experience you'll be able to "intuit" which ones are better 
suited than others, but a priori you can't say which is the best one.

So we then need a way to measure if machine learning algorithm (A) is better than machine learning 
algorithm (B). Or for that matter, if (A) with parameters {l,m,n} or (A) with parameters {p,q,r} will do better. This 
is where **cross-validation** comes in.

[1] Wolpert, David (1996), "The Lack of A Priori Distinctions between Learning Algorithms", Neural Computation, pp. 1341-1390

[2] Wolpert, D.H., and Macready, W.G. (2005) "Coevolutionary free lunches", IEEE Transactions on Evolutionary Computation, 9(6): 721-735

# Cross-Validation

Cross-validation is a way to measure how well a machine learning algorithm will generalize to more data (i.e the "real" dataset). For prediction problems, the steps typically go like this:
1. Prepare a set of data where the **target** (the value you want to predict) is already known. 
2. Split this set into a **training set** and a **validation (or test) set**
3. Fit your model to the **training set**, and use the fit to predict values for the **validation set**
4. See how well the model predicted the **validation set**

A typical measure is the **coefficient of determination, R^2**, which is defined as below:

![R^2 definition](img/R2def.png)



# Let's See Why We Need Cross Validation

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import sklearn as skl
import numpy as np

import scripts.load_data as load

import seaborn as sns

%matplotlib inline

In [None]:
from sklearn.linear_model import Ridge
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.pipeline import make_pipeline

#set the seed you get the same data each time you run
np.random.seed(0)

#define a model to generate fake data
def data_model(x, sigma=0.1):
    base_model = np.log(x)
    noise = sigma * np.random.randn(len(x))
    return base_model + noise

#Generate some fake data and keep a subset of them
#for training
#***Try different values for N***
N_split = 10
x_plot = np.linspace(1,100,100)
x_train = x_plot[:N_split]
y_train = data_model(x_train)

#also make validation data
x_val = x_plot[N_split:]
y_val = data_model(x_val)
# create matrix versions of these arrays
X_train = x_train[:, np.newaxis]
X_val = x_val[:, np.newaxis]
X_plot = x_plot[:, np.newaxis]

colors = ['teal', 'yellowgreen', 'gold', 'red']
lw = 0
plt.figure(figsize = (8,6))
plt.scatter(x_train, y_train, color='navy', s=30, marker='o', label="training points", linewidth=lw)
plt.xlabel('x')
plt.ylabel('y')

plt.scatter(x_val, y_val, marker='o', color='cornflowerblue', alpha=0.5,linewidth=lw, label="validation data")

plt.ylim([0,5])
lw =2
for count, degree in enumerate([2, 4, 5, N_split-1]):
    model = make_pipeline(PolynomialFeatures(degree), Ridge())
    #fit to training data
    model.fit(X_train, y_train)
    #apply to validation data
    y_predicted = model.predict(X_val)
    #get R^2 score
    score = skl.metrics.r2_score(y_val, y_predicted)
    #apply again, just to show plot
    y_plot = model.predict(X_plot)
    plt.plot(x_plot, y_plot, color=colors[count], linewidth=lw, alpha=0.5,
             label="degree %d with score %.02f" %(degree, score))

plt.legend(loc='lower right')


# Picking a Good Training Set

The example above is meant to illustrate two things:
1. Training on too little data yields poor results. (Typically people use 80%train, 20% validate)
2. Even a good split won't save you if your training set is not representative of the data. Picking a bad training set will bias your model.

The example below shows the importance of shuffling.

In [None]:
#this the same code as above
#except see how x_train is chosen differently

#set the seed you get the same data each time you run
np.random.seed(0)

#define a model to generate fake data
def data_model(x, sigma=0.1):
    base_model = np.log(x)
    noise = sigma * np.random.randn(len(x))
    return base_model + noise

#Generate some fake data and keep a subset of them
#for training
#***Try different values for N***
N_split = 70
x_plot = np.linspace(1,100,num=100)
x = np.copy(x_plot)

rng = np.random.RandomState(0)
rng.shuffle(x)
x_train = np.sort(x[:N_split])
y_train = data_model(x_train)

#also make validation data
x_val = np.sort(x[N_split:])
y_val = data_model(x_val)
# create matrix versions of these arrays
X_train = x_train[:, np.newaxis]
X_val = x_val[:, np.newaxis]
X_plot = x_plot[:, np.newaxis]

colors = ['teal', 'yellowgreen', 'gold', 'red']
lw = 0
plt.figure(figsize = (8,6))
plt.scatter(x_train, y_train, color='navy', s=30, marker='o', label="training points", linewidth=lw)
plt.xlabel('x')
plt.ylabel('y')

plt.scatter(x_val, y_val, marker='o', color='cornflowerblue', alpha=0.5,linewidth=lw, label="validation data")

plt.ylim([0,7])
lw =2
for count, degree in enumerate([2, 4, 5, N_split-1]):
    model = make_pipeline(PolynomialFeatures(degree), Ridge())
    #fit to training data
    model.fit(X_train, y_train)
    y_train_fit = model.predict(X_train)
    nscore = skl.metrics.r2_score(y_train, y_train_fit)
    #apply to validation data
    y_predicted = model.predict(X_val)
    #get R^2 score
    score = skl.metrics.r2_score(y_val, y_predicted)
    #apply again, just to show plot
    y_plot = model.predict(X_plot)
    plt.plot(x_plot, y_plot, color=colors[count], linewidth=lw, alpha=0.5,
             label="degree %d with score %0.02f /n naive score %0.02f" %(degree, score, nscore))

plt.legend(loc='lower right')




# Using k-Fold Cross Validation to avoid over fitting

Notice that while the 9-degree polynomial goes through every data point in the training set perfectly, it does not generalize to new data very well. If you took the performance of the fit to the training data as the measure of the model's success, you would c


from sklearn.model_selection import cross_val_score
score = cross_val_score(model, x.reshape(-1,1), data_model(x).reshape(-1,1), scoring='r2') #cv=5 for five-fold default is 3-fold
print(score)

In [None]:
from scripts.load_data import*
data, targets = load_training_spectra()

In [None]:
data.head()

In [None]:
targets.head()

In [None]:
targets = targets.pH #for example just try 1

In [None]:
data_chunks = np.array_split(data,5)#split data into 5 equal size chunks

In [None]:
data_chunks[4].head()

In [None]:
target_chunks = np.array_split(targets,5)

In [None]:
from sklearn.ensemble import RandomForestRegressor

In [None]:
for i in range(5):
    data_t = pd.concat(data_chunks[j] for j in range(5) if j != i) #train
    target_t = pd.concat(target_chunks[j] for j in range(5) if j != i)
    data_v = data_chunks[i] #validate
    target_v = target_chunks[i]
    rf = RandomForestRegressor()
    rf.fit(data_t, target_t)
    print(rf.score(data_v, target_v))

In [None]:
from sklearn.model_selection import cross_val_score
rf2 = RandomForestRegressor()
score = cross_val_score(rf2, data, targets, scoring='r2') #cv=5 for five-fold default is 3-fold
print(score)

In [None]:
np.mean(score), np.std(score)/np.sqrt(len(score))