# 1. Introduction
In an earlier mission, we learned about train/test validation, a simple technique for testing a machine learning model's accuracy on new data that the model wasn't trained on. In this mission, we'll focus on more robust techniques.

To start, we'll focus on the holdout validation technique, which involves:

- splitting the full dataset into 2 partitions:
 - a training set
 - a test set
- training the model on the training set,
- using the trained model to predict labels on the test set,
- computing an error metric to understand the model's effectiveness,
- switch the training and test sets and repeat,
- average the errors.

In holdout validation, we usually use a 50/50 split instead of the 75/25 split from train/test validation. This way, we remove number of observations as a potential source of variation in our model performance.

![Title](holdout_validation.png)

Let's start by splitting the data into 2 nearly equivalent halves.

In [1]:
import pandas as pd

dc_listings = pd.read_csv("dc_airbnb5.csv")
stripped_commas = dc_listings['price'].str.replace(',', '')
stripped_dollars = stripped_commas.str.replace('$', '')
dc_listings['price'] = stripped_dollars.astype('float')
dc_listings.head()

Unnamed: 0,host_response_rate,host_acceptance_rate,host_listings_count,accommodates,room_type,bedrooms,bathrooms,beds,price,cleaning_fee,security_deposit,minimum_nights,maximum_nights,number_of_reviews,latitude,longitude,city,zipcode,state
0,92%,91%,26,4,Entire home/apt,1.0,1.0,2.0,160.0,$115.00,$100.00,1,1125,0,38.890046,-77.002808,Washington,20003,DC
1,90%,100%,1,6,Entire home/apt,3.0,3.0,3.0,350.0,$100.00,,2,30,65,38.880413,-76.990485,Washington,20003,DC
2,90%,100%,2,1,Private room,1.0,2.0,1.0,50.0,,,2,1125,1,38.955291,-76.986006,Hyattsville,20782,MD
3,100%,,1,2,Private room,1.0,1.0,1.0,95.0,,,1,1125,0,38.872134,-77.019639,Washington,20024,DC
4,92%,67%,1,4,Entire home/apt,1.0,1.0,1.0,50.0,$15.00,$450.00,7,1125,0,38.996382,-77.041541,Silver Spring,20910,MD


In [2]:
import numpy as np
np.random.seed(1)
shuffled_index = np.random.permutation(dc_listings.index) 
dc_listings = dc_listings.reindex(shuffled_index)
split_one = dc_listings.iloc[0:1862]
split_two = dc_listings.iloc[1862:]

# 2. Holdout Validation
Now that we've split our data set into 2 dataframes, let's:

- train a k-nearest neighbors model on the first half,
- test this model on the second half,
- train a k-nearest neighbors model on the second half,
- test this model on the first half.

In [3]:
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error

train_one = split_one
test_one = split_two
train_two = split_two
test_two = split_one

# First half
model = KNeighborsRegressor()
model.fit(train_one[["accommodates"]], train_one["price"]) 
test_one["predicted_price"] = model.predict(test_one[["accommodates"]])
iteration_one_rmse = mean_squared_error(test_one["price"], test_one["predicted_price"])**(1/2)

# Second half
model.fit(train_two[["accommodates"]], train_two["price"])
test_two["predicted_price"] = model.predict(test_two[["accommodates"]])
iteration_two_rmse = mean_squared_error(test_two["price"], test_two["predicted_price"])**(1/2)

avg_rmse = np.mean([iteration_two_rmse, iteration_one_rmse])

print(iteration_one_rmse, iteration_two_rmse, avg_rmse)

131.702947472 126.222147187 128.962547329


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


# 3. K-Fold Cross Validation
If we average the two RMSE values from the last step, we get an RMSE value of approximately 128.96. Holdout validation is actually a specific example of a larger class of validation techniques called k-fold cross-validation. While holdout validation is better than train/test validation because the model isn't repeatedly biased towards a specific subset of the data, both models that are trained only use half the available data. K-fold cross validation, on the other hand, takes advantage of a larger proportion of the data during training while still rotating through different subsets of the data to avoid the issues of train/test validation.

Here's the algorithm from k-fold cross validation:

- splitting the full dataset into k equal length partitions,
    - selecting k-1 partitions as the training set and
    - selecting the remaining partition as the test set
- training the model on the training set,
- using the trained model to predict labels on the test fold,
- computing the test fold's error metric,
- repeating all of the above steps k-1 times, until each partition has been used as the test set for an iteration,
- calculating the mean of the k error values.

Holdout validation is essentially a version of k-fold cross validation when k is equal to 2. Generally, 5 or 10 folds is used for k-fold cross-validation. Here's a diagram describing each iteration of 5-fold cross validation:

![Title](kfold_cross_validation.png)

As you increase the number the folds, the number of observations in each fold decreases and the variance of the fold-by-fold errors increases. Let's start by manually partitioning the data set into 5 folds. Instead of splitting into 5 dataframes, let's add a column that specifies which fold the row belongs to.

In [4]:
# use df.set_value method to set values of defined index/col values!
dc_listings.set_value(dc_listings.index[0:744],"fold",1)
dc_listings.set_value(dc_listings.index[744:1488], "fold", 2)
dc_listings.set_value(dc_listings.index[1488:2232], "fold", 3)
dc_listings.set_value(dc_listings.index[2232:2976], "fold", 4)
dc_listings.set_value(dc_listings.index[2976:3723], "fold", 5)

Unnamed: 0,host_response_rate,host_acceptance_rate,host_listings_count,accommodates,room_type,bedrooms,bathrooms,beds,price,cleaning_fee,security_deposit,minimum_nights,maximum_nights,number_of_reviews,latitude,longitude,city,zipcode,state,fold
574,100%,100%,1,2,Private room,1.0,1.0,1.0,125.0,,$300.00,1,4,149,38.913548,-77.031981,Washington,20009,DC,1.0
1593,87%,100%,2,2,Private room,1.0,1.5,1.0,85.0,$15.00,,1,30,49,38.953431,-77.030695,Washington,20011,DC,1.0
3091,100%,,1,1,Private room,1.0,0.5,1.0,50.0,,,1,1125,1,38.933491,-77.029679,Washington,20010,DC,1.0
420,58%,51%,480,2,Entire home/apt,1.0,1.0,1.0,209.0,$150.00,,4,730,2,38.904054,-77.051991,Washington,20037,DC,1.0
808,100%,95%,3,12,Entire home/apt,5.0,2.0,5.0,215.0,$135.00,$100.00,2,1825,34,38.906118,-76.988873,Washington,20002,DC,1.0
3492,100%,,1,8,Entire home/apt,4.0,2.5,5.0,350.0,,,4,1125,1,38.879581,-76.983600,"Washington, D.C.",20003,DC,1.0
364,100%,100%,2,3,Entire home/apt,0.0,1.0,2.0,115.0,$60.00,,2,1125,63,38.902220,-77.054803,Washington,20037,DC,1.0
1412,43%,100%,1,2,Private room,1.0,1.0,1.0,110.0,$20.00,$250.00,2,1125,5,38.915845,-77.025168,Washington,20001,DC,1.0
3219,100%,100%,1,3,Entire home/apt,0.0,1.0,1.0,99.0,$25.00,,2,14,45,38.929955,-76.973854,Washington,20018,DC,1.0
756,100%,100%,1,2,Private room,1.0,1.0,1.0,49.0,,,1,1125,3,38.906857,-76.983764,Washington,20002,DC,1.0


# 4. First Iteration
Let's start by performing the first iteration of k-fold cross validation on a simple, univariate model.

In [5]:
from sklearn.neighbors import KNeighborsRegressor as KNR
from sklearn.metrics import mean_squared_error as MSE
train = dc_listings[dc_listings['fold']!=1]
test = dc_listings[dc_listings['fold']==1]

model = KNR()
model.fit(train[['accommodates']],train['price'])
predictions = model.predict(test[['accommodates']])

iteration_one_rmse = MSE(test['price'],predictions)**(1/2)
iteration_one_rmse

105.06498501055819

# 5. Function For Training Models
From the first iteration, we achieved an RMSE value of 105.06. While this is one of the lowest RMSE values we achieved in the last few missions, let's calculate the RMSE values for the remaining iterations. To make the iteration process easier, let's wrap the code we wrote in the previous screen in a function.

In [7]:
from sklearn.neighbors import KNeighborsRegressor as KNR
from sklearn.metrics import mean_squared_error as MSE
import numpy as np

def train_and_validate(df,folds):
    rmses = []
    for i in folds:
        train = df[df['fold']!=i]
        test = df[df['fold']==i]

        model = KNR()
        model.fit(train[['accommodates']],train['price'])
        predictions = model.predict(test[['accommodates']])
        rmses.append(MSE(test['price'],predictions)**(1/2))
    avg_rmse = np.mean(rmses)
    return rmses, avg_rmse

fold_ids = [1,2,3,4,5]
rmses, avg_rmse = train_and_validate(dc_listings,fold_ids)

print(rmses, avg_rmse)

[105.06498501055819, 140.31953569184228, 153.25636421383291, 108.30319349137453, 171.39722950247619] 135.668261582


# 6. Performing K-Fold Cross Validation Using Scikit-Learn
While the average RMSE value was approximately 136.78, the RMSE values ranged from 105.06 all the way to 176.97. This large amount of variability between the RMSE values means that we're either using a poor model or a poor evaluation criteria (or a bit of both!). By implementing your own k-fold cross-validation function, you hopefully acquired a good understanding of the inner workings of the technique. The function we wrote, however, has many limitations. If we want to now change the number of folds we want to use, we need to make the function more general so it can also handle randomizing the ordering of the rows in the dataframe and splitting into folds.

In machine learning, we're interested in building a good model and accurately understand how well it will perform. **To build a better k-nearest neighbors model, we can change the features it uses or tweak the number of neighbors (a hyperparameter)**. **To accurately understand a model's performance, we can perform k-fold cross validation and select the proper number of folds.** We've learned how scikit-learn makes it easy for us to quickly experiment with these different knobs when it comes to building a better model. Let's now dive into how we can use scikit-learn to handle cross-validation as well.

First, we instantiate an instance of the KFold class from sklearn.model_selection:


    from sklearn.model_selection import KFold
    kf = KFold(n_folds, shuffle=False, random_state=None)

where:

- n_folds is the number of folds you want to use,
- shuffle is used to toggle shuffling of the ordering of the observations in the dataset,
- random_state is used to specify the random seed value if shuffle is set to True.

You'll notice here that no parameters depend on the data set at all. This is because the **KFold class returns an iterator object which we use in conjunction with the cross_val_score() function, also from sklearn.model_selection. Together, these 2 functions allow us to compactly train and test using k-fold cross validation:**

Here are the relevant parameters for the cross_val_score function:

    from sklearn.model_selection import cross_val_score
    cross_val_score(estimator, X, Y, scoring=None, cv=None)

where:

- estimator is a sklearn model that implements the fit method (e.g. instance of KNeighborsRegressor),
- X is the list or 2D array containing the features you want to train on,
- y is a list containing the values you want to predict (target column),
- scoring is a string describing the scoring criteria (list of accepted values here).
- cv describes the number of folds. Here are some examples of accepted values:
  - an instance of the KFold class,
  - an integer representing the number of folds.

Depending on the scoring criteria you specify, either a single total value is returned one value for each fold. Here's the general workflow for performing k-fold cross-validation using the classes we just described:

- instantiate the scikit-learn model class you want to fit,
- instantiate the KFold class and using the parameters to specify the k-fold cross-validation attributes you want,
- use the cross_val_score() function to return the scoring metric you're interested in.

In [39]:
from sklearn.model_selection import cross_val_score, KFold
from sklearn.neighbors import KNeighborsRegressor as KNR
from sklearn.metrics import mean_squared_error

features=['accommodates']
var_to_predict=['price']

kf = KFold(5, shuffle = True, random_state = 1)
knn = KNR(n_neighbors = 5)
mses = cross_val_score(knn,dc_listings[features],dc_listings[var_to_predict],scoring='neg_mean_squared_error',cv=kf)
rmses = [abs(x)**(1/2) for x in list(mses)]
avg_rmse = np.mean(rmses)
avg_rmse

134.66322485283825

# 7. Exploring Different K Values
Choosing the right k value when performing k-fold cross validation is more of an art and less of a science. As we discussed earlier in the mission, a k value of 2 is really just holdout validation. On the other end, setting k equal to n (the number of observations in the data set) is known as leave-one-out cross validation, or LOOCV for short. Through lots of trial and error, data scientists have converged on 10 as the standard k value.

In the following code block, we display the results of varying k from 3 to 23. For each k value, we calculate and display the average RMSE value across all of the folds and the standard deviation of the RMSE values. Across the many different k values, it seems like the average RMSE value is around 128. You'll notice that the standard deviation of the RMSE increases from approximately 1.1 to 37.3 as we increase the number the folds.

In [43]:
from sklearn.model_selection import cross_val_score, KFold

num_folds = [3, 5, 7, 9, 10, 11, 13, 15, 17, 19, 21, 23]

for fold in num_folds:
    kf = KFold(fold, shuffle=True, random_state=1)
    model = KNeighborsRegressor()
    mses = cross_val_score(model, dc_listings[["accommodates"]], dc_listings["price"], scoring="neg_mean_squared_error", cv=kf)
    rmses = [np.sqrt(np.absolute(mse)) for mse in mses]
    avg_rmse = np.mean(rmses)
    std_rmse = np.std(rmses)
    print(str(fold), "folds: ", "avg RMSE: ", str(avg_rmse), "std RMSE: ", str(std_rmse))

3 folds:  avg RMSE:  126.192234742 std RMSE:  1.10699117395
5 folds:  avg RMSE:  134.663224853 std RMSE:  17.0639664505
7 folds:  avg RMSE:  128.761029407 std RMSE:  15.1130364721
9 folds:  avg RMSE:  130.97403955 std RMSE:  17.5527278118
10 folds:  avg RMSE:  129.382430039 std RMSE:  22.8580286103
11 folds:  avg RMSE:  128.520917482 std RMSE:  21.0393369592
13 folds:  avg RMSE:  128.665369279 std RMSE:  29.931738536
15 folds:  avg RMSE:  127.74903938 std RMSE:  30.2252052501
17 folds:  avg RMSE:  125.086899801 std RMSE:  34.7037432777
19 folds:  avg RMSE:  123.249524373 std RMSE:  37.9325864603
21 folds:  avg RMSE:  129.742921534 std RMSE:  36.3109063433
23 folds:  avg RMSE:  129.910388399 std RMSE:  37.3486155888


# 8. Bias-Variance Tradeoff

So far, we've been working under the assumption that a lower RMSE always means that a model is more accurate. This isn't the complete picture, unfortunately. A model has two sources of error, bias and variance.

**Bias describes error that results in bad assumptions about the learning algorithm.** For example, assuming that only one feature, like a car's weight, relates to a car's fuel efficiency will lead you to fit a simple, univariate regression model that will result in high bias. The error rate will be high since a car's fuel efficiency is affected by many other factors besides just its weight.

**Variance describes error that occurs because of the variability of a model's predicted values.** If we were given a dataset with 1000 features on each car and used every single feature to train an incredibly complicated multivariate regression model, we will have low bias but high variance. In an ideal world, we want low bias and low variance but in reality, there's always a tradeoff.

1. **The standard deviation of the RMSE** values can be a proxy for a model's variance;
2. **The average RMSE** is a proxy for a model's bias. 

Bias and variance are the 2 observable sources of error in a model that we can indirectly control.

![](bias_variance.png)


**While k-nearest negihbors can make predictions, IT IS NOT A MATHEMATICAL MODEL.** A mathematical model is usually an equation that can exist without the original data, which isn't true with k-nearest neighbors. In the next two courses, we'll learn about a mathematical model called linear regression. We'll explore the bias-variance tradeoff in greater depth in these next 2 courses because of its importance when working with mathematical models in particular.