We'll focus on more robust techniques.
To start, we'll focus on the holdout validation technique, which involves:

splitting the full dataset into 2 partitions:

a training set
a test set

training the model on the training set,
using the trained model to predict labels on the test set,
computing an error metric to understand the model's effectiveness,
switch the training and test sets and repeat,
average the errors.

In holdout validation, we usually use a 50/50 split instead of the 75/25 split from train/test validation. This way, we remove the number of observations as a potential source of variation in our model performance.

### Instructions
1. Use the numpy.random.permutation() function to shuffle the ordering of the rows in dc_listings.

Select the first 1862 rows and assign to split_one.

Select the remaining 1861 rows and assign to split_two.

In [27]:
import numpy as np
import pandas as pd

dc_listings = pd.read_csv("dc_airbnb.csv")
stripped_commas = dc_listings['price'].str.replace(',', '')
stripped_dollars = stripped_commas.str.replace('$', '')
dc_listings['price'] = stripped_dollars.astype('float')

dc_listing = np.random.permutation(len(dc_listings))
dc_listings = dc_listings.reindex(dc_listing)
split_one = dc_listings[0:1862].copy() 
split_two = dc_listings[1862:].copy() 


### Holdout Validation

Now that we've split our data set into 2 dataframes, let's:

1. train a k-nearest neighbors model on the first half,
2. test this model on the second half,
3. train a k-nearest neighbors model on the second half,
4. test this model on the first half.

### Instructions
1. Train a k-nearest neighbors model using the default algorithm (auto) and the default number of neighbors (5) that:
2. Uses the accommodates column from train_one for training and
3. Tests it on test_one.
4. Assign the resulting RMSE value to iteration_one_rmse.
5. Train a k-nearest neighbors model using the default algorithm (auto) and the default number of neighbors (5) that:

Uses the accommodates column from train_two for training and
Tests it on test_two.

Assign the resulting RMSE value to iteration_two_rmse.

Use numpy.mean() to calculate the average of the 2 RMSE values and assign to avg_rmse.

In [28]:
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error

train_one = split_one
test_one = split_two
train_two = split_two
test_two = split_one

In [29]:
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error

train_one = split_one
test_one = split_two
train_two = split_two
test_two = split_one

features = ['accommodates']
knn = KNeighborsRegressor(n_neighbors=5, algorithm='auto')
train_features = train_one[features] 
train_target = train_one['price']
knn.fit(train_features, train_target)
predictions = knn.predict(test_one[features])
MSE = mean_squared_error(test_one['price'] , predictions)
iteration_one_rmse = MSE**0.5 

knn = KNeighborsRegressor(n_neighbors=5, algorithm='auto')
train_features = train_two[features] 
train_target = train_two['price']
knn.fit(train_features, train_target)
predictions = knn.predict(test_two[features])
MSE_1 = mean_squared_error(test_two['price'] , predictions)
iteration_two_rmse = MSE_1**0.5

avg_rmse = np.mean([iteration_one_rmse, iteration_two_rmse]) 

In [30]:
avg_rmse

129.37494585774908

### K-Fold Cross Validation

If we average the two RMSE values from the last step, we get an RMSE value of approximately 128.96. Holdout validation is actually a specific example of a larger class of validation techniques called k-fold cross-validation. While holdout validation is better than train/test validation because the model isn't repeatedly biased towards a specific subset of the data, both models that are trained only use half the available data. K-fold cross validation, on the other hand, takes advantage of a larger proportion of the data during training while still rotating through different subsets of the data to avoid the issues of train/test validation.

Here's the algorithm from k-fold cross validation:

1. splitting the full dataset into k equal length partitions.
2. selecting k-1 partitions as the training set and
3. selecting the remaining partition as the test set
4. training the model on the training set.
5. using the trained model to predict labels on the test fold.
6. computing the test fold's error metric.
7. repeating all of the above steps k-1 times, until each partition has been used as the test set for an iteration.
8. calculating the mean of the k error values.

Holdout validation is essentially a version of k-fold cross validation when k is equal to 2. Generally, 5 or 10 folds is used for k-fold cross-validation.

As you increase the number the folds, the number of observations in each fold decreases and the variance of the fold-by-fold errors increases. Let's start by manually partitioning the data set into 5 folds. Instead of splitting into 5 dataframes, let's add a column that specifies which fold the row belongs to. This way, we can easily select our training set and testing set.

### Instructions
Add a new column to dc_listings named fold that contains the fold number each row belongs to:
1. Fold 1 should have rows from index 0 up to745, not including 745.
2. Fold 2 should have rows from index 745 up to 1490, not including 1490.
3. Fold 3 should have rows from index 1490 up to 2234, not including 2234.
4. Fold 4 should have rows from index 2234 up to 2978, not including 2978.
5. Fold 5 should have rows from index 2978 up to 3723, not including 3723.

Display the unique value counts for the fold column to confirm that each fold has roughly the same number of elements.

Display the number of missing values in the fold column to confirm we didn't miss any rows.

In [31]:
dc_listings.loc[(dc_listings.index[0:745], "fold")] = 1
dc_listings.loc[(dc_listings.index[745:1490], "fold")] = 2
dc_listings.loc[(dc_listings.index[1490:2234], "fold")] = 3
dc_listings.loc[(dc_listings.index[2234:2978], "fold")] = 4
dc_listings.loc[(dc_listings.index[2978:3723], "fold")] = 5
dc_listings['fold'].value_counts()
dc_listings['fold'].isnull().sum()

0

In [32]:
dc_listings['fold'].value_counts() 

5.0    745
2.0    745
1.0    745
4.0    744
3.0    744
Name: fold, dtype: int64

In [33]:
dc_listings['fold'].isnull().sum()

0

### First iteration

### Instructions

Train a k-nearest neighbors model using the accommodates column as the sole feature from folds 2 to 5 as the training set.

Use the model to make predictions on the test set (accommodates column from fold 1) and assign the predicted labels to labels.

Calculate the RMSE value by comparing the price column with the predicted labels.

Assign the RMSE value to iteration_one_rmse.

In [9]:
train_one = dc_listings[dc_listings['fold'] != 1] 

In [10]:
test_one = dc_listings[dc_listings ['fold'] == 1] 

In [11]:
features = ['accommodates']
knn = KNeighborsRegressor(n_neighbors=5, algorithm='auto')
train_features = train_one[features] 
train_target = train_one['price']
knn.fit(train_features, train_target)
predictions = knn.predict(test_one[features])
MSE = mean_squared_error(test_one['price'] , predictions)
iteration_one_rmse = MSE**0.5 

In [12]:
iteration_one_rmse 

133.1107549141898

### Function for training models

### Instructions

Write a function named train_and_validate that takes in a dataframe as the first parameter (df) and a list of fold values (1 to 5 in our case) as the second parameter (folds). This function should:

   Train n models (where n is number of folds) and perform k-fold cross validation (using n folds). Use the default k value for the KNeighborsRegressor class.


   Return a list of RMSE values, where the first element is the RMSE for when fold 1 was the test set, the second element is the RMSE for when fold 2 was the test set, and so on.
   
   
Use the train_and_validate function to return the list of RMSE values for the dc_listings Dataframe and assign to rmses.


Calculate the mean of these values and assign to avg_rmse.
Display both rmses and avg_rmse.

In [34]:
def train_and_validate(df,folds):
    fold_rmse = []
    for fold in folds:
        train_one = df[df['fold'] != fold] 
        test_one = df[df['fold'] == fold] 
        knn = KNeighborsRegressor(n_neighbors=5, algorithm='auto')
        features = ['accommodates']
        train_features = train_one[features] 
        train_target = train_one['price']
        knn.fit(train_features, train_target)
        predictions = knn.predict(test_one[features])
        MSE = mean_squared_error(test_one['price'] , predictions)
        iteration_one_rmse = MSE**0.5 
        fold_rmse.append(iteration_one_rmse) 
    return fold_rmse 

In [35]:
fold_i = [1,2,3,4,5] 
rmse_iteration = train_and_validate(dc_listings, fold_i)
rmse_iteration 

[132.10511149543564,
 149.59109778630798,
 109.77740576823767,
 130.12078466621134,
 117.80601162860462]

In [36]:
avvg = np.mean(rmse_iteration) 
avvg 

127.88008226895946

In [37]:
# Use np.mean to calculate the mean.
import numpy as np
fold_ids = [1,2,3,4,5]
def train_and_validate(df,folds):
    fold_rmse = []
    for fold in folds:
        train_one = df[df['fold'] != fold] 
        test_one = df[df['fold'] == fold] 
        knn = KNeighborsRegressor(n_neighbors=5, algorithm='auto')
        features = ['accommodates']
        train_features = train_one[features] 
        train_target = train_one['price']
        knn.fit(train_features, train_target)
        predictions = knn.predict(test_one[features])
        MSE = mean_squared_error(test_one['price'] , predictions)
        iteration_one_rmse = MSE**0.5 
        fold_rmse.append(iteration_one_rmse) 
    return fold_rmse 
rmse_iteration = train_and_validate(dc_listings, fold_ids)
rmse_iteration 
avg_rmse = np.mean(rmse_iteration) 

In [38]:
avg_rmse 

127.88008226895946

### Performing K-Fold Cross Validation Using Scikit-Learn

While the average RMSE value was approximately 129, the RMSE values ranged from 102 to 148. This large amount of variability between the RMSE values means that we're either using a poor model or a poor evaluation criteria (or a bit of both!). By implementing your own k-fold cross-validation function, you hopefully acquired a good understanding of the inner workings of the technique. The function we wrote, however, has many limitations. If we want to now change the number of folds we want to use, we need to make the function more general so it can also handle randomizing the ordering of the rows in the dataframe and splitting into folds.

In machine learning, we're interested in building a good model and accurately understanding how well it will perform. To build a better k-nearest neighbors model, we can change the features it uses or tweak the number of neighbors (a hyperparameter). To accurately understand a model's performance, we can perform k-fold cross validation and select the proper number of folds. We've learned how scikit-learn makes it easy for us to quickly experiment with these different knobs when it comes to building a better model. Let's now dive into how we can use scikit-learn to handle cross-validation as well.

First, we instantiate an instance of the KFold class`from sklearn.model_selection`

where:

1. n_splits is the number of folds you want to use,
2. shuffle is used to toggle shuffling of the ordering of the observations in the dataset,
3. random_state is used to specify the random seed value if shuffle is set to True.

You'll notice here that no parameters depend on the data set at all. 

This is because the KFold class returns an iterator object which we use in conjunction with the cross_val_score() function, also from sklearn.model_selection.

Together, these 2 functions allow us to compactly train and test using k-fold cross validation:
Here are the relevant parameters for the cross_val_score function:

where:
1. estimator is a sklearn model that implements the fit method (e.g. instance of KNeighborsRegressor),
2. X is the list or 2D array containing the features you want to train on,
3. y is a list containing the values you want to predict (target column),
4. scoring is a string describing the scoring criteria (list of accepted values here).
5. cv describes the number of folds. Here are some examples of accepted values:
an instance of the KFold class,
an integer representing the number of folds.

Depending on the scoring criteria you specify, a single total value is returned for each fold. 

Here's the general workflow for performing k-fold cross-validation using the classes we just described:

1. instantiate the scikit-learn model class you want to fit,
2. instantiate the KFold class and using the parameters to specify the k-fold cross-validation attributes you want,
3. use the cross_val_score() function to return the scoring metric you're interested in.

### Instructions

Create a new instance of the KFold class with the following properties:

5 folds,
shuffle set to True,
random seed set to 1 (so we can answer check using the same seed),
assigned to the variable kf.
Create a new instance of the KNeighborsRegressor class and assign to knn.

Use the cross_val_score() function to perform k-fold cross-validation:

using the KNeighborsRegressor instance knn,
using the accommodates column for training,
using the price column as the target column,
returning an array of MSE values (one value for each fold).

Assign the resulting list of MSE values to mses. Then, take the absolute value followed by the square root of each MSE value. Then, calculate the average of the resulting RMSE values and assign to avg_rmse.

In [44]:
from sklearn.model_selection import KFold
kf = KFold(n_splits=5, shuffle=True , random_state= 1)

In [45]:
knn = KNeighborsRegressor()


In [50]:
from sklearn.model_selection import cross_val_score
mses = cross_val_score(estimator = knn , X = dc_listings[['accommodates'] ] , y = dc_listings['price'], scoring='neg_mean_squared_error', cv=kf)

In [51]:
mses

array([-11063.43779866, -18518.44746309, -15614.78679195, -23355.82586022,
       -16877.44806452])

In [52]:
rmses = np.sqrt(np.absolute(mses)) 

In [53]:
rmses

array([105.18287788, 136.08250241, 124.95914049, 152.82612951,
       129.91323283])

In [54]:
avg = np.mean(rmses) 

In [55]:
avg

129.7927766238104

### Exploring Different K Values

Choosing the right k value when performing k-fold cross validation is more of an art and less of a science. As we discussed earlier in the mission, a k value of 2 is really just holdout validation. On the other end, setting k equal to n (the number of observations in the data set) is known as leave-one-out cross validation, or LOOCV for short. Through lots of trial and error, data scientists have converged on 10 as the standard k value.


In the following code block, we display the results of varying k from 3 to 23. For each k value, we calculate and display the average RMSE value across all of the folds and the standard deviation of the RMSE values. Across the many different k values, it seems like the average RMSE value is around 129. You'll notice that the standard deviation of the RMSE increases from approximately 8 to over 40 as we increase the number of folds.

In [56]:
from sklearn.model_selection import cross_val_score, KFold
num_folds = [3, 5, 7, 9, 10, 11, 13, 15, 17, 19, 21, 23]
for fold in num_folds:
    kf = KFold(fold, shuffle=True, random_state=1)
    model = KNeighborsRegressor()
    mses = cross_val_score(model, dc_listings[["accommodates"]], dc_listings["price"], scoring="neg_mean_squared_error", cv=kf)
    rmses = np.sqrt(np.absolute(mses))
    avg_rmse = np.mean(rmses)
    std_rmse = np.std(rmses)
    print(str(fold), "folds: ", "avg RMSE: ", str(avg_rmse), "std RMSE: ", str(std_rmse))

3 folds:  avg RMSE:  131.65344624453658 std RMSE:  14.030793158965047
5 folds:  avg RMSE:  129.7927766238104 std RMSE:  15.486262685559295
7 folds:  avg RMSE:  137.81589969287887 std RMSE:  19.56967264436837
9 folds:  avg RMSE:  130.56099493414956 std RMSE:  26.876759979969403
10 folds:  avg RMSE:  130.8187363715133 std RMSE:  31.510552928995477
11 folds:  avg RMSE:  131.3996497334271 std RMSE:  33.46527564364315
13 folds:  avg RMSE:  133.25442826716528 std RMSE:  26.55922419112415
15 folds:  avg RMSE:  128.64593528731592 std RMSE:  39.082756287644585
17 folds:  avg RMSE:  129.0621601271626 std RMSE:  40.78829266438868
19 folds:  avg RMSE:  127.18252582906854 std RMSE:  38.90858988084232
21 folds:  avg RMSE:  124.29272528142299 std RMSE:  38.68360177447985
23 folds:  avg RMSE:  127.51758683533754 std RMSE:  39.9913413136974


### Bias-Variance Tradeoff

So far, we've been working under the assumption that a lower RMSE always means that a model is more accurate. This isn't the complete picture, unfortunately.

A model has two sources of error, bias and variance.

Bias describes error that results in bad assumptions about the learning algorithm. For example, assuming that only one feature, like a car's weight, relates to a car's fuel efficiency will lead you to fit a simple, univariate regression model that will result in high bias. The error rate will be high since a car's fuel efficiency is affected by many other factors besides just its weight.

Variance describes error that occurs because of the variability of a model's predicted values. If we were given a dataset with 1000 features on each car and used every single feature to train an incredibly complicated multivariate regression model, we will have low bias but high variance. In an ideal world, we want low bias and low variance but in reality, there's always a tradeoff.

The standard deviation of the RMSE values can be a proxy for a model's variance while the average RMSE is a proxy for a model's bias. Bias and variance are the 2 observable sources of error in a model that we can indirectly control.

While k-nearest neighbors can make predictions, it isn't a mathematical model. A mathematical model is usually an equation that can exist without the original data, which isn't true with k-nearest neighbors. In the next two courses, we'll learn about a mathematical model called linear regression. We'll explore the bias-variance tradeoff in greater depth in these next 2 courses because of its importance when working with mathematical models in particular.

### Syntax