Let's start by splitting the data set into 2 nearly equivalent halves.

When splitting the data set, don't forget to set a copy of it using .copy() to ensure you don't get any unexpected results later on. If you run the code locally in Jupyter Notebook or Jupyter Lab without .copy(), you'll notice what is known as a SettingWithCopy Warning. 

In [2]:
import numpy as np
import pandas as pd

dc_listings = pd.read_csv("dc_airbnb.csv")
stripped_commas = dc_listings['price'].str.replace(',', '')
stripped_dollars = stripped_commas.str.replace('$', '')
dc_listings['price'] = stripped_dollars.astype('float')

shuffled_index = np.random.permutation(dc_listings.index)
dc_listings = dc_listings.reindex(shuffled_index)

split_one = dc_listings.iloc[0:1862].copy()
split_two = dc_listings.iloc[1862:].copy()

 - train a k-nearest neighbors model on the first half,
 - test this model on the second half,
 - train a k-nearest neighbors model on the second half,
 - test this model on the first half.

In [3]:
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error

train_one = split_one
test_one = split_two
train_two = split_two
test_two = split_one

# First half
model = KNeighborsRegressor()
model.fit(train_one[["accommodates"]], train_one["price"])
test_one["predicted_price"] = model.predict(test_one[["accommodates"]])
iteration_one_rmse = mean_squared_error(test_one["price"], test_one["predicted_price"])**(1/2)

# Second half
model.fit(train_two[["accommodates"]], train_two["price"])
test_two["predicted_price"] = model.predict(test_two[["accommodates"]])
iteration_two_rmse = mean_squared_error(test_two["price"], test_two["predicted_price"])**(1/2)

avg_rmse = np.mean([iteration_two_rmse, iteration_one_rmse])

print(iteration_one_rmse, iteration_two_rmse, avg_rmse)

163.2408585628993 131.22316287438863 147.23201071864395


### K-Cross Fold Validation

If we average the two RMSE values from the last step, we get an RMSE value of approximately 128.96. Holdout validation is actually a specific example of a larger class of validation techniques called k-fold cross-validation. While holdout validation is better than train/test validation because the model isn't repeatedly biased towards a specific subset of the data, both models that are trained only use half the available data. K-fold cross validation, on the other hand, takes advantage of a larger proportion of the data during training while still rotating through different subsets of the data to avoid the issues of train/test validation.

Here's the algorithm from k-fold cross validation:

 - splitting the full dataset into k equal length partitions.
 - selecting k-1 partitions as the training set and
 - selecting the remaining partition as the test set
 - training the model on the training set.
 - using the trained model to predict labels on the test fold.
 - computing the test fold's error metric.
 - repeating all of the above steps k-1 times, until each partition has been used as the test set for an iteration.
 - calculating the mean of the k error values.

Holdout validation is essentially a version of k-fold cross validation when k is equal to 2. Generally, 5 or 10 folds is used for k-fold cross-validation. 

In [4]:
dc_listings.loc[dc_listings.index[0:745], "fold"] = 1
dc_listings.loc[dc_listings.index[745:1490], "fold"] = 2
dc_listings.loc[dc_listings.index[1490:2234], "fold"] = 3
dc_listings.loc[dc_listings.index[2234:2978], "fold"] = 4
dc_listings.loc[dc_listings.index[2978:3723], "fold"] = 5

print(dc_listings['fold'].value_counts())
print("\n Num of missing values: ", dc_listings['fold'].isnull().sum())

5.0    745
2.0    745
1.0    745
4.0    744
3.0    744
Name: fold, dtype: int64

 Num of missing values:  0


In [5]:
# from sklearn.neighbors import KNeighborsRegressor
# from sklearn.metrics import mean_squared_error

# Training
model = KNeighborsRegressor()
train_iteration_one = dc_listings[dc_listings["fold"] != 1]
test_iteration_one = dc_listings[dc_listings["fold"] == 1].copy()
model.fit(train_iteration_one[["accommodates"]], train_iteration_one["price"])

# Predicting
labels = model.predict(test_iteration_one[["accommodates"]])
test_iteration_one["predicted_price"] = labels
iteration_one_mse = mean_squared_error(test_iteration_one["price"], test_iteration_one["predicted_price"])
iteration_one_rmse = iteration_one_mse ** (1/2)

print(iteration_one_rmse)

137.74291626020474


From the first iteration, we achieved an RMSE value of roughly 124. Let's calculate the RMSE values for the remaining iterations. To make the iteration process easier, let's wrap the code we wrote in the previous screen in a function.

In [6]:
fold_ids = [1,2,3,4,5]
def train_and_validate(df, folds):
    fold_rmses = []
    for fold in folds:
        # Train
        model = KNeighborsRegressor()
        train = df[df["fold"] != fold]
        test = df[df["fold"] == fold].copy()
        model.fit(train[["accommodates"]], train["price"])
        # Predict
        labels = model.predict(test[["accommodates"]])
        test["predicted_price"] = labels
        mse = mean_squared_error(test["price"], test["predicted_price"])
        rmse = mse**(1/2)
        fold_rmses.append(rmse)
    return(fold_rmses)

rmses = train_and_validate(dc_listings, fold_ids)
print(rmses)
avg_rmse = np.mean(rmses)
print(avg_rmse)

[137.74291626020474, 120.65465299187844, 155.25353977575915, 139.64903454540016, 86.51907746330096]
127.96384420730867


In [7]:
from sklearn.model_selection import cross_val_score, KFold
kf = KFold(5, shuffle=True, random_state=1)
model = KNeighborsRegressor()
mses = cross_val_score(model, dc_listings[["accommodates"]], dc_listings["price"], scoring="neg_mean_squared_error", cv=kf)
rmses = np.sqrt(np.absolute(mses))
avg_rmse = np.mean(rmses)

print(rmses)
print(avg_rmse)

[136.13196696 121.82150811 155.53378001 162.76730372  99.306953  ]
135.11230236229338


Choosing the right k value when performing k-fold cross validation is more of an art and less of a science. 

As we discussed earlier in the mission, a k value of 2 is really just holdout validation. 

On the other end, setting k equal to n (the number of observations in the data set) is known as leave-one-out cross validation, or LOOCV for short. 

Through lots of trial and error, data scientists have converged on 10 as the standard k value.

In [8]:

num_folds = [3, 5, 7, 9, 10, 11, 13, 15, 17, 19, 21, 23]

for fold in num_folds:
    kf = KFold(fold, shuffle=True, random_state=1)
    model = KNeighborsRegressor()
    mses = cross_val_score(model, dc_listings[["accommodates"]], dc_listings["price"], scoring="neg_mean_squared_error", cv=kf)
    rmses = np.sqrt(np.absolute(mses))
    avg_rmse = np.mean(rmses)
    std_rmse = np.std(rmses)
    print(str(fold), "folds: ", "avg RMSE: ", str(avg_rmse), "std RMSE: ", str(std_rmse))

3 folds:  avg RMSE:  127.06965345727298 std RMSE:  15.846517938155053
5 folds:  avg RMSE:  135.11230236229338 std RMSE:  22.984968219940907
7 folds:  avg RMSE:  128.7310175398072 std RMSE:  27.221295979788373
9 folds:  avg RMSE:  134.61697548752926 std RMSE:  36.9612947796052
10 folds:  avg RMSE:  126.40912746156773 std RMSE:  37.14777750653743
11 folds:  avg RMSE:  133.0868875025243 std RMSE:  32.81227878239363
13 folds:  avg RMSE:  129.41850536672487 std RMSE:  37.0369322754536
15 folds:  avg RMSE:  132.96768055999738 std RMSE:  36.432734176987644
17 folds:  avg RMSE:  130.22270090831245 std RMSE:  42.9675639583121
19 folds:  avg RMSE:  125.54727462548044 std RMSE:  40.971624884633066
21 folds:  avg RMSE:  124.9508658320053 std RMSE:  42.760538966024555
23 folds:  avg RMSE:  126.71180034548591 std RMSE:  44.13210992839968
