## Introduction

In [24]:
import numpy as np
import pandas as pd

dc_listings = pd.read_csv("dc_airbnb.csv")
stripped_commas = dc_listings['price'].str.replace(',', '')
stripped_dollars = stripped_commas.str.replace('$', '')
dc_listings['price'] = stripped_dollars.astype('float')

In [25]:
cpy_dc_listings = dc_listings.copy()
#np.random.permutation(cpy_dc_listings.index)
cpy_dc_listings = cpy_dc_listings.iloc[np.random.permutation(cpy_dc_listings.index), :]

In [26]:
cpy_dc_listings

Unnamed: 0,host_response_rate,host_acceptance_rate,host_listings_count,accommodates,room_type,bedrooms,bathrooms,beds,price,cleaning_fee,security_deposit,minimum_nights,maximum_nights,number_of_reviews,latitude,longitude,city,zipcode,state
3393,100%,92%,2,4,Entire home/apt,1.0,1.0,2.0,99.0,$15.00,$300.00,1,1125,2,38.934368,-77.059080,Washington,20008,DC
3359,,,1,1,Private room,1.0,1.0,1.0,80.0,$70.00,$450.00,1,300,0,38.937169,-77.074239,Washington,20016,DC
3667,90%,100%,2,3,Private room,1.0,1.0,2.0,90.0,,$100.00,1,24,167,38.885347,-76.981833,Washington,20003,DC
287,99%,96%,32,4,Entire home/apt,2.0,1.0,2.0,189.0,$65.00,$200.00,2,365,47,38.907245,-77.048975,Washington,20037,DC
1783,90%,88%,1,3,Private room,1.0,1.5,2.0,128.0,$50.00,,1,365,8,38.950600,-77.085841,Washington,20016,DC
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1560,100%,100%,1,2,Entire home/apt,1.0,1.0,1.0,125.0,$35.00,,2,1125,9,38.938868,-77.019719,Washington,20011,DC
959,83%,0%,2,4,Entire home/apt,2.0,1.0,2.0,127.0,$30.00,,2,1125,0,38.917604,-77.037196,Washington,20009,DC
2805,100%,100%,1,4,Entire home/apt,1.0,1.0,2.0,168.0,$25.00,$200.00,3,14,31,38.928605,-77.029234,Washington,20010,DC
3217,,,1,2,Private room,1.0,1.0,1.0,70.0,,,1,1125,0,38.935041,-76.982252,Washington,20017,DC


In [27]:
split_one = cpy_dc_listings.iloc[:1862,:]
split_two = cpy_dc_listings.iloc[1862:,:]

## Holdout Validation

In [28]:
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error

train_one = split_one
test_one = split_two
train_two = split_two
test_two = split_one

In [29]:
x_train_one = train_one['accommodates'].to_frame()# need convert series to frame
y_train_one = train_one['price']
x_test_one = test_one['accommodates'].to_frame()
y_test_one = test_one['price']

x_train_two = train_two['accommodates'].to_frame()
y_train_two = train_two['price']
x_test_two = test_two['accommodates'].to_frame()
y_test_two = test_two['price']

In [32]:
knn = KNeighborsRegressor(n_neighbors=5, algorithm='auto')

knn.fit(x_train_one, y_train_one)
predictions = knn.predict(x_test_one)
iteration_one_rmse = mean_squared_error(y_test_one, predictions)**0.5

In [33]:
knn = KNeighborsRegressor(n_neighbors=5, algorithm='auto')

knn.fit(x_train_two, y_train_two)
predictions = knn.predict(x_test_two)
iteration_two_rmse = mean_squared_error(y_test_two, predictions)**0.5

In [34]:
avg_rmse = np.mean([iteration_one_rmse, iteration_two_rmse])

In [35]:
print(iteration_one_rmse, iteration_two_rmse, avg_rmse)

131.68732919536032 125.61653427877448 128.6519317370674


## K-Fold Cross Validation

**Holdout validation** is essentially `a version` of **k-fold cross** validation when `k` is equal to `2`. Generally, `5` or `10` folds is used for k-fold cross-validation. Here's a diagram describing each iteration of 5-fold cross validation:

![Jupyter](./kfold_cross_validation.png)

In [48]:
dc_listings.loc[0:744, 'fold']  = 1

In [49]:
dc_listings.loc[745:1489, 'fold']  = 2

In [50]:
dc_listings.loc[1490:2233, 'fold']  = 3

In [51]:
dc_listings.loc[2234:2977, 'fold']  = 4

In [52]:
dc_listings.loc[2978:3722, 'fold']  = 5

In [53]:
dc_listings['fold'].value_counts()

5.0    745
2.0    745
1.0    745
4.0    744
3.0    744
Name: fold, dtype: int64

In [55]:
dc_listings['fold'].isnull().sum()

0

## First iteration

In [63]:
x_train_first_fold = dc_listings[dc_listings['fold'].isin(range(2,6))][['accommodates']]
y_train_first_fold = dc_listings[dc_listings['fold'].isin(range(2,6))]['price']

In [64]:
x_test_first_fold = dc_listings[dc_listings['fold'] == 1][['accommodates']]
y_test_first_fold = dc_listings[dc_listings['fold'] == 1]['price']

In [69]:
knn = KNeighborsRegressor(n_neighbors=5, algorithm='auto')
knn.fit(x_train_first_fold, y_train_first_fold)
predictions = knn.predict(x_test_first_fold)

iteration_one_rmse = mean_squared_error(y_test_first_fold, predictions)**0.5

In [70]:
iteration_one_rmse

123.64816897663778

## Function for training models

In [78]:
import numpy as np
fold_ids = [1,2,3,4,5]

def train_and_validate(df, folds):
    rmses = []
    for test_fold in fold_ids:
        
        train_folds = list(set(fold_ids) - set([test_fold]))
        
        x_train_first_fold = dc_listings[dc_listings['fold'].isin(train_folds)][['accommodates']]
        y_train_first_fold = dc_listings[dc_listings['fold'].isin(train_folds)]['price']
        x_test_first_fold = dc_listings[dc_listings['fold'] == test_fold][['accommodates']]
        y_test_first_fold = dc_listings[dc_listings['fold'] == test_fold]['price']
        
        knn = KNeighborsRegressor(n_neighbors=5, algorithm='auto')
        knn.fit(x_train_first_fold, y_train_first_fold)
        predictions = knn.predict(x_test_first_fold)
        
        rmse = mean_squared_error(y_test_first_fold, predictions)**0.5
        rmses.append(rmse)
    
    return rmses

In [79]:
rmses = train_and_validate(dc_listings, fold_ids)
avg_rmse = np.mean(rmses)

In [80]:
print(rmses)
print(avg_rmse)

[123.64816897663778, 104.90933995950148, 164.72575286188246, 102.32103626510822, 148.42036980986353]
128.8049335745987


In [77]:
fold_ids = [1,2,3,4,5]
a = 1
set(fold_ids) - set([a])

{2, 3, 4, 5}

## Performing K-Fold Cross Validation Using Scikit-Learn

In machine learning, we're interested in building a good model and accurately understanding how well it will perform. 
* To build a better k-nearest neighbors model, we can change the features it uses or tweak the number of neighbors (a hyperparameter). 
* To accurately understand a model's performance, we can perform k-fold cross validation and select the proper number of folds. 

Here's the general workflow for performing k-fold cross-validation using the classes we just described:

* instantiate the scikit-learn model class you want to fit,
* instantiate the `KFold` class and using the parameters to specify the k-fold cross-validation attributes you want,
* use the `cross_val_score()` function to return the scoring metric you're interested in.

In [86]:
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

kf = KFold(n_splits=5, shuffle=True, random_state=1)

knn = KNeighborsRegressor(n_neighbors=5, algorithm='auto')

mses = cross_val_score(knn, dc_listings[['accommodates']], dc_listings['price'], scoring='neg_root_mean_squared_error', cv=kf)

np.abs(mses)

array([13779.82990604, 18843.18802685, 21335.28257718, 11314.62688172,
       21244.80607527])

In [95]:
avg_rmse = sum(list(map(lambda x: x**0.5, list(np.abs(mses))))) / len(mses)

In [96]:
avg_rmse

130.57004998596955

## Exploring Different K Values

In [98]:
num_folds = [3, 5, 7, 9, 10, 11, 13, 15, 17, 19, 21, 23]

for fold in num_folds:
    kf = KFold(fold, shuffle=True, random_state=1)
    model = KNeighborsRegressor()
    neg_rmses = cross_val_score(model, dc_listings[["accommodates"]], dc_listings["price"], scoring="neg_root_mean_squared_error", cv=kf)
    rmses = np.absolute(neg_rmses)
    avg_rmse = np.mean(rmses)
    std_rmse = np.std(rmses)
    print(str(fold), "folds: ", "avg RMSE: ", str(avg_rmse), "std RMSE: ", str(std_rmse))

3 folds:  avg RMSE:  127.19146799819767 std RMSE:  7.80114274447321
5 folds:  avg RMSE:  130.57004998596955 std RMSE:  15.968993082617418
7 folds:  avg RMSE:  124.74000565490935 std RMSE:  23.009326104623764
9 folds:  avg RMSE:  133.85427296864364 std RMSE:  20.275996691809862
10 folds:  avg RMSE:  134.50358073016668 std RMSE:  30.83892745302988
11 folds:  avg RMSE:  129.58548991863123 std RMSE:  22.39316430178567
13 folds:  avg RMSE:  133.05101345639838 std RMSE:  27.88932598342725
15 folds:  avg RMSE:  124.86715246014936 std RMSE:  37.03384132069149
17 folds:  avg RMSE:  131.3786960290144 std RMSE:  40.043451719093724
19 folds:  avg RMSE:  129.0143524209374 std RMSE:  44.3383982741942
21 folds:  avg RMSE:  125.49498964946545 std RMSE:  41.03033829748872
23 folds:  avg RMSE:  125.27939162120605 std RMSE:  41.668089858618046


## Bias-Variance Tradeoff

 A model has two sources of error, **bias** and **variance**.
 
* **`Bias`** describes error that results in bad assumptions about the learning algorithm. For example, assuming that only one feature, like a car's weight, relates to a car's fuel efficiency will lead you to fit a simple, univariate regression model that will result in high bias. The error rate will be high since a car's fuel efficiency is affected by many other factors besides just its weight.

* **`Variance`** describes error that occurs because of the variability of a model's predicted values. If we were given a dataset with 1000 features on each car and used every single feature to train an incredibly complicated multivariate regression model, we will have low bias but high variance. In an ideal world, we want low bias and low variance but in reality, there's always a tradeoff.

The standard deviation of the RMSE values can be a proxy for a model's variance while the average RMSE is a proxy for a model's bias. Bias and variance are the 2 observable sources of error in a model that we can indirectly control.

![Jupyter](./bias_variance.png)