## Testing quality of predictions

In [1]:
import pandas as pd
import numpy as np
dc_listings = pd.read_csv("dc_airbnb.csv")
stripped_commas = dc_listings['price'].str.replace(',', '')
stripped_dollars = stripped_commas.str.replace('$', '')
dc_listings['price'] = stripped_dollars.astype('float')
train_df = dc_listings.iloc[0:2792]
test_df = dc_listings.iloc[2792:]


疑问：test set直接使用train set 的结果？
答案：如果test set 直接使用自己的feature去计算，那本身就重复了train set 的操作了。

使用train set的数据得出得结果，直接去预测test set，
只是KNN算法，在训练和预测的时候，计算量都是一样的。
如果为了减少计算量，可以把结果保存下来，不过需要针对不同的K与feature进行保存，这好像也和其他模型算法是一样的

**通过feature 求欧式距离， 取K个最小距离的样本**
* 如果是分类，取取K中多数的label
* 如果是回归，则取K个label的平均值

In [2]:
def predict_price(new_listing):
    ## DataFrame.copy() performs a deep copy
    temp_df = train_df.copy()
    temp_df['distance'] = temp_df['accommodates'].apply(lambda x: np.abs(x - new_listing))
    temp_df = temp_df.sort_values('distance')
    nearest_neighbor_prices = temp_df.iloc[0:5]['price']
    predicted_price = nearest_neighbor_prices.mean()
    return(predicted_price)


In [4]:
test_df['predicted_price'] = test_df.accommodates.apply(lambda x : predict_price(x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [6]:
test_df['predicted_price'].value_counts()

104.0    397
145.8    178
177.4    103
187.2     85
89.0      73
201.4     47
297.4     22
394.8     10
324.0      6
334.4      5
259.2      3
836.0      1
431.4      1
Name: predicted_price, dtype: int64

In [7]:
test_df['accommodates'].value_counts()

2     397
4     178
3     103
6      85
1      73
5      47
8      22
7      10
10      6
9       5
12      3
13      1
11      1
Name: accommodates, dtype: int64

## Error Metrics

In [11]:
mae = sum(np.absolute(test_df['price'] - test_df['predicted_price']))/len(test_df)

In [12]:
mae

56.29001074113876

## Mean Squared Error

In [16]:
mse = sum(np.square(test_df['price'] - test_df['predicted_price']))/len(test_df)
mse

18646.525370569325

## Training another model

In [17]:
train_df = dc_listings.iloc[0:2792]
test_df = dc_listings.iloc[2792:]

In [18]:
def predict_price(new_listing):
    temp_df = train_df.copy()
    temp_df['distance'] = temp_df['bathrooms'].apply(lambda x: np.abs(x - new_listing))
    temp_df = temp_df.sort_values('distance')
    nearest_neighbors_prices = temp_df.iloc[0:5]['price']
    predicted_price = nearest_neighbors_prices.mean()
    return(predicted_price)

In [19]:
test_df['predicted_price'] = test_df['bathrooms'].apply(lambda x: predict_price(x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [21]:
mse = sum(np.square(test_df['predicted_price'] - test_df['price']))/len(test_df)
mse

18405.444081632548

## Root Mean Squared Error

In [22]:
rmse = np.sqrt(mse)
rmse

135.6666653295221

## Comparing MAE and RMSE

In [23]:
errors_one = pd.Series([5, 10, 5, 10, 5, 10, 5, 10, 5, 10, 5, 10, 5, 10, 5, 10, 5, 10])
errors_two = pd.Series([5, 10, 5, 10, 5, 10, 5, 10, 5, 10, 5, 10, 5, 10, 5, 10, 5, 1000])

In [41]:
mae_one = np.mean(errors_one)
mse_one = sum(np.square(errors_one))/len(errors_one)
rmse_one = np.sqrt(mse_one)

mae_two = np.mean(errors_two)
mse_two = sum(np.square(errors_two))/len(errors_two)
rmse_two = np.sqrt(mse_two)

In [42]:
print(mae_one, mae_two)
print(rmse_one, rmse_two)

7.5 62.5
7.905694150420948 235.82302686548658
