
#  Knn
## Airbnb 房价预测任务

<img src="1.png" style="width:700px;height:400px;float:left">

### 读取数据
---

In [1]:
import pandas as pd

features = ['accommodates','bedrooms','bathrooms','beds','price','minimum_nights','number_of_reviews']
dc_listings = pd.read_csv('listings.csv')

dc_listings = dc_listings[features]
print(dc_listings.shape)
dc_listings.head()

(3723, 7)


Unnamed: 0,accommodates,bedrooms,bathrooms,beds,price,minimum_nights,number_of_reviews
0,4,1.0,1.0,2.0,$160.00,1,0
1,6,3.0,3.0,3.0,$350.00,2,65
2,1,1.0,2.0,1.0,$50.00,2,1
3,2,1.0,1.0,1.0,$95.00,1,0
4,4,1.0,1.0,1.0,$50.00,7,0


#### 数据特征：
- accommodate:可以容纳的旅客的数量
- bedroom
- bathroom
- ...

### 如果我有一个三个房间的房子，我能租多少钱？


<img src="2.png" style="width:600px;height:230px;float:left">

#### k代表我们的候选的对象个数

## K近邻原理

<img src="3.png" style="width:600;height:230;float:left">

<img src="4.png" style="width:600;height:230;float:left">

<img src="5.png" style="width:600;height:230;float:left">

### 距离的定义
如何才能知道哪些数据样本跟我最近

<img src="6.png" style="width:600;height:230;float:left">

 假设我有三个房间的房子

In [2]:
import numpy as np

our_acc_value = 3
dc_listings['distance'] = np.abs(dc_listings.accommodates - our_acc_value)
dc_listings.distance.value_counts().sort_index()

0      461
1     2294
2      503
3      279
4       35
5       73
6       17
7       22
8        7
9       12
10       2
11       4
12       6
13       8
Name: distance, dtype: int64

这里我们用了绝对值来计算，和我们距离为0的样本有 461条

sample操作可以得到洗牌后的数据

In [3]:
dc_listings = dc_listings.sample(frac=1,random_state=0)
dc_listings = dc_listings.sort_values('distance')
dc_listings.price.head()

2645     $75.00
2825    $120.00
2145     $90.00
2541     $50.00
3349    $105.00
Name: price, dtype: object

现在的问题是，这里面的价格是字符串类型，需要装换一下

In [4]:
dc_listings['price'] = dc_listings.price.str.replace("\$|,",'').astype(float)
mean_price = dc_listings.price.iloc[:5].mean()
print(mean_price)

88.0


得到了平均价格，也就是我们房子的大致价格了



# 模型的评估
---

<img src="7.png" style="width:600;height:230;float:left">

首先制定好训练集和测试卷

In [6]:
dc_listings.drop('distance',axis=1)
train_df = dc_listings.copy().iloc[:2792]
test_df = dc_listings.copy().iloc[2782:]

#### 基于单变量预测价格

In [11]:
def predict_price(new_listing_value,feature_column):
    temp_df = train_df
    temp_df['distance'] = np.abs(temp_df[feature_column]- new_listing_value)
    temp_df = temp_df.sort_values('distance')
    knn_5 = temp_df.price.iloc[:5]
    predict_price = knn_5.mean()
    return predict_price

In [12]:
test_df.accommodates[:5]

418     1
1680    1
585     1
1849    1
1813    1
Name: accommodates, dtype: int64

In [13]:
test_df['predict_price'] = test_df.accommodates.apply(predict_price,feature_column='accommodates')

这样我们就得到了测试集中，所以房子的价格了

root mean squared error (RMSE) 均方根误差


<img src="8.png" style="width:600;height:230;float:left">

In [14]:
test_df['squared_error'] = (test_df['predict_price'] - test_df['price']) ** (2)
mse = test_df['squared_error'].mean()
rmse = mse ** (1/2)
print(rmse)

211.86474130412552


现在我们得到了对与一个变量的模型评估得分

---

### 不同的变量效果会不会不一样呢

In [16]:
for feature in ['accommodates','bedrooms','bathrooms','number_of_reviews']:
    test_df['predict_price'] = test_df[feature].apply(predict_price,feature_column=feature)
    test_df['squared_error'] = (test_df['predict_price'] - test_df['price']) ** 2
    mse = test_df['squared_error'].mean()
    rmse = mse ** (1/2)
    print("RMSE for {} column:{}".format(feature,rmse))

RMSE for accommodates column:211.86474130412552
RMSE for bedrooms column:198.75523158621627
RMSE for bathrooms column:229.07053018712944
RMSE for number_of_reviews column:234.8428230560047


看起来结果差异还是比较大的，接下来我们要做的就是综合利用所有变量来计算评分

In [24]:
import pandas as pd
from sklearn.preprocessing import  StandardScaler

features = ['accommodates','bedrooms','bathrooms','beds','price','number_of_reviews','minimum_nights','maximum_nights']

dc_listing = pd.read_csv('./listings.csv')
dc_listing = dc_listing[features]
dc_listing['price'] = dc_listing.price.str.replace("\$|,","").astype(float)
dc_listing = dc_listing.dropna()

dc_listing[features] = StandardScaler().fit_transform(dc_listing[features])

normlized_listing = dc_listing

print(dc_listing.shape)

normlized_listing.head()

(3671, 8)


Unnamed: 0,accommodates,bedrooms,bathrooms,beds,price,number_of_reviews,minimum_nights,maximum_nights
0,0.40142,-0.249501,-0.439211,0.297386,0.081119,-0.516779,-0.341421,-0.016575
1,1.399466,2.129508,2.969551,1.141704,1.462622,1.706767,-0.065047,-0.016606
2,-1.095648,-0.249501,1.26517,-0.546933,-0.718699,-0.482571,-0.065047,-0.016575
3,-0.596625,-0.249501,-0.439211,-0.546933,-0.391501,-0.516779,-0.341421,-0.016575
4,0.40142,-0.249501,-0.439211,-0.546933,-0.718699,-0.516779,1.316824,-0.016575


In [25]:
norm_train_df = normlized_listing.iloc[:2792]
norm_test_df = normlized_listing.iloc[2792:]

多变量距离的计算

<img src="9.png" style="width:700px;height:400px;float:left">

In [26]:
from scipy.spatial import distance

first_listing = normlized_listing.iloc[0][['accommodates','bathrooms']]
fifth_listing = normlized_listing.iloc[20][['accommodates','bathrooms']]
first_fifth_dictance = distance.euclidean(first_listing,fifth_listing)
first_fifth_dictance

3.723019604017032

### 多变量 KNN模型

In [38]:
def prediec_price_mulvariate(new_listing_value,feature_columns):
        temp_df = norm_train_df
        temp_df['distance'] = distance.cdist(temp_df[feature_columns],[new_listing_value[feature_columns]])
        temp_df = temp_df.sort_values('distance')
        knn_5 = temp_df.price.iloc[:5]
        predict_price = knn_5.mean()
        return predict_price

cols = ['accommodates','bathrooms']
norm_test_df['predicted_price'] = norm_test_df[cols].apply(prediec_price_mulvariate,feature_columns=cols,axis=1)
norm_test_df['squared_error'] = (norm_test_df['predicted_price'] - norm_test_df['price']) ** 2
mse = norm_test_df['squared_error'].mean()
rmse = mse ** (1/2)
print(rmse)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until


0.7894063922577537


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  # Remove the CWD from sys.path while we load stuff.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  # This is added back by InteractiveShellApp.init_path()


### 使用 Sklearn 来完成 knn

In [39]:
from sklearn.neighbors import KNeighborsRegressor
cols = ['accommodates','bathrooms']
knn = KNeighborsRegressor()
knn.fit(norm_train_df[cols],norm_train_df['price'])
two_feature_predictions = knn.predict(norm_test_df[cols])

In [40]:
from sklearn.metrics import mean_squared_error
two_features_mse = mean_squared_error(norm_test_df['price'],two_feature_predictions)
two_features_rmse = two_features_mse ** (1/2)
print(two_features_rmse)

0.857101359198754


#### 加入更多的特征

In [45]:
cols = ['accommodates','bedrooms','number_of_reviews','bathrooms','maximum_nights','minimum_nights'] 
knn = KNeighborsRegressor()

knn.fit(norm_train_df[cols],norm_train_df['price'])
y_hat = knn.predict(norm_test_df[cols])

mse = mean_squared_error(norm_test_df['price'],y_hat)
rmse = mse ** (1/2)
print(rmse)

0.7912808461922323
