# 1.分类和回归
###     定量输出称为回归，或者说是连续变量预测，预测明天的气温是多少度，这是一个回归任务
###     定性输出称为分类，或者说是离散变量预测，预测明天是阴、晴还是雨，就是一个分类任务


# 2.机器学习-K近邻

### 房价预测任务

<img src="民宿.png" style="width:800px;height:480px;float:left">

## 数据读取

In [1]:
import pandas as pd

features = ['accommodates','bedrooms','bathrooms','beds','price','minimum_nights','maximum_nights','number_of_reviews']
#将特征名称（数据的列名），共8个特征，写在一个列表当中
dc_listings = pd.read_csv('listings.csv')#pd.read_csv读入一个csv格式文件

dc_listings = dc_listings[features]#取对应特征（列名）的数据
print(dc_listings.shape)

dc_listings.head()#显示前5行

(3723, 8)


Unnamed: 0,accommodates,bedrooms,bathrooms,beds,price,minimum_nights,maximum_nights,number_of_reviews
0,4,1.0,1.0,2.0,$160.00,1,1125,0
1,6,3.0,3.0,3.0,$350.00,2,30,65
2,1,1.0,2.0,1.0,$50.00,2,1125,1
3,2,1.0,1.0,1.0,$95.00,1,1125,0
4,4,1.0,1.0,1.0,$50.00,7,1125,0


数据特征：

* accommodates: 可以容纳的旅客，当做是房间的数量
* bedrooms: 卧室的数量
* bathrooms: 厕所的数量
* beds: 床的数量
* price: 每晚的费用
* minimum_nights: 客人最少租了几天
* maximum_nights: 客人最多租了几天
* number_of_reviews: 评论的数量

## 我有一个3个卧室的房子，租多少钱呢？
不知道的话，就去看看别人3个卧室的都租多少钱吧！

<img src="2.png" style="width:600px;height:230px;float:left">

K代表我们的候选对象个数，也就是找和我房间数量最相近的其他房子的价格

## K近邻原理

<img src="3.png" style="width:600px;height:330px;float:left">

假设我们的数据源中只有5条信息，现在我想针对我的房子（只有一个房间）来定一个价格。

<img src="4.png" style="width:600px;height:330px;float:left">

在这里假设我们选择的K=3，也就是选3个跟我最相近的房源。

<img src="5.png" style="width:600px;height:330px;float:left">

再综合考虑这三个我就得到了我的房子大概能值多钱啦！

## 距离的定义
如何才能知道哪些数据样本跟我最相近呢？

### 欧氏距离公式

<img src="距离.png" style="width:800px;height:180px;float:left">

其中p1到pn是一条数据的所有特征信息，q1到qn是另一条数据的所有特征信息

# K邻近算法步骤（回归）：

### 1.计算待预测样本与训练集的每个样本的距离
### 2.将训练集的样本按照距离从小到大排序
### 3.取前K个距离最小的训练样本，计算该K个样本标签的平均值作为预测值
### （若是KNN做分类任务，第三个步骤改为：取前K个距离最小的训练样本，返回前K个样本频率最高的类别作为预测类别）

假设我们的房子有3个房间

### 基于单变量预测价格

<img src="单距离公式.png" style="width:400px;height:100px;float:left">

In [2]:
import numpy as np

our_acc_value = 3

dc_listings['distance'] = np.abs(dc_listings.accommodates - our_acc_value)
#np.abs算绝对值，absolute value
#dc_listings.accommodates - our_acc_value  一列数据减去一个数，对应位置相减
dc_listings.distance.value_counts().sort_index() 
#value_counts()统计值的个数，sort_index()按照索引index排序,此时index是distance

0      461
1     2294
2      503
3      279
4       35
5       73
6       17
7       22
8        7
9       12
10       2
11       4
12       6
13       8
Name: distance, dtype: int64

In [3]:
dc_listings.head()

Unnamed: 0,accommodates,bedrooms,bathrooms,beds,price,minimum_nights,maximum_nights,number_of_reviews,distance
0,4,1.0,1.0,2.0,$160.00,1,1125,0,1
1,6,3.0,3.0,3.0,$350.00,2,30,65,3
2,1,1.0,2.0,1.0,$50.00,2,1125,1,2
3,2,1.0,1.0,1.0,$95.00,1,1125,0,1
4,4,1.0,1.0,1.0,$50.00,7,1125,0,1


In [4]:
dc_listings.accommodates[:5]
dc_listings['accommodates'][:5]

0    4
1    6
2    1
3    2
4    4
Name: accommodates, dtype: int64

这里我们只有了绝对值来计算，和我们距离为0的（同样数量的房间）有461个

sample操作可以得到洗牌后的数据

In [5]:
dc_listings = dc_listings.sample(frac=1,random_state=0)
#sample(frac=1,random_state=0)进行洗牌操作,表示从dc_listings随机抽出若干样本
#fraction,frac=1选择了100%所有样本，
#random_state设置随机种子，每次重起运行，随机的结果相同
dc_listings = dc_listings.sort_values('distance')#按照distance对样本进行升序排列
print(dc_listings.price.head())
dc_listings.head()

2645     $75.00
2825    $120.00
2145     $90.00
2541     $50.00
3349    $105.00
Name: price, dtype: object


Unnamed: 0,accommodates,bedrooms,bathrooms,beds,price,minimum_nights,maximum_nights,number_of_reviews,distance
2645,3,1.0,1.0,1.0,$75.00,7,180,24,0
2825,3,3.0,2.0,2.0,$120.00,1,1125,0,0
2145,3,1.0,2.0,2.0,$90.00,1,1125,55,0
2541,3,1.0,1.0,1.0,$50.00,1,1125,1,0
3349,3,1.0,1.0,1.0,$105.00,1,1125,7,0


In [6]:
dc_listings.head()

Unnamed: 0,accommodates,bedrooms,bathrooms,beds,price,minimum_nights,maximum_nights,number_of_reviews,distance
2645,3,1.0,1.0,1.0,$75.00,7,180,24,0
2825,3,3.0,2.0,2.0,$120.00,1,1125,0,0
2145,3,1.0,2.0,2.0,$90.00,1,1125,55,0
2541,3,1.0,1.0,1.0,$50.00,1,1125,1,0
3349,3,1.0,1.0,1.0,$105.00,1,1125,7,0


In [7]:
print(dc_listings['price'].head() )

2645     $75.00
2825    $120.00
2145     $90.00
2541     $50.00
3349    $105.00
Name: price, dtype: object


现在的问题是，这里面的数据是字符串呀，需要转换一下！

In [8]:
dc_listings['price'] = dc_listings.price.str.replace("\$|,",'').astype(float) 
#str.replace()字符替换，astype()改变数据类型，"\$|,"\是转义符，|是或的意思，将$或,替换

In [9]:
mean_price = dc_listings.price.iloc[:5].mean()
mean_price

88.0

得到了平均价格，也就是我们的房子大致的价格了

## 模型的评估

### 训练集和测试集 

<img src="7.png" style="width:600px;height:250px;float:left">

In [10]:
dc_listings.drop('distance',axis=1)#删除'distance'这一列，axis=1删除列，axis=0删除行

train_df = dc_listings.copy().iloc[:2792] #iloc行号来索引，loc是根据index来索引
test_df = dc_listings.copy().iloc[2792:]

### 只考虑一个变量

In [11]:
def predict_price(new_listing_value,feature_column):#new_listing_value带预测样本的特征数据
    temp_df = train_df#取训练集数据
    temp_df['distance'] = np.abs(train_df[feature_column] - new_listing_value) #np.abs求绝对值
    temp_df = temp_df.sort_values('distance')#按距离从小到大排序,默认ascending=True，升序。
    knn_5 = temp_df.price.iloc[:5]#取前5个距离最小样本的价格数据
    predicted_price = knn_5.mean()#算平均价格
    return(predicted_price)

In [12]:
print (test_df.accommodates.head())#查看测试集中前五个样本的accommodates

2850    1
2279    1
2771    5
910     5
2434    5
Name: accommodates, dtype: int64


In [13]:
print (predict_price(1,feature_column='accommodates'))#预测价格
print (test_df.head(1).price)#第一个样本的真实价格

83.6
2850    40.0
Name: price, dtype: float64


In [14]:
test_df['predicted_price'] = test_df.accommodates.apply(predict_price,feature_column='accommodates')
#series.apply(),没有axis参数，把每行数据传入predict_price，对应返回结果，返回一个series结构
print (test_df[['predicted_price','price']])

      predicted_price   price
2850             83.6    40.0
2279             83.6    45.0
2771            340.4   217.0
910             340.4   415.0
2434            340.4   275.0
965             340.4   145.0
1305             83.6   100.0
2513             83.6    80.0
2118            340.4   115.0
345             340.4   324.0
725              83.6    50.0
1172             83.6   116.0
1409             83.6    90.0
1943             83.6    49.0
2842            340.4   136.0
2967            340.4   125.0
3295             83.6    50.0
3558             83.6    90.0
1698            340.4   165.0
3151             83.6    95.0
2025            340.4   179.0
754              83.6    38.0
252              83.6   135.0
1767             83.6   120.0
1241             83.6    75.0
1149             83.6   100.0
1931             83.6    50.0
3535             83.6   165.0
3422            340.4   250.0
2969             83.6   100.0
...               ...     ...
542             340.4   340.0
529       

## 误差评估

root mean squared error (RMSE)均方根误差

<img src="RMSE.png" style="width:700px;height:200px;float:left">

### 测试集总的均方根误差 

In [15]:
test_df['squared_error'] = (test_df['predicted_price'] - test_df['price'])**(2)
mse = test_df['squared_error'].mean()
rmse = mse ** (1/2)
rmse

212.98927967051543

现在我们得到了对于一个变量的模型评估得分

## 不同的变量效果会不会不同呢？

In [16]:
for feature in ['accommodates','bedrooms','bathrooms','number_of_reviews']:
    test_df['predicted_price'] = test_df[feature].apply(predict_price,feature_column=feature)
    test_df['squared_error'] = (test_df['predicted_price'] - test_df['price'])**(2)
    mse = test_df['squared_error'].mean()
    rmse = mse ** (1/2)
    print("RMSE for the {} column: {}".format(feature,rmse))

RMSE for the accommodates column: 212.98927967051543
RMSE for the bedrooms column: 199.80935328065033
RMSE for the bathrooms column: 230.24716705684227
RMSE for the number_of_reviews column: 235.91327066995507


### 只用一个指标不靠谱，那就几个指标都充分利用起来。

# 数据的z-score标准化与归一化

## z-score标准化：是将数据按比例缩放，使之落入一个特定区间。 要求：均值 μ = 0 ，σ = 1

## 归一化：把数变为（0，1）之间的小数

## z-score标准化

### 标准差公式：
<img src="标准差.png" style="width:400px;height:200px;float:left">

### z-score标准化转换公式：
<img src="标准化.png" style="width:400px;height:200px;float:left">

### 归一化公式 ：
<img src="归一化.png" style="width:500px;height:200px;float:left">

In [17]:
import pandas as pd
from sklearn import preprocessing
from sklearn.preprocessing import MinMaxScaler #归一化模块
from sklearn.preprocessing import StandardScaler #标准化模块

features = ['accommodates','bedrooms','bathrooms','beds','price','minimum_nights','maximum_nights','number_of_reviews']

dc_listings = pd.read_csv('listings.csv')

dc_listings = dc_listings[features]

dc_listings['price'] = dc_listings.price.str.replace("\$|,",'').astype(float)

dc_listings = dc_listings.dropna() #去掉数据中的带缺失值的样本

min_max_scaler = MinMaxScaler()#实例化归一化模块
dc_listings[features] = min_max_scaler.fit_transform(dc_listings[features])#.fit_transform对数据进行归一化
# dc_listings[features] = StandardScaler().fit_transform(dc_listings[features]) # 标准化用 sklearn.preprocessing.StandardScaler()模块
# dc_listings[features] = MinMaxScaler().fit_transform(dc_listings[features]) #归一化用 sklearn.preprocessing.MinMaxScaler()模块

normalized_listings = dc_listings

print(dc_listings.shape)

normalized_listings.head()

(3671, 8)


Unnamed: 0,accommodates,bedrooms,bathrooms,beds,price,minimum_nights,maximum_nights,number_of_reviews
0,0.2,0.1,0.125,0.066667,0.053343,0.0,5.234033e-07,0.0
1,0.333333,0.3,0.375,0.133333,0.12091,0.005587,1.350418e-08,0.179558
2,0.0,0.1,0.25,0.0,0.014225,0.005587,5.234033e-07,0.002762
3,0.066667,0.1,0.125,0.0,0.030228,0.0,5.234033e-07,0.0
4,0.2,0.1,0.125,0.0,0.014225,0.03352,5.234033e-07,0.0


In [18]:
norm_train_df = normalized_listings.copy().iloc[0:2792]
norm_test_df = normalized_listings.copy().iloc[2792:]
#norm_test_df.shape
#dc_listings['price'].iloc[2792:].shape

### 多变量距离的计算

<img src="多变量计算3.png" style="width:600px;height:220px;float:left">

### 两个指标算距离，就相当于平面上两个点之间的距离
<img src="多变量计算2.png" style="width:500px;height:120px;float:left">

scipy中已经有现成的距离的计算工具了

In [19]:
from scipy.spatial import distance

first_listing = normalized_listings.iloc[0][['accommodates', 'bathrooms']]
fifth_listing = normalized_listings.iloc[20][['accommodates', 'bathrooms']]
first_fifth_distance = distance.euclidean(first_listing, fifth_listing)
#distance.euclidean计算两个样本点的距离，每个样本点可以是n维，但只能计算两个点的距离，返回一个数
first_fifth_distance

0.32015621187164245

## 多变量KNN模型

In [20]:
def predict_price_multivariate(new_listing_value,feature_columns):
    temp_df = norm_train_df
    temp_df['distance'] = distance.cdist(temp_df[feature_columns].values,[new_listing_value[feature_columns].values])
    #distance.cdist()函数是计算两个输入集合的距离,[new_listing_value[feature_columns]]使其满足是array结构
    temp_df = temp_df.sort_values('distance')
    knn_5 = temp_df.price.iloc[:5]
    predicted_price = knn_5.mean()
    return(predicted_price)

In [21]:
feature_columns = ['accommodates', 'bathrooms']
new_listing_value = norm_test_df.head(1)[feature_columns]
temp_df = norm_train_df
distance.cdist(temp_df[feature_columns].values,new_listing_value[feature_columns].values)

array([[0.2       ],
       [0.41666667],
       [0.125     ],
       ...,
       [0.44176493],
       [0.06666667],
       [0.06666667]])

In [22]:
norm_test_df.head(1)[feature_columns]

Unnamed: 0,accommodates,bathrooms
2839,0.0,0.125


In [23]:
cols = ['accommodates', 'bathrooms']
predicted_price_head1 = norm_test_df.head(1)[cols].apply(predict_price_multivariate,feature_columns=cols,axis=1)
print(type(predicted_price_head1))
#axis=1，对每行样本传入函数
predicted_price_head1

<class 'pandas.core.series.Series'>


2839    0.026885
dtype: float64

In [24]:
print(distance.cdist(norm_test_df.head()[['accommodates', 'bathrooms']],norm_test_df.tail(3)[['accommodates', 'bathrooms']]))
distance.euclidean(norm_test_df[['accommodates', 'bathrooms']].iloc[0], norm_test_df[['accommodates', 'bathrooms']].iloc[-4])

[[0.33333333 0.06666667 0.13333333]
 [0.26666667 0.         0.06666667]
 [0.13333333 0.13333333 0.06666667]
 [0.14166667 0.23584953 0.18276427]
 [0.26666667 0.         0.06666667]]


0.14166666666666666

In [25]:
data = pd.read_csv('listings.csv')
data = data[features]
data['price'] = data.price.str.replace("\$|,",'').astype(float)
data = data.dropna()
data1 = data['price'].values.reshape(-1,1) #将dataframe中值取出来，成为numpy的array结构，然后改变array形状
print (data1)
#对price进行反归一化，应该重新实例化一个MinMaxScaler()，
#重新对price归一化之后，再按照price归一化的标准进行反归一化
price_scaler = MinMaxScaler()
norm_data = price_scaler.fit_transform(data1)  
origin_data = price_scaler.inverse_transform(predicted_price_head1.values.reshape(-1,1))
#inverse_transform()反归一化函数,此时，传入的array形状必须是1列
print('origin predicted price is ',origin_data.ravel())
print('real price is ',test_df['price'].head(1).values)

[[160.]
 [350.]
 [ 50.]
 ...
 [275.]
 [179.]
 [110.]]
origin predicted price is  [85.6]
real price is  [40.]


In [26]:
cols = ['accommodates', 'bathrooms']
norm_test_df['predicted_price'] = norm_test_df[cols].apply(predict_price_multivariate,feature_columns=cols,axis=1)    
norm_test_df['squared_error'] = (norm_test_df['predicted_price'] - norm_test_df['price'])**(2)
mse = norm_test_df['squared_error'].mean()
rmse = mse ** (1/2)
print(rmse)

0.03998816336418803


In [27]:
predicted_price_values = norm_test_df['predicted_price'].values.reshape(-1,1)
scaler = MinMaxScaler()
norm_data = scaler.fit_transform(data1)  
norm_test_df['origin_predicted_price'] = scaler.inverse_transform(predicted_price_values)
norm_test_df['origin_predicted_price'].head()

2839     85.6
2840    115.8
2841    125.0
2842    366.2
2843    115.8
Name: origin_predicted_price, dtype: float64

## 使用Sklearn来完成KNN

In [28]:
from sklearn.neighbors import KNeighborsRegressor
#1.导入KNN回归模块（分类模块是KNeighborsClassifier）
cols = ['accommodates','bedrooms']
knn = KNeighborsRegressor(n_neighbors=5) #默认n_neighbors=5，取前5个最相近的样本。
#2.实例化KNN回归模块，设置参数
knn.fit(norm_train_df[cols], norm_train_df['price']) #传入训练集指标下的数据和标签
#3..fit训练模型
two_features_predictions = knn.predict(norm_test_df[cols])
#4..predict预测结果，传入测试集的特征数据
#print(two_features_predictions)

In [29]:
from sklearn.metrics import mean_squared_error

two_features_mse = mean_squared_error(norm_test_df['price'], two_features_predictions)
two_features_rmse = two_features_mse ** (1/2)
print(two_features_rmse)

0.04193612857354859


In [30]:
minmax_scaler=MinMaxScaler()
minmax_price_values=minmax_scaler.fit_transform(dc_listings.price.values.reshape(-1,1))
minmax_price_r_values=minmax_scaler.inverse_transform(two_features_predictions.reshape(-1,1))
dc_listings.price.values.reshape(-1,1)

array([[0.05334282],
       [0.12091038],
       [0.01422475],
       ...,
       [0.09423898],
       [0.06009957],
       [0.03556188]])

加入更多的特征

In [31]:
knn = KNeighborsRegressor(n_neighbors=5)

cols = ['accommodates','bedrooms','bathrooms','beds','minimum_nights','maximum_nights','number_of_reviews']

knn.fit(norm_train_df[cols], norm_train_df['price'])
seven_features_predictions = knn.predict(norm_test_df[cols])

In [32]:
seven_features_mse = mean_squared_error(norm_test_df['price'], seven_features_predictions)
seven_features_rmse = seven_features_mse ** (1/2)
print(seven_features_rmse)

0.041388747758587266


## 思考：
## 1、将使用Sklearn的KNN模型预测的结果进行反归一化，并算出反归一化后的RMSE
## 2、改变n_neighbors参数和指标特征的选用，能不能改进模型效果
## 3、使用标准化对数据进行预处理