# **【python】数据科学竞赛——租房租金预测**
### 【作者】        星少
为贯彻习近平主席在十九大报告中关于“推动互联网、大数据、人工智能和实体经济深度融合”以及“善于运用互联网技术和信息化手段开展工作”等讲话精神，引导高校在校生学习掌握计算机与互联网知识，提高计算机的技能应用，中国软件行业协会培训中心将举办全国大学生计算机技能应用大赛。大赛旨在增强广大在校大学生的IT应用技能，对于进一步落实学校培养应用型人才的目标要求，培育创新创业人才、促进产学研相结合有着重要意义。

当今社会，房屋租金由装修情况、位置地段、户型格局、交通便利程度、市场供需量等多方面因素综合决定，对于租房这个相对传统的行业来说，信息严重不对称一直存在。一方面，房东不了解租房的市场真实价格，只能忍痛空置高租金的房屋；另一方面，租客也找不到满足自己需求高性价比房屋，这造成了租房资源的极大浪费。

本次计算机技能大赛中的大数据赛题将基于租房市场的痛点，提供脱敏处理后的真实租房市场数据。选手需要利用有月租金标签的历史数据建立模型，实现基于房屋基本信息的住房月租金预测，为该城市租房市场提供客观衡量标准。
![Image Name](https://cdn.kesci.com/upload/image/pjwsoi2vnr.png?imageView2/0/w/960/h/960)



In [1]:
#导入包
import numpy as np
import pandas as pd 

## 数据读取

    "- time 房屋信息采集的时间\n",
    "- apartment_name 房屋所在小区 脱敏处理\n",
    "- apartment_rent_num 小区房屋出租数量 保留大小关系\n",
    "- floor 楼层高中低 脱敏处理 \n",
    "- total_floor 房屋所在建筑的总楼层数 脱敏处理\n",
    "- house_area 房屋面积 脱敏处理 (*)\n",
    "- house_towards 房屋朝向\n",
    "- house_state 居住状态，出租或者居住中 脱敏处理 (*)\n",
    "- bedrooms 卧室数量 (*)\n",
    "- livingrooms 厅的数量 \n",
    "- bathrooms 卫的数量 \n",
    "- rent_method 出租方式，表示是否整租 (*)\n",
    "- district 房屋所在区级行政单位\n",
    "- bs_region 小区所在商圈位置 (*)\n",
    "- sub_route 地铁线路 (*)\n",
    "- sub_stop 临近的地铁站点 (*)\n",
    "- sub_distance 距临近地铁站的距离 (*)\n",
    "- decoration_state 房屋的装修档次\n",
    "- monthly_rent 月租金，标签值，脱敏处理"

In [2]:
# 避免后期出现编码问题，把所有的列名改为英文
columns = [
    'time',
    'apartment_name',
    'apartment_rent_num',
    'floor',
    'total_floor',
    'house_area',
    'house_towards',
    'house_state',
    'bedrooms',
    'livingrooms',
    'bathrooms',
    'rent_method',
    'district',
    'bs_region',
    'sub_route',
    'sub_stop',
    'sub_distance',
    'decoration_state',
    'monthly_rent']
columns_test = [
    'id',
    'time',
    'apartment_name',
    'apartment_rent_num',
    'floor',
    'total_floor',
    'house_area',
    'house_towards',
    'house_state',
    'bedrooms',
    'livingrooms',
    'bathrooms',
    'rent_method',
    'district',
    'bs_region',
    'sub_route',
    'sub_stop',
    'sub_distance',
    'decoration_state']
#用Pandas包的 read_csv（） 函数分别将文件movies.csv和ratings.csv读入至对应的 DataFrame变量中
train = pd.read_csv(
    '/home/kesci/input/ABC7410/train.csv',
    header=0,
    names=columns)
test = pd.read_csv(
    '/home/kesci/input/ABC7410/test.csv',
    header=0,
    names=columns_test)
X_train = train
X_test = test
y_train = X_train.pop('monthly_rent')
test_id = X_test.pop('id')

* 基本数据探索

In [3]:
#查看数据大概的情况
print(X_train.head(5))
X_train.sample(5)
X_test.head(5)

   time  apartment_name  apartment_rent_num  floor  total_floor  house_area  \
0     1            3072            0.128906      2     0.236364    0.008628   
1     1            3152            0.132812      1     0.381818    0.017046   
2     1            5575            0.042969      0     0.290909    0.010593   
3     1            3103            0.085938      2     0.581818    0.019199   
4     1            5182            0.214844      0     0.545455    0.010427   

  house_towards  house_state  bedrooms  livingrooms  bathrooms  rent_method  \
0            东南          NaN         1            1          1          NaN   
1             东          NaN         1            0          0          NaN   
2            东南          NaN         2            1          2          NaN   
3             南          NaN         3            2          2          NaN   
4            东北          NaN         2            1          1          NaN   

   district  bs_region  sub_route  sub_stop  sub_d

Unnamed: 0,time,apartment_name,apartment_rent_num,floor,total_floor,house_area,house_towards,house_state,bedrooms,livingrooms,bathrooms,rent_method,district,bs_region,sub_route,sub_stop,sub_distance,decoration_state
0,4,6011,0.382812,1,0.6,0.007117,东,3.0,2,1,1,1.0,10.0,5.0,,,,6.0
1,4,1697,0.152344,1,0.472727,0.007448,东,,2,1,1,,3.0,0.0,,,,
2,4,754,0.207031,2,0.709091,0.014068,东南,,3,2,2,,10.0,9.0,4.0,74.0,0.400833,
3,4,1285,0.011719,0,0.090909,0.008937,南,,2,1,1,,6.0,96.0,5.0,17.0,0.384167,
4,4,4984,0.035156,1,0.218182,0.008606,东南,,2,1,1,,6.0,61.0,3.0,114.0,0.598333,


In [4]:
#查看数据的维度
print(X_train.shape)
print(X_test.shape)

(196539, 18)
(56279, 18)


In [5]:
#一般来讲把测试集和训练集合并进行预处理和探索型数据分析,后续可以直接用索引拆分
frame = [X_train,X_test]
data = pd.concat(frame,axis=0)
print(data.shape)

(252818, 18)


In [6]:
#数据的统计性分析
data.describe()

Unnamed: 0,time,apartment_name,apartment_rent_num,floor,total_floor,house_area,house_state,bedrooms,livingrooms,bathrooms,rent_method,district,bs_region,sub_route,sub_stop,sub_distance,decoration_state
count,252818.0,252818.0,251795.0,252818.0,252818.0,252818.0,24621.0,252818.0,252818.0,252818.0,29201.0,252777.0,252777.0,118272.0,118272.0,118272.0,22699.0
mean,2.534792,3225.269419,0.123107,0.956854,0.41024,0.013162,2.727834,2.245422,1.297249,1.22602,0.893839,7.935825,68.189677,3.287346,57.308602,0.55035,3.595973
std,1.047004,2027.332289,0.132124,0.8511,0.182808,0.0077,0.665599,0.897825,0.611892,0.489941,0.308049,4.032691,43.666676,1.483973,35.226873,0.248115,1.995365
min,1.0,0.0,0.007812,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.001667,1.0
25%,2.0,1385.0,0.039062,0.0,0.290909,0.009268,3.0,2.0,1.0,1.0,1.0,4.0,33.0,2.0,23.0,0.356667,2.0
50%,3.0,3086.0,0.082031,1.0,0.436364,0.01291,3.0,2.0,1.0,1.0,1.0,9.0,61.0,4.0,59.0,0.554167,2.0
75%,3.0,5199.0,0.15625,2.0,0.563636,0.014992,3.0,3.0,2.0,1.0,1.0,11.0,104.0,5.0,87.0,0.746667,6.0
max,4.0,6627.0,1.0,2.0,1.0,1.0,3.0,11.0,8.0,8.0,1.0,14.0,152.0,5.0,119.0,1.0,6.0


In [7]:
#查看缺失值
print('训练集的缺失值:\n',X_train.isnull().sum().sort_values(ascending = False))
print('测试集的缺失值:\n',X_test.isnull().sum().sort_values(ascending = False))
print('所有数据集的缺失值:\n',data.isnull().sum().sort_values(ascending = False))

训练集的缺失值:
 decoration_state      178047
house_state           176401
rent_method           172309
sub_stop              104761
sub_route             104761
sub_distance          104761
apartment_rent_num      1001
bs_region                 31
district                  31
total_floor                0
apartment_name             0
floor                      0
bedrooms                   0
house_area                 0
house_towards              0
livingrooms                0
bathrooms                  0
time                       0
dtype: int64
测试集的缺失值:
 decoration_state      52072
house_state           51796
rent_method           51308
sub_stop              29785
sub_route             29785
sub_distance          29785
apartment_rent_num       22
bs_region                10
district                 10
total_floor               0
apartment_name            0
floor                     0
bedrooms                  0
house_area                0
house_towards             0
livingrooms              

# 特征工程

* 异常值处理

In [None]:
#train.drop(train[train['房屋面积']>0.2].index,inplace=True)
#train.drop(train[(train['装修情况']==2)&(train['月租金']>80)].index,inplace=True)
#train.drop(train[train['卧室数量']>10].index,inplace=True)
#train.drop(train[train['厅的数量']>7].index,inplace=True)
#train.drop(train[(train['卫的数量']>5)&(train['月租金']>60)].index,inplace=True)

* 缺失值处理

In [8]:
# 通过业务数据可以易得：同一个小区其住房出租数量、所在区、位置、最近的地铁站点和离地铁站的距离相同
# 所以通过对小区名apartment_name分组，对同组有缺失值的特征进行补缺
data['apartment_rent_num'] = data.groupby('apartment_name')[
    'apartment_rent_num'].transform(lambda x: x.fillna(x.median()))
data['district'] = data.groupby('apartment_name')[
    'district'].transform(lambda x: x.fillna(x.median()))
data['bs_region'] = data.groupby('apartment_name')[
    'bs_region'].transform(lambda x: x.fillna(x.median()))
data['sub_stop'] = data.groupby('apartment_name')[
    'sub_stop'].transform(lambda x: x.fillna(x.median()))
data['sub_distance'] = data.groupby('apartment_name')[
    'sub_distance'].transform(lambda x: x.fillna(x.median()))
data['sub_route'] = data.groupby('apartment_name')[
    'sub_route'].transform(lambda x: x.fillna(x.median()))


In [9]:
print('所有数据集的缺失值:\n',data.isnull().sum().sort_values(ascending = False))

所有数据集的缺失值:
 decoration_state      230119
house_state           228197
rent_method           223617
sub_stop              115341
sub_route             115341
sub_distance          115341
bs_region                 41
district                  41
apartment_rent_num        10
total_floor                0
apartment_name             0
floor                      0
bedrooms                   0
house_area                 0
house_towards              0
livingrooms                0
bathrooms                  0
time                       0
dtype: int64


In [10]:
#可以看到商圈位置bs_region的缺失值数目很小，不妨用众数填充
data['bs_region'].fillna(data['bs_region'].mode()[0],inplace=True)

In [11]:
data['apartment_rent_num'] = data.groupby(
    'bs_region')['apartment_rent_num'].transform(lambda x: x.fillna(x.median()))
data['district'] = data.groupby('bs_region')['district'].transform(
    lambda x: x.fillna(x.median()))
#data['位置'] = data.groupby('位置')['位置'].transform(lambda x: x.fillna(x.median()))
data['sub_stop'] = data.groupby('bs_region')['sub_stop'].transform(
    lambda x: x.fillna(x.median()))
data['sub_distance'] = data.groupby(
    'bs_region')['sub_distance'].transform(lambda x: x.fillna(x.median()))
data['sub_route'] = data.groupby('bs_region')['sub_route'].transform(
    lambda x: x.fillna(x.median()))

In [12]:
#可以看到sub_stop、sub_distance缺失值大大减少
print('所有数据集的缺失值:\n',data.isnull().sum().sort_values(ascending = False))

所有数据集的缺失值:
 decoration_state      230119
house_state           228197
rent_method           223617
sub_stop               25117
sub_route              25117
sub_distance           25117
house_area                 0
apartment_name             0
apartment_rent_num         0
floor                      0
total_floor                0
bedrooms                   0
house_towards              0
livingrooms                0
bathrooms                  0
district                   0
bs_region                  0
time                       0
dtype: int64


* 特征构造

In [13]:
#构造特征：通过小区名构造每个小区的平均租金和每个商圈的平均租金，必然和预测值相关性很大，做相关性系数也可以得到验证
xiaoqu = pd.DataFrame()

train['monthly_rent'] = y_train 
for i in train['apartment_name'].unique():

    tem = pd.DataFrame([train.loc[train['apartment_name'] == i]['monthly_rent'].mean(), i]).T

    frame = [xiaoqu, tem]
    xiaoqu = pd.concat(frame, axis=0)
xiaoqu.index = xiaoqu.iloc[:, 1]
for i in xiaoqu.iloc[:, 1]:
    data.loc[data['apartment_name'] == i, 'ave'] = xiaoqu.loc[xiaoqu.index == i].iloc[0, 0]
data['ave'] = data.groupby('bs_region')['ave'].transform(
    lambda x: x.fillna(x.median()))

In [14]:
data.shape

(252818, 19)

In [16]:
#LabelEncoder是对不连续的数字或文本编号，通过这样对房屋朝向进行变化并离散化
#连续变量放到基于树的分类器模型中容易产生过拟合（自查），所有尽量将类别变量放入LightGBM进行回归
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
le.fit(data['house_towards'])
data['house_towards'] = le.transform(data['house_towards'])
data['house_towards'] = data['house_towards'].astype('category')

使用类别特征的简单的序列对分类值进行表示后，进行模型训练时可能会产生一个问题就是特征的因为数字值得不同影响模型的训练效果，比如装修情况的1,2,3其实没有大小区分，在模型训练的过程中不同的值使得同一特征在样本中的权重可能发生变化，假如直接编码成1000，是不是比编码成1对模型的的影响更大。为了解决上述的问题，使训练过程中不受到因为分类值表示的问题对模型产生的负面影响，引入独热码对分类型的特征进行独热码编码。
* **本应该用one-hot或者下面的哑变量dummies进行编码处理：**，基于模型，我们直接可以用astype进行处理

#living_standard=pd.get_dummies(data['居住状态'])
#forward = pd.get_dummies(data['房屋朝向'])
#qu = pd.get_dummies(data['区'])
#location = pd.get_dummies(data['位置'])
#decor = pd.get_dummies(data['装修情况'])
#ways = pd.get_dummies(data['出租方式'])
#subway = pd.get_dummies(data['地铁站点'])

#new_data = data.drop(['地铁站点','小区名','房屋朝向','区','位置','居住状态','装修情况','出租方式'],axis=1)
#frame = [new_data,forward,qu,location,subway,living_standard,decor,ways]
#new_data = pd.concat(frame,axis=1)

In [17]:
#我们一开始就选定使用LightGBM来进行回归，而LightGBM可以自动识别类别变量，即类别，进行自带的编码，比独热编码更有效（自查）
data['district'] = data['district'].astype('category')
data['bs_region'] = data['bs_region'].astype('category')
data['decoration_state'] = data['decoration_state'].astype('category')
data['rent_method'] = data['rent_method'].astype('category')
data['sub_stop'] = data['sub_stop'].astype('category')
data['house_towards'] = data['house_towards'].astype('category')
data['floor'] = data['floor'].astype('category')
data['time'] = data['time'].astype('category')
data['apartment_name'] = data['apartment_name'].astype('category')

In [20]:
X_train.shape

(196539, 19)

In [22]:
X_train = data[:X_train.shape[0]]
X_test = data[X_train.shape[0]:]

In [None]:
123455