# 项目目的

- 描述北京二手房市场特点，找出影响北京二手房价格的因素

# 数据说明
- direction:房屋朝向
- district:房屋所在区域
- elevator:电梯情况
- Floor:所在楼层
- Garden:所在楼盘
- ID:房屋id
- Layout:房屋房型
- Price:价格
- Region:所在行政区
- Renovation:装修情况
- Size:面积
- Year:所建年份

# 分析思路
- 根据提供的各项属性分别分析其分布特点，分别分组与价格对比，推断各属性与二手房价格的关系，在此基础上找出与价格相关性强的属性，作为建模条件，进行拟合建模

# 0.导入包

In [1]:
import pandas as pd
import numpy as np
import random 
from matplotlib import pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures#进行特征构造的模块
from sklearn.tree import DecisionTreeRegressor
from wordcloud import WordCloud
from pyecharts import options as opts
from pyecharts.charts import Page,WordCloud
from pyecharts.globals import SymbolType
import warnings
warnings.filterwarnings('ignore')

# 1.导入数据

In [2]:
df=pd.read_csv('./lianjia.csv')
df.head()

Unnamed: 0,Direction,District,Elevator,Floor,Garden,Id,Layout,Price,Region,Renovation,Size,Year
0,东西,灯市口,,6,锡拉胡同21号院,101102647043,3室1厅,780.0,东城,精装,75.0,1988
1,南北,东单,无电梯,6,东华门大街,101102650978,2室1厅,705.0,东城,精装,60.0,1988
2,南西,崇文门,有电梯,16,新世界中心,101102672743,3室1厅,1400.0,东城,其他,210.0,1996
3,南,崇文门,,7,兴隆都市馨园,101102577410,1室1厅,420.0,东城,精装,39.0,2004
4,南,陶然亭,有电梯,19,中海紫御公馆,101102574696,2室2厅,998.0,东城,精装,90.0,2010


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23677 entries, 0 to 23676
Data columns (total 12 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Direction   23677 non-null  object 
 1   District    23677 non-null  object 
 2   Elevator    15440 non-null  object 
 3   Floor       23677 non-null  int64  
 4   Garden      23677 non-null  object 
 5   Id          23677 non-null  int64  
 6   Layout      23677 non-null  object 
 7   Price       23677 non-null  float64
 8   Region      23677 non-null  object 
 9   Renovation  23677 non-null  object 
 10  Size        23677 non-null  float64
 11  Year        23677 non-null  int64  
dtypes: float64(2), int64(3), object(7)
memory usage: 2.2+ MB


In [4]:
from scipy import stats
stats.mode(df['Floor'])

ModeResult(mode=array([6]), count=array([7662]))

In [6]:
df['Floor'].mode()

0    6
dtype: int64

### 重复值处理

In [9]:
df.drop_duplicates(inplace=True)
df.reset_index(drop=True,inplace=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22468 entries, 0 to 22467
Data columns (total 12 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Direction   22468 non-null  object 
 1   District    22468 non-null  object 
 2   Elevator    14569 non-null  object 
 3   Floor       22468 non-null  int64  
 4   Garden      22468 non-null  object 
 5   Id          22468 non-null  int64  
 6   Layout      22468 non-null  object 
 7   Price       22468 non-null  float64
 8   Region      22468 non-null  object 
 9   Renovation  22468 non-null  object 
 10  Size        22468 non-null  float64
 11  Year        22468 non-null  int64  
dtypes: float64(2), int64(3), object(7)
memory usage: 2.1+ MB


In [10]:
df.describe()

Unnamed: 0,Floor,Id,Price,Size,Year
count,22468.0,22468.0,22468.0,22468.0,22468.0
mean,12.719512,101102400000.0,608.365849,99.204736,2001.372085
std,7.625692,568261.3,412.399905,51.465747,9.04065
min,1.0,101088600000.0,60.0,2.0,1950.0
25%,6.0,101102200000.0,360.0,66.0,1997.0
50%,11.0,101102500000.0,495.0,88.0,2003.0
75%,18.0,101102700000.0,710.0,118.0,2008.0
max,57.0,101102800000.0,6000.0,1019.0,2017.0


### 添加新特征房屋均价

In [12]:
new_data=df.copy()
new_data['Perprice']=new_data['Price']/new_data['Size']

### 重新摆放位置 

In [13]:
new_data.columns

Index(['Direction', 'District', 'Elevator', 'Floor', 'Garden', 'Id', 'Layout',
       'Price', 'Region', 'Renovation', 'Size', 'Year', 'Perprice'],
      dtype='object')

In [14]:
columns=['Region','District','Garden','Layout','Floor','Year','Size','Elevator','Direction','Renovation','Perprice','Price']
new_data=pd.DataFrame(new_data,columns=columns)

In [15]:
new_data

Unnamed: 0,Region,District,Garden,Layout,Floor,Year,Size,Elevator,Direction,Renovation,Perprice,Price
0,东城,灯市口,锡拉胡同21号院,3室1厅,6,1988,75.0,,东西,精装,10.400000,780.0
1,东城,东单,东华门大街,2室1厅,6,1988,60.0,无电梯,南北,精装,11.750000,705.0
2,东城,崇文门,新世界中心,3室1厅,16,1996,210.0,有电梯,南西,其他,6.666667,1400.0
3,东城,崇文门,兴隆都市馨园,1室1厅,7,2004,39.0,,南,精装,10.769231,420.0
4,东城,陶然亭,中海紫御公馆,2室2厅,19,2010,90.0,有电梯,南,精装,11.088889,998.0
...,...,...,...,...,...,...,...,...,...,...,...,...
22463,东城,广渠门,保利蔷薇,2室1厅,16,2008,97.0,,南北,简装,10.412371,1010.0
22464,东城,永定门,郭庄北里,2室1厅,6,1995,66.0,,南北,简装,7.090909,468.0
22465,东城,和平里,康鸿家园,3室2厅,6,2000,155.0,,南北,简装,9.032258,1400.0
22466,东城,前门,台基厂头条10号院,4室1厅,6,1990,107.0,,南北,简装,10.280374,1100.0


## 2.数据可视化

### 1.Region分析
- 对于区域特征，可以分析不同区域房价和数量的对比

In [16]:
#对二手房区域分组对比二手房数量和每平方米房价
new_data.groupby('Region')['Price'].count().sort_values(ascending=False).to_frame().reset_index()


Unnamed: 0_level_0,Price
Region,Unnamed: 1_level_1
丰台,2757
海淀,2726
朝阳,2673
昌平,2660
西城,2059
大兴,2028
通州,1569
东城,1485
房山,1411
顺义,1201
