# 项目：评估和清理纽约市Airbnb房间租凭数据

## 分析目标

此数据分析的目的是，根据市场租赁数据，挖掘畅销产品，以便制定更有效的市场策略来提升营收。

本实战项目的目的在于练习评估数据干净和整洁度，并且基于评估结果，对数据进行清洗，从而得到可供下一步分析的数据。

## 简介

该数据集包含了2019年纽约市的Airbnb上线的房间情况。Airbnb是一个旅行房屋租赁社区，用户可通过网站或手机APP发布、搜索度假房屋租赁信息并在线预定。

变量含义：
- id：房间id
- name：房间名称
- host_id：房东id
- host_name：房东姓名
- neighbourhood_group：地区
- neighbourhood：街区
- latitude：纬度坐标
- longitude：经度坐标
- room_type：房间类型
- price：价格（美元）
- minimum_nights：最少预定夜晚数
- number_of_reviews：评论数量
- last_review：最新浏览
- reviews_per_month：每月浏览次数
- calculated_host_listings_count：房东挂出房子的数量
- availability_365：可预定房源的天数

## 读取数据

导入数据分析所需要的库，并通过Pandas的read_csv函数，将原始数据文件"airbnb_NYC_2019.csv"里的数据内容，解析为DataFrame，并赋值给变量original_data。

In [1]:
import pandas as pd

In [4]:
original_data = pd.read_csv("./airbnb_NYC_2019.csv")
original_data.head()

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
0,2539,Clean & quiet apt home by the park,2787,John,Brooklyn,Kensington,40.64749,-73.97237,Private room,149,1,9,2018-10-19,0.21,6,365
1,2595,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,225,1,45,2019-05-21,0.38,2,355
2,3647,THE VILLAGE OF HARLEM....NEW YORK !,4632,Elisabeth,Manhattan,Harlem,40.80902,-73.9419,Private room,150,3,0,,,1,365
3,3831,Cozy Entire Floor of Brownstone,4869,LisaRoxanne,Brooklyn,Clinton Hill,40.68514,-73.95976,Entire home/apt,89,1,270,2019-07-05,4.64,1,194
4,5022,Entire Apt: Spacious Studio/Loft by central park,7192,Laura,Manhattan,East Harlem,40.79851,-73.94399,Entire home/apt,80,10,9,2018-11-19,0.1,1,0


## 评估数据

在这一部分，我将对在上一部分建立的original_data这个DataFrame所包含的数据进行评估。 评估主要从两个方面进行：结构和内容，即整齐度和干净度。数据的结构性问题指不符合“每列是一个变量，每行是一个观察值，每个单元格是一个值”这三个标准，数据的内容性问题包括存在丢失数据、重复数据、无效数据等。

### 评估数据整齐度

In [5]:
original_data.sample(15)

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
22692,18362628,Sunny Lower East Side bedroom,22866591,Caroline,Manhattan,Lower East Side,40.7183,-73.98639,Private room,75,2,4,2017-07-19,0.16,1,0
24688,19830008,Charming Private Room in the heart of East Vil...,139942077,Sahar,Manhattan,East Village,40.72304,-73.98089,Private room,125,1,80,2019-06-28,3.35,1,340
13685,10257476,Private Room in Artsy Home - Heart of Manhattan,5903619,Kimberly,Manhattan,Hell's Kitchen,40.76464,-73.99403,Private room,125,1,45,2019-06-29,1.13,1,322
4568,3185829,2BR Private Apt Guest Homestay,16151285,Carol,Bronx,Williamsbridge,40.88075,-73.84845,Entire home/apt,95,3,58,2019-06-10,0.96,4,358
3240,1934804,3 Story Brooklyn House - Sleeps 10!,10012421,Deborah,Brooklyn,Clinton Hill,40.69356,-73.96744,Entire home/apt,750,3,32,2019-05-19,1.66,1,141
42419,32919980,CHARMING BEDROOM❤️HEART OF BROOKLYN,247673988,Adrian,Brooklyn,Bushwick,40.69929,-73.93602,Private room,55,1,16,2019-07-02,4.36,2,169
16906,13461123,Cozy Brooklyn apartment in Brownstone,406987,Lau,Brooklyn,Bedford-Stuyvesant,40.69261,-73.95069,Entire home/apt,100,4,3,2016-09-23,0.08,1,0
36516,29035162,Winter Escape in Classic Brooklyn Brownstone,125576446,Roz,Brooklyn,Prospect Heights,40.67861,-73.96999,Entire home/apt,400,32,1,2018-12-30,0.16,1,0
43223,33524941,The New Harlemites,252607014,Eric,Manhattan,Harlem,40.81325,-73.9398,Entire home/apt,150,3,6,2019-05-21,2.31,1,25
48144,36108534,Luxury 1 bed in UWS Finest Building with Gym #...,116305897,Laura,Manhattan,Upper West Side,40.78935,-73.97389,Entire home/apt,180,30,0,,,9,311


从抽样的15行数据数据来看，数据符合“每列是一个变量，每行是一个观察值，每个单元格是一个值”，具体来看每行是关于某房屋的房源信息，每列是房屋数据相关的各个变量，因此不存在结构性问题。

### 评估数据干净程度

In [6]:
original_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48895 entries, 0 to 48894
Data columns (total 16 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   id                              48895 non-null  int64  
 1   name                            48879 non-null  object 
 2   host_id                         48895 non-null  int64  
 3   host_name                       48874 non-null  object 
 4   neighbourhood_group             48895 non-null  object 
 5   neighbourhood                   48895 non-null  object 
 6   latitude                        48895 non-null  float64
 7   longitude                       48895 non-null  float64
 8   room_type                       48895 non-null  object 
 9   price                           48895 non-null  int64  
 10  minimum_nights                  48895 non-null  int64  
 11  number_of_reviews               48895 non-null  int64  
 12  last_review                     

从输出结果来看，数据共有48895条观察值，而`last_review`、`reviews_per_month`变量存在缺失值。 此外，`price`的数据类型应为浮点类型，应当进行数据格式转换。

### 评估缺失数据

在了解last_review存在缺失值后，根据条件提取出缺失观察值。

In [7]:
original_data[original_data["last_review"].isnull()]

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
2,3647,THE VILLAGE OF HARLEM....NEW YORK !,4632,Elisabeth,Manhattan,Harlem,40.80902,-73.94190,Private room,150,3,0,,,1,365
19,7750,Huge 2 BR Upper East Cental Park,17985,Sing,Manhattan,East Harlem,40.79685,-73.94872,Entire home/apt,190,7,0,,,2,249
26,8700,Magnifique Suite au N de Manhattan - vue Cloitres,26394,Claude & Sophie,Manhattan,Inwood,40.86754,-73.92639,Private room,80,4,0,,,1,0
36,11452,Clean and Quiet in Brooklyn,7355,Vt,Brooklyn,Bedford-Stuyvesant,40.68876,-73.94312,Private room,35,60,0,,,1,365
38,11943,Country space in the city,45445,Harriet,Brooklyn,Flatbush,40.63702,-73.96327,Private room,150,1,0,,,1,365
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48890,36484665,Charming one bedroom - newly renovated rowhouse,8232441,Sabrina,Brooklyn,Bedford-Stuyvesant,40.67853,-73.94995,Private room,70,2,0,,,2,9
48891,36485057,Affordable room in Bushwick/East Williamsburg,6570630,Marisol,Brooklyn,Bushwick,40.70184,-73.93317,Private room,40,4,0,,,2,36
48892,36485431,Sunny Studio at Historical Neighborhood,23492952,Ilgar & Aysel,Manhattan,Harlem,40.81475,-73.94867,Entire home/apt,115,10,0,,,1,27
48893,36485609,43rd St. Time Square-cozy single bed,30985759,Taz,Manhattan,Hell's Kitchen,40.75751,-73.99112,Shared room,55,1,0,,,6,2


有10052条浏览数据缺失`last_review`变量值。 从输出结果来看，这些缺失`last_review`的l浏览数据，`number_of_reviews`,`reviews_per_month`都为0。为了验证猜想，我们增加筛选条件，看是否存在`last_review`变量缺失且`number_of_reviews`,`reviews_per_month`不为0的数据。

In [8]:
original_data[(original_data["number_of_reviews"] != 0) & (original_data["reviews_per_month"].isnull())]

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365


在了解`reviews_per_month`存在缺失值后，根据条件提取出缺失观察值。

In [13]:
original_data[original_data["reviews_per_month"].isnull()]

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
2,3647,THE VILLAGE OF HARLEM....NEW YORK !,4632,Elisabeth,Manhattan,Harlem,40.80902,-73.94190,Private room,150,3,0,,,1,365
19,7750,Huge 2 BR Upper East Cental Park,17985,Sing,Manhattan,East Harlem,40.79685,-73.94872,Entire home/apt,190,7,0,,,2,249
26,8700,Magnifique Suite au N de Manhattan - vue Cloitres,26394,Claude & Sophie,Manhattan,Inwood,40.86754,-73.92639,Private room,80,4,0,,,1,0
36,11452,Clean and Quiet in Brooklyn,7355,Vt,Brooklyn,Bedford-Stuyvesant,40.68876,-73.94312,Private room,35,60,0,,,1,365
38,11943,Country space in the city,45445,Harriet,Brooklyn,Flatbush,40.63702,-73.96327,Private room,150,1,0,,,1,365
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48890,36484665,Charming one bedroom - newly renovated rowhouse,8232441,Sabrina,Brooklyn,Bedford-Stuyvesant,40.67853,-73.94995,Private room,70,2,0,,,2,9
48891,36485057,Affordable room in Bushwick/East Williamsburg,6570630,Marisol,Brooklyn,Bushwick,40.70184,-73.93317,Private room,40,4,0,,,2,36
48892,36485431,Sunny Studio at Historical Neighborhood,23492952,Ilgar & Aysel,Manhattan,Harlem,40.81475,-73.94867,Entire home/apt,115,10,0,,,1,27
48893,36485609,43rd St. Time Square-cozy single bed,30985759,Taz,Manhattan,Hell's Kitchen,40.75751,-73.99112,Shared room,55,1,0,,,6,2


筛选出来结果数量为0条，说明缺失`last_review`值的数据，同时也不具备有效的`number_of_reviews`,`reviews_per_month`值。 last_review表示最新浏览，`number_of_reviews`表示评论数量，`reviews_per_month`表示每月浏览次数，都是进行后续商品交易分析的重要变量。如果它们同时缺失/无效，我们认为数据无法提供无效含义，因此这些后续可以被删除。

### 评估重复数据

根据数据变量的含义来看，虽然`id`、`host_id`都是唯一标识符，但是房间id不可以重复，房东id可以存在重复，因此`host_name`可以存在重复，房东可以有多套房屋。 那么针对此数据集，我们需对`id`评估重复数据。

In [9]:
original_data[original_data["id"].duplicated()]

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365


评估`id`结果不存在重复数据

### 评估不一致数据

不一致数据可能存在于`neighbourhood_group`变量中，我们要查看是否存在多个不同值指代同一地区的情况。

In [10]:
original_data["neighbourhood_group"].value_counts()

neighbourhood_group
Manhattan        21661
Brooklyn         20104
Queens            5666
Bronx             1091
Staten Island      373
Name: count, dtype: int64

不存在不同值指代同一地区的情况

### 评估无效或者错误数据

可以通过DataFrame的describe方法，对数值统计信息进行快速了解。

In [11]:
original_data.describe()

Unnamed: 0,id,host_id,latitude,longitude,price,minimum_nights,number_of_reviews,reviews_per_month,calculated_host_listings_count,availability_365
count,48895.0,48895.0,48895.0,48895.0,48895.0,48895.0,48895.0,38843.0,48895.0,48895.0
mean,19017140.0,67620010.0,40.728949,-73.95217,152.720687,7.029962,23.274466,1.373221,7.143982,112.781327
std,10983110.0,78610970.0,0.05453,0.046157,240.15417,20.51055,44.550582,1.680442,32.952519,131.622289
min,2539.0,2438.0,40.49979,-74.24442,0.0,1.0,0.0,0.01,1.0,0.0
25%,9471945.0,7822033.0,40.6901,-73.98307,69.0,1.0,1.0,0.19,1.0,0.0
50%,19677280.0,30793820.0,40.72307,-73.95568,106.0,3.0,5.0,0.72,1.0,45.0
75%,29152180.0,107434400.0,40.763115,-73.936275,175.0,5.0,24.0,2.02,2.0,227.0
max,36487240.0,274321300.0,40.91306,-73.71299,10000.0,1250.0,629.0,58.5,327.0,365.0


不存在无效或错误数据

## 清理数据

根据前面评估部分得到的结论，我们需要进行的数据清理包括：
- 把`price`变量的数据类型转换为为浮点类型,
- 把`last_review`变量的数据删除,
- 把`reviews_per_month`变量的数据删除

为了区分开经过清理的数据和原始的数据，我们创建新的变量`cleaned_data`，让它为`original_data`复制出的副本。我们之后的清理步骤都将被运用在`cleaned_data`上。

In [14]:
cleaned_data = original_data.copy()
cleaned_data.head()

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
0,2539,Clean & quiet apt home by the park,2787,John,Brooklyn,Kensington,40.64749,-73.97237,Private room,149,1,9,2018-10-19,0.21,6,365
1,2595,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,225,1,45,2019-05-21,0.38,2,355
2,3647,THE VILLAGE OF HARLEM....NEW YORK !,4632,Elisabeth,Manhattan,Harlem,40.80902,-73.9419,Private room,150,3,0,,,1,365
3,3831,Cozy Entire Floor of Brownstone,4869,LisaRoxanne,Brooklyn,Clinton Hill,40.68514,-73.95976,Entire home/apt,89,1,270,2019-07-05,4.64,1,194
4,5022,Entire Apt: Spacious Studio/Loft by central park,7192,Laura,Manhattan,East Harlem,40.79851,-73.94399,Entire home/apt,80,10,9,2018-11-19,0.1,1,0


把`price`变量的数据类型转换为为浮点类型

In [15]:
cleaned_data["price"] = cleaned_data["price"].astype(float)
cleaned_data["price"]

0        149.0
1        225.0
2        150.0
3         89.0
4         80.0
         ...  
48890     70.0
48891     40.0
48892    115.0
48893     55.0
48894     90.0
Name: price, Length: 48895, dtype: float64

把`last_review`变量缺失的观察值删除，并查看删除后该列空缺值个数和：

In [16]:
cleaned_data = cleaned_data.dropna(subset=["last_review"])
cleaned_data

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
0,2539,Clean & quiet apt home by the park,2787,John,Brooklyn,Kensington,40.64749,-73.97237,Private room,149.0,1,9,2018-10-19,0.21,6,365
1,2595,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,225.0,1,45,2019-05-21,0.38,2,355
3,3831,Cozy Entire Floor of Brownstone,4869,LisaRoxanne,Brooklyn,Clinton Hill,40.68514,-73.95976,Entire home/apt,89.0,1,270,2019-07-05,4.64,1,194
4,5022,Entire Apt: Spacious Studio/Loft by central park,7192,Laura,Manhattan,East Harlem,40.79851,-73.94399,Entire home/apt,80.0,10,9,2018-11-19,0.10,1,0
5,5099,Large Cozy 1 BR Apartment In Midtown East,7322,Chris,Manhattan,Murray Hill,40.74767,-73.97500,Entire home/apt,200.0,3,74,2019-06-22,0.59,1,129
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48782,36425863,Lovely Privet Bedroom with Privet Restroom,83554966,Rusaa,Manhattan,Upper East Side,40.78099,-73.95366,Private room,129.0,1,1,2019-07-07,1.00,1,147
48790,36427429,No.2 with queen size bed,257683179,H Ai,Queens,Flushing,40.75104,-73.81459,Private room,45.0,1,1,2019-07-07,1.00,6,339
48799,36438336,Seas The Moment,211644523,Ben,Staten Island,Great Kills,40.54179,-74.14275,Private room,235.0,1,1,2019-07-07,1.00,1,87
48805,36442252,1B-1B apartment near by Metro,273841667,Blaine,Bronx,Mott Haven,40.80787,-73.92400,Entire home/apt,100.0,1,2,2019-07-07,2.00,1,40


In [17]:
cleaned_data["last_review"].isnull().sum()

np.int64(0)

把`reviews_per_month`变量缺失的观察值删除，并查看删除后该列空缺值个数和：

In [18]:
cleaned_data = cleaned_data.dropna(subset=['reviews_per_month'])
cleaned_data

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
0,2539,Clean & quiet apt home by the park,2787,John,Brooklyn,Kensington,40.64749,-73.97237,Private room,149.0,1,9,2018-10-19,0.21,6,365
1,2595,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,225.0,1,45,2019-05-21,0.38,2,355
3,3831,Cozy Entire Floor of Brownstone,4869,LisaRoxanne,Brooklyn,Clinton Hill,40.68514,-73.95976,Entire home/apt,89.0,1,270,2019-07-05,4.64,1,194
4,5022,Entire Apt: Spacious Studio/Loft by central park,7192,Laura,Manhattan,East Harlem,40.79851,-73.94399,Entire home/apt,80.0,10,9,2018-11-19,0.10,1,0
5,5099,Large Cozy 1 BR Apartment In Midtown East,7322,Chris,Manhattan,Murray Hill,40.74767,-73.97500,Entire home/apt,200.0,3,74,2019-06-22,0.59,1,129
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48782,36425863,Lovely Privet Bedroom with Privet Restroom,83554966,Rusaa,Manhattan,Upper East Side,40.78099,-73.95366,Private room,129.0,1,1,2019-07-07,1.00,1,147
48790,36427429,No.2 with queen size bed,257683179,H Ai,Queens,Flushing,40.75104,-73.81459,Private room,45.0,1,1,2019-07-07,1.00,6,339
48799,36438336,Seas The Moment,211644523,Ben,Staten Island,Great Kills,40.54179,-74.14275,Private room,235.0,1,1,2019-07-07,1.00,1,87
48805,36442252,1B-1B apartment near by Metro,273841667,Blaine,Bronx,Mott Haven,40.80787,-73.92400,Entire home/apt,100.0,1,2,2019-07-07,2.00,1,40


In [19]:
cleaned_data['reviews_per_month'].isnull().sum()

np.int64(0)

## 保存清理后的数据

In [21]:
cleaned_data.to_csv("C_Airbnn_NYC_2019",index=False)

In [22]:
pd.read_csv('./C_Airbnn_NYC_2019')

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
0,2539,Clean & quiet apt home by the park,2787,John,Brooklyn,Kensington,40.64749,-73.97237,Private room,149.0,1,9,2018-10-19,0.21,6,365
1,2595,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,225.0,1,45,2019-05-21,0.38,2,355
2,3831,Cozy Entire Floor of Brownstone,4869,LisaRoxanne,Brooklyn,Clinton Hill,40.68514,-73.95976,Entire home/apt,89.0,1,270,2019-07-05,4.64,1,194
3,5022,Entire Apt: Spacious Studio/Loft by central park,7192,Laura,Manhattan,East Harlem,40.79851,-73.94399,Entire home/apt,80.0,10,9,2018-11-19,0.10,1,0
4,5099,Large Cozy 1 BR Apartment In Midtown East,7322,Chris,Manhattan,Murray Hill,40.74767,-73.97500,Entire home/apt,200.0,3,74,2019-06-22,0.59,1,129
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
38838,36425863,Lovely Privet Bedroom with Privet Restroom,83554966,Rusaa,Manhattan,Upper East Side,40.78099,-73.95366,Private room,129.0,1,1,2019-07-07,1.00,1,147
38839,36427429,No.2 with queen size bed,257683179,H Ai,Queens,Flushing,40.75104,-73.81459,Private room,45.0,1,1,2019-07-07,1.00,6,339
38840,36438336,Seas The Moment,211644523,Ben,Staten Island,Great Kills,40.54179,-74.14275,Private room,235.0,1,1,2019-07-07,1.00,1,87
38841,36442252,1B-1B apartment near by Metro,273841667,Blaine,Bronx,Mott Haven,40.80787,-73.92400,Entire home/apt,100.0,1,2,2019-07-07,2.00,1,40
