# 에어비앤비 가격 예측을 통한 숙박 가격에 미치는 요인 분석
https://www.kaggle.com/dgomonov/new-york-city-airbnb-open-data

# 1. 라이브러리 및 모듈 불러오기

In [3]:
import pandas as pd
from os.path import join

# 2. 데이터 가져오기

In [7]:
data_path = join("data", "AB_NYC_2019.csv")
airbnb_df = pd.read_csv(data_path)

In [10]:
print(airbnb_df.shape)
airbnb_df.head(3)

(48895, 16)


Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
0,2539,Clean & quiet apt home by the park,2787,John,Brooklyn,Kensington,40.64749,-73.97237,Private room,149,1,9,2018-10-19,0.21,6,365
1,2595,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,225,1,45,2019-05-21,0.38,2,355
2,3647,THE VILLAGE OF HARLEM....NEW YORK !,4632,Elisabeth,Manhattan,Harlem,40.80902,-73.9419,Private room,150,3,0,,,1,365


# 3. 결측치 처리

## 3.1 name, host_name 결측치 데이터 처리

**<div style="font-size: 16px">현재 분석에선 다음 2가지 이유로 name, host_name 결측치 데이터를 제거하도록 하겠습니다.</div>**
- 가게명과 호스트 이름은 중요의 여부를 떠나 결측치가 존재하는 데이터는 사용할 수 없습니다.
- 전체 48895개의 데이터 중 37개 데이터는 전체 데이터에 크게 영향을 미치지 않을거라 판단했습니다.

In [9]:
airbnb_df.isnull().sum()

id                                    0
name                                 16
host_id                               0
host_name                            21
neighbourhood_group                   0
neighbourhood                         0
latitude                              0
longitude                             0
room_type                             0
price                                 0
minimum_nights                        0
number_of_reviews                     0
last_review                       10052
reviews_per_month                 10052
calculated_host_listings_count        0
availability_365                      0
dtype: int64

In [40]:
airbnb_df = airbnb_df.loc[~(airbnb_df["name"].isnull() | airbnb_df["host_name"].isnull()), :]
airbnb_df = airbnb_df.reset_index(drop=True)

In [41]:
# 잘 처리 되었는지 확인합니다.
airbnb_df.isnull().sum()

id                                    0
name                                  0
host_id                               0
host_name                             0
neighbourhood_group                   0
neighbourhood                         0
latitude                              0
longitude                             0
room_type                             0
price                                 0
minimum_nights                        0
number_of_reviews                     0
last_review                       10037
reviews_per_month                 10037
calculated_host_listings_count        0
availability_365                      0
dtype: int64

## 3.2 last_review, reviews_per_month 결측치 처리

- 마지막 리뷰가 없는 데이터가 왜 존재할까 생각해봤을 때, 애초에 리뷰가 없는 상품일 수 있습니다. <br>
또한 만약 애초에 리뷰가 없는 상품이라면 자연스럽게 월별 리뷰 수도 결측치일 것입니다. <br>
추측이 맞는지 검증해보도록 하겠습니다.

In [47]:
airbnb_df[airbnb_df["last_review"].isnull()]["number_of_reviews"].max()

0

In [59]:
# last_review가 없으면 reviews_per_month도 없습니다.
list(airbnb_df[airbnb_df["last_review"].isnull()].index) == list(airbnb_df[airbnb_df["reviews_per_month"].isnull()].index)

True

- 마지막 리뷰 날짜 데이터를 연월일로 나눈 다음, 결측치를 한 번에 0으로 처리하겠습니다.

In [70]:
airbnb_df["last_review"]       = airbnb_df["last_review"].astype("datetime64")
airbnb_df["last_review_year"]  = airbnb_df["last_review"].dt.year
airbnb_df["last_review_month"] = airbnb_df["last_review"].dt.month
airbnb_df["last_review_day"]   = airbnb_df["last_review"].dt.day

airbnb_df = airbnb_df.drop(["last_review"], axis=1)
airbnb_df.head(3)

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,reviews_per_month,calculated_host_listings_count,availability_365,last_review_year,last_review_month,last_review_day
0,2539,Clean & quiet apt home by the park,2787,John,Brooklyn,Kensington,40.64749,-73.97237,Private room,149,1,9,0.21,6,365,2018.0,10.0,19.0
1,2595,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,225,1,45,0.38,2,355,2019.0,5.0,21.0
2,3647,THE VILLAGE OF HARLEM....NEW YORK !,4632,Elisabeth,Manhattan,Harlem,40.80902,-73.9419,Private room,150,3,0,,1,365,,,


In [72]:
airbnb_df.isnull().sum()

id                                    0
name                                  0
host_id                               0
host_name                             0
neighbourhood_group                   0
neighbourhood                         0
latitude                              0
longitude                             0
room_type                             0
price                                 0
minimum_nights                        0
number_of_reviews                     0
reviews_per_month                 10037
calculated_host_listings_count        0
availability_365                      0
last_review_year                  10037
last_review_month                 10037
last_review_day                   10037
dtype: int64

In [73]:
airbnb_df = airbnb_df.fillna(0)
airbnb_df.isnull().sum()

id                                0
name                              0
host_id                           0
host_name                         0
neighbourhood_group               0
neighbourhood                     0
latitude                          0
longitude                         0
room_type                         0
price                             0
minimum_nights                    0
number_of_reviews                 0
reviews_per_month                 0
calculated_host_listings_count    0
availability_365                  0
last_review_year                  0
last_review_month                 0
last_review_day                   0
dtype: int64

In [77]:
airbnb_df.head(3)

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,reviews_per_month,calculated_host_listings_count,availability_365,last_review_year,last_review_month,last_review_day
0,2539,Clean & quiet apt home by the park,2787,John,Brooklyn,Kensington,40.64749,-73.97237,Private room,149,1,9,0.21,6,365,2018.0,10.0,19.0
1,2595,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,225,1,45,0.38,2,355,2019.0,5.0,21.0
2,3647,THE VILLAGE OF HARLEM....NEW YORK !,4632,Elisabeth,Manhattan,Harlem,40.80902,-73.9419,Private room,150,3,0,0.0,1,365,0.0,0.0,0.0


In [76]:
airbnb_df["number_of_reviews"]

0          9
1         45
2          0
3        270
4          9
        ... 
48853      0
48854      0
48855      0
48856      0
48857      0
Name: number_of_reviews, Length: 48858, dtype: int64