# **Exploratory Data Analysis of Airbnb Data**

### Imad Ahmad, Ibtassam Racheed, Yip Chi Man

# Introduction

# Guiding Questions

# Dataset

In [2]:
import pandas as pd
import numpy as np
from datetime import date
import geopandas as gpd
import matplotlib as mpl
import matplotlib.pyplot as plt
import plotly.express as px
from shapely.geometry import Point
from shapely.geometry.polygon import Polygon
import warnings
import seaborn as sns
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
warnings.filterwarnings("ignore")
airbnb_raw = pd.read_csv('DataSet/Listings.csv', encoding='iso-8859-1')
airbnb_raw.head()

Unnamed: 0,listing_id,name,host_id,host_since,host_location,host_response_time,host_response_rate,host_acceptance_rate,host_is_superhost,host_total_listings_count,...,minimum_nights,maximum_nights,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,instant_bookable
0,281420,"Beautiful Flat in le Village Montmartre, Paris",1466919,2011-12-03,"Paris, Ile-de-France, France",,,,f,1.0,...,2,1125,100.0,10.0,10.0,10.0,10.0,10.0,10.0,f
1,3705183,39 mÃÂ² Paris (Sacre CÃâur),10328771,2013-11-29,"Paris, Ile-de-France, France",,,,f,1.0,...,2,1125,100.0,10.0,10.0,10.0,10.0,10.0,10.0,f
2,4082273,"Lovely apartment with Terrace, 60m2",19252768,2014-07-31,"Paris, Ile-de-France, France",,,,f,1.0,...,2,1125,100.0,10.0,10.0,10.0,10.0,10.0,10.0,f
3,4797344,Cosy studio (close to Eiffel tower),10668311,2013-12-17,"Paris, Ile-de-France, France",,,,f,1.0,...,2,1125,100.0,10.0,10.0,10.0,10.0,10.0,10.0,f
4,4823489,Close to Eiffel Tower - Beautiful flat : 2 rooms,24837558,2014-12-14,"Paris, Ile-de-France, France",,,,f,1.0,...,2,1125,100.0,10.0,10.0,10.0,10.0,10.0,10.0,f


## Data Cleaning

Above we can see the head of our listing dataset, containing 33 columns. Before we can perform analysis on the data, some cleaning and wrangling must be done. 

We first assigned proper data to our columns. Many of the columns containing strings were listed as objects. Although strings are objects, for good measure we just converted them to strings. All the columns containing boolean values had the values listed as 't' or 'f'. In addition to converting these columns from objects to booleans, we mapped the python boolean terms 'TRUE' and 'FALSE' to these columns. 

We also had to deal with the date values in the host_since column. We created a new column that contained the host_since values in DateTime format, as well as a column that only contained the host_since years in DateTime format. This was a preliminary wrangling step that proved to be useful in many parts of our analysis, as you will soon see.

In [5]:
# Correcting datatype for string columns (note: Pandas stores strings as objects)
airbnb_raw[['name', 'host_location', 'host_response_time', 
            'neighbourhood', 'district', 'city', 
            'property_type', 'room_type']] = airbnb_raw[['name','host_location', 
            'host_response_time','neighbourhood','district',
            'city','property_type','room_type']].astype('str')

# Correcting labelling and datatype for boolean columns
airbnb_raw['host_is_superhost'] = airbnb_raw['host_is_superhost'].map({'t':True,'f':False}).astype(bool)
airbnb_raw['host_has_profile_pic'] = airbnb_raw['host_has_profile_pic'].map({'t':True,'f':False}).astype(bool)
airbnb_raw['host_identity_verified'] = airbnb_raw['host_identity_verified'].map({'t':True,'f':False}).astype(bool)
airbnb_raw['instant_bookable'] = airbnb_raw['instant_bookable'].map({'t':True,'f':False}).astype(bool)

# Creating 2 columns, one with host_since in DateTime format, and one with the year values of host_since in DateTime format
airbnb_raw['host_since_dt'] = pd.to_datetime(airbnb_raw['host_since'])
airbnb_raw['host_since_dt_year'] = airbnb_raw['host_since_dt'].apply(lambda x: str(x.year))

We then read in our second reviews dataset, the head of which can be seen below:

In [14]:
airbnb_review_raw = pd.read_csv('DataSet/Reviews.csv')
airbnb_review_raw

Unnamed: 0,listing_id,review_id,date,reviewer_id
0,11798,330265172,2018-09-30,11863072
1,15383,330103585,2018-09-30,39147453
2,16455,329985788,2018-09-30,1125378
3,17919,330016899,2018-09-30,172717984
4,26827,329995638,2018-09-30,17542859
...,...,...,...,...
5373138,47779342,726766332,2021-01-25,283094516
5373139,47823964,727963021,2021-01-31,76411977
5373140,47896175,728548625,2021-02-02,71370946
5373141,47900451,727399287,2021-01-29,109011160


With only four columns, we thought there would not be a lot of cleaning and wrangling to be done with this dataset. However, we underestimated how important it would be to our analysis, and the different ways in which we would have to wrangle the data to merge the datasets proved to be a welcome challenge. 

We created a new column containing the dates in DateTime, as well as a column containing the days since the last review. Creating this second column was an interesting...

In [15]:
airbnb_review_raw['review_date_dt'] = pd.to_datetime(airbnb_review_raw['date'])
airbnb_review_raw['day_since_last_review'] = (airbnb_review_raw.sort_values('review_date_dt').
                                             groupby('listing_id').review_date_dt.shift() - airbnb_review_raw.
                                             review_date_dt).dt.days.abs()

Our next step was binning the number of reviews for each listing into months. We did this by first creating a year_month column that contained each month in our dataset. This column contained review dates in the format YYYY_MM. We then created a separate dataframe that was grouped by listings in the rows, and year_month bins in the columns. This new dataset contained counts of how many reviews took place in year_month each bin. The head of this new dataset can be seen below:

Note: The columns are abbreviated by ellipses as we are analyzing over 147 separate months, each with it's own column!

In [16]:
airbnb_review_raw['year_month'] = airbnb_review_raw['review_date_dt'].dt.strftime('%Y_%m')
airbnb_year_month = airbnb_review_raw.groupby(['listing_id','year_month'])['review_id'].count()
airbnb_year_month = airbnb_year_month.unstack(level=-1).fillna(0)
airbnb_year_month.head()

year_month,2008_11,2009_01,2009_02,2009_04,2009_05,2009_06,2009_07,2009_08,2009_09,2009_10,...,2020_06,2020_07,2020_08,2020_09,2020_10,2020_11,2020_12,2021_01,2021_02,2021_03
listing_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2577,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2595,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2737,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2903,2.0,1.0,1.0,1.0,0.0,1.0,4.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3079,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


# Analysis

# Conclusion

## References