# EDA and data cleaning

In this notebook, we will go over the data we have and perform EDA and data cleaning.

In [1]:
# imports
import pandas as pd 
import matplotlib.pyplot as plt 
import numpy as np 
import seaborn as sns 
import re
from collections import Counter


## Dropping unnecessary columns
We will start our cleaning process by dropping the columns we are sure we dont need.

In [2]:
# read the preprocessed data
df = pd.read_csv('./../data/austin_listings_processed.csv')
print(f'the size of our data is {df.shape}')
df.head(2)

the size of our data is (47037, 81)


  df = pd.read_csv('./../data/austin_listings_processed.csv')


Unnamed: 0,id,listing_url,scrape_id,last_scraped,source,name,description,neighborhood_overview,picture_url,host_id,...,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month,occ_rate_calendar,active_duration_days,occ_rate_70,occ_rate_50,occ_rate_30,time_quarter
0,5456,https://www.airbnb.com/rooms/5456,20231215200307,2023-12-16,city scrape,Guesthouse in Austin · ★4.84 · 1 bedroom · 2 b...,,My neighborhood is ideally located if you want...,https://a0.muscache.com/pictures/14084884/b5a3...,8028,...,1,0,0,3.71,0.3,5390.0,0.7,0.7,0.7,Q4
1,5769,https://www.airbnb.com/rooms/5769,20231215200307,2023-12-16,previous scrape,Home in Austin · ★4.91 · 1 bedroom · 1 bed · 1...,,Quiet neighborhood with lots of trees and good...,https://a0.muscache.com/pictures/23822033/ac94...,8186,...,0,1,0,1.76,0.7,5404.0,0.388601,0.544041,0.7,Q4


In [3]:
pd.set_option('display.max_columns', None)

In [4]:
# print list of the columns
print(list(df.columns))


['id', 'listing_url', 'scrape_id', 'last_scraped', 'source', 'name', 'description', 'neighborhood_overview', 'picture_url', 'host_id', 'host_url', 'host_name', 'host_since', 'host_location', 'host_about', 'host_response_time', 'host_response_rate', 'host_acceptance_rate', 'host_is_superhost', 'host_thumbnail_url', 'host_picture_url', 'host_neighbourhood', 'host_listings_count', 'host_total_listings_count', 'host_verifications', 'host_has_profile_pic', 'host_identity_verified', 'neighbourhood', 'neighbourhood_cleansed', 'neighbourhood_group_cleansed', 'latitude', 'longitude', 'property_type', 'room_type', 'accommodates', 'bathrooms', 'bathrooms_text', 'bedrooms', 'beds', 'amenities', 'price', 'minimum_nights', 'maximum_nights', 'minimum_minimum_nights', 'maximum_minimum_nights', 'minimum_maximum_nights', 'maximum_maximum_nights', 'minimum_nights_avg_ntm', 'maximum_nights_avg_ntm', 'calendar_updated', 'has_availability', 'availability_30', 'availability_60', 'availability_90', 'availabil

Let us retain the columns we might wanna use later. Only drop the columns that won't be used for sure. 

In [5]:

columns_to_keep = ['id', 'source', 'name', 'description','neighborhood_overview',
                   'host_is_superhost', 'neighbourhood_cleansed', 'property_type', 'room_type', 
                   'bathrooms_text', 'bedrooms', 'beds', 'amenities', 'price',
                   'minimum_nights', 'maximum_nights', 'number_of_reviews',
                   'review_scores_rating','occ_rate_50', 'time_quarter',
                   # host analysis
                   'host_id'
                   ]
df = df[columns_to_keep]
# rename columns if needed
df.rename(columns={'neighbourhood_cleansed': 'zipcode',
                   'occ_rate_50': 'occupancy_rate'}, inplace=True)
df.head(3)


Unnamed: 0,id,source,name,description,neighborhood_overview,host_is_superhost,zipcode,property_type,room_type,bathrooms_text,bedrooms,beds,amenities,price,minimum_nights,maximum_nights,number_of_reviews,review_scores_rating,occupancy_rate,time_quarter,host_id
0,5456,city scrape,Guesthouse in Austin · ★4.84 · 1 bedroom · 2 b...,,My neighborhood is ideally located if you want...,t,78702,Entire guesthouse,Entire home/apt,1 bath,,2.0,[],$101.00,2,90,668,4.84,0.7,Q4,8028
1,5769,previous scrape,Home in Austin · ★4.91 · 1 bedroom · 1 bed · 1...,,Quiet neighborhood with lots of trees and good...,t,78729,Private room in home,Private room,1 shared bath,,1.0,[],,1,14,294,4.91,0.544041,Q4,8186
2,6413,previous scrape,Guesthouse in Austin · ★4.97 · Studio · 1 bed ...,,Travis Heights is one of the oldest neighborho...,f,78704,Entire guesthouse,Entire home/apt,1 bath,,1.0,[],,30,90,120,4.97,0.7,Q4,13879


Let us start our analysis by looking at the nans.

In [6]:
df.isna().sum().sort_values(ascending=False)

bedrooms                 17307
neighborhood_overview    16070
description              12515
price                     2135
host_is_superhost         2058
beds                       363
bathrooms_text              34
time_quarter                 0
occupancy_rate               0
review_scores_rating         0
number_of_reviews            0
maximum_nights               0
minimum_nights               0
id                           0
amenities                    0
source                       0
room_type                    0
property_type                0
zipcode                      0
name                         0
host_id                      0
dtype: int64

Since price is a very import feature in our data, let us dive deeper into why it has 2000 missing values.

## Missing prices

In [7]:
df[df['price'].isna()].head(5)

Unnamed: 0,id,source,name,description,neighborhood_overview,host_is_superhost,zipcode,property_type,room_type,bathrooms_text,bedrooms,beds,amenities,price,minimum_nights,maximum_nights,number_of_reviews,review_scores_rating,occupancy_rate,time_quarter,host_id
1,5769,previous scrape,Home in Austin · ★4.91 · 1 bedroom · 1 bed · 1...,,Quiet neighborhood with lots of trees and good...,t,78729,Private room in home,Private room,1 shared bath,,1.0,[],,1,14,294,4.91,0.544041,Q4,8186
2,6413,previous scrape,Guesthouse in Austin · ★4.97 · Studio · 1 bed ...,,Travis Heights is one of the oldest neighborho...,f,78704,Entire guesthouse,Entire home/apt,1 bath,,1.0,[],,30,90,120,4.97,0.7,Q4,13879
14,69810,previous scrape,Guesthouse in Austin · ★4.98 · 1 bedroom · 1 b...,,"Located in the cool Dawson area, about two mil...",f,78704,Entire guesthouse,Entire home/apt,1 bath,,1.0,[],,2,730,445,4.98,0.7,Q4,82762
18,219168,previous scrape,Rental unit in Austin · ★4.88 · Studio · 1 bed...,,Despite its proximity to South Congress and th...,f,78704,Entire rental unit,Entire home/apt,1 bath,,1.0,[],,2,60,8,4.88,0.04308,Q4,1134580
19,72833,previous scrape,Guesthouse in Austin · ★4.91 · 1 bedroom · 1 b...,,Peaceful neighborhood street in central Austin...,t,78731,Entire guesthouse,Entire home/apt,1 bath,,1.0,[],,3,60,413,4.91,0.7,Q4,378744


Let us see if we can find further insight towards what listings have missing prices.

In [8]:
df[df['price'].isna()].describe(include='object')

Unnamed: 0,source,name,description,neighborhood_overview,host_is_superhost,property_type,room_type,bathrooms_text,amenities,price,time_quarter
count,2135,2135,0.0,1333,2134,2135,2135,2132,2135,0.0,2135
unique,2,1151,0.0,1276,2,40,4,18,1,0.0,1
top,previous scrape,Rental unit in Austin · 1 bedroom · 1 bed · 1 ...,,"Great location, close to everything Austin has...",f,Entire home,Entire home/apt,1 bath,[],,Q4
freq,2115,110,,11,1828,631,1555,970,2135,,2135


In [9]:
df[df['price'].isna()].describe()


Unnamed: 0,id,zipcode,bedrooms,beds,minimum_nights,maximum_nights,number_of_reviews,review_scores_rating,occupancy_rate,host_id
count,2135.0,2135.0,0.0,2122.0,2135.0,2135.0,2135.0,2135.0,2135.0,2135.0
mean,1.153142e+17,78723.140984,,2.052309,5.290867,542.600937,23.096956,4.794993,0.189055,86278700.0
std,2.717539e+17,20.665209,,1.447688,29.952152,527.063859,50.841213,0.444607,0.230274,111028100.0
min,5769.0,78701.0,,1.0,1.0,1.0,1.0,0.0,0.00185,5985.0
25%,13169790.0,78704.0,,1.0,1.0,15.0,2.0,4.775,0.024379,11395700.0
50%,23680260.0,78722.0,,2.0,2.0,365.0,6.0,4.97,0.076225,42120140.0
75%,47038120.0,78744.0,,3.0,3.0,1125.0,21.5,5.0,0.266056,116698900.0
max,1.029898e+18,78759.0,,14.0,1100.0,1125.0,687.0,5.0,0.7,535876900.0


In [10]:
df[df['price'].isna()]['id'].nunique()

2135

As we can see, all of the missing prices have happened during data scraping in Q4 so there could be some issues at that time (they all have unique id's as well). Looking further, we can see that none of the listings with missing price have description either but this may not be a significant finding because description column has a lot of missing values, in addition to the ones with no price (similar observation for bedrooms).   

Let us see what is the total number of unique id's for our listings.

In [11]:
df['id'].nunique()

15175

This means we will have a lot of listings that have been listed at different times. We might be able to use price information for that specific property by looking at the prices it was listed for at other times. 

In [12]:
missing_price_ids = list(df[df['price'].isna()]['id'].unique())

In [13]:
df[df['id'] == missing_price_ids[0]]

Unnamed: 0,id,source,name,description,neighborhood_overview,host_is_superhost,zipcode,property_type,room_type,bathrooms_text,bedrooms,beds,amenities,price,minimum_nights,maximum_nights,number_of_reviews,review_scores_rating,occupancy_rate,time_quarter,host_id
1,5769,previous scrape,Home in Austin · ★4.91 · 1 bedroom · 1 bed · 1...,,Quiet neighborhood with lots of trees and good...,t,78729,Private room in home,Private room,1 shared bath,,1.0,[],,1,14,294,4.91,0.544041,Q4,8186
12265,5769,previous scrape,Home in Austin · ★4.90 · 1 bedroom · 1 bed · 1...,<b>The space</b><br />Looking for a comfortabl...,Quiet neighborhood with lots of trees and good...,t,78729,Private room in home,Private room,1 shared bath,,1.0,"[""Wifi \u2013 47 Mbps"", ""Lock on bedroom door""...",$45.00,1,14,290,4.9,0.549346,Q3,8186
24012,5769,previous scrape,Home in Austin · ★4.90 · 1 bedroom · 1 bed · 1...,<b>The space</b><br />Looking for a comfortabl...,Quiet neighborhood with lots of trees and good...,,78729,Private room in home,Private room,1 shared bath,,1.0,"[""Host greets you"", ""Essentials"", ""Shampoo"", ""...",$42.00,1,14,285,4.9,0.546291,Q2,8186
35716,5769,previous scrape,NW Austin Room,<b>The space</b><br />Looking for a comfortabl...,Quiet neighborhood with lots of trees and good...,t,78729,Private room in home,Private room,1 shared bath,1.0,1.0,"[""Private backyard"", ""Free parking on premises...",$42.00,1,14,275,4.9,0.539957,Q1,8186


Let us replace the missing price values with that of the same listing on other times periods. 

In [14]:
#convert currency columns to float when avaialble
df['price'] = df['price'].str.replace('[\$,]', '', regex=True).astype(float)
df['price'].dtype

dtype('float64')

In [15]:
# make a dictionary where the key is the index id for the missing prices and value is the average price for that listing
# based on other times of the year
avrg_price = {}
for i in missing_price_ids:
    avrg_price[i] = df[df['id'] == i]['price'].mean()

# replace the missings with average from other dates
df.loc[df['price'].isna(),'price'] = df[df['price'].isna()].apply(lambda x: avrg_price[x['id']], axis=1)

In [16]:
df['price'].isna().sum()

33

As we can see, we have replaced most of the listings with missing prices with some values from the same listing at other times (data missing not at random). Now, we can go ahead and drop the remaining nan's in the price column. 

In [17]:
df = df.dropna(subset=['price'])
df.isna().sum().sort_values(ascending=False)

bedrooms                 17274
neighborhood_overview    16056
description              12482
host_is_superhost         2058
beds                       363
bathrooms_text              34
price                        0
time_quarter                 0
occupancy_rate               0
review_scores_rating         0
number_of_reviews            0
maximum_nights               0
minimum_nights               0
id                           0
amenities                    0
source                       0
room_type                    0
property_type                0
zipcode                      0
name                         0
host_id                      0
dtype: int64

## Missing values in bedrooms

An interesting trend we are observing is that we already know all airbnb listings should have information about the number of bedrooms. However, we have a lot of listings without any data for the bedrooms. Similar to what we did with price, let us see if we can find information about the number of bedrooms from earlier listings of the same property and use that in our data.


In [18]:
missing_bedroom_id = df[df['bedrooms'].isna()]['id'].unique()
missing_bedroom = {}
# go over listings with missibg bedroom info and see if the same listing has some bedrooms reported earlier or later in time
for i in missing_bedroom_id:
    # if there are different number of bedrooms reported for the same property, pick up the smaller one (arbitrary decision)
    min_bd = df[df['id'] == i ]['bedrooms'].min()
    missing_bedroom[i] = min_bd

# now replace the nan's in missing bedroom in the same listing has bedrooms reported somewhere else
df.loc[df['bedrooms'].isna(),'bedrooms'] = df[df['bedrooms'].isna()].apply(lambda x: missing_bedroom[x['id']], axis=1)
     

In [19]:
df['bedrooms'].isna().sum()

4431

This reduces the number of missing bedrooms to 4431.

When we look at the listing name column, we can see very interesting information about the home, number of bedrooms, beds, and bathrooms are provided. 

In [20]:
df.iloc[0]['name']

'Guesthouse in Austin · ★4.84 · 1 bedroom · 2 beds · 1 bath'

From this pattern, we can see that the '.' character can be used to separate the string upon. Let us first make sure that all the names in our listing follow the same convention. 

In [21]:
df['name'].apply(lambda x: len(x.split('·'))).value_counts()

5    29971
1    11317
4     5620
3       90
2        6
Name: name, dtype: int64

As we can see, there are a lot of listings that don't follow this naming convenction, let's take a look at them.

In [22]:
df[df['name'].apply(lambda x: len(x.split('·'))) == 1].head(2)

Unnamed: 0,id,source,name,description,neighborhood_overview,host_is_superhost,zipcode,property_type,room_type,bathrooms_text,bedrooms,beds,amenities,price,minimum_nights,maximum_nights,number_of_reviews,review_scores_rating,occupancy_rate,time_quarter,host_id
35715,5456,city scrape,"Walk to 6th, Rainey St and Convention Ctr",Great central location for walking to Convent...,My neighborhood is ideally located if you want...,t,78702,Entire guesthouse,Entire home/apt,1 bath,1.0,2.0,"[""Heating"", ""Backyard"", ""Bed linens"", ""Hot wat...",176.0,2,90,630,4.84,0.7,Q1,8028
35716,5769,previous scrape,NW Austin Room,<b>The space</b><br />Looking for a comfortabl...,Quiet neighborhood with lots of trees and good...,t,78729,Private room in home,Private room,1 shared bath,1.0,1.0,"[""Private backyard"", ""Free parking on premises...",42.0,1,14,275,4.9,0.539957,Q1,8186


Let us look at the first listing from the list above.

In [23]:
df[df['id'] ==5456]

Unnamed: 0,id,source,name,description,neighborhood_overview,host_is_superhost,zipcode,property_type,room_type,bathrooms_text,bedrooms,beds,amenities,price,minimum_nights,maximum_nights,number_of_reviews,review_scores_rating,occupancy_rate,time_quarter,host_id
0,5456,city scrape,Guesthouse in Austin · ★4.84 · 1 bedroom · 2 b...,,My neighborhood is ideally located if you want...,t,78702,Entire guesthouse,Entire home/apt,1 bath,1.0,2.0,[],101.0,2,90,668,4.84,0.7,Q4,8028
12264,5456,city scrape,Guesthouse in Austin · ★4.84 · 1 bedroom · 2 b...,Week of July 31 - Aug 10 require min of 5 nig...,My neighborhood is ideally located if you want...,t,78702,Entire guesthouse,Entire home/apt,1 bath,1.0,2.0,"[""Air conditioning"", ""Hot water"", ""Dishes and ...",126.0,2,90,657,4.84,0.7,Q3,8028
24007,5456,city scrape,Guesthouse in Austin · ★4.84 · 1 bedroom · 2 b...,Great central location for walking to Convent...,My neighborhood is ideally located if you want...,t,78702,Entire guesthouse,Entire home/apt,1 bath,1.0,2.0,"[""HDTV with Amazon Prime Video, HBO Max, Hulu,...",107.0,2,90,648,4.84,0.7,Q2,8028
35715,5456,city scrape,"Walk to 6th, Rainey St and Convention Ctr",Great central location for walking to Convent...,My neighborhood is ideally located if you want...,t,78702,Entire guesthouse,Entire home/apt,1 bath,1.0,2.0,"[""Heating"", ""Backyard"", ""Bed linens"", ""Hot wat...",176.0,2,90,630,4.84,0.7,Q1,8028


As we can see, this property has some other listings as well and it looks like after Q1, airbnb has decided to follow the listing convention of using '.' in their names. let us check another listing. 

In [24]:
df[df['id'] ==5769]

Unnamed: 0,id,source,name,description,neighborhood_overview,host_is_superhost,zipcode,property_type,room_type,bathrooms_text,bedrooms,beds,amenities,price,minimum_nights,maximum_nights,number_of_reviews,review_scores_rating,occupancy_rate,time_quarter,host_id
1,5769,previous scrape,Home in Austin · ★4.91 · 1 bedroom · 1 bed · 1...,,Quiet neighborhood with lots of trees and good...,t,78729,Private room in home,Private room,1 shared bath,1.0,1.0,[],43.0,1,14,294,4.91,0.544041,Q4,8186
12265,5769,previous scrape,Home in Austin · ★4.90 · 1 bedroom · 1 bed · 1...,<b>The space</b><br />Looking for a comfortabl...,Quiet neighborhood with lots of trees and good...,t,78729,Private room in home,Private room,1 shared bath,1.0,1.0,"[""Wifi \u2013 47 Mbps"", ""Lock on bedroom door""...",45.0,1,14,290,4.9,0.549346,Q3,8186
24012,5769,previous scrape,Home in Austin · ★4.90 · 1 bedroom · 1 bed · 1...,<b>The space</b><br />Looking for a comfortabl...,Quiet neighborhood with lots of trees and good...,,78729,Private room in home,Private room,1 shared bath,1.0,1.0,"[""Host greets you"", ""Essentials"", ""Shampoo"", ""...",42.0,1,14,285,4.9,0.546291,Q2,8186
35716,5769,previous scrape,NW Austin Room,<b>The space</b><br />Looking for a comfortabl...,Quiet neighborhood with lots of trees and good...,t,78729,Private room in home,Private room,1 shared bath,1.0,1.0,"[""Private backyard"", ""Free parking on premises...",42.0,1,14,275,4.9,0.539957,Q1,8186


Let us see how many missing values for bedrooms do we have for listing where the name is using the new convention of using '.' and indicating the number of beds and baths. 

In [25]:
print('number of listing with missing bedrooms that use "." to separate the number of rooms in the listing name (5 dots)')
print(df[df['name'].apply(lambda x: len(x.split('·'))) == 5]['bedrooms'].isna().sum())

number of listing with missing bedrooms that use "." to separate the number of rooms in the listing name (5 dots)
2840


In [26]:
print('number of listing with missing bedrooms that use "." to separate the number of rooms in the listing name (4 dots)')
print(df[df['name'].apply(lambda x: len(x.split('·'))) == 4]['bedrooms'].isna().sum())

number of listing with missing bedrooms that use "." to separate the number of rooms in the listing name (4 dots)
981


In [27]:
print('number of listing with missing bedrooms that use "." to separate the number of rooms in the listing name (3 dots)')
print(df[df['name'].apply(lambda x: len(x.split('·'))) == 3]['bedrooms'].isna().sum())

number of listing with missing bedrooms that use "." to separate the number of rooms in the listing name (3 dots)
24


In [28]:
print('number of listing with missing bedrooms that use "." to separate the number of rooms in the listing name (2 dots)')
print(df[df['name'].apply(lambda x: len(x.split('·'))) == 2]['bedrooms'].isna().sum())

number of listing with missing bedrooms that use "." to separate the number of rooms in the listing name (2 dots)
0


In [29]:
print('number of listing with missing bedrooms that use "." to separate the number of rooms in the listing name (1 dots)')
print(df[df['name'].apply(lambda x: len(x.split('·'))) == 1]['bedrooms'].isna().sum())

number of listing with missing bedrooms that use "." to separate the number of rooms in the listing name (1 dots)
586


It looks like most of the listings that have the number of bedrooms missing should have some indication of the number of bedrooms in their name listing. We will have to regex to extract the bedroom information from this data. The following regex will extract the first number before the words 'bedroom', 'Bedroom', 'bedrooms', or 'Bedrooms'. 

In [30]:
df['bedrooms_extracted'] = df['name'].apply(lambda x: re.search(r'\D*(\d+\.\d+|\d+)\D*(?:bedroom|bedrooms|Bedroom|Bedrooms)',x).group(1) if 
                                            re.search(r'\D*(\d+\.\d+|\d+)\D*(?:bedroom|bedrooms|Bedroom|Bedrooms)',x) else np.nan)                           


In [31]:
# convert the string into int if they are not nan
df['bedrooms_extracted'] = pd.to_numeric(df['bedrooms_extracted'], errors='coerce').astype('Int64')


In [32]:
df['bedrooms_extracted'].value_counts()

1        14505
2         8923
3         6456
4         3030
5          949
6          335
7          146
8           60
9           30
10          23
13          10
12           8
23           4
14           4
15           3
1940         2
11           1
78704        1
33           1
Name: bedrooms_extracted, dtype: Int64

In [33]:
df['bedrooms_extracted'].isna().sum()

12513

Now, let us take a look at the column beedrooms and bedrooms_extracted together and see how many listings dont have values in either of these columns.

In [34]:
df[df['bedrooms'].isna() & df['bedrooms_extracted'].isna()].shape

(2346, 22)

In [35]:
df[df['bedrooms'].isna() & df['bedrooms_extracted'].isna()].head(5)

Unnamed: 0,id,source,name,description,neighborhood_overview,host_is_superhost,zipcode,property_type,room_type,bathrooms_text,bedrooms,beds,amenities,price,minimum_nights,maximum_nights,number_of_reviews,review_scores_rating,occupancy_rate,time_quarter,host_id,bedrooms_extracted
2,6413,previous scrape,Guesthouse in Austin · ★4.97 · Studio · 1 bed ...,,Travis Heights is one of the oldest neighborho...,f,78704,Entire guesthouse,Entire home/apt,1 bath,,1.0,[],91.666667,30,90,120,4.97,0.7,Q4,13879,
18,219168,previous scrape,Rental unit in Austin · ★4.88 · Studio · 1 bed...,,Despite its proximity to South Congress and th...,f,78704,Entire rental unit,Entire home/apt,1 bath,,1.0,[],250.0,2,60,8,4.88,0.04308,Q4,1134580,
22,224603,city scrape,Guesthouse in Austin · ★4.94 · Studio · 1 bed ...,,,t,78751,Entire guesthouse,Entire home/apt,1 bath,,1.0,[],110.0,60,365,116,4.94,0.7,Q4,1169129,
43,78884,city scrape,Rental unit in Austin · ★4.78 · Studio · 2 bed...,,The Barton Hills neighborhood is located in Au...,t,78704,Entire rental unit,Entire home/apt,1 bath,,2.0,[],101.0,2,90,199,4.78,0.473922,Q4,2157243,
107,343889,city scrape,Rental unit in Austin · ★4.75 · Studio · 1 bed...,,"Friendly, walkable residential neighborhood wi...",f,78702,Entire rental unit,Entire home/apt,1 bath,,1.0,[],60.0,30,1125,36,4.75,0.505027,Q4,1744639,


It seems like a lot of these places could be studios. What we can do here now is to see what listing has the word studio in it and then assign a value of 0 to its bedrooms_extracted column value. 

In [36]:
df.loc[(df['bedrooms_extracted'].isna() &
       df['name'].str.contains('Studio|studio|STUDIO')),'bedrooms_extracted'] = 0
df[df['bedrooms_extracted'].isna() &
   df['name'].str.contains('Studio|studio|STUDIO')
   ].shape

(0, 22)

Again, going back to the listings with no beedrooms reported in the bedrooms or bedrooms_extracted columns.

In [37]:
df[df['bedrooms'].isna() & df['bedrooms_extracted'].isna()].shape

(478, 22)

If we exclude the listing that are private or shared rooms (we will not be modeling those) we have:

In [38]:
df[df['bedrooms_extracted'].isna() & 
   df['bedrooms'].isna() & 
   df['room_type'].str.contains('Entire home/apt')].shape

(455, 22)

We will only have 455 important listing with no bedroom information (later, we will have to drop these listings). At this time, let us combine the data from bedrooms and bedroom_extracted columns. To do that, we will first check if we have any information about bedroom in the bedroom column itself, if not, we will use that of bedroom extracted. 

In [39]:
df['bedrooms'] = df.apply(lambda x: x['bedrooms'] if not np.isnan(x['bedrooms']) else x['bedrooms_extracted'], axis=1)
df.drop(columns=['bedrooms_extracted'], inplace=True)
df['bedrooms'].isna().sum()

478

## Missing superhost information

Let us take a look at the count of missing values again.

In [40]:
df.isna().sum().sort_values(ascending=False)

neighborhood_overview    16056
description              12482
host_is_superhost         2058
bedrooms                   478
beds                       363
bathrooms_text              34
price                        0
time_quarter                 0
occupancy_rate               0
review_scores_rating         0
number_of_reviews            0
maximum_nights               0
minimum_nights               0
id                           0
amenities                    0
source                       0
room_type                    0
property_type                0
zipcode                      0
name                         0
host_id                      0
dtype: int64

We see a lot of data is missing in the column of host_is_superhost. Since there could be multiple listings (different properties and/or same property listed at different times of the year), we will see if we can find data related to that host in other listings and use that.

In [41]:
# find the unique host ids for listings that are missing superhost info.
missing_superhost_hostid = df[df['host_is_superhost'].isna()]['host_id'].unique()
missing_superhost = {}
# see if that host has the superhost tag or not in any of their lisitngs and count each of those tags
# assign the majority tag to as the 'host_is_superhost' value when this column is missing
for i in missing_superhost_hostid:
    num_t = df[df['host_id'] == i]['host_is_superhost'].str.count('t').sum()
    num_f = df[df['host_id'] == i]['host_is_superhost'].str.count('f').sum()
    # assign the majority tag to that host id who has some listings that is missing superhost tag
    if num_f + num_t > 0:
        if num_t >= num_f:
            missing_superhost[i] = 't'
        else: 
            missing_superhost[i] = 'f'
    else:
        missing_superhost[i] = np.nan

df.loc[df['host_is_superhost'].isna(),'host_is_superhost'] = df[df['host_is_superhost'].isna()].apply(lambda x: missing_superhost[x['host_id']], axis=1)


In [42]:
df['host_is_superhost'].isna().sum()

30

We can see that we were able to reduce the number of data missing related to superhosts substantially.

## Important amenities 

The amenities in our dataset is stored as a string that contains the list of the items in the amenity section. Since there is no structure or template for which amenities listings might have, we will start our analysis by looking at some of that results and see what should we expect in the amenities section. Let us first start by looking at the data itself.

In [43]:
df['amenities'].value_counts()

[]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      

We can see a big portion of the amenities have '[]' in their column. Let us see if we can find any pattern for these listings that have no amenities info provided. 

In [44]:
df[df['amenities'] == '[]']['time_quarter'].value_counts()

Q4    12231
Q1       13
Q3       12
Q2       12
Name: time_quarter, dtype: int64

In [45]:
# set amenities with [] in them as nan as they have no info available
df.loc[df['amenities'] == '[]', 'amenities'] = np.nan
df['amenities'].isna().sum()

12268

It is clear that most of the missing amenities are from latest data scrape so there could be something with the datascrape at that period. Similar to what we did in previous sections, we will see if we can extract amenities for that specific listing from the previous listings of the same property. In this case, we will be using amenities from the closest period that has amenities available for us. 

In [46]:
# find all the lisitngs that have missing amenities
missing_amen_id = df[df['amenities'].isna() ]['id'].unique()
missing_amen = {}
# iterate over the missing amenity ids and see if the same amenity has some 
# information about amenities in some other listings
for i in missing_amen_id:
    # set up a mask to find other postings with the same id that have amenities mentioned
    mask = (df['id'] == i) & (df['amenities'].notnull())
    # if for that id, there is some other listings with amenities mentioned, use the firs one (most recent)
    if not df[mask].empty:
        missing_amen[i] = df[mask]['amenities'].values[0]     
    #otherwise return nan
    else:
        missing_amen[i] = np.nan
# set the missing amenity 
df.loc[df['amenities'].isna(), 'amenities'] = df[df['amenities'].isna()].apply(lambda x: missing_amen[x['id']], axis=1)


In [47]:
df['amenities'].isna().sum()

1445

As we can see, we were able to find a lot of information about amenities that are missing in our recent datascrape from the previous listings and used that in our data. Now, let us list the top 50 amenities in the listings.

In [48]:
# get all amenities and join them together with ','
amenities = df[df['amenities'].notnull()][['amenities']]
all_amenities = ','.join(amenities['amenities'])
all_amenities = all_amenities.lower()
# remove extra characters
all_amenities = all_amenities.replace(']', '')
all_amenities = all_amenities.replace('[', '')
# make a list of all amenities by using comma as a separator
list_all_amenities = all_amenities.split(',')
# create an instance to count the occurrances
element_counts = Counter(list_all_amenities)
top_50_elements = element_counts.most_common(50)
print('top 50 amenities and their occurances')
for i in top_50_elements:
    print(i)


top 50 amenities and their occurances
(' "smoke alarm"', 44105)
(' "kitchen"', 42413)
(' "wifi"', 39754)
(' "hangers"', 37007)
(' "hair dryer"', 36644)
(' "dishes and silverware"', 35577)
(' "essentials"', 35175)
(' "fire extinguisher"', 35161)
(' "iron"', 34652)
(' "carbon monoxide alarm"', 34481)
(' "shampoo"', 34109)
(' "cooking basics"', 33650)
(' "refrigerator"', 33422)
(' "hot water"', 33183)
(' "self check-in"', 32884)
(' "microwave"', 32619)
(' "bed linens"', 30868)
(' "free parking on premises"', 30572)
(' "dedicated workspace"', 27544)
(' "dishwasher"', 27293)
(' "heating"', 27200)
(' "first aid kit"', 27196)
(' "private entrance"', 25024)
(' "extra pillows and blankets"', 24257)
(' "free street parking"', 24095)
(' "washer"', 22840)
(' "oven"', 22216)
(' "long term stays allowed"', 22064)
(' "coffee maker"', 22058)
(' "tv"', 21969)
(' "bathtub"', 21518)
(' "freezer"', 20394)
(' "ceiling fan"', 20167)
(' "cleaning products"', 19766)
(' "dryer"', 18911)
(' "outdoor furniture"'

The features we are more interested in are having pool, pet friendly, work space, free parking, and gym. These features have been selected after consulting with multiple online resources and data. Please note that these features could different from city to city based on where the city is located and the activities there. Let us start with pool and see what has been mentioned about pools in the listings. 

In [49]:
list_pool = [i for i in list_all_amenities if 'pool' in i]
element_counts = Counter(list_pool)
top_30_elements = element_counts.most_common(30)
print('top 30 pool occurances in the amenities')
for i in top_30_elements:
    print(i)


top 30 pool occurances in the amenities
(' "pool"', 5423)
(' "shared pool"', 2468)
(' "pool view"', 1336)
(' "private outdoor pool - available all year', 1098)
(' "shared outdoor pool - available all year', 1067)
(' "pool table"', 1065)
(' "private pool"', 839)
(' pool toys"', 471)
(' "shared outdoor pool - available seasonally', 364)
('"pool"', 361)
(' lap pool"', 338)
(' "shared outdoor pool"', 283)
(' "whirlpool stainless steel gas stove"', 227)
(' "private outdoor pool - available seasonally', 177)
(' "whirlpool refrigerator"', 143)
(' "whirlpool stainless steel oven"', 139)
(' "shared pool - available all year"', 130)
(' "shared pool - available all year', 123)
('"whirlpool refrigerator"', 122)
(' "shared outdoor pool - available all year"', 108)
(' pool cover"', 102)
(' "whirlpool gas stove"', 100)
(' pool cover', 85)
(' "whirlpool stainless steel electric stove"', 78)
(' "private outdoor pool"', 77)
(' "whirlpool  refrigerator"', 63)
(' "shared pool - available seasonally"', 58)

Let us build a list for elements that could be used to identify pool occurances in listing.

In [50]:
pool_identifiers = ['"pool"', 'shared pool', 'private pool', 'outdoor pool']

Now, let us doa similar analysis for pet in features. 

In [51]:
list_pool = [i for i in list_all_amenities if 'pet' in i]
element_counts = Counter(list_pool)
top_30_elements = element_counts.most_common(30)
print('top 30 pet occurances in the amenities')
for i in top_30_elements:
    print(i)

top 30 pet occurances in the amenities
(' "pets allowed"', 14879)
('"pets allowed"', 150)
(' "le petit marseillais body soap"', 4)
(' "not petroleum based', 2)


In [52]:
pet_identifiers = ['"pets allowed"']

And for work space for work from home or remote work.

In [53]:
list_pool = [i for i in list_all_amenities if 'work' in i]
element_counts = Counter(list_pool)
top_30_elements = element_counts.most_common(30)
print('top 30 work occurances in the amenities')
for i in top_30_elements:
    print(i)

top 30 work occurances in the amenities
(' "dedicated workspace"', 27544)
('"dedicated workspace"', 95)
(' "bath and body works shampoo"', 12)
(' "bath and body works conditioner"', 8)
(' "bath & body works body soap"', 6)
(' "bath and body works body soap"', 3)
(' "bath and bodyworks body soap"', 2)
(' "bath and bodyworks shampoo"', 2)
(' "bath and body works  body soap"', 1)
(' "bath and body works; dove body soap"', 1)
('"bath and body works; dove body soap"', 1)


In [54]:
wfh_identifiers = ['"dedicated workspace"']

And for free parking

In [55]:
list_pool = [i for i in list_all_amenities if 'parking' in i]
element_counts = Counter(list_pool)
top_30_elements = element_counts.most_common(30)
print('top 30 parking occurances in the amenities')
for i in top_30_elements:
    print(i)

top 30 parking occurances in the amenities
(' "free parking on premises"', 30572)
(' "free street parking"', 24095)
('"free parking on premises"', 5375)
(' "paid parking on premises"', 1770)
(' "paid parking off premises"', 959)
(' "free driveway parking on premises \\u2013 2 spaces"', 828)
(' "free driveway parking on premises \\u2013 1 space"', 606)
(' "free driveway parking on premises"', 596)
(' "free driveway parking on premises \\u2013 4 spaces"', 236)
(' "paid street parking off premises"', 217)
('"paid street parking off premises"', 187)
(' "free parking garage on premises \\u2013 1 space"', 155)
(' "free parking on premises \\u2013 1 space"', 152)
(' "free driveway parking on premises \\u2013 3 spaces"', 150)
(' "paid parking garage off premises"', 137)
(' "paid parking garage on premises"', 110)
(' "paid valet parking on premises"', 110)
(' "free parking on premises \\u2013 2 spaces"', 95)
(' "free parking garage on premises"', 81)
(' "paid parking garage on premises \\u2013 

In [56]:
parking_identifiers = ['free parking']

And lastly for gym access.

In [57]:
list_pool = [i for i in list_all_amenities if 'gym' in i]
element_counts = Counter(list_pool)
top_30_elements = element_counts.most_common(30)
print('top 30 gym occurances in the amenities')
for i in top_30_elements:
    print(i)

top 30 gym occurances in the amenities
(' "gym"', 4318)
(' "shared gym in building"', 1177)
(' "private gym in building"', 519)
(' "shared gym nearby"', 405)
(' "shared gym"', 96)
(' "gym in building"', 61)
(' "private gym"', 49)
(' "private gym nearby"', 27)
(' "gym nearby"', 11)
('"gym"', 3)


In [58]:
gym_identifiers = ['"gym', '"shared gym in building"', '"private gym in building"', 
                   '"shared gym"', '"gym in building"', '"private gym"']

We will now build some new columns and indicate whether the listings have any of the amenities mentioned above. This could be used for our EDA and modeling when needed. 

In [59]:
# pool
# 0 and 1 if any information is available about that amenities, nan if there is amenity info is nan.
df['has_pool'] = df['amenities'].apply(lambda x: any(element in x.lower() for element in pool_identifiers) if x == x else np.nan)
print('distribution of has_pool:')
df['has_pool'].value_counts(normalize=True)

distribution of has_pool:


False    0.707478
True     0.292522
Name: has_pool, dtype: float64

In [60]:
# pet friendly
# 0 and 1 if any information is available about that amenities, nan if there is amenity info is nan.
df['is_petfriendly'] = df['amenities'].apply(lambda x: any(element in x.lower() for element in pet_identifiers) if x == x else np.nan)
print('distribution of is_petfriendly:')
df['is_petfriendly'].value_counts(normalize=True)

distribution of is_petfriendly:


False    0.67012
True     0.32988
Name: is_petfriendly, dtype: float64

In [61]:
# workspace
# 0 and 1 if any information is available about that amenities, nan if there is amenity info is nan.
df['has_workspace'] = df['amenities'].apply(lambda x: any(element in x.lower() for element in wfh_identifiers) if x == x else np.nan)
print('distribution of has_workspace:')
df['has_workspace'].value_counts(normalize=True)

distribution of has_workspace:


True     0.606664
False    0.393336
Name: has_workspace, dtype: float64

In [62]:
# freeparking
# 0 and 1 if any information is available about that amenities, nan if there is amenity info is nan.
df['has_freeparking'] = df['amenities'].apply(lambda x: any(element in x.lower() for element in parking_identifiers) if x == x else np.nan)
print('distribution of has_freeparking:')
df['has_freeparking'].value_counts(normalize=True)

distribution of has_freeparking:


True     0.803683
False    0.196317
Name: has_freeparking, dtype: float64

In [63]:
# gym
# 0 and 1 if any information is available about that amenities, nan if there is amenity info is nan.
df['has_gym'] = df['amenities'].apply(lambda x: any(element in x.lower() for element in gym_identifiers) if x == x else np.nan)
print('distribution of has_gym:')
df['has_gym'].value_counts(normalize=True)

distribution of has_gym:


False    0.863166
True     0.136834
Name: has_gym, dtype: float64

## Neighborhood and listing description

These two columns could provide information about the neighborhood and also some of the descriptions of the listing, however, since one of our primary goals of this study is that we want to be able to help the users identify good investment opportunities, and these features are not that straightforward to quantify or provide value for (as opposed to features like number of bedrooms), we will ignore these features for this study. 

In [64]:
df = df.drop(columns=['neighborhood_overview', 'description'])

## Average home price in Austin

One of the factors that could potentially impact the airbnb price logic is how much the host is paying for the mortgage of that house. Houses in more expensive neighborhoods are worth more and the owners might have been paying a higher mortgage that could impact the listing price of that property on airbnb. Finding an accurate estimate of the house price would require a detailed study and might be out of scope of this project. Also, airbnb listings dont provide important information like the square footage of the property, that is a requirement in estimating the price of that property. In this work, we have taken a rather simple approach to account for the higher mortgage in pricier neighborhoods and possibly capture the effect that has on airbnb price. For that, we have extracted the average house price in different zip codes in Austin (data from Zillow.com) and used that as a feature to predict airbnb listing price. 

In [65]:
home_price = {78702: 613000, 78729: 457000, 78704: 802000, 78741: 397000, 78745: 456000, 78703: 1324000, 78731: 1019000, 
              78758: 398000, 78705: 353000, 78754: 385000, 78727: 474000, 78751: 636000, 78722: 595000, 78701: 727000,
              78723: 502000, 78752: 422000, 78757: 612000, 78724: 384000, 78746: 1721000, 78736: 529000, 78759: 661000,
              78721: 464000, 78756: 728000, 78733: 1120000, 78737: 786000, 78744: 358000, 78726: 714000, 78748: 440000,
              78749: 568000, 78738: 883000, 78734: 648000, 78735: 823000, 78732: 831000, 78753: 364000, 78725: 313000,
              78728: 432000, 78717: 583000, 78750: 603000, 78747: 398000, 78730: 1125000, 78739: 794000, 78719: 354000,
              78742: 527000}

In [66]:
df['home_price_aprx'] = df['zipcode'].apply(lambda x: home_price[x])
df['home_price_aprx'].describe()

count    4.700400e+04
mean     6.360820e+05
std      2.618294e+05
min      3.130000e+05
25%      4.400000e+05
50%      6.130000e+05
75%      7.940000e+05
max      1.721000e+06
Name: home_price_aprx, dtype: float64

## Shared vs. private listings

If we look at the room type column in our data, we can see that there are different types of listings there. 

In [67]:
df['room_type'].value_counts()

Entire home/apt    39647
Private room        6984
Shared room          357
Hotel room            16
Name: room_type, dtype: int64

The focus of this study is on airbnb investment opportunities and not cases where people will will sublet part of the their current residential place. For this reason, we will just consider listing that are tagged as 'Entire home/apt'. Also, since the nature of renting the entire home is different from that of renting a room (price-wise), it might be a good idea to fit separate models in case we were interested in modeling the price distribution of rooms as well.

In [68]:
df = df[df['room_type'] == 'Entire home/apt']
df.shape

(39647, 25)

## Cleaning bathroom data

Bathroom data in our original is in text format as it is a mix of numbers and letters.

In [69]:
df['bathrooms_text'].isna().sum()

9

In [70]:
df['bathrooms_text'].value_counts()

1 bath        18981
2 baths       10134
2.5 baths      4054
3 baths        2283
1.5 baths      1706
3.5 baths       921
4 baths         676
4.5 baths       283
5 baths         163
0 baths         135
5.5 baths       109
6 baths          73
6.5 baths        39
7 baths          27
8 baths          21
7.5 baths        12
Half-bath         5
10.5 baths        4
17 baths          4
11.5 baths        4
12.5 baths        2
10 baths          1
8.5 baths         1
Name: bathrooms_text, dtype: int64

We need to replace Half-bath with 0.5 baths.

In [71]:
df['bathrooms_text'].replace('Half-bath', '0.5 baths', inplace=True)

In [72]:
df['bathrooms_text'].value_counts()

1 bath        18981
2 baths       10134
2.5 baths      4054
3 baths        2283
1.5 baths      1706
3.5 baths       921
4 baths         676
4.5 baths       283
5 baths         163
0 baths         135
5.5 baths       109
6 baths          73
6.5 baths        39
7 baths          27
8 baths          21
7.5 baths        12
0.5 baths         5
10.5 baths        4
17 baths          4
11.5 baths        4
12.5 baths        2
10 baths          1
8.5 baths         1
Name: bathrooms_text, dtype: int64

In [117]:
df['bath'] = df['bathrooms_text'].apply(lambda x: re.findall(r'\d*\.?\d+', x)[0] if x==x else np.nan)
df['bath'].value_counts()

1       18981
2       10134
2.5      4054
3        2283
1.5      1706
3.5       921
4         676
4.5       283
5         163
0         135
5.5       109
6          73
6.5        39
7          27
8          21
7.5        12
0.5         5
10.5        4
17          4
11.5        4
12.5        2
10          1
8.5         1
Name: bath, dtype: int64

## Listing price 

Here we will start looking at the distribution of the listing price to get a sense of how it is distributed. 

In [None]:
df['price'].info()

In [None]:
df['price'].describe()

The average listing price in Austing is $280 with maximum of 65k and minimum of 1. These minimum and maximums seem to be very extreme and need further investigations. 

In [None]:
df[df['price'] < 20].shape

In [None]:
df[df['price'] < 20].head(5)

## Focus on primary homes and not shared ones. 