# Testing the Efficacy of Airbnb's 90 Day Limit on Entire Home Listings

**Research Question:** Did the imposition of the 90 day limit reduce the growth rate in proportion of entire home listings over 90 days in comparison to single rooms?

**H0:** there is no difference in the change in proportion of entire home listings exceeding 90 days compared to single room listings after the imposition of the 90-day limit.

**HA:** there is a difference in the change in proportion of entire home listings exceeding 90 days compared to single room listings after the imposition of the 90-day limit.

The policy was introduced in Jan 2017, so I will test for differences in proportions between 2016 and 2017. There is, however, a possibility of a time lag, which means different years may yield different results.

As this notebook aims to be reproducible and we are getting close to submission time, I'm not going to keep record of any QAing/data exploration - I'll only keep things which I think might make it into the final document.

In [55]:
#Loading packages
import numpy as np
import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt 
import os
import datetime as dt
import seaborn as sns
import duckdb as db
import statsmodels.api as sm
print(os.getcwd())

/home/jovyan/work/CASA0013 - Foundations of Spatial Data Science/CASA0013_FSDS_Airbnb-data-analytics/Documentation


## 1. Data Processing

### 1.1 Review Data

In [14]:
reviews = pd.read_csv("data/reviews.csv.gz")
reviews.head() #Loaded successfully

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments
0,13913,80770,2010-08-18,177109,Michael,My girlfriend and I hadn't known Alina before ...
1,13913,367568,2011-07-11,19835707,Mathias,Alina was a really good host. The flat is clea...
2,13913,529579,2011-09-13,1110304,Kristin,Alina is an amazing host. She made me feel rig...
3,13913,595481,2011-10-03,1216358,Camilla,"Alina's place is so nice, the room is big and ..."
4,13913,612947,2011-10-09,490840,Jorik,"Nice location in Islington area, good for shor..."


In [15]:
#Changing date data type
reviews["date"] = pd.to_datetime(reviews["date"], format="%Y-%m-%d")

In [16]:
#Filtering to only 2016 and 2017 data
reviews['year'] = reviews['date'].dt.year
reviews = reviews[
        (reviews.year==2016) |
        (reviews.year==2017)]

In [17]:
#Dropping unnecessary columns
reviews.drop(columns = ["id", "date", "reviewer_name", "reviewer_id", "comments"], inplace=True)

### 1.2 Listing Data

In [25]:
listings = pd.read_csv("data/listings.csv.gz")
listings.head() #Loaded successfully

Unnamed: 0,id,listing_url,scrape_id,last_scraped,source,name,description,neighborhood_overview,picture_url,host_id,...,review_scores_communication,review_scores_location,review_scores_value,license,instant_bookable,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month
0,13913,https://www.airbnb.com/rooms/13913,20240906025501,2024-09-06,city scrape,Holiday London DB Room Let-on going,My bright double bedroom with a large window h...,Finsbury Park is a friendly melting pot commun...,https://a0.muscache.com/pictures/miso/Hosting-...,54730,...,4.84,4.72,4.72,,f,3,2,1,0,0.26
1,15400,https://www.airbnb.com/rooms/15400,20240906025501,2024-09-07,city scrape,Bright Chelsea Apartment. Chelsea!,Lots of windows and light. St Luke's Gardens ...,It is Chelsea.,https://a0.muscache.com/pictures/428392/462d26...,60302,...,4.84,4.93,4.75,,f,1,1,0,0,0.54
2,17402,https://www.airbnb.com/rooms/17402,20240906025501,2024-09-07,city scrape,Fab 3-Bed/2 Bath & Wifi: Trendy W1,"You'll have a great time in this beautiful, cl...","Fitzrovia is a very desirable trendy, arty and...",https://a0.muscache.com/pictures/39d5309d-fba7...,67564,...,4.72,4.89,4.61,,f,6,6,0,0,0.34
3,24328,https://www.airbnb.com/rooms/24328,20240906025501,2024-09-07,city scrape,"Battersea live/work artist house, garden & par...","Artist house, bright high ceiling rooms for bo...","- Battersea is a quiet family area, easy acces...",https://a0.muscache.com/pictures/9194b40f-c627...,41759,...,4.93,4.59,4.65,,f,1,1,0,0,0.56
4,33332,https://www.airbnb.com/rooms/33332,20240906025501,2024-09-06,city scrape,Beautiful Ensuite Richmond-upon-Thames borough,"Walking distance to Twickenham Stadium, 35 min...",Peaceful and friendly.,https://a0.muscache.com/pictures/miso/Hosting-...,144444,...,4.5,4.67,4.22,,f,2,0,2,0,0.11


In [27]:
listings = listings[["id", "host_id", "room_type", "minimum_nights"]]

In [28]:
listings.head()

Unnamed: 0,id,host_id,room_type,minimum_nights
0,13913,54730,Private room,1
1,15400,60302,Entire home/apt,4
2,17402,67564,Entire home/apt,3
3,24328,41759,Entire home/apt,2
4,33332,144444,Private room,2


### 1.3 Join Tables and Calculate Occupancy Metric

Using Wang et al. (2024) occupancy estimation:
1. Count reviews per listing per year
2. Divide by 0.5 (assume that 1 in 2 stays results in a review)
3. Join to the listings dataset
4. Estimate stay length: either 3 days or minimum nights, whichever is larger
5. Multiply this by review rate
6. Cap at 21 nights per month (252) - *although technically 2016 was a leap year*

Finally, work out whether a listing had estimated over 90 nights or not

In [29]:
#Step 1: count reviews per listing per year
reviews_annual = reviews.groupby(['listing_id', 'year']).size().unstack(fill_value=0)
reviews_annual.rename(columns={2016: 'reviews_2016', 2017: 'reviews_2017'}, inplace=True)
reviews_annual = reviews_annual.reset_index()
reviews_annual.columns.name = None

reviews_annual.head()

Unnamed: 0,listing_id,reviews_2016,reviews_2017
0,13913,1,1
1,15400,15,7
2,24328,12,0
3,36299,8,7
4,36660,52,63


In [31]:
#Step 2: divide each year by 0.5 (assume that 1 in 2 stays results in a review)
reviews_annual['reviews_2016_adjusted'] = reviews_annual.reviews_2016/0.5
reviews_annual['reviews_2017_adjusted'] = reviews_annual.reviews_2017/0.5

reviews_annual.head()

Unnamed: 0,listing_id,reviews_2016,reviews_2017,reviews_2016_adjusted,reviews_2017_adjusted
0,13913,1,1,2.0,2.0
1,15400,15,7,30.0,14.0
2,24328,12,0,24.0,0.0
3,36299,8,7,16.0,14.0
4,36660,52,63,104.0,126.0


In [32]:
#Step 3: join to the listings dataset
reviews_annual = reviews_annual.merge(listings, how='left', left_on='listing_id', right_on='id').drop(columns = ['id'])
reviews_annual.head()

Unnamed: 0,listing_id,reviews_2016,reviews_2017,reviews_2016_adjusted,reviews_2017_adjusted,host_id,room_type,minimum_nights
0,13913,1,1,2.0,2.0,54730,Private room,1
1,15400,15,7,30.0,14.0,60302,Entire home/apt,4
2,24328,12,0,24.0,0.0,41759,Entire home/apt,2
3,36299,8,7,16.0,14.0,155938,Entire home/apt,3
4,36660,52,63,104.0,126.0,157884,Private room,2


In [37]:
#Checking for nulls
cols_to_check = ["host_id", "room_type", "minimum_nights"]

for col in cols_to_check:
    if reviews_annual[col].isna().sum() == 0:
        print(f'{col}: No nulls')
    else:
        print(f'{col}: Contains nulls')

host_id: No nulls
room_type: No nulls
minimum_nights: No nulls


In [38]:
#Step 4: calculating estimated nights column: greater of either 3 or minimum nights
reviews_annual['estimated_stay'] = np.maximum(3, reviews_annual.minimum_nights)

reviews_annual.head()

Unnamed: 0,listing_id,reviews_2016,reviews_2017,reviews_2016_adjusted,reviews_2017_adjusted,host_id,room_type,minimum_nights,estimated_stay
0,13913,1,1,2.0,2.0,54730,Private room,1,3
1,15400,15,7,30.0,14.0,60302,Entire home/apt,4,4
2,24328,12,0,24.0,0.0,41759,Entire home/apt,2,3
3,36299,8,7,16.0,14.0,155938,Entire home/apt,3,3
4,36660,52,63,104.0,126.0,157884,Private room,2,3


In [39]:
#Step 5: estimate occupied nights for each year by multiplying the adjusted review rate by the estimated stay duration
#n.b. this assumes that the minimum nights has not changed over time

reviews_annual['estimated_nights2016'] = reviews_annual['reviews_2016_adjusted'] * reviews_annual.estimated_stay
reviews_annual['estimated_nights2017'] = reviews_annual['reviews_2017_adjusted'] * reviews_annual.estimated_stay

reviews_annual.head()

Unnamed: 0,listing_id,reviews_2016,reviews_2017,reviews_2016_adjusted,reviews_2017_adjusted,host_id,room_type,minimum_nights,estimated_stay,estimated_nights2016,estimated_nights2017
0,13913,1,1,2.0,2.0,54730,Private room,1,3,6.0,6.0
1,15400,15,7,30.0,14.0,60302,Entire home/apt,4,4,120.0,56.0
2,24328,12,0,24.0,0.0,41759,Entire home/apt,2,3,72.0,0.0
3,36299,8,7,16.0,14.0,155938,Entire home/apt,3,3,48.0,42.0
4,36660,52,63,104.0,126.0,157884,Private room,2,3,312.0,378.0


In [41]:
#Step 6: cap at 21 days per month
cap_nights = 12*21 #not changing for a leap year as 1/365 is only 0.002...
reviews_annual['estimated_nights2016_capped'] = np.minimum(cap_nights, reviews_annual['estimated_nights2016'])
reviews_annual['estimated_nights2017_capped'] = np.minimum(cap_nights, reviews_annual['estimated_nights2017'])

reviews_annual.head()

Unnamed: 0,listing_id,reviews_2016,reviews_2017,reviews_2016_adjusted,reviews_2017_adjusted,host_id,room_type,minimum_nights,estimated_stay,estimated_nights2016,estimated_nights2017,estimated_nights2016_capped,estimated_nights2017_capped
0,13913,1,1,2.0,2.0,54730,Private room,1,3,6.0,6.0,6.0,6.0
1,15400,15,7,30.0,14.0,60302,Entire home/apt,4,4,120.0,56.0,120.0,56.0
2,24328,12,0,24.0,0.0,41759,Entire home/apt,2,3,72.0,0.0,72.0,0.0
3,36299,8,7,16.0,14.0,155938,Entire home/apt,3,3,48.0,42.0,48.0,42.0
4,36660,52,63,104.0,126.0,157884,Private room,2,3,312.0,378.0,252.0,252.0


In [65]:
#Calculating final table: whether a listing had True or False for over 90 days, and aggregated by room type
#Only looking at room type for now connected to research question, but I have the host column in there to check for superhosts if necessary

#Getting number of over 90s and totals for each category, as this is what the statistical test requires

db.register('reviews_annual', reviews_annual)

query = '''
       WITH listings_90 AS (
            SELECT 
            listing_id,
            CASE WHEN room_type = 'Entire home/apt' THEN 'Entire Home' ELSE 'Other' END AS room_type,
            CASE WHEN estimated_nights2016_capped >= 90 THEN 1 ELSE 0 END AS over90_2016,
            CASE WHEN estimated_nights2016_capped BETWEEN 1 AND 90 THEN 1 ELSE 0 END AS under90_2016,
            CASE WHEN estimated_nights2017_capped >= 90 THEN 1 ELSE 0 END AS over90_2017,
            CASE WHEN estimated_nights2017_capped BETWEEN 1 AND 90 THEN 1 ELSE 0 END AS under90_2017
        FROM reviews_annual)
    SELECT
        room_type,
        SUM(over90_2016) AS over90_2016,
        SUM(over90_2016) + SUM(under90_2016) AS total_2016,
        SUM(over90_2017) AS over90_2017,
        SUM(over90_2017)+SUM(under90_2017) AS total_2017
    FROM listings_90
    GROUP BY 1
'''

proportions = db.sql(query).to_df()
proportions.head()

Unnamed: 0,room_type,over90_2016,total_2016,over90_2017,total_2017
0,Entire Home,1173.0,4039.0,1606.0,6072.0
1,Other,934.0,3021.0,1477.0,4441.0


## 2. Statistical Tests

In [66]:
#Conducting a two proportion z-test for proportions of property types above and below 90 days in 2016 and 2017
#Room type is independent and sample size is over 10 for each category
#https://www.statsmodels.org/dev/generated/statsmodels.stats.proportion.proportions_ztest.html

#H0: there is no difference in the proportion of properties estimated over 90 days for each room type.
#H1: there is a difference in the proportion of properties estimated over 90 days for each room type.

#Entire Home z-test
count_eh = [proportions[proportions.room_type=='Entire Home'].over90_2016,
               proportions[proportions.room_type=='Entire Home'].over90_2017]
nobs_eh = [proportions[proportions.room_type=='Entire Home'].total_2016,
              proportions[proportions.room_type=='Entire Home'].total_2017]

z_eh, p_eh = sm.stats.proportions_ztest(count_eh, nobs_eh)

#Other z-test
count_other = [proportions[proportions.room_type=='Other'].over90_2016,
               proportions[proportions.room_type=='Other'].over90_2017]
nobs_other = [proportions[proportions.room_type=='Other'].total_2016,
              proportions[proportions.room_type=='Other'].total_2017]

z_other, p_other = sm.stats.proportions_ztest(count_other, nobs_other)

print(f"Z-statistic for 'Other' room type: {z_other}, P-value: {p_other}")
print(f"Z-statistic for 'Entire Home' room type: {z_eh}, P-value: {p_eh}")

#Reject H0 for both
#But need to 

Z-statistic for 'Other' room type: [-2.1228745], P-value: [0.03376437]
Z-statistic for 'Entire Home' room type: [2.86005442], P-value: [0.00423568]


## 3. Difference in Differences

## 4. Identifying Duplicates?

Might need to go back and change listings to work out which were present in 2016/2017!!!!