# Change in Occupancy Analysis
This Notebook will use the method detailed in Wang et al. (2024) to use review counts to estimate changes in Airbnb occupancy between 2022-2023. This will then be analysed spatially.

**Stages to the workflow:**
1. Data processing
2. Occupancy metric calculation
3. Spatial autocorrelation and cluster (if any) identification
4. (depending on what shows up) Regression analysis

In [30]:
#Loading packages
import numpy as np
import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt 
import os
print(os.getcwd())

/home/jovyan/work/CASA0013 - Foundations of Spatial Data Science/CASA0013_FSDS_Airbnb-data-analytics/Documentation


## 1) Data Processing

In [31]:
#Load in both datasets
listings_url = "data/clean/listings_provisonal.csv"
reviews_url = "data/clean/reviews_provisional.csv"

listings = pd.read_csv(listings_url)
reviews = pd.read_csv(reviews_url)

In [32]:
print(listings.columns.to_list())

['id', 'listing_url', 'last_scraped', 'name', 'host_id', 'host_name', 'host_since', 'host_is_superhost', 'host_listings_count', 'host_total_listings_count', 'host_verifications', 'host_identity_verified', 'neighbourhood_cleansed', 'latitude', 'longitude', 'property_type', 'room_type', 'accommodates', 'amenities', 'price', 'minimum_nights', 'maximum_nights', 'availability_30', 'availability_60', 'availability_90', 'availability_365', 'number_of_reviews', 'number_of_reviews_ltm', 'number_of_reviews_l30d', 'first_review', 'last_review', 'review_scores_rating', 'calculated_host_listings_count', 'calculated_host_listings_count_entire_homes', 'calculated_host_listings_count_private_rooms', 'calculated_host_listings_count_shared_rooms', 'reviews_per_month']


In [33]:
#Keep id, first review, last review, reviews per month (though I'll probably recalculate this), minimum nights
listings = listings[["id", "first_review", "last_review", "reviews_per_month", "minimum_nights"]]
listings.head()

Unnamed: 0,id,first_review,last_review,reviews_per_month,minimum_nights
0,13913,2010-08-18,2024-07-10,0.26,1
1,15400,2009-12-21,2024-04-28,0.54,4
2,17402,2011-03-21,2024-02-19,0.34,3
3,24328,2010-11-15,2022-07-19,0.56,2
4,33332,2010-10-16,2022-08-01,0.11,2


In [34]:
#Filter out unrealistic listings from reviews

#I saw when processing the data that some have more than 1 review per night
#These look realistic as they are hotels rather than STLs
#But probably need to be excluded here as they can't be compared to other listings in the same way

reviews.listing_id.value_counts()

listing_id
47408549               890
43120947               785
30760930               615
47438714               497
51738677               437
                      ... 
1228824289813508892      1
1228835271290751437      1
1232581373083249751      1
1232626920155165386      1
1232764003769397209      1
Name: count, Length: 55598, dtype: int64

In [35]:
#Checking for properties with over 1 review every 2 days
reviews['listing_id'].value_counts()[lambda x: x > 365] #Only affects 8 properties

listing_id
47408549    890
43120947    785
30760930    615
47438714    497
51738677    437
45006692    429
51995470    427
26766870    400
Name: count, dtype: int64

In [36]:
#I will remove these but might need to double-check this assumption
reviews = reviews[reviews.groupby('listing_id')['listing_id'].transform('count') <= 365]
reviews.info()

<class 'pandas.core.frame.DataFrame'>
Index: 946039 entries, 0 to 950518
Data columns (total 4 columns):
 #   Column       Non-Null Count   Dtype 
---  ------       --------------   ----- 
 0   listing_id   946039 non-null  int64 
 1   date         946039 non-null  object
 2   reviewer_id  946039 non-null  int64 
 3   year         946039 non-null  object
dtypes: int64(2), object(2)
memory usage: 36.1+ MB


In [3]:
#Work out how to operationalise Wang et al. (2024) equation
#Join minimum stay length from listings data
#Find total occupancy per period per listing
#Aggregate by LSOA? Or average by LSOA
#Summary stats

#If time: work this out in 12 month intervals rather than years as I think it's more sensible to use most recent data!
#But not a priority if the others aren't convinced