# Data Processing

## Reviews
**To note:**
- I have saved this to reviews_provisional.csv
- This is just using my own judgement, and based on what variables I expect we'll need: we can always add the exact date back in, for example
- Note the listings with unrealistic review counts
    - Large outliers seem to be things like hotels, but we don't know whether they are all like this
    - Similarly, we'll probably need a threshold of minimum reviews to measure genuine listings
- Note some weird reviewing activity
    - e.g. people reviewing the same place many times (up to 38 in 2 years!)
    - e.g. people leaving the same (often identical) reviews on multiple listings on the same day
    - When examining individual cases they actually did look realistic, but we don't know whether this is the case for all (e.g. people deliberately reviewing their properties to improve their score)

In [75]:
#Loading packages for data loading
import numpy as np
import pandas as pd
import geopandas as gpd
import matplotlib.cm as cm
import matplotlib.pyplot as plt 
import os
print(os.getcwd())

/home/jovyan/work/FSDS/CASA0013_FSDS_Airbnb-data-analytics/Documentation


In [2]:
reviews = pd.read_csv("data/reviews.csv.gz")
reviews.head() #Loaded successfully

BadGzipFile: Not a gzipped file (b've')

In [None]:
#Sorting data type
print("Data types before transformation:")
print(reviews.info())
#Looks like the date column is a string (e.g. "2010-08-18")- let's turn this into a datetime64 series
reviews["date"] = pd.to_datetime(reviews["date"], format="%Y-%m-%d")
#Other data types look okay

print("\nData types after transformation:")
reviews.info()

In [None]:
#Splitting dates for last 24 months
print(f"Latest review date: {reviews.date.max()}.")
max_date = reviews.date.max()
cutoff_2023 = max_date.replace(year=max_date.year - 1)
cutoff_date = max_date.replace(year=max_date.year - 2)
print(f"This means the cutoff date for reviews is: {cutoff_date}.")

#Filter for only reviews from the last 24 months
reviews = reviews[reviews["date"] > cutoff_date]

#Add column with year data (ChatGPT helped)
reviews["year"] = pd.cut(reviews["date"],
                        bins=[cutoff_date, cutoff_2023, max_date],
                        labels=["2022-2023", "2023-2024"],
                        right=True)

In [None]:
#Checking this
reviews.sample(10) #Looks good

In [None]:
#Do any have null values?
reviews[reviews.year.isna()] #No - looks like transformation went well

In [None]:
#Look for nulls
reviews.count() #some have no comments, one has no reviewer name

In [None]:
#Row with no reviewer name:
reviews[reviews.reviewer_name.isna()] #Looks fine

In [None]:
#Rows with no comments
reviews[reviews.comments.isna()] #presume these are also fine to keep as people will leave scores; so it's not weird behaviour
#we will be looking at counts anyway, rather than review content

In [None]:
#Check for any listings with unrealistic review counts
print("Review count per listing (over 731 days):")
print(reviews.groupby('listing_id').size().describe())
#some listings with only 1, some with over 890 - more than 1 per day!!

#Not going to remove this more as we'll consider this more robustly when calculating occupancy metrics
print(f"\nNumber of listings with more than 731 reviews: {(reviews.groupby('listing_id').size() > 731).sum()} - this is impossible!")
print(f"Number of listings with more than 365 reviews: {(reviews.groupby('listing_id').size() > 365).sum()} - apparently 1 every 2 days")
print(f"Number of listings with more than 243 reviews: {(reviews.groupby('listing_id').size() > 365).sum()} - apparently 1 every 3 days")

print("\nListing with most reviews:")
reviews[reviews.listing_id==reviews.groupby('listing_id').size().idxmax()]
#Looks like a big hotel - though odd that it has the same listing_id

Listing with most reviews over this period is a [hotel](https://www.airbnb.co.uk/rooms/47408549?source_impression_id=p3_1732634604_P3_JbW_THpmpjHlw) - same listing can be rented out by multiple people at once.

In [None]:
#Check for duplicates
reviews[reviews.duplicated(keep=False)] #no exact duplicates

In [None]:
#Have any individuals commented on the same listing twice?
print("Some individuals have reviewed the same listing more than once:")
multiple_reviews = reviews.groupby(['reviewer_id', 'listing_id']).size().reset_index(name='reviews_per_commenter').query('reviews_per_commenter > 1').sort_values(by='reviews_per_commenter', ascending=False)
print(f"\nMean number of multiple reviews: {multiple_reviews.reviews_per_commenter.mean():.2f}\n")
multiple_reviews #Lots of situations where the same reviewer has commented on the same listing

#Have looked into this manually and actually it looks fairly realistic: friendships forming between hosts and holidays
#One individual stayed at the same property 38 times in 2 years which seems excessive
#But they do genuinely seem to be friends

#And the mean is fairly low - not too many multiples

In [None]:
#Are different people (e.g. business owners) deliberately doing this? Or is it the same people consistently doing this?
multiple_reviews.groupby('reviewer_id').size().sort_values(ascending=False)
#Looked into some of these manually and they look okay - probably just people who travel a lot

In [None]:
#Has the same person left multiple reviews on the same day?
unlikely_reviews = reviews.groupby(['reviewer_id', 'date']).size().reset_index(name='reviews_on_day').query('reviews_on_day > 1').sort_values(by='reviews_on_day', ascending=False)
unlikely_reviews
#This has happened 1142 times!

In [None]:
#Some manual checking
pd.set_option('display.max_colwidth', None)
#Checked the top two and bottom 1 manually - they all look realistic actually
#reviews[
#    (reviews.reviewer_id==113858886) &
#    (reviews.date=="2022-11-13")
#]
pd.reset_option('display.max_colwidth')

In [None]:
#Drop unneccessary columns: date, ...
reviews.drop(columns = ["id", "date", "reviewer_name", "comments"], inplace=True)
reviews.head()

In [None]:
#Output it to data folder
reviews.to_csv("data/reviews_provisional.csv", index=False)

## Moving on to cleaning listings.csv

In [109]:
# Reading in listings csv
listings_raw = pd.read_csv("data/listings.csv.gz")
print(f"The listings dataset is {listings_raw.shape[0]} rows x {listings_raw.shape[1]} columns.")

# Taking a peek
listings_raw.head(n=5)

The listings dataset is 96182 rows x 75 columns.


Unnamed: 0,id,listing_url,scrape_id,last_scraped,source,name,description,neighborhood_overview,picture_url,host_id,...,review_scores_communication,review_scores_location,review_scores_value,license,instant_bookable,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month
0,13913,https://www.airbnb.com/rooms/13913,20240906025501,2024-09-06,city scrape,Holiday London DB Room Let-on going,My bright double bedroom with a large window h...,Finsbury Park is a friendly melting pot commun...,https://a0.muscache.com/pictures/miso/Hosting-...,54730,...,4.84,4.72,4.72,,f,3,2,1,0,0.26
1,15400,https://www.airbnb.com/rooms/15400,20240906025501,2024-09-07,city scrape,Bright Chelsea Apartment. Chelsea!,Lots of windows and light. St Luke's Gardens ...,It is Chelsea.,https://a0.muscache.com/pictures/428392/462d26...,60302,...,4.84,4.93,4.75,,f,1,1,0,0,0.54
2,17402,https://www.airbnb.com/rooms/17402,20240906025501,2024-09-07,city scrape,Fab 3-Bed/2 Bath & Wifi: Trendy W1,"You'll have a great time in this beautiful, cl...","Fitzrovia is a very desirable trendy, arty and...",https://a0.muscache.com/pictures/39d5309d-fba7...,67564,...,4.72,4.89,4.61,,f,6,6,0,0,0.34
3,24328,https://www.airbnb.com/rooms/24328,20240906025501,2024-09-07,city scrape,"Battersea live/work artist house, garden & par...","Artist house, bright high ceiling rooms for bo...","- Battersea is a quiet family area, easy acces...",https://a0.muscache.com/pictures/9194b40f-c627...,41759,...,4.93,4.59,4.65,,f,1,1,0,0,0.56
4,33332,https://www.airbnb.com/rooms/33332,20240906025501,2024-09-06,city scrape,Beautiful Ensuite Richmond-upon-Thames borough,"Walking distance to Twickenham Stadium, 35 min...",Peaceful and friendly.,https://a0.muscache.com/pictures/miso/Hosting-...,144444,...,4.5,4.67,4.22,,f,2,0,2,0,0.11


In [4]:
# Looking at column names
print(listings_raw.columns.to_list())

['id', 'listing_url', 'scrape_id', 'last_scraped', 'source', 'name', 'description', 'neighborhood_overview', 'picture_url', 'host_id', 'host_url', 'host_name', 'host_since', 'host_location', 'host_about', 'host_response_time', 'host_response_rate', 'host_acceptance_rate', 'host_is_superhost', 'host_thumbnail_url', 'host_picture_url', 'host_neighbourhood', 'host_listings_count', 'host_total_listings_count', 'host_verifications', 'host_has_profile_pic', 'host_identity_verified', 'neighbourhood', 'neighbourhood_cleansed', 'neighbourhood_group_cleansed', 'latitude', 'longitude', 'property_type', 'room_type', 'accommodates', 'bathrooms', 'bathrooms_text', 'bedrooms', 'beds', 'amenities', 'price', 'minimum_nights', 'maximum_nights', 'minimum_minimum_nights', 'maximum_minimum_nights', 'minimum_maximum_nights', 'maximum_maximum_nights', 'minimum_nights_avg_ntm', 'maximum_nights_avg_ntm', 'calendar_updated', 'has_availability', 'availability_30', 'availability_60', 'availability_90', 'availabil

In [118]:
# Setting aside some columns that don't seem useful for analysis
# Including NaN-only columns like 'license' and 'neighbourhood_group_cleansed'.
exclude_cols = ['scrape_id', 'source', 'picture_url', 'host_url', 
                'host_response_time', 'host_response_rate', 'host_thumbnail_url', 
                'host_picture_url', 'host_has_profile_pic', 'neighbourhood', 
                'neighbourhood_group_cleansed', 'calendar_last_scraped', 
                'calendar_updated', 'has_availability', 'minimum_minimum_nights', 
                'maximum_maximum_nights', 'minimum_maximum_nights', 
                'maximum_minimum_nights', 'minimum_nights_avg_ntm', 'maximum_nights_avg_ntm',
                'license', 'instant_bookable', ]

# Also checking for duplicate rows
duplicate_count = listings_raw.duplicated().sum()
print(duplicate_count) # Great, there are none

# Dropping these columns from listings_raw
listings = listings_raw.drop(columns = exclude_cols)
print(f"After dropping {len(exclude_cols)} less relevant columns, the listings dataset is {listings.shape[0]} rows x {listings.shape[1]} columns.")

0
After dropping 22 less relevant columns, the listings dataset is 96182 rows x 53 columns.


In [6]:
# Looking at listings columns and data types
listings.info(verbose = True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 96182 entries, 0 to 96181
Data columns (total 53 columns):
 #   Column                                        Non-Null Count  Dtype  
---  ------                                        --------------  -----  
 0   id                                            96182 non-null  int64  
 1   listing_url                                   96182 non-null  object 
 2   last_scraped                                  96182 non-null  object 
 3   name                                          96182 non-null  object 
 4   description                                   92763 non-null  object 
 5   neighborhood_overview                         45661 non-null  object 
 6   host_id                                       96182 non-null  int64  
 7   host_name                                     96177 non-null  object 
 8   host_since                                    96177 non-null  object 
 9   host_location                                 73913 non-null 

In [19]:
# Viewing null counts by row
listings.isnull().sum(axis=0).sort_values(ascending=False)

neighborhood_overview                           50521
host_neighbourhood                              48557
host_about                                      47147
beds                                            33141
bathrooms                                       33037
price                                           32977
host_acceptance_rate                            27226
review_scores_location                          24584
review_scores_value                             24583
review_scores_checkin                           24583
review_scores_communication                     24557
review_scores_accuracy                          24551
review_scores_cleanliness                       24543
first_review                                    24533
reviews_per_month                               24533
last_review                                     24533
review_scores_rating                            24533
host_location                                   22269
bedrooms                    

In [130]:
# I've used this cell as a tester cell to see what some of these variables look like.
# Based on this, I'm dropping more columns that don't seem useful.
listings['calculated_host_listings_count_entire_homes'].sample(3)

85179    1
79759    0
42857    7
Name: calculated_host_listings_count_entire_homes, dtype: int64

In [136]:
listings.drop(columns = ['description','neighborhood_overview', 'host_neighbourhood', 'host_about',
                      'beds', 'bathrooms', 'host_acceptance_rate', 'review_scores_location',
                      'review_scores_value', 'review_scores_checkin', 'review_scores_communication',
                      'review_scores_accuracy', 'review_scores_cleanliness', 'host_location',
                      'bedrooms', 'bathrooms_text'], inplace = True)
print(f"listings now has {listings.shape[0]} rows x {listings.shape[1]} columns.")

listings now has 96182 rows x 37 columns.


### Converting data types appropriately

In [99]:
# Converting true/false-response columns from string to Boolean
bools = ['host_is_superhost', 'host_identity_verified']
print(f"Before converting, 'host_is_superhost' is of type {listings['host_is_superhost'].dtype}.")


for b in bools:
    print(f"Converting {b}")
    listings[b] = listings[b].replace({'f':False, 't':True}).astype('bool')


Before converting, 'host_is_superhost' is of type object.
Converting host_is_superhost
Converting host_identity_verified


In [100]:
# Checking to see if it worked
print(f"After converting, 'host_is_superhost' is of type {listings['host_is_superhost'].dtype}.")
listings['host_identity_verified'].sample(3)

After converting, 'host_is_superhost' is of type bool.


83338    True
28887    True
7621     True
Name: host_identity_verified, dtype: bool

In [102]:
# Converting date-response columns to date type
dates = ['host_since','first_review','last_review']
print(f"Before converting, 'host_since' is of type {listings['host_since'].dtype}.\n")

for d in dates:
    print("Converting " + d)
    listings[d] = pd.to_datetime(listings[d])
    

Before converting, 'host_since' is of type datetime64[ns].

Converting host_since
Converting first_review
Converting last_review


In [103]:
# Checking to see if it worked
print(f"After converting, 'host_since' is of type {listings['host_since'].dtype}.\n")
listings['first_review'].sample(3)

After converting, 'host_since' is of type datetime64[ns].



19095   2018-01-02
87017          NaT
19368   2017-11-22
Name: first_review, dtype: datetime64[ns]

In [111]:
# Converting categorical variables
categories = ['property_type', 'room_type', 'neighbourhood_cleansed',]
print(f"Before converting, 'room_type' is of type {listings['room_type'].dtype}.")

for c in categories:
    print("Converting " + c)
    listings[c] = listings[c].astype('category')

print(f"After converting, 'room_type' is of type {listings['room_type'].dtype}.\n")
listings['room_type'].sample(3)

Before converting, 'room_type' is of type object.
Converting property_type
Converting room_type
Converting neighbourhood_cleansed
After converting, 'room_type' is of type category.



28392    Entire home/apt
44961    Entire home/apt
14152       Private room
Name: room_type, dtype: category
Categories (4, object): ['Entire home/apt', 'Hotel room', 'Private room', 'Shared room']

In [119]:
# Converting float variables
# Looks like just 'price' for now, which is originally 
floats = ['price']
print(f"Before converting, 'price' is of type {listings['price'].dtype}.")

# Also removing '$' so we're left with the float
for f in floats:
    print("Converting " + f)
    listings[f] = listings[f].str.replace('$','', regex=False).str.replace(',', '').astype('float').round(2)    
print(f"After converting, 'price' is of type {listings['price'].dtype}.\n")
listings['price'].sample(3)

Before converting, 'price' is of type object.
Converting price
After converting, 'price' is of type float64.



8168      NaN
34217    53.0
56730    72.0
Name: price, dtype: float64

In [134]:
# Converting integer variables
# Looks like just these for now, which are originally floats
ints = ['host_listings_count', 'host_total_listings_count']
print(f"Before converting, 'host_listings_count' is of type {listings['host_listings_count'].dtype}.")

# Note: i'm also putting in '0' where the cell is empty
for i in ints:
    print(f"Converting {i}")
    listings[i] = listings[i].fillna(0).astype('int')

print(f"After converting, 'host_listings_count' is of type {listings['host_listings_count'].dtype}.\n")
listings['host_listings_count'].sample(3)    

Before converting, 'host_listings_count' is of type float64.
Converting host_listings_count
Converting host_total_listings_count
After converting, 'host_listings_count' is of type int64.



40357    4
45208    5
53308    6
Name: host_listings_count, dtype: int64

### Saving a copy of the listings dataset as it's been cleaned thus far...

In [137]:
path = os.path.join('data','clean')

if not os.path.exists(path):
    print(f"Creating {path} under {os.getcwd()}")
    os.makedirs(path)

else: 
    print(f"Folder for clean data found.")
    
listings.to_csv(os.path.join(path, "listings_provisonal.csv"), index=False)
print("Provisionally cleaned listings dataset added to data/clean.")

Folder for clean data found.
Provisionally cleaned listings dataset added to data/clean.


### Creating a GDF from listings 

In [138]:
geolistings = gpd.GeoDataFrame(listings, 
                               geometry = gpd.points_from_xy(listings.longitude, 
                                                             listings.latitude,
                                                             crs ='epsg:4326'))

### Investigating hosts with multiple listings


In [18]:
# First, taking a look at the variables in our dataset that look relevant
listings[["calculated_host_listings_count", "host_listings_count", "host_total_listings_count"]].sample(3)

Unnamed: 0,calculated_host_listings_count,host_listings_count,host_total_listings_count
74311,259,268.0,697.0
50620,2,2.0,2.0
874,2,2.0,2.0


We can see that these numbers aren't always the same across "calculated_host_listings_count", "host_listings_count", and "host_listings_count"...

calculated_host_listings_count and host_listings_count are often not too far apart, but host_total_listings_count sometimes displays numbers that are quite different. I'm not sure why - data dictionary variable descriptors don't provide much clarity on this.