# Data Processing

## Reviews
**To note:**
- I have saved this to reviews_provisional.csv
- This is just using my own judgement, and based on what variables I expect we'll need: we can always add the exact date back in, for example
- Note the listings with unrealistic review counts
    - Large outliers seem to be things like hotels, but we don't know whether they are all like this
    - Similarly, we'll probably need a threshold of minimum reviews to measure genuine listings
- Note some weird reviewing activity
    - e.g. people reviewing the same place many times (up to 38 in 2 years!)
    - e.g. people leaving the same (often identical) reviews on multiple listings on the same day
    - When examining individual cases they actually did look realistic, but we don't know whether this is the case for all (e.g. people deliberately reviewing their properties to improve their score)

In [2]:
#Loading packages for data loading
import numpy as np
import pandas as pd
import geopandas as gpd
import matplotlib.cm as cm
import matplotlib.pyplot as plt 
import os
import datetime as dt
print(os.getcwd())

/home/jovyan/work/Documentation


In [15]:
reviews = pd.read_csv("data/reviews.csv.gz")
reviews.head() #Loaded successfully

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments
0,13913,80770,2010-08-18,177109,Michael,My girlfriend and I hadn't known Alina before ...
1,13913,367568,2011-07-11,19835707,Mathias,Alina was a really good host. The flat is clea...
2,13913,529579,2011-09-13,1110304,Kristin,Alina is an amazing host. She made me feel rig...
3,13913,595481,2011-10-03,1216358,Camilla,"Alina's place is so nice, the room is big and ..."
4,13913,612947,2011-10-09,490840,Jorik,"Nice location in Islington area, good for shor..."


In [16]:
#Sorting data type
print("Data types before transformation:")
print(reviews.info())
#Looks like the date column is a string (e.g. "2010-08-18")- let's turn this into a datetime64 series
reviews["date"] = pd.to_datetime(reviews["date"], format="%Y-%m-%d")
#Other data types look okay

print("\nData types after transformation:")
reviews.info()

Data types before transformation:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1887519 entries, 0 to 1887518
Data columns (total 6 columns):
 #   Column         Dtype 
---  ------         ----- 
 0   listing_id     int64 
 1   id             int64 
 2   date           object
 3   reviewer_id    int64 
 4   reviewer_name  object
 5   comments       object
dtypes: int64(3), object(3)
memory usage: 86.4+ MB
None

Data types after transformation:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1887519 entries, 0 to 1887518
Data columns (total 6 columns):
 #   Column         Dtype         
---  ------         -----         
 0   listing_id     int64         
 1   id             int64         
 2   date           datetime64[ns]
 3   reviewer_id    int64         
 4   reviewer_name  object        
 5   comments       object        
dtypes: datetime64[ns](1), int64(3), object(2)
memory usage: 86.4+ MB


In [21]:
#Splitting dates for last 24 months
print(f"Latest review date: {reviews.date.max()}.")
max_date = reviews.date.max()
cutoff_2023 = max_date.replace(year=max_date.year - 1)
cutoff_date = max_date.replace(year=max_date.year - 2)
print(f"This means the cutoff date for reviews is: {cutoff_date}.")

#Filter for only reviews from the last 24 months
#Update: changing this to include all reviews from 2022 onwards (if we are doing 2022/2023 rather than 12 month series)
reviews = reviews[reviews["date"] >= '01-01-2022']
min_date = reviews.date.min()- dt.timedelta(days=1)

#Add column with year data (ChatGPT helped)
reviews["year_category"] = pd.cut(reviews["date"],
                        bins=[min_date, cutoff_date, cutoff_2023, max_date],
                        labels=["Pre-Cutoff", "2022-2023", "2023-2024"],
                        right=True)

Latest review date: 2024-09-10 00:00:00.
This means the cutoff date for reviews is: 2022-09-10 00:00:00.


In [22]:
#Checking this
reviews.sample(10) #Looks good

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments,year_category
1442535,654406936802958269,1183275811634430383,2024-06-20,114970537,Maurício,"Great place to stay in London. Quiet street, c...",2023-2024
1693490,919231276913874768,946227559221177019,2023-07-29,524209916,Prateek,"Place was good, locality was good, good connec...",2022-2023
1582368,807365802241567358,955643655868829847,2023-08-11,184360910,Noémie,Le logement est situé dans un quartier tranqui...,2022-2023
1560978,776725720927697827,889049491944284471,2023-05-11,297317558,Martine,"Nice place, mainly student accommodation, but ...",2022-2023
1778378,1014329261632130179,1156487404264422366,2024-05-14,9262023,Blair,Great space. As advertised. Lovely area!! Quit...,2023-2024
1403172,612354678270062038,682394449624218653,2022-07-30,114274888,Angela,"So nice, had to stay here twice! Thanks again!",Pre-Cutoff
1524377,735330823980526934,1174615061736719258,2024-06-08,40050143,Christof,Small modern and clean studio with everything ...,2023-2024
1280774,51365543,632432182927729014,2022-05-22,259606152,Beatriz,muy bien,Pre-Cutoff
1881005,1189900530175491366,1200605508353124268,2024-07-14,44970965,Kim,awesome little place in a great location.,2023-2024
972473,31941670,815884374297340326,2023-01-30,24117071,Claudia,"Good hostel, nice staff.",2022-2023


In [23]:
#Do any have null values?
reviews[reviews.year_category.isna()] #No - looks like transformation went well

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments,year_category


In [24]:
#Look for nulls
reviews.count() #some have no comments, one has no reviewer name

listing_id       1130616
id               1130616
date             1130616
reviewer_id      1130616
reviewer_name    1130615
comments         1130486
year_category    1130616
dtype: int64

In [25]:
#Row with no reviewer name:
reviews[reviews.reviewer_name.isna()] #Looks fine

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments,year_category
1092876,38639282,1200623816721820465,2024-07-14,190102481,,Perfect and I would love her again. Hilary is ...,2023-2024


In [26]:
#Rows with no comments
reviews[reviews.comments.isna()] #presume these are also fine to keep as people will leave scores; so it's not weird behaviour
#we will be looking at counts anyway, rather than review content

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments,year_category
35852,406777,871657238998972445,2023-04-17,507527775,Andreas,,2022-2023
50554,530395,1011524014255735246,2023-10-27,194523055,Edinson,,2023-2024
291876,6950763,849182377415080078,2023-03-17,421094279,Adegoke,,2022-2023
357119,9257162,1212918942981929485,2024-07-31,533627049,Rena,,2023-2024
405938,11665555,687530246941623779,2022-08-06,199658540,Kelly,,Pre-Cutoff
...,...,...,...,...,...,...,...
1839251,1107781028947673818,1172406240781123578,2024-06-05,303002557,Linda,,2023-2024
1873472,1171037862684642296,1209332929199379359,2024-07-26,71053688,Shailesh,,2023-2024
1878047,1180953314752991914,1186154929632305939,2024-06-24,571743776,James,,2023-2024
1886825,1217437614161504190,1226062601864080735,2024-08-18,516650786,Vanessa,,2023-2024


In [27]:
#Check for any listings with unrealistic review counts
print("Review count per listing (over 731 days):")
print(reviews.groupby('listing_id').size().describe())
#some listings with only 1, some with over 890 - more than 1 per day!!

#Not going to remove this more as we'll consider this more robustly when calculating occupancy metrics
print(f"\nNumber of listings with more than 731 reviews: {(reviews.groupby('listing_id').size() > 731).sum()} - this is impossible!")
print(f"Number of listings with more than 365 reviews: {(reviews.groupby('listing_id').size() > 365).sum()} - apparently 1 every 2 days")
print(f"Number of listings with more than 243 reviews: {(reviews.groupby('listing_id').size() > 365).sum()} - apparently 1 every 3 days")

print("\nListing with most reviews:")
reviews[reviews.listing_id==reviews.groupby('listing_id').size().idxmax()]
#Looks like a big hotel - though odd that it has the same listing_id

Review count per listing (over 731 days):
count    57330.000000
mean        19.721193
std         32.463202
min          1.000000
25%          3.000000
50%          9.000000
75%         23.000000
max       1263.000000
dtype: float64

Number of listings with more than 731 reviews: 5 - this is impossible!
Number of listings with more than 365 reviews: 20 - apparently 1 every 2 days
Number of listings with more than 243 reviews: 20 - apparently 1 every 3 days

Listing with most reviews:


Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments,year_category
1220378,47408549,530233932046441020,2022-01-01,414735089,Karan,Value for money,Pre-Cutoff
1220379,47408549,530940130068732337,2022-01-02,33347920,Salman,"bad bed, mattress is stained, springs poking t...",Pre-Cutoff
1220649,47408549,530948711881412794,2022-01-02,341085874,Jonny,Nice room in a great location,Pre-Cutoff
1220650,47408549,530986525479442483,2022-01-02,126530393,丽,"The location is pretty good, convenient to go ...",Pre-Cutoff
1220651,47408549,530993720776910753,2022-01-02,410214121,Dom,Fantastic location.,Pre-Cutoff
...,...,...,...,...,...,...,...
1223750,47408549,1237577199134430219,2024-09-03,461445821,Andrew,Great staff and location,2023-2024
1223751,47408549,1237662215098725353,2024-09-03,130410905,Alexandra,"Fantastic location, great instructions and sug...",2023-2024
1223752,47408549,1238299422218584817,2024-09-04,525670987,Finnley,"Great location, friendly staff and lovely room",2023-2024
1223753,47408549,1239100574078005860,2024-09-05,211088838,Marie,Patrick was very informative and gave us all t...,2023-2024


Listing with most reviews over this period is a [hotel](https://www.airbnb.co.uk/rooms/47408549?source_impression_id=p3_1732634604_P3_JbW_THpmpjHlw) - same listing can be rented out by multiple people at once.

In [28]:
#Check for duplicates
reviews[reviews.duplicated(keep=False)] #no exact duplicates

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments,year_category


In [29]:
#Have any individuals commented on the same listing twice?
print("Some individuals have reviewed the same listing more than once...")
multiple_reviews = reviews.groupby(['reviewer_id', 'listing_id']).size().reset_index(name='reviews_per_commenter').query('reviews_per_commenter > 1').sort_values(by='reviews_per_commenter', ascending=False)
print(f"\nMean number of multiple reviews: {multiple_reviews.reviews_per_commenter.mean():.2f}\n")
multiple_reviews #Lots of situations where the same reviewer has commented on the same listing

#Have looked into this manually and actually it looks fairly realistic: friendships forming between hosts and holidays
#One individual stayed at the same property 38 times in 2 years which seems excessive
#But they do genuinely seem to be friends

#And the mean is fairly low - not too many multiples

Some individuals have reviewed the same listing more than once...

Mean number of multiple reviews: 2.35



Unnamed: 0,reviewer_id,listing_id,reviews_per_commenter
220457,42114923,29054773,39
453678,123490325,782361090804736131,27
560492,177109196,17344226,25
684967,258926680,35284127,22
991349,496343909,804133784391047309,22
...,...,...,...
395147,97969298,22962647,2
395180,97987362,38977803,2
395353,98078839,908081576338338413,2
395360,98081950,52858934,2


In [30]:
#Are different people (e.g. business owners) deliberately doing this? Or is it the same people consistently doing this?
multiple_reviews.groupby('reviewer_id').size().sort_values(ascending=False)
#Looked into some of these manually and they look okay - probably just people who travel a lot

reviewer_id
143140436    18
550200826    14
326508280     9
74419692      8
465622175     8
             ..
89065731      1
89116506      1
89128605      1
89194193      1
88539022      1
Length: 12860, dtype: int64

In [31]:
#Has the same person left multiple reviews on the same day?
unlikely_reviews = reviews.groupby(['reviewer_id', 'date']).size().reset_index(name='reviews_on_day').query('reviews_on_day > 1').sort_values(by='reviews_on_day', ascending=False)
unlikely_reviews
#This has happened 1142 times!

Unnamed: 0,reviewer_id,date,reviews_on_day
319038,68076514,2023-05-08,5
205893,37458539,2024-07-14,5
810513,371058682,2024-04-15,5
834651,396592646,2023-05-05,4
1123106,583434683,2024-08-23,4
...,...,...,...
416703,104336852,2022-05-30,2
416661,104314095,2024-05-30,2
416566,104275033,2023-05-18,2
414399,103363269,2024-09-01,2


In [16]:
#Some manual checking
pd.set_option('display.max_colwidth', None)
#Checked the top two and bottom 1 manually - they all look realistic actually
#reviews[
#    (reviews.reviewer_id==113858886) &
#    (reviews.date=="2022-11-13")
#]
pd.reset_option('display.max_colwidth')

In [32]:
#Drop unneccessary columns
reviews.drop(columns = ["id", "reviewer_name", "comments"], inplace=True)
reviews.head()

Unnamed: 0,listing_id,date,reviewer_id,year_category
21,13913,2022-04-04,2941217,Pre-Cutoff
22,13913,2022-05-17,32486646,Pre-Cutoff
23,13913,2022-05-20,203803703,Pre-Cutoff
24,13913,2022-05-22,172968198,Pre-Cutoff
25,13913,2022-06-10,27083732,Pre-Cutoff


In [33]:
#Output it to data folder
reviews.to_csv("data/clean/reviews_provisional.csv", index=False)

## Moving on to cleaning listings.csv

In [79]:
# Reading in listings csv
listings_raw = pd.read_csv("data/raw/2021_listings.csv")
print(f"The listings dataset is {listings_raw.shape[0]} rows x {listings_raw.shape[1]} columns.")

# Taking a peek
listings_raw.head(n=5)

The listings dataset is 70617 rows x 74 columns.


Unnamed: 0,id,listing_url,scrape_id,last_scraped,name,description,neighborhood_overview,picture_url,host_id,host_url,...,review_scores_communication,review_scores_location,review_scores_value,license,instant_bookable,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month
0,13913,https://www.airbnb.com/rooms/13913,20210909211636,2021-09-11,Holiday London DB Room Let-on going,My bright double bedroom with a large window h...,Finsbury Park is a friendly melting pot commun...,https://a0.muscache.com/pictures/miso/Hosting-...,54730,https://www.airbnb.com/users/show/54730,...,4.89,4.63,4.74,,f,2,1,1,0,0.17
1,15400,https://www.airbnb.com/rooms/15400,20210909211636,2021-09-11,Bright Chelsea Apartment. Chelsea!,Lots of windows and light. St Luke's Gardens ...,It is Chelsea.,https://a0.muscache.com/pictures/428392/462d26...,60302,https://www.airbnb.com/users/show/60302,...,4.82,4.93,4.73,,f,1,1,0,0,0.8
2,17402,https://www.airbnb.com/rooms/17402,20210909211636,2021-09-11,Superb 3-Bed/2 Bath & Wifi: Trendy W1,You'll have a wonderful stay in this superb mo...,"Location, location, location! You won't find b...",https://a0.muscache.com/pictures/39d5309d-fba7...,67564,https://www.airbnb.com/users/show/67564,...,4.66,4.85,4.59,,f,15,15,0,0,0.35
3,17506,https://www.airbnb.com/rooms/17506,20210909211636,2021-09-11,Boutique Chelsea/Fulham Double bed 5-star ensuite,Enjoy a chic stay in this elegant but fully mo...,Fulham is 'villagey' and residential – a real ...,https://a0.muscache.com/pictures/11901327/e63d...,67915,https://www.airbnb.com/users/show/67915,...,,,,,f,2,0,2,0,
4,24328,https://www.airbnb.com/rooms/24328,20210909211636,2021-09-11,"Battersea bedroom & studio, parking, shared ga...","Artist house, bright high ceiling rooms, priva...","- Battersea is a quiet family area, easy acces...",https://a0.muscache.com/pictures/c4ffa398-ebfd...,41759,https://www.airbnb.com/users/show/41759,...,4.93,4.58,4.64,,f,1,1,0,0,0.77


In [80]:
# Looking at column names
print(listings_raw.columns.to_list())

['id', 'listing_url', 'scrape_id', 'last_scraped', 'name', 'description', 'neighborhood_overview', 'picture_url', 'host_id', 'host_url', 'host_name', 'host_since', 'host_location', 'host_about', 'host_response_time', 'host_response_rate', 'host_acceptance_rate', 'host_is_superhost', 'host_thumbnail_url', 'host_picture_url', 'host_neighbourhood', 'host_listings_count', 'host_total_listings_count', 'host_verifications', 'host_has_profile_pic', 'host_identity_verified', 'neighbourhood', 'neighbourhood_cleansed', 'neighbourhood_group_cleansed', 'latitude', 'longitude', 'property_type', 'room_type', 'accommodates', 'bathrooms', 'bathrooms_text', 'bedrooms', 'beds', 'amenities', 'price', 'minimum_nights', 'maximum_nights', 'minimum_minimum_nights', 'maximum_minimum_nights', 'minimum_maximum_nights', 'maximum_maximum_nights', 'minimum_nights_avg_ntm', 'maximum_nights_avg_ntm', 'calendar_updated', 'has_availability', 'availability_30', 'availability_60', 'availability_90', 'availability_365', 

In [82]:
# Setting aside some columns that don't seem useful for analysis
# Including NaN-only columns like 'license' and 'neighbourhood_group_cleansed'.
exclude_cols = ['scrape_id', 'picture_url', 'host_url', 
                'host_response_time', 'host_response_rate', 'host_thumbnail_url', 
                'host_picture_url', 'host_has_profile_pic', 'neighbourhood', 
                'neighbourhood_group_cleansed', 'calendar_last_scraped', 
                'calendar_updated', 'has_availability', 'minimum_minimum_nights', 
                'maximum_maximum_nights', 'minimum_maximum_nights', 
                'maximum_minimum_nights', 'minimum_nights_avg_ntm', 'maximum_nights_avg_ntm',
                'license', 'instant_bookable', ]

# Also checking for duplicate rows
duplicate_count = listings_raw.duplicated().sum()
print(duplicate_count) # Great, there are none

# Dropping these columns from listings_raw
listings = listings_raw.drop(columns = exclude_cols)
print(f"After dropping {len(exclude_cols)} less relevant columns, the listings dataset is {listings.shape[0]} rows x {listings.shape[1]} columns.")

0
After dropping 21 less relevant columns, the listings dataset is 70617 rows x 53 columns.


In [83]:
# Looking at listings columns and data types
listings.info(verbose = True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 70617 entries, 0 to 70616
Data columns (total 53 columns):
 #   Column                                        Non-Null Count  Dtype  
---  ------                                        --------------  -----  
 0   id                                            70617 non-null  int64  
 1   listing_url                                   70617 non-null  object 
 2   last_scraped                                  70617 non-null  object 
 3   name                                          70593 non-null  object 
 4   description                                   67718 non-null  object 
 5   neighborhood_overview                         42021 non-null  object 
 6   host_id                                       70617 non-null  int64  
 7   host_name                                     69942 non-null  object 
 8   host_since                                    69942 non-null  object 
 9   host_location                                 69784 non-null 

In [84]:
# Viewing null counts by row
listings.isnull().sum(axis=0).sort_values(ascending=False)

bathrooms                                       70617
host_acceptance_rate                            36041
host_about                                      31976
neighborhood_overview                           28596
review_scores_value                             22624
review_scores_location                          22620
review_scores_checkin                           22619
review_scores_communication                     22582
review_scores_accuracy                          22578
review_scores_cleanliness                       22566
reviews_per_month                               21087
review_scores_rating                            21087
last_review                                     21087
first_review                                    21087
host_neighbourhood                              19688
bedrooms                                         4581
description                                      2899
beds                                              945
host_location               

In [85]:
# I've used this cell as a tester cell to see what some of these variables look like.
# Based on this, I'm dropping more columns that don't seem useful.
listings['calculated_host_listings_count_entire_homes'].sample(3)

50283    4
34308    1
33536    1
Name: calculated_host_listings_count_entire_homes, dtype: int64

In [86]:
listings.drop(columns = ['description','neighborhood_overview', 'host_neighbourhood', 'host_about',
                      'beds', 'bathrooms', 'host_acceptance_rate', 'review_scores_location',
                      'review_scores_value', 'review_scores_checkin', 'review_scores_communication',
                      'review_scores_accuracy', 'review_scores_cleanliness', 'host_location',
                      'bedrooms', 'bathrooms_text'], inplace = True)
print(f"listings now has {listings.shape[0]} rows x {listings.shape[1]} columns.")

listings now has 70617 rows x 37 columns.


### Converting data types appropriately

In [87]:
# Converting true/false-response columns from string to Boolean
bools = ['host_is_superhost', 'host_identity_verified']
print(f"Before converting, 'host_is_superhost' is of type {listings['host_is_superhost'].dtype}.")


for b in bools:
    print(f"Converting {b}")
    listings[b] = listings[b].replace({'f':False, 't':True}).astype('bool')


Before converting, 'host_is_superhost' is of type object.
Converting host_is_superhost
Converting host_identity_verified


In [88]:
# Checking to see if it worked
print(f"After converting, 'host_is_superhost' is of type {listings['host_is_superhost'].dtype}.")
listings['host_identity_verified'].sample(3)

After converting, 'host_is_superhost' is of type bool.


40345     True
40125    False
19674    False
Name: host_identity_verified, dtype: bool

In [89]:
# Converting date-response columns to date type
dates = ['host_since','first_review','last_review']
print(f"Before converting, 'host_since' is of type {listings['host_since'].dtype}.\n")

for d in dates:
    print("Converting " + d)
    listings[d] = pd.to_datetime(listings[d])
    

Before converting, 'host_since' is of type object.

Converting host_since
Converting first_review
Converting last_review


In [90]:
# Checking to see if it worked
print(f"After converting, 'host_since' is of type {listings['host_since'].dtype}.\n")
listings['first_review'].sample(3)

After converting, 'host_since' is of type datetime64[ns].



16475   2017-01-01
63030   2021-07-22
34998   2018-10-25
Name: first_review, dtype: datetime64[ns]

In [91]:
# Converting categorical variables
categories = ['property_type', 'room_type', 'neighbourhood_cleansed',]
print(f"Before converting, 'room_type' is of type {listings['room_type'].dtype}.")

for c in categories:
    print("Converting " + c)
    listings[c] = listings[c].astype('category')

print(f"After converting, 'room_type' is of type {listings['room_type'].dtype}.\n")
listings['room_type'].sample(3)

Before converting, 'room_type' is of type object.
Converting property_type
Converting room_type
Converting neighbourhood_cleansed
After converting, 'room_type' is of type category.



69566    Entire home/apt
34475       Private room
63576    Entire home/apt
Name: room_type, dtype: category
Categories (4, object): ['Entire home/apt', 'Hotel room', 'Private room', 'Shared room']

In [92]:
# Converting float variables
# Looks like just 'price' for now, which is originally 
floats = ['price']
print(f"Before converting, 'price' is of type {listings['price'].dtype}.")

# Also removing '$' so we're left with the float
for f in floats:
    print("Converting " + f)
    listings[f] = listings[f].str.replace('$','', regex=False).str.replace(',', '').astype('float').round(2)    
print(f"After converting, 'price' is of type {listings['price'].dtype}.\n")
listings['price'].sample(3)

Before converting, 'price' is of type object.
Converting price
After converting, 'price' is of type float64.



53998     69.0
35379     73.0
29503    400.0
Name: price, dtype: float64

In [93]:
# Converting integer variables
# Looks like just these for now, which are originally floats
ints = ['host_listings_count', 'host_total_listings_count']
print(f"Before converting, 'host_listings_count' is of type {listings['host_listings_count'].dtype}.")

# Note: i'm also putting in '0' where the cell is empty
for i in ints:
    print(f"Converting {i}")
    listings[i] = listings[i].fillna(0).astype('int')

print(f"After converting, 'host_listings_count' is of type {listings['host_listings_count'].dtype}.\n")
listings['host_listings_count'].sample(3)    

Before converting, 'host_listings_count' is of type float64.
Converting host_listings_count
Converting host_total_listings_count
After converting, 'host_listings_count' is of type int64.



61634    118
46556      2
58035      0
Name: host_listings_count, dtype: int64

### Saving a copy of the listings dataset as it's been cleaned thus far...

In [94]:

folder_path = os.path.join('data', 'clean')
file_path = os.path.join(folder_path, "2021_listings_provisonal.csv")

# Check if the folder exists; create it if it doesn't
if not os.path.exists(folder_path):
    print(f"Creating folder {folder_path} under {os.getcwd()}")
    os.makedirs(folder_path)

# Check if the file exists
if not os.path.exists(file_path):
    print(f"File {file_path} does not exist. Creating the file.")
    listings.to_csv(file_path, index=False)
    print(f"Provisionally cleaned listings dataset added to {file_path}.")
else:
    print(f"File {file_path} already exists.")
    

listings.to_csv(os.path.join("2021_listings_provisonal.csv"), index=False)


File data/clean/2021_listings_provisonal.csv does not exist. Creating the file.
Provisionally cleaned listings dataset added to data/clean/2021_listings_provisonal.csv.


### Creating a GDF from listings 

In [138]:
geolistings = gpd.GeoDataFrame(listings, 
                               geometry = gpd.points_from_xy(listings.longitude, 
                                                             listings.latitude,
                                                             crs ='epsg:4326'))

### Investigating hosts with multiple listings


In [18]:
# First, taking a look at the variables in our dataset that look relevant
listings[["calculated_host_listings_count", "host_listings_count", "host_total_listings_count"]].sample(3)

Unnamed: 0,calculated_host_listings_count,host_listings_count,host_total_listings_count
74311,259,268.0,697.0
50620,2,2.0,2.0
874,2,2.0,2.0


We can see that these numbers aren't always the same across "calculated_host_listings_count", "host_listings_count", and "host_listings_count"...

calculated_host_listings_count and host_listings_count are often not too far apart, but host_total_listings_count sometimes displays numbers that are quite different. I'm not sure why - data dictionary variable descriptors don't provide much clarity on this.