# Data Processing

## Reviews
**To note:**
- I have saved this to reviews_provisional.csv
- This is just using my own judgement, and based on what variables I expect we'll need: we can always add the exact date back in, for example
- Note the listings with unrealistic review counts
    - Large outliers seem to be things like hotels, but we don't know whether they are all like this
    - Similarly, we'll probably need a threshold of minimum reviews to measure genuine listings
- Note some weird reviewing activity
    - e.g. people reviewing the same place many times (up to 38 in 2 years!)
    - e.g. people leaving the same (often identical) reviews on multiple listings on the same day
    - When examining individual cases they actually did look realistic, but we don't know whether this is the case for all (e.g. people deliberately reviewing their properties to improve their score)

In [75]:
#Loading packages for data loading
import numpy as np
import pandas as pd
import os
print(os.getcwd())

/home/jovyan/work/CASA0013 - Foundations of Spatial Data Science/CASA0013_FSDS_Airbnb-data-analytics/Documentation


In [76]:
reviews = pd.read_csv("data/reviews.csv.gz")
reviews.head() #Loaded successfully

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments
0,13913,80770,2010-08-18,177109,Michael,My girlfriend and I hadn't known Alina before ...
1,13913,367568,2011-07-11,19835707,Mathias,Alina was a really good host. The flat is clea...
2,13913,529579,2011-09-13,1110304,Kristin,Alina is an amazing host. She made me feel rig...
3,13913,595481,2011-10-03,1216358,Camilla,"Alina's place is so nice, the room is big and ..."
4,13913,612947,2011-10-09,490840,Jorik,"Nice location in Islington area, good for shor..."


In [77]:
#Sorting data type
print("Data types before transformation:")
print(reviews.info())
#Looks like the date column is a string (e.g. "2010-08-18")- let's turn this into a datetime64 series
reviews["date"] = pd.to_datetime(reviews["date"], format="%Y-%m-%d")
#Other data types look okay

print("\nData types after transformation:")
reviews.info()

Data types before transformation:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1887519 entries, 0 to 1887518
Data columns (total 6 columns):
 #   Column         Dtype 
---  ------         ----- 
 0   listing_id     int64 
 1   id             int64 
 2   date           object
 3   reviewer_id    int64 
 4   reviewer_name  object
 5   comments       object
dtypes: int64(3), object(3)
memory usage: 86.4+ MB
None

Data types after transformation:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1887519 entries, 0 to 1887518
Data columns (total 6 columns):
 #   Column         Dtype         
---  ------         -----         
 0   listing_id     int64         
 1   id             int64         
 2   date           datetime64[ns]
 3   reviewer_id    int64         
 4   reviewer_name  object        
 5   comments       object        
dtypes: datetime64[ns](1), int64(3), object(2)
memory usage: 86.4+ MB


In [78]:
#Splitting dates for last 24 months
print(f"Latest review date: {reviews.date.max()}.")
max_date = reviews.date.max()
cutoff_2023 = max_date.replace(year=max_date.year - 1)
cutoff_date = max_date.replace(year=max_date.year - 2)
print(f"This means the cutoff date for reviews is: {cutoff_date}.")

#Filter for only reviews from the last 24 months
reviews = reviews[reviews["date"] > cutoff_date]

#Add column with year data (ChatGPT helped)
reviews["year"] = pd.cut(reviews["date"],
                        bins=[cutoff_date, cutoff_2023, max_date],
                        labels=["2022-2023", "2023-2024"],
                        right=True)

Latest review date: 2024-09-10 00:00:00.
This means the cutoff date for reviews is: 2022-09-10 00:00:00.


In [79]:
#Checking this
reviews.sample(10) #Looks good

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments,year
1648958,877511611506016890,1007136086878719743,2023-10-21,236719406,Ezio,"Great location, 2~3min to a tube station with ...",2023-2024
948304,30850801,850566567931635399,2023-03-19,60460266,Helene,"John is very nice , and reactive <br/>Nothing ...",2022-2023
1676987,903567264272854720,1114456098461527291,2024-03-17,83446052,Hitesh,Great stay and good location near welling stat...,2023-2024
76925,874463,773047243745275036,2022-12-02,68908112,Luciano,Really nice apartment with lovely living room ...,2022-2023
375084,10401320,807843704610109423,2023-01-19,112215678,Ben,"Wonderful host, very helpful & the accommodati...",2022-2023
1674595,901199795426890458,937593138427113976,2023-07-17,100679715,Jason,Great location close to Heathrow. We arrived l...,2022-2023
1269986,50803136,962140151427568683,2023-08-20,4087446,John,Good deal and walkable from Paddington Station...,2022-2023
1651762,880297939413562209,1107935971729612395,2024-03-08,40894127,Hayley,"Beautiful accommodation, really comfortable ro...",2023-2024
1452125,661210464768252099,862172377331863417,2023-04-04,163852564,Jerome,Marijana and Duncan were great hosts and made ...,2022-2023
904081,28838351,736841579237486280,2022-10-13,219648419,Lynette,"The flat is in a great location , and the area...",2022-2023


In [80]:
#Do any have null values?
reviews[reviews.year.isna()] #No - looks like transformation went well

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments,year


In [81]:
#Look for nulls
reviews.count() #some have no comments, one has no reviewer name

listing_id       950519
id               950519
date             950519
reviewer_id      950519
reviewer_name    950518
comments         950412
year             950519
dtype: int64

In [82]:
#Row with no reviewer name:
reviews[reviews.reviewer_name.isna()] #Looks fine

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments,year
1092876,38639282,1200623816721820465,2024-07-14,190102481,,Perfect and I would love her again. Hilary is ...,2023-2024


In [83]:
#Rows with no comments
reviews[reviews.comments.isna()] #presume these are also fine to keep as people will leave scores; so it's not weird behaviour
#we will be looking at counts anyway, rather than review content

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments,year
35852,406777,871657238998972445,2023-04-17,507527775,Andreas,,2022-2023
50554,530395,1011524014255735246,2023-10-27,194523055,Edinson,,2023-2024
291876,6950763,849182377415080078,2023-03-17,421094279,Adegoke,,2022-2023
357119,9257162,1212918942981929485,2024-07-31,533627049,Rena,,2023-2024
587037,17380483,828858707902223357,2023-02-17,151267357,Patricia,,2022-2023
...,...,...,...,...,...,...,...
1839251,1107781028947673818,1172406240781123578,2024-06-05,303002557,Linda,,2023-2024
1873472,1171037862684642296,1209332929199379359,2024-07-26,71053688,Shailesh,,2023-2024
1878047,1180953314752991914,1186154929632305939,2024-06-24,571743776,James,,2023-2024
1886825,1217437614161504190,1226062601864080735,2024-08-18,516650786,Vanessa,,2023-2024


In [89]:
#Check for any listings with unrealistic review counts
print("Review count per listing (over 731 days):")
print(reviews.groupby('listing_id').size().describe())
#some listings with only 1, some with over 890 - more than 1 per day!!

#Not going to remove this more as we'll consider this more robustly when calculating occupancy metrics
print(f"\nNumber of listings with more than 731 reviews: {(reviews.groupby('listing_id').size() > 731).sum()} - this is impossible!")
print(f"Number of listings with more than 365 reviews: {(reviews.groupby('listing_id').size() > 365).sum()} - apparently 1 every 2 days")
print(f"Number of listings with more than 243 reviews: {(reviews.groupby('listing_id').size() > 365).sum()} - apparently 1 every 3 days")

print("\nListing with most reviews:")
reviews[reviews.listing_id==reviews.groupby('listing_id').size().idxmax()]
#Looks like a big hotel - though odd that it has the same listing_id

Review count per listing (over 731 days):
count    55598.000000
mean        17.096280
std         26.220929
min          1.000000
25%          3.000000
50%          8.000000
75%         20.000000
max        890.000000
dtype: float64

Number of listings with more than 731 reviews: 2 - this is impossible!
Number of listings with more than 365 reviews: 8 - apparently 1 every 2 days
Number of listings with more than 243 reviews: 8 - apparently 1 every 3 days

Listing with most reviews:


Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments,year
1221390,47408549,713648060004407628,2022-09-11,91650464,Isabel,"Good location, very central, close to King's C...",2022-2023
1221391,47408549,714267096650985030,2022-09-12,180339071,Arynne,the Melville was perfect for our 6-night stay ...,2022-2023
1221392,47408549,715038835351763365,2022-09-13,296551199,Stefanie,We had a great stay here! The room was clean a...,2022-2023
1221393,47408549,715083215765305587,2022-09-13,244039567,Radka,"Room is very tiny. Without kitchen, just kettl...",2022-2023
1221394,47408549,715734325784951820,2022-09-14,126932827,Lesley,Clean and convenient - an economical stay for ...,2022-2023
...,...,...,...,...,...,...,...
1223750,47408549,1237577199134430219,2024-09-03,461445821,Andrew,Great staff and location,2023-2024
1223751,47408549,1237662215098725353,2024-09-03,130410905,Alexandra,"Fantastic location, great instructions and sug...",2023-2024
1223752,47408549,1238299422218584817,2024-09-04,525670987,Finnley,"Great location, friendly staff and lovely room",2023-2024
1223753,47408549,1239100574078005860,2024-09-05,211088838,Marie,Patrick was very informative and gave us all t...,2023-2024


Listing with most reviews over this period is a [hotel](https://www.airbnb.co.uk/rooms/47408549?source_impression_id=p3_1732634604_P3_JbW_THpmpjHlw) - same listing can be rented out by multiple people at once.

In [110]:
#Check for duplicates
reviews[reviews.duplicated(keep=False)] #no exact duplicates

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments,year


In [130]:
#Have any individuals commented on the same listing twice?
print("Some individuals have reviewed the same listing more than once:")
multiple_reviews = reviews.groupby(['reviewer_id', 'listing_id']).size().reset_index(name='reviews_per_commenter').query('reviews_per_commenter > 1').sort_values(by='reviews_per_commenter', ascending=False)
print(f"\nMean number of multiple reviews: {multiple_reviews.reviews_per_commenter.mean():.2f}\n")
multiple_reviews #Lots of situations where the same reviewer has commented on the same listing

#Have looked into this manually and actually it looks fairly realistic: friendships forming between hosts and holidays
#One individual stayed at the same property 38 times in 2 years which seems excessive
#But they do genuinely seem to be friends

#And the mean is fairly low - not too many multiples

Some individuals have reviewed the same listing more than once:

Mean number of multiple reviews: 2.36



Unnamed: 0,reviewer_id,listing_id,reviews_per_commenter
179995,42114923,29054773,38
372355,123490325,782361090804736131,27
815074,496343909,804133784391047309,22
564007,258926680,35284127,22
620734,326508280,38212604,19
...,...,...,...
335791,104100218,1048717494574103473,2
335884,104147888,1045061160556613146,2
335898,104156735,25147229,2
336466,104463265,73125,2


In [139]:
#Are different people (e.g. business owners) deliberately doing this? Or is it the same people consistently doing this?
multiple_reviews.groupby('reviewer_id').size().sort_values(ascending=False)
#Looked into some of these manually and they look okay - probably just people who travel a lot

reviewer_id
550200826    14
143140436    12
326508280     8
465622175     8
513321810     7
             ..
95880203      1
95906485      1
95972936      1
95975506      1
95545840      1
Length: 10350, dtype: int64

In [142]:
#Has the same person left multiple reviews on the same day?
unlikely_reviews = reviews.groupby(['reviewer_id', 'date']).size().reset_index(name='reviews_on_day').query('reviews_on_day > 1').sort_values(by='reviews_on_day', ascending=False)
unlikely_reviews
#This has happened 1142 times!

Unnamed: 0,reviewer_id,date,reviews_on_day
260779,68076514,2023-05-08,5
167759,37458539,2024-07-14,5
667728,371058682,2024-04-15,5
943685,584235248,2024-08-11,4
943153,583434683,2024-08-23,4
...,...,...,...
356399,112126593,2024-02-02,2
353228,110578574,2023-05-01,2
351924,109946402,2024-07-24,2
351161,109559068,2023-07-28,2


In [157]:
#Some manual checking
pd.set_option('display.max_colwidth', None)
#Checked the top two and bottom 1 manually - they all look realistic actually
#reviews[
#    (reviews.reviewer_id==113858886) &
#    (reviews.date=="2022-11-13")
#]
pd.reset_option('display.max_colwidth')

In [162]:
#Drop unneccessary columns: date, ...
reviews.drop(columns = ["id", "date", "reviewer_name", "comments"], inplace=True)
reviews.head()

Unnamed: 0,listing_id,reviewer_id,year
29,13913,139088109,2022-2023
30,13913,217491570,2022-2023
31,13913,63837960,2022-2023
32,13913,27083732,2022-2023
33,13913,30527774,2022-2023


In [165]:
#Output it to data folder
reviews.to_csv("data/reviews_provisional.csv", index=False)