# Airbnb Sentiment Analysis Test
This notebook is a rough test of how we might do sentiment analysis on Airbnb reviews. It will naturally be short and a bit choppy: I am only giving myself 1 hour to have a brief look, as we'll be covering this further in FSDS in Week 8.

**To-Do List for any Future Analysis:**
- Need to look into processing of different languages!
- And other forms of sentiment analysis, as it looks like almost everything English was classified as positive
- Further check that encoding is all okay
- Investigate Alsudais (2021) review ID issue
- Split the reviews into 2022-23 and 2023-2024
- Ultimately: much more work on NLP evaluation!

## Set Up

In [78]:
#Loading packages for data loading
import numpy as np
import pandas as pd
import os
print(os.getcwd())

/home/jovyan/work/CASA0013 - Foundations of Spatial Data Science/CASA0013_FSDS_Airbnb-data-analytics/Documentation


In [53]:
#Loading in review data
reviews = pd.read_csv("data/reviews.csv.gz")
reviews.head() #Loaded successfully

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments
0,13913,80770,2010-08-18,177109,Michael,My girlfriend and I hadn't known Alina before ...
1,13913,367568,2011-07-11,19835707,Mathias,Alina was a really good host. The flat is clea...
2,13913,529579,2011-09-13,1110304,Kristin,Alina is an amazing host. She made me feel rig...
3,13913,595481,2011-10-03,1216358,Camilla,"Alina's place is so nice, the room is big and ..."
4,13913,612947,2011-10-09,490840,Jorik,"Nice location in Islington area, good for shor..."


### Data Cleaning
We will want relatively recent data, as we're investigating recent changes in satisfaction associated with the professionalisation and saturation of Airbnb.

In [54]:
#Brief overview
reviews.count()

listing_id       1887519
id               1887519
date             1887519
reviewer_id      1887519
reviewer_name    1887518
comments         1887331
dtype: int64

In [55]:
#Looks like some reviews don't have comments:
print("Null values by data type:")
print(reviews.isnull().sum(axis=0).sort_values(ascending=False))

#Let's drop the listings with no comments
reviews.drop(reviews[reviews.comments.isna()].index.array, axis=0, inplace=True)

Null values by data type:
comments         188
reviewer_name      1
id                 0
listing_id         0
reviewer_id        0
date               0
dtype: int64


In [56]:
#Let's briefly look into the listing with no reviewer_name
reviews[reviews['reviewer_name'].isnull()]
#This looks fine to keep

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments
1092876,38639282,1200623816721820465,2024-07-14,190102481,,Perfect and I would love her again. Hilary is ...


In [57]:
#Checking data type:
print("Original series types:")
print(reviews.info())

#Looks like the date column is a string (e.g. "2010-08-18")- let's turn this into a datetime64 series
reviews["date"] = pd.to_datetime(reviews["date"], format="%Y-%m-%d")
#Other data types look okay

#Drop reviewer name (unlikely to need the other ID columns, but we might, so I'm leaving just in case):
reviews.drop(columns = ["reviewer_name"], inplace=True)

#New data frame format:
print("Cleaned data frame format:")
print(reviews.info())

Original series types:
<class 'pandas.core.frame.DataFrame'>
Index: 1887331 entries, 0 to 1887518
Data columns (total 6 columns):
 #   Column         Dtype 
---  ------         ----- 
 0   listing_id     int64 
 1   id             int64 
 2   date           object
 3   reviewer_id    int64 
 4   reviewer_name  object
 5   comments       object
dtypes: int64(3), object(3)
memory usage: 100.8+ MB
None
Cleaned data frame format:
<class 'pandas.core.frame.DataFrame'>
Index: 1887331 entries, 0 to 1887518
Data columns (total 5 columns):
 #   Column       Dtype         
---  ------       -----         
 0   listing_id   int64         
 1   id           int64         
 2   date         datetime64[ns]
 3   reviewer_id  int64         
 4   comments     object        
dtypes: datetime64[ns](1), int64(3), object(1)
memory usage: 86.4+ MB
None


In [65]:
#Filter for max date, take 12 months, and then the 12 months prior
print(f"Latest review date: {reviews.date.max()}.")
max_date = reviews.date.max()
cutoff_date = cutoff_date = max_date.replace(year=max_date.year - 2)
print(f"This means the cutoff date for reviews is: {cutoff_date}.")

#Creating a new data frame with only reviews from the previous two years
reviews_2224 = reviews[reviews["date"] >= cutoff_date]
#This should later be split into reviews_2223 and reviews_2324, but for QAing, it makes sense to do this together

#Dropping original data to save memory
del(reviews)

Latest review date: 2024-09-10 00:00:00.
This means the cutoff date for reviews is: 2022-09-10 00:00:00.


### Brief QA

In [124]:
#We have already checked for nulls

#Now need to check for poor encoding
#This code is from ChatGPT:
poorly_encoded_comments = reviews_2224['comments'].str.contains(r'[ÃÂâ�]', na=False)
# Filter rows with poorly encoded strings
poorly_encoded_rows = reviews_2224[poorly_encoded_comments]
#poorly_encoded_rows.comments.to_csv("data/poorly_encoded_rows.csv") (I've commented this out as it's just for me to have a brief look)
#looks like lots of text is French/Portuguese - how will we deal with this?
#Further investigation into encoding practices is needed as it looks like a few odd characters are popping up
#Not a task for today, but certainly should be a topic of future focus

#More substantive QA connected to review problem in Alsudais (2021) is needed

## NLP Test
The below analysis is mainly following [this tutorial](https://www.datacamp.com/tutorial/text-analytics-beginners-nltk) from DataCamp. Hopefully we can refine this a bit once we've covered the Week 8 content on textual data, plus further evaluated the metrics we're using for analysis.

In [108]:
#Import packages as recommended by DataCamp
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from sklearn.metrics import confusion_matrix
#Not sure why I had to download these separately but the code wouldn't run until I did so:
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('vader_lexicon')

[nltk_data] Downloading package punkt to /home/jovyan/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /home/jovyan/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /home/jovyan/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /home/jovyan/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


True

In [97]:
#Pre-process text:
#Tokenising words, removing step words, and lemmatising filtered tokens
def preprocess_text(text):
    tokens = word_tokenize(text.lower())
    filtered_tokens = [token for token in tokens if token not in stopwords.words('english')]
    lemmatizer = WordNetLemmatizer()
    lemmatized_tokens = [lemmatizer.lemmatize(token) for token in filtered_tokens]
    processed_text = ' '.join(lemmatized_tokens)
    return processed_text

In [100]:
#Let's apply this on a subset of the data (I'm slightly worried about the df size!)
reviews_small = reviews_2224.sample(1000)
reviews_small["updated_comments"] = reviews_small.comments.apply(preprocess_text)
reviews_small

Unnamed: 0,listing_id,id,date,reviewer_id,comments,updated_comments
1797480,1038399411348146842,1127480297418250947,2024-04-04,5337024,"Die Wohnung ist sehr schön im obersten, 2. OG ...","die wohnung ist sehr schön im obersten , 2. og..."
915047,29237506,1224616522650857420,2024-08-16,114554419,Andrew is the best !,andrew best !
1376950,575401390295945330,1215914369138780527,2024-08-04,137070369,"Amazing host, would definitely recommend","amazing host , would definitely recommend"
1575345,797240240735534060,974463919342114982,2023-09-06,82736012,Evgenia was an amazing host. As a female trave...,evgenia amazing host . female traveler felt sa...
1775673,1009286480282730687,1152814906891321945,2024-05-09,533982053,Vi and Francesca were such a great hosts! My h...,vi francesca great host ! husband really felt ...
...,...,...,...,...,...,...
1620501,849418908398661520,1076797540907750634,2024-01-25,202500934,I had a fantastic stay at Christopher’s place....,fantastic stay christopher ’ place . really ni...
812973,24343912,978881365353105661,2023-09-12,207312128,We enjoyed our stay close to many good things-...,enjoyed stay close many good things- great fan...
1139651,41239740,1217325253588403750,2024-08-06,442930934,Logement correspondant à la description mais l...,logement correspondant à la description mais l...
271411,6228511,1165175998307719034,2024-05-26,4296373,I can recommend her place to stay.,recommend place stay .


In [105]:
#NLTK Sentiment Analyser
analyser = SentimentIntensityAnalyzer()
def get_sentiment(text):
    scores = analyser.polarity_scores(text)
    sentiment = 1 if scores['pos']>0 else 0
    return sentiment

#Applying this to text
reviews_small["sentiment"] = reviews_small["updated_comments"].apply(get_sentiment)
reviews_small

Unnamed: 0,listing_id,id,date,reviewer_id,comments,updated_comments,sentiment
1797480,1038399411348146842,1127480297418250947,2024-04-04,5337024,"Die Wohnung ist sehr schön im obersten, 2. OG ...","die wohnung ist sehr schön im obersten , 2. og...",0
915047,29237506,1224616522650857420,2024-08-16,114554419,Andrew is the best !,andrew best !,1
1376950,575401390295945330,1215914369138780527,2024-08-04,137070369,"Amazing host, would definitely recommend","amazing host , would definitely recommend",1
1575345,797240240735534060,974463919342114982,2023-09-06,82736012,Evgenia was an amazing host. As a female trave...,evgenia amazing host . female traveler felt sa...,1
1775673,1009286480282730687,1152814906891321945,2024-05-09,533982053,Vi and Francesca were such a great hosts! My h...,vi francesca great host ! husband really felt ...,1
...,...,...,...,...,...,...,...
1620501,849418908398661520,1076797540907750634,2024-01-25,202500934,I had a fantastic stay at Christopher’s place....,fantastic stay christopher ’ place . really ni...,1
812973,24343912,978881365353105661,2023-09-12,207312128,We enjoyed our stay close to many good things-...,enjoyed stay close many good things- great fan...,1
1139651,41239740,1217325253588403750,2024-08-06,442930934,Logement correspondant à la description mais l...,logement correspondant à la description mais l...,1
271411,6228511,1165175998307719034,2024-05-26,4296373,I can recommend her place to stay.,recommend place stay .,1


In [110]:
#Evaluation with a confusion matrix: how? Could we classify ourselves?
#print(confusion_matrix(reviews_small['pre_classified'], df['sentiment']))

In [123]:
print("Summary:")
print(reviews_small.sentiment.describe())

print("\nUnique Counts:")
print(reviews_small.sentiment.value_counts())

#Look into some rows classified as negative:
negative = reviews_small[reviews_small['sentiment'] == 0]
#negative.to_csv("data/negative_comments.csv") (commented this out - was just for me to look)
#These are literally all just other languages!! Much further work to be done

Summary:
count    1000.000000
mean        0.856000
std         0.351265
min         0.000000
25%         1.000000
50%         1.000000
75%         1.000000
max         1.000000
Name: sentiment, dtype: float64

Unique Counts:
sentiment
1    856
0    144
Name: count, dtype: int64
