In [42]:
import pandas as pd
from textblob import TextBlob
from torch.optim.lr_scheduler import OneCycleLR
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
import torch
from transformers import pipeline

## Task 2 - Establishing Ground Truth

First we load the cleaned reviews dataset from task 1.

In [43]:
reviews = pd.read_csv("cleaned_reviews.csv")
reviews.head()
### 1.1. Nature of the data

Unnamed: 0,review_id,location_id,hotel_name,city,review,rating
0,1016464488,11953119,Nh Collection Colombo,Colombo,good stay found lighters toilet paper rolls no...,1
1,1016435128,11953119,Nh Collection Colombo,Colombo,definitely recommend hotel excellent food good...,5
2,1016307864,11953119,Nh Collection Colombo,Colombo,wonderful stay comfortable staycooperative sta...,5
3,1016165618,11953119,Nh Collection Colombo,Colombo,favorite 4 star hotel colombo live new york ar...,5
4,1015472232,11953119,Nh Collection Colombo,Colombo,excellent food stay excellent food especially ...,5


### Task 2.1. Using TextBlob

We can use TextBlob to analyze the sentiment of the reviews.

In [44]:
reviews['text_blob_sentiment'] = reviews['review'].apply(lambda text: TextBlob(text).sentiment.polarity)

In [45]:
reviews['text_blob_sentiment'].describe()

count    5186.000000
mean        0.367365
std         0.230712
min        -0.825000
25%         0.265239
50%         0.405107
75%         0.514089
max         1.000000
Name: text_blob_sentiment, dtype: float64

When we look at a summary of the sentiment scores, we can see that they range from -1 to 1, where -1 is very negative and 1 is very positive. The mean value is 0.367, indicating a generally positive sentiment across the reviews. The standard deviation is 0.231.

In [46]:
reviews.head()

Unnamed: 0,review_id,location_id,hotel_name,city,review,rating,text_blob_sentiment
0,1016464488,11953119,Nh Collection Colombo,Colombo,good stay found lighters toilet paper rolls no...,1,0.333333
1,1016435128,11953119,Nh Collection Colombo,Colombo,definitely recommend hotel excellent food good...,5,0.470238
2,1016307864,11953119,Nh Collection Colombo,Colombo,wonderful stay comfortable staycooperative sta...,5,0.386667
3,1016165618,11953119,Nh Collection Colombo,Colombo,favorite 4 star hotel colombo live new york ar...,5,0.279339
4,1015472232,11953119,Nh Collection Colombo,Colombo,excellent food stay excellent food especially ...,5,0.519048


Here, we can see that the sentiment scores are continuous values. To convert these into binary sentiment labels, we can use a threshold of 0, where scores above 0 are considered positive and scores below or equal to 0 are considered negative.

In [47]:
reviews['text_blob_sentiment'] = reviews['text_blob_sentiment'].apply(lambda x: 1 if x > 0 else -1)
reviews.head()

Unnamed: 0,review_id,location_id,hotel_name,city,review,rating,text_blob_sentiment
0,1016464488,11953119,Nh Collection Colombo,Colombo,good stay found lighters toilet paper rolls no...,1,1
1,1016435128,11953119,Nh Collection Colombo,Colombo,definitely recommend hotel excellent food good...,5,1
2,1016307864,11953119,Nh Collection Colombo,Colombo,wonderful stay comfortable staycooperative sta...,5,1
3,1016165618,11953119,Nh Collection Colombo,Colombo,favorite 4 star hotel colombo live new york ar...,5,1
4,1015472232,11953119,Nh Collection Colombo,Colombo,excellent food stay excellent food especially ...,5,1


We can check whether the sentiment analysis is correct by looking at some 5-star and some 1-start reviews.

In [48]:
five_star_reviews = reviews[reviews['rating'] == 5]
five_star_reviews.head()

Unnamed: 0,review_id,location_id,hotel_name,city,review,rating,text_blob_sentiment
1,1016435128,11953119,Nh Collection Colombo,Colombo,definitely recommend hotel excellent food good...,5,1
2,1016307864,11953119,Nh Collection Colombo,Colombo,wonderful stay comfortable staycooperative sta...,5,1
3,1016165618,11953119,Nh Collection Colombo,Colombo,favorite 4 star hotel colombo live new york ar...,5,1
4,1015472232,11953119,Nh Collection Colombo,Colombo,excellent food stay excellent food especially ...,5,1
5,1015273964,11953119,Nh Collection Colombo,Colombo,outstanding spotless immaculate premises room ...,5,1


In [49]:
one_star_reviews = reviews[reviews['rating'] == 1]
one_star_reviews.head()

Unnamed: 0,review_id,location_id,hotel_name,city,review,rating,text_blob_sentiment
0,1016464488,11953119,Nh Collection Colombo,Colombo,good stay found lighters toilet paper rolls no...,1,1
22,1013561310,11953119,Nh Collection Colombo,Colombo,dont complaint restaurant food cold otherwise ...,1,-1
76,568472844,11899031,De Colombo Boutique Hotel,Colombo,filthy towers probably worst hotel ive stayed ...,1,-1
77,568472643,11899031,De Colombo Boutique Hotel,Colombo,stay peril total disappointing start holiday s...,1,-1
80,565729487,11899031,De Colombo Boutique Hotel,Colombo,absolutely disgusting stay bug infested overpr...,1,-1


Here, we can see that the TextBlob sentiment analysis is generally correct, as the 5-star reviews have a positive sentiment score and the 1-star reviews have a negative sentiment score. 

However, there is a single 1-star review that has a positive sentiment score.

### Task 2.2. Using VADER

Now, we can use VADER (Valence Aware Dictionary and sEntiment Reasoner) to analyze the sentiment of the reviews. VADER is particularly effective for social media texts and short reviews.

In [50]:
sentiment = SentimentIntensityAnalyzer()
reviews['vader_sentiment'] = reviews['review'].apply(lambda text: sentiment.polarity_scores(text))

In [51]:
reviews.head()

Unnamed: 0,review_id,location_id,hotel_name,city,review,rating,text_blob_sentiment,vader_sentiment
0,1016464488,11953119,Nh Collection Colombo,Colombo,good stay found lighters toilet paper rolls no...,1,1,"{'neg': 0.0, 'neu': 0.847, 'pos': 0.153, 'comp..."
1,1016435128,11953119,Nh Collection Colombo,Colombo,definitely recommend hotel excellent food good...,5,1,"{'neg': 0.0, 'neu': 0.316, 'pos': 0.684, 'comp..."
2,1016307864,11953119,Nh Collection Colombo,Colombo,wonderful stay comfortable staycooperative sta...,5,1,"{'neg': 0.0, 'neu': 0.586, 'pos': 0.414, 'comp..."
3,1016165618,11953119,Nh Collection Colombo,Colombo,favorite 4 star hotel colombo live new york ar...,5,1,"{'neg': 0.0, 'neu': 0.716, 'pos': 0.284, 'comp..."
4,1015472232,11953119,Nh Collection Colombo,Colombo,excellent food stay excellent food especially ...,5,1,"{'neg': 0.0, 'neu': 0.597, 'pos': 0.403, 'comp..."


Here, we are given a dictionary with four keys: 'neg', 'neu', 'pos', and 'compound'. The 'compound' score is a normalized score that ranges from -1 (most negative) to +1 (most positive). We will use this score for our binary sentiment classification.

In [52]:
reviews["vader_sentiment"] = reviews["vader_sentiment"].apply(lambda x: x["compound"])
reviews.head()

Unnamed: 0,review_id,location_id,hotel_name,city,review,rating,text_blob_sentiment,vader_sentiment
0,1016464488,11953119,Nh Collection Colombo,Colombo,good stay found lighters toilet paper rolls no...,1,1,0.4404
1,1016435128,11953119,Nh Collection Colombo,Colombo,definitely recommend hotel excellent food good...,5,1,0.9666
2,1016307864,11953119,Nh Collection Colombo,Colombo,wonderful stay comfortable staycooperative sta...,5,1,0.872
3,1016165618,11953119,Nh Collection Colombo,Colombo,favorite 4 star hotel colombo live new york ar...,5,1,0.9493
4,1015472232,11953119,Nh Collection Colombo,Colombo,excellent food stay excellent food especially ...,5,1,0.9451


In [53]:
reviews["vader_sentiment"].describe()

count    5186.000000
mean        0.811598
std         0.450619
min        -0.994600
25%         0.925750
50%         0.969800
75%         0.985500
max         0.999300
Name: vader_sentiment, dtype: float64

Here, we can see that the average sentiment is 0.812, indicating a generally positive sentiment across the reviews. The standard deviation is 0.451, suggesting some variability in the sentiment scores.

We can once again convert these continuous sentiment scores into binary labels using a threshold of 0, where scores above 0 are considered positive and scores below or equal to 0 are considered negative.

In [54]:
reviews['vader_sentiment'] = reviews['vader_sentiment'].apply(lambda x: 1 if x > 0 else -1)
reviews.head()

Unnamed: 0,review_id,location_id,hotel_name,city,review,rating,text_blob_sentiment,vader_sentiment
0,1016464488,11953119,Nh Collection Colombo,Colombo,good stay found lighters toilet paper rolls no...,1,1,1
1,1016435128,11953119,Nh Collection Colombo,Colombo,definitely recommend hotel excellent food good...,5,1,1
2,1016307864,11953119,Nh Collection Colombo,Colombo,wonderful stay comfortable staycooperative sta...,5,1,1
3,1016165618,11953119,Nh Collection Colombo,Colombo,favorite 4 star hotel colombo live new york ar...,5,1,1
4,1015472232,11953119,Nh Collection Colombo,Colombo,excellent food stay excellent food especially ...,5,1,1


And then check some 5-star and 1-star reviews to see if the sentiment analysis is correct.

In [55]:
five_star_reviews = reviews[reviews['rating'] == 5]
five_star_reviews.head()

Unnamed: 0,review_id,location_id,hotel_name,city,review,rating,text_blob_sentiment,vader_sentiment
1,1016435128,11953119,Nh Collection Colombo,Colombo,definitely recommend hotel excellent food good...,5,1,1
2,1016307864,11953119,Nh Collection Colombo,Colombo,wonderful stay comfortable staycooperative sta...,5,1,1
3,1016165618,11953119,Nh Collection Colombo,Colombo,favorite 4 star hotel colombo live new york ar...,5,1,1
4,1015472232,11953119,Nh Collection Colombo,Colombo,excellent food stay excellent food especially ...,5,1,1
5,1015273964,11953119,Nh Collection Colombo,Colombo,outstanding spotless immaculate premises room ...,5,1,1


In [56]:
one_star_reviews = reviews[reviews['rating'] == 1]
one_star_reviews.head()

Unnamed: 0,review_id,location_id,hotel_name,city,review,rating,text_blob_sentiment,vader_sentiment
0,1016464488,11953119,Nh Collection Colombo,Colombo,good stay found lighters toilet paper rolls no...,1,1,1
22,1013561310,11953119,Nh Collection Colombo,Colombo,dont complaint restaurant food cold otherwise ...,1,-1,-1
76,568472844,11899031,De Colombo Boutique Hotel,Colombo,filthy towers probably worst hotel ive stayed ...,1,-1,-1
77,568472643,11899031,De Colombo Boutique Hotel,Colombo,stay peril total disappointing start holiday s...,1,-1,-1
80,565729487,11899031,De Colombo Boutique Hotel,Colombo,absolutely disgusting stay bug infested overpr...,1,-1,-1


Similar to TextBlob, the VADER sentiment analysis is generally correct, as the 5-star reviews have a positive sentiment score and the 1-star reviews have a negative sentiment score.

But once again, there is a single 1-star review that has a positive sentiment score.

### Task 2.3. Using Transformers

Since our transformer model has a maximum input length of 512 tokens, we need to limit the length of the reviews to this maximum length.

In [57]:
def limit_tokens(text, max_length=512):
    return text[:max_length]

reviews['transformer_review'] = reviews['review'].apply(limit_tokens)

Now, we can use a pre-trained transformer model for sentiment analysis.

This provides two outputs, `score` and `label`, where `score` is the confidence score of the sentiment label and `label` is either 'POSITIVE' or 'NEGATIVE'.

In [58]:
sentiment_classifier = pipeline("sentiment-analysis", model="distilbert-base-uncased-finetuned-sst-2-english")
reviews['transformer_sentiment'] = reviews['transformer_review'].apply(lambda text: sentiment_classifier(text)[0])

reviews['transformer_sentiment_score'] = reviews['transformer_sentiment'].apply(lambda x: x['score'])
reviews['transformer_sentiment_label'] = reviews['transformer_sentiment'].apply(lambda x: x['label'])
reviews.head()

Device set to use cpu


Unnamed: 0,review_id,location_id,hotel_name,city,review,rating,text_blob_sentiment,vader_sentiment,transformer_review,transformer_sentiment,transformer_sentiment_score,transformer_sentiment_label
0,1016464488,11953119,Nh Collection Colombo,Colombo,good stay found lighters toilet paper rolls no...,1,1,1,good stay found lighters toilet paper rolls no...,"{'label': 'POSITIVE', 'score': 0.9806791543960...",0.980679,POSITIVE
1,1016435128,11953119,Nh Collection Colombo,Colombo,definitely recommend hotel excellent food good...,5,1,1,definitely recommend hotel excellent food good...,"{'label': 'POSITIVE', 'score': 0.999824583530426}",0.999825,POSITIVE
2,1016307864,11953119,Nh Collection Colombo,Colombo,wonderful stay comfortable staycooperative sta...,5,1,1,wonderful stay comfortable staycooperative sta...,"{'label': 'POSITIVE', 'score': 0.9993433356285...",0.999343,POSITIVE
3,1016165618,11953119,Nh Collection Colombo,Colombo,favorite 4 star hotel colombo live new york ar...,5,1,1,favorite 4 star hotel colombo live new york ar...,"{'label': 'POSITIVE', 'score': 0.9975016713142...",0.997502,POSITIVE
4,1015472232,11953119,Nh Collection Colombo,Colombo,excellent food stay excellent food especially ...,5,1,1,excellent food stay excellent food especially ...,"{'label': 'POSITIVE', 'score': 0.9997791647911...",0.999779,POSITIVE


We can calculate an overall sentiment score by multiplying the score by -1 if the label is 'NEGATIVE', and leaving it as is if the label is 'POSITIVE'. This way, we can convert the sentiment scores into a binary format where positive sentiment is represented by 1 and negative sentiment by -1.

In [61]:
reviews['transformer_sentiment'] = reviews.apply(
    lambda row: row['transformer_sentiment_score'] * -1 if row['transformer_sentiment_label'] == 'NEGATIVE' else row['transformer_sentiment_score'],
    axis=1
)
reviews = reviews.drop(columns=["transformer_review", "transformer_sentiment_score", "transformer_sentiment_label"])
reviews.head()

KeyError: 'transformer_sentiment_label'

In [60]:
reviews['transformer_sentiment'].describe()

count    5186.000000
mean        0.623263
std         0.763787
min        -0.999799
25%         0.982515
50%         0.999153
75%         0.999708
max         0.999882
Name: transformer_sentiment, dtype: float64

Here, we can see that the average sentiment is 0.623, indicating a generally positive sentiment across the reviews. The standard deviation is 0.764.

Once again, we can convert these continuous sentiment scores into binary labels using a threshold of 0, where scores above 0 are considered positive and scores below or equal to 0 are considered negative.

In [63]:
reviews['transformer_sentiment'] = reviews['transformer_sentiment'].apply(lambda x: 1 if x > 0 else -1)
reviews.head()

Unnamed: 0,review_id,location_id,hotel_name,city,review,rating,text_blob_sentiment,vader_sentiment,transformer_sentiment
0,1016464488,11953119,Nh Collection Colombo,Colombo,good stay found lighters toilet paper rolls no...,1,1,1,1
1,1016435128,11953119,Nh Collection Colombo,Colombo,definitely recommend hotel excellent food good...,5,1,1,1
2,1016307864,11953119,Nh Collection Colombo,Colombo,wonderful stay comfortable staycooperative sta...,5,1,1,1
3,1016165618,11953119,Nh Collection Colombo,Colombo,favorite 4 star hotel colombo live new york ar...,5,1,1,1
4,1015472232,11953119,Nh Collection Colombo,Colombo,excellent food stay excellent food especially ...,5,1,1,1


And then check some 5-star and 1-star reviews to see if the sentiment analysis is correct.

In [64]:
five_star_reviews = reviews[reviews['rating'] == 5]
five_star_reviews.head()

Unnamed: 0,review_id,location_id,hotel_name,city,review,rating,text_blob_sentiment,vader_sentiment,transformer_sentiment
1,1016435128,11953119,Nh Collection Colombo,Colombo,definitely recommend hotel excellent food good...,5,1,1,1
2,1016307864,11953119,Nh Collection Colombo,Colombo,wonderful stay comfortable staycooperative sta...,5,1,1,1
3,1016165618,11953119,Nh Collection Colombo,Colombo,favorite 4 star hotel colombo live new york ar...,5,1,1,1
4,1015472232,11953119,Nh Collection Colombo,Colombo,excellent food stay excellent food especially ...,5,1,1,1
5,1015273964,11953119,Nh Collection Colombo,Colombo,outstanding spotless immaculate premises room ...,5,1,1,1


In [65]:
one_star_reviews = reviews[reviews['rating'] == 1]
one_star_reviews.head()

Unnamed: 0,review_id,location_id,hotel_name,city,review,rating,text_blob_sentiment,vader_sentiment,transformer_sentiment
0,1016464488,11953119,Nh Collection Colombo,Colombo,good stay found lighters toilet paper rolls no...,1,1,1,1
22,1013561310,11953119,Nh Collection Colombo,Colombo,dont complaint restaurant food cold otherwise ...,1,-1,-1,-1
76,568472844,11899031,De Colombo Boutique Hotel,Colombo,filthy towers probably worst hotel ive stayed ...,1,-1,-1,-1
77,568472643,11899031,De Colombo Boutique Hotel,Colombo,stay peril total disappointing start holiday s...,1,-1,-1,-1
80,565729487,11899031,De Colombo Boutique Hotel,Colombo,absolutely disgusting stay bug infested overpr...,1,-1,-1,-1


Similar to TextBlob and VADER, the transformer sentiment analysis is generally correct, as the 5-star reviews have a positive sentiment score and the 1-star reviews have a negative sentiment score.

Once again, there is a single 1-star review that has a positive sentiment score.

### Task 2.4. Establishing Ground Truth with Majority Voting

Here, we simply use majority voting to establish the ground truth sentiment for each review. We will consider the sentiment labels from TextBlob, VADER, and the transformer model.

In [66]:
# majority voting to establish ground truth
def majority_vote(row):
    votes = [row['text_blob_sentiment'], row['vader_sentiment'], row['transformer_sentiment']]
    return max(set(votes), key=votes.count)

reviews['ground_truth_sentiment'] = reviews.apply(majority_vote, axis=1)
reviews.head()

Unnamed: 0,review_id,location_id,hotel_name,city,review,rating,text_blob_sentiment,vader_sentiment,transformer_sentiment,ground_truth_sentiment
0,1016464488,11953119,Nh Collection Colombo,Colombo,good stay found lighters toilet paper rolls no...,1,1,1,1,1
1,1016435128,11953119,Nh Collection Colombo,Colombo,definitely recommend hotel excellent food good...,5,1,1,1,1
2,1016307864,11953119,Nh Collection Colombo,Colombo,wonderful stay comfortable staycooperative sta...,5,1,1,1,1
3,1016165618,11953119,Nh Collection Colombo,Colombo,favorite 4 star hotel colombo live new york ar...,5,1,1,1,1
4,1015472232,11953119,Nh Collection Colombo,Colombo,excellent food stay excellent food especially ...,5,1,1,1,1


Finally, we remove the unnecessary columns and save the reviews with the ground truth sentiment to a new CSV file.

In [70]:
reviews = reviews[["review_id", "location_id", "hotel_name", "city", "review", "rating", "ground_truth_sentiment"]]
reviews.to_csv("ground_truth_reviews.csv", index=False)

KeyError: "['rating'] not in index"