# Sentiment Analysis of Yelp reviews: Word Embeddings and LSTM

In this notebook, we implement a sentiment analysis to associate stars (from 1 to 5) to the reviews of the [YelpReviewFull dataset](https://pytorch.org/text/stable/datasets.html#yelpreviewfull). 
This dataset consists of reviews from Yelp, and is extracted from the Yelp Dataset Challenge 2015 data. 
A data point of this dataset comprises a review's text and the corresponding label (1 to 5 stars).

For this, we will first use ..WORD EMBEDDINGS... TO...    [ResNet](https://arxiv.org/abs/1512.03385)

Then, we will use XXX... to...    [ResNet](https://arxiv.org/abs/1512.03385)

We achieve a XX% ...

To build this notebook, the hints given by the [Udacity](https://github.com/udacity/deep-learning-v2-pytorch/blob/master/sentiment-rnn/Sentiment_RNN_Solution.ipynb) team have been very useful.



We decide to group **4 and 5 stars** reviews as **"good"**, **3 stars** as **"neutral"**, and **1 and 2 stars** as **"bad"**.

In the following, we will try to **predict** if a review is good, neutral or bad.



To begin with, we install the libraries that are necessary to run the code on [Google Colab](https://colab.research.google.com/).

In [71]:
%%capture  # To hide the output of the cell.
%%bash
pip install torch==1.11.0
pip install folium==0.2.1
pip install torchdata
pip install datasets

UsageError: unrecognized arguments: To hide the output of the cell.


In [72]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

import torch
import torchdata
from torch import nn, optim
import torch.nn.functional as F
from torchtext import transforms, utils, models#, datasets 

## Data pre-processing:

### Data loading and reduction:

We download the data with the [Hugging Face](https://huggingface.co/) library **datasets**. 

(The direct downloading from torchtext.datsets is impossible for now, because of a bug that will be corrected in the next version of PyTorch.)

In [73]:
from datasets import load_dataset
train_data = load_dataset("yelp_review_full", split="train")
test_data = load_dataset("yelp_review_full", split="test")

reviews = train_data['text'] + test_data['text']
stars = train_data['label'] + test_data['label']

Reusing dataset yelp_review_full (/Users/louis/.cache/huggingface/datasets/yelp_review_full/yelp_review_full/1.0.0/e8e18e19d7be9e75642fc66b198abadb116f73599ec89a69ba5dd8d1e57ba0bf)
Reusing dataset yelp_review_full (/Users/louis/.cache/huggingface/datasets/yelp_review_full/yelp_review_full/1.0.0/e8e18e19d7be9e75642fc66b198abadb116f73599ec89a69ba5dd8d1e57ba0bf)


We observe some data to see of all is OK.

In [74]:
print(f"A review example:   \"{reviews[2]}\"")
print()
print(f"The related label (number of stars):  {stars[2]} stars")

A review example:   "Been going to Dr. Goldberg for over 10 years. I think I was one of his 1st patients when he started at MHMG. He's been great over the years and is really all about the big picture. It is because of him, not my now former gyn Dr. Markoff, that I found out I have fibroids. He explores all options with you and is very patient and understanding. He doesn't judge and asks all the right questions. Very thorough and wants to be kept in the loop on every aspect of your medical health and your life."

The related label (number of stars):  3 stars


The full dataset contains a very big number of reviews: 700,000.

In [75]:
print(f"Number of reviews of the full dataset:  {len(reviews)}")

Number of reviews of the full dataset:  700000


Let us look at the full distribution of stars given to the reviews.

In [76]:
from collections import Counter

print(f"Number of reviews with 1 STAR of the INITIAL dataset:  {100*stars.count(0)/len(stars)} %")
print(f"Number of reviews with 2 STARS of the INITIAL dataset:  {100*stars.count(1)/len(stars)} %")
print(f"Number of reviews with 3 STARS of the INITIAL dataset:  {100*stars.count(2)/len(stars)} %")
print(f"Number of reviews with 4 STARS of the INITIAL dataset:  {100*stars.count(3)/len(stars)} %")
print(f"Number of reviews with 5 STARS of the INITIAL dataset:  {100*stars.count(4)/len(stars)} %")


Number of reviews with 1 STAR of the INITIAL dataset:  20.0 %
Number of reviews with 2 STARS of the INITIAL dataset:  20.0 %
Number of reviews with 3 STARS of the INITIAL dataset:  20.0 %
Number of reviews with 4 STARS of the INITIAL dataset:  20.0 %
Number of reviews with 5 STARS of the INITIAL dataset:  20.0 %


We decide to keep only **20%** of the original YelpReviewsFull dataset for our analysis (140,000 reviews!).
We carry out the selection **by preserving the equal distribution of stars**.

In [77]:
# We choose the indices that will be used
indices_1_Star = [index for index in range(len(reviews)) if stars[index]==0]
indices_2_Star = [index for index in range(len(reviews)) if stars[index]==1]
indices_3_Star = [index for index in range(len(reviews)) if stars[index]==2]
indices_4_Star = [index for index in range(len(reviews)) if stars[index]==3]
indices_5_Star = [index for index in range(len(reviews)) if stars[index]==4]

# We shuffle these indexes and delect randomly 20% of them
np.random.seed(123)
np.random.shuffle(indices_1_Star)
np.random.shuffle(indices_2_Star)
np.random.shuffle(indices_3_Star)
np.random.shuffle(indices_4_Star)
np.random.shuffle(indices_5_Star)
selected_indices_1_Star = indices_1_Star[:len(indices_1_Star)//5]
selected_indices_2_Star = indices_2_Star[:len(indices_2_Star)//5]
selected_indices_3_Star = indices_3_Star[:len(indices_3_Star)//5]
selected_indices_4_Star = indices_4_Star[:len(indices_4_Star)//5]
selected_indices_5_Star = indices_5_Star[:len(indices_5_Star)//5]

selected_indices = selected_indices_1_Star + selected_indices_2_Star + selected_indices_3_Star + selected_indices_4_Star + selected_indices_5_Star
reviews = [reviews[index] for index in selected_indices]
stars = [stars[index] for index in selected_indices]

print(f"Number of reviews kept for our analysis:  {len(reviews)}")

Number of reviews kept for our analysis:  140000


We check that the distribution of different stars is still the same.

In [78]:
print(f"Number of reviews with 1 STAR of the SELECTED dataset:  {100*stars.count(0)/len(stars)} %")
print(f"Number of reviews with 2 STARS of the SELECTED dataset:  {100*stars.count(1)/len(stars)} %")
print(f"Number of reviews with 3 STARS of the SELECTED dataset:  {100*stars.count(2)/len(stars)} %")
print(f"Number of reviews with 4 STARS of the SELECTED dataset:  {100*stars.count(3)/len(stars)} %")
print(f"Number of reviews with 5 STARS of the SELECTED dataset:  {100*stars.count(4)/len(stars)} %")

Number of reviews with 1 STAR of the SELECTED dataset:  20.0 %
Number of reviews with 2 STARS of the SELECTED dataset:  20.0 %
Number of reviews with 3 STARS of the SELECTED dataset:  20.0 %
Number of reviews with 4 STARS of the SELECTED dataset:  20.0 %
Number of reviews with 5 STARS of the SELECTED dataset:  20.0 %


### Data tokenization:

We now **tokenize** the reviews in order to make them readable by our neuronal network. For this we carry out the following steps:

- First, we **remove punctuation** and transform **upper cases into lower cases**.

- Second, we **tokenize** the words, by assigning to each of them an **integer**.

- Finally, we **translate** the reviews into tokenized reviews (by replacing each word by its corresponding integer).

Let us begin with punctuation and upper cases.

In [79]:
from string import punctuation

# Remove punctuation and upper cases
all_reviews = ' new_review_xyz '.join([review.translate(str.maketrans('', '', punctuation)) for review in reviews]) 
all_reviews = all_reviews.lower()

# Split again the different reviews
reviews_split = all_reviews.split(' new_review_xyz ')
all_reviews = ' '.join(reviews_split)

# Create a list containing all the words
all_words = all_reviews.split()

# We delete useless variables to increase available RAM
del all_reviews

# Check if all is OK with one review
print(reviews_split[0])

food is good  service at the bar is excellent   franco dropping f bombs at the the bar is a total turn off  tell him to stay in the kitchennnwe came back 6 months later  and had the most bizarre experience  the owner chef came out of the kitchen and verbally attacked a table of patronsnhe said that he was not happy with their disappointment with one of their dinners  he challenged them to the parking lot where he was going to punch the fbomb out of them the customers told him to stand back and even asked if the police could be called  the food is still good yet over pricedngo only at your own risk


Now, let us **encode the words into integers**: we do this by building a **dictionary** that maps all the present words into integers. It will permit us to **convert the reviews into a list of integers**. We **keep the value 0 for the padding**, that will be carried out next to have reviews of same size.

In [80]:
# Count occurences of each word
words_occurences = Counter(all_words) 
# Sort the words from the most to the least present
words_sorted = sorted(words_occurences, key=words_occurences.get, reverse=True) 
# Build the dictionary that maps words to integers: we keep "0" for the padding
word_to_integer = {word: integer for integer, word in enumerate(words_sorted, start=1)} 


## Tokenize each review in reviews_split and store the tokenized reviews in reviews_tokenized
reviews_tokenized = []
for review in reviews_split:
    reviews_tokenized.append([word_to_integer[word] for word in review.split()])
    

In [81]:
# Number of unique words
print('Number of unique words: ', len((word_to_integer))) 
print()

print('Example of tokenized review: \n', reviews_tokenized[1])
print()

print('The same review before tokenization: \n', reviews_split[1])


# We delete useless variables to increase available RAM
del all_words
del reviews_split
del words_occurences
del words_sorted

Number of unique words:  279087

Example of tokenized review: 
 [4, 6, 10, 482, 133, 5, 3545, 58405, 3330, 296, 2, 4644, 206, 6, 44, 7, 1, 206, 3875, 75, 1, 296, 4, 76, 3, 120, 25, 32535, 10600, 3331, 312, 63, 23, 3, 52, 618, 14, 120, 9070, 134, 35816, 33, 19, 74, 144, 20, 1412, 6, 669, 16, 856, 1486, 1782, 6, 10, 537, 7, 625, 375, 7, 4548, 4420, 374, 44, 429, 3742, 16, 146, 1082, 24, 605, 32, 9, 1279, 388, 1840, 553, 255, 2, 62, 85374, 100, 59, 56, 32]

The same review before tokenization: 
 i was in las vegas to attend asd trade show and ti hotel was one of the hotel referred by the show  i got a room at 33rd flood facing strip  which had a great view but room smelt bad furnitures as you can see on picture was dirty with dark spots nobody was in charge of taking care of hallways leftover dishes  one ice bucket with few glasses were stayed there for 24 hours  towels looks old and some dirtynnever never go back there


### Preparation of the features matrix (containing tokenized customized reviews)

We prepare the review matrix, by **standardizing** the lenght of the reviews to **300** words, to make the neuronal network comutations reasonable:

- We **remove** the reviews with no text.   XXXX

- We  **add zeros** to the reviews that are **too short**.   XXXX

- We **truncate** reviews that are **too long**.   XXXX


Let us first **check** whether there are **reviews with no text**.

In [83]:
# Size of the reviews
reviews_lenghts = [len(x) for x in reviews_tokenized]

print("Number of reviews with no words: {}".format(min(reviews_lenghts)))


Number of reviews with no words: 0


There are **no reviews with no words**.

Now, let us check the **maximal, median, and average number of words in a review**.

In [84]:
print("Maximal number of words in a review: {}".format(max(reviews_lenghts)))
print("Median number of words in a review: {}".format(np.median(reviews_lenghts)))
print("Average number of words in a review: {}".format(np.mean(reviews_lenghts)))
print("Average number of words in a review + 2 standard deviations: {}".format(np.mean(reviews_lenghts) + 2*np.std(reviews_lenghts)))

Maximal number of words in a review: 1025
Median number of words in a review: 99.0
Average number of words in a review: 133.38750714285715
Average number of words in a review + 2 standard deviations: 374.6552661842528


Based on these statistics, we decide to customize the length of our reviews at **300 words**.

Thus, let us **trunctate** reviews that are too long, and **add zeros** to reviews that are too short, to create the **matrix of features** that will be used as **predictor** in our neuronal network.

In [113]:
# Create the matrix of features, that will be entered in the neuronal network
# Its size must be of (number_of_reviews * maximum_number_of_words_per_review)
max_words_per_review = 300
matrix_features = np.zeros((len(reviews_tokenized), max_words_per_review), dtype=int)

# We put the tokenized words in the matrix, for each review (and truncate at the maximum review length)
for review_number, words_in_review in enumerate(reviews_tokenized):
    matrix_features[review_number, :len(words_in_review)] = np.array(words_in_review)[:max_words_per_review]
    
# We look at the last columns of the matrix of features (tokenized and standardized reviews), to check if all is OK
print(matrix_features[:5, -5:])
    

[[  0   0   0   0   0]
 [  0   0   0   0   0]
 [  1 227  55  79  72]
 [  0   0   0   0   0]
 [  0   0   0   0   0]]


### Transforming stars into "good", "neutral", and "bad" reviews

We decide to group **4 and 5 stars** reviews as **"good"**, **3 stars** as **"neutral"**, and **1 and 2 stars** as **"bad"**.

In the following, we will try to **predict** if a review is good, neutral or bad.

In [118]:
# 5 or 4 stars => "good"
good = [1 if (star == 4 or star ==3) else 0 for star in stars]
# 3 stars => "neutral"
neutral = [1 if (star == 2) else 0 for star in stars]
# 2 or 1 star => "bad"
bad = [1 if (star == 1 or star ==0) else 0 for star in stars]

# 3 variables put together in a matrix
goodness = np.array([good, neutral, bad], dtype=int)
goodness = goodness.T

### Preparation of the training, test, and validation sets

We now divide our prepared data into the **training set** (**80% of the data**), the **test set** (**10% of the data**), and the **validation set** (**10% of the data**).

In [122]:
train_prop = 0.8
valid_prop = 0.1
test_prop = 1 - (train_prop+valid_prop)

train_len = int(len(reviews_tokenized)*train_prop)
valid_len = int(len(reviews_tokenized)*valid_prop)

train_reviews, valid_reviews, test_reviews = matrix_features[:train_len], matrix_features[train_len:train_len+valid_len], matrix_features[train_len+valid_len:]
train_goodness, valid_goodness, test_goodness = goodness[:train_len], goodness[train_len:train_len+valid_len], goodness[train_len+valid_len:]

## We check the number of reviews in each dataset
print("\t\t\tFeature Shapes:")
print("Train set: \t\t{}".format(train_reviews.shape), 
      "\nValidation set: \t{}".format(valid_reviews.shape),
      "\nTest set: \t\t{}".format(test_reviews.shape))

			Feature Shapes:
Train set: 		(112000, 300) 
Validation set: 	(14000, 300) 
Test set: 		(14000, 300)


## Building the Neuronal Network