# Sentiment Analysis of Yelp reviews: Word Embeddings and LSTM

In this notebook, we implement a sentiment analysis to associate stars (from 1 to 5) to the reviews of the [YelpReviewFull dataset](https://pytorch.org/text/stable/datasets.html#yelpreviewfull). 
This dataset consists of reviews from Yelp, and is extracted from the Yelp Dataset Challenge 2015 data. 
A data point of this dataset comprises a review's text and the corresponding label (1 to 5 stars).

For this, we will first use ..WORD EMBEDDINGS... TO...    [ResNet](https://arxiv.org/abs/1512.03385)

Then, we will use XXX... to...    [ResNet](https://arxiv.org/abs/1512.03385)

We achieve a XX% ...

To build this notebook, the hints given by the [Udacity](https://github.com/udacity/deep-learning-v2-pytorch/blob/master/sentiment-rnn/Sentiment_RNN_Solution.ipynb) team have been very useful.



We decide to group **4 and 5 stars** reviews as **"good"**, **3 stars** as **"neutral"**, and **1 and 2 stars** as **"bad"**.

In the following, we will try to **predict** if a review is good, neutral or bad.



To begin with, we install the libraries that are necessary to run the code on [Google Colab](https://colab.research.google.com/).

In [114]:
%%capture  # To hide the output of the cell.
%%bash
pip install torch==1.11.0
pip install folium==0.2.1
pip install torchdata
pip install datasets

UsageError: unrecognized arguments: To hide the output of the cell.


In [115]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

import torch
import torchdata
from torch import nn, optim
import torch.nn.functional as F
from torchtext import transforms, utils, models#, datasets 

## Data pre-processing:

### Data loading and reduction:

We download the data with the [Hugging Face](https://huggingface.co/) library **datasets**. 

(The direct downloading from torchtext.datsets is impossible for now, because of a bug that will be corrected in the next version of PyTorch.)

In [116]:
from datasets import load_dataset
train_data = load_dataset("yelp_review_full", split="train")
test_data = load_dataset("yelp_review_full", split="test")

reviews = train_data['text'] + test_data['text']
stars = train_data['label'] + test_data['label']

Reusing dataset yelp_review_full (/Users/louis/.cache/huggingface/datasets/yelp_review_full/yelp_review_full/1.0.0/e8e18e19d7be9e75642fc66b198abadb116f73599ec89a69ba5dd8d1e57ba0bf)
Reusing dataset yelp_review_full (/Users/louis/.cache/huggingface/datasets/yelp_review_full/yelp_review_full/1.0.0/e8e18e19d7be9e75642fc66b198abadb116f73599ec89a69ba5dd8d1e57ba0bf)


We observe some data to see of all is OK.

In [117]:
print(f"A review example:   \"{reviews[2]}\"")
print()
print(f"The related label (number of stars):  {stars[2]} stars")

A review example:   "Been going to Dr. Goldberg for over 10 years. I think I was one of his 1st patients when he started at MHMG. He's been great over the years and is really all about the big picture. It is because of him, not my now former gyn Dr. Markoff, that I found out I have fibroids. He explores all options with you and is very patient and understanding. He doesn't judge and asks all the right questions. Very thorough and wants to be kept in the loop on every aspect of your medical health and your life."

The related label (number of stars):  3 stars


The full dataset contains a very big number of reviews: 700,000.

In [118]:
print(f"Number of reviews of the full dataset:  {len(reviews)}")

Number of reviews of the full dataset:  700000


Let us look at the full distribution of stars given to the reviews.

In [119]:
from collections import Counter

print(f"Number of reviews with 1 STAR of the INITIAL dataset:  {100*stars.count(0)/len(stars)} %")
print(f"Number of reviews with 2 STARS of the INITIAL dataset:  {100*stars.count(1)/len(stars)} %")
print(f"Number of reviews with 3 STARS of the INITIAL dataset:  {100*stars.count(2)/len(stars)} %")
print(f"Number of reviews with 4 STARS of the INITIAL dataset:  {100*stars.count(3)/len(stars)} %")
print(f"Number of reviews with 5 STARS of the INITIAL dataset:  {100*stars.count(4)/len(stars)} %")


Number of reviews with 1 STAR of the INITIAL dataset:  20.0 %
Number of reviews with 2 STARS of the INITIAL dataset:  20.0 %
Number of reviews with 3 STARS of the INITIAL dataset:  20.0 %
Number of reviews with 4 STARS of the INITIAL dataset:  20.0 %
Number of reviews with 5 STARS of the INITIAL dataset:  20.0 %


We decide to keep only **20%** of the original YelpReviewsFull dataset for our analysis (140,000 reviews!).
We carry out the selection **by preserving the equal distribution of stars**.

In [120]:
# We choose the indices that will be used
indices_1_Star = [index for index in range(len(reviews)) if stars[index]==0]
indices_2_Star = [index for index in range(len(reviews)) if stars[index]==1]
indices_3_Star = [index for index in range(len(reviews)) if stars[index]==2]
indices_4_Star = [index for index in range(len(reviews)) if stars[index]==3]
indices_5_Star = [index for index in range(len(reviews)) if stars[index]==4]

# We shuffle these indexes and delect randomly 20% of them
np.random.seed(123)
np.random.shuffle(indices_1_Star)
np.random.shuffle(indices_2_Star)
np.random.shuffle(indices_3_Star)
np.random.shuffle(indices_4_Star)
np.random.shuffle(indices_5_Star)
selected_indices_1_Star = indices_1_Star[:len(indices_1_Star)//5]
selected_indices_2_Star = indices_2_Star[:len(indices_2_Star)//5]
selected_indices_3_Star = indices_3_Star[:len(indices_3_Star)//5]
selected_indices_4_Star = indices_4_Star[:len(indices_4_Star)//5]
selected_indices_5_Star = indices_5_Star[:len(indices_5_Star)//5]

selected_indices = selected_indices_1_Star + selected_indices_2_Star + selected_indices_3_Star + selected_indices_4_Star + selected_indices_5_Star
reviews = [reviews[index] for index in selected_indices]
stars = [stars[index] for index in selected_indices]

print(f"Number of reviews kept for our analysis:  {len(reviews)}")

Number of reviews kept for our analysis:  140000


We check that the distribution of different stars is still the same.

In [121]:
print(f"Number of reviews with 1 STAR of the SELECTED dataset:  {100*stars.count(0)/len(stars)} %")
print(f"Number of reviews with 2 STARS of the SELECTED dataset:  {100*stars.count(1)/len(stars)} %")
print(f"Number of reviews with 3 STARS of the SELECTED dataset:  {100*stars.count(2)/len(stars)} %")
print(f"Number of reviews with 4 STARS of the SELECTED dataset:  {100*stars.count(3)/len(stars)} %")
print(f"Number of reviews with 5 STARS of the SELECTED dataset:  {100*stars.count(4)/len(stars)} %")

Number of reviews with 1 STAR of the SELECTED dataset:  20.0 %
Number of reviews with 2 STARS of the SELECTED dataset:  20.0 %
Number of reviews with 3 STARS of the SELECTED dataset:  20.0 %
Number of reviews with 4 STARS of the SELECTED dataset:  20.0 %
Number of reviews with 5 STARS of the SELECTED dataset:  20.0 %


### Data tokenization:

We now **tokenize** the reviews in order to make them readable by our neuronal network. For this we carry out the following steps:

- First, we **remove punctuation**.

- Second, we **tokenize** the words, by assigning to each of them an **integer**.

- Finally, we **translate** the reviews into tokenized reviews (by replacing each word by its corresponding integer).

Let us begin with the XXXX

### Transforming stars into "good", "neutral", and "bad" reviews

We decide to group **4 and 5 stars** reviews as **"good"**, **3 stars** as **"neutral"**, and **1 and 2 stars** as **"bad"**.

In the following, we will try to **predict** if a review is good, neutral or bad.

We transform the datsets into **iterators**.

In [28]:
train_iter = iter(train_data)
test_iter = iter(test_data)

# Example to see if all OK
next(train_iter)

{'label': 4,
 'text': "dr. goldberg offers everything i look for in a general practitioner.  he's nice and easy to talk to without being patronizing; he's always on time in seeing his patients; he's affiliated with a top-notch hospital (nyu) which my parents have explained to me is very important in case something happens and you need surgery; and you can get referrals to see specialists without having to see him first.  really, what more do you need?  i'm sitting here trying to think of any complaints i have about him, but i'm really drawing a blank."}

In [29]:
train_text = train_data['text']
train_label = train_data['label']


We then **split** the text into individual words (**tokens**).
We also remove capital letters and ponctuation.

In [None]:
# Split the reviews into a list of words (tokens)
all_reviews_tokens = []
for review in train_iter['text']:
    all_reviews_tokens += review.split()
    


In [None]:
# Remove punctuation
from string import punctuation

all_reviews_tokens = [token for token in all_reviews_tokens if token not in punctuation]

We then **encode** the words into integers that can be fed into the Neuronal Network.

In [None]:
## We write a dictionary mapping the words (tokens) to integers
from collections import Counter

occurences = Counter(all_reviews_tokens) # number of occurences of each word (token)
sorted_tokens = sorted(occurences, key=occurences.get, reverse=True) # sort words from most to least present
dictionary_tokens_to_int = {token: ii for ii, token in enumerate(sorted_tokens)}

In [None]:

    
# Remove punctuation
from string import punctuation
all_tokens = ' '.join([token for token in tokens])
all_tokens = ''.join([letter for letter in all_tokens if letter not in punctuation])

# Tokens without punctuation
full_text_tokens = []
for text in all_tokens:
    full_text_tokens += text.split()

In [None]:
print(all_tokens[0:53])

dr goldberg offers everything i look for in a general


In [None]:
from collections import Counter

## We write a dictionary mapping the words (tokens) to integers
occurences = Counter(full_text_tokens) # number of occurences of each word (token)
sorted_tokens = sorted(occurences, key=occurences.get, reverse=True) # sort words from most to least present
sorted_tokens_to_int = {token: ii for ii, token in enumerate(sorted_tokens)}

## use the dict to tokenize each review in reviews_split
## store the tokenized reviews in reviews_ints
full_text_tokens_ints = []
for token in full_text_tokens:
    full_text_tokens_ints.append([sorted_tokens_to_int[token]])

KeyboardInterrupt: 

In [None]:
tokens

['dr.',
 'goldberg',
 'offers',
 'everything',
 'i',
 'look',
 'for',
 'in',
 'a',
 'general',
 'practitioner.',
 "he's",
 'nice',
 'and',
 'easy',
 'to',
 'talk',
 'to',
 'without',
 'being',
 'patronizing;',
 "he's",
 'always',
 'on',
 'time',
 'in',
 'seeing',
 'his',
 'patients;',
 "he's",
 'affiliated',
 'with',
 'a',
 'top-notch',
 'hospital',
 '(nyu)',
 'which',
 'my',
 'parents',
 'have',
 'explained',
 'to',
 'me',
 'is',
 'very',
 'important',
 'in',
 'case',
 'something',
 'happens',
 'and',
 'you',
 'need',
 'surgery;',
 'and',
 'you',
 'can',
 'get',
 'referrals',
 'to',
 'see',
 'specialists',
 'without',
 'having',
 'to',
 'see',
 'him',
 'first.',
 'really,',
 'what',
 'more',
 'do',
 'you',
 'need?',
 "i'm",
 'sitting',
 'here',
 'trying',
 'to',
 'think',
 'of',
 'any',
 'complaints',
 'i',
 'have',
 'about',
 'him,',
 'but',
 "i'm",
 'really',
 'drawing',
 'a',
 'blank.',
 'Unfortunately,',
 'the',
 'frustration',
 'of',
 'being',
 'Dr.',
 "Goldberg's",
 'patient',
 