## 1. Project Description
### 1.1 Project Objectives
This project has two main objectives:
1. Build an RNN model that predicts (classifies) whether a customer review about restaurants are positive or negative.
2. Build an unsupervised machine model that extracts key topics/themes from reviews.

<br>

### 1.2 Project Background
In order for a business to stay competitive, it needs to understand its customer feedbacks. What are customers saying about us? Are they happy with our products and services? Where do we need to improve upon? 

One important source of customer feedback is the reviews posted on specialized customer review platforms such as Google Reviews and Tripadvisor reviews. These review sites allow customers to post text reviews as well as a rating (usually 1-5 stars). Companies can take these reviews, segment them based on the rating or simply into postive and negative reviews. Companies can then analyze the negative reviews to understand their shortcomings and focus on the positive reviews to further enhance their competitive position.

However, with the advent of smart devices, more and more people are turning to social media to post their experiences with a brand. These social media posts are highly important (due to word of mouth), but much harder for the companies to efficiently analyze, because:
- there are no ratings (1-5)
- have to manually go through each post and sort them into positive/negative reviews, which is very resource intensive for companies

It would therefore be highly valuable, to train an automatic reviews classifier based on reviews that have a rating (e.g. data from specialized reviews platforms, such as Tripadvisor), and use the classifier to predict whether a social media post about a company (such as a Tweet) is positive or negative. In addition, it would be even more valuable to develop a model that can automatically extract keywords/themes from the reviews, as this will give human analysts even more efficient ways to understand and analyze the data.

<br>

### 1.3 Project Evaluation
Part 1 Classification:

Reviews data tend to have imbalanced data distributions:
- either many negative reviews and complaints vs. positive reviews, or
- an overwhelming proportion of positive reviews/high ratings vs. negative reviews

Therefore, accuracy will not be a suitable evaluation metric, as a naive strategy of always predicting the majority class will give us a (misleadingly) high accuracy.

The classification model results will be evaluated using the F1 score. The F1 score is the harmonic mean of precision and recall and it is calculated as:

$$F1 = 2 * \frac{precision\, * \,recall}{precision\, + \,recall}$$

Where

$$precision = \frac{True \, Positives}{True\, Positives \, + \, False \,Positives}$$

$$recall = \frac{True \,Positives}{True\, Positives \, + \, False \,Negatives}$$

Unlike accuracy, the F1 score does not suffer from class imbalance problems, however it is less interpretable than accuracy.

Part 2 Topic Modelling:

Unlike supervised methods, unsupervised methods don't have a "ground truth" to evaluate against. There are mathematical methods such as various distance metrics that can be use for evaluating topic models, but in this project, we feel that it is more important to apply human logic and judgement. Therefore the topic model will be evaluated on whether it provides meaningful and interpretable results.

-------------

## 2. Data Description
### 2.1 Data Understanding
The dataset for this project contains English reviews for restaurants on Tripadvisor. The full dataset contains 6 CSVs, one for each of the following cities: Barcelona, London, Madrid, New Delhi, New York and Pairs. Our project will focus the Barcelona data. 

The Barcelona data contains:
- more than 416k observations (reviews)
- 12 columns/features

Feature definition:
1. parse_count: numerical (integer), corresponding number of extracted review by the web scraper (auto-incremental, starts from 1)
2. author_id: categorical (string), univocal, incremental and anonymous identifier of the user (UID_XXXXXXXXXX)
3. restaurant_name: categorical (string), name of the restaurant matching the review
4. rating_review: numerical (integer), review score in the range 1-5
5. sample: categorical (string), indicating “positive” sample for scores [4-5] and “negative” for scores [1-3]
6. review_id: categorical (string), univocal and internal identifier of the review (review_XXXXXXXXX)
7. title_review: text, review title
8. review_preview: text, preview of the review, truncated in the website when the text is very long
9. review_full: text, complete review
10. date: timestamp, publication date of the review in the format (day, month, year)
11. city: categorical (string), city of the restaurant which the review was written for
12. url_restaurant: text, restaurant url

### 2.2 Data Source and Acknowledgements
The dataset used for this project was found on [Kaggle](https://www.kaggle.com/datasets/inigolopezrioboo/a-tripadvisor-dataset-for-nlp-tasks?select=Barcelona_reviews.csv).

If you use these data, please cite the datasets using the associated [Zenodo DOI](https://doi.org/10.5281/zenodo.6583422) as I am doing here.

If you use these data, please cite the [related paper](https://arxiv.org/abs/2205.01759).

> Botana, Iñigo López-Riobóo, Verónica Bolón-Canedo, Bertha Guijarro-Berdiñas, and Amparo Alonso-Betanzos. "Explain and Conquer: Personalised Text-based Reviews to Achieve Transparency." arXiv preprint arXiv:2205.01759 (2022).

Please notice that these datasets are under a CC-BY-NC 4.0 International license. You must NOT use the material for commercial purposes.


-----------

## 3. Importing the Libraries and Data
### 3.1 Importing the Libraries

In [1]:
# 3.1.1 Importing the common libraries
import pandas as pd
import numpy as np
import os
from collections import defaultdict

import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud

# Settings
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)

import warnings
warnings.filterwarnings( "ignore", module = "matplotlib\..*" )

In [2]:
# 3.1.2 Natural Language libraries
import re
import string
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
nltk.download('punkt')
nltk.download('stopwords')
from keras_preprocessing.text import text_to_word_sequence
from nltk.stem import SnowballStemmer
from sklearn.feature_extraction.text import TfidfVectorizer

[nltk_data] Downloading package punkt to /usr/share/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /usr/share/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [3]:
# 3.1.3 Machine Learning libraries
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline

from sklearn.linear_model import LogisticRegression

from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix


In [None]:
# 3.1.4 Deep Learning libraries
import tensorflow as tf
import keras # high-level deep learning API
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import (Embedding,LSTM,
                          Dense,
                          SpatialDropout1D,
                          Bidirectional,
                          BatchNormalization,
                          TimeDistributed, 
                          Dropout, 
                          Flatten, 
                          GlobalMaxPool1D)
from keras.initializers import Constant
from keras.optimizers import adam_v2
from keras.callbacks import ModelCheckpoint, ReduceLROnPlateau
import keras.backend as K

### 3.2 Importing the Data

In [4]:
fulldata = pd.read_csv(r'../input/a-tripadvisor-dataset-for-nlp-tasks/Barcelona_reviews.csv')
fulldata.head()

  exec(code_obj, self.user_global_ns, self.user_ns)


Unnamed: 0.1,Unnamed: 0,parse_count,restaurant_name,rating_review,sample,review_id,title_review,review_preview,review_full,date,city,url_restaurant,author_id
0,0,1,Chalito_Rambla,1,Negative,review_774086112,Terrible food Terrible service,"Ok, this place is terrible! Came here bc we’ve...","Ok, this place is terrible! Came here bc we’ve...","October 12, 2020",Barcelona_Catalonia,https://www.tripadvisor.com/Restaurant_Review-...,UID_0
1,1,2,Chalito_Rambla,5,Positive,review_739142140,The best milanesa in central Barcelona,This place was a great surprise. The food is d...,This place was a great surprise. The food is d...,"January 14, 2020",Barcelona_Catalonia,https://www.tripadvisor.com/Restaurant_Review-...,UID_1
2,2,3,Chalito_Rambla,5,Positive,review_749758638,Family bonding,The food is excellent.....the ambiance is very...,The food is excellent.....the ambiance is very...,"March 7, 2020",Barcelona_Catalonia,https://www.tripadvisor.com/Restaurant_Review-...,UID_2
3,3,4,Chalito_Rambla,5,Positive,review_749732001,Best food,"The food is execellent ,affortable price for p...","The food is execellent ,affortable price for p...","March 7, 2020",Barcelona_Catalonia,https://www.tripadvisor.com/Restaurant_Review-...,UID_3
4,4,5,Chalito_Rambla,5,Positive,review_749691057,Amazing Food and Fantastic Service,"Mr Suarez,The food at your restaurant was abso...","Mr Suarez,The food at your restaurant was abso...","March 7, 2020",Barcelona_Catalonia,https://www.tripadvisor.com/Restaurant_Review-...,UID_4


### 3.2.1 Initial Data Understanding
When we get a dataset it can often be helpful to just run a quick check on the data's key statistics and information. We can understand whether the data has decent quality (values of expected types and in reasonable numerical ranges), and also drop features that are clearly irrelevant to help us save memory (especially on a large dataset).

In [6]:
fulldata.describe(include = 'all')

Unnamed: 0.1,Unnamed: 0,parse_count,restaurant_name,rating_review,sample,review_id,title_review,review_preview,review_full,date,city,url_restaurant,author_id
count,416356,416356,416356,416356,416356,416356,416355,416355,416354,416354,416354,416354,416354
unique,416356,416356,6622,11,3,416356,286450,416152,416185,4496,5,45571,232020
top,0,1,Cerveceria_Catalana,5,Positive,review_774086112,Excellent,"Great tapas, a variety of tastes, all tasted e...","Great tapas, a variety of tastes, all tasted e...","October 18, 2016",Barcelona_Catalonia,https://www.tripadvisor.com/Restaurant_Review-...,UID_7046
freq,1,1,5628,203206,338779,1,1495,4,4,413,416325,10,390


In [5]:
fulldata.info()
# memory usage of 41.3+ MB and all features are of "object" type

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 416356 entries, 0 to 416355
Data columns (total 13 columns):
 #   Column           Non-Null Count   Dtype 
---  ------           --------------   ----- 
 0   Unnamed: 0       416356 non-null  object
 1   parse_count      416356 non-null  object
 2   restaurant_name  416356 non-null  object
 3   rating_review    416356 non-null  object
 4   sample           416356 non-null  object
 5   review_id        416356 non-null  object
 6   title_review     416355 non-null  object
 7   review_preview   416355 non-null  object
 8   review_full      416354 non-null  object
 9   date             416354 non-null  object
 10  city             416354 non-null  object
 11  url_restaurant   416354 non-null  object
 12  author_id        416354 non-null  object
dtypes: object(13)
memory usage: 41.3+ MB


-----------------

## 4. Exploratory Data Analysis (EDA)