<a href="https://colab.research.google.com/github/keresztesbeata/Intelligent-Systems-Lab/blob/main/Restaurant-Reviews-NLP/RestaurantReviews.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Restaurant Reviews

## Context
Analyzing the positive and negative reviews of a customers/guests related to a restaurant (or any other service) could help to estimate the success and popularity of the restaurant, estimate its future profits or highlight the areas that need to be changed in order to accommodate to the needs and expectations of the customers.

Automatically categorizing these reviews as positive or negative, without the need to read and interpret them one-by-one (nearly impossible for millions of reviews), could improve the efficiency of such review system, by speeding up the process of evaluating the customer feedback and adjusting the provided services.

The task of categorizing such reviews belongs to the domain of Natural Language Processing, more specifically Sentiment Analysis based on a text fragment.

## Dataset
The dataset consists of a file containing 2 columns, one with the actual customer reviews and the second, a label indicating whether the customer has liked the food offered by the restaurant or not.

## Task
Decide if a given review is positive or negative.

# Data preparation

In [63]:
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from sklearn.base import clone
import csv
import tensorflow as tf
from tensorflow import keras
%tensorflow_version 2.x
from sklearn.model_selection import train_test_split
import string
from nltk.corpus import stopwords

## Reading the data

In [57]:
data_url = "https://raw.githubusercontent.com/keresztesbeata/Intelligent-Systems-Lab/main/Restaurant-Reviews-NLP/restaurant_reviews_data.csv"

raw_data = pd.read_csv(data_url, on_bad_lines='skip')

In [58]:
raw_data.head()

Unnamed: 0,Review,Liked
0,Wow... Loved this place.,1
1,Crust is not good.,0
2,Not tasty and the texture was just nasty.,0
3,Stopped by during the late May bank holiday of...,1
4,The selection on the menu was great and so wer...,1


In [26]:
raw_data.groupby('Liked').describe()

Unnamed: 0_level_0,Review,Review,Review,Review
Unnamed: 0_level_1,count,unique,top,freq
Liked,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
0,500,497,The food was terrible.,2
1,500,499,I love this place.,2


As we can see the dataset is quite balanced, there are almost as many positive as negative examples of reviews which will be helpful when training the model as there is less danger of overfitting our model.

In [59]:
raw_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Review  1000 non-null   object
 1   Liked   1000 non-null   int64 
dtypes: int64(1), object(1)
memory usage: 15.8+ KB


Check if there are any null values:

In [19]:
raw_data.isnull().sum()

 Review    0
Liked      0
dtype: int64

Remove the duplicates:

In [53]:
raw_data = raw_data.drop_duplicates()
raw_data.groupby('Liked').describe()

Unnamed: 0_level_0,Review,Review,Review,Review
Unnamed: 0_level_1,count,unique,top,freq
Liked,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
0,497,497,Crust is not good.,1
1,499,499,Wow... Loved this place.,1


## Splitting the dataset into train and validation sets

In order to make sure that the model is not biased when predicting the output label, splitting the data into train and test sets is performed before any data preparation. This means that no information is used from the evaluation data, for example the words are not added to the knowledge base used for training the model.

The dataset will be split in 9/1 ratio, using 90% of the available data as train data, adn the remaining 10% will be used for evaluation (test set).

In [54]:
raw_data.columns

Index([' Review', 'Liked'], dtype='object')

In [60]:
X = raw_data.Review   # reviews
y = raw_data.Liked    # output label which is to be predicted

split_size = int(len(raw_data) * 0.9)
train_X = X[:split_size]
train_y = y[:split_size]

tets_X = X[split_size:]
test_y = y[split_size:]

## Clean the dataset with the following steps:
- split the text into words (tokens) based on the white space characters
- remove punctuation marks, as rhey do not have a relevance in this case
- remove the words which contain non-merical or special characters and keep only the ones which contain excluseilevy alphabetical characters.
- remove 'stop words', the most common words, articles, pronouns in English, for ex. a, an, the, etc.
- Remove words with length of 0 or just 1 character.

In [62]:
def clean_data(data):
  # split the data into words (= tokens delimeted by space)
  words = data.str.split(' ')
  # remove all punctuation marks
  words = [word.translate(string.punctuation) for word in words if word]
  # keep only the words which contain only alpahebtic characters
  words = [word for word in words if word.isalpha()]
  # similarly, filter stop words:
  words = [word for word in words if not word in set(stopwords.words('english'))]
  # finally, filter the words by a min length, which should be at least 2 characters
  words = [word for word in words if len(word) > 1]
  return words