# TripAdvisor Restaurant Reviews - Data cleaning and pre-processing
by Liyena Yusoff

## Background

In recent years, Singapore has experienced a surge in fusion cuisine, blending international flavors with local delicacies. While this culinary trend has gained popularity, it raises concerns about preserving the authenticity and traditional essence of local dishes.

Amidst this dynamic gastronomic landscape, there's a delicate balance to be struck. Fusion cuisine caters to those who appreciate international twists, drawing attention to Singapore's culinary scene globally. However, this shift also prompts reflection on the preservation of authentic local experiences.

The ongoing dialogue between fusion and tradition reflects diverse culinary preferences. While some gravitate towards the exciting blend of international flavors, others seek authentic tasting food true to the essence of their cultural roots. Navigating this intricate interplay between innovation and tradition becomes crucial in defining Singapore's culinary identity.

## Problem statement

With the growing challenge of finding truly authentic food, this project aims to recommend authentic dining experiences based on consumer preferences and support rstaurants in enhancing the authenticity and quality of their offerings through customer feedback analysis, including key words, and Net Promoter Scores (NPS).

This addresses the difficulty customers face in navigating a culinary landscape where traditional flavors may be diluted or overshadowed by modern interpretations.

## Objectives

- Analyze TripAdvisor restaurant reviews to identify their sentiments for the authenticity of the dishes.
- Develop a NPS-like score to rank restaurants.
- Develop a model that could classify TripAdvisor reviews into detractor or promoter of the restaurant.
- Provide actionable insights for both consumers seeking authentic experiences and restaurants aiming to highlight their authenticity.

## Success metrics

- F1-score

### Notebooks
[Part II - Exploratory Data Analysis](eda.ipynb)

[Part III - Modeling](preprocess_model.ipynb)

[Part IV - App data training](../streamlit/rag_finetuning.ipynb)

### Contents
- [Data cleaning](#Clean-data-using-DataCleaner)
- [Text Cleaning](#Clean-and-lemmatize-text-data-using-TextCleaner)

In [1]:
# import libraries
import pandas as pd
import numpy as np
import time

from data_cleaner import DataCleaner
from text_cleaner import TextCleaner

from bs4 import BeautifulSoup
import spacy
from nltk.tokenize import sent_tokenize, word_tokenize, RegexpTokenizer
from nltk.stem import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords, wordnet
from nltk import pos_tag
import re 

## Import data

In [12]:
# import data

data = pd.read_csv("../data/raw_reviews.csv")

In [13]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40956 entries, 0 to 40955
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   Unnamed: 0      40956 non-null  int64 
 1   restaurant      40956 non-null  object
 2   review_heading  40954 non-null  object
 3   review_text     40956 non-null  object
 4   review_rating   40956 non-null  int64 
dtypes: int64(2), object(3)
memory usage: 1.6+ MB


In [14]:
data.drop(columns=['Unnamed: 0'], inplace=True)

In [15]:
data

Unnamed: 0,restaurant,review_heading,review_text,review_rating
0,Entre Nous creperie,Wonderfully consistent,I’ve come to entre nous periodically over the ...,50
1,Entre Nous creperie,Don’t miss the French Galettes and Crepes,Absolutely delicious…\nThe menu is lovely and ...,50
2,Entre Nous creperie,Lovely French restaurant - excellent for glute...,Thank you so much for choosing entre nous crep...,50
3,Entre Nous creperie,A simple yet delicious & authentic brunch.,"Lovely little French restaurant, really authen...",45
4,Entre Nous creperie,A trip to brittany,"Dear Rebecca,\n\nThank you very much for dinin...",40
...,...,...,...,...
40951,Chao San Cuisine,Chao Shan serves authentic Teochew cuisine,Chao Shan serves one of the best authentic Teo...,40
40952,Chao San Cuisine,Super duper yummy and authentic Teochew dishes,"We loved all the dishes we had! If you can, or...",40
40953,Chao San Cuisine,chinese style,This country is not famous for its cuisines be...,40
40954,Chao San Cuisine,Cosy restaurant with authentic Teochew cuisine,"Making a reservation is highly recommended, as...",40


In [17]:
data[data['review_heading'].isnull()]

Unnamed: 0,restaurant,review_heading,review_text,review_rating
2218,Anjappar Authentic Chettinaad Restaurant,,We went to this restaurant for lunch with our ...,50
24732,Malaysian Food Street,,"Tried the chicken rice balls, lor bak and curr...",35


## Clean data using DataCleaner

In [19]:
cleaner = DataCleaner(data)

cleaner.get_ratings('review_rating')

cleaner.drop_duplicates()

cleaned_reviews = cleaner.df

In [20]:
cleaned_reviews.columns

Index(['restaurant', 'review_heading', 'review_text', 'review_rating'], dtype='object')

In [21]:
cleaned_reviews[cleaned_reviews.duplicated()]

Unnamed: 0,restaurant,review_heading,review_text,review_rating


In [22]:
cleaned_reviews.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40947 entries, 0 to 40946
Data columns (total 4 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   restaurant      40947 non-null  object 
 1   review_heading  40945 non-null  object 
 2   review_text     40947 non-null  object 
 3   review_rating   40947 non-null  float64
dtypes: float64(1), object(3)
memory usage: 1.2+ MB


In [23]:
%%time
# Assuming 'df' is your DataFrame with a column 'restaurant_name' containing restaurant names
# Create a unique list of restaurant names
unique_restaurants = cleaned_reviews['restaurant'].unique()

# Create a mapping between restaurant names and numbers
restaurant_to_number = {restaurant: i for i, restaurant in enumerate(unique_restaurants)}

# Add a new column 'restaurant_label' to your DataFrame
cleaned_reviews['restaurant_label'] = cleaned_reviews['restaurant'].map(restaurant_to_number)

# Now, each restaurant name is associated with a unique number in the 'restaurant_label' column


CPU times: user 5.8 ms, sys: 3.52 ms, total: 9.31 ms
Wall time: 6.71 ms


In [24]:
cleaned_reviews.head()

Unnamed: 0,restaurant,review_heading,review_text,review_rating,restaurant_label
0,Entre Nous creperie,Wonderfully consistent,I’ve come to entre nous periodically over the ...,5.0,0
1,Entre Nous creperie,Don’t miss the French Galettes and Crepes,Absolutely delicious…\nThe menu is lovely and ...,5.0,0
2,Entre Nous creperie,Lovely French restaurant - excellent for glute...,Thank you so much for choosing entre nous crep...,5.0,0
3,Entre Nous creperie,A simple yet delicious & authentic brunch.,"Lovely little French restaurant, really authen...",4.5,0
4,Entre Nous creperie,A trip to brittany,"Dear Rebecca,\n\nThank you very much for dinin...",4.0,0


In [25]:
cleaned_reviews[cleaned_reviews['review_heading'].isnull()]

Unnamed: 0,restaurant,review_heading,review_text,review_rating,restaurant_label
2218,Anjappar Authentic Chettinaad Restaurant,,We went to this restaurant for lunch with our ...,5.0,4
24725,Malaysian Food Street,,"Tried the chicken rice balls, lor bak and curr...",3.5,70


## Clean and lemmatize text data using TextCleaner

In [26]:
%%time

cleaner = TextCleaner(cleaned_reviews, "review_text")

# Clean and lemmatize text
cleaner.clean_lemmatize_text()

cleaner.clean_lemmatize_header()

# Access the DataFrame with cleaned and lemmatized text
cleaned_reviews = cleaner.df

  text = BeautifulSoup(str(raw_text), features="lxml").get_text()
  text = BeautifulSoup(str(raw_text), features="lxml").get_text()


CPU times: user 4min 46s, sys: 1.39 s, total: 4min 48s
Wall time: 4min 48s


In [27]:
# a preview of the cleaned reviews

cleaned_reviews.head()

Unnamed: 0,restaurant,review_heading,review_text,review_rating,restaurant_label,cleaned_text,cleaned_lem_text,word_count,cleaned_heading,cleaned_lem_heading
0,Entre Nous creperie,Wonderfully consistent,I’ve come to entre nous periodically over the ...,5.0,0,come entre nous periodically past years seems ...,come entre nous periodically past year seem se...,47,wonderfully consistent,wonderfully consistent
1,Entre Nous creperie,Don’t miss the French Galettes and Crepes,Absolutely delicious…\nThe menu is lovely and ...,5.0,0,absolutely delicious menu lovely offers except...,absolutely delicious menu lovely offer excepti...,42,miss french galettes crepes,miss french galette crepe
2,Entre Nous creperie,Lovely French restaurant - excellent for glute...,Thank you so much for choosing entre nous crep...,5.0,0,thank much choosing entre nous creperie recent...,thank much choose entre nous creperie recently...,45,lovely french restaurant excellent gluten free,lovely french restaurant excellent gluten free
3,Entre Nous creperie,A simple yet delicious & authentic brunch.,"Lovely little French restaurant, really authen...",4.5,0,lovely little french restaurant really authent...,lovely little french restaurant really authent...,45,simple yet delicious authentic brunch,simple yet delicious authentic brunch
4,Entre Nous creperie,A trip to brittany,"Dear Rebecca,\n\nThank you very much for dinin...",4.0,0,dear rebecca thank much dining entre nous crep...,dear rebecca thank much dining entre nous crep...,45,trip brittany,trip brittany


To have a better understanding of what is done when the raw text is cleaned and them lemmatized, we will make a comparison before and after a review is cleaned.

In [28]:
cleaned_reviews[['review_text', 'cleaned_text', 'cleaned_lem_text']]

Unnamed: 0,review_text,cleaned_text,cleaned_lem_text
0,I’ve come to entre nous periodically over the ...,come entre nous periodically past years seems ...,come entre nous periodically past year seem se...
1,Absolutely delicious…\nThe menu is lovely and ...,absolutely delicious menu lovely offers except...,absolutely delicious menu lovely offer excepti...
2,Thank you so much for choosing entre nous crep...,thank much choosing entre nous creperie recent...,thank much choose entre nous creperie recently...
3,"Lovely little French restaurant, really authen...",lovely little french restaurant really authent...,lovely little french restaurant really authent...
4,"Dear Rebecca,\n\nThank you very much for dinin...",dear rebecca thank much dining entre nous crep...,dear rebecca thank much dining entre nous crep...
...,...,...,...
40942,Chao Shan serves one of the best authentic Teo...,chao serves one best authentic teochew cuisine...,chao serve one good authentic teochew cuisine ...
40943,"We loved all the dishes we had! If you can, or...",loved dishes order suckling pig advance regret...,love dish order suckle pig advance regret also...
40944,This country is not famous for its cuisines be...,country famous cuisines melting nation opp res...,country famous cuisine melt nation opp restaur...
40945,"Making a reservation is highly recommended, as...",making reservation highly recommended inside s...,make reservation highly recommend inside seat ...


In [29]:
# by taking the first review

first_index_texts = cleaned_reviews.loc[0, ['review_text', 'cleaned_text', 'cleaned_lem_text']]

# Print the texts
for col, text in first_index_texts.items():
    print(f"{col}: {text}")
    print(f"\n")

review_text: I’ve come to entre nous periodically over the past 10 years it seems and the service and food have always been consistently great. Geraldine and her husband are clearly invested in maintaining the highest standards and I’m so happy the restaurant is still going strong...even through COVID.More


cleaned_text: come entre nous periodically past years seems service food always consistently great geraldine husband clearly invested maintaining highest standards happy restaurant still going strong even covid


cleaned_lem_text: come entre nous periodically past year seem service food always consistently great geraldine husband clearly invest maintain high standard happy restaurant still go strong even covid




Firstly, comparing review_text and cleaned_text, words with conjunctions like 'I've' and 'I'm' were removed as well as numbers and stopwords such as 'over' and 'it'.

Next, comparing cleaned_text and cleaned_lem_text, here are some of the differences in the words pre-lemmatization and after lemmatization:

|pre-lemmatization|after lemmatization|
|-----------------|-------------------|
| years | year |
| seems | seem |
| invested | invest |
| maintaining | maintain |
| highest | high |

Moving on...

In [30]:
cleaned_reviews['cleaned'] = cleaned_reviews['cleaned_lem_text'] + cleaned_reviews['cleaned_lem_heading']

In [31]:
cleaned_reviews.head()

Unnamed: 0,restaurant,review_heading,review_text,review_rating,restaurant_label,cleaned_text,cleaned_lem_text,word_count,cleaned_heading,cleaned_lem_heading,cleaned
0,Entre Nous creperie,Wonderfully consistent,I’ve come to entre nous periodically over the ...,5.0,0,come entre nous periodically past years seems ...,come entre nous periodically past year seem se...,47,wonderfully consistent,wonderfully consistent,come entre nous periodically past year seem se...
1,Entre Nous creperie,Don’t miss the French Galettes and Crepes,Absolutely delicious…\nThe menu is lovely and ...,5.0,0,absolutely delicious menu lovely offers except...,absolutely delicious menu lovely offer excepti...,42,miss french galettes crepes,miss french galette crepe,absolutely delicious menu lovely offer excepti...
2,Entre Nous creperie,Lovely French restaurant - excellent for glute...,Thank you so much for choosing entre nous crep...,5.0,0,thank much choosing entre nous creperie recent...,thank much choose entre nous creperie recently...,45,lovely french restaurant excellent gluten free,lovely french restaurant excellent gluten free,thank much choose entre nous creperie recently...
3,Entre Nous creperie,A simple yet delicious & authentic brunch.,"Lovely little French restaurant, really authen...",4.5,0,lovely little french restaurant really authent...,lovely little french restaurant really authent...,45,simple yet delicious authentic brunch,simple yet delicious authentic brunch,lovely little french restaurant really authent...
4,Entre Nous creperie,A trip to brittany,"Dear Rebecca,\n\nThank you very much for dinin...",4.0,0,dear rebecca thank much dining entre nous crep...,dear rebecca thank much dining entre nous crep...,45,trip brittany,trip brittany,dear rebecca thank much dining entre nous crep...


In [32]:
cleaned_reviews.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40947 entries, 0 to 40946
Data columns (total 11 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   restaurant           40947 non-null  object 
 1   review_heading       40945 non-null  object 
 2   review_text          40947 non-null  object 
 3   review_rating        40947 non-null  float64
 4   restaurant_label     40947 non-null  int64  
 5   cleaned_text         40947 non-null  object 
 6   cleaned_lem_text     40947 non-null  object 
 7   word_count           40947 non-null  int64  
 8   cleaned_heading      40945 non-null  object 
 9   cleaned_lem_heading  40945 non-null  object 
 10  cleaned              40945 non-null  object 
dtypes: float64(1), int64(2), object(8)
memory usage: 3.4+ MB


In [36]:
cleaned_reviews[cleaned_reviews['review_heading'].isnull()]

Unnamed: 0,restaurant,review_heading,review_text,review_rating,restaurant_label,cleaned_text,cleaned_lem_text,word_count,cleaned_heading,cleaned_lem_heading,cleaned
2218,Anjappar Authentic Chettinaad Restaurant,,We went to this restaurant for lunch with our ...,5.0,4,went restaurant lunch office friends enjoyed f...,go restaurant lunch office friend enjoy food e...,31,,,
24725,Malaysian Food Street,,"Tried the chicken rice balls, lor bak and curr...",3.5,70,tried chicken rice balls lor bak curry mee thr...,try chicken rice ball lor bak curry mee three ...,45,,,


In [None]:
# since the review heading is null, the cleaned column for these rows will be the 'cleaned_lem_text'

In [40]:
cleaned_reviews.loc[cleaned_reviews['review_heading'].isnull(), 'cleaned'] = cleaned_reviews.loc[cleaned_reviews['review_heading'].isnull(), 'cleaned_lem_text']

In [41]:
cleaned_reviews.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40947 entries, 0 to 40946
Data columns (total 11 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   restaurant           40947 non-null  object 
 1   review_heading       40945 non-null  object 
 2   review_text          40947 non-null  object 
 3   review_rating        40947 non-null  float64
 4   restaurant_label     40947 non-null  int64  
 5   cleaned_text         40947 non-null  object 
 6   cleaned_lem_text     40947 non-null  object 
 7   word_count           40947 non-null  int64  
 8   cleaned_heading      40945 non-null  object 
 9   cleaned_lem_heading  40945 non-null  object 
 10  cleaned              40947 non-null  object 
dtypes: float64(1), int64(2), object(8)
memory usage: 3.4+ MB


In [42]:
cleaned_reviews[cleaned_reviews['review_heading'].isnull()]

Unnamed: 0,restaurant,review_heading,review_text,review_rating,restaurant_label,cleaned_text,cleaned_lem_text,word_count,cleaned_heading,cleaned_lem_heading,cleaned
2218,Anjappar Authentic Chettinaad Restaurant,,We went to this restaurant for lunch with our ...,5.0,4,went restaurant lunch office friends enjoyed f...,go restaurant lunch office friend enjoy food e...,31,,,go restaurant lunch office friend enjoy food e...
24725,Malaysian Food Street,,"Tried the chicken rice balls, lor bak and curr...",3.5,70,tried chicken rice balls lor bak curry mee thr...,try chicken rice ball lor bak curry mee three ...,45,,,try chicken rice ball lor bak curry mee three ...


In [43]:
cols_to_keep = ['restaurant', 'cleaned', 'word_count','review_rating',
       'restaurant_label' 
        ]

In [44]:
latest_data = cleaned_reviews[cols_to_keep]

In [45]:
latest_data

Unnamed: 0,restaurant,cleaned,word_count,review_rating,restaurant_label
0,Entre Nous creperie,come entre nous periodically past year seem se...,47,5.0,0
1,Entre Nous creperie,absolutely delicious menu lovely offer excepti...,42,5.0,0
2,Entre Nous creperie,thank much choose entre nous creperie recently...,45,5.0,0
3,Entre Nous creperie,lovely little french restaurant really authent...,45,4.5,0
4,Entre Nous creperie,dear rebecca thank much dining entre nous crep...,45,4.0,0
...,...,...,...,...,...
40942,Chao San Cuisine,chao serve one good authentic teochew cuisine ...,45,4.0,146
40943,Chao San Cuisine,love dish order suckle pig advance regret also...,45,4.0,146
40944,Chao San Cuisine,country famous cuisine melt nation opp restaur...,33,4.0,146
40945,Chao San Cuisine,make reservation highly recommend inside seat ...,45,4.0,146


In [46]:
# latest_data.to_csv("../data/cleaned_reviews.csv")

## Next part

[Part II - Exploratory Data Analysis](eda.ipynb)