# SENTIMENT ANALYSIS: RESTAURANT REVIEWS



#PART 1: BUILD SENTIMENT LEXICON 

#Import Library and Upload File

In [1]:
import pandas as pd
import numpy as np
!pip install deep_translator
from deep_translator import GoogleTranslator

#Text Preprocessing
import re
import string
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.stem import PorterStemmer
from nltk.stem import LancasterStemmer
from nltk.corpus import stopwords
from nltk import pos_tag
nltk.download('words')
nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

#Model
import collections
from tqdm import tqdm
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn import utils
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier
from sklearn.naive_bayes import GaussianNB, MultinomialNB
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from sklearn import svm
from sklearn.svm import LinearSVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split, cross_val_score, KFold
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score
from sklearn.metrics import classification_report
import gensim
import multiprocessing
from gensim.models import Doc2Vec
from gensim.models.doc2vec import TaggedDocument

[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Package words is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!



#1.1 Data Preparation & Preprocessing

In [2]:
pd.set_option('float_format', '{:.2f}'.format)
pd.set_option('max_rows', None)
pd.set_option('max_columns', None)
pd.set_option('max_colwidth', None)

Retrive excel file containing restaurant reviews with label either it is positive or negative.

In [3]:
df = pd.read_excel('/content/drive/MyDrive/Colab Notebooks/GitHub_SHZ/GitHub-SA-restaurant-review/restaurant_review.xlsx', dtype={'Review':'str'})
df.head()

Unnamed: 0,Review,Label
0,Bad service at The Mines when not many customers. The staff have no mood to do their work at all. One of the staff even reluctantly serves the food. The table is also isn't set up properly. Sorry Kenny's Rogers you should hand picked your staff properly.,Negative
1,very very poor service in mines. very unprofessional. unbelievable.,Negative
2,"Branch Senawang. The waiting time is ok. Food and drinks are ok but a little bit more variety again ok. Parking problem because the limited bit. Ambience is ok. Just be more FLUENT mengorder procedure. Now this customer select the menu near the counter, waiting in line so long that terhegeh-hegeh next customer. A family many who lined up since then just about to ask each what they eat. I proposed park near each table menu. Place order kopitiam style sheet like that. Ask waiter to fill customer orders InForm sheet, then just go pay close counters",Negative
3,Jammed like Damn ... good luck once in a while .. if you can do the day Tiap2 union ... syibal .. huhuhu,Negative
4,"Cashier taking order at Subang Empire Gallery, on 3rd February 2014 at 1pm, Mr Prakash is VERY unfriendly. Took orders from few of my friends as well. Same attitude. Somewhat ruin our holiday morning mood.",Negative


In [4]:
#check and drop for duplicates
df.drop_duplicates(subset ='Review', keep = False, inplace = True)
df.reset_index(drop=True,inplace=True)


Preprocessing steps in this section include:

1.1 Data manipulation


> 1.1.1 Change *Label* input for easier data manipulation:

> Negative = 0

> Positive = 1

> 1.1.2 Translate *Review* to english language

1.2 Create new dataframes to separate between positive and negative reviews.

> df_pos = dataframe contains positive reviews only

> df_neg = dataframe contains negative reviews only

##1.1.1 Change *Label* to integer

In [5]:
df['Label'] = np.where(df['Label'] == 'Positive', 1, 0)
df.head()

Unnamed: 0,Review,Label
0,Bad service at The Mines when not many customers. The staff have no mood to do their work at all. One of the staff even reluctantly serves the food. The table is also isn't set up properly. Sorry Kenny's Rogers you should hand picked your staff properly.,0
1,very very poor service in mines. very unprofessional. unbelievable.,0
2,"Branch Senawang. The waiting time is ok. Food and drinks are ok but a little bit more variety again ok. Parking problem because the limited bit. Ambience is ok. Just be more FLUENT mengorder procedure. Now this customer select the menu near the counter, waiting in line so long that terhegeh-hegeh next customer. A family many who lined up since then just about to ask each what they eat. I proposed park near each table menu. Place order kopitiam style sheet like that. Ask waiter to fill customer orders InForm sheet, then just go pay close counters",0
3,Jammed like Damn ... good luck once in a while .. if you can do the day Tiap2 union ... syibal .. huhuhu,0
4,"The worst meal experience ever in my life!\nEgg benetic is so salty and sour!\nThe mushroom soup comes with free real human long hair!Anyway, the chef is the key person who contribute us such a ""terrific"" experience, improvement need to be made!",0


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 130 entries, 0 to 129
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Review  130 non-null    object
 1   Label   130 non-null    int64 
dtypes: int64(1), object(1)
memory usage: 2.2+ KB


In [7]:
def fx_clean_word(text):
    #Make text lowercase 
    text = text.lower()
    #remove text in brackets
    text = re.sub('\(.*?\)', ' ', text)
    #remove punctuation
    text = re.sub('[%s]' % re.escape(string.punctuation), ' ', text)
    #remove words containing numbers
    text = re.sub('\w*\d\w*', '' , text)
    text = re.sub('[…]', '', text)
    text = re.sub('\n', '', text)   
    return text

def fx_get_meaningful_word(text):
    #keep english words only
    words = set(nltk.corpus.words.words())
    text = " ".join(w for w in nltk.wordpunct_tokenize(text) if w.lower() in words or not w.isalpha())
    #lemmatize
    lemmatizer = WordNetLemmatizer()
    text = lemmatizer.lemmatize(text)
    #tokenize
    text = nltk.word_tokenize(text)
    #remove stopwords
    stop = stopwords.words('english')
    text = [x for x in text if x not in stop]
    #remove words with only one letter
    text = [t for t in text if len(t) > 1]
    return text

def fx_pos_tagging(text):
    #POS tag
    text = pos_tag(text)
    #extract adjectives and adverbs only
    text = [x for x, pos in text if pos.startswith('JJ') or pos.startswith('RB')]  
    return text

def fx_text_preprocessing(text):
    text = fx_clean_word(text)
    text = fx_get_meaningful_word(text)
    text = fx_pos_tagging(text)
    return text

clean_word = lambda x: fx_clean_word(x)
get_meaningful_word = lambda x: fx_get_meaningful_word(x)
pos_tagging = lambda x: fx_pos_tagging(x)
text_preprocessing = lambda x: fx_text_preprocessing(x)

## 1.1.2 Translate Review to english

Based on the preview above, column Review contains mixed language of english and malay. During text preprocessing, it is important to translate the column into english since most of the NLP tools are based on english language.


In [8]:
df_clean = pd.DataFrame(df.Review.apply(clean_word))
translator = GoogleTranslator(source='ms',target='en')
df_clean['Review'] = df_clean['Review'].apply(lambda x: translator.translate(x))
df_clean['Label'] = df.Label.copy()
df_clean

Unnamed: 0,Review,Label
0,bad service at the mines when not many customers the staff have no mood to do their work at all one of the staff even reluctantly serves the food the table is also isn t set up properly sorry kenny s rogers you should hand picked your staff properly,0
1,very very poor service in mines very unprofessional unbelievable,0
2,branch senawang the waiting time is ok food and drinks are ok but a little bit more variety again ok parking problem because the limited bit ambience is ok just be more fluent mengorder procedure now this customer select the menu near the counter waiting in line so long that terhegeh hegeh next customer a family many who lined up since then just about to ask each what they eat i proposed park near each table menu place order kopitiam style sheet like that ask waiter to fill customer orders inform sheet then just go pay close counters,0
3,jammed like damn good luck once in a while if you can do the day union syibal huhuhu,0
4,the worst meal experience ever in my life egg benetic is so salty and sour the mushroom soup comes with free real human long hair anyway the chef is the key person who contribute us such a terrific experience improvement need to be made,0
5,dear family friends my husband i patronized delicious on his birthday i ordered baked beans clearly marked as vegetarian on the menu halfway through the meal he discovered meat in the baked beans on questioning the manager we found out that the baked beans was prepared with beef stock beef bacon when we asked who the chef was we were told that these items were prepared in the main kitchen and were sent out to all their restaurants in klang valley we were shocked deeply disappointed as this is a favorite restaurant we have been there in good faith we wrote an official complaint it has been a week we haven t heard from them please inform all your hindus buddhists vegetarians people who don t consume beef of this ask them not to patronize this chain of restaurants we have escalated this to the consumer association thank you,0
6,sushi zanmai the garden bad and slow service then server look at me i was asking for service the server try to avoid me some more the sushi was come after cook food is the sushi very hard to wrap no refill drink at all i ask to check order for twice no response after all the most terrible service ever,0
7,today at approximately at sunway pyramid branch a rat jumped out of the kitchen and towards the customers people were screaming and yelling yet none of the staff gave afk about it not even the manager just like how you don t give afk about your customers giving feedbacks on this page,0
8,yesterday we have gone to ioi seoul garden let to say it was not my first time to go there and i have been there for few times i want to share some experience with you guys first they increase the prices now it is rm per person then the service was very bad for example when the lemon finished i asked them to re fill they said sorry finished also their shrimps same as their crab was to smelly and bad taste after grilling i told to manager at least for today i think your seafoods are not fresh and smelly he answered oo i will check and then he didn t do any thing don t go,0
9,slow service and not customer friendly should be a fight night terok,0


# 1.2 Separate positive and negative reviews

In [9]:
df_pos = df_clean[df_clean.Label == 1].reset_index(drop=True)
df_pos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 82 entries, 0 to 81
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Review  82 non-null     object
 1   Label   82 non-null     int64 
dtypes: int64(1), object(1)
memory usage: 1.4+ KB


In [10]:
df_neg = df_clean[df_clean.Label == 0].reset_index(drop=True)
df_neg.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48 entries, 0 to 47
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Review  48 non-null     object
 1   Label   48 non-null     int64 
dtypes: int64(1), object(1)
memory usage: 896.0+ bytes


## 1.2.1 Generate positive lexicons list

Apply text processing function to the positive reviews in df_pos and return only words tagged as adjective and adverb in each sentences. Those words will be kept in pos_clean dataframe.

In [11]:
pos_clean = pd.DataFrame(df_pos.Review.apply(text_preprocessing))
pos_clean

Unnamed: 0,Review
0,"[best, indeed, jumbo, prime, best, best, best]"
1,[]
2,"[good, reasonable, new]"
3,"[best, far, also, big, well, willing, social]"
4,"[great, back, overall, friendly, efficient, atmosphere, initiative, constantly, interactive, attentive, social, great]"
5,[]
6,"[best, chicken]"
7,"[best, cheese, quite, generous, cheese, heavenly, best, ever, nasi, quite, good, also, chicken]"
8,"[best, cheese, already, peanut, sauce, delicious]"
9,"[wonderful, friendly, especially, lady, good]"


Keep unique lexicon for positive sentiments in new array called pos_

In [12]:
pos_ = []
for index, row in pos_clean.iterrows():
  rv = row["Review"]
  for word in rv:
    if word not in pos_:
      pos_.append(word)

The array is converted into pandas DataFrame.
The 'Score' is labeled as 1 to indicate positive sentiment.

In [13]:
pos_list = pd.DataFrame(pos_)
pos_list.rename(columns={0:"Lexicon"}, inplace=True)
pos_list["Score"]= 1
pos_list

Unnamed: 0,Lexicon,Score
0,best,1
1,indeed,1
2,jumbo,1
3,prime,1
4,good,1
5,reasonable,1
6,new,1
7,far,1
8,also,1
9,big,1


In [14]:
print("Total lexicon in positive file: " + str(len(pos_list.Lexicon)))

Total lexicon in positive file: 115


##1.2.2 Generate negative lexicons list

Apply text processing function to the negative reviews in df_neg and return only words tagged as adjective and adverb in each sentences. Those words will be kept in neg_clean dataframe.


In [15]:
neg_clean = pd.DataFrame(df_neg.Review.apply(text_preprocessing))
neg_clean

Unnamed: 0,Review
0,"[bad, many, even, reluctantly, also, properly, sorry, properly]"
1,"[poor, unprofessional, unbelievable]"
2,"[little, ambience, fluent, long, next, many, table, close]"
3,"[damn, good]"
4,"[worst, ever, sour, free, real, human, long, anyway, key, terrific]"
5,"[dear, delicious, clearly, vegetarian, baked, prepared, prepared, main, deeply, disappointed, favorite, good, official, inform]"
6,"[bad, slow, server, server, come, hard, ask, twice, terrible, ever]"
7,"[approximately, sunway, pyramid, yet, none, even]"
8,"[garden, first, first, bad, also, smelly, bad, least, fresh, smelly]"
9,"[slow, friendly]"


Keep unique lexicon for negative sentiments in new array called neg_

In [16]:
neg_ = []
for index, row in neg_clean.iterrows():
  rv = row["Review"]
  for word in rv:
    if word not in neg_:
      neg_.append(word)

The array is converted into pandas DataFrame.
The 'Score' is labeled as -1 to indicate negative sentiment.

In [17]:
neg_list = pd.DataFrame(neg_)
neg_list.rename(columns={0:"Lexicon"}, inplace=True)
neg_list["Score"]= -1
neg_list

Unnamed: 0,Lexicon,Score
0,bad,-1
1,many,-1
2,even,-1
3,reluctantly,-1
4,also,-1
5,properly,-1
6,sorry,-1
7,poor,-1
8,unprofessional,-1
9,unbelievable,-1


In [18]:
print("Total lexicon in negative file: " + str(len(neg_list.Lexicon)))

Total lexicon in negative file: 174


##1.2.3 Combine both pos and neg lists into one dataframe

restaurant_lex is a dataframe containing list of positive and negative lexicons for restaurant domain. This lexicon list will be used in Part 2 to calculate the polarity for each reviews in the file.

In [19]:
restaurant_lex = pos_list.append(neg_list, ignore_index=True)
#check and drop duplicates
restaurant_lex.drop_duplicates(['Lexicon'], keep=False, inplace=True)
restaurant_lex.reset_index(drop=True)
restaurant_lex

Unnamed: 0,Lexicon,Score
1,indeed,1
2,jumbo,1
3,prime,1
5,reasonable,1
7,far,1
11,willing,1
12,social,1
15,overall,1
17,efficient,1
18,atmosphere,1


In [20]:
print("Total Lexicon after combined: " + str(len(restaurant_lex.Lexicon)))

Total Lexicon after combined: 215


In [21]:
restaurant_lex_i = restaurant_lex.set_index('Lexicon')
restaurant_lex_i.head()

Unnamed: 0_level_0,Score
Lexicon,Unnamed: 1_level_1
indeed,1
jumbo,1
prime,1
reasonable,1
far,1


#PART 2: PREDICT POLARITY

##2.1 Calculate polarity




In [22]:
df_clean['Review'] = pd.DataFrame(df_clean.Review.apply(get_meaningful_word))
df_clean

Unnamed: 0,Review,Label
0,"[bad, service, mines, many, staff, mood, work, one, staff, even, reluctantly, food, table, also, set, properly, sorry, hand, picked, staff, properly]",0
1,"[poor, service, mines, unprofessional, unbelievable]",0
2,"[branch, waiting, time, food, little, bit, variety, parking, problem, limited, bit, ambience, fluent, procedure, customer, select, menu, near, counter, waiting, line, long, next, customer, family, many, lined, since, ask, eat, park, near, table, menu, place, order, style, sheet, like, ask, waiter, fill, customer, inform, sheet, go, pay, close]",0
3,"[like, damn, good, luck, day, union]",0
4,"[worst, meal, experience, ever, life, egg, salty, sour, mushroom, soup, comes, free, real, human, long, hair, anyway, chef, key, person, contribute, us, terrific, experience, improvement, need, made]",0
5,"[dear, family, husband, delicious, birthday, ordered, baked, clearly, marked, vegetarian, menu, halfway, meal, discovered, meat, baked, manager, found, baked, prepared, beef, stock, beef, bacon, chef, told, prepared, main, kitchen, sent, valley, deeply, disappointed, favorite, restaurant, good, faith, wrote, official, complaint, week, please, inform, people, consume, beef, ask, patronize, chain, consumer, association, thank]",0
6,"[garden, bad, slow, service, server, look, service, server, try, avoid, come, cook, food, hard, wrap, refill, drink, ask, check, order, twice, response, terrible, service, ever]",0
7,"[today, approximately, sunway, pyramid, branch, rat, kitchen, towards, people, screaming, yelling, yet, none, staff, gave, even, manager, like, give, giving, page]",0
8,"[yesterday, gone, garden, let, say, first, time, go, times, want, share, experience, first, increase, per, person, service, bad, example, lemon, finished, fill, said, sorry, finished, also, crab, smelly, bad, taste, told, manager, least, today, think, fresh, smelly, check, thing, go]",0
9,"[slow, service, customer, friendly, fight, night]",0


In [23]:
def total_score(text):
    score=0
    for word in text:
      if word in restaurant_lex.Lexicon.values:
        score = score + restaurant_lex_i.loc[word].values
    return score

calc = lambda x: total_score(x)
df_clean['Score']=pd.DataFrame(df_clean.Review.apply(calc))
df_clean

Unnamed: 0,Review,Label,Score
0,"[bad, service, mines, many, staff, mood, work, one, staff, even, reluctantly, food, table, also, set, properly, sorry, hand, picked, staff, properly]",0,[-6]
1,"[poor, service, mines, unprofessional, unbelievable]",0,[-3]
2,"[branch, waiting, time, food, little, bit, variety, parking, problem, limited, bit, ambience, fluent, procedure, customer, select, menu, near, counter, waiting, line, long, next, customer, family, many, lined, since, ask, eat, park, near, table, menu, place, order, style, sheet, like, ask, waiter, fill, customer, inform, sheet, go, pay, close]",0,[-11]
3,"[like, damn, good, luck, day, union]",0,[-1]
4,"[worst, meal, experience, ever, life, egg, salty, sour, mushroom, soup, comes, free, real, human, long, hair, anyway, chef, key, person, contribute, us, terrific, experience, improvement, need, made]",0,[-9]
5,"[dear, family, husband, delicious, birthday, ordered, baked, clearly, marked, vegetarian, menu, halfway, meal, discovered, meat, baked, manager, found, baked, prepared, beef, stock, beef, bacon, chef, told, prepared, main, kitchen, sent, valley, deeply, disappointed, favorite, restaurant, good, faith, wrote, official, complaint, week, please, inform, people, consume, beef, ask, patronize, chain, consumer, association, thank]",0,[-13]
6,"[garden, bad, slow, service, server, look, service, server, try, avoid, come, cook, food, hard, wrap, refill, drink, ask, check, order, twice, response, terrible, service, ever]",0,[-10]
7,"[today, approximately, sunway, pyramid, branch, rat, kitchen, towards, people, screaming, yelling, yet, none, staff, gave, even, manager, like, give, giving, page]",0,[-6]
8,"[yesterday, gone, garden, let, say, first, time, go, times, want, share, experience, first, increase, per, person, service, bad, example, lemon, finished, fill, said, sorry, finished, also, crab, smelly, bad, taste, told, manager, least, today, think, fresh, smelly, check, thing, go]",0,[-9]
9,"[slow, service, customer, friendly, fight, night]",0,[-1]


Change 'Score' data type to integer else total_score() function will return it as object

In [24]:
df_clean['Score'] = df_clean.Score.astype('int64')
df_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 130 entries, 0 to 129
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Review  130 non-null    object
 1   Label   130 non-null    int64 
 2   Score   130 non-null    int64 
dtypes: int64(2), object(1)
memory usage: 3.2+ KB


Mapped sentiments to each score generated

In [25]:
def senti(total_score):
  if (total_score >= 0):
      total_score = 1
  else: 
      total_score = 0 
  return total_score

snt = lambda x: senti(x)
df_clean['Sentiment']=pd.DataFrame(df_clean.Score.apply(snt))
df_clean

Unnamed: 0,Review,Label,Score,Sentiment
0,"[bad, service, mines, many, staff, mood, work, one, staff, even, reluctantly, food, table, also, set, properly, sorry, hand, picked, staff, properly]",0,-6,0
1,"[poor, service, mines, unprofessional, unbelievable]",0,-3,0
2,"[branch, waiting, time, food, little, bit, variety, parking, problem, limited, bit, ambience, fluent, procedure, customer, select, menu, near, counter, waiting, line, long, next, customer, family, many, lined, since, ask, eat, park, near, table, menu, place, order, style, sheet, like, ask, waiter, fill, customer, inform, sheet, go, pay, close]",0,-11,0
3,"[like, damn, good, luck, day, union]",0,-1,0
4,"[worst, meal, experience, ever, life, egg, salty, sour, mushroom, soup, comes, free, real, human, long, hair, anyway, chef, key, person, contribute, us, terrific, experience, improvement, need, made]",0,-9,0
5,"[dear, family, husband, delicious, birthday, ordered, baked, clearly, marked, vegetarian, menu, halfway, meal, discovered, meat, baked, manager, found, baked, prepared, beef, stock, beef, bacon, chef, told, prepared, main, kitchen, sent, valley, deeply, disappointed, favorite, restaurant, good, faith, wrote, official, complaint, week, please, inform, people, consume, beef, ask, patronize, chain, consumer, association, thank]",0,-13,0
6,"[garden, bad, slow, service, server, look, service, server, try, avoid, come, cook, food, hard, wrap, refill, drink, ask, check, order, twice, response, terrible, service, ever]",0,-10,0
7,"[today, approximately, sunway, pyramid, branch, rat, kitchen, towards, people, screaming, yelling, yet, none, staff, gave, even, manager, like, give, giving, page]",0,-6,0
8,"[yesterday, gone, garden, let, say, first, time, go, times, want, share, experience, first, increase, per, person, service, bad, example, lemon, finished, fill, said, sorry, finished, also, crab, smelly, bad, taste, told, manager, least, today, think, fresh, smelly, check, thing, go]",0,-9,0
9,"[slow, service, customer, friendly, fight, night]",0,-1,0


###2.1.1 Row-by-row claculation

In [26]:
for i,col in df_clean.iterrows():
  score = 0
  print('Review ', i , ': ')
  print(df.Review.iloc[i],'\n')
  r=col['Review']
  for word in r:
    if word in restaurant_lex.Lexicon.values:
      print('  ' , word , restaurant_lex_i.loc[word].values)
      score = score + restaurant_lex_i.loc[word].values
    if (score >= 0):
      Pscore = 'Positive'
    else: 
      Pscore = 'Negative'

  print('\nTotal: ' , score)
  print('Sentiment: ', Pscore)
  print('------------------------------------------------------------------------------------')
    

Review  0 : 
Bad service at The Mines when not many customers. The staff have no mood to do their work at all. One of the staff even reluctantly serves the food. The table is also isn't set up properly. Sorry Kenny's Rogers you should hand picked your staff properly. 

   bad [-1]
   reluctantly [-1]
   table [-1]
   properly [-1]
   sorry [-1]
   properly [-1]

Total:  [-6]
Sentiment:  Negative
------------------------------------------------------------------------------------
Review  1 : 
very very poor service in mines. very unprofessional. unbelievable. 

   poor [-1]
   unprofessional [-1]
   unbelievable [-1]

Total:  [-3]
Sentiment:  Negative
------------------------------------------------------------------------------------
Review  2 : 
Branch Senawang. The waiting time is ok. Food and drinks are ok but a little bit more variety again ok. Parking problem because the limited bit. Ambience is ok. Just be more FLUENT mengorder procedure. Now this customer select the menu near th

##2.2 Prediction Performance

In [77]:
def get_accuracy(y_pred):
  count = 0
  for index, col in df_clean.iterrows():
    total = len(df_clean)
    y_test = col["Label"]
    y_pred = col["Sentiment"]

    if (y_pred == y_test):
      count += 1
  acc = count/total
  return acc

accuracy_lexicon = get_accuracy(df_clean)
print('Accuracy:' , str(round((accuracy_lexicon*100),2)), ' %')

Accuracy: 95.38  %


In [28]:
y_test = df_clean["Label"]
y_pred = df_clean["Sentiment"]

confusion_matrix = pd.crosstab(y_test, y_pred, rownames=['Actual'], colnames=['Predicted'])
print (confusion_matrix)

Predicted   0   1
Actual           
0          43   5
1           1  81


In [29]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.98      0.90      0.93        48
           1       0.94      0.99      0.96        82

    accuracy                           0.95       130
   macro avg       0.96      0.94      0.95       130
weighted avg       0.95      0.95      0.95       130



# PART 3: COMPARISON OF MODEL PERFORMANCES

In [68]:
def show_percentage(x):
  return "{0:.2f}".format(round(x, 2) * 100)

def run_ML(feature_name, xtrain, ytrain, xtest, ytest):

  print(feature_name)

  if feature_name == "Doc2Vec":
    model_dict = {
      'GradientBoostingClassifier': GradientBoostingClassifier(n_estimators=100),
      'RandomForestClassifier': RandomForestClassifier(n_estimators=100, random_state=0),
      'NaiveBayes': GaussianNB(),
      'NeuralNetwork': MLPClassifier(solver='adam', hidden_layer_sizes=(10,5), random_state=2, activation='relu', max_iter=5000, learning_rate='invscaling'),
      'SupportVector': svm.SVC(C=1.0, kernel='rbf', degree=3, gamma='scale', coef0=0.0, shrinking=True, probability=False, tol=0.001, cache_size=200, class_weight=None, verbose=False, max_iter=2000, decision_function_shape='ovr', random_state=2),
      'LinearSVC': LinearSVC(penalty='l2', loss='squared_hinge', dual=True, tol=0.001, C=1.0, multi_class='ovr', fit_intercept=True, intercept_scaling=1, class_weight=None, verbose=0, random_state=None, max_iter=30000),
      'LogisticRegression': LogisticRegression(random_state=42, max_iter=8000, multi_class='auto', solver='saga'),
    }    
  else:
    model_dict = {
      'GradientBoostingClassifier': GradientBoostingClassifier(n_estimators=100),
      'RandomForestClassifier': RandomForestClassifier(n_estimators=100, random_state=0),
      'NaiveBayes': MultinomialNB(),
      'NeuralNetwork': MLPClassifier(solver='adam', hidden_layer_sizes=(10,5), random_state=2, activation='relu', max_iter=5000, learning_rate='invscaling'),
      'SupportVector': svm.SVC(C=1.0, kernel='rbf', degree=3, gamma='scale', coef0=0.0, shrinking=True, probability=False, tol=0.001, cache_size=200, class_weight=None, verbose=False, max_iter=2000, decision_function_shape='ovr', random_state=2),
      'LinearSVC': LinearSVC(penalty='l2', loss='squared_hinge', dual=True, tol=0.001, C=1.0, multi_class='ovr', fit_intercept=True, intercept_scaling=1, class_weight=None, verbose=0, random_state=None, max_iter=30000),
      'LogisticRegression': LogisticRegression(random_state=42, max_iter=8000, multi_class='auto', solver='saga'),
    }
  
  #index_list = ['GradientBoostingClassifier', 'NaiveBayes', 'NeuralNetwork', 'RandomForestClassifier', 'SupportVector', 'LinearSVC', 'LogisticRegression']
  cols = ['Accuracy', 'F1', 'Precision', 'Recall']
  df_report = pd.DataFrame(columns=cols)        
                           
  for name, algo in model_dict.items():
    algo.fit(xtrain, ytrain) 
    pred = algo.predict(xtest)
    df_report.loc[name, 'Accuracy'] = show_percentage(accuracy_score(ytest, pred))
    df_report.loc[name, 'F1'] = show_percentage(f1_score(ytest, pred, average='macro'))
    df_report.loc[name, 'Precision'] = show_percentage(precision_score(ytest, pred, average='macro'))
    df_report.loc[name, 'Recall'] = show_percentage(recall_score(ytest, pred, average='macro'))
    df_report['Feature'] = feature_name
  
  cols = ['Feature', 'Accuracy', 'F1', 'Precision', 'Recall']
  df_report = df_report.reindex(columns=cols)
    
  return df_report.sort_values('F1', ascending=False)

In [31]:
#convert list of words into sentences
df_clean['Review'] = df_clean.Review.apply(lambda x: ' '.join(x))

In [69]:
feature = df_clean['Review'].values
label = df_clean['Label'].values

x_train, x_test, y_train, y_test = train_test_split(feature, label, test_size=0.2, random_state=1, shuffle=True)

print(len(x_train), "training instances +", len(x_test), "test instances")

104 training instances + 26 test instances


#3.1 Features comparison

##3.1.1 Bag-of-Words

In [70]:
BOW = CountVectorizer()
BOW.fit(x_train)
XX_train = BOW.transform(x_train)
XX_test  = BOW.transform(x_test)
bow_summ = run_ML("BOW", XX_train, y_train, XX_test, y_test)
bow_summ

BOW


Unnamed: 0,Classifier,Accuracy,F1,Precision,Recall
GradientBoostingClassifier,BOW,92.0,90.0,95.0,88.0
LogisticRegression,BOW,88.0,85.0,93.0,81.0
NaiveBayes,BOW,85.0,78.0,91.0,75.0
LinearSVC,BOW,85.0,78.0,91.0,75.0
NeuralNetwork,BOW,81.0,71.0,89.0,69.0
RandomForestClassifier,BOW,77.0,68.0,76.0,66.0
SupportVector,BOW,77.0,63.0,88.0,62.0


## 3.1.2 TF-IDF

In [71]:
TFIDF = TfidfVectorizer()
TFIDF.fit_transform(x_train)
XXX_train = TFIDF.transform(x_train)
XXX_test  = TFIDF.transform(x_test)

tfidf_summ = run_ML("TF-IDF", XXX_train, y_train, XXX_test, y_test)
tfidf_summ

TF-IDF


Unnamed: 0,Classifier,Accuracy,F1,Precision,Recall
GradientBoostingClassifier,TF-IDF,88.0,85.0,93.0,81.0
LinearSVC,TF-IDF,85.0,78.0,91.0,75.0
NaiveBayes,TF-IDF,81.0,71.0,89.0,69.0
SupportVector,TF-IDF,81.0,71.0,89.0,69.0
LogisticRegression,TF-IDF,81.0,71.0,89.0,69.0
NeuralNetwork,TF-IDF,77.0,68.0,76.0,66.0
RandomForestClassifier,TF-IDF,77.0,63.0,88.0,62.0


## 3.1.3 Doc2vec

In [35]:
cores = multiprocessing.cpu_count()
def tokenize_text(text):
    tokens = []
    for sent in nltk.sent_tokenize(text):
        for word in nltk.word_tokenize(sent):
            if len(word) < 2:
                continue
            tokens.append(word.lower())
    return tokens

In [36]:
# SPLIT TO 9:1
doc2vec_train, doc2vec_test = train_test_split(df_clean, test_size=0.1, random_state=42, shuffle=True)

# CONVERT TO DOC TO VEC FORMAT
train_tagged = doc2vec_train.apply(lambda r: TaggedDocument(words=tokenize_text(r['Review']), tags=[r.Label]), axis=1)
test_tagged = doc2vec_test.apply(lambda r: TaggedDocument(words=tokenize_text(r['Review']), tags=[r.Label]), axis=1)


# dm :: dm=1, 'distributed memory' (PV-DM), 0=BOW
# vector_size :: Dimensionality of the feature vectors
# negative ::  0 = no negative sampling is used.
# hs :: If 1, hierarchical softmax will be used for model training. If set to 0, and negative is non-zero, negative sampling will be used.
# min_count (int, optional) – Ignores all words with total frequency lower than this.
# sample (float, optional) – The threshold for configuring which higher-frequency words are randomly downsampled, useful range is (0, 1e-5).
# workers (int, optional) – Use these many worker threads to train the model (=faster training with multicore machines).

model_dbow = Doc2Vec(dm=0, vector_size=300, negative=5, hs=0, min_count=2, sample=0, workers=cores)
model_dbow.build_vocab([x for x in tqdm(train_tagged.values)])


100%|██████████| 117/117 [00:00<00:00, 529949.86it/s]


In [37]:
%%time
for epoch in range(30):
    model_dbow.train(utils.shuffle([x for x in tqdm(train_tagged.values)]), total_examples=len(train_tagged.values), epochs=100)
    model_dbow.alpha -= 0.002
    model_dbow.min_alpha = model_dbow.alpha

100%|██████████| 117/117 [00:00<00:00, 420508.63it/s]
100%|██████████| 117/117 [00:00<00:00, 708129.25it/s]
100%|██████████| 117/117 [00:00<00:00, 716399.37it/s]
100%|██████████| 117/117 [00:00<00:00, 505910.89it/s]
100%|██████████| 117/117 [00:00<00:00, 395115.59it/s]
100%|██████████| 117/117 [00:00<00:00, 689232.54it/s]
100%|██████████| 117/117 [00:00<00:00, 645702.06it/s]
100%|██████████| 117/117 [00:00<00:00, 634842.91it/s]
100%|██████████| 117/117 [00:00<00:00, 534568.16it/s]
100%|██████████| 117/117 [00:00<00:00, 602127.08it/s]
100%|██████████| 117/117 [00:00<00:00, 739056.58it/s]
100%|██████████| 117/117 [00:00<00:00, 487322.31it/s]
100%|██████████| 117/117 [00:00<00:00, 530522.78it/s]
100%|██████████| 117/117 [00:00<00:00, 682522.35it/s]
100%|██████████| 117/117 [00:00<00:00, 461649.64it/s]
100%|██████████| 117/117 [00:00<00:00, 438546.53it/s]
100%|██████████| 117/117 [00:00<00:00, 503833.23it/s]
100%|██████████| 117/117 [00:00<00:00, 767971.15it/s]
100%|██████████| 117/117 [00

CPU times: user 12.8 s, sys: 1.33 s, total: 14.2 s
Wall time: 15 s


In [72]:
def vec_for_learning(model, tagged_docs):
    sents = tagged_docs.values
    targets, regressors = zip(*[(doc.tags[0], model.infer_vector(doc.words, steps=20)) for doc in sents])
    return targets, regressors

y_train, X_train = vec_for_learning(model_dbow, train_tagged)
y_test, X_test = vec_for_learning(model_dbow, test_tagged)

doc2vec_summ = run_ML("Doc2Vec", X_train, y_train, X_test, y_test)
doc2vec_summ

Doc2Vec


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Unnamed: 0,Classifier,Accuracy,F1,Precision,Recall
NaiveBayes,Doc2Vec,69.0,69.0,78.0,75.0
RandomForestClassifier,Doc2Vec,54.0,46.0,47.0,48.0
GradientBoostingClassifier,Doc2Vec,46.0,41.0,40.0,41.0
NeuralNetwork,Doc2Vec,62.0,38.0,31.0,50.0
SupportVector,Doc2Vec,62.0,38.0,31.0,50.0
LinearSVC,Doc2Vec,62.0,38.0,31.0,50.0
LogisticRegression,Doc2Vec,62.0,38.0,31.0,50.0


#3.2 Summary

In [78]:
report = bow_summ.append(tfidf_summ)
report = report.append(doc2vec_summ)
report = report.sort_values(['F1','Accuracy'], ascending=False)
report

Unnamed: 0,Classifier,Accuracy,F1,Precision,Recall
GradientBoostingClassifier,BOW,92.0,90.0,95.0,88.0
LogisticRegression,BOW,88.0,85.0,93.0,81.0
GradientBoostingClassifier,TF-IDF,88.0,85.0,93.0,81.0
NaiveBayes,BOW,85.0,78.0,91.0,75.0
LinearSVC,BOW,85.0,78.0,91.0,75.0
LinearSVC,TF-IDF,85.0,78.0,91.0,75.0
NeuralNetwork,BOW,81.0,71.0,89.0,69.0
NaiveBayes,TF-IDF,81.0,71.0,89.0,69.0
SupportVector,TF-IDF,81.0,71.0,89.0,69.0
LogisticRegression,TF-IDF,81.0,71.0,89.0,69.0


In [75]:
report[['Accuracy', 'F1', 'Precision', 'Recall']] = report[['Accuracy', 'F1', 'Precision', 'Recall']].astype(float)
report.info()

<class 'pandas.core.frame.DataFrame'>
Index: 21 entries, GradientBoostingClassifier to LogisticRegression
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Classifier  21 non-null     object 
 1   Accuracy    21 non-null     float64
 2   F1          21 non-null     float64
 3   Precision   21 non-null     float64
 4   Recall      21 non-null     float64
dtypes: float64(4), object(1)
memory usage: 1008.0+ bytes


In [76]:
report.describe()

Unnamed: 0,Accuracy,F1,Precision,Recall
count,21.0,21.0,21.0,21.0
mean,74.86,64.19,72.71,65.29
std,12.41,17.26,25.09,12.82
min,46.0,38.0,31.0,41.0
25%,62.0,46.0,47.0,50.0
50%,77.0,69.0,88.0,69.0
75%,85.0,78.0,91.0,75.0
max,92.0,90.0,95.0,88.0


Based on the table above, the best model for restaurant reviews, text classification is **Gradient Boosting**, using **Bag-of-Word** as its feature. This can be seen from its performance metrics in which the accuracy, precision, recall and F1 score are the highest compared to the other models.

#CONCLUSION

In this project, two sentiment analysis approaches are used. There are lexicon based approach and machine learning approach. Lexicon based approach consists of two parts in this project which are PART 1 and PART 2. Meanwhile, PART 3 is the machine learning approach. 



In [87]:
compare = [('Machine Learning', report.Accuracy.iloc[0],report.F1.iloc[0]),
            ('Lexicon Based', show_percentage(accuracy_lexicon))]

comparison = pd.DataFrame(compare, columns = ['Approach', 'Accuracy', 'F1'])

comparison

Unnamed: 0,Approach,Accuracy,F1
0,Machine Learning,92.0,90.0
1,Lexicon Based,95.0,


Based on the table above, text classification model using ML Approach has better performance than using Lexicon Based Approach.
Machine learning algorithms usually take statistical approach, therefore, they are prone to change in data and tend to attain good predictive accuracy. For this assignment, the lexicon approach did not has a better performance probably because the lexicon is not well developed to cover as much affective words for sentiment analysis.