# Reaching Beyond the Stars In Recommending Thai Restaurants in Las Vegas: A Sentiment Detecting Approach to Rating Reviews as a Complement to User Ratings

## ABSTRACT

In line with career prospects in social networking applications, this classification project deals with recommender systems and focuses on the problem of matching a user to a new item. Based on the business and review academic datasets from the 2015 Yelp Dataset Challenge, the task is to classify a Thai restaurant in Las Vegas that is new to a user as a restaurant for the user to experience or not. Performance is measured, as a matter of classification accuracy, by a restaurant rating that is predicted correctly for a user from user ratings and from lexicon-extracted sentiment scores in review text. The merged working dataset has 990,627 reviews, of which 405,760 target one of the 4,960 restaurants in Las Vegas. In line with the “Yelp Restaurant Lexicon” (Kiritchenko, Zhu, Cherry, & Mohammad, 2014b) whose units of analysis are reviews of Yelp Restaurants in Phoenix (AZ), we narrow down the reviews to reviewers of restaurants in Phoenix, although not exclusively. We further narrow the reviews to Thai restaurants to accommodate R and RAM limits. Using restaurant ratings given by reviewers and sentiment scores, we select a user-based collaborative approach to predict missing ratings and to deliver to the user top recommendations. 

## CODE AND RESULTS FOR THE SENTIMENT DETECTION EXERCISE

The objectives of this kernel are to extract sentiment scores from Thai restaurant reviews using lexicons and to perform a preliminary comparison assessment of the review scores and the ratings of the restaurants given by individual reviewers.

In this kernel, we rely on a working dataset that was created in a previous phase of the project from the merger of a slightly transformed version of the Yelp_academic_dataset_business to the Yelp_academic_dataset_review (key: business_id, method: left, left dataset: Yelp_academic_dataset_review). For lexicon compatibility reasons and for RAM limitation reasons, we filtered the working dataset respectively by users who have rated restaurants in Phoenix (AZ) and by the "Thai" category of restaurants. Before filtering the working dataset by Thai restaurants, we combined review text by unique users and averaged review_ratings by unique users. The restaurant review data files in this kernel were derived from this working dataset. The lexicon data files were downloaded from the internet.

In the next few code cells, we will be reading four files: 

1- ThaiTextByUniqueUserBiz.csv, which has three separate columns of interest: user_id, business_id and text 

2- Yelp-restaurant-reviews-AFFLEX-NEGLEX-unigrams.txt, which was downloaded from http://saifmohammad.com/Lexicons/Yelp-restaurant-reviews.zip

3- AFINN-emoticons-en-165.txt, which is a lexicon derived from the content of the AFINN_emoticon-8 downloaded from https://github.com/fnielsen/afinn/blob/master/afinn/data/AFINN-emoticon-8.txt and manually copied to the AFINN-en-165 lexicon downloaded from https://github.com/fnielsen/afinn/blob/master/afinn/data/AFINN-en-165.txt

4- ThaiReviewRatingsByUserBiz.pkl, which has three columns of interest: user_id, business_id and review_ratings.

The files in items 1 and 4 were derived from the same working dataset. We made a decision to re-merge the two with the keys 'user_id' and 'business_id' at the end of this kernel to prevent confusion and to compare the various sentiment scores with the review_ratings.

### Import Utility Packages

We acknowledge that some of these packages are overkill. You may wish to select what you need to decrease the burden on your memory.

In [1]:
import sys
import re
import os
import shutil
import commands
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import itertools
import nltk
nltk.download('all')

[nltk_data] Downloading collection u'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to
[nltk_data]    |     C:\Users\Luc\AppData\Roaming\nltk_data...
[nltk_data]    |   Package abc is already up-to-date!
[nltk_data]    | Downloading package alpino to
[nltk_data]    |     C:\Users\Luc\AppData\Roaming\nltk_data...
[nltk_data]    |   Package alpino is already up-to-date!
[nltk_data]    | Downloading package biocreative_ppi to
[nltk_data]    |     C:\Users\Luc\AppData\Roaming\nltk_data...
[nltk_data]    |   Package biocreative_ppi is already up-to-date!
[nltk_data]    | Downloading package brown to
[nltk_data]    |     C:\Users\Luc\AppData\Roaming\nltk_data...
[nltk_data]    |   Package brown is already up-to-date!
[nltk_data]    | Downloading package brown_tei to
[nltk_data]    |     C:\Users\Luc\AppData\Roaming\nltk_data...
[nltk_data]    |   Package brown_tei is already up-to-date!
[nltk_data]    | Downloading package cess_cat to
[nltk_data]    |     C:\Users\Luc\AppDat

True

### Read the Thai Restaurant Review File

We choose a csv file rather than a pkl file and encode with utf-8 in order to avoid complications with accented charaters, such as 'ê' in crêpe.  

In [2]:
ThaiTextByUniqueUserBiz = pd.read_csv("ThaiTextByUniqueUserBiz.csv", encoding='utf-8')
ThaiTextByUniqueUserBiz

Unnamed: 0.1,Unnamed: 0,user_id,business_id,text
0,0,--65q1FpAL_UQtVZ2PTGew,JiLK9QPjd53pOBEAaY83lw,I'm a big fan of this place and have dropped i...
1,1,--ijvARuRJhZrBdS9_jF2A,ApUCpJ9aa6yVgsde16gYrg,Food was ok but the service was less than exce...
2,2,--ohLoec6PU9_yxhbIlVWg,2XXwiASSS6685OhWWnIt_A,I got the Penang curry and have to say the foo...
3,3,--qEXbk-cA0HmbPyhcffdA,CVos739DJ06t8-dNiRMyeQ,"To sum up in one sentence: ""I only go to Thai ..."
4,4,--qEXbk-cA0HmbPyhcffdA,jQST5lkLGX9L52-A10TGTQ,I LOVE THIS PLACE!\r\r\n\r\r\nIt's a cute mom-...
5,5,-0fMBkX7QvWKQrtOp7H-GQ,3rqoxOasrRKxNubxjLSElA,The food was delicious and the service was ama...
6,6,-2EuoueswhqEERWezJY8gw,cInzGnaFZ3EIItvFXl1MvQ,My Girlfriend and I eat here occasionally and ...
7,7,-2Ig3GSBkj8JQT8eETmDPg,d-YNxMKL6ZhkiRhfUPxKHg,Very friendly family business. We had the pad...
8,8,-3WzrbWjnaKg2QWAsouy_g,jQST5lkLGX9L52-A10TGTQ,Yellow curry w/tofu is my favorite!
9,9,-45GJdo8Ye8A1AStuUZp9Q,-SNpLwJNup8N96yq7sBJyw,"Excellent food, reasonable price and great atm..."


In [3]:
ThaiTextByUniqueUserBiz.drop(ThaiTextByUniqueUserBiz.columns[[0]],axis=1, inplace=True)
ThaiTextByUniqueUserBiz

Unnamed: 0,user_id,business_id,text
0,--65q1FpAL_UQtVZ2PTGew,JiLK9QPjd53pOBEAaY83lw,I'm a big fan of this place and have dropped i...
1,--ijvARuRJhZrBdS9_jF2A,ApUCpJ9aa6yVgsde16gYrg,Food was ok but the service was less than exce...
2,--ohLoec6PU9_yxhbIlVWg,2XXwiASSS6685OhWWnIt_A,I got the Penang curry and have to say the foo...
3,--qEXbk-cA0HmbPyhcffdA,CVos739DJ06t8-dNiRMyeQ,"To sum up in one sentence: ""I only go to Thai ..."
4,--qEXbk-cA0HmbPyhcffdA,jQST5lkLGX9L52-A10TGTQ,I LOVE THIS PLACE!\r\r\n\r\r\nIt's a cute mom-...
5,-0fMBkX7QvWKQrtOp7H-GQ,3rqoxOasrRKxNubxjLSElA,The food was delicious and the service was ama...
6,-2EuoueswhqEERWezJY8gw,cInzGnaFZ3EIItvFXl1MvQ,My Girlfriend and I eat here occasionally and ...
7,-2Ig3GSBkj8JQT8eETmDPg,d-YNxMKL6ZhkiRhfUPxKHg,Very friendly family business. We had the pad...
8,-3WzrbWjnaKg2QWAsouy_g,jQST5lkLGX9L52-A10TGTQ,Yellow curry w/tofu is my favorite!
9,-45GJdo8Ye8A1AStuUZp9Q,-SNpLwJNup8N96yq7sBJyw,"Excellent food, reasonable price and great atm..."


### Read the Yelp Restaurant Review Lexicon File


Although a bigram-based lexicon is available from the Yelp Restaurant Review Lexicon website file, we choose for this exercise the unigram-based lexicon for ease of comparison with the AFINN lexicon. We first assign column names to the four types of values found in the unigram-based text file although we will be using for this exercise the 'word' and 'score' values only.

In [4]:
colnames = ['word', 'score', 'pos', 'neg']
YelpSentiment = pd.read_table("Yelp-restaurant-reviews-AFFLEX-NEGLEX-unigrams.txt", names=colnames)
YelpSentiment

Unnamed: 0,word,score,pos,neg
0,overpowering_NEGFIRST,3.798,139,0
1,yumm,3.716,128,0
2,faves,3.306,256,2
3,yummmm,3.250,80,0
4,satisfies,3.238,79,0
5,disappoints_NEGFIRST,3.100,208,2
6,bosnian,3.061,66,0
7,combines,3.046,65,0
8,vinegars,2.967,60,0
9,exquisite,2.955,240,3


### Filter out the NEGLEX elements from the Yelp Restaurant Review Lexicon  

The unigram lexicon for Yelp restaurant reviews has two scales, each with words (common or not with the other scale) and its own sentiment scores. For the purpose of this excercise, we filter out the NEGLEX scale as it provides positive and negative values for some words with negative connotations. We are seeking alignment with the AFINN lexicons for comparison purposes. The NEGFLEX lexicon is provided for reference purposes only.

In [5]:
YelpSentimentAFFLEX = YelpSentiment[YelpSentiment.word.str.contains("_NEG") == False]
YelpSentimentAFFLEX = YelpSentimentAFFLEX.reset_index(drop=True)
YelpSentimentAFFLEX

Unnamed: 0,word,score,pos,neg
0,yumm,3.716,128,0
1,faves,3.306,256,2
2,yummmm,3.250,80,0
3,satisfies,3.238,79,0
4,bosnian,3.061,66,0
5,combines,3.046,65,0
6,vinegars,2.967,60,0
7,exquisite,2.955,240,3
8,yummm,2.942,118,1
9,seasonally,2.899,56,0


In [6]:
YelpSentimentNEGFLEX = YelpSentiment[YelpSentiment.word.str.contains("_NEG") == True]
YelpSentimentNEGFLEX

Unnamed: 0,word,score,pos,neg
0,overpowering_NEGFIRST,3.798,139,0
5,disappoints_NEGFIRST,3.100,208,2
10,regret_NEGFIRST,2.953,360,5
24,disappoint_NEGFIRST,2.802,982,18
28,pushy_NEGFIRST,2.727,47,0
30,pretentious_NEGFIRST,2.692,138,2
31,overpower_NEGFIRST,2.685,45,0
38,affordable_NEG,2.640,43,0
46,rushed_NEGFIRST,2.606,84,1
51,skimping_NEGFIRST,2.570,40,0


### Create Two Python Dictionaries with Sentiment Scores for the Sentiment Detection Algorithm

In this section, the objective of the code cells are (1) to identify the dictionary format used with the AFINN lexicon for the sentiment dectection algorithm and (2) the creation of a Python dictionary in the same format with the Yelp Restaurant Review Lexicon from the YelpSentimentAFFLEX dataframe. The Python dictionaries will be used in sentiment detection algorithms in the next section. 

#### Create the AFINN dictionary

In [7]:
sentiment_dictionary = {}
for line in open('AFINN-emoticons-en-165.txt'):
    word, score = line.split('\t')
    sentiment_dictionary[word] = int(score)
sentiment_dictionary 

{'unimaginative': -2,
 'limited': -1,
 'unscientific': -2,
 'suicidal': -2,
 'pardon': 2,
 'desirable': 2,
 'foul': -3,
 'obstruction': -2,
 'protest': -2,
 'lurking': -1,
 'controversial': -2,
 'hating': -3,
 'ridiculous': -3,
 'hate': -3,
 '\\o/': 3,
 'aggression': -2,
 'poorly': -2,
 'stinks': -2,
 'infuriates': -2,
 'regretted': -2,
 'violate': -2,
 'granting': 1,
 'attracted': 1,
 'tremors': -2,
 'stinky': -2,
 'poorest': -2,
 'disability': -2,
 'condemns': -2,
 'sorry': -1,
 'regrets': -2,
 'struck': -1,
 'misreporting': -2,
 'compassion': 2,
 'misreports': -2,
 'hilarious': 2,
 'lurk': -1,
 'misunderstanding': -2,
 'distort': -2,
 'lololol': 4,
 'stolen': -2,
 'gratification': 2,
 'uncertain': -1,
 'stabbed': -2,
 'screaming': -2,
 'courageous': 2,
 'disturb': -2,
 'exaggerate': -2,
 'harried': -2,
 'solution': 1,
 'nigger': -5,
 'honor': 2,
 'pardons': 2,
 'delightfully': 3,
 'monopolized': -2,
 'illiteracy': -2,
 'triumph': 4,
 'enjoy': 2,
 'shithead': -4,
 'diverting': -1,
 '

#### Create the Yelp Sentiment dictionary

In [8]:
Yelp_sentiment_dictionary = YelpSentimentAFFLEX.set_index('word')['score'].to_dict()
Yelp_sentiment_dictionary

{'gai': 1.6890000000000001,
 'mid-week': 0.16800000000000001,
 'woods': 0.90200000000000002,
 'hanging': 0.13300000000000001,
 'woody': 0.59899999999999998,
 'northsight': -0.33299999999999996,
 'comically': -0.67400000000000004,
 'frou-frou': 0.35999999999999999,
 'cake-': 0.871,
 'originality': -0.098000000000000004,
 'calpico': 1.0529999999999999,
 'fattiness': 0.24199999999999999,
 'rawhide': -0.45100000000000001,
 'bringing': 0.089999999999999997,
 'tcby': -0.16300000000000001,
 'revelers': -0.45100000000000001,
 'caramels': 0.89300000000000002,
 'grueling': 0.155,
 'broiler': -0.90300000000000002,
 'caramely': 0.80200000000000005,
 'condessa': -0.29699999999999999,
 'wednesday': 0.56799999999999995,
 'broiled': 0.54600000000000004,
 'crotch': -0.044999999999999998,
 'stereotypical': -0.64800000000000002,
 'caramelo': 1.159,
 'bbqs': 0.35999999999999999,
 'chimichuri': 0.24199999999999999,
 "roscoe's": 0.92299999999999993,
 "tom's": -0.33799999999999997,
 'scrapes': 0.648000000000

#### Observations

We note that the two dictionaries have emoticons, that their word elements are lower-cased and that verbs are conjugated in various tenses. We also note that the Yelp Sentiment dictionary has both correctly spelled and mispelled words.

### Run the Sentiment Detection Algorithm with Dictionaries and NLTK's TweetTokenizer

As the restaurant reviews in our dataset has emoticons, we choose NLTK's TweetTokenizer to tokenize words from restaurant reviews with a view not to sacrifice the emoticons that provide key review sentiments. We acknowledge however that sarcasm may be difficult to detect with a simple dictionary method. We also choose to lower the case of all words and emoticons as both dictionaries have exclusively lower-case words and as the AFINN dictionary has both higher-case emoticons and their lower-case equivalents. We choose not to combine all the sentiment score summing lines in the same algorithm for the purposes of comprehension. We acknowledge that a combination of these summing lines in the same algorithm may save significant computational ressources with larger datasets.

#### Extract Sentiment Scores from Reviews Using Algorithm, TweetTokenizer and Yelp Restaurant Review Sentiment Dictionary

In [9]:
from nltk.tokenize import TweetTokenizer
tknzr = TweetTokenizer()
for row in range(len(ThaiTextByUniqueUserBiz)):
    n = ThaiTextByUniqueUserBiz.loc[row, 'text']
    words = tknzr.tokenize(n.lower())
    ThaiTextByUniqueUserBiz.loc[row,"YRsentiment"] = sum(Yelp_sentiment_dictionary.get(word, 0) for word in words)
ThaiTextByUniqueUserBiz  

Unnamed: 0,user_id,business_id,text,YRsentiment
0,--65q1FpAL_UQtVZ2PTGew,JiLK9QPjd53pOBEAaY83lw,I'm a big fan of this place and have dropped i...,43.539
1,--ijvARuRJhZrBdS9_jF2A,ApUCpJ9aa6yVgsde16gYrg,Food was ok but the service was less than exce...,-9.641
2,--ohLoec6PU9_yxhbIlVWg,2XXwiASSS6685OhWWnIt_A,I got the Penang curry and have to say the foo...,0.105
3,--qEXbk-cA0HmbPyhcffdA,CVos739DJ06t8-dNiRMyeQ,"To sum up in one sentence: ""I only go to Thai ...",-16.096
4,--qEXbk-cA0HmbPyhcffdA,jQST5lkLGX9L52-A10TGTQ,I LOVE THIS PLACE!\r\r\n\r\r\nIt's a cute mom-...,5.802
5,-0fMBkX7QvWKQrtOp7H-GQ,3rqoxOasrRKxNubxjLSElA,The food was delicious and the service was ama...,12.067
6,-2EuoueswhqEERWezJY8gw,cInzGnaFZ3EIItvFXl1MvQ,My Girlfriend and I eat here occasionally and ...,14.353
7,-2Ig3GSBkj8JQT8eETmDPg,d-YNxMKL6ZhkiRhfUPxKHg,Very friendly family business. We had the pad...,3.156
8,-3WzrbWjnaKg2QWAsouy_g,jQST5lkLGX9L52-A10TGTQ,Yellow curry w/tofu is my favorite!,4.719
9,-45GJdo8Ye8A1AStuUZp9Q,-SNpLwJNup8N96yq7sBJyw,"Excellent food, reasonable price and great atm...",15.490


#### Convert Yelp Review Sentiment Scores to a 1-to-5 Scale Using the Round Function without Decimals

In [10]:
OldMax = max(ThaiTextByUniqueUserBiz['YRsentiment'])
OldMin = min(ThaiTextByUniqueUserBiz['YRsentiment'])
NewMax = 5
NewMin = 1
OldRange = (OldMax - OldMin)
NewRange = (NewMax - NewMin)
for row in range(len(ThaiTextByUniqueUserBiz)):
    n = ThaiTextByUniqueUserBiz.loc[row, 'YRsentiment']
    ThaiTextByUniqueUserBiz.loc[row,"YRSentScore"] = (((n - OldMin) * NewRange / OldRange) + NewMin).round(decimals=0, out=None)
ThaiTextByUniqueUserBiz  

Unnamed: 0,user_id,business_id,text,YRsentiment,YRSentScore
0,--65q1FpAL_UQtVZ2PTGew,JiLK9QPjd53pOBEAaY83lw,I'm a big fan of this place and have dropped i...,43.539,3
1,--ijvARuRJhZrBdS9_jF2A,ApUCpJ9aa6yVgsde16gYrg,Food was ok but the service was less than exce...,-9.641,3
2,--ohLoec6PU9_yxhbIlVWg,2XXwiASSS6685OhWWnIt_A,I got the Penang curry and have to say the foo...,0.105,3
3,--qEXbk-cA0HmbPyhcffdA,CVos739DJ06t8-dNiRMyeQ,"To sum up in one sentence: ""I only go to Thai ...",-16.096,3
4,--qEXbk-cA0HmbPyhcffdA,jQST5lkLGX9L52-A10TGTQ,I LOVE THIS PLACE!\r\r\n\r\r\nIt's a cute mom-...,5.802,3
5,-0fMBkX7QvWKQrtOp7H-GQ,3rqoxOasrRKxNubxjLSElA,The food was delicious and the service was ama...,12.067,3
6,-2EuoueswhqEERWezJY8gw,cInzGnaFZ3EIItvFXl1MvQ,My Girlfriend and I eat here occasionally and ...,14.353,3
7,-2Ig3GSBkj8JQT8eETmDPg,d-YNxMKL6ZhkiRhfUPxKHg,Very friendly family business. We had the pad...,3.156,3
8,-3WzrbWjnaKg2QWAsouy_g,jQST5lkLGX9L52-A10TGTQ,Yellow curry w/tofu is my favorite!,4.719,3
9,-45GJdo8Ye8A1AStuUZp9Q,-SNpLwJNup8N96yq7sBJyw,"Excellent food, reasonable price and great atm...",15.490,3


#### Convert Yelp Review Sentiment Scores to a 1-to-5 Scale with 0.5 Increments

In [11]:
OldMax = max(ThaiTextByUniqueUserBiz['YRsentiment'])
OldMin = min(ThaiTextByUniqueUserBiz['YRsentiment'])
NewMax = 5
NewMin = 1
OldRange = (OldMax - OldMin)
NewRange = (NewMax - NewMin)
for row in range(len(ThaiTextByUniqueUserBiz)):
    n = ThaiTextByUniqueUserBiz.loc[row, 'YRsentiment']
    ThaiTextByUniqueUserBiz.loc[row,"YRSentScore2"] = 0.5 * np.ceil(2*(((n - OldMin) * NewRange / OldRange) + NewMin))
ThaiTextByUniqueUserBiz  

Unnamed: 0,user_id,business_id,text,YRsentiment,YRSentScore,YRSentScore2
0,--65q1FpAL_UQtVZ2PTGew,JiLK9QPjd53pOBEAaY83lw,I'm a big fan of this place and have dropped i...,43.539,3,3.5
1,--ijvARuRJhZrBdS9_jF2A,ApUCpJ9aa6yVgsde16gYrg,Food was ok but the service was less than exce...,-9.641,3,3.0
2,--ohLoec6PU9_yxhbIlVWg,2XXwiASSS6685OhWWnIt_A,I got the Penang curry and have to say the foo...,0.105,3,3.0
3,--qEXbk-cA0HmbPyhcffdA,CVos739DJ06t8-dNiRMyeQ,"To sum up in one sentence: ""I only go to Thai ...",-16.096,3,3.0
4,--qEXbk-cA0HmbPyhcffdA,jQST5lkLGX9L52-A10TGTQ,I LOVE THIS PLACE!\r\r\n\r\r\nIt's a cute mom-...,5.802,3,3.0
5,-0fMBkX7QvWKQrtOp7H-GQ,3rqoxOasrRKxNubxjLSElA,The food was delicious and the service was ama...,12.067,3,3.5
6,-2EuoueswhqEERWezJY8gw,cInzGnaFZ3EIItvFXl1MvQ,My Girlfriend and I eat here occasionally and ...,14.353,3,3.5
7,-2Ig3GSBkj8JQT8eETmDPg,d-YNxMKL6ZhkiRhfUPxKHg,Very friendly family business. We had the pad...,3.156,3,3.0
8,-3WzrbWjnaKg2QWAsouy_g,jQST5lkLGX9L52-A10TGTQ,Yellow curry w/tofu is my favorite!,4.719,3,3.0
9,-45GJdo8Ye8A1AStuUZp9Q,-SNpLwJNup8N96yq7sBJyw,"Excellent food, reasonable price and great atm...",15.490,3,3.5


#### Extract Sentiment Scores from Reviews Using Algorithm, TweetTokenizer and AFINN Dictionary

In [12]:
from nltk.tokenize import TweetTokenizer
tknzr = TweetTokenizer()
for row in range(len(ThaiTextByUniqueUserBiz)):
    n = ThaiTextByUniqueUserBiz.loc[row, 'text']
    words = tknzr.tokenize(n.lower())
    ThaiTextByUniqueUserBiz.loc[row,"AFINNSentScore"] = sum(sentiment_dictionary.get(word, 0) for word in words)
ThaiTextByUniqueUserBiz

Unnamed: 0,user_id,business_id,text,YRsentiment,YRSentScore,YRSentScore2,AFINNSentScore
0,--65q1FpAL_UQtVZ2PTGew,JiLK9QPjd53pOBEAaY83lw,I'm a big fan of this place and have dropped i...,43.539,3,3.5,22
1,--ijvARuRJhZrBdS9_jF2A,ApUCpJ9aa6yVgsde16gYrg,Food was ok but the service was less than exce...,-9.641,3,3.0,-1
2,--ohLoec6PU9_yxhbIlVWg,2XXwiASSS6685OhWWnIt_A,I got the Penang curry and have to say the foo...,0.105,3,3.0,5
3,--qEXbk-cA0HmbPyhcffdA,CVos739DJ06t8-dNiRMyeQ,"To sum up in one sentence: ""I only go to Thai ...",-16.096,3,3.0,7
4,--qEXbk-cA0HmbPyhcffdA,jQST5lkLGX9L52-A10TGTQ,I LOVE THIS PLACE!\r\r\n\r\r\nIt's a cute mom-...,5.802,3,3.0,21
5,-0fMBkX7QvWKQrtOp7H-GQ,3rqoxOasrRKxNubxjLSElA,The food was delicious and the service was ama...,12.067,3,3.5,15
6,-2EuoueswhqEERWezJY8gw,cInzGnaFZ3EIItvFXl1MvQ,My Girlfriend and I eat here occasionally and ...,14.353,3,3.5,13
7,-2Ig3GSBkj8JQT8eETmDPg,d-YNxMKL6ZhkiRhfUPxKHg,Very friendly family business. We had the pad...,3.156,3,3.0,6
8,-3WzrbWjnaKg2QWAsouy_g,jQST5lkLGX9L52-A10TGTQ,Yellow curry w/tofu is my favorite!,4.719,3,3.0,2
9,-45GJdo8Ye8A1AStuUZp9Q,-SNpLwJNup8N96yq7sBJyw,"Excellent food, reasonable price and great atm...",15.490,3,3.5,13


#### Convert AFINN Sentiment Scores to a 1-to-5 Scale with 0.5 Increments

In [13]:
OldMax = max(ThaiTextByUniqueUserBiz['AFINNSentScore'])
OldMin = min(ThaiTextByUniqueUserBiz['AFINNSentScore'])
NewMax = 5
NewMin = 1
OldRange = (OldMax - OldMin)
NewRange = (NewMax - NewMin)
for row in range(len(ThaiTextByUniqueUserBiz)):
    n = ThaiTextByUniqueUserBiz.loc[row, 'AFINNSentScore']
    ThaiTextByUniqueUserBiz.loc[row,"AFINNSentScore2"] = 0.5 * np.ceil(2*(((n - OldMin) * NewRange / OldRange) + NewMin))
ThaiTextByUniqueUserBiz  

Unnamed: 0,user_id,business_id,text,YRsentiment,YRSentScore,YRSentScore2,AFINNSentScore,AFINNSentScore2
0,--65q1FpAL_UQtVZ2PTGew,JiLK9QPjd53pOBEAaY83lw,I'm a big fan of this place and have dropped i...,43.539,3,3.5,22,2.0
1,--ijvARuRJhZrBdS9_jF2A,ApUCpJ9aa6yVgsde16gYrg,Food was ok but the service was less than exce...,-9.641,3,3.0,-1,2.0
2,--ohLoec6PU9_yxhbIlVWg,2XXwiASSS6685OhWWnIt_A,I got the Penang curry and have to say the foo...,0.105,3,3.0,5,2.0
3,--qEXbk-cA0HmbPyhcffdA,CVos739DJ06t8-dNiRMyeQ,"To sum up in one sentence: ""I only go to Thai ...",-16.096,3,3.0,7,2.0
4,--qEXbk-cA0HmbPyhcffdA,jQST5lkLGX9L52-A10TGTQ,I LOVE THIS PLACE!\r\r\n\r\r\nIt's a cute mom-...,5.802,3,3.0,21,2.0
5,-0fMBkX7QvWKQrtOp7H-GQ,3rqoxOasrRKxNubxjLSElA,The food was delicious and the service was ama...,12.067,3,3.5,15,2.0
6,-2EuoueswhqEERWezJY8gw,cInzGnaFZ3EIItvFXl1MvQ,My Girlfriend and I eat here occasionally and ...,14.353,3,3.5,13,2.0
7,-2Ig3GSBkj8JQT8eETmDPg,d-YNxMKL6ZhkiRhfUPxKHg,Very friendly family business. We had the pad...,3.156,3,3.0,6,2.0
8,-3WzrbWjnaKg2QWAsouy_g,jQST5lkLGX9L52-A10TGTQ,Yellow curry w/tofu is my favorite!,4.719,3,3.0,2,2.0
9,-45GJdo8Ye8A1AStuUZp9Q,-SNpLwJNup8N96yq7sBJyw,"Excellent food, reasonable price and great atm...",15.490,3,3.5,13,2.0


### Read and Merge the Thai Review Ratings file to the Sentiment Score Dataframe (ThaiTextByUniqueUserBiz)

#### Read the Thai Review Ratings File

In [14]:
ThaiReviewRatingsByUserBiz = pd.read_pickle('ThaiReviewRatingsByUserBiz.pkl')
ThaiReviewRatingsByUserBiz

Unnamed: 0,user_id,business_id,review_ratings
0,--65q1FpAL_UQtVZ2PTGew,JiLK9QPjd53pOBEAaY83lw,5
1,--ijvARuRJhZrBdS9_jF2A,ApUCpJ9aa6yVgsde16gYrg,1
2,--ohLoec6PU9_yxhbIlVWg,2XXwiASSS6685OhWWnIt_A,3
3,--qEXbk-cA0HmbPyhcffdA,CVos739DJ06t8-dNiRMyeQ,3
4,--qEXbk-cA0HmbPyhcffdA,jQST5lkLGX9L52-A10TGTQ,5
5,-0fMBkX7QvWKQrtOp7H-GQ,3rqoxOasrRKxNubxjLSElA,4
6,-2EuoueswhqEERWezJY8gw,cInzGnaFZ3EIItvFXl1MvQ,4
7,-2Ig3GSBkj8JQT8eETmDPg,d-YNxMKL6ZhkiRhfUPxKHg,3
8,-3WzrbWjnaKg2QWAsouy_g,jQST5lkLGX9L52-A10TGTQ,5
9,-45GJdo8Ye8A1AStuUZp9Q,-SNpLwJNup8N96yq7sBJyw,5


#### Merge Ratings to Sentiment Scores in the ThaiTextByUniqueUserBiz Dataframe

In [15]:
ThaiTextByUniqueUserBiz = pd.merge(ThaiTextByUniqueUserBiz, ThaiReviewRatingsByUserBiz, how='left', on=['user_id', 'business_id'])
ThaiTextByUniqueUserBiz

Unnamed: 0,user_id,business_id,text,YRsentiment,YRSentScore,YRSentScore2,AFINNSentScore,AFINNSentScore2,review_ratings
0,--65q1FpAL_UQtVZ2PTGew,JiLK9QPjd53pOBEAaY83lw,I'm a big fan of this place and have dropped i...,43.539,3,3.5,22,2.0,5
1,--ijvARuRJhZrBdS9_jF2A,ApUCpJ9aa6yVgsde16gYrg,Food was ok but the service was less than exce...,-9.641,3,3.0,-1,2.0,1
2,--ohLoec6PU9_yxhbIlVWg,2XXwiASSS6685OhWWnIt_A,I got the Penang curry and have to say the foo...,0.105,3,3.0,5,2.0,3
3,--qEXbk-cA0HmbPyhcffdA,CVos739DJ06t8-dNiRMyeQ,"To sum up in one sentence: ""I only go to Thai ...",-16.096,3,3.0,7,2.0,3
4,--qEXbk-cA0HmbPyhcffdA,jQST5lkLGX9L52-A10TGTQ,I LOVE THIS PLACE!\r\r\n\r\r\nIt's a cute mom-...,5.802,3,3.0,21,2.0,5
5,-0fMBkX7QvWKQrtOp7H-GQ,3rqoxOasrRKxNubxjLSElA,The food was delicious and the service was ama...,12.067,3,3.5,15,2.0,4
6,-2EuoueswhqEERWezJY8gw,cInzGnaFZ3EIItvFXl1MvQ,My Girlfriend and I eat here occasionally and ...,14.353,3,3.5,13,2.0,4
7,-2Ig3GSBkj8JQT8eETmDPg,d-YNxMKL6ZhkiRhfUPxKHg,Very friendly family business. We had the pad...,3.156,3,3.0,6,2.0,3
8,-3WzrbWjnaKg2QWAsouy_g,jQST5lkLGX9L52-A10TGTQ,Yellow curry w/tofu is my favorite!,4.719,3,3.0,2,2.0,5
9,-45GJdo8Ye8A1AStuUZp9Q,-SNpLwJNup8N96yq7sBJyw,"Excellent food, reasonable price and great atm...",15.490,3,3.5,13,2.0,5


#### Save the Working Dataframe for Future Reference and for Recommender Activities

In [16]:
ThaiTextByUniqueUserBiz.to_pickle('ThaiTextByUniqueUserBizWithSentimentScoresRatings.pkl')

In [17]:
ThaiTextByUniqueUserBiz.to_csv('ThaiTextByUniqueUserBizWithSentimentScoresRatings.csv', encoding='utf-8')

### Preliminary Descriptive Analysis of Sentiment Scores and Ratings

In this section of the kernel, we first provide a summary of the score and rating dataframe we created in the previous section. We add to the summary a count of values per column and each their median to assess the distribution. 

#### Draw up a Summary for the Score and Rating Working Dataframe

In [18]:
ThaiTextByUniqueUserBiz.describe()

Unnamed: 0,YRsentiment,YRSentScore,YRSentScore2,AFINNSentScore,AFINNSentScore2,review_ratings
count,10911.0,10911.0,10911.0,10911.0,10911.0,10911.0
mean,13.940278,3.009898,3.332692,13.196957,2.034002,3.869112
std,19.472534,0.142946,0.260143,11.963478,0.185616,1.175408
min,-233.857,1.0,1.0,-36.0,1.0,1.0
25%,4.8345,3.0,3.0,6.0,2.0,3.0
50%,12.261,3.0,3.5,11.0,2.0,4.0
75%,21.616,3.0,3.5,18.0,2.0,5.0
max,249.665,5.0,5.0,221.0,5.0,5.0


#### Observations 1: 

(1) The sentiment score results from the algorithm and the Yelp Restaurant Review Sentiment Lexicon range from -234 to 250.

(2) The sentiment score results from the algorithm and the two AFINN Lexicons combined range from -36 to 221.

#### Observations 2: 

(1) Like the review_ratings, both sets of converted sentiment scores range from 1 to 5. The mean changes: it is the lowest at 3.0 with sentiment scores converted with no decimals, 3.3 with sentiment scores converted to a 0.5-increment scale and 3.9 with review ratings.

#### Perform a Count of Yelp Review Sentiment Scores Scaled to 1-to-5 with No Decimals and Compute the Median

In [19]:
pd.value_counts(ThaiTextByUniqueUserBiz.YRSentScore.ravel())

3    10699
4      156
2       52
5        3
1        1
dtype: int64

In [20]:
ThaiTextByUniqueUserBiz.YRSentScore.median()

3.0

#### Observations 3:

(1) The Yelp Restaurant Review Lexicon produced a range of values where 10,699 values out of 10,911 lie in the middle which happens to be the median. This median corresponds to a count of 0 in the initial score count; in other words, the median corresponds to an evenly balanced score of sentiments or a neutral sentiment, which is neither positive nor negative. 

(2) This leaves us with 156 + 3 Thai restaurant reviews in the Yelp dataset (given reviewers of Phoenix restaurants) for recommendation purposes that have sentiments that are either positive or very positive. A pertinent question might be how many of these 159 positive-sentiment elements are reviews for Thai restaurants in Las Vegas.

#### Perform a Count of Yelp Review Sentiment Scores Scaled to 1-to-5 with 0.5 Increments and Compute the Median

In [21]:
pd.value_counts(ThaiTextByUniqueUserBiz.YRSentScore2.ravel())

3.5    6987
3.0    3712
4.0     147
2.5      48
4.5       9
2.0       4
5.0       3
1.0       1
dtype: int64

In [22]:
ThaiTextByUniqueUserBiz.YRSentScore2.median()

3.5

In [23]:
ThaiTextByUniqueUserBiz[ThaiTextByUniqueUserBiz['YRsentiment']==0]

Unnamed: 0,user_id,business_id,text,YRsentiment,YRSentScore,YRSentScore2,AFINNSentScore,AFINNSentScore2,review_ratings
10722,ywf8LhV2jCvbksPx6wc7aw,ApUCpJ9aa6yVgsde16gYrg,Closed,0,3,3,0,2,1


#### Observations 4:

(1) This version of the Yelp Review Sentiment Scores Scaled 1-to-5 with 0.5 increments offers new potentials for a recommender system as 6,987 + 147 + 9 + 3 reviews fare positively. 3,712 reviews are neutral whereas 48 + 4 + 1 reviews raise a negative sentiment.

(2) The distribution is skewed to positive (although not quite as dramatically as the review_ratings (see below)).

(3) As there is no 1.5 score, we acknowledge that the one review with a score of 1 may be an outlier.

(4) The filter "0" applied to the initial score count without scaling ('0' is the neutral value in this column) suggests that the neutral value for the corresponding scaled value (without decimals) is 3 and that the neutral value for the corresponding scaled value using 0.5 increments is also 3. 

(5) The median of 3.5 is different than the corresponding scaled value of 3 (using 0.5 increments) in item 4.

#### Perform a Count of AFINN Sentiment Scores Scaled to 1-to-5 with 0.5 Increments and Compute the Median

In [24]:
pd.value_counts(ThaiTextByUniqueUserBiz.AFINNSentScore2.ravel())

2.0    9573
2.5     952
1.5     328
3.0      54
3.5       2
1.0       1
5.0       1
dtype: int64

In [25]:
ThaiTextByUniqueUserBiz.AFINNSentScore2.median()

2.0

In [26]:
ThaiTextByUniqueUserBiz[ThaiTextByUniqueUserBiz['AFINNSentScore']==0]

Unnamed: 0,user_id,business_id,text,YRsentiment,YRSentScore,YRSentScore2,AFINNSentScore,AFINNSentScore2,review_ratings
172,08PXVzu6ysM93s-HaUPOIQ,90AXjqb4O-wrTHDKDoDUzg,"Very nicely appointed place, nicely done. The ...",3.434,3,3.0,0,2,3
192,0GYLQqgu7v_qMZGgflv5Hg,4xpACaa99_KFokYvNLXMBA,Spring roll was chewy and oily. Soup flavor pr...,-11.058,3,3.0,0,2,1
206,0ItqxKwFSTgkifR6Fv13dQ,shlAd7PLzWlQrkQ0uYcBBg,Been here a couple times and it continues to s...,1.855,3,3.0,0,2,4
233,0U6YfdxR2aw-_4vctk9eKA,2bdKR3l4o-S1CscLqqnvVw,I used to come here all the time for take out ...,1.156,3,3.0,0,2,1
242,0WPwr3Jr4a_Ywg_hZxm4KQ,1621ir5mjVgbHwxCbMAEjg,When you ask for HOT it means hot. Not take it...,0.166,3,3.0,0,2,2
328,0npnrzUAhaiOX2awq3dPUw,4nnMgD9X62YrMqkQKhx-Pg,Standard wok place. My experience was that the...,-1.213,3,3.0,0,2,2
351,0zcc8klD5N3kUvHnpA5ZlA,TqVjy0dxvNh51BF9KePCoQ,Below average Chinese food. Chow mein was very...,-0.873,3,3.0,0,2,2
390,1AM1mfGPGIlQi78p_OtnPQ,CVos739DJ06t8-dNiRMyeQ,It's been a while since we've been here. Must ...,2.051,3,3.0,0,2,3
396,1CpyQ1VkgmNwA6RxHoq2bA,KPoTixdjoJxSqRSEApSAGg,"I've been here several times, but have to admi...",1.654,3,3.0,0,2,3
410,1J3EW_WdsDoOeQv5j1Ws8w,17DI33J8TkcfzyoiIYLQIw,Ok,-1.151,3,3.0,0,2,3


#### Observations 5:

(1) As the median is 2 (not 3 like what the results show from the Yelp Review Sentiment Score code cells above), the distribution is skewed to positive although 9,573 out of 10,911 reviews are neutral in sentiment.

(2) As there are no 4.0 and 4.5 scores, we acknowledge that the one review with a score of 5 may be an outlier.

(3) There are more neutral reviews from the computations with the AFINN Lexicon than there are with the Yelp Restaurant Review Lexicon in a 0.5 incremental scale.

(4) Only 952 + 54 + 2 + 1 reviews raise a positive sentiment.

(5) The filter "0" applied to the initial AFINN score count without scaling ('0' is the neutral value in this column) suggests  that the neutral value for the corresponding scaled value using 0.5 increments is 2. There are 252 reviews that correspond to a neutral position according to the AFINN Lexicon initial count whereas there is one only one review that corresponds to a neutral position according to the Yelp Restaurant Review Lexicon count. The latter has also a neutral value of 2 in the scaled AFINN column. Interestingly, the corresponding rating given by the user is "1" which is "strongly dislike".  The only word in that particular review is "closed" which likely means that it is not in the lexicons or that it is rated as neutral. 

#### Perform a Count of Ratings Given by Individual Users and Compute the Median

In [27]:
pd.value_counts(ThaiTextByUniqueUserBiz.review_ratings.astype(int).ravel())

5    3949
4    3775
3    1586
2     918
1     683
dtype: int64

In [28]:
ThaiTextByUniqueUserBiz.review_ratings.astype(int).median()

4.0

#### Observations 6: 

(1) As the median is 4 (not 2 or 3 like the results from the Yelp and AFINN lexicons), the distribution is clearly skewed to "liking" with 3,949 + 3,775 reviews that are leaning to "liking", 1,586 reviews that are satisfactory, and 918 + 683 reviews that are leaning towards 'disliking'.   

(2) Although Yelp did not use 0.5 increments (we selected integer as a type to prevent function-produced decimals), the distribution offers a good range of possibilities. They do not match the range of possibilities produced by the sentiment detection methods above.

### CONCLUSIONS

The limited number of positively-charged reviews using either lexicons may improve the odds of predicting top restaurants for a user. The very high number of neutral reviews using either lexicons is a little disappointing: we hoped to see a better range of sentiment scores. We anticipate that a normalizing step may necessary for better recommendation results before the scaling phase in this sentiment detection exercise; otherwise, the high number of neutral reviews may remain when it's time to normalize the distribution in the recommender's set of activities.   