# Sentiment Analysis on Yelp Reviews
Let's explore the Yelp reviews and perform a sentiment analysis:

1. Load reviews from data file ... there are in JSON format
2. Convert JSON records to Python tuples for earch row, extract only what we need
3. Maybe Look at stars rating
4. Create list of words (or bag-of-words)
4. Load sentiment dictionary file and convert into a useful format.
5. Assign sentiment values (pos and neg) to words of reviews
6. Aggregate over reviews and report sentiment analysis

In [3]:
# %load pyspark_init_arc.py
#
# This configuration works for Spark on macOS using homebrew
#
import os, sys
# set OS environment variable
os.environ["SPARK_HOME"] = '/usr/hdp/2.6.3.0-235/spark2'
# add Spark library to Python
sys.path.insert(0, os.path.join(os.environ["SPARK_HOME"], 'python'))

# import package
import pyspark
from pyspark.context import SparkContext, SparkConf

import atexit
def stop_my_spark():
    sc.stop()
    del(sc)

# Register exit    
atexit.register(stop_my_spark)

# Configure and start Spark ... but only once.
if not 'sc' in globals():
    conf = SparkConf()
    conf.setAppName('MyFirstSpark') ## you may want to change this
    conf.setMaster('yarn-client')
    sc = SparkContext(conf=conf)
    print "Launched Spark version %s with ID %s" % (sc.version, sc.applicationId)
    print "http://arc.insight.gsu.edu:8088/cluster/app/%s"% (sc.applicationId)


Launched Spark version 2.2.0.2.6.3.0-235 with ID application_1514672663667_0448
http://arc.insight.gsu.edu:8088/cluster/app/application_1514672663667_0448


In [4]:
print "http://arc.insight.gsu.edu:8088/cluster/app/%s"% (sc.applicationId)

http://arc.insight.gsu.edu:8088/cluster/app/application_1508160140652_0005


In [4]:
%%sh
hdfs dfs -ls /data/yelp/review

ls: `/data/yelp/review': No such file or directory


In [7]:
DATADIR='/data/yelp/'

In [8]:
review_rdd = sc.textFile(os.path.join(DATADIR, 'review/review_ab.json.gz')).sample(False, 0.01, 42)

# First Glance at Reviews

In [9]:
review_rdd.first()

u'{"votes": {"funny": 0, "useful": 0, "cool": 0}, "user_id": "pGxyI2QhPpyn3RT61sbfXQ", "review_id": "rNExnQ7fMtXajA4Xx57BTg", "stars": 5, "date": "2016-03-23", "text": "This is my first time ever writing a review. But Martha is such a great doctor I thought I should share what I know.\\n\\nAa a scared pregnant teenager i met Martha for the first time 17 years old. 20 years later I am still a patient of Martha\'s. This is one of those rare doctors that I have met that actually care about her patients well-being. It\'s treacherous having to go to the doctor so often. When I walk in the office it\'s comfortable and I know I am in good hands. This makes all the difference and I am not kept waiting all day.\\n \\nI encourage every Woman to have a personal relationship with their doctor. Alternative for women is a great place to try. Once you go you will understand the importance of a excellent doctor and a great staff. I love Martha. Not only did she see me through the birth of my big healt

In [None]:
review_rdd.count()

In [11]:
text =  "Mr Hoagie is an institution. Walking in, it does seem like a throwback to 30 years ago, old fashioned menu board, booths out of the 70s, and a large selection of food. Their speciality is the Italian Hoagie, and it is voted the best in the area year after year. I usually order the burger, while the patties are obviously cooked from frozen, all of the other ingredients are very fresh. Overall, its a good alternative to Subway, which is down the road."

In [10]:
def text2words(text):
    import re
    def clean_text(text):
        return re.sub(r'[.;:,!\'"]', ' ', unicode(text).lower())
    return filter(lambda x: x!='', clean_text(text).split(' '))

In [13]:
text2words(text)[:10]

[u'mr',
 u'hoagie',
 u'is',
 u'an',
 u'institution',
 u'walking',
 u'in',
 u'it',
 u'does',
 u'seem']

In [86]:
def json_review(s):
    import json
    r = json.loads(s.strip())
    return (r['business_id']+'^'+r['review_id'], r['text'])

In [87]:
review_rdd.map(json_review).take(4)

[(u'se11kpNHxkw59O_7wwWhaQ^rNExnQ7fMtXajA4Xx57BTg',
  u"This is my first time ever writing a review. But Martha is such a great doctor I thought I should share what I know.\n\nAa a scared pregnant teenager i met Martha for the first time 17 years old. 20 years later I am still a patient of Martha's. This is one of those rare doctors that I have met that actually care about her patients well-being. It's treacherous having to go to the doctor so often. When I walk in the office it's comfortable and I know I am in good hands. This makes all the difference and I am not kept waiting all day.\n \nI encourage every Woman to have a personal relationship with their doctor. Alternative for women is a great place to try. Once you go you will understand the importance of a excellent doctor and a great staff. I love Martha. Not only did she see me through the birth of my big healthy baby. She is still seeing me through all my health problems and making it alot easier on me in the process. Check her

In [88]:
##word_train_rdd = rtrain_rdd.flatMap(lambda r: [(r[0], w) for w in text2words(r[1])])
words_rdd = review_rdd.map(json_review).flatMap(lambda r: [(r[0], w) for w in text2words(r[1])])

In [89]:
words_rdd.take(10) ## .groupByKey().take(10)

[(u'se11kpNHxkw59O_7wwWhaQ^rNExnQ7fMtXajA4Xx57BTg', u'this'),
 (u'se11kpNHxkw59O_7wwWhaQ^rNExnQ7fMtXajA4Xx57BTg', u'is'),
 (u'se11kpNHxkw59O_7wwWhaQ^rNExnQ7fMtXajA4Xx57BTg', u'my'),
 (u'se11kpNHxkw59O_7wwWhaQ^rNExnQ7fMtXajA4Xx57BTg', u'first'),
 (u'se11kpNHxkw59O_7wwWhaQ^rNExnQ7fMtXajA4Xx57BTg', u'time'),
 (u'se11kpNHxkw59O_7wwWhaQ^rNExnQ7fMtXajA4Xx57BTg', u'ever'),
 (u'se11kpNHxkw59O_7wwWhaQ^rNExnQ7fMtXajA4Xx57BTg', u'writing'),
 (u'se11kpNHxkw59O_7wwWhaQ^rNExnQ7fMtXajA4Xx57BTg', u'a'),
 (u'se11kpNHxkw59O_7wwWhaQ^rNExnQ7fMtXajA4Xx57BTg', u'review'),
 (u'se11kpNHxkw59O_7wwWhaQ^rNExnQ7fMtXajA4Xx57BTg', u'but')]

# Sentiment Dictionary

In [24]:
%%sh
hdfs dfs -ls /data/yelp/

Found 7 items
-rw-r--r--   3 pmolnar hdfs     13590777 2017-10-19 14:41 /data/yelp/SentiWordNet_3.0.0_20130122.txt
drwxr-xr-x   - pmolnar hadoop          0 2017-02-18 10:50 /data/yelp/als_distance_mat.parquet
drwxr-xr-x   - pmolnar hadoop          0 2017-01-15 12:12 /data/yelp/business
drwxr-xr-x   - pmolnar hadoop          0 2017-01-15 12:14 /data/yelp/checkin
dr-xr-xr-x   - pmolnar hadoop          0 2017-01-15 13:02 /data/yelp/review
drwxr-xr-x   - pmolnar hadoop          0 2017-01-15 12:23 /data/yelp/tip
drwxr-xr-x   - pmolnar hadoop          0 2017-01-15 12:24 /data/yelp/user


In [26]:
sentidict = sc.textFile(os.path.join(DATADIR, 'SentiWordNet_3.0.0_20130122.txt'))
print sentidict.count()
sentidict.take(20)

117687


[u'# SentiWordNet v3.0.0 (1 June 2010)',
 u'# Copyright 2010 ISTI-CNR.',
 u'# All right reserved.',
 u'#',
 u'# SentiWordNet is distributed under the Attribution-ShareAlike 3.0 Unported (CC BY-SA 3.0) license.',
 u'# http://creativecommons.org/licenses/by-sa/3.0/',
 u'#',
 u'# For any information about SentiWordNet:',
 u'# Web: http://sentiwordnet.isti.cnr.it',
 u'# -------',
 u'#',
 u'# Data format.',
 u'#',
 u'# SentiWordNet v3.0 is based on WordNet version 3.0.',
 u'# WordNet website: http://wordnet.princeton.edu/',
 u'#',
 u'# The pair (POS,ID) uniquely identifies a WordNet (3.0) synset.',
 u'# The values PosScore and NegScore are the positivity and negativity',
 u'# score assigned by SentiWordNet to the synset.',
 u'# The objectivity score can be calculated as:']

In [39]:
def split_senti_words(x):
    xl = x.split(' ')
    return map(lambda r: r.split('#')[0], xl)

In [40]:
split_senti_words(u'dorsal#2 abaxial#1')

[u'dorsal', u'abaxial']

In [59]:
def proc_senti_recs(s):
    try:
        sl = s.split('\t')
        pos = sl[0]
        wid = sl[1]
        pos = sl[2]
        neg = sl[3]
        wrds = split_senti_words(sl[4])
        return [(w, (float(pos), float(neg))) for w in wrds]
    except:
        return []

In [None]:
proc_senti_recs()

In [60]:
sdict_rdd = sentidict.filter(lambda s: not s.startswith('#')).flatMap(proc_senti_recs)
sdict_rdd.take(4)

[(u'able', (0.125, 0.0)),
 (u'unable', (0.0, 0.75)),
 (u'dorsal', (0.0, 0.0)),
 (u'abaxial', (0.0, 0.0))]

# Apply Sentiment Analysis

In [90]:
jnt_rdd = words_rdd.map(lambda r: (r[1], r[0])).join(sdict_rdd)
jnt_rdd.take(20)

[(u'fawn', (u'hfpt_mEBm1ZLI2zrqfXwaA^5D0CLsIzQuL7wHWAQFHV3Q', (0.0, 0.0))),
 (u'fawn', (u'hfpt_mEBm1ZLI2zrqfXwaA^5D0CLsIzQuL7wHWAQFHV3Q', (0.0, 0.0))),
 (u'fawn', (u'hfpt_mEBm1ZLI2zrqfXwaA^5D0CLsIzQuL7wHWAQFHV3Q', (0.0, 0.0))),
 (u'fawn', (u'hfpt_mEBm1ZLI2zrqfXwaA^5D0CLsIzQuL7wHWAQFHV3Q', (0.125, 0.0))),
 (u'fawn', (u'hfpt_mEBm1ZLI2zrqfXwaA^5D0CLsIzQuL7wHWAQFHV3Q', (0.125, 0.125))),
 (u'wooden', (u'5GpvSL1tlAjpgdJKZ5eLpg^d9PnYF_koOYnSQl3sNf9cg', (0.25, 0.25))),
 (u'wooden', (u'5GpvSL1tlAjpgdJKZ5eLpg^d9PnYF_koOYnSQl3sNf9cg', (0.0, 0.0))),
 (u'wooden', (u'cjjZt2oOkk0F152RkQMfQw^vd0nB9PbB3n1nNyXVXAvyA', (0.25, 0.25))),
 (u'wooden', (u'cjjZt2oOkk0F152RkQMfQw^vd0nB9PbB3n1nNyXVXAvyA', (0.0, 0.0))),
 (u'wooden', (u'sxRI0je6hAR-MeBDxdyhug^P-MBU1hPKH9_Ga5-NqUHlA', (0.25, 0.25))),
 (u'wooden', (u'sxRI0je6hAR-MeBDxdyhug^P-MBU1hPKH9_Ga5-NqUHlA', (0.0, 0.0))),
 (u'wooden', (u'rS-GnEEtuzfaUn56dkNd1w^pbpgOOUgB9gb-LNanzKdIQ', (0.25, 0.25))),
 (u'wooden', (u'rS-GnEEtuzfaUn56dkNd1w^pbpgOOUgB9gb-LNanzKdI

In [91]:
# (u'fawn', (u'hfpt_mEBm1ZLI2zrqfXwaA', (0.0, 0.0)))
#jnt_rdd.map(lambda r: (r[1][0], r[1][1][0], r[1][1][1])).
res_rdd = jnt_rdd.map(lambda r: r[1]).reduceByKey(lambda a, b: (a[0]+b[0], a[1]+b[1]))
res_rdd.take(4)

[(u'D_v-FRvDS23YFAo3dvPDnw^9xKxie6NzYYKYZdJ2_DyGA', (41.375, 7.625)),
 (u'URAkdLzbFwbhvp087RlvzA^IBO0tXcEV_6N-hMD3EJRuA', (4.0, 5.125)),
 (u'2OY8xs4aqOt8eTnYokdrww^9MneBUC7WZ6W5iT9Ns4cTQ', (12.375, 5.875)),
 (u'0uq7x1_pxU1hajaIDhv3MQ^oWQ6NUpC0KmwiHD3uDPnTg', (1.75, 2.125))]

# Summary Statistics

Using short cut calculation:

$$mean = \frac{1}{n} \sum_{i=1}^n x_i$$

$$var = \frac{1}{n-1}( \sum_{i=1}^n x_i^2 - \frac{ (\sum_{i=1}^n x_i)^2}{n})$$


Suggestion https://stackoverflow.com/questions/39981312/spark-rdd-how-to-calculate-statistics-most-efficiently


You can try reduceByKey. It's pretty straightforward if we only want to compute the min():

    rdd.reduceByKey(lambda x,y: min(x,y)).collect()
#Out[84]: [('key3', 2.0), ('key2', 3.0), ('key1', 1.0)]
To calculate the mean, you'll first need to create (value, 1) tuples which we use to calculate both the sum and count in the reduceByKey operation. Lastly we divide them by each other to arrive at the mean:

    meanRDD = (rdd
               .mapValues(lambda x: (x, 1))
               .reduceByKey(lambda x, y: (x[0]+y[0], x[1]+y[1]))
               .mapValues(lambda x: x[0]/x[1]))

    meanRDD.collect()
#Out[85]: [('key3', 5.5), ('key2', 5.0), ('key1', 3.3333333333333335)]
For the variance, you can use the formula (sumOfSquares/count) - (sum/count)^2, which we translate in the following way:

    varRDD = (rdd
              .mapValues(lambda x: (1, x, x*x))
              .reduceByKey(lambda x,y: (x[0]+y[0], x[1]+y[1], x[2]+y[2]))
              .mapValues(lambda x: (x[2]/x[0] - (x[1]/x[0])**2)))

    varRDD.collect()
#Out[106]: [('key3', 12.25), ('key2', 4.0), ('key1', 2.8888888888888875)]
I used values of type double instead of int in the dummy data to accurately illustrate computing the average and variance:

    rdd = sc.parallelize([("key1", 1.0),
                          ("key3", 9.0),
                          ("key2", 3.0),
                          ("key1", 4.0),
                          ("key1", 5.0),
                          ("key3", 2.0),
                          ("key2", 7.0)])


In [102]:
res2_rdd = res_rdd.map(lambda x: (x[0].split('^')[0], (1.0, x[1][0], x[1][1], x[1][0]*x[1][0], x[1][1]*x[1][1] )))
res2_rdd.take(4)

[(u'D_v-FRvDS23YFAo3dvPDnw', (1.0, 41.375, 7.625, 1711.890625, 58.140625)),
 (u'URAkdLzbFwbhvp087RlvzA', (1.0, 4.0, 5.125, 16.0, 26.265625)),
 (u'2OY8xs4aqOt8eTnYokdrww', (1.0, 12.375, 5.875, 153.140625, 34.515625)),
 (u'0uq7x1_pxU1hajaIDhv3MQ', (1.0, 1.75, 2.125, 3.0625, 4.515625))]

In [104]:
res2_rdd.reduceByKey(lambda a,b: (a[0]+b[0], a[1]+b[1], a[2]+b[2], a[3]+b[4], a[1]+b[4])).take(4)

[(u'8buIr1zBCO7OEcAQSZko7w',
  (8.0, 388.29200000000003, 211.208, 3851.01075, 629.2295)),
 (u'SfBHIShLosHKWWLFuDFUhQ', (2.0, 50.0, 22.25, 1620.90625, 90.390625)),
 (u'3P6IQFwMyH2FE4msc5WM7w', (1.0, 45.375, 29.125, 2058.890625, 848.265625)),
 (u'oPhx1YGHyLmtl6sfo3Vr8A',
  (1.0, 45.986000000000004, 22.764, 2114.7121960000004, 518.199696))]