# Sentiment Analysis on Yelp Reviews
Let's explore the Yelp reviews and perform a sentiment analysis:

1. Load reviews from data file ... there are in JSON format
2. Convert JSON records to Python tuples for earch row, extract only what we need
3. Maybe Look at stars rating
4. Create list of words (or bag-of-words)
4. Load sentiment dictionary file and convert into a useful format.
5. Assign sentiment values (pos and neg) to words of reviews
6. Aggregate over reviews and report sentiment analysis

In [1]:
# %load pyspark_init_arc.py
#
# This configuration works for Spark on macOS using homebrew
#
import os, sys
# set OS environment variable
os.environ["SPARK_HOME"] = '/usr/hdp/2.4.2.0-258/spark'
# add Spark library to Python
sys.path.insert(0, os.path.join(os.environ["SPARK_HOME"], 'python'))

# import package
import pyspark
from pyspark.context import SparkContext, SparkConf

import atexit
def stop_my_spark():
    sc.stop()
    del(sc)

# Register exit    
atexit.register(stop_my_spark)

# Configure and start Spark ... but only once.
if not 'sc' in globals():
    conf = SparkConf()
    conf.setAppName('SentimentAnalysis') ## you may want to change this
    conf.setMaster('yarn-client')
    conf.set('spark.ui.port', '63340')
    sc = SparkContext(conf=conf)
    print "Launched Spark version %s with ID %s" % (sc.version, sc.applicationId)
    print "http://arc.insight.gsu.edu:8088/cluster/app/%s"% (sc.applicationId)


Launched Spark version 1.6.1 with ID application_1508160140652_0054
http://arc.insight.gsu.edu:8088/cluster/app/application_1508160140652_0054


In [None]:
print "http://arc.insight.gsu.edu:8088/cluster/app/%s"% (sc.applicationId)

In [None]:
%%sh
hdfs dfs -ls /data/yelp/review

In [None]:
DATADIR='/data/yelp/'

In [None]:
review_rdd = sc.textFile(os.path.join(DATADIR, 'review/review_ab.json.gz')).sample(False, 0.01, 42)

# First Glance at Reviews

In [None]:
review_rdd.first()

In [None]:
review_rdd.count()

In [None]:
# how many elements do we actually have?


In [None]:
text =  "Mr Hoagie is an institution. Walking in, it does seem like a throwback to 30 years ago, old fashioned menu board, booths out of the 70s, and a large selection of food. Their speciality is the Italian Hoagie, and it is voted the best in the area year after year. I usually order the burger, while the patties are obviously cooked from frozen, all of the other ingredients are very fresh. Overall, its a good alternative to Subway, which is down the road."

In [None]:
def text2words(text):
    import re
    def clean_text(text):
        return re.sub(r'[.;:,!\'"]', ' ', unicode(text).lower())
    return filter(lambda x: x!='', clean_text(text).split(' '))

In [None]:
text2words(text)[:10]

In [None]:
def json_review(s):
    import json
    r = json.loads(s.strip())
    return (r['business_id']+'^'+r['review_id'], r['text'])

In [None]:
js = u'{"votes": {"funny": 0, "useful": 0, "cool": 0}, "user_id": "pGxyI2QhPpyn3RT61sbfXQ", "review_id": "rNExnQ7fMtXajA4Xx57BTg", "stars": 5, "date": "2016-03-23", "text": "This is my first time ever writing a review. But Martha is such a great doctor I thought I should share what I know.\\n\\nAa a scared pregnant teenager i met Martha for the first time 17 years old. 20 years later I am still a patient of Martha\'s. This is one of those rare doctors that I have met that actually care about her patients well-being. It\'s treacherous having to go to the doctor so often. When I walk in the office it\'s comfortable and I know I am in good hands. This makes all the difference and I am not kept waiting all day.\\n \\nI encourage every Woman to have a personal relationship with their doctor. Alternative for women is a great place to try. Once you go you will understand the importance of a excellent doctor and a great staff. I love Martha. Not only did she see me through the birth of my big healthy baby. She is still seeing me through all my health problems and making it alot easier on me in the process. Check her out.", "type": "review", "business_id": "se11kpNHxkw59O_7wwWhaQ"}'
#print js
import json
print json.loads(js).keys()
jsdict = json.loads(js)
for k in jsdict.keys():
    print k, jsdict[k]

In [None]:
# OK, let's create a `review_rdd` with only the elements we care bout


In [None]:
# next thing create a word list `words_rdd`



# Sentiment Dictionary

In [None]:
%%sh
hdfs dfs -cat /data/yelp/SentiWordNet_3.0.0_20130122.txt | head -30

In [None]:
sentidict = sc.textFile(os.path.join(DATADIR, 'SentiWordNet_3.0.0_20130122.txt'))
print sentidict.count()
sentidict.take(20)

In [None]:
def split_senti_words(x):
    xl = x.split(' ')
    return map(lambda r: r.split('#')[0], xl)

In [None]:
split_senti_words(u'dorsal#2 abaxial#1')

In [None]:
def proc_senti_recs(s):
    try:
        sl = s.split('\t')
        pos = sl[0]
        wid = sl[1]
        pos = sl[2]
        neg = sl[3]
        wrds = split_senti_words(sl[4])
        return [(w, (float(pos), float(neg))) for w in wrds]
    except:
        return []

In [None]:
# now let's create a dictionary that we can join to the word list `sdict_rdd`
# What are we going to do with the synonyms in the dictionary?



# Apply Sentiment Analysis

In [None]:
# let's join and create a `jnt_rdd`
# Hint: the joining key is the first element in the tuple



jnt_rdd.take(20)

In [None]:
# OK, after joining add up the positive and negative scores, create `res_rdd`
# Hint: use reduceByKey


res_rdd.take(4)

# Summary Statistics

Using short cut calculation:

$$mean = \frac{1}{n} \sum_{i=1}^n x_i$$

$$var = \frac{1}{n-1}( \sum_{i=1}^n x_i^2 - \frac{ (\sum_{i=1}^n x_i)^2}{n})$$


Suggestion https://stackoverflow.com/questions/39981312/spark-rdd-how-to-calculate-statistics-most-efficiently


You can try reduceByKey. It's pretty straightforward if we only want to compute the min():

    rdd.reduceByKey(lambda x,y: min(x,y)).collect()
#Out[84]: [('key3', 2.0), ('key2', 3.0), ('key1', 1.0)]
To calculate the mean, you'll first need to create (value, 1) tuples which we use to calculate both the sum and count in the reduceByKey operation. Lastly we divide them by each other to arrive at the mean:

    meanRDD = (rdd
               .mapValues(lambda x: (x, 1))
               .reduceByKey(lambda x, y: (x[0]+y[0], x[1]+y[1]))
               .mapValues(lambda x: x[0]/x[1]))

    meanRDD.collect()
#Out[85]: [('key3', 5.5), ('key2', 5.0), ('key1', 3.3333333333333335)]
For the variance, you can use the formula (sumOfSquares/count) - (sum/count)^2, which we translate in the following way:

    varRDD = (rdd
              .mapValues(lambda x: (1, x, x*x))
              .reduceByKey(lambda x,y: (x[0]+y[0], x[1]+y[1], x[2]+y[2]))
              .mapValues(lambda x: (x[2]/x[0] - (x[1]/x[0])**2)))

    varRDD.collect()
#Out[106]: [('key3', 12.25), ('key2', 4.0), ('key1', 2.8888888888888875)]
I used values of type double instead of int in the dummy data to accurately illustrate computing the average and variance:

    rdd = sc.parallelize([("key1", 1.0),
                          ("key3", 9.0),
                          ("key2", 3.0),
                          ("key1", 4.0),
                          ("key1", 5.0),
                          ("key3", 2.0),
                          ("key2", 7.0)])


In [None]:
# We need to create a dataset to produce N=number of samples, mean(pos), var(pos), mean(neg), var(neg)
# once we have that in place we can aggregate
# Note: we are instested in mean and var per business...


res2_rdd.take(4)

In [None]:
# Now aggreate 
res2_rdd.reduceByKey(lambda a,b: (a[0]+b[0], a[1]+b[1], a[2]+b[2], a[3]+b[4], a[1]+b[4])).take(4)