### Berkeley MIDS W251 Final Project

Developers: 
* [Dhaval Bhatt](https://github.com/dhavalbhatt)
* [James Gray](https://github.com/jamesgray007)
* [Tuhin Mahmud](https://github.com/tuhinmahmud)
 

In [1]:
# Import standard libraries
import json
import time
import sys

# Import third-party libraries
import dask
import dask.bag as db
from boto.s3.connection import S3Connection # Python API to AWS; http://docs.pythonboto.org/en/latest/index.html
from boto.s3.key import Key
import bokeh # http://bokeh.pydata.org/en/latest/
#import pyspark # https://spark.apache.org/docs/0.9.0/python-programming-guide.html



In [11]:
# Constants
AWS_ACCESS = "XXXXXXXXXXXXX"
AWS_SECRET = "XXXXXXXXXXXXX"
REDDITS3 = "blaze-data" # Continuum Analytics S3 data; reddit in the reddit/json/RC_YYYY-MM.json
REDDIT_MONTH_KEY = 'reddit/json/2007/RC_2007-11.json'

### Questions and Insights 

This project analyzes Redditt posts to answer the following questions:

1. What is the month-over-month volume growth (trend) of subreddit "r/datascience" between 2007-2015?
2. What are the top ten words for "r/datascience" for each year between 2007-2015?
3. What is the sentiment about "r/datascience" for each year between 2007-2015?

### Technology

This project explores the use of multiple technologies to process Big Data at scale using cloud platforms.  

* Spark
* Dask

### Download one month of Reddit data from Amazon S3

Reddit JSON schema details (22 fields)


In [3]:
# use Boto to access S3
import os
if not os.path.exists("reddit.json"):
    S3conn = S3Connection(AWS_ACCESS, AWS_SECRET)
    mybucket = S3conn.get_bucket(REDDITS3)
    for key in mybucket.list():
        #print key.name.encode('utf-8')
        if key.key == REDDIT_MONTH_KEY:  # get one month of data
            key.get_contents_to_filename("reddit.json")
            print "downloaded json"
else:
    print "reddit.json file already downloaded"


reddit.json file already downloaded


### Computations Using Dask

In [4]:
# load JSON file into dask bag
#data = db.from_filenames("reddit.json", chunkbytes=100000).map(json.loads)
data = db.from_filenames("reddit.json").map(json.loads)

In [5]:
data.take(1)

({u'archived': True,
  u'author': u'BraveSirRobin',
  u'author_flair_css_class': None,
  u'author_flair_text': None,
  u'body': u'Some of the linux distros, as well as BSD, make this really easy. You don\'t need to tweak anything, it\'s "ready to compile". I don\'t bother with it myself.',
  u'controversiality': 0,
  u'created_utc': u'1193875218',
  u'distinguished': None,
  u'downs': 0,
  u'edited': False,
  u'gilded': 0,
  u'id': u'c02chew',
  u'link_id': u't3_5zjl1',
  u'name': u't1_c02chew',
  u'parent_id': u't1_c02ch4f',
  u'retrieved_on': 1427424835,
  u'score': 2,
  u'score_hidden': False,
  u'subreddit': u'reddit.com',
  u'subreddit_id': u't5_6',
  u'ups': 2},)

In [6]:
start = time.time()
print "Monthly posts: " + str(data.count().compute())
end = time.time()
executionTime = end - start
print "Computation time: " + str(executionTime) + " second"

Monthly posts: 372983
Computation time: 10.0100028515 second


In [8]:
type(data)

dask.bag.core.Bag

## Using AlchemyApi for sentiment analysis of the reddit comments
http://www.alchemyapi.com/developers/getting-started-guide

In [9]:
!python --version
#prepare using alchemy api for sentiment analysis
!python /Users/tuhinm/berkeley/alchemyapi_python/alchemyapi.py 892322060e70b741767ac916896aeae6d507293b

sys.path.insert(1,'/Users/tuhinm/berkeley/alchemyapi_python')
# Create the AlchemyAPI Object

from alchemyapi import AlchemyAPI
import json
alchemyapi = AlchemyAPI()
def sentimentAnalysis(mytext,target):
    response = alchemyapi.sentiment_targeted('text',mytext, target)
    if response['status'] == 'OK':
        #print('## Response Object ##')
        #print(json.dumps(response, indent=4))
        #print('')
        #print('## Targeted Sentiment ##')
        type=response['docSentiment']['type']
        score=None
        if 'score' in response['docSentiment']:
            score=response['docSentiment']['score']
        return (type,score)
    else:
        print('Error in targeted sentiment analysis call: ',
              response['statusInfo'])
    return None
 

Python 2.7.10 :: Anaconda 2.5.0 (x86_64)
Key: 892322060e70b741767ac916896aeae6d507293b was written to api_key.txt
You are now ready to start using AlchemyAPI. For an example, run: python example.py


In [10]:
#subreddit = data.filter(lambda x: x['subreddit'] == 'movies')
reviews= data
findstr=' linux '
for i,review in enumerate(reviews):
    reviewBody=review['body']
    if not findstr in reviewBody:
        continue
    print i,reviewBody
    print "###################################################"
    sentiment=sentimentAnalysis(reviewBody,findstr)
    print "Sentiment for word: %s on review#:%d is:%s" % (findstr,i,sentiment)
    print "###################################################"
    if i > 100:
        break


0 Some of the linux distros, as well as BSD, make this really easy. You don't need to tweak anything, it's "ready to compile". I don't bother with it myself.
###################################################
Sentiment for word:  linux  on review#:0 is:(u'neutral', None)
###################################################
1383 &gt; That seemed to show an awareness of open source, and of its importance for the future of computing, that was surprising coming from such high-level – and usually tech-averse – politicians.

You should have started worrying at this point.

There's a dystopic internet future, here: today's young generation of Europe will grow up to take on the mantle of hyperstatist, central-planning governance -- and won't have the humility of this generation's leadership when it comes to technical matters.  "Oh," this raised-on-Windows, totally-coded-a-webpage-this-one-time generation will say, "open source?  Whatever, man-- linux had shitty games."
########################