### Berkeley MIDS W251 Final Project

Developers: 
* [Dhaval Bhatt](https://github.com/dhavalbhatt)
* [James Gray](https://github.com/jamesgray007)
* [Tuhin Mahmud](https://github.com/tuhinmahmud)
 

In [21]:
# Import standard libraries
import json
import time
import re

# Import third-party libraries
import dask
import dask.bag as db
from boto.s3.connection import S3Connection # Python API to AWS; http://docs.pythonboto.org/en/latest/index.html
from boto.s3.key import Key
import bokeh # http://bokeh.pydata.org/en/latest/
#import pyspark # https://spark.apache.org/docs/0.9.0/python-programming-guide.html
import nltk
from nltk.corpus import stopwords

In [16]:
# Constants
AWS_ACCESS = ""
AWS_SECRET = ""
REDDITS3 = "blaze-data" # Continuum Analytics S3 data; reddit in the reddit/json/RC_YYYY-MM.json
REDDIT_MONTH_KEY = 'reddit/json/2007/RC_2007-11.json'

### Questions and Insights 

This project analyzes Redditt posts to answer the following questions:

1. What is the month-over-month volume growth (trend) of subreddit "r/datascience" between 2007-2015?
2. What are the top ten words for "r/datascience" for each year between 2007-2015?
3. z

### Technology

This project explores the use of multiple technologies to process Big Data at scale using cloud platforms.  

* Spark
* Dask

### Download one month of Reddit data from Amazon S3

Reddit JSON schema details (22 fields)


In [8]:
# use Boto to access S3
S3conn = S3Connection(AWS_ACCESS, AWS_SECRET)
mybucket = S3conn.get_bucket(REDDITS3)
for key in mybucket.list():
    print(key.name.encode('utf-8'))
    if key.key == REDDIT_MONTH_KEY:  # get one month of data
        key.get_contents_to_filename("reddit.json")
        print("downloaded json")


b'dogs-cats-img/'
b'dogs-cats-img/all.zip'
b'dogs-cats-img/images/'
b'dogs-cats-img/images/cat.0.jpg'
b'dogs-cats-img/images/cat.1.jpg'
b'dogs-cats-img/images/cat.10.jpg'
b'dogs-cats-img/images/cat.100.jpg'
b'dogs-cats-img/images/cat.1000.jpg'
b'dogs-cats-img/images/cat.10000.jpg'
b'dogs-cats-img/images/cat.10001.jpg'
b'dogs-cats-img/images/cat.10002.jpg'
b'dogs-cats-img/images/cat.10003.jpg'
b'dogs-cats-img/images/cat.10004.jpg'
b'dogs-cats-img/images/cat.10005.jpg'
b'dogs-cats-img/images/cat.10006.jpg'
b'dogs-cats-img/images/cat.10007.jpg'
b'dogs-cats-img/images/cat.10008.jpg'
b'dogs-cats-img/images/cat.10009.jpg'
b'dogs-cats-img/images/cat.1001.jpg'
b'dogs-cats-img/images/cat.10010.jpg'
b'dogs-cats-img/images/cat.10011.jpg'
b'dogs-cats-img/images/cat.10012.jpg'
b'dogs-cats-img/images/cat.10013.jpg'
b'dogs-cats-img/images/cat.10014.jpg'
b'dogs-cats-img/images/cat.10015.jpg'
b'dogs-cats-img/images/cat.10016.jpg'
b'dogs-cats-img/images/cat.10017.jpg'
b'dogs-cats-img/images/cat.10018.jp

### Computations Using Dask

In [22]:
# load JSON file into dask bag
#data = db.from_filenames("reddit.json", chunkbytes=100000).map(json.loads)
data = db.from_filenames("reddit.json").map(json.loads)

In [25]:
data.take(10)

({'archived': True,
  'author': 'BraveSirRobin',
  'author_flair_css_class': None,
  'author_flair_text': None,
  'body': 'Some of the linux distros, as well as BSD, make this really easy. You don\'t need to tweak anything, it\'s "ready to compile". I don\'t bother with it myself.',
  'controversiality': 0,
  'created_utc': '1193875218',
  'distinguished': None,
  'downs': 0,
  'edited': False,
  'gilded': 0,
  'id': 'c02chew',
  'link_id': 't3_5zjl1',
  'name': 't1_c02chew',
  'parent_id': 't1_c02ch4f',
  'retrieved_on': 1427424835,
  'score': 2,
  'score_hidden': False,
  'subreddit': 'reddit.com',
  'subreddit_id': 't5_6',
  'ups': 2},
 {'archived': True,
  'author': 'kobes',
  'author_flair_css_class': None,
  'author_flair_text': None,
  'body': "Don't you think there's a line (sometimes blurry) between free speech and harassment?  Would it be ok for me to express my opinions through a megaphone outside your window at 3 in the morning?",
  'controversiality': 0,
  'created_utc': '

In [19]:
start = time.time()
print("Monthly posts: " + str(data.count().compute()))
end = time.time()
executionTime = end - start
print("Computation time: " + str(executionTime) + " seconds")

Monthly posts: 372983
Computation time: 4.588881015777588 seconds


In [49]:
type(data)

dask.bag.core.Bag

### Get most popular words for this sub-reddit

In [26]:
no_stopwords = lambda x: x not in stopwords.words('english')
is_word = lambda x: re.search("^[0-9a-zA-Z]+$", x) is not None

subreddit = data.filter(lambda x: x['subreddit'] == 'politics')
bodies = subreddit.pluck('body')
words = bodies.map(nltk.word_tokenize).concat()
words2 = words.map(lambda x: x.lower())
words3 = words2.filter(no_stopwords)
words4 = words3.filter(is_word)
counts = words4.frequencies()

start_time = time.time()
values = counts.compute()
elapsed_time = time.time() - start_time
elapsed_time  # seconds

len(values)
sort = sorted(values, key=lambda x: x[1], reverse=True)
sort[:10]

[('deleted', 15765),
 ('would', 14706),
 ('people', 14683),
 ('gt', 12470),
 ('like', 11120),
 ('think', 9658),
 ('one', 9138),
 ('paul', 9065),
 ('get', 7574),
 ('http', 6920)]

### Computations Using Spark