## UC Berkeley School of Information | MIDS W251 Final Project
### The Rise of Donald Trump in Politics

Project Team:
* Dhaval Bhatt
* Tuhin Mahumad
* James Gray


## Mining Reddit to gain insight into the rise of Donald Trump

1. How many reddit posts (subreddit politics) are there each year (2007-2015) that reference Donald Trump?  How may that compare to other candidates such as Hillary Clinton, Bernie Sanders?
2. What is the trend of the number of unique reddit users posting about Donald Trump from 2007-2015? Is there an increase in the general interest as a political candidate?
3. What are the top 20 keywords each year (2007-2015) in the reddit posts that include Donald Trump? This may give us insight into key themes or areas of interest.
4. What is the sentiment of the Donald Trump reddit posts over time? 

## Initialize environment and SparkContext

In [1]:
# import Python libraries
import os
import sys
from operator import add
import nltk
from nltk.corpus import stopwords

os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-7-oracle-cloudera/jre"
os.environ["SPARK_HOME"] = "/opt/cloudera/parcels/CDH/lib/spark"
os.environ["PYLIB"] = os.environ["SPARK_HOME"] + "/python/lib"
os.environ["PYSPARK_PYTHON"] = "/opt/anaconda/bin/python"
sys.path.insert(0, os.environ["PYLIB"] +"/py4j-0.8.2.1-src.zip")
sys.path.insert(0, os.environ["PYLIB"] +"/pyspark.zip")

In [2]:
from pyspark import SparkConf
from pyspark import SparkContext

conf = SparkConf()
conf.setMaster('yarn-client')
conf.setAppName('reddit-json')
sc = SparkContext(conf=conf)

## Load JSON data from HDFS

In [61]:
#lines = sc.textFile("/user/root/RC_2007-10.json")  # 1 GB
lines = sc.textFile("/user/root/RC_2015-08.json")  # 28 GB
#lines = sc.textFile("/user/root/*.json")  # 242 GB

In [77]:
lines

MapPartitionsRDD[47] at textFile at NativeMethodAccessorImpl.java:-2

In [78]:
import json
data = lines.map(json.loads)

## Reduce data to the "r/politics" subreddit

In [79]:
politics = data.filter(lambda x: x['subreddit'] == 'politics')
# subdata.persist()

In [80]:
%time print politics.count()

382764
CPU times: user 50 ms, sys: 12.9 ms, total: 62.9 ms
Wall time: 3min 37s


In [81]:
%time politics.take(1)

CPU times: user 11.1 ms, sys: 0 ns, total: 11.1 ms
Wall time: 8.44 s


[{u'author': u'psdsnutz',
  u'author_flair_css_class': None,
  u'author_flair_text': None,
  u'body': u"I get it. When politicians campaign making certain promises, and then it turns out they lied- we can't compare lies. Especially when both involve the tax code. \n\nGood job. Got it. ",
  u'controversiality': 0,
  u'created_utc': u'1438387202',
  u'distinguished': None,
  u'edited': False,
  u'gilded': 0,
  u'id': u'ctnf6hb',
  u'link_id': u't3_3f7knx',
  u'parent_id': u't1_ctmsga4',
  u'retrieved_on': 1440210815,
  u'score': 1,
  u'subreddit': u'politics',
  u'subreddit_id': u't5_2cneq',
  u'ups': 1}]

## Query records in Subreddit "r/politics" AND "Donald Trump"

In [86]:
query = politics.filter(lambda d: 'trump' in d['body'].lower())
%time print "The number of subreddit posts with Trump: "  + str(query.count()) 

The number of subreddit posts with Trump: 23757
CPU times: user 127 ms, sys: 39.8 ms, total: 167 ms
Wall time: 10min 46s


In [53]:
query.take(2)

[{u'archived': True,
  u'author': u'sandmonkey',
  u'author_flair_css_class': None,
  u'author_flair_text': None,
  u'body': u"HOW THE FUCK, is Giuliani going to defend that clip of him dressed as a lady pressing Donald Trump's face in his fake boobies..??!!?!!?",
  u'controversiality': 0,
  u'created_utc': u'1193067884',
  u'distinguished': None,
  u'downs': 0,
  u'edited': False,
  u'gilded': 0,
  u'id': u'c02akis',
  u'link_id': u't3_5yuip',
  u'name': u't1_c02akis',
  u'parent_id': u't3_5yuip',
  u'retrieved_on': 1427425797,
  u'score': 2,
  u'score_hidden': False,
  u'subreddit': u'politics',
  u'subreddit_id': u't5_2cneq',
  u'ups': 2}]

In [87]:
N=10000
%time df = pd.DataFrame({'body':[x['body'] for x in query.collect()][:N],'label':0*N})


CPU times: user 290 ms, sys: 168 ms, total: 458 ms
Wall time: 3min 40s


In [278]:
df[:100]

In [90]:
df.to_csv("trump1000.csv",encoding='utf-8')

In [93]:
df1000=pd.DataFrame.from_csv("trump1000.csv")

In [279]:
df1000[:2]

Unnamed: 0,body,label
0,Trump is not even remotely smart. Dumbass some...,0
1,Trump's final solution is to expel the bad Mex...,0


In [238]:

    
import requests
def get_sentiment_score(mytext):
    apiKey="xxxxxxxxxxxxxxxxxxxxxxxxxxx"  #put your apiKey for alchemy here
    url ="http://gateway-a.watsonplatform.net/calls/text/TextGetTargetedSentiment?apikey="+apiKey
    #print url
    options="&text="+mytext
    options=options+"&targets=trump"
    options=options+"&outputMode=json"
    fullurl=url+options
    #print fullurl
    try:
        resp = requests.get(fullurl)
        #print resp
        response=resp.json()
    except:
        return (None,None)
    #print response
    if response and response['status'] == 'OK':
        #print('## Response Object ##')[0]
        #print(json.dumps(response, indent=4))
        #print('')
        #print(response['results'][0]['sentiment'])
        #print('## Targeted Sentiment ##')
        sentiment=response['results'][0]['sentiment']
        type=sentiment['type']
        score=None
        if 'score' in sentiment:
            score=sentiment['score']
        #print (type,score)
        return (type,score)
    else:
        #print 'Error in targeted sentiment analysis call: ',response['statusInfo']
        return (None,None)


In [229]:
index=0
text=df1000['body'][index]
print text
(type,score)=get_sentiment_score(text)
if type == "positive":
    df1000['label'][index] = 1
elif type == "negative":
    df1000['label'][index] = 0
else:
    df1000['label'][index] = pd.np.na

Trump is not even remotely smart. Dumbass somehow managed to bankrupt a *casino.*


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


In [205]:
df1000[:2]

Unnamed: 0,body,label
0,Trump is not even remotely smart. Dumbass some...,1
1,Trump's final solution is to expel the bad Mex...,0


In [206]:
df1000['body'][0]

'Trump is not even remotely smart. Dumbass somehow managed to bankrupt a *casino.*'

In [236]:
import time
import datetime
def get_time_stamped(nameStr):
    ts = time.time()
    #print ts
    st = datetime.datetime.fromtimestamp(ts).strftime('%Y-%m-%d %H:%M:%S')
    fileName=nameStr+st
    #print fileName
    return fileName
fileName=get_time_stamped("sentimentsNNN")+".csv"
print fileName

sentimentsNNN2016-04-05 14:01:07.csv


In [280]:
N=500
startindex=500
texts=df1000['body'][startindex:startindex+N-1]
label =None
data=[]
for text in texts:
    (type,score)=get_sentiment_score(text)
    print type,score,text
    print "-------------------------------------------------"
    if type == "positive":
        label =1
        
    elif type == "negative":
        label =0
    else:
        label = pd.np.nan
    labels.append(label)
    data.append((label,text))
df= pd.DataFrame(data)
fileName=get_time_stamped("sentimentsNNN")+".csv"
print fileName
df.to_csv(fileName,mode = 'w', index=False)

positive 0.849087 I find it so fascinating that so much of Reddit and political blogs and networks are focused on Trump.  There are de facto making him the front runner and person to beat.  Every other of the 20 GOP candidates must be loving the fact that no one is digging through their dirty laundry.  Hillary is perhaps the other candidate on the other side taking some heat for the email fiasco.  At a minimum the debate season is shaping up to be super fun.
-------------------------------------------------
negative -0.236867 LOL what does Bernie Sanders have to do with Trump's statements about Syrian christian refugees being denied entry into the U.S.? 
-------------------------------------------------
positive 0.747396 Yeah nobody is paying him to be the idiot. I think that's just a rational person trying to make sense of Trump's success. The big prediction is that he'll leave the GOP and run as an independent, which will divide the republican vote and put another democrat in office.

In [281]:
#print mydf.head()

df = pd.read_csv(fileName)
df.columns = ['label', 'text']
df.head()
print len(df)

499


In [282]:
import numpy as np
df= df[np.isfinite(df['label'])]

In [283]:
df

Unnamed: 0,label,text
0,1.0,I find it so fascinating that so much of Reddi...
1,0.0,LOL what does Bernie Sanders have to do with T...
2,1.0,Yeah nobody is paying him to be the idiot. I t...
3,1.0,To the kind of people who would vote for Donal...
5,1.0,Those are Trump's calculations.\n\nOn a a diff...
6,1.0,I think it's a bit ridiculous to think that th...
7,0.0,So? What matters is how they got the money.\n\...
9,0.0,Donald Trump is certainly not dumb. Ignorant m...
10,0.0,Thats not what trump said. Trump said we shou...
11,0.0,I'm going to have to patently disagree with yo...


## Example Code for sentiment analysis with NLTK 
http://streamhacker.com/2010/05/10/text-classification-sentiment-analysis-naive-bayes-classifier/

In [256]:
import nltk.classify.util
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import movie_reviews
 
def word_feats(words):
    return dict([(word, True) for word in words])

 
negids = movie_reviews.fileids('neg')
posids = movie_reviews.fileids('pos')
 
negfeats = [(word_feats(movie_reviews.words(fileids=[f])), 'neg') for f in negids]
posfeats = [(word_feats(movie_reviews.words(fileids=[f])), 'pos') for f in posids]
 
negcutoff = len(negfeats)*3/4
poscutoff = len(posfeats)*3/4
 
trainfeats = negfeats[:negcutoff] + posfeats[:poscutoff]
testfeats = negfeats[negcutoff:] + posfeats[poscutoff:]
print 'train on %d instances, test on %d instances' % (len(trainfeats), len(testfeats))
 
classifier = NaiveBayesClassifier.train(trainfeats)
print 'accuracy:', nltk.classify.util.accuracy(classifier, testfeats)
classifier.show_most_informative_features()


train on 1500 instances, test on 500 instances
accuracy: 0.728
Most Informative Features
             magnificent = True              pos : neg    =     15.0 : 1.0
             outstanding = True              pos : neg    =     13.6 : 1.0
               insulting = True              neg : pos    =     13.0 : 1.0
              vulnerable = True              pos : neg    =     12.3 : 1.0
               ludicrous = True              neg : pos    =     11.8 : 1.0
             uninvolving = True              neg : pos    =     11.7 : 1.0
                  avoids = True              pos : neg    =     11.7 : 1.0
              astounding = True              pos : neg    =     10.3 : 1.0
             fascination = True              pos : neg    =     10.3 : 1.0
                 idiotic = True              neg : pos    =      9.8 : 1.0


# Create Corpus from text file example
http://stackoverflow.com/questions/4951751/creating-a-new-corpus-with-nltk

In [303]:
import os
from nltk.corpus.reader.plaintext import PlaintextCorpusReader

# Let's create a corpus with 2 texts in different textfile.
txt1 = """This is a foo bar sentence.\nAnd this is the first txtfile in the corpus."""
txt2 = """Are you a foo bar? Yes I am. Possibly, everyone is.\n"""
corpus = [txt1,txt2]

# Make new dir for the corpus.
corpusdir = '/root/newcorpus/'
if not os.path.isdir(corpusdir):
    os.mkdir(corpusdir)

# Output the files into the directory.
filename = 0
for text in corpus:
    filename+=1
    fullpathName=corpusdir+str(filename)+'.txt'
    print fullpathName
    with open(fullpathName,'w') as fout:
        print>>fout, text
        print text

# Check that our corpus do exist and the files are correct.
assert os.path.isdir(corpusdir)
for infile, text in zip(sorted(os.listdir(corpusdir)),corpus):
    assert open(corpusdir+infile,'r').read().strip() == text.strip()
    print text


# Create a new corpus by specifying the parameters
# (1) directory of the new corpus
# (2) the fileids of the corpus
# NOTE: in this case the fileids are simply the filenames.
newcorpus = PlaintextCorpusReader(corpusdir, '.*')
print "after PlaintextCorpusReader()"
print "-----------------------------"
print newcorpus
print newcorpus.fileids()

print "words of the corpus"
print newcorpus.words()
print "fieldids of the corpus"
print newcorpus.fileids()

# Access each file in the corpus.
#or infile in sorted(newcorpus.fileids()):
#   print infile # The fileids of each file.
#   with newcorpus.open(infile) as fin: # Opens the file.
#       print fin.read().strip() # Prints the content of the file
#rint

# Access the plaintext; outputs pure string/basestring.
print "output from newcorpus.raw()"
print "-----------------------------"
print newcorpus.raw().strip()
print 

# Access paragraphs in the corpus. (list of list of list of strings)
# NOTE: NLTK automatically calls nltk.tokenize.sent_tokenize and 
#       nltk.tokenize.word_tokenize.
#
# Each element in the outermost list is a paragraph, and
# Each paragraph contains sentence(s), and
# Each sentence contains token(s)
print newcorpus.paras()
print

# To access pargraphs of a specific fileid.
print newcorpus.paras(newcorpus.fileids()[0])

# Access sentences in the corpus. (list of list of strings)
# NOTE: That the texts are flattened into sentences that contains tokens.
print newcorpus.sents()
print

# To access sentences of a specific fileid.
print newcorpus.sents(newcorpus.fileids()[0])

# Access just tokens/words in the corpus. (list of strings)
print newcorpus.words()

# To access tokens of a specific fileid.
print newcorpus.words(newcorpus.fileids()[0])

/root/newcorpus/1.txt
This is a foo bar sentence.
And this is the first txtfile in the corpus.
/root/newcorpus/2.txt
Are you a foo bar? Yes I am. Possibly, everyone is.

This is a foo bar sentence.
And this is the first txtfile in the corpus.
Are you a foo bar? Yes I am. Possibly, everyone is.

after PlaintextCorpusReader()
-----------------------------
<PlaintextCorpusReader in u'/root/newcorpus'>
['1.txt', '2.txt']
words of the corpus
[u'This', u'is', u'a', u'foo', u'bar', u'sentence', ...]
fieldids of the corpus
['1.txt', '2.txt']
output from newcorpus.raw()
-----------------------------
This is a foo bar sentence.
And this is the first txtfile in the corpus.
Are you a foo bar? Yes I am. Possibly, everyone is.

[[[u'This', u'is', u'a', u'foo', u'bar', u'sentence', u'.'], [u'And', u'this', u'is', u'the', u'first', u'txtfile', u'in', u'the', u'corpus', u'.']], [[u'Are', u'you', u'a', u'foo', u'bar', u'?'], [u'Yes', u'I', u'am', u'.'], [u'Possibly', u',', u'everyone', u'is', u'.']]]

[

In [289]:
fname="/root/newcorpus/1.txt"
with open(fname) as f:
    content = f.readlines()

## NLTK Helper Functions for Data Cleansing

In [13]:
no_stopwords = lambda x: x not in stopwords.words('english')
is_word = lambda x: re.search("^[0-9a-zA-Z]+$", x) is not None

### Determine the Top N keywords associated with subreddit posts that contain Donald Trump

In [None]:
#subreddit = data.filter(lambda x: x['subreddit'] == 'politics')
#bodies = subreddit.pluck('body')
#words = query.map(nltk.word_tokenize)
#words2 = words.map(lambda x: x.lower())
#words3 = words2.filter(no_stopwords)
#words4 = words3.filter(is_word)

#counts = words4.map(lambda word: (word,1)).reduceByKey(add)

#start_time = time.time()
#values = counts.collect()
#elapsed_time = time.time() - start_time
#print str(elapsed_time)
#print len(values)

# sort the keywords in decending order
#sort = sorted(values, key=lambda x: x[1], reverse=True)
#sort[:20]



## Distributed language processing with NLTK: Part of speech tagging

In [20]:
def parse(record):
    print "record:",record
    import nltk
    tokens = nltk.word_tokenize(record["body"])
    record["n_words"] = len(tokens)
    record["pos"] = nltk.pos_tag(tokens)
    return record

In [21]:
#politics2 = politics.map(parse)
politics2 = query.map(parse)


In [23]:
politics2.take(1)[0]['body']

u"HOW THE FUCK, is Giuliani going to defend that clip of him dressed as a lady pressing Donald Trump's face in his fake boobies..??!!?!!?"

In [60]:
query.take(2)

[{u'archived': True,
  u'author': u'sandmonkey',
  u'author_flair_css_class': None,
  u'author_flair_text': None,
  u'body': u"HOW THE FUCK, is Giuliani going to defend that clip of him dressed as a lady pressing Donald Trump's face in his fake boobies..??!!?!!?",
  u'controversiality': 0,
  u'created_utc': u'1193067884',
  u'distinguished': None,
  u'downs': 0,
  u'edited': False,
  u'gilded': 0,
  u'id': u'c02akis',
  u'link_id': u't3_5yuip',
  u'name': u't1_c02akis',
  u'parent_id': u't3_5yuip',
  u'retrieved_on': 1427425797,
  u'score': 2,
  u'score_hidden': False,
  u'subreddit': u'politics',
  u'subreddit_id': u't5_2cneq',
  u'ups': 2}]