**Summary**

This deals with exclusively with the data acquistion, munging, and cleaning of the twitter data about Donald Trump from November 28th to 30th, 2015. If you want to see the final product, you can go [here](twitter_plot.html).

**Grabbing Tweets**

I used the Twitter API to grab tweets from November 29th to November 30th, the code can be seen in `streaming.py`. I essentailly left it streaming overnight to collect enough tweets for analysis. After that, I created a list and read the `.txt` files (there were interruptions so I had to separate the tweets) and append them on according. All in all, I got around 285,000 tweets. Good enough.

In [3]:
import json
import os
import random
import re
import unicodedata
from dateutil import parser

import numpy as np
import pandas as pd
import unirest

import plotly.tools as tls
import plotly as py
from plotly.offline import download_plotlyjs, init_notebook_mode, iplot
from plotly.graph_objs import *

import processing

In [12]:

tweets_data = []

for i in range(1,6):
    data = 'data/data' + str(i) + '.txt'
    tweets_file = open(data, "r")
    for line in tweets_file:
        try:
            tweet = json.loads(line)
            tweets_data.append(tweet)
        except:
            continue
            
print len(tweets_data)

285270


In [None]:
#get raw tweet text from twitter
data = []
for item in tweets_data:
    data.append(item.get("text"))

**Cleaning the Data**

The next step is to take the tweets and cleaned them up. I borrowed [code](http://ravikiranj.net/posts/2012/code/how-build-twitter-sentiment-analyzer/#motivation) that creates a function that will strip and removed instances of retweets, mentions of other twitter users, and various keys. Then I used regular expression again, to remove `rt`, `AT_USER`, and `URL` which are the by-products of the `processTweet` function. 

The reason I kept it in two stages is because I wanted to make sure the function works correctly. I also stripped the times out of the tweets so I can plot for later uses.

Finally, I combined the tweets and the times into a `pandas` dataframe.

In [None]:
#processed the tweets to remove extra characters
processed = []
for tweet in data:
    try:
        processed.append(processing.processTweet(tweet))
    except:
        continue
        
#format the tweet to remove AT_USER and URL and format from unicode text
formatted = [re.sub("^rt|AT_USER|URL", "", unicodedata.normalize('NFKD', item).encode('ascii','ignore')) for item in processed]

#extract time of tweets
times = []
for item in tweets_data:
    times.append(item.get("created_at"))

In [None]:
#convert the lists to a dataframe
data = pd.DataFrame([times, formatted])
data = data.transpose()
data.columns = ['times','tweet']
data['tweet'] = map(lambda x: str(x).strip(), data['tweet'])
data.to_csv('tweets.csv', index = False)

**Sentiment Analysis**

I used another [API](https://market.mashape.com/twinword/sentiment-analysis-free) that provides sentiment analysis for free. After extracting the tweets, I ran a loop that will allow the API to score each tweet as either `negative`, `neutral`, and `positive`

I know that sentiment analysis has its flaws, such as unable to detect sarcasm or recognize the tone of certain words (for instance, "not bad" can be construed as negative, even though it could be actually positive). However this is good enough to get a good idea of the sentiments of our tweets (the idea, after all, is to build something).

After getting the sentiments of all the tweets, I joined the list with my data frame and saved it since it took a day or so! (ran into some more hiccups).

In [None]:
#create a list to store sentiment of each text, looping through the dataframe, and appending it on.
sentiment = []
for text in data['tweet']:
    response = unirest.post("https://twinword-sentiment-analysis.p.mashape.com/analyze/",
      headers={
        "X-Mashape-Key": os.environ.get('NLP_KEY'),
        "Content-Type": "application/x-www-form-urlencoded",
        "Accept": "application/json"
      },
      params={
        "text": text
      }
    )
    sentiment.append(response.body.items()[6])

In [6]:
#some cleaning up of the final dataframe
withSentiment = data.join(pd.DataFrame(sentiment))
withSentiment = withSentiment.iloc[:,[0,1,3]]
withSentiment.columns = ['times','tweet','sentiment']
withSentiment.head()
withSentiment.to_csv('withSentiment.csv', index = False)

#sneak peak at the resulting data.
withSentiment.head()

Unnamed: 0,times,tweet,sentiment
0,Sat Nov 28 18:46:24 +0000 2015,trump2016 they love donald!,positive
1,Sat Nov 28 18:46:25 +0000 2015,trump: we have to accept migrants here because...,negative
2,Sat Nov 28 18:46:25 +0000 2015,media accuse trump of mocking disabled reporte...,negative
3,Sat Nov 28 18:46:25 +0000 2015,how donald trump comes up with his ideas,neutral
4,Sat Nov 28 18:46:26 +0000 2015,me: i hate stamps donald trump: i hate stamps me:,negative


**Reshaping the Data**

Now that I have the data, I have to reshape it in a way that's meaningful for plotting. It doesn't help that the tweets are classified, what I'm interested in is how many classification did each time period had. To do this, I invoked the `pivot_table` method and reshaped it to count the number of classifications for each second. After doing that, I used a function from the `parser` library to convert the string text to the proper `datetime` format (converting "Sat Nov 28 18:46:24 +0000 2015" to "2015-11-28 18:45:00+00:00")

In [12]:
#pivot table to aggregate tweet sentiment
pivotedTable = withSentiment.pivot_table(index = 'times',  columns='sentiment', aggfunc=len)
pivotedTable = pivotedTable.fillna(0)
data = pivotedTable['tweet'] #extracting part of the dataframe so I don't have to deal with multi-level indexing as much

#create a custom function to parse times
def parseTimes(time):
    try:
        return parser.parse(time)
    except:
        return None

#list comprehension to parse times
listOfTimes = [parseTimes(time) for time in data.index]

After converting the data to the proper format, I realized that the time interval is too granular, so I used the `resample` method to aggregate the data into 5 minute interval. I figured this is a good trade-off between granularity and cleanliness. I also removed any 5 minute periods that did not have any tweets as to not skew the chart

In [13]:
#changing the index of dataframe to match datetime class
data = data.set_index(pd.DatetimeIndex(listOfTimes))

#using resampling method to aggregate tweets again, in 5 minute increment this time.
data = data.resample('5Min', how = 'sum')
data.head()

#remove period of times where there are no tweets at all
data['sum'] = sum([data['negative'],data['positive'],data['neutral']])
data = data.ix[data['sum'] > 0, ['negative','neutral','positive']]
data.index = [dates.replace(":00+00:00","").replace("2015-11","Nov") for dates in data.index]
data.to_csv('finaldata.csv')

#quick look at our cleaned dataset
data.head()

sentiment,negative,neutral,positive
2015-11-28 18:45:00+00:00,499,191,373
2015-11-28 18:50:00+00:00,700,372,508
2015-11-28 18:55:00+00:00,722,352,660
2015-11-28 19:00:00+00:00,641,335,520
2015-11-28 19:05:00+00:00,637,323,498


**Plotting**

After acquiring, munging, and cleaning the data, we can plot to see the results. Plotting is done using `Plot.ly` which is super nifty. Since Github doesn't allow large files, I had to embed the code. Below you will see the code that I used to create the plot.

In [None]:
#Code to create the plot

#required to plot offline 
init_notebook_mode()

#loading in file
data = pd.read_csv('https://raw.githubusercontent.com/minh5/nlp-sentiment/master/csv/finaldata.csv', index_col=0)

# #plotting
iplot({
    'data': [
        Scatter(x=data.index,
                y=data[col],
                name=col) for col in data.columns],
    'layout': Layout(title=('Sentiment Analysis of Donald Trumps Tweets from Nov 28 to 30th'), 
                     font=dict(family='Arial',
                               size=10),
                     yaxis=YAxis(title='Number of Tweets')),
}, show_link=True)

And then using Ipython's super [functions](http://blog.fperez.org/2012/09/blogging-with-ipython-notebook.html), I can embed html directly into the IPython Notebook. You can click on it to view the entire thing to see mouse over effects  to view the data better and to look at the source code. 

In [9]:
%%html

<div>
    <a href="https://plot.ly/~minh5/9/" target="_blank" title="Sentiment Analysis of Donald Trumps Tweets from Nov 28 to 30th" style="display: block; text-align: center;"><img src="https://plot.ly/~minh5/9.png" alt="Sentiment Analysis of Donald Trumps Tweets from Nov 28 to 30th" style="max-width: 100%;width: 1080px;"  width="1080" onerror="this.onerror=null;this.src='https://plot.ly/404.png';" /></a>
    <script data-plotly="minh5:9"  src="https://plot.ly/embed.js" async></script>
</div>
