# Assignment 5

__Table of contents__

1. [Module 7 walkthrough](#Module-7-walkthrough)
1. [Module 8 walkthrough](#Module-8-walkthrough)
1. [Assignment 5](#assignment)
    1. [Acquire tweets](#Acquire-tweets)
    1. [Load tweets](#Load-tweets)
    1. [HTML Parser](#HTML-Parser)
    1. [Remove username, URL](#Remove-username-URL)
    1. [Remove punctuation](#Remove-punctuation)
    1. [Remove apostrophes](#Remove-apostrophes)
    1. [Word pattern formatting](#Word-pattern-formatting)
    1. [Remove hashtags](#Remove-hashtags)
    1. [Polarity analysis](#Polarity-analysis)

In [1]:
import os
import sys
import jsonpickle
import json
import tweepy
import html.parser as HTMLParser
import re

import nltk
nltk.download('sentiwordnet')
from nltk.corpus import sentiwordnet as swn
from nltk.corpus import stopwords


modulePath = os.path.abspath(os.path.join('../../..'))
if modulePath not in sys.path:
    sys.path.append(modulePath)
import config

# Standard tweepy API setup

auth = tweepy.OAuthHandler(config.apiKey, config.apiSec)
auth.set_access_token(config.accessToken, config.accessSec)

api = tweepy.API(auth)

# Application authentication tweepy setup
# Use application-only authentication for higher Twitter API rate limit
# Twitter API returns a max of 100 tweets per query
# Allows for 450 queries every 15 minutes
# So we can gather 45,000 tweets every 15 minutes

#Switching to application authentication
auth = tweepy.AppAuthHandler(config.apiKey, config.apiSec)

#Setting up new api wrapper, using authentication only
api = tweepy.API(auth, wait_on_rate_limit = True
                 ,wait_on_rate_limit_notify = True)
 
# View rate limit status

api.rate_limit_status()['resources']['search']


[nltk_data] Downloading package sentiwordnet to
[nltk_data]     C:\Users\petersont\AppData\Roaming\nltk_data...
[nltk_data]   Package sentiwordnet is already up-to-date!


AttributeError: module 'config' has no attribute 'apiKey'

<a id = 'Module-7-walkthrough'></a>

# Module 7 walkthrough

In [4]:
# 

htmlParser = HTMLParser.HTMLParser()

tweet = "@user_@34 Life is great & I like it sooooooooo much. It's whatis life. #life #great#like http://lifeisgreat.com ."
parsedTweet = htmlParser.unescape(tweet)
print(parsedTweet)


@user_@34 Life is great & I like it sooooooooo much. It's whatis life. #life #great#like http://lifeisgreat.com .


  


In [3]:
#

urlPattern = re.compile('http\S+')
tweet_v1 = re.sub(urlPattern, '', parsedTweet)
print(tweet_v1)


@user_@34 Life is great & I like it sooooooooo much. It's whatis life. #life #great#like  .


In [4]:
# 

usernamePattern = re.compile('@\S+')
tweet_v2 = re.sub(usernamePattern, '', tweet_v1)
print(tweet_v2)


 Life is great & I like it sooooooooo much. It's whatis life. #life #great#like  .


In [5]:
#

wordPattern = re.compile('s[o]+')
tweet_v3 = re.sub(wordPattern, 'so', tweet_v2)
print(tweet_v3)


 Life is great & I like it so much. It's whatis life. #life #great#like  .


<a id = 'Module-8-walkthrough'></a>

# Module 8 walkthrough

In [8]:
#

nltk.download('wordnet')

print('positive score for the word "happy": {0}'.format(list(swn.senti_synsets('happy','a'))[0].pos_score()))
print('negative score for the word "happy": {0}'.format(list(swn.senti_synsets('happy','a'))[0].neg_score()))
print('neutral score for the word "happy": {0}'.format(list(swn.senti_synsets('happy','a'))[0].obj_score()))


[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/petersontylerd/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
positive score for the word "happy": 0.875
negative score for the word "happy": 0.0
neutral score for the word "happy": 0.125


In [14]:
#

nltk.download('punkt')

sentence = 'i am happy'
tokens = nltk.tokenize.word_tokenize(sentence)
print('Tokens: {0}'.format(tokens))
    

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/petersontylerd/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
Tokens: ['i', 'am', 'happy']


In [17]:
#

from nltk.tag import pos_tag
nltk.download('averaged_perceptron_tagger')

pos_tag(tokens)


[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/petersontylerd/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


[('i', 'NN'), ('am', 'VBP'), ('happy', 'JJ')]

NN - noun
VBP - verb
JJ - adjective

In [18]:
#

stop = stopwords.words('english')
sentence = 'i am happy'
newSentence = []
for word in tokens:
    if word not in stop:
        newSentence.append(word)

print('The sentence has been reduced from \'{0}\' \n to \'{1}\''.format(sentence, newSentence))


The sentence has been reduced from 'i am happy' 
 to '['happy']'


<a id = 'assignment'></a>

# Assignment 5

* Try cleaning the tweets that you have extracted in the the previous chapter. Apply the above rules and in addition to that apply the below mentioned rules as well:
    * Remove Punctuations. Puntuations sometimes don't carry any weight. You can remove them. Try writing a regular expression to remove , from sentences. Dont remove question marks "?" or exclamatory marks as they have effect upon any sentence.
    * Remove apostrophes and expand the words. For example in the sentence "It's a great time to code!" the first word It's can be expanded to 'it is'. You can do this either with regular expressions.
    * Create a list of word patterns for word formatting. For example 'gud' should be substitued with 'good'

* Calculate the polarity of a sentence and write a progam to calculate the polarity of all the tweets that you have extracted and preprocessed in the previous questions. You progam should also include the below features:

    * Tweets have hashtags. Remove the hashtags and then find the polarity of each tweet.

    * There might be words that are not present in the sentiwordnet lexicon.
    * The program should handle these cases, by giving a zero score for such words.
    *Depending on the questions,file uploads or screenshots are necessary to show your work.

<a id = 'Acquire-tweets'></a>

## Acquire tweets

In [None]:
# Find up to 500,000 tweets from the last week containing the word election.
# Store in JSON file

maxTweets = 500000
tweetCount = 0
with open('trumpTweets.json','w') as f:
    for tweet in tweepy.Cursor(api.search, q = 'trump', tweet_mode = 'extended', lang = 'en').items(maxTweets):
        f.write(jsonpickle.encode(tweet._json, unpicklable = False) + '\n')
        tweetCount += 1
    print('Downloaded {0} tweets'.format(tweetCount))


<a id = 'Load-tweets'></a>

## Load tweets

In [2]:
# Load election tweets into memory

data = []
with open('./trumpTweets.json', 'r') as jsonFile:
    for line in jsonFile:
        data.append(json.loads(line))
print('Total number of tweets loaded: {0}'.format(len(data)))


Total number of tweets loaded: 221072


In [3]:
# Unpack all tweets in data

tweets = []
for item in data:
    if 'full_text' in item.keys():
        tweet = item['full_text']
        tweets.append(tweet)
print('Total number of tweets extracted from json: {0}'.format(len(tweets)))


Total number of tweets extracted from json: 221072


In [4]:
tweets[:5]

['RT @pettyasamug: Trump hates pics of his hair circulating on social media #Retweet ♻️\n\n#NotMyPresident https://t.co/4uvEKDaING',
 'RT @maggieNYT: Ted OLSON, who Trump praised in one of his Fla tweets and who Trump tried repeatedly to hire for his own personal legal team…',
 '@realDonaldTrump Stock goes up, credit to trump , goes down? Blame the Democrats. Got it 👌🏾',
 'This. The crisis is here; it can’t be avoided, only mitigated. When I look at Trump’s GOP &amp; their supporters, I see people knowingly and gleefully poisoning my grandchildren and my planet. This is why bipartisanship is BS. I won’t compromise with murderers. https://t.co/mEYs7IszjB',
 'RT @politico: Despite Democrats’ massive House gains — the party’s biggest since 1974, after Richard Nixon’s resignation — redistricting cl…']

In [5]:
# I only want to look at original tweets, not retweets

tweets = [x for x in tweets if not x.startswith('RT ')]


In [6]:
tweets[:5]

['@realDonaldTrump Stock goes up, credit to trump , goes down? Blame the Democrats. Got it 👌🏾',
 'This. The crisis is here; it can’t be avoided, only mitigated. When I look at Trump’s GOP &amp; their supporters, I see people knowingly and gleefully poisoning my grandchildren and my planet. This is why bipartisanship is BS. I won’t compromise with murderers. https://t.co/mEYs7IszjB',
 '@PrincessBravato Suburban white women who do not have a hard on for Trump like @lindseygraham',
 '@speakerRyan I am so weary of how often you lie about this. It is early but still not going to work. Trump’s Tax Cut Was Supposed to Change Corporate Behavior. Here’s What Happened. https://t.co/9X9LR1aXJ1',
 'Ummmmmmm...no. \n\nTrump claims he tried to salvage trip to French cemetery for U.S. troops - POLITICO https://t.co/G9cRGSv06s']

In [None]:
# Review first 3 tweets

for i in range(3):
    print(tweets[i])
    print('')

<a id = 'HTML-Parser'></a>

## HTML Parser

In [7]:
# 
import html

for ix, tweet in enumerate(tweets):
    parsedTweet = html.unescape(tweet)
    tweets[ix] = parsedTweet


In [8]:
tweets[:5]

['@realDonaldTrump Stock goes up, credit to trump , goes down? Blame the Democrats. Got it 👌🏾',
 'This. The crisis is here; it can’t be avoided, only mitigated. When I look at Trump’s GOP & their supporters, I see people knowingly and gleefully poisoning my grandchildren and my planet. This is why bipartisanship is BS. I won’t compromise with murderers. https://t.co/mEYs7IszjB',
 '@PrincessBravato Suburban white women who do not have a hard on for Trump like @lindseygraham',
 '@speakerRyan I am so weary of how often you lie about this. It is early but still not going to work. Trump’s Tax Cut Was Supposed to Change Corporate Behavior. Here’s What Happened. https://t.co/9X9LR1aXJ1',
 'Ummmmmmm...no. \n\nTrump claims he tried to salvage trip to French cemetery for U.S. troops - POLITICO https://t.co/G9cRGSv06s']

<a id = 'Remove-username-URL'></a>

## Remove username, URL

In [9]:
# Remove URLs and usernames

urlPattern = re.compile(r'(?:\@|https?\://)\S+')
for ix, tweet in enumerate(tweets):
    parsedTweet = re.sub(urlPattern, '', tweet)
    tweets[ix] = parsedTweet


In [10]:
tweets[:5]

[' Stock goes up, credit to trump , goes down? Blame the Democrats. Got it 👌🏾',
 'This. The crisis is here; it can’t be avoided, only mitigated. When I look at Trump’s GOP & their supporters, I see people knowingly and gleefully poisoning my grandchildren and my planet. This is why bipartisanship is BS. I won’t compromise with murderers. ',
 ' Suburban white women who do not have a hard on for Trump like ',
 ' I am so weary of how often you lie about this. It is early but still not going to work. Trump’s Tax Cut Was Supposed to Change Corporate Behavior. Here’s What Happened. ',
 'Ummmmmmm...no. \n\nTrump claims he tried to salvage trip to French cemetery for U.S. troops - POLITICO ']

<a id = 'Remove-punctuation'></a>

In [11]:
# Remove unnecessary white space, newlines and tabs

stripPattern = re.compile(r'\s+')
for ix, tweet in enumerate(tweets):
    parsedTweet = re.sub(stripPattern, ' ', tweet).strip()
    tweets[ix] = parsedTweet


In [12]:
tweets[:10]

['Stock goes up, credit to trump , goes down? Blame the Democrats. Got it 👌🏾',
 'This. The crisis is here; it can’t be avoided, only mitigated. When I look at Trump’s GOP & their supporters, I see people knowingly and gleefully poisoning my grandchildren and my planet. This is why bipartisanship is BS. I won’t compromise with murderers.',
 'Suburban white women who do not have a hard on for Trump like',
 'I am so weary of how often you lie about this. It is early but still not going to work. Trump’s Tax Cut Was Supposed to Change Corporate Behavior. Here’s What Happened.',
 'Ummmmmmm...no. Trump claims he tried to salvage trip to French cemetery for U.S. troops - POLITICO',
 'We couldn’t look more closely if we tried. Every single story MSM gives us proves they are united with us against Trump.',
 "Putin will send a Putin bear to Trump in his new prison surrounding's. And will make sure Ivan is his daddy I mean cellmate",
 'Donald Trump gets back to the United States, and someone expla

<a id = 'Remove-apostrophes'></a>

In [13]:
# Fix fancy single quote that's used as apostrophe in tweets with standard apostrophe

apostropheFix = re.compile(u'\u2019')
for ix, tweet in enumerate(tweets):
    parsedTweet = re.sub(apostropheFix, "'", tweet)
    tweets[ix] = parsedTweet


In [14]:
tweets[:10]

['Stock goes up, credit to trump , goes down? Blame the Democrats. Got it 👌🏾',
 "This. The crisis is here; it can't be avoided, only mitigated. When I look at Trump's GOP & their supporters, I see people knowingly and gleefully poisoning my grandchildren and my planet. This is why bipartisanship is BS. I won't compromise with murderers.",
 'Suburban white women who do not have a hard on for Trump like',
 "I am so weary of how often you lie about this. It is early but still not going to work. Trump's Tax Cut Was Supposed to Change Corporate Behavior. Here's What Happened.",
 'Ummmmmmm...no. Trump claims he tried to salvage trip to French cemetery for U.S. troops - POLITICO',
 "We couldn't look more closely if we tried. Every single story MSM gives us proves they are united with us against Trump.",
 "Putin will send a Putin bear to Trump in his new prison surrounding's. And will make sure Ivan is his daddy I mean cellmate",
 'Donald Trump gets back to the United States, and someone expla

## Remove apostrophes

- Remove apostrophes and expand words
    - "It's" becomes "It is", however "Trump's" stays "Trump's"

In [15]:
# 
"""
I am copying this approach from this stackoverflow post http://stackoverflow.com/questions/19790188/expanding-english-language-contractions-in-python
"""

cList = {
          "ain't": "am not",
          "aren't": "are not",
          "can't've": "cannot have",
          "can't": "cannot",
          "'cause": "because",
          "couldn't've": "could not have",
          "could've": "could have",
          "couldn't": "could not",
          "didn't": "did not",
          "doesn't": "does not",
          "don't": "do not",
          "hadn't've": "had not have",
          "hadn't": "had not",
          "hasn't": "has not",
          "haven't": "have not",
          "he'd've": "he would have",
          "he'd": "he would",
          "he'll've": "he will have",
          "he'll": "he will",
          "he's": "he is",
          "how'd'y": "how do you",
          "how'd": "how did",
          "how'll": "how will",
          "how's": "how is",
          "i'd've": "i would have",
          "i'd": "i would",
          "i'll've": "i will have",
          "i'll": "i will",
          "i'm": "i am",
          "i've": "i have",
          "isn't": "is not",
          "it'd've": "it would have",
          "it'd": "it had",
          "it'll've": "it will have",
          "it'll": "it will",
          "it's": "it is",
          "let's": "let us",
          "ma'am": "madam",
          "mayn't": "may not",
          "might've": "might have",
          "mightn't've": "might not have",
          "mightn't": "might not",
          "must've": "must have",
          "mustn't've": "must not have",
          "mustn't": "must not",
          "needn't've": "need not have",
          "needn't": "need not",
          "o'clock": "of the clock",
          "oughtn't've": "ought not have",
          "oughtn't": "ought not",
          "shan't": "shall not",
          "shan't've": "shall not have",
          "sha'n't": "shall not",
          "she'd've": "she would have",
          "she'd": "she would",
          "she'll've": "she will have",
          "she'll": "she will",
          "she's": "she is",
          "shouldn't've": "should not have",
          "should've": "should have",
          "shouldn't": "should not",
          "so've": "so have",
          "so's": "so is",
          "that'd've": "that would have",
          "that'd": "that would",
          "that's": "that is",
          "there'd've": "there would have",
          "there'd": "there had",
          "there's": "there is",
          "they'd've": "they would have",
          "they'd": "they would",
          "they'll've": "they will have",
          "they'll": "they will",
          "they're": "they are",
          "they've": "they have",
          "to've": "to have",
          "wasn't": "was not",
          "we'd've": "we would have",
          "we'd": "we had",
          "we'll've": "we will have",
          "we'll": "we will",
          "we're": "we are",
          "we've": "we have",
          "weren't": "were not",
          "what'll": "what will",
          "what'll've": "what will have",
          "what're": "what are",
          "what's": "what is",
          "what've": "what have",
          "when's": "when is",
          "when've": "when have",
          "where'd": "where did",
          "where's": "where is",
          "where've": "where have",
          "who'll": "who will",
          "who'll've": "who will have",
          "who's": "who is",
          "who've": "who have",
          "why's": "why is",
          "why've": "why have",
          "will've": "will have",
          "won't": "will not",
          "won't've": "will not have",
          "would've": "would have",
          "wouldn't": "would not",
          "wouldn't've": "would not have",
          "y'all'd've": "you all would have",
          "y'all'd": "you all would",
          "y'all're": "you all are",
          "y'all've": "you all have",
          "y'all": "you all",
          "y'alls": "you alls",
          "you'd've": "you would have",
          "you'd": "you had",
          "you'll've": "you you will have",
          "you'll": "you you will",
          "you're": "you are",
          "you've": "you have"
}

contractionPatterns = re.compile('(%s)' % '|'.join(cList.keys()))

def expandContractions(text, c_re = contractionPatterns):
    def replace(match):
        return cList[match.group(0)]
    return c_re.sub(replace, text.lower())


for ix, tweet in enumerate(tweets):
    parsedTweet = expandContractions(tweet)
    tweets[ix] = parsedTweet



In [16]:
tweets[:5]

['stock goes up, credit to trump , goes down? blame the democrats. got it 👌🏾',
 "this. the crisis is here; it cannot be avoided, only mitigated. when i look at trump's gop & their supporters, i see people knowingly and gleefully poisoning my grandchildren and my planet. this is why bipartisanship is bs. i will not compromise with murderers.",
 'suburban white women who do not have a hard on for trump like',
 "i am so weary of how often you lie about this. it is early but still not going to work. trump's tax cut was supposed to change corporate behavior. here's what happened.",
 'ummmmmmm...no. trump claims he tried to salvage trip to french cemetery for u.s. troops - politico']

## Remove punctuation

- Remove ','
- Keep '?','!'
- Remove newline (\n)

In [17]:
# Replace ellipses with a single space

ellipsesPattern = re.compile(r'\.{3,}')
for ix, tweet in enumerate(tweets):
    parsedTweet = re.sub(ellipsesPattern, ' ', tweet)
    tweets[ix] = parsedTweet


In [18]:
tweets[:5]

['stock goes up, credit to trump , goes down? blame the democrats. got it 👌🏾',
 "this. the crisis is here; it cannot be avoided, only mitigated. when i look at trump's gop & their supporters, i see people knowingly and gleefully poisoning my grandchildren and my planet. this is why bipartisanship is bs. i will not compromise with murderers.",
 'suburban white women who do not have a hard on for trump like',
 "i am so weary of how often you lie about this. it is early but still not going to work. trump's tax cut was supposed to change corporate behavior. here's what happened.",
 'ummmmmmm no. trump claims he tried to salvage trip to french cemetery for u.s. troops - politico']

In [21]:
# Remove all punctuation except '?', '!', '#',and apostrophes

punctuationPattern = re.compile(r"[^\w\d\s?!#']+")
for ix, tweet in enumerate(tweets):
    parsedTweet = re.sub(punctuationPattern, '', tweet)
    tweets[ix] = parsedTweet


In [22]:
tweets[:5]

['stock goes up credit to trump  goes down? blame the democrats got it ',
 "this the crisis is here it cannot be avoided only mitigated when i look at trump's gop  their supporters i see people knowingly and gleefully poisoning my grandchildren and my planet this is why bipartisanship is bs i will not compromise with murderers",
 'suburban white women who do not have a hard on for trump like',
 "i am so weary of how often you lie about this it is early but still not going to work trump's tax cut was supposed to change corporate behavior here's what happened",
 'ummmmmmm no trump claims he tried to salvage trip to french cemetery for us troops  politico']

<a id = 'Word-pattern-formatting'></a>

## Word pattern formatting

- Condense extended strings of vowels and consonants down to form correctly spelled word
    - "Gooooooood" becomes "Good"
    - "Realllllly" becomes "Really"

In [36]:
# Remove all overly repetitive vowels and consonants

repetitionPattern = re.compile(r'(.)\1+')
for ix, tweet in enumerate(tweets):
    parsedTweet = re.sub(repetitionPattern, r'\1\1', tweet)
    tweets[ix] = parsedTweet


<a id = 'Remove-hashtags'></a>

## Remove hashtags

In [42]:
# Remove all hashtags, including # and the associated word.

hashtagPattern = re.compile(r'([#?])(\w+)\b')
for ix, tweet in enumerate(tweets):
    parsedTweet = re.sub(hashtagPattern, '', tweet)
    tweets[ix] = parsedTweet


<a id = 'Polarity-analysis'></a>

## Polarity analysis

In [43]:
#

nltk.download('wordnet')

print('positive score for the word "happy": {0}'.format(list(swn.senti_synsets('happy','a'))[0].pos_score()))
print('negative score for the word "happy": {0}'.format(list(swn.senti_synsets('happy','a'))[0].neg_score()))
print('neutral score for the word "happy": {0}'.format(list(swn.senti_synsets('happy','a'))[0].obj_score()))


[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\petersont\AppData\Roaming\nltk_data...


KeyboardInterrupt: 

In [None]:
#

stop = stopwords.words('english')
sentence = 'i am happy'
newSentence = []
for word in tokens:
    if word not in stop:
        newSentence.append(word)

print('The sentence has been reduced from \'{0}\' \n to \'{1}\''.format(sentence, newSentence))


In [44]:
sentence = 'i am happy'
tokens = nltk.tokenize.word_tokenize(sentence)
print('Tokens: {0}'.format(tokens))


LookupError: 
**********************************************************************
  Resource [93mpunkt[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('punkt')
  [0m
  Searched in:
    - 'C:\\Users\\petersont/nltk_data'
    - 'C:\\nltk_data'
    - 'D:\\nltk_data'
    - 'E:\\nltk_data'
    - 'C:\\Users\\petersont\\Miniconda\\nltk_data'
    - 'C:\\Users\\petersont\\Miniconda\\share\\nltk_data'
    - 'C:\\Users\\petersont\\Miniconda\\lib\\nltk_data'
    - 'C:\\Users\\petersont\\AppData\\Roaming\\nltk_data'
    - ''
**********************************************************************


In [None]:
#
stop = stopwords.words('english')
sentimentDict = {}
for tweet in tweets:
    newSentence = []
    sentenceSentiment = 0.
    for word in tokens:
        if word not in stop:
            newSentence.append(word)
            swn.senti_synsets(word,'a')[0].neg_score()
    

In [None]:
# fix typos
# handle words not in sentiwordnet by giving 0

0 if word not in sentiwordnet else senti_synsets(word,'a')[0].neg_score()