# Assignment 5

__Table of contents__

1. [Module 7 walkthrough](#Module-7-walkthrough)
1. [Module 8 walkthrough](#Module-8-walkthrough)
1. [Assignment 5](#assignment)
    1. [Acquire tweets](#Acquire-tweets)
    1. [Load tweets](#Load-tweets)
    1. [HTML Parser](#HTML-Parser)
    1. [Remove username, URL](#Remove-username-URL)
    1. [Remove punctuation](#Remove-punctuation)
    1. [Remove apostrophes](#Remove-apostrophes)
    1. [Word pattern formatting](#Word-pattern-formatting)
    1. [Remove hashtags](#Remove-hashtags)
    1. [Polarity analysis](#Polarity-analysis)

In [1]:
import os
import sys
import jsonpickle
import json
import tweepy
import html.parser as HTMLParser
import re

import nltk
nltk.download('sentiwordnet')
from nltk.corpus import sentiwordnet as swn
from nltk.corpus import stopwords


modulePath = os.path.abspath(os.path.join('../../..'))
if modulePath not in sys.path:
    sys.path.append(modulePath)
import config

# Standard tweepy API setup

auth = tweepy.OAuthHandler(config.apiKey, config.apiSec)
auth.set_access_token(config.accessToken, config.accessSec)

api = tweepy.API(auth)

# Application authentication tweepy setup
# Use application-only authentication for higher Twitter API rate limit
# Twitter API returns a max of 100 tweets per query
# Allows for 450 queries every 15 minutes
# So we can gather 45,000 tweets every 15 minutes

#Switching to application authentication
auth = tweepy.AppAuthHandler(config.apiKey, config.apiSec)

#Setting up new api wrapper, using authentication only
api = tweepy.API(auth, wait_on_rate_limit = True
                 ,wait_on_rate_limit_notify = True)
 
# View rate limit status

api.rate_limit_status()['resources']['search']


[nltk_data] Downloading package sentiwordnet to
[nltk_data]     /Users/petersontylerd/nltk_data...
[nltk_data]   Package sentiwordnet is already up-to-date!


{'/search/tweets': {'limit': 450, 'remaining': 450, 'reset': 1542583710}}

<a id = 'Module-7-walkthrough'></a>

# Module 7 walkthrough

In [2]:
# 

htmlParser = HTMLParser.HTMLParser()

tweet = "@user_@34 Life is great & I like it sooooooooo much. It's whatis life. #life #great#like http://lifeisgreat.com ."
parsedTweet = htmlParser.unescape(tweet)
print(parsedTweet)


@user_@34 Life is great & I like it sooooooooo much. It's whatis life. #life #great#like http://lifeisgreat.com .


  


In [3]:
#

urlPattern = re.compile('http\S+')
tweet_v1 = re.sub(urlPattern, '', parsedTweet)
print(tweet_v1)


@user_@34 Life is great & I like it sooooooooo much. It's whatis life. #life #great#like  .


In [4]:
# 

usernamePattern = re.compile('@\S+')
tweet_v2 = re.sub(usernamePattern, '', tweet_v1)
print(tweet_v2)


 Life is great & I like it sooooooooo much. It's whatis life. #life #great#like  .


In [5]:
#

wordPattern = re.compile('s[o]+')
tweet_v3 = re.sub(wordPattern, 'so', tweet_v2)
print(tweet_v3)


 Life is great & I like it so much. It's whatis life. #life #great#like  .


<a id = 'Module-8-walkthrough'></a>

# Module 8 walkthrough

In [8]:
#

nltk.download('wordnet')

print('positive score for the word "happy": {0}'.format(list(swn.senti_synsets('happy','a'))[0].pos_score()))
print('negative score for the word "happy": {0}'.format(list(swn.senti_synsets('happy','a'))[0].neg_score()))
print('neutral score for the word "happy": {0}'.format(list(swn.senti_synsets('happy','a'))[0].obj_score()))


[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/petersontylerd/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
positive score for the word "happy": 0.875
negative score for the word "happy": 0.0
neutral score for the word "happy": 0.125


In [14]:
#

nltk.download('punkt')

sentence = 'i am happy'
tokens = nltk.tokenize.word_tokenize(sentence)
print('Tokens: {0}'.format(tokens))
    

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/petersontylerd/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
Tokens: ['i', 'am', 'happy']


In [17]:
#

from nltk.tag import pos_tag
nltk.download('averaged_perceptron_tagger')

pos_tag(tokens)


[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/petersontylerd/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


[('i', 'NN'), ('am', 'VBP'), ('happy', 'JJ')]

NN - noun
VBP - verb
JJ - adjective

In [18]:
#

stop = stopwords.words('english')
sentence = 'i am happy'
newSentence = []
for word in tokens:
    if word not in stop:
        newSentence.append(word)

print('The sentence has been reduced from \'{0}\' \n to \'{1}\''.format(sentence, newSentence))


The sentence has been reduced from 'i am happy' 
 to '['happy']'


<a id = 'assignment'></a>

# Assignment 5

* Try cleaning the tweets that you have extracted in the the previous chapter. Apply the above rules and in addition to that apply the below mentioned rules as well:
    * Remove Punctuations. Puntuations sometimes don't carry any weight. You can remove them. Try writing a regular expression to remove , from sentences. Dont remove question marks "?" or exclamatory marks as they have effect upon any sentence.
    * Remove apostrophes and expand the words. For example in the sentence "It's a great time to code!" the first word It's can be expanded to 'it is'. You can do this either with regular expressions.
    * Create a list of word patterns for word formatting. For example 'gud' should be substitued with 'good'

* Calculate the polarity of a sentence and write a progam to calculate the polarity of all the tweets that you have extracted and preprocessed in the previous questions. You progam should also include the below features:

    * Tweets have hashtags. Remove the hashtags and then find the polarity of each tweet.

    * There might be words that are not present in the sentiwordnet lexicon.
    * The program should handle these cases, by giving a zero score for such words.
    *Depending on the questions,file uploads or screenshots are necessary to show your work.

<a id = 'Acquire-tweets'></a>

## Acquire tweets

In [None]:
# Find up to 500,000 tweets from the last week containing the word election.
# Store in JSON file

maxTweets = 500000
tweetCount = 0
with open('trumpTweets.json','w') as f:
    for tweet in tweepy.Cursor(api.search, q = 'trump', tweet_mode = 'extended', lang = 'en').items(maxTweets):
        f.write(jsonpickle.encode(tweet._json, unpicklable = False) + '\n')
        tweetCount += 1
    print('Downloaded {0} tweets'.format(tweetCount))


<a id = 'Load-tweets'></a>

## Load tweets

In [151]:
# Load election tweets into memory

data = []
with open('./trumpTweets.json', 'r') as jsonFile:
    for line in jsonFile:
        data.append(json.loads(line))
print('Total number of tweets loaded: {0}'.format(len(data)))


Total number of tweets loaded: 221072


In [235]:
# Unpack all tweets in data

tweets = []
for item in data:
    if 'full_text' in item.keys():
        tweet = item['full_text']
        tweets.append(tweet)
print('Total number of tweets extracted from json: {0}'.format(len(tweets)))


Total number of tweets extracted from json: 221072


In [236]:
# I only want to look at original tweets, not retweets

tweets = [x for x in tweets if not x.startswith('RT ')]


In [237]:
# Review first 3 tweets

for i in range(3):
    print(tweets[i])
    print('')

@realDonaldTrump Stock goes up, credit to trump , goes down? Blame the Democrats. Got it 👌🏾

This. The crisis is here; it can’t be avoided, only mitigated. When I look at Trump’s GOP &amp; their supporters, I see people knowingly and gleefully poisoning my grandchildren and my planet. This is why bipartisanship is BS. I won’t compromise with murderers. https://t.co/mEYs7IszjB

@PrincessBravato Suburban white women who do not have a hard on for Trump like @lindseygraham



<a id = 'HTML-Parser'></a>

## HTML Parser

In [238]:
# 
import html

for ix, tweet in enumerate(tweets):
    parsedTweet = html.unescape(tweet)
    tweets[ix] = parsedTweet


<a id = 'Remove-username-URL'></a>

## Remove username, URL

In [239]:
# Remove URLs and usernames

urlPattern = re.compile(r'(?:\@|https?\://)\S+')
for ix, tweet in enumerate(tweets):
    parsedTweet = re.sub(urlPattern, '', tweet)
    tweets[ix] = parsedTweet


<a id = 'Remove-punctuation'></a>

## Remove punctuation

- Remove ','
- Keep '?','!'
- Remove newline (\n)

In [240]:
# Replace ellipses with a single space

ellipsesPattern = re.compile(r'\.{3,}')
for ix, tweet in enumerate(tweets):
    parsedTweet = re.sub(ellipsesPattern, ' ', tweet)
    tweets[ix] = parsedTweet


In [241]:
# Remove all punctuation except '?', '!', '#',and apostrophes

punctuationPattern = re.compile(r'[^\w\d\s?!\'#]+')
for ix, tweet in enumerate(tweets):
    parsedTweet = re.sub(punctuationPattern, '', tweet)
    tweets[ix] = parsedTweet


In [242]:
# Remove unnecessary white space, newlines and tabs

stripPattern = re.compile(r'\s+')
for ix, tweet in enumerate(tweets):
    parsedTweet = re.sub(stripPattern, ' ', tweet).strip()
    tweets[ix] = parsedTweet


In [243]:
tweets[:100]

['Stock goes up credit to trump goes down? Blame the Democrats Got it',
 'This The crisis is here it cant be avoided only mitigated When I look at Trumps GOP their supporters I see people knowingly and gleefully poisoning my grandchildren and my planet This is why bipartisanship is BS I wont compromise with murderers',
 'Suburban white women who do not have a hard on for Trump like',
 'I am so weary of how often you lie about this It is early but still not going to work Trumps Tax Cut Was Supposed to Change Corporate Behavior Heres What Happened',
 'Ummmmmmm no Trump claims he tried to salvage trip to French cemetery for US troops POLITICO',
 'We couldnt look more closely if we tried Every single story MSM gives us proves they are united with us against Trump',
 "Putin will send a Putin bear to Trump in his new prison surrounding's And will make sure Ivan is his daddy I mean cellmate",
 'Donald Trump gets back to the United States and someone explains what they were saying in Europe',


<a id = 'Remove-apostrophes'></a>

## Remove apostrophes

- Remove apostrophes and expand words
    - "It's" becomes "It is", however "Trump's" stays "Trump's"

- It's, what's, whats, that's, thats
- do a search through corpus for the rest

<a id = 'Word-pattern-formatting'></a>

## Word pattern formatting

- Condense extended strings of vowels and consonants down to form correctly spelled word
    - "Gooooooood" becomes "Good"
    - "Realllllly" becomes "Really"

In [5]:
#

wordPattern = re.compile('s[o]+')
tweet_v3 = re.sub(wordPattern, 'so', tweet_v2)
print(tweet_v3)


 Life is great & I like it so much. It's whatis life. #life #great#like  .


<a id = 'Remove-hashtags'></a>

## Remove hashtags

<a id = 'Polarity-analysis'></a>

## Polarity analysis