## Introduction

This is a comparative analysis of [Hillary Clinton's](https://twitter.com/HillaryClinton) and [Donald Trump's](https://twitter.com/realDonaldTrump) tweets. The repo contains the code and a couple of images used for the data visualization. 

We will use the Twitter API to download the data. Because the API only allows one to download a user's last 3200 tweets, your results might be a little different than mine.

Running this analysis requires Python 3 and the following packages:

* yaml
* numpy
* matplotlib
* pandas
* tweepy
* bokeh
* nltk

If you're willing to tweak the code a bit more, than Python 2 should be fine too.

Downloading the data requires a Twitter account and a mobile phone number associated with the account. The phone number can be added in `Profile and settings -> Settings -> Mobile`: https://twitter.com/settings/add_phone?edit_phone=true

The tutorial consists of two parts. The first part should be easy to complete. The second part is slightly more difficult and has several exercises that you can work on (solutions are provided though). 

The second part requires the Stanford Named Entity Recognition Tagger. This blog post explains how to get it to work: http://textminingonline.com/how-to-use-stanford-named-entity-recognizer-ner-in-python-nltk-and-other-programming-languages.

The second part also requires an emotion lexicon. I used the one here: http://saifmohammad.com/WebPages/NRC-Emotion-Lexicon.htm

1. [Download Clinton's and Trump's Tweets](Download Tweets.ipynb)
2. [Vocabulary Size and Sentiment Analysis](Vocabulary Size and Sentiment Analysis.ipynb)

NLTK is needed for the second part. At the end of this notebook, there is a brief introduction to the NLTK methods that will be used in the tutorial. This is very basic, and those who worked with NLTK before can probably skip all of it.

I suggest the following directory structure inside your top level directory:

```
.
├── code
├── data
│   └── NRC-Emotion-Lexicon-v0.92
├── figs
└── tutorial
    ├── img
    └── notebooks
```

If you clone the Git repository, then most of these directories will be created automatically. Exceptions are the `figs` directory (because all the files can be generated by running the notebooks) and the CSV files in the `data` directory that contain the tweets info (because, per the Twitter Developer Agreement, one is not allowed to share their Twitter data).

---

## NLTK

[NLTK](http://www.nltk.org/) in a Python package for natural language processing. The main NLTK methods that we will be using for this tutorial are `word_tokenize`, `pos_tag`, and `stopwords`. Additionally, we will also use the tagging, stemming, and lemmatization classes `StanfordNERTagger`, `PorterStemmer`, and `WordNetLemmatizer`, respectively.

The examples below show simple uses of these methods, which should be enough to write/understand the code in the tutorial.

In [7]:
# Get a list of all stopwords.
from nltk.corpus import stopwords
print(stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'no

In [16]:
# Tokenize a text.
from nltk.tokenize import word_tokenize

text1 = "This is a very boring sentence for the PyLadies tutorial."
tokenized_text1 = word_tokenize(text1)
print(tokenized_text1, "\n")

# from http://examples.yourdictionary.com/20-examples-of-slang-language.html
text2 = """“Gangsta” is hardly a new word; in fact, it's at least two decades old. 
           But a new take on someone who aspires to the gangsta style, but fails miserably, 
           is a “wanksta.”"""
tokenized_text2 = word_tokenize(text2)
print(tokenized_text2)

['This', 'is', 'a', 'very', 'boring', 'sentence', 'for', 'the', 'PyLadies', 'tutorial', '.'] 

['“Gangsta”', 'is', 'hardly', 'a', 'new', 'word', ';', 'in', 'fact', ',', 'it', "'s", 'at', 'least', 'two', 'decades', 'old', '.', 'But', 'a', 'new', 'take', 'on', 'someone', 'who', 'aspires', 'to', 'the', 'gangsta', 'style', ',', 'but', 'fails', 'miserably', ',', 'is', 'a', '“wanksta.”']


### Parts of Speech

In [18]:
# Tag parts of speech in a text.
#
# See this list for possible tags:
# http://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html

import nltk

print(nltk.pos_tag(tokenized_text1), "\n")
print(nltk.pos_tag(tokenized_text2))

[('This', 'DT'), ('is', 'VBZ'), ('a', 'DT'), ('very', 'RB'), ('boring', 'JJ'), ('sentence', 'NN'), ('for', 'IN'), ('the', 'DT'), ('PyLadies', 'NNP'), ('tutorial', 'NN'), ('.', '.')] 

[('“Gangsta”', 'NN'), ('is', 'VBZ'), ('hardly', 'RB'), ('a', 'DT'), ('new', 'JJ'), ('word', 'NN'), (';', ':'), ('in', 'IN'), ('fact', 'NN'), (',', ','), ('it', 'PRP'), ("'s", 'VBZ'), ('at', 'IN'), ('least', 'JJS'), ('two', 'CD'), ('decades', 'NNS'), ('old', 'JJ'), ('.', '.'), ('But', 'CC'), ('a', 'DT'), ('new', 'JJ'), ('take', 'NN'), ('on', 'IN'), ('someone', 'NN'), ('who', 'WP'), ('aspires', 'VBZ'), ('to', 'TO'), ('the', 'DT'), ('gangsta', 'NN'), ('style', 'NN'), (',', ','), ('but', 'CC'), ('fails', 'VBZ'), ('miserably', 'RB'), (',', ','), ('is', 'VBZ'), ('a', 'DT'), ('“wanksta.”', 'NN')]


### Proper Nouns

In [25]:
# Tag a text using the Stanford NER tagger.
import os

from nltk.tag import StanfordNERTagger

# Change the paths to point to the directory where you downloaded the Stanford tagger.
os.environ['STANFORD_MODELS'] = "/Users/gogrean/code/stanford-ner-2014-06-16/classifiers"
os.environ['CLASSPATH'] = "/Users/gogrean/code/stanford-ner-2014-06-16"
st = StanfordNERTagger('english.all.3class.distsim.crf.ser.gz')

# from Trump's tweets
text = """Crazy Maureen Dowd, the wacky columnist for the failing 
          @nytimes, pretends she knows me well--wrong!"""
tokenized_text = word_tokenize(text)

print(st.tag(tokenized_text))

[('Crazy', 'O'), ('Maureen', 'PERSON'), ('Dowd', 'PERSON'), (',', 'O'), ('the', 'O'), ('wacky', 'O'), ('columnist', 'O'), ('for', 'O'), ('the', 'O'), ('failing', 'O'), ('@', 'O'), ('nytimes', 'O'), (',', 'O'), ('pretends', 'O'), ('she', 'O'), ('knows', 'O'), ('me', 'O'), ('well', 'O'), ('--', 'O'), ('wrong', 'O'), ('!', 'O')]


### Stemming and Lemmatization

Stemming and lemmatization are two ways of processing text that identify inflected words and words with a common stem, and reduce them to a common form. The examples below give a general idea of how they work.

#### Stemming

In [39]:
# some word examples from Wikipedia
# https://en.wikipedia.org/wiki/Stemming#Examples

from nltk.stem.porter import PorterStemmer

porter_stemmer = PorterStemmer()

# sometimes stemming returns actual words
print( porter_stemmer.stem('fish'), porter_stemmer.stem('fishing'), porter_stemmer.stem('fishes') )

fish fish fish


In [40]:
# other times it doesn't
print( porter_stemmer.stem('argue'), porter_stemmer.stem('argued'), porter_stemmer.stem('arguing') )

argu argu argu


In [41]:
# sometimes the results and not necessarily what we want
print( porter_stemmer.stem('dog'), porter_stemmer.stem('dogs'), porter_stemmer.stem('dogged') )

dog dog dog


#### Lemmatization

In [43]:
from nltk.stem import WordNetLemmatizer

wn_lemmatizer = WordNetLemmatizer()

print( wn_lemmatizer.lemmatize('fish'), wn_lemmatizer.lemmatize('fishing'), wn_lemmatizer.lemmatize('fishes') )

fish fishing fish


In [44]:
# this doesn't work great
print( wn_lemmatizer.lemmatize('argue'), wn_lemmatizer.lemmatize('argued'), wn_lemmatizer.lemmatize('arguing') )

argue argued arguing


In [45]:
# it works much better if we provide the part of speech
print( wn_lemmatizer.lemmatize('argue', 'v'), wn_lemmatizer.lemmatize('argued', 'v'), 
       wn_lemmatizer.lemmatize('arguing', 'v') )

argue argue argue


In [46]:
print( wn_lemmatizer.lemmatize('dog'), wn_lemmatizer.lemmatize('dogs'), wn_lemmatizer.lemmatize('dogged') )

dog dog dogged
