# Analyzing Twitter Sentiment 

## Part 1: Analyzing Twitter Sentiment Using NLP

In social media, trends move at incredible speed.  A hashtag can start trending, become popular, and then die in a matter days, or even hours.   At the forefront of social media trends is Twitter, an online social media site that allows people to write short 140 character comments on anything ranging from politics, to sports to video games.  

The sheer volume of Twitter data makes analysis challenging however.  There are ~6000 tweets sent out from twitter every second, which means that finding the latest trends is akin for looking for a needle in a haystack while getting sprayed by a firehose.   

Fortunately there are some good libraries for dealing with twitter data that can allow you to extract meaning from this information firehose.   In this blog post, I will show you how to set up a twitter sentiment analyzer which allows you to see the sentiment, and location of the latest trends in the US and around the world.   

## Table of Contents
  1. [Introduction](#1)
    1. [Necessary Libraries](#1.1)
    2. [Accessing labeled Twitter Data](#1.2)
  2. [Preprocessing the Data](#2)
    1. [removing html formatting](#2.1)
    2. [removing usernames/websites/emoji's](#2.2)
    3. [stemming and tokenizing](#2.3)
  3. [Sentiment Analysis Models](#3)
    1. [Naive Bayes](3.1)
    2. [Logistic Regression](#3.2)
    3. [Stochastic Gradient Descent](#3.3)
  4. [Conclusions/Look Ahead](#4)


### Necessary Libraries <a id=1></a>

For this part of the tutorial we will need to use the following libraries
  - [SkLearn](http://scikit-learn.org/): popular machine learning library
  - [NLTK](http://www.nltk.org): Natural language processing library
  - [re](https://docs.python.org/3/library/re.html): regular expression library
  - [pandas](https://pandas.pydata.org/): popular data analysis library

These libraries can all be installed via pip, and most of these will come preinstalled with anaconda.   

In [3]:
import sklearn
import nltk
import re
import pandas as pd

### Accessing labeled twitter data<a id="1.2"></a>

There are several sources of labeled twitter data, for example Kaggle hosts a dataset of labeled tweets, and various other hand labeled tweet datasets can be found elsewhere.   However, they all suffer from a serious flaw in that all of the tweets have an easily identifiable sentiment.   This might sound like a good thing, but when trying to classify real world data you quickly will run into the problem that most tweets don't have an easily identifiable sentiment.  Your training data will not adequitely reflect your actual data.   

A better idea to get both more data, and data which is closer to real world data, is to scrape tweets with emoticons, remove the emoticons, and then label the tweets based on whether or not the emoticon is positive/negative.   A dataset of 1.6M tweets created using this method can be found [here](http://thinknook.com/wp-content/uploads/2012/09/Sentiment-Analysis-Dataset.zip).   

Download the corpus on your home machine and extract it.  We will use the file training.1600000.processed.noemoticon.csv to train our model.  To load it into python, we will use the pandas read_csv function.   We can print out the first rows of the dataframe by calling the .head() method.

In [5]:
dataframe = pd.read_csv('data/training.1600000.processed.noemoticon.csv', encoding='iso8859', 
                        header=None, names=['sentiment', 'id', 'time', 'query', 'user', 'text'])
dataframe.head()

Unnamed: 0,sentiment,id,time,query,user,text
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."


## Preprocessing the data <a id=2></a>


### Removing usernames/weblinks/hashtags <a id=2.2></a>
From the dataframe, we can see that the text tends to be quite ugly.   There are username mentions, links, and html formatting.   None of these are useful for predicting sentiment, so we will need to remove them from the tweets.  

To do this we will use BeautifulSoup and re, the python regular expressions library.   The following regular expressions look somewhat complicated, so a short explanation is followed below
```
(?:(https?://)|(www\.))(?:\S+)?
```
Parenthesis in regular expressions are used for grouping, and the ```?:``` operator tells regular expressions to keep matching elements in the following parenthesis.   The ```|``` operator means or, so the first line part ```(?:(https?://)|(www\.))``` tells regular expressions to match either ```(https?://)``` or ```(www\.)```.   The question mark when not used right after a parenthesis means optional.  So in the first case its saying match ```http://``` or ```https://```.  The alternative thing we want to match is ```www.```.  The ```.``` is escaped since ```.``` is a metacharacter in regular expressions.   The next part says that after regular expressions matches https or www, to then match ```(?:\S+)```.   The ```\S``` character means match any letter, number, dash, period, or underscore  and the ```+``` operator means it must occur one or more times.   

In [None]:
def clean_text(tweet_text):
    """Removes URL's, usernames, hashtags, and html formatting from the tweets
    
    Parameters:
    -----------
    tweet_text: str
    
    Returns:
    --------
    cleaned text
    """
    # remove html formatting
    cleaned_text = BeautifulSoup(text, 'html.parser').text
    # remove URL's
    cleaned_text = re.sub(r'(?:(https?://)|(www\.))(?:\S+)?', '', tweet)
    # remove usernames
    cleaned_text = re.sub(r'@\w{1,15}', '', tweet)
    # remove hashtags
    cleaned_text = re.sub('#(\S+)', r'\1', tweet)
    return cleaned text

### Stemming and tokenizing <a id=2.3></a>

To make predictions from the sentence, each sentence will first need to be split into a list of words.  Again, I will use the python regular expression library to accomplish this task.   To match words, I will use  the regular expression 

```
(?u)\b\w[\w']+\b
```   

The ```(?u)``` expression is a part of python, and it specifies that the library should expect unicode strings.    The ```\b``` character matches the beginning of a word, which means that it will begin matching at the beginning of words.   The ```\w``` character is a word character, whcih in regular expressions means any letter from a-z, any number, or an underscore.  After this, the next part ```[\w']+``` matches any word character, or an apostrophe occurring one or more time (note that this setup ignores one letter words).   Finnaly, the ```\b``` character specifies that the word should end at the word boundary.   

We will also want to stem the words, which means that words like run, running, and runs all will represent the same word, so they will al be stemmed to run.    This is useful in that it reduces the total number of features, and it instead capture the essential essense of what a word represents. 

To do this we will use nltk's porter stemmer, which provides out of teh box stemming.   We will package up all of these transformations in one function that will turn a sentence into a list of word stems.   

In [None]:
def tokenize_and_stem(tweet):
    """tokenizes and stems a tweet preserving
    emoticons

    Parameters:
    -----------
    tweet: str
        contents of a given tweet

    Returns:
    --------
    list of stemmed tokens
    """
    tweet = clean_text(tweet)
    tweet, emoticons = _remove_emoticons(tweet)
    words = re.findall(r"(?u)\b\w[\w']+\b", tweet.lower())
    porter_stemmer = PorterStemmer()
    words = [porter_stemmer.stem(word) for word in words]
    for i in emoticons:
        words.append(i)
    return words


### Sentiment analysis models <a id=3></a>

There are several methods for sentiment analysis differing from simple to incredibly complex.   Perhaps the simplist method would be Naive Bayes.  

### Naive Bayes: <a id=3.1></a>

Naive Bayes attempts to use Bayes rule in conjunction with a "naive" assumption that word probabilities are independent of each other.   

Assume we have a corpus where the sentiment is either positive or negative.   The method starts by first breaking a sentence up into a bag of words.  For example, the sentence "the quick brown fox jumps over the lazy dog" would become {"the", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"}.   

To come up with a probability, we use Bayes rule $$ P\left(\text{sentiment} \mid \{w_i\}\right) \approx P\left(\{w_i\} | \text{sentiment}\right)P(\text{sentiment}) $$

The naive part of this then comes from assuming that the probability of a sentiment given a word is completely independent of another word occuring, which then allows us to write $$ P\left(\text{sentiment} \mid \{w_i\}\right) \approx \left(\prod_iP\left(w_i | \text{sentiment}\right)\right)P(\text{sentiment}) $$

The $P(\text{sentiment})$ is then just given by the probability of the given sentiment occuring (i.e. if 10% or our tweets are negative and 90% are positive, then the $P(\text{negative})=.1$ and $P(\text{positive})=.9$.  

The simplist way fo finding the probability of a word, $w_i$ given a sentiment is just the frequency of that word within tweets of that sentiment.   For example, if "horrible" occurs .1% of the time in negative tweets and .005% of the time in positive tweets, then $$P(\text{"horrible"}\mid \text{negative}) = .001$$ and $$P(\text{"horrible"}\mid\text{positive}) = .00005$$

However, this can lead us to some errors in our analysis.  For example, if a word never occurs in a corpus, then the probability of either label would be zero.  Thus to avoid this, we can redefine the probability of a word given a sentiment as $$P(w \mid\text{sentiment}) = \frac{ \text{count}(w,\text{sentiment})+\alpha}{\sum_w\text{count}(w, \text{sentiment})+\alpha} \ , $$   where $\text{count}(w, \text{sentiment})$ is the number of times a word appears in all of the documents with the given sentiment.


We can implement Naive Bayes using a CountVectorizer from sklearn to create the bag of words and their MultinomialNaiveBayes classifier.   These will be chained together using a pipeline.   

In [None]:
from sklearn.preprocessing import CountVectorizer
from sklearn.model_selection import GridSearchCV
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline

vectorizer = CountVectorizer(stop_words='english',
                        tokenizer=tokenize_and_stem)
predictor = sklearn.  MultinomialNB()
pipeline = Pipeline([('vectorizer', vectorizer), ('predictor', predictor)])


There are several hyperparameters we can tune to get the best results.  For one, we can demand that words occur a certain minimum number of times for us to consider them.  This is implimented in the CountVectorizer by the parameter min_df.  We can also exclude words which occur very often and are unlikely to increase the predictive power of our algorithm (for example, "the").  To evaluate which combination of the hyperparameters will work best, we will use a grid search over 5-fold cross validation.   

Our metric for evaluation will be the negative log loss.   Maximizing negative log loss is equivalent to training the algorithm to output the correct probabilities of a  tweet being positive or negative.    

In [3]:
param_grid = {'tfidf__min_df': [0.005, .01],
              'tfidf__max_df': [.7]}
grid_search = GridSearchCV(self.pipeline, self.param_grid,
                                scoring='neg_log_loss',
                               n_jobs=-1)
tweets, labels = self.load_data(source)
grid_search.fit(tweets, labels)
self.grid_search = grid_search
cv_res = grid_search.cv_results_
results = '           :   C   : min_df  : max_df  '
for score, std, params in zip(cv_res['mean_test_score'],
                              cv_res['std_test_score'],
                              cv_res['params']):
    results +='\n{}'.format(params)
    results += '\n\t{:.3f}+/-{:3f}'.format(
                score, std)
    print(results.split('\n')[-1])
results += '\n\ncv_best results:'
results += '{}'.format(grid_search.best_params_)
results += '{}'.format(grid_search.best_score_)


## Logistic Regression

We can also try another simple model for finding probabilities which is logistic regression on top of tfidf vectorization.   
