# Analyzing Twitter Sentiment 

## Part 1: Analyzing Twitter Sentiment Using NLP

In social media, trends move at incredible speed.  A hashtag can start trending, become popular, and then die in a matter days, or even hours.   At the forefront of social media trends is Twitter, an online social media site that allows people to write short 140 character comments on anything ranging from politics, to sports to video games.  

The sheer volume of Twitter data makes analysis challenging however.  There are ~6000 tweets sent out from twitter every second, which means that finding the latest trends is akin for looking for a needle in a haystack while getting sprayed by a firehose.   

Fortunately there are some good libraries for dealing with twitter data that can allow you to extract meaning from this information firehose.   In this blog post, I will show you how to set up a twitter sentiment analyzer which allows you to see the sentiment, and location of the latest trends in the US and around the world.   

## Table of Contents
  1. [Introduction](#1)
    1. [Necessary Libraries](#1.1)
    2. [Accessing labeled Twitter Data](#1.2)
  2. [Preprocessing the Data](#2)
    1. [removing html formatting](#2.1)
    2. [removing usernames/websites/emoji's](#2.2)
    3. [stemming and tokenizing](#2.3)
  3. [Sentiment Analysis Models](#3)
    1. [Naive Bayes](3.1)
    2. [Logistic Regression](#3.2)
    3. [Stochastic Gradient Descent](#3.3)
  4. [Conclusions/Look Ahead](#4)


### Necessary Libraries <a id=1></a>

For this part of the tutorial we will need to use the following libraries
  - [SkLearn](http://scikit-learn.org/): popular machine learning library
  - [NLTK](http://www.nltk.org): Natural language processing library
  - [re](: regular expression library
  - pandas: popular data analysis library



In [3]:
import sklearn
import nltk
import re
import pandas as pd

### Accessing labeled twitter data<a id="1.2"></a>

There are several sources of labeled twitter data, for example Kaggle hosts a dataset of labeled tweets, and various other hand labeled tweet datasets can be found elsewhere.   However, they all suffer from a serious flaw in that all of the tweets have an easily identifiable sentiment.   This might sound like a good thing, but when trying to classify real world data you quickly will run into the problem that most tweets don't have an easily identifiable sentiment.  Your training data will not adequitely reflect your actual data.   

A better idea to get both more data, and data which is closer to real world data, is to scrape tweets with emoticons, remove the emoticons, and then label the tweets based on whether or not the emoticon is positive/negative.   A dataset of 1.6M tweets created using this method can be found [here](http://thinknook.com/wp-content/uploads/2012/09/Sentiment-Analysis-Dataset.zip).   

Download the corpus on your home machine and extract it.  We will use the file training.1600000.processed.noemoticon.csv to train our model.  To load it into python, we will use the pandas read_csv function.   We can print out the first rows of the dataframe by calling the .head() method.

In [5]:
dataframe = pd.read_csv('data/training.1600000.processed.noemoticon.csv', encoding='iso8859', 
                        header=None, names=['sentiment', 'id', 'time', 'query', 'user', 'text'])
dataframe.head()

Unnamed: 0,sentiment,id,time,query,user,text
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."


## Preprocessing the data <a id=2></a>

From the dataframe, we can see that the text tends to be quite ugly.   There are username mentions, links, and html formatting.   None of these are useful for predicting sentiment, so we will need to remove them from the tweets.  

To do this we will use BeautifulSoup and re, the python regular expressions library.   The following regular expressions look somewhat complicated, so a short explanation is followed below
```
(?:(https?://)|(www\.))(?:\S+)?
```
Parenthesis in regular expressions are used for grouping, and the ```?:``` operator tells regular expressions to keep matching elements in the following parenthesis.   The ```|``` operator means or, so the first line part ```(?:(https?://)|(www\.))``` tells regular expressions to match either ```(https?://)``` or ```(www\.)```.   The question mark when not used right after a parenthesis means optional.  So in the first case its saying match ```http://``` or ```https://```.  The alternative thing we want to match is ```www.```.  The ```.``` is escaped since ```.``` is a metacharacter in regular expressions.   The next part says that after regular expressions matches https or www, to then match ```(?:\S+)```.   The ```\S``` character means match any letter, number, dash, or underscore  and the ```+``` operator means it must occur one or more times.   

In [None]:
def clean_text(tweet_text):
    """Removes URL's, usernames, hashtags, and html formatting from the tweets
    
    Parameters:
    -----------
    tweet_text: str
    
    Returns:
    --------
    cleaned text
    """
    # remove html formatting
    cleaned_text = BeautifulSoup(text, 'html.parser').text
    # remove URL's
    cleaned_text = re.sub(r'(?:(https?://)|(www\.))(?:\S+)?', '', tweet)
    # remove usernames
    cleaned_text = re.sub(r'@\w{1,15}', '', tweet)
    # remove hashtags
    cleaned_text = re.sub('#(\S+)', r'\1', tweet)
    return cleaned text