# Analyzing Twitter Sentiment 

## Part 1: Analyzing Twitter Sentiment Using NLP

In social media, trends move at incredible speed.  A hashtag can start trending, become popular, and then die in a matter days, or even hours.   At the forefront of social media trends is Twitter, an online social media site that allows people to write short 140 character comments on anything ranging from politics, to sports to video games.  

The sheer volume of Twitter data makes analysis challenging however.  There are ~6000 tweets sent out from twitter every second, which means that finding the latest trends is akin for looking for a needle in a haystack while getting sprayed by a firehose.   

Fortunately there are some good libraries for dealing with twitter data that can allow you to extract meaning from this information firehose.   In this blog post, I will show you how to set up a twitter sentiment analyzer which allows you to see the sentiment, and location of the latest trends in the US and around the world.   

## Table of Contents
  1. [Introduction](#1)
    1. [Necessary Libraries](#1.1)
    2. [Accessing labeled Twitter Data](#1.2)
  2. [Preprocessing the Data](#2)
    1. [removing html formatting](#2.1)
    2. [removing usernames/websites/emoji's](#2.2)
    3. [stemming and tokenizing](#2.3)
  3. [Sentiment Analysis Models](#3)
    1. [Naive Bayes](3.1)
    2. [Logistic Regression](#3.2)
    3. [Stochastic Gradient Descent](#3.3)
  4. [Conclusions/Look Ahead](#4)


### Necessary Libraries <a id=1></a>

For this part of the tutorial we will need to use the following libraries
  - [SkLearn](http://scikit-learn.org/): popular machine learning library
  - [NLTK](http://www.nltk.org): Natural language processing library
  - [re](: regular expression library
  - pandas: popular data analysis library



In [1]:
import sklearn
import nltk
import re
import pandas

### Accessing labeled twitter data<a id="1.2"></a>

There are several sources of labeled twitter data, for example Kaggle hosts a dataset of labeled tweets, and various other hand labeled tweet datasets can be found elsewhere.   However, they all suffer from a serious flaw in that all of the tweets have an easily identifiable sentiment.   This might sound like a good thing, but when trying to classify real world data you quickly will run into the problem that most tweets don't have an easily identifiable sentiment.  Your training data will not adequitely reflect your actual data.   

A better idea is to get both more data, and data which is closer to real world data.   