### Twitter sentiment analysis using Python and NLTK
#### The purpose of the implementation is to be able to automatically classify a tweet as positive or negative, sentiment wise.

Train the classifier using a list of manually classified tweets; 5 positive tweets and 5 negative tweets.

Positive tweets:

I love this car.
This view is amazing.
I feel great this morning.
I am so excited about the concert.
He is my best friend.

Negative tweets:

I do not like this car.
This view is horrible.
I feel tired this morning.
I am not looking forward to the concert.
He is my enemy.

In the full implementation, 600 positive tweets and 600 negative tweets are used to train the classifier. 
The tweets are stored in a Redis DB. 
Even with those numbers, it is quite a small sample. You should use a much larger set if you want good results.

Next is a test set so we can assess the exactitude of the trained classifier.

Test tweets:

I feel happy this morning. -positive.
Larry is my friend. -positive.
I do not like that man. -negative.
My house is not great. -negative.
Your song is annoying. -negative.

### Implementation

In [4]:
# Create a list containing the positive tweets:
pos_tweets = [('I love this car', 'positive'),
              ('This view is amazing', 'positive'),
              ('I feel great this morning', 'positive'),
              ('I am so excited about the concert', 'positive'),
              ('He is my best friend', 'positive')]

In [5]:
# Create a list containing the negative tweets:
neg_tweets = [('I do not like this car', 'negative'),
              ('This view is horrible', 'negative'),
              ('I feel tired this morning', 'negative'),
              ('I am not looking forward to the concert', 'negative'),
              ('He is my enemy', 'negative')]

In [6]:
# We take both of those lists and create a single list of tuples each containing two elements. 
# First element is an array containing the words and second element is the type of sentiment. 
# We get rid of the words smaller than 2 characters and we use lowercase for everything.

tweets = []
for (words, sentiment) in pos_tweets + neg_tweets:
    words_filtered = [e.lower() for e in words.split() if len(e) >= 3]
    tweets.append((words_filtered, sentiment))

In [7]:
# The list of tweets now looks like this:
tweets

[(['love', 'this', 'car'], 'positive'),
 (['this', 'view', 'amazing'], 'positive'),
 (['feel', 'great', 'this', 'morning'], 'positive'),
 (['excited', 'about', 'the', 'concert'], 'positive'),
 (['best', 'friend'], 'positive'),
 (['not', 'like', 'this', 'car'], 'negative'),
 (['this', 'view', 'horrible'], 'negative'),
 (['feel', 'tired', 'this', 'morning'], 'negative'),
 (['not', 'looking', 'forward', 'the', 'concert'], 'negative'),
 (['enemy'], 'negative')]

In [8]:
# Finally, we add the list with the test tweets:
test = [('I feel happy this morning', 'positive'),
              ('Larry is my friend', 'positive'),
              ('I do not like that man', 'negative'),
              ('My house is not great', 'negative'),
              ('Your song is annoying', 'negative')]

In [107]:
test_tweets = []
for (words, sentiment) in test:
    words_filtered = [e.lower() for e in words.split() if len(e) >= 3]
    test_tweets.append((words_filtered, sentiment))

In [106]:
test_tweets

[(['feel', 'happy', 'this', 'morning'], 'positive'),
 (['larry', 'friend'], 'positive'),
 (['not', 'like', 'that', 'man'], 'negative'),
 (['house', 'not', 'great'], 'negative'),
 (['your', 'song', 'annoying'], 'negative')]

### Classifier
The list of word features need to be extracted from the tweets. It is a list with every distinct words ordered by frequency of appearance. We use the following function to get the list plus the two helper functions.

In [101]:
import nltk
word_features = get_word_features(get_word_in_tweets(tweets))

In [103]:
def get_word_in_tweets(tweets):
    all_words = []
    for(words, sentiment) in tweets:
        all_words.extend(words)
    return all_words

In [104]:
def get_word_features(wordlist):
    wordlist = nltk.FreqDist(wordlist)
    word_features = wordlist.keys()
    return word_features

In [105]:
word_features

dict_keys(['love', 'this', 'car', 'view', 'amazing', 'feel', 'great', 'morning', 'excited', 'about', 'the', 'concert', 'best', 'friend', 'not', 'like', 'horrible', 'tired', 'looking', 'forward', 'enemy'])

To create a classifier, we need to decide what features are relevant. To do that, we first need a feature extractor. The one we are going to use returns a dictionary indicating what words are contained in the input passed. Here, the input is the tweet. We use the word features list defined above along with the input to create the dictionary.

In [56]:
def extract_features(document):
    document_words = set(document)
    features = {}
    for word in word_features:
        features['contains(%s)' % word] = word in document_words
    return features

In [79]:
extract_features(['love', 'this', 'car'])

{'contains(love)': True,
 'contains(this)': True,
 'contains(car)': True,
 'contains(view)': False,
 'contains(amazing)': False,
 'contains(feel)': False,
 'contains(great)': False,
 'contains(morning)': False,
 'contains(excited)': False,
 'contains(about)': False,
 'contains(the)': False,
 'contains(concert)': False,
 'contains(best)': False,
 'contains(friend)': False,
 'contains(not)': False,
 'contains(like)': False,
 'contains(horrible)': False,
 'contains(tired)': False,
 'contains(looking)': False,
 'contains(forward)': False,
 'contains(enemy)': False}

With our feature extractor, we can apply the features to our classifier using the method apply_features. We pass the feature extractor along with the tweets list defined above.

In [99]:
training_set = nltk.classify.apply_features(extract_features, tweets)

The variable ‘training_set’ contains the labeled feature sets. It is a list of tuples which each tuple containing the feature dictionary and the sentiment string for each tweet. The sentiment string is also called ‘label’.

In [112]:
training_set

[({'contains(love)': True, 'contains(this)': True, 'contains(car)': True, 'contains(view)': False, 'contains(amazing)': False, 'contains(feel)': False, 'contains(great)': False, 'contains(morning)': False, 'contains(excited)': False, 'contains(about)': False, 'contains(the)': False, 'contains(concert)': False, 'contains(best)': False, 'contains(friend)': False, 'contains(not)': False, 'contains(like)': False, 'contains(horrible)': False, 'contains(tired)': False, 'contains(looking)': False, 'contains(forward)': False, 'contains(enemy)': False}, 'positive'), ({'contains(love)': False, 'contains(this)': True, 'contains(car)': False, 'contains(view)': True, 'contains(amazing)': True, 'contains(feel)': False, 'contains(great)': False, 'contains(morning)': False, 'contains(excited)': False, 'contains(about)': False, 'contains(the)': False, 'contains(concert)': False, 'contains(best)': False, 'contains(friend)': False, 'contains(not)': False, 'contains(like)': False, 'contains(horrible)': Fa

Now that we have our training set, we can train our classifier.

In [119]:
classifier = nltk.NaiveBayesClassifier.train(training_set)

The Naive Bayes classifier uses the prior probability of each label which is the frequency of each label in the training set, and the contribution from each feature. In our case, the frequency of each label is the same for ‘positive’ and ‘negative’. The word ‘amazing’ appears in 1 of 5 of the positive tweets and none of the negative tweets. This means that the likelihood of the ‘positive’ label will be multiplied by 0.2 when this word is seen as part of the input.

Let’s take a look inside the classifier train method in the source code of the NLTK library. ‘label_probdist’ is the prior probability of each label and ‘feature_probdist’ is the feature/value probability dictionary. Those two probability objects are used to create the classifier.

In [153]:
from nltk.probability import ELEProbDist, FreqDist, DictionaryProbDist

In [186]:
def train(labeled_featuresets, estimator=ELEProbDist):
    # Create the P(label) distribution
    label_probdist = estimator(label_freqdist)
    # Create the P(fval|label, fname) distribution
    feature_probdist = {}
    return NaiveBayesClassifier(label_probdist, label_freqdist)
    label_probdist.prob('positive')
    label_probdist.prob('negative')
    feature_probdist[('negative', 'contains(best)')].prob(True)

In our case, the probability of each label is 0.5 as we can see below. label_probdist is of type ELEProbDist.