In [1]:
import os
import csv
import nltk
import math
import tweepy
from nltk.tokenize import RegexpTokenizer

import sys
sys.path.insert(1, os.path.join(sys.path[0], '..'))

import main
import tweet_processing
import classifier

api = "Dummy" # We don't need the actual API here so just pass a dummy value in

sa = main.SentimentAnalysis(False, api, 'bayes', 1, 10)

## Training and using a Naive Bayesian Classifier

The next step in the sentiment analysis process is to train a Naive Bayesian Classifier using the chosen training corpus. First we'll go in to a bit of detail about what a Naive Bayesian Classifier is exactly, if you want more information you can check out [here](https://en.wikipedia.org/wiki/Naive_Bayes_classifier).

### What is a Naive Bayesian Classifier?
Naive Bayesian Classifiers are a group of classification algorithms which are based upon [Bayes Theorem](http://en.wikipedia.org/wiki/Bayes%27_theorem). The main idea behind this classifier is that the presence of every feature to be classified is completely independent of the presence of any other feature.

The best way to try and explain how this classifier works is by way of an example.
If we have a piece of fruit, we can consider that this piece of fruit is an orange if it has the following 3 properties:
* Orange in colour
* Round in shape
* Around 8cm in diameter

A Naive Bayesian Classifier aiming to determine if a piece of fruit is an orange will consider that each of these 3 properties or "features" will contribute independently to the probability that the piece of fruit is an orange. 
Like most things, Naive Bayes has its pros and cons. It is both fast and easily trained however the assumption it makes about every feature being independent isn't always the case and thus is where the naive part came from in its name.

### Training the classifier
Now we have a better understanding of how the Naive Bayesian Classifier works, it is time to train it using a training corpus of preclassified data.

The first step is to transform each entry in the training corpus in to a set of features by seperating each entry, word by word. Below you can see an example of this step using a small, sample corpus and the output it gives.

In [2]:
positive_tweets = [["happy :)", "positive"],["i am very excited!", "positive"],["yay lol, exciting!", "positive"], ["happy time, i have cake!", "positive"], ["this is amazing! :)", "positive"]]
negative_tweets = [["no i am so sad", "negative"], ["my cat just died :(", "negative"], ["Just crashed my car whoops!", "negative"], ["Late for my new job arghh", "negative"], ["I feel like crying!", "negative"]]        

tweets = positive_tweets + negative_tweets

def format_sentence(sent): 
    tokenizer = RegexpTokenizer(r'\w+')
    return {word: True for word in set(tokenizer.tokenize(sent))}

count = 0
for entry in tweets:
    text = entry[0]
    sentiment = entry[1]
    entry = [format_sentence(text), sentiment]
    tweets[count] = entry
    count += 1
    print(entry)

[{'happy': True}, 'positive']
[{'excited': True, 'i': True, 'am': True, 'very': True}, 'positive']
[{'lol': True, 'yay': True, 'exciting': True}, 'positive']
[{'happy': True, 'time': True, 'have': True, 'cake': True, 'i': True}, 'positive']
[{'amazing': True, 'this': True, 'is': True}, 'positive']
[{'i': True, 'sad': True, 'am': True, 'no': True, 'so': True}, 'negative']
[{'my': True, 'died': True, 'cat': True, 'just': True}, 'negative']
[{'Just': True, 'my': True, 'car': True, 'crashed': True, 'whoops': True}, 'negative']
[{'Late': True, 'job': True, 'my': True, 'new': True, 'for': True, 'arghh': True}, 'negative']
[{'crying': True, 'like': True, 'I': True, 'feel': True}, 'negative']


Once we have generated the features for each entry all that is left then is to use them to train the classifier. In code terms this is as straightforward as calling the train method with our tweets.

In [3]:
classifier = nltk.NaiveBayesClassifier.train(tweets)

We can represent the overall process in terms of a diagram:


![Bayes diagram](https://www.laurentluce.com/images/blog/nltk/overview.png)
<center>[(Image Source)](https://www.tweetsentiment.co.uk/static/images/bayes.png)</center>


In our example above, the word features and feature extraction steps are combined into a single step which in this case is the method 'format_sentence' from previously.

The classifier uses the prior probability of each label, this is the number of times each label occurs in the training set, and the contribution that each feature provides. 
In our case, the frequency of each label is the same for 'positive' and 'negative'. The word 'amazing' appears in 1 of 5 of the positive tweets and none of the negative tweets. This means that the likelihood of the ‘positive’ label will be multiplied by 0.2 when this word is seen in the analysis input.

### Classifying tweets
Now that we have the classifier trained we can use it to classify tweets pulled from Twitter. Lets try this out for perhaps a trivial example tweet "My hamster just died":

In [4]:
tweet = "My hamster just died"
print('"' + tweet + '"')
predict = classifier.classify(tweet_processing.format_sentence(tweet))
print("Sentiment:", predict)

"My hamster just died"
Sentiment: negative


We can also return the positive and negative sentiment percentages from the classifier for each classification call. 

If the difference is small between the 2 values (Less than 10%) then we say the tweet is of neutral sentiment instead.

In [5]:
dist = classifier.prob_classify(tweet_processing.format_sentence(tweet))

for label in dist.samples():
    if label == 'positive':
        positive = dist.prob(label)
    elif label == 'negative':
        negative = dist.prob(label)
        
if positive - negative < 0.1 and positive - negative > -0.1:
    predict = "neutral"

positive = '%.1d' % (round(positive, 2) * 100) + "%"
negative = '%.1d' % (round(negative, 2) * 100) + "%"
print("Positive:", positive)
print("Negative:", negative)

Positive: 10%
Negative: 90%


When it comes to classification of a tweet, the first step is to break the tweet down in to its features in the same way that we did for our training corpus earlier. Doing this gives us the features of this tweet which we can then pass to the classifier for classification.

The next step is to find the logarithmic probability for each label. For our case the probability of each label (positive and negative) is 0.5. The logarithmic probability is $Log_{2}$ of that which is -1 so our probability set after this step looks like:

```python

{'positive': -1.0, 'negative': -1.0}
```

In [6]:
print("Positive:", classifier._label_probdist.prob('positive'))
print("Negative:", classifier._label_probdist.prob('negative'))

Positive: 0.5
Negative: 0.5


Next we add the logarithmic probability of the features given labels to this set. For each label in our features, we go through the feature set and we add the logarithmic probability of each item to our probability set from above. For example, we have the feature name 'died' and the feature value 'True'. Its probability value for the label 'positive' in our classifier is -2.

In [7]:
print(str(classifier._feature_probdist)[0:225] + "...\n") # Print a small section from the start for demo purposes

print("Feature probability distribution value for label 'died':", classifier._feature_probdist[('negative', 'died')].prob(True))

print("...converted to a log value: ", math.log(classifier._feature_probdist[('negative', 'died')].prob(True), 2))

{('positive', 'lol'): <ELEProbDist based on 5 samples>, ('negative', 'whoops'): <ELEProbDist based on 5 samples>, ('positive', 'died'): <ELEProbDist based on 5 samples>, ('negative', 'just'): <ELEProbDist based on 5 samples>,...

Feature probability distribution value for label 'died': 0.25
...converted to a log value:  -2.0


Once we have done this for every label, we will then have a dictionary of probability distribution which will give us the label with the greatest probability. In this case it is negative indicating that our classifier has told us it thinks the tweet has negative sentiment.


If we then try and use this classifier to classify the tweet "You're annoying", you'll see that it returns 'neutral' even though it quite clearly isn't. This is down to the fact that there is no information about the feature 'annoying' in our training set. For this reason a well chosen and larger training set is highly desirable in order to ensure higher accuracy when it comes to sentiment analysis.

In [8]:
tweet = "You're annoying"
print('"' + tweet + '"')
dist = classifier.prob_classify(tweet_processing.format_sentence(tweet))
for label in dist.samples():
    if label == 'positive':
        positive = dist.prob(label)
    elif label == 'negative':
        negative = dist.prob(label)
        
predict = classifier.classify(tweet_processing.format_sentence(tweet))       
        
if positive - negative < 0.1 and positive - negative > -0.1:
    predict = "neutral"

positive = '%.1d' % (round(positive, 2) * 100) + "%"
negative = '%.1d' % (round(negative, 2) * 100) + "%"
print("Sentiment: " + predict)
print("Positive:", positive)
print("Negative:", negative)

"You're annoying"
Sentiment: neutral
Positive: 50%
Negative: 50%
