The twitter_samples corpus contains 3 files.

1. negative_tweets.json: contains 5k negative tweets
2. positive_tweets.json: contains 5k positive tweets
3. tweets.20150430-223406.json: contains 20k positive and negative tweets

In [None]:
from nltk.corpus import twitter_samples
print (twitter_samples.fileids())

In [None]:
pos_tweets = twitter_samples.strings('positive_tweets.json')
print (len(pos_tweets),"positive tweets present")

In [None]:
neg_tweets = twitter_samples.strings('negative_tweets.json')
print (len(neg_tweets),"negative tweets present")

In [None]:
all_tweets = twitter_samples.strings('tweets.20150430-223406.json')
print (len(all_tweets),"positive and negative tweets present") 

In [None]:
for tweet in pos_tweets[:5]:
    print (tweet)

**Tokenize Tweets**

NLTK has a TweetTokenizer module that does a good job in tokenizing (splitting text into a list of words) tweets.
Three different parameters can be passed while calling the TweetTokenizer class. They are:
1. preserve_case: if False then it converts tweet to lowercase and vice-versa.
2. strip_handles: if True then it removes twitter handles from the tweet and vice-versa.
3. reduce_len: if True then it reduces the length of words in the tweet like hurrayyyy, yipppiieeee, etc. and vice-versa.

In [None]:
from nltk.tokenize import TweetTokenizer
tweet_tokenizer = TweetTokenizer(preserve_case=False, strip_handles=True, reduce_len=True)
for tweet in pos_tweets[:5]:
    print (tweet_tokenizer.tokenize(tweet))

**Cleaning Tweet**

In the tweet cleaning process, we will do the following:

1. Remove stock market tickers like $GE
2. Remove retweet text “RT”
3. Remove hyperlinks
4. Remove hashtags (only the hashtag # and not the word)
5. Remove stop words like a, and, the, is, are, etc.
6. Remove emoticons like :), :D, :(, :-), etc.
7. Remove punctuation like full-stop, comma, exclamation sign, etc.
8. Convert words to Stem/Base words using Porter Stemming Algorithm. E.g. words like ‘working’, ‘works’, and ‘worked’ will be converted to their base/stem word “work”.

We will define a function named clean_tweets which returns a list of cleaned (by removing the above-mentioned things) words for any given tweet.

In [None]:
import string
import re

from nltk.corpus import stopwords 
stopwords_english = stopwords.words('english')

from nltk.stem import PorterStemmer
stemmer = PorterStemmer()

from nltk.tokenize import TweetTokenizer

# Happy Emoticons
emoticons_happy = set([
    ':-)', ':)', ';)', ':o)', ':]', ':3', ':c)', ':>', '=]', '8)', '=)', ':}',
    ':^)', ':-D', ':D', '8-D', '8D', 'x-D', 'xD', 'X-D', 'XD', '=-D', '=D',
    '=-3', '=3', ':-))', ":'-)", ":')", ':*', ':^*', '>:P', ':-P', ':P', 'X-P',
    'x-p', 'xp', 'XP', ':-p', ':p', '=p', ':-b', ':b', '>:)', '>;)', '>:-)',
    '<3'
    ])

# Sad Emoticons
emoticons_sad = set([
    ':L', ':-/', '>:/', ':S', '>:[', ':@', ':-(', ':[', ':-||', '=L', ':<',
    ':-[', ':-<', '=\\', '=/', '>:(', ':(', '>.<', ":'-(", ":'(", ':\\', ':-c',
    ':c', ':{', '>:\\', ';('
    ])


# all emoticons (happy + sad)
emoticons = emoticons_happy.union(emoticons_sad)

def clean_tweets(tweet):
    # remove stock market tickers like $GE
    tweet = re.sub(r'\$\w*', '', tweet)
 
    # remove old style retweet text "RT"
    tweet = re.sub(r'^RT[\s]+', '', tweet)
 
    # remove hyperlinks
    tweet = re.sub(r'https?:\/\/.*[\r\n]*', '', tweet)
    
    # remove hashtags
    # only removing the hash # sign from the word
    tweet = re.sub(r'#', '', tweet)
 
    # tokenize tweets
    tokenizer = TweetTokenizer(preserve_case=False, strip_handles=True, reduce_len=True)
    tweet_tokens = tokenizer.tokenize(tweet)
 
    tweets_clean = []    
    for word in tweet_tokens:
        if (word not in stopwords_english and # remove stopwords
              word not in emoticons and # remove emoticons
                word not in string.punctuation): # remove punctuation
            #tweets_clean.append(word)
            stem_word = stemmer.stem(word) # stemming word
            tweets_clean.append(stem_word)
 
    return tweets_clean

custom_tweet = "RT @Twitter @paresh Hello There! Have a great day. :) #good #morning http://google.co.in"

# print cleaned tweet
print (clean_tweets(custom_tweet))

In [None]:
print (pos_tweets[5])

In [None]:
print (clean_tweets(pos_tweets[5]))

**Feature Extraction**

We define a simple bag_of_words function that extracts unigram features from the tweets.

In [None]:
# feature extractor function
def bag_of_words(tweet):
    words = clean_tweets(tweet)
    words_dictionary = dict([word, True] for word in words)    
    return words_dictionary

custom_tweet = "RT @Twitter @paresh Hello There! Have a great day. :) #good #morning http://google.co.in"
print (bag_of_words(custom_tweet))

In [None]:
# positive tweets feature set
pos_tweets_set = []
for tweet in pos_tweets:
    pos_tweets_set.append((bag_of_words(tweet), 'pos'))    
 
# negative tweets feature set
neg_tweets_set = []
for tweet in neg_tweets:
    neg_tweets_set.append((bag_of_words(tweet), 'neg'))
 
print (len(pos_tweets_set), len(neg_tweets_set)) 

**Create Train and Test Set**

There are 5000 positive tweets set and 5000 negative tweets set. We take 20% (i.e. 1000) of positive tweets and 20% (i.e. 1000) of negative tweets as the test set. The remaining negative and positive tweets will be taken as the training set.



In [None]:
# radomize pos_reviews_set and neg_reviews_set
# doing so will output different accuracy result everytime we run the program
from random import shuffle 
shuffle(pos_tweets_set)
shuffle(neg_tweets_set)
 
test_set = pos_tweets_set[:1000] + neg_tweets_set[:1000]
train_set = pos_tweets_set[1000:] + neg_tweets_set[1000:]
 
print("Number of test set is ",len(test_set))
print("Number of train set is", len(train_set))

**Training Classifier and Calculating Accuracy**

We train Naive Bayes Classifier using the training set and calculate the classification accuracy of the trained classifier using the test set.

In [None]:
from nltk import classify
from nltk import NaiveBayesClassifier
 
classifier = NaiveBayesClassifier.train(train_set)
 
accuracy = classify.accuracy(classifier, test_set)
print("The accuracy is ",accuracy) # Output: 0.753
 
print (classifier.show_most_informative_features(10))

**Testing Classifier with Custom Tweet**

We provide custom tweet and check the classification output of the trained classifier. The classifier correctly predicts both negative and positive tweets provided.

In [None]:
custom_tweet = "I hated the film. It was a disaster. Poor direction, bad acting."
custom_tweet_set = bag_of_words(custom_tweet)
print ("The custom tweet is ",classifier.classify(custom_tweet_set)) # Output: neg
# Negative tweet correctly classified as negative
 
# probability result
prob_result = classifier.prob_classify(custom_tweet_set)
print (prob_result) # Output: <ProbDist with 2 samples>
print (prob_result.max()) # Output: neg
print (prob_result.prob("neg")) 
print (prob_result.prob("pos")) 

In [None]:
custom_tweet = "It was a wonderful and amazing movie. I loved it. Best direction, good acting."
custom_tweet_set = bag_of_words(custom_tweet)
 
print ("The custom tweet is ",classifier.classify(custom_tweet_set)) # Output: pos
# Positive tweet correctly classified as positive
 
# probability result
prob_result = classifier.prob_classify(custom_tweet_set)
print (prob_result) # Output: <ProbDist with 2 samples>
print (prob_result.max()) # Output: pos
print (prob_result.prob("neg")) 
print (prob_result.prob("pos")) 

**Precision, Recall & F1-Score**

**Accuracy** is (correctly predicted observation) / (total observation).

**Precision** is about being precise.
– It shows how many correct predictions were given.
– For example, out of 100 questions, if you answered only 1 question and answered it correctly then you will have 100% precision.
– It’s about checking how often the classifier predicts the result correctly.

**Recall** (as opposed to precision)
– is about answering all questions that have the answer “true” with the answer “true”.
– It’s about checking how often does the classifier predicts “yes” when the result is actually “yes”.

**F1 Score or F-measure**: Harmonic mean of recall and precision.

We should have “true” answers and “false” answers for the calculation of precision and recall.

For mathematical representation of precision and recall, we need to understand the following:

**True Positive (TP)**: e.g. the number of patients who did have cancer whom we correctly diagnosed as having cancer
**True Negative (TN)**: e.g. the number of patients who did not have cancer whom we correctly diagnosed as not having cancer

**False Positive (FP)**: e.g. the number of patients who did not have cancer whom we incorrectly diagnosed as having cancer (Also known as Type I error)
False Negative (FN): e.g. the number of patients who did have cancer whom we incorrectly diagnosed as not having cancer (Also known as Type II error)

**Accuracy** = (TP + TN) / (TP + TN + FP + FN)

**Precision** = (TP) / (TP + FP)

**Recall** = (TP) / (TP + FN)

**F1 Score** = 2 * (precision * recall) / (precision + recall)

In [None]:
from collections import defaultdict
 
actual_set = defaultdict(set)
predicted_set = defaultdict(set)
 
actual_set_cm = []
predicted_set_cm = []
 
for index, (feature, actual_label) in enumerate(test_set):
    actual_set[actual_label].add(index)
    actual_set_cm.append(actual_label)
 
    predicted_label = classifier.classify(feature)
 
    predicted_set[predicted_label].add(index)
    predicted_set_cm.append(predicted_label)
    
from nltk.metrics import precision, recall, f_measure, ConfusionMatrix
 
print ('pos precision:', precision(actual_set['pos'], predicted_set['pos'])) 
print ('pos recall:', recall(actual_set['pos'], predicted_set['pos'])) 
print ('pos F-measure:', f_measure(actual_set['pos'], predicted_set['pos'])) 
 
print ('neg precision:', precision(actual_set['neg'], predicted_set['neg'])) 
print ('neg recall:', recall(actual_set['neg'], predicted_set['neg'])) 
print ('neg F-measure:', f_measure(actual_set['neg'], predicted_set['neg'])) 

**Confusion Matrix**

Confusion Matrix is a table that is used to describe the performance of the classifier.
Confusion Matrix is represented in the following format :


In [None]:
'''
           |   Predicted NO      |   Predicted YES     |
-----------+---------------------+---------------------+
Actual NO  | True Negative (TN)  | False Positive (FP) |
Actual YES | False Negative (FN) | True Positive (TP)  |
-----------+---------------------+---------------------+
'''

The following output of the confusion matrix shows the following performance of our trained classifier:

1. 727 negative tweets were correctly classified as negative (TN)
2. 273 negative tweets were incorrectly classified as positive (FP)
3. 263 positive tweets were incorrectly classified as negative (FN)
4. 737 positive tweets were correctly classified as positive (TP)

In [None]:
# Confusion Matrix for the test set
# 
# Output: 
# row = actual_set_cm 
# column = predicted_set_cm
cm = ConfusionMatrix(actual_set_cm, predicted_set_cm)
print (cm)

In [None]:
print (cm.pretty_format(sort_by_count=True, show_percents=True, truncate=9))

**Thank you**