# Tweets Classification using Naive Bayes

## Naive Bayes Classifier
This is a simple (naive) classification method based on Bayes rule. It relies on a very simple representation of the document (called the bag of words representation)
Imagine we have 2 classes ( positive and negative ), and our input is a text representing a review of a movie. We want to know whether the review was positive or negative. So we may have a bag of positive words (e.g. love, amazing,hilarious, great), and a bag of negative words (e.g. hate, terrible).


We may then count the number of times each of those words appears in the document, in order to classify the document as positive or negative.

This technique works well for topic classification; say we have a set of academic papers, and we want to classify them into different topics (computer science, biology, mathematics).

### Bayes’ Rule applied to Tweets and Classes

* For a tweet $d$ and a class $c$, and using Bayes’ rule,

$$P( c | d ) = \frac{P( d | c )  P( c )} {P( d )}$$


**What do we mean by the term $P( d | c )$?**

Let’s represent the tweet as a set of features (words or tokens) $\{x_1, x_2, x_3, \ldots \}$

We can then re-write $P( d | c )$ as:
$$P( d | c ) = P( x_1, x_2, x_3, … , x_n | c )$$

**What about $P( c )$? How do you calculate it?**

$P( c )$ is the total probability of a class. => How often does this class occur in total?


E.g., in the case of classes positive and negative, we would be calculating the probability that any given review is positive or negative without actually analyzing the current input document.

**Do you need to calculate $P( d )$?**  
Since all probabilities have $P( d )$ as their denominator, we can eliminate the denominator, and simply compare the different values of the numerator:

$$P( c | d ) =P( d | c )  P( c ) $$

### Maximum a Posteriori (MAP) Hypothesis
$$c = \arg\max_{c\in C} P( d | c )  P( c ) $$

Under what conditions Maximum Likelihood (ML) rule for detection is same as MAP rule?

## Sentiment Analysis
In this exercise, you will be using Naive Bayes for sentiment analysis on tweets. Given a tweet, you will decide if it has a positive sentiment or a negative one.

1. Train a naive bayes model on a sentiment analysis task
2. Test using your model
3. Compute ratios of positive words to negative words
4. Do some error analysis
5. Predict on your own tweet

In [None]:
import numpy as np

## Natural Language Toolkit
NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries, and an active discussion forum.

In [None]:
!pip install nltk

### Example
Tokenization is the process by which a large quantity of text is divided into smaller parts called tokens. These tokens are very useful for finding patterns and are considered as a base step for stemming and lemmatization.

In [None]:
import nltk
nltk.download('punkt') # This tokenizer divides a text into a list of sentences


In [None]:
sentence = """At eight o'clock on Thursday morning ... Arthur didn't feel very good."""
tokens = nltk.word_tokenize(sentence)
tokens

### NLTK Stopword List
So stopwords are words that are very common in human language but are generally not useful because they represent particularly common words such as “the”, “of”, and “to”.

In [None]:
nltk.download('stopwords')

## Import the Data
Download the sample tweets from the NLTK package:

In [None]:
nltk.download('twitter_samples')

In [None]:
from nltk.corpus import twitter_samples

This will import three datasets from NLTK that contain various tweets to train and test the model:

* negative_tweets.json: 5000 tweets with negative sentiments
* positive_tweets.json: 5000 tweets with positive sentiments
* tweets.20150430-223406.json: 20000 tweets with no sentiments

Next, create variables for positive_tweets and negative_tweets:

In [None]:
# get the sets of positive and negative tweets
all_positive_tweets = twitter_samples.strings('positive_tweets.json')
all_negative_tweets = twitter_samples.strings('negative_tweets.json')

# split the data into two pieces, one for training and one for testing (validation set)
test_pos = all_positive_tweets[4000:]
train_pos = all_positive_tweets[:4000]
test_neg = all_negative_tweets[4000:]
train_neg = all_negative_tweets[:4000]

train_x = train_pos + train_neg
test_x = test_pos + test_neg

In [None]:
# avoid assumptions about the length of all_positive_tweets
train_y = np.append(np.ones(len(train_pos)), np.zeros(len(train_neg)))
test_y = np.append(np.ones(len(test_pos)), np.zeros(len(test_neg)))

## Process the Data
For any machine learning project, once you've gathered the data, the first step is to process it to make useful inputs to your model.

1. Eliminate handles and URLs
2. Tokenize the string into words. 
3. Remove stop words like "and, is, a, on, etc."
4. Stemming- or convert every word to its stem. Like dancer, dancing, danced, becomes 'danc'. You can use porter stemmer to take care of this. 
5. Convert all your words to lower case. 

In [None]:
custom_tweet = test_pos[5]
# print tweet
print(custom_tweet)

the function `process_tweet()` does this for you.

In [None]:
from utils import process_tweet
# print cleaned tweet
print(process_tweet(custom_tweet))

## Feature Extraction

**Feature extraction** refers to the process of transforming raw data into numerical features that can be processed while preserving the information in the original data set. It yields better results than applying machine learning directly to the raw data.

What would be your guess as to which features are suitable to represent text documents? 

* Assign a real number to each word in the English dictionary and replace each text with the corresponding number. 
* Create a list of possible words and compare it with the words in each of your texts. You will end up with a feature vector with zeros and ones whose size corresponds to the number of possible words.
* Count how many times each word from the texts occurs in each category (positive and negative), and then add these numbers for each of your texts in each category.  



### Feature Extraction with Frequencies
You have to encode each tweet as a 3-dimesional vector. To do so, you have to create a dictionary to map the word, and the class it appeared in (positive or negative) to the number of times that word appeared in its corresponding class.

#### Example

In [None]:
tweets = ['i am happy', 'i am tricked', 'i am sad', 'i am tired', 'i am tired']
ys = [1, 0, 0, 0, 0]

| Vocabulary  | PosFreq  | NegFreq  |
|---|---|---|
| happi   | 1  | 0  |
| trick | 0  | 1  |
|  sad |  0 |  1 |
|  tire | 0  | 2  |

In [None]:
def count_tweets(result, tweets, ys):
    '''
    Input:
        result: a dictionary that will be used to map each pair to its frequency
        tweets: a list of tweets
        ys: a list corresponding to the sentiment of each tweet (either 0 or 1)
    Output:
        result: a dictionary mapping each pair to its frequency
    '''
    for y, tweet in zip(ys, tweets):
        for word in process_tweet(tweet):
            # define the key, which is the word and label tuple
            pair = (word,y)

            # if the key exists in the dictionary, increment the count
            if pair in result:
                result[pair] += 1

            # else, if the key is new, add it to the dictionary and set the count to 1
            else:
                result[pair] = 1
    return result

In [None]:
result = {}
freqs = count_tweets(result, tweets, ys)
print(freqs)

define `lookup` function to get the positive frequencies and the negative frequencies for a specific word.

In [None]:
def lookup(freqs, word, label):
    '''
    Input:
        freqs: a dictionary with the frequency of each pair (or tuple)
        word: the word to look up
        label: the label corresponding to the word
    Output:
        n: the number of times the word with its corresponding label appears.
    '''
    n = 0  # freqs.get((word, label), 0)

    pair = (word, label)
    if (pair in freqs):
        n = freqs[pair]

    return n

In [None]:
word = 'happi'
label = 0
lookup(freqs, word, label)

## Naive Bayes Classificiation:
$$D_{NB} = \arg \max _{D_j \in \{D_{neg}, D_{pos} \}} P(D_{j}) \prod_{i}^m P(W_{i}|D_{j})\tag{3}$$

To do inference, you can compute the following: 
$$\frac {P(D_{pos})}{P(D_{neg})} \prod_{i}^m \frac {P(W_{i}|D_{pos})}{ P(W_{i}|D_{neg})} > 1 $$

As $m$ gets larger, we can get numerical flow issues, so we introduce the $\log$, which gives you the following equation: 

$$\log \frac {P(D_{pos})}{P(D_{neg})} + \sum_{i}^m  \log \frac {P(W_{i}|D_{pos})}{ P(W_{i}|D_{neg})} > 0$$

#### Prior and Logprior:
The prior probability represents the underlying probability in the target population that a tweet is positive versus negative. In other words, if we had no specific information and blindly picked a tweet out of the population set, what is the probability that it will be positive versus that it will be negative? That is the "prior".


To train a Naive Bayes classifier:
- The first part of training a naive bayes classifier is to identify the number of classes that you have.
- You will create a probability for each class.
$P(D_{pos})$ is the probability that the document is positive.
$P(D_{neg})$ is the probability that the document is negative.
Use the formulas as follows and store the values in a dictionary:

$$P(D_{pos}) = \frac{D_{pos}}{D}\tag{1}$$

$$P(D_{neg}) = \frac{D_{neg}}{D}\tag{2}$$

Where $D$ is the total number of documents, or tweets in this case, $D_{pos}$ is the total number of positive tweets and $D_{neg}$ is the total number of negative tweets.

In [None]:
# Calculate D, the number of documents
D = len(train_y)

In [None]:
# Calculate D_pos, the number of positive documents 
D_pos = (len(list(filter(lambda x: x > 0, train_y))))
print("a priori P(Dpos) = ", D_pos/D)

In [None]:
# Calculate D_neg, the number of negative documents
D_neg = (len(list(filter(lambda x: x <= 0, train_y))))
print("a priori P(Dneg) = ", D_neg/D)

$$  \text{Logprior} =  \log \frac {P(D_{pos})}{P(D_{neg})} = \log(P(D_{pos})) - \log(P(D_{neg}))  $$

In [None]:
logprior = np.log(D_pos) - np.log(D_neg)
print("logprior = %0.2f" %
      (logprior))

#### Positive and Negative Probability of a Word
To compute the positive probability and the negative probability for a specific word in the vocabulary, we'll use the following inputs:

- $freq_{pos}$ and $freq_{neg}$ are the frequencies of that specific word in the positive or negative class. In other words, the positive frequency of a word is the number of times the word is counted with the label of 1.
- $N_{pos}$ and $N_{neg}$ are the total number of positive and negative words for all documents (for all tweets), respectively.
- $V$ is the number of unique words in the entire set of documents, for all classes, whether positive or negative.

We'll use these to compute the positive and negative probability for a specific word using this formula:

$$P(W|D_{pos}) =  P(W_{pos}) = \frac{freq_{pos} + 1}{N_{pos} + V}\tag{4} $$
$$P(W|D_{neg}) = P(W_{neg}) = \frac{freq_{neg} + 1}{N_{neg} + V}\tag{5} $$

##### Create `freqs` dictionary
- Given your `count_tweets` function, you can compute a dictionary called `freqs` that contains all the frequencies.
- In this `freqs` dictionary, the key is the tuple (word, label)
- The value is the number of times it has appeared.

In [None]:
# Build the freqs dictionary for later uses
freqs = count_tweets({}, train_x, train_y)

You can compute the number of unique words that appear in the `freqs`dictionary to get your $V$

In [None]:
# calculate V, the number of unique words in the vocabulary
vocab = set([pair[0] for pair in freqs.keys()])
V = len(vocab)
print("The number of unique words in the vocabulary =", V)

Using `freqs` dictionary, you can also compute the total number of positive words and total number of negative words 
 and 


In [None]:
# calculate N_pos and N_neg
N_pos = N_neg = 0
for pair in freqs.keys():
    # if the label is positive (greater than zero)
    if pair[1] > 0:

        # Increment the number of positive words by the count for this (word, label) pair
        N_pos += freqs[pair]

    # else, the label is negative
    else:

        # increment the number of negative words by the count for this (word,label) pair
        N_neg += freqs[pair]

print("N_pos = ", N_pos)
print("N_neg = ", N_neg)

you can iterate over each word in the vocabulary, use your `lookup` function to get the positive frequencies, $freq_{pos}$, and the negative frequencies, $freq_{neg}$, for that specific word.
- Compute the positive probability of each word $P(W_{pos})$, negative probability of each word $P(W_{neg})$ using equations 4 & 5.

$$ P(W_{pos}) = \frac{freq_{pos} + 1}{N_{pos} + V}\tag{4} $$
$$ P(W_{neg}) = \frac{freq_{neg} + 1}{N_{neg} + V}\tag{5} $$

The log likelihood of a specific word

- You can then compute the loglikelihood: $log \left( \frac{P(W_{pos})}{P(W_{neg})} \right) \tag{6}$.

In [None]:
# the log likelihood of you Naive bayes equation
loglikelihood = {}
# For each word in the vocabulary...
for word in vocab:
    # get the positive and negative frequency of the word
    freq_pos = lookup(freqs,word,1)
    freq_neg = lookup(freqs,word,0)

    # calculate the probability that each word is positive, and negative
    p_w_pos = (freq_pos + 1) / (N_pos + V)
    p_w_neg = (freq_neg + 1) / (N_neg + V)
    
    # calculate the log likelihood of the word
    loglikelihood[word] = np.log(p_w_pos/p_w_neg)

In [None]:
print(len(loglikelihood))

## Test your naive bayes
We can test the naive bayes function by making predicting on some tweets!

In [None]:
my_tweet = 'she smiled and was happy.'
word_l = process_tweet(my_tweet)
print(word_l)

In [None]:
# initialize probability to logprior
p = logprior

for word in word_l:
    # check if the word exists in the loglikelihood dictionary
    if word in loglikelihood:
        # add the log likelihood of that word to the probability
        p += loglikelihood[word]

print('The expected output is', p)

In [None]:
def naive_bayes_predict(tweet, logprior, loglikelihood):
    '''
    Input:
        tweet: a string
        loglikelihood: a dictionary of words mapping to numbers
    Output:
        p: the sum of all the logliklihoods of each word in the tweet (if found in the dictionary) + logprior (a number)

    '''
    ### START CODE HERE (REPLACE INSTANCES OF 'None' with your code) ###
    # process the tweet to get a list of words
    word_l = process_tweet(tweet)

    # initialize probability to logprior
    p = logprior

    for word in word_l:

        # check if the word exists in the loglikelihood dictionary
        if word in loglikelihood:
            # add the log likelihood of that word to the probability
            p += loglikelihood[word]

    return p

In [None]:
for tweet in ['I am happy', 'I am bad', 'this movie should have been great.', 'great', 'great great', 'great great great', 'great great great great']:
    # print( '%s -> %f' % (tweet, naive_bayes_predict(tweet, logprior, loglikelihood)))
    p = naive_bayes_predict(tweet,logprior, loglikelihood)
    # print(f'{tweet} -> {p:.2f} ({p_category})')
    print(f'{tweet} -> {p:.2f}')

In [None]:
your_tweet = 'you are sad and not happy :('
naive_bayes_predict(your_tweet,logprior, loglikelihood)

In [None]:
your_tweet = 'you are sad :('
naive_bayes_predict(your_tweet,logprior, loglikelihood)

## Filter words by Ratio of positive to negative counts
Some words have more positive counts than others, and can be considered "more positive". Likewise, some words can be considered more negative than others.

In [None]:
word = "bad"
pos_neg_ratio = {'positive': 0, 'negative': 0, 'ratio': 0.0}

In [None]:
# use lookup() to find positive counts for the word (denoted by the integer 1)
pos_neg_ratio['positive'] = lookup(freqs,word,1)

In [None]:
# use lookup() to find negative counts for the word (denoted by integer 0)
pos_neg_ratio['negative'] = lookup(freqs,word,0)

In [None]:
pos_neg_ratio['ratio'] = (pos_neg_ratio['positive'] + 1)/(pos_neg_ratio['negative'] + 1)

In [None]:
print(pos_neg_ratio)