## Week 2: Naive Bayes

The first part if this is on your CVP computer.



### Conditional Probability

It's easier to predict the weather, given that it is California and winter,than just predicting the weather, without any other information.

The conditional probability that a tweet is positive given that it has the word "happy" in it is equal to the intersection of the tweets that are positive and the tweets that have the word happy in it, divided by the probability that the tweet has the word "happy" in it.

$$ P(Positive | ``happy") = \frac{P(Positive \cap ``happy")}{P(``happy")} $$

![](vennConditionalProb.PNG)

Similarly, we could write the Probability of a tweet having the word "happy" in it given that the tweet is positive as:

$$ P(``happy" | Positive) = \frac{P(``happy" \cap Positive)}{P(Postitive)} $$

If we wanted to write the Probability that a tweet positive given that it has the word "happy" in it in terms of the Probability that the tweet has the word happy in it, given that it is Positive, we could take advantage of the fact that $ P(``happy" \cap Positive) $ is the same as $ P(Positive \cap ``happy") $ then 

$$ P(Positive | ``happy") = P(``happy" | Positive) * \frac{P(Positive)}{P(``happy")} $$

This is **Bayes Rule**:

$$ P(X|Y) = P(Y|X) * \frac{P(X)}{P(Y)} $$

# Naive Bayes For Sentiment Analysis: Introduction

It's called Naive because it assumes that the features that you use for classification are all independent, which, in reality is rarely the case. To build a classifier, we first start by creating the conditional probabilities of each word in the vocabulary:

![](CondProbNaiveBayes.PNG)



By dividing the number of instances over the total number of words in each class, you get the following table of probabilities:

![](condProbs.PNG)


You can see that some words appear about equally in both classes (e.g. I, am, learning, NLP).  These don't have much predictive power.  Other words only appear in one class but not the other.  These can't be used.  happy, sad and not are the most influential.

Given the above probabilities, we can computer the **likelihood score** of a particular tweet as follows:

![](likelihoodScore.PNG)

A score > 1 indicates the class is positive, else negative.

## Laplacian Smoothing

Use Laplacian Smoothing to avoid Probability of zero in a particular class (like *because* above). If a word does not appear in training, it automatically gets a probability of zero.  To fix this, use Laplacian Smoothing.

**V**: the number of unique words in your vocabulary

To calculate the probability of word i is in class:

$$ P(w_{i} | class) = \frac{freq(w_{i}, class)}{N_{class}}  $$

class $ \in $ { Positive, Negative}
$N_{class}$ = frequency of all words in class

Laplacian smoothing adds 1 to the numerator and the number of all words in the entire vocabulary (Positive and Negative tweets) in the denominator.

$$ P(w_{i} | class) = \frac{freq(w_{i}, class) + 1}{N_{class} + V}  $$

Then the new $ P(because | Negative) $ is 0.05 instead of 0, which makes the likelihood score computable again.

## Log Likelihood, Part 1

The conditional probability ratios for word_i are:

$$ ratio(w_{i}) = \frac{P(w_{i} | Pos)}{P(w_{i} | Neg)} $$

This ratio is > 1 for positive words, 1 for neutral and < 1 for negative words.

For all words in a tweet, the product of the ratios above for each word is called the **likelihood** that the tweet is positive.  If you multiply by the Probability that the tweet is positive divided by the probability that the tweet is negative, or the **prior ratio**, you get the full **Naive Bayes** equation.  If the sample is balanced, then the prior ratio is 1.

$$ \frac{P(pos)}{P(neg)} \prod_{i=1}\frac{P(w_{i} | Pos)}{P(w_{i} | Neg)} $$


## Log Likelihood

Take the log of the likelihood so you aren't multiplying smaller and smaller numbers and getting underflow from computer.

$$ log( a * b) = log(a) + log(b) $$

$$ log(\frac{P(pos)}{P(neg)} \prod_{i=1}\frac{P(w_{i} | Pos)}{P(w_{i} | Neg)}) \Rightarrow  log\frac{P(pos)}{P(neg)} + \sum_{i=1} log \frac{P(w_{i} | Pos)}{P(w_{i} | Neg)}$$

Where the first term on the right is the **log prior** and the second term is the **log likelihood**.

So we calculate the lambda for each word in tweet:

$$ \lambda(w) = \frac{P(w| Pos)}{P(w | Neg)} $$

![](lambda.PNG)

So, if our tweet is: <scan style="color: lightgreen">I am happy because I am learning</scan> then we can calculate the loglikelihood by summing up the $\lambda$ s for each word in the tweet:

log likelihood = 0 + 0 + 2.2 + 0 + 0 + 0 + 1.1 = 3.3

A value less than zero would indicate that the tweet was negative, zero is neutral, positive values are positive.

