# EECS 487 Project: Naive Bayes Classifier of Sentiment Analysis of Contraseptives

This notebook contains the code of our project. In the second problem, you will build naive bayes classifiers to distinguish between legitimate news headlines and clickbait.

In [12]:
!pip install nltk --upgrade  # after running this line once, you can comment this line out
import nltk
print(nltk.__version__) # this should print out 3.8.1 if you have installed the latest version of NLTK properly

3.8.1


Before we get started, run the following cell to load the autoreload extension so that functions in ```language_model.py``` and ```naive_bayes.py``` will be re-imported into the notebook every time we run them. We also need to import all necessary packages.

In [13]:
%load_ext autoreload
%autoreload 2

import pickle

import pandas as pd
from sklearn.model_selection import train_test_split
from nltk.tokenize import word_tokenize

from naive_bayes import *

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## C.2 Naive Bayes for Text Classification [26 points]
In this problem, you will build naive bayes classifiers to do text classification. You will use the clickbait headlines dataset, which contains examples of legitimate news headlines and clickbait news headlines. The original dataset can be found in [this GitHub repository](https://github.com/bhargaviparanjape/clickbait) and [this paper](https://arxiv.org/abs/1610.09786).
### C.2.1 Load dataset [4 points]
To get started, **fill in** the function ```load_headlines``` to load the clickbait dataset into pandas dataframes. The file ```clickbait_data.csv``` contains a partially processed subset of the data. It contains two columns: (1) ```is_clickbait``` is 1 when the row contains a clickbait headline and 0 when it doesn't and (2) ```text```, which contains the headline itself.

To get started, **fill in** the function ```load_headlines``` to load the clickbait dataset into a pandas dataframe. To do this, you will need to do the following:

1. Read in the ```text``` and ```is_clickbait``` columns.
2. Rename the ```is_clickbait``` column to ```label```

In [50]:
from naive_bayes import *

all_data = load_headlines('reviews.csv')

(train, test) = train_test_split(all_data, train_size=0.7)

display(train)
display(test)

Unnamed: 0,ratings,reviews
10174,3,I was on this med in my twenties and it worked...
7997,5,I've been taking this pill for over 5 months n...
3180,5,I actually got on this site to see hwta other ...
3218,1,I was only on this pill 2 weeks when I woke on...
831,5,
...,...,...
14315,5,I took Demulen for many years at the referral ...
3631,1,After having been on cyclessa (a triphasic pil...
10246,5,I'm taking Apri since 1 1/2 year and I never ...
13757,5,I have been on this BC pill for about 6 months...


Unnamed: 0,ratings,reviews
5391,4,I was placed on this medication for the follow...
7142,4,abnormal bleeding.
7324,5,I guess I may be a bit different from the othe...
264,5,I almost backed out of getting Mirena because ...
12528,5,I received a supply of this pill from Planned ...
...,...,...
653,5,I had the Mirena put in on my 6 week check up ...
10460,1,This medication caused massive and sub-massive...
9609,5,I've been taking aviane for two years now. I h...
5882,5,I love the drug. However. Ive gained twenty f...


### C.2.2 Dataset statistics [3 points]
Before start training classifiers, you need to calculate some basic statistics of the dataset. **Fill in** the function ```get_basic_stats``` to print out the following statistics of the training data:
- Average number of tokens per headline
- Standard deviation of the number of tokens per headline
- Total number of legitimate headlines
- Total number of clickbait headlines

Note: you can use any tokenization method you like.

In [52]:
get_basic_stats(all_data)

Average number of tokens per headline: 110.13079584775086
Standard deviation: 77.16307487574683
Number of negative/positive headlines: {0: 3537, 1: 10913}


{0: 3537, 1: 10913}

### C.2.3 Data processing and ngram calculation [6 points]
Now you need to calculate the ngram counts. **Fill in** the function ```fit``` that, given a dataframe of training data, calculates the ngram counts in each category and the prior probability for each category. Concretely, **store** the total occurrence of each ngram in each category in a list called ```self.ngram_count``` so that ```self.ngram_count[0]``` contains $count(w, c_0)$ for all $w$ in the vocabulary, and ```self.ngram_count[1]``` contains $count(w, c_1)$, etc. ```self.ngram_count[i]``` should be an array of shape $(1,|V|)$, where $V$ is the vocabulary (total vocabulary across both classes). **Store** the total occurrence of all ngrams in each category in a list called ```self.total_count``` so that ```self.total_count[0]``` $=\sum_{w\in V}count(w, c_0)$, and ```self.total_count[1]``` $=\sum_{w\in V}count(w, c_1)$, etc. **Store** the prior probability for each category in ```self.category_prob```. You need to follow these rules when calculating the counts:
- convert all letters to lowercase;
- include both unigrams and bigrams;
- ignore terms that appear in more than 80\% of the headlines;
- ignore terms that appear in less than 3 headlines.

Hint: use ```CountVectorizer``` in sklearn and store it as ```self.vectorizer```. You need to use **both legitimate and clickbait headlines** to get the vocabulary.

In [53]:
naive_bayes = NaiveBayes()
naive_bayes.fit(train)
print(f"Probability for each category: {naive_bayes.category_prob}")
print(f"Length of self.ngram_count: {len(naive_bayes.ngram_count)}")
print(f"Shape of the counts for 1st category: {naive_bayes.ngram_count[0].shape}")
print(f"Number of non-zero terms for 1st category: {(naive_bayes.ngram_count[0] > 0).sum()}")
print(f"Maximum count of the 1st category: {naive_bayes.ngram_count[0].max()}")
print(f"Minimum count of the 1st category: {naive_bayes.ngram_count[0].min()}")
print(f"Sum of ngram count for 1st category: {naive_bayes.ngram_count[0].sum()}")
print(f"Total count for each category: {naive_bayes.total_count}")

Probability for each category: [0.24290657 0.75709343]
Length of self.ngram_count: 2
Shape of the counts for 1st category: (46931,)
Number of non-zero terms for 1st category: 35660
Maximum count of the 1st category: 6931.0
Minimum count of the 1st category: 0.0
Sum of ngram count for 1st category: 365251.0
Total count for each category: [ 365251. 1203534.]


### C.2.4 Calculate posterior probability for a category [4 points]
Next, you will use the vectorizer and ngram counts to calculate the posterior probability of a category. In this homework, we have two categories: legitimate and clickbait. **Fill in** the function ```calculate_prob``` that given a list of articles $docs$, a category index $i$, return $\log\left(p(c_i)p(d|c_i)\right)=\log\left(p(c_i)\prod_{x\in X}p(x|c_i)\right)$ for each article $d$ in $docs$, where $X$ is the set of unigrams and bigrams in **both** article $d$ and vocabulary $V$.

- Use **add-one smoothing** in your calculation.
- Simply discard unseen unigrams/bigrams (do not use add-one smoothing to account for them).
- Calculate the **sum of logarithms** to avoid issues with underflow.

In [54]:
test_docs = ["United Kingdom officially exits the European Union",
 "How to Lose a Guy in 10 Days"]
prob1 = naive_bayes.calculate_prob(test_docs, 0)
prob2 = naive_bayes.calculate_prob(test_docs, 1)
print(f"Probability for category 0: {prob1}")
print(f"Probability for category 1: {prob2}")

Probability for category 0: [-17.33100299 -96.63015648]
Probability for category 1: [-16.00840133 -94.23799538]


### C.2.5 Predict labels for new headlines [2 points]
With the posterior probability of each category, you can predict the label for new headlines. **Fill in** the function ```predict``` that, given a list of headlines, returns the predicted categories of the headlines.

In [55]:
preds = naive_bayes.predict(test_docs)
print(f"Prediction: {preds}")

Prediction: [1, 1]


### C.2.6 Calculate evaluation metrics [5 points]
To evaluate a classifier, you need to calculate some evaluation metrics. **Fill in** the function ```evaluate``` that, given a list of predictions and a list of true labels, returns the accuracy, macro f1-score, and micro f1-score. You can **NOT** use functions in sklearn.

In [8]:
predictions = [1,1,0,1,0,0,1]
labels = [1,0,0,1,0,1,1]
accuracy, mac_f1, mic_f1 = evaluate(predictions, labels)
print(f"Accuracy: {accuracy}")
print(f"Macro f1: {mac_f1}")
print(f"Micro f1: {mic_f1}")

Accuracy: 0.7142857142857143
Macro f1: 0.7083333333333333
Micro f1: 0.7142857142857143


### C.2.7 Test classifier on test data [2 points]
Finally, you are ready to evaluate your classifier on the test data! Run the following cell to make predictions and print out performance.

In [20]:
predictions = naive_bayes.predict(test.text.tolist())
labels = test.label.tolist()
accuracy, mac_f1, mic_f1 = evaluate(predictions, labels)
print(f"Accuracy: {accuracy}")
print(f"Macro f1: {mac_f1}")
print(f"Micro f1: {mic_f1}")

Accuracy: 0.959
Macro f1: 0.9589999589999589
Micro f1: 0.959
