<a href="https://colab.research.google.com/github/parsa-abbasi/intro-to-nlp/blob/main/NLP_naive_bayes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Naïve Bayes Implementation

Naïve Bayes is a classification technique based on Bayes’ Theorem with an assumption of independence among predictors. In simple terms, a Naïve Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature.

In text classification, the features are words (or tokens) and the classes are the predefined categories. The classifier uses the probability of each feature for each class to make predictions of new data. For example, consider a sentence "Studio 99 makes Ronnie O'Sullivan feature doc" and the classes are "Sports" and "Politics". The classifier will calculate the probability of each word in the sentence belonging to each class and then predict the class with the highest probability.

<br>

$$ P(\text{Sports} | \text{"Studio 99 makes Ronnie O'Sullivan feature doc"}) = $$

$$ P(\text{"Studio 99 makes Ronnie O'Sullivan feature doc"} | \text{Sports}) \times P(\text{Sports}) = $$

$$ P(\text{"Studio"} | \text{Sports}) \times ... \times P(\text{"doc"} | \text{Sports}) \times P(\text{Sports}) $$

<br>

$$ P(\text{Politics} | \text{"Studio 99 makes Ronnie O'Sullivan feature doc"}) = $$

$$ P(\text{"Studio 99 makes Ronnie O'Sullivan feature doc"} | \text{Politics}) \times P(\text{Politics}) = $$

$$ P(\text{"Studio"} | \text{Politics}) \times ... \times P(\text{"doc"} | \text{Politics}) \times P(\text{Politics}) $$

<br>

We can easily compute the probability of each word in the document belonging to each class using the training data by counting the number of times each word appears in each class and dividing it by the total number of words in that class. For example, if the word "Studio" appears 10 times in the "Sports" class and the total number of words in the "Sports" class is 1000, then the probability of the word "Studio" belonging to the "Sports" class is 0.01.

Also, we can compute the probability of each class by dividing the number of documents in each class by the total number of documents. For example, if there are 100 documents in the "Sports" class and 200 documents in the "Politics" class, then the probability of the "Sports" class is 0.33 and the probability of the "Politics" class is 0.66.

<br>

Now we should check which class has the highest probability and assign the sentence to that class.

$$ \text{Prediction} = \underset{c \in \text{Classes}}{\operatorname{argmax}} P(c) \prod_{i=1}^{n} P(x_i | c) $$

## Libraries

In [None]:
import pandas as pd
import numpy as np
import time
from tqdm import tqdm
import plotly.express as px
from sklearn.model_selection import train_test_split

## Dataset

(WELFake) is a dataset of 72,134 news articles with `35,028` real and `37,106` fake news.   
For this, authors merged four popular news datasets (i.e. Kaggle, McIntire, Reuters, BuzzFeed Political) to prevent over-fitting of classifiers and to provide more text data for better ML training.

Dataset contains four columns: Serial number (starting from 0); Title (about the text news heading); Text (about the news content); and Label (`0 = fake` and `1 = real`).

There are `78098` data entries in csv file out of which only `72134` entries are accessed as per the data frame.

You can find the dataset [here](https://www.kaggle.com/datasets/saurabhshahane/fake-news-classification/).

In [None]:
!gdown 1_0K9UPrdpR83OtCExceq_KMlqlSKzTkk

Downloading...
From: https://drive.google.com/uc?id=1_0K9UPrdpR83OtCExceq_KMlqlSKzTkk
To: /content/WELFake_Dataset.csv
100% 245M/245M [00:03<00:00, 80.0MB/s]


In [None]:
df = pd.read_csv('WELFake_Dataset.csv', index_col=0)
df

Unnamed: 0,title,text,label
0,LAW ENFORCEMENT ON HIGH ALERT Following Threat...,No comment is expected from Barack Obama Membe...,1
1,,Did they post their votes for Hillary already?,1
2,UNBELIEVABLE! OBAMA’S ATTORNEY GENERAL SAYS MO...,"Now, most of the demonstrators gathered last ...",1
3,"Bobby Jindal, raised Hindu, uses story of Chri...",A dozen politically active pastors came here f...,0
4,SATAN 2: Russia unvelis an image of its terrif...,"The RS-28 Sarmat missile, dubbed Satan 2, will...",1
...,...,...,...
72129,Russians steal research on Trump in hack of U....,WASHINGTON (Reuters) - Hackers believed to be ...,0
72130,WATCH: Giuliani Demands That Democrats Apolog...,"You know, because in fantasyland Republicans n...",1
72131,Migrants Refuse To Leave Train At Refugee Camp...,Migrants Refuse To Leave Train At Refugee Camp...,0
72132,Trump tussle gives unpopular Mexican leader mu...,MEXICO CITY (Reuters) - Donald Trump’s combati...,0


### Label Distribution

In [None]:
label_dist = df['label'].value_counts()
fig = px.pie(values=label_dist.values, names=label_dist.index, title='Label Distribution')
fig.show()

## Data Preprocessing

### Missing Values

In [None]:
# count missing values
df.isna().sum()

title    558
text      39
label      0
dtype: int64

In [None]:
# check for null values in title and text
df[df['title'].isna() & df['text'].isna()]

Unnamed: 0,title,text,label


We can combine the title and text columns into a single column and use it as the input to the model.

In [None]:
df['full_text'] = df['title'].fillna('') + ' ' + df['text'].fillna('')
df.head()

Unnamed: 0,title,text,label,full_text
0,LAW ENFORCEMENT ON HIGH ALERT Following Threat...,No comment is expected from Barack Obama Membe...,1,LAW ENFORCEMENT ON HIGH ALERT Following Threat...
1,,Did they post their votes for Hillary already?,1,Did they post their votes for Hillary already?
2,UNBELIEVABLE! OBAMA’S ATTORNEY GENERAL SAYS MO...,"Now, most of the demonstrators gathered last ...",1,UNBELIEVABLE! OBAMA’S ATTORNEY GENERAL SAYS MO...
3,"Bobby Jindal, raised Hindu, uses story of Chri...",A dozen politically active pastors came here f...,0,"Bobby Jindal, raised Hindu, uses story of Chri..."
4,SATAN 2: Russia unvelis an image of its terrif...,"The RS-28 Sarmat missile, dubbed Satan 2, will...",1,SATAN 2: Russia unvelis an image of its terrif...


### Data Splitting

We keep 20% of the data for validation and the rest for training.

In [None]:
import plotly.graph_objects as go
fig = go.Figure()
fig.add_trace(go.Histogram(x=df['full_text'][df['label']==1].apply(lambda x: len(x)), name='1 - Real'))
fig.add_trace(go.Histogram(x=df['full_text'][df['label']==0].apply(lambda x: len(x)), name='0 - Fake'))
fig.update_traces(opacity=0.7)
fig.update_layout(barmode='overlay', title='Text Length Distribution')
fig.show()

In [None]:
X_train, X_val, y_train, y_val = train_test_split(df['full_text'], df['label'], test_size=0.2, random_state=42)

print('X_train shape:', X_train.shape)
print('X_val shape:', X_val.shape)

X_train shape: (57707,)
X_val shape: (14427,)


### Preprocessing Pipeline

In [None]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [None]:
def preprocess_text(text, return_tokens=True):
    tokens = nltk.word_tokenize(text)
    if return_tokens:
        return tokens
    else:
        return ' '.join(tokens)

In [None]:
X_train_pos = X_train[y_train == 1]
X_train_neg = X_train[y_train == 0]
print('Number of positive samples:', len(X_train_pos))
print('Number of negative samples:', len(X_train_neg))

Number of positive samples: 29768
Number of negative samples: 27939


## Naïve Bayes Algorithm

$$P(class|t_1, t_2, ..., t_n)=P(t_1, t_2, ..., t_n|class)\times P(class)=P(t_1|class)\times P(t_2|class)\times ...\times P(t_n|class)\times P(class)$$

### Prior Probability

Prior probability is the probability of each class before seeing any data.

We can compute the prior probability of each class by dividing the number of documents in each class by the total number of documents.

$$ P(class) = \frac{\text{Number of documents in class}}{\text{Total number of documents}} $$

In [None]:
prior_probability = {0: (y_train == 0).sum() / len(y_train),
                     1: (y_train == 1).sum() / len(y_train)}
prior_probability

{0: 0.48415270244511066, 1: 0.5158472975548893}

In [None]:
# We could also use the value_counts method to get the prior probability
prior_probability = y_train.value_counts(normalize=True)
prior_probability = prior_probability.to_dict()
prior_probability

{1: 0.5158472975548893, 0: 0.48415270244511066}

### Likelihood

Likelihood is the probability of each feature (word) given each class.

We can compute the likelihood of each feature by dividing the number of times each feature appears in each class by the total number of words in that class.

$$ \large P(w_i|class)=\frac{count(t_i, class)}{\sum_{t \in V}{count(t, class)}} $$

#### Laplace Smoothing

Laplace smoothing is a technique used to smooth categorical data. It is used to solve the problem of zero probability. It is also known as Additive Smoothing.

$$ \large P(w_i|class)=\frac{count(t_i, class) + 1}{(\sum_{t \in V}{count(t, class)}) + |V|} $$

In [None]:
def token_counter(texts):
    count_dict = {}
    for text in tqdm(texts):
        preprocessed = preprocess_text(text)
        for token in preprocessed:
            if token in count_dict:
                count_dict[token] += 1
            else:
                count_dict[token] = 1
    return count_dict

In [None]:
class_count_neg = token_counter(X_train_neg)
print(f'Negative class - Vocab size: {len(class_count_neg)}, Total count: {sum(class_count_neg.values())}')

100%|██████████| 27939/27939 [02:49<00:00, 165.10it/s]

Negative class - Vocab size: 171014, Total count: 18914776





In [None]:
class_count_pos = token_counter(X_train_pos)
print(f'Positive class - Vocab size: {len(class_count_pos)}, Total count: {sum(class_count_pos.values())}')

100%|██████████| 29768/29768 [02:31<00:00, 195.98it/s]

Positive class - Vocab size: 318809, Total count: 17514814





In [None]:
class_based_count = [class_count_neg, class_count_pos]
vocab_size = len(set(list(class_count_neg.keys()) + list(class_count_pos.keys())))
total_count = [sum(class_count_neg.values()), sum(class_count_pos.values())]

In [None]:
# mega_doc_neg = ' '.join(X_train_neg)
# class_count_neg_mega = token_counter([mega_doc_neg])

100%|██████████| 1/1 [02:18<00:00, 138.96s/it]


### Posterior Probability

Posterior probability is the probability of each class after seeing the data. It is the product of the prior probability and the likelihood.

$$ \large P(class|t_1, t_2, ..., t_n)=P(t_1, t_2, ..., t_n|class)\times P(class) $$

In [None]:
def compute_probability(doc, cls):
    total_probability = 0
    preprocessed = preprocess_text(doc)
    for token in preprocessed:
        try:
            word_count = class_based_count[cls][token]
        except:
            word_count = 0
        word_prob = (word_count + 1) / (total_count[cls] + vocab_size + 1)
        total_probability = total_probability + np.log10(word_prob)
    total_probability = total_probability + np.log10(prior_probability[cls])
    return total_probability

### Prediction

We can predict the class of a document by computing the posterior probability of each class and then choosing the class with the highest probability.

$$ \large \text{Prediction} = \underset{c \in \text{Classes}}{\operatorname{argmax}} P(c) \prod_{i=1}^{n} P(x_i | c) $$

In [None]:
def predict(test):
    predictions = []
    for text in tqdm(test):
        neg_prob = compute_probability(text, 0)
        pos_prob = compute_probability(text, 1)
        if neg_prob > pos_prob:
            predictions.append(0)
        else:
            predictions.append(1)
    return np.array(predictions)

In [None]:
X_val.iloc[0]

'ARNOLD SCHWARZENEGGER Sends A Message To Liberals Whining About Trump [Video]  '

In [None]:
y_val.iloc[0]

1

In [None]:
predict([X_val.iloc[0]])

100%|██████████| 1/1 [00:00<00:00, 314.23it/s]


array([1])

## Model Evaluation

In [None]:
predictions = predict(X_val)
print('Accuracy:', (predictions == y_val).mean())

100%|██████████| 14427/14427 [02:54<00:00, 82.73it/s]

Accuracy: 0.9426769252096763





## Scikit-learn

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline

model = Pipeline([('vectorizer', CountVectorizer(min_df=0, lowercase=False)),
 ('nb', MultinomialNB())])
model.fit(X_train, y_train)
print('Accuracy:', model.score(X_val, y_val))

Accuracy: 0.9210508075136896
