# Sentiment Analyis with Naive Bayes

Create generic text classifier and predict the sentiment of IMDB movie reviews.

## Dataset
### Context
IMDB dataset having 50K movie reviews for natural language processing or Text analytics.<br>
This is a dataset for binary sentiment classification.
### Content
<li>review: the movie review made in text</li>
<li>sentiment: sentiment label positive or negative</li>

## What is Naive Bayes ?

Naive Bayes is a simple and effective algorithm for classification tasks.<br>
<br>
It calculates probabilities based on prior knowledge and observed evidence to predict the class of new examples.<br>
<br>
It assumes that features are independent of each other given the class label. Works well for text data and is useful when data is limited.<br>
<br>
Here's how the Naive Bayes algorithm works for classification:<br>
<li>Given a dataset with labeled examples, it calculates the prior probabilities of each class based on the proportion of each class in the dataset.</li>
<li>For each class, it estimates the likelihood probabilities of observing each feature given that class. This is usually done using the frequency of each feature within the class.</li>
<li>To classify a new example with unknown class, the algorithm calculates the posterior probabilities for each class using Bayes' theorem. The class with the highest posterior probability is considered the predicted class for the new example.</li>
<li>During the prediction, the algorithm ignores the denominator P(E) as it acts only as a normalization constant and does not affect the relative ranking of the classes.</li>

# Libraries

Please install the following dependencies using pip install:
<li>nltk</li>
<li>sklearn</li>

In [1]:
import nltk
import pandas as pd
import re
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from bs4 import BeautifulSoup
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB

In [2]:
# Download the WordNet dataset from NLTK.
nltk.download('wordnet')

# Download  the Open Multilingual Wordnet (OMW) dataset from NLTK.
nltk.download('omw-1.4')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\VelzM\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\VelzM\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

# ML Data Engineering

In [3]:
# Please download the dataset from https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews
data = pd.read_csv('IMDB Dataset.csv', encoding = 'Latin-1')
data.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [4]:
# Text Preprocessing.
stop_words = set(stopwords.words("english")) 
lemmatizer = WordNetLemmatizer()

def strip_html(text):
    soup = BeautifulSoup(text, "html.parser")
    return soup.get_text()

def clean_text(text):
    text = strip_html(text)
    text = re.sub(r'[^A-Za-z0-9]+',' ',text)
    text = text.lower()
    text = [lemmatizer.lemmatize(token) for token in text.split(" ")]
    text = [lemmatizer.lemmatize(token, "v") for token in text]
    text = [word for word in text if not word in stop_words]
    text = " ".join(text)
    return text

In [5]:
# Clean up the reviews.
data['review_cleaned'] = data.review.apply(lambda x: clean_text(x))



In [6]:
# Training and splitting.
X_train, X_test, y_train, y_test = train_test_split(data['review_cleaned'], data['sentiment'], test_size = 0.2, random_state = 0)

In [7]:
# Feature Extraction to convert text data into numerical feature vectors using the Term Frequency-Inverse Document Frequency (TF-IDF) representation.
vectorizer = TfidfVectorizer()
X_train_vector = vectorizer.fit_transform(X_train)
X_test_vector = vectorizer.transform(X_test)

## ML Model Engineering

In [8]:
# Training the Model using the Multinomial Naive Bayes classifier, which is a variant of the Naive Bayes algorithm specifically designed for discrete feature data.
# It is commonly used for text classification tasks, where features represent word frequencies or counts.
clf = MultinomialNB()
clf.fit(X_train_vector, y_train)

## ML Model Evaluation

In [9]:
# Evaluate the performance.
y_pred = clf.predict(X_test_vector)

# Accuracy is a measure of how often a classifier correctly predicts the correct class (both true positives and true negatives) out of all the instances in the dataset.
print("Accuracy: ", accuracy_score(y_test, y_pred))

# Precision is a measure of how many of the predicted positive instances are actually true positives.
print("Precision: ", precision_score(y_test, y_pred, average='weighted'))

# Recall is a measure of how many of the actual positive instances are correctly predicted as positive by the classifier.
print("Recall: ", recall_score(y_test, y_pred, average='weighted'))

# F1-score is the harmonic mean of precision and recall. It provides a balance between precision and recall when they have an inverse relationship.
print("F1-score: ", f1_score(y_test, y_pred, pos_label='positive'))

Accuracy:  0.8609
Precision:  0.861122470471853
Recall:  0.8609
F1-score:  0.8580467394632105


## Conclusions

86.09% correct predictions<br>
86.11% predicted as positive are actually true positives<br>
86.09% actual positive instances are identified<br>
85.80% is reasonable for considering both false positives and false negatives

## References

https://chat.openai.com/<br>
https://levelup.gitconnected.com/movie-review-sentiment-analysis-with-naive-bayes-machine-learning-from-scratch-part-v-7bb869391bab<br>
https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews<br>
https://towardsdatascience.com/sentiment-analysis-on-a-imdb-movie-review-dataset-with-a-support-vector-machines-model-in-python-50c1d487327e