John Park (UBID: 50285417)

MTH 448

Apr. 23, 2023

# Classifying Texts with Naive Bayes

## Introduction

In the realm of natural language processing, text classification is a fundamental task with a wide range of applications, including sentiment analysis, topic identification, and spam detection. Naive Bayes is a simple and efficient probabilistic machine learning algorithm. It plays significant role for text classification tasks with its ease of implementation and strong performance in many real-world scenarios. This project aims to investigate the Naive Bayes classifier by appling it to two distinct datasets for text classification: (1) newsgroup posts and (2) movie reviews.

The main goal is to assess naive bayes classifier's ability to predict the newsgroup to which a post belongs and the sentiment of a movie review. This study will not only evaluate the classification accuracy but also delve into the finer aspects of the algorithm, such as the impact of stopwords on the performance and the role of the training set size. The most frequent words in each class of texts and misclassified examples will be analyzed as well. To ensure a thorough understanding of the Naive Bayes classifier, this project implement it from scratch, without relying on pre-built machine learning libraries.

Overall the whole project aims to provide valuable insights into the capabilities and limitations of the Naive Bayes classifier for text classification alongside, practical recommendations to improve its performance for various applications.

The first approach is importing essential libraries.

In [773]:
import pandas as pd
import numpy as np
from zipfile import ZipFile
from sklearn.model_selection import train_test_split
from collections import defaultdict, Counter
from tqdm import tqdm
import re
%pylab inline

%pylab is deprecated, use %matplotlib inline and import the required libraries.
Populating the interactive namespace from numpy and matplotlib


Then, in order to analyze, both datasets should be cleared so that they only consist of proper words, which affect the meanings and contexts of the news posts and reviews. Unnecessary strings may hinder the classifier.

In [1053]:
# Clearing Strings of Text
def clean_string(string):    
# Unnecessary strings from the text
    bad_strings = ['<br','\n','.',',',';','>','/']
    for bad in bad_strings:
# Removing unncessary strings by changing it to empty string ''
        string = string.replace(bad,'')
# Remove '-' and Double Space into ' '  
    string = string.replace('-',' ').replace('  ',' ')
    return string

## Part 1. News Group

The first part of the project is investigating Newsgroup. The first step is Load and Preprocess the Data.

### Load and Preprocess the Data

In [959]:
# Load newsgroups.zip
with ZipFile("newsgroups.zip", 'r') as zipped:
    txt = zipped.read('newsgroups.txt').decode(encoding='utf8', errors='ignore')

In [1058]:
# Split txt files by each post
entries = txt.split('Newsgroup: ')[1:]
# Make empty list to store new Data
data = []
# Loop through all posts
for post in entries:
# Split the post into each line
    lines = post.strip().split('\n')
# Name of Newsgroup of post is first item in array
    newsgroup = lines[0].split(' ')[0]
# Main post
    post_text = '\n'.join(lines[5:])
# Append to data and each list contains two dictionaries in it
    data.append({'Newsgroup': newsgroup, 'Body': post_text})
news_df = pd.DataFrame(data)
news_df['Body']

0       >sure sounds like they got a ringer.  the 325i...
1       I have been hearing bad thing about amalgam de...
2       >DATE:   Tue, 6 Apr 1993 00:11:49 GMT\n>FROM: ...
3       In article <1993Apr16.174843.28111@cabell.vcu....
4       In article <visser.735284180@convex.convex.com...
                              ...                        
7373    L(>  levin@bbn.com (Joel B Levin) writes:\nL(>...
7374    In article <1r3sbbINN8e0@hp-col.col.hp.com>, t...
7375    \nWell, since someone probably wanted to know,...
7376    \nHello Hockey fans.\nBonjour tout le monde!\n...
7377    If the Islanders beat the Devils tonight, they...
Name: Body, Length: 7378, dtype: object

In [1060]:
# Preprocess the data
def preprocess(text):
# Make all lowercase
    text = text.lower()
# Clean bad strings of the text
    text = clean_string(text)
# Remove non-alphanumeric characters and replace them with ' '
    text = re.sub(r'\W+', ' ', text)
# Split them into a list of individual words
    words = text.split()
# Return new list
    return words

After preporcessing, the next step is spliting data into training and testing.

### Split Data into Training and Testing Sets

In [962]:
# Split data into training and testing sets
data_train, data_test = train_test_split(news_df, test_size=0.1, random_state=1)

In [963]:
# Name of Newsgroups and Number of Posts from each group in Training Set
data_train['Newsgroup'].value_counts()

sci.med               905
rec.sport.hockey      895
rec.sport.baseball    893
rec.motorcycles       893
sci.electronics       887
rec.autos             885
alt.atheism           714
talk.religion.misc    568
Name: Newsgroup, dtype: int64

In [964]:
# 'Body' before going into preprocess
data_train['Body']

7321    \n\tThis comes indirectly from Al Morgani who ...
3420    Tall Cool One (rky57514@uxa.cso.uiuc.edu) wrot...
3531    Andy Collins (acollins@uclink.berkeley.edu) wr...
2812    \n    I hope that this comes off as a somewhat...
3913    Reading all you folks things to do to illegall...
                              ...                        
905     \nThey detect the oscillator operating in the ...
5192    \nIn article <1993Apr12.201056.20753@ns1.cc.le...
3980    In article <Apr16.215151.28035@engr.washington...
235     In article <1993Apr6.170330.12314@is.morgan.co...
5157    Anybody got any good/bad experience with selli...
Name: Body, Length: 6640, dtype: object

In [965]:
# Preprocessing 'Body'
data_train['Body'] = data_train['Body'].apply(preprocess)

In [966]:
# 'Body' after preprocess
data_train['Body']

7321    [this, comes, indirectly, from, al, morgani, w...
3420    [tall, cool, one, rky57514, uxacsouiucedu, wro...
3531    [andy, collins, acollins, uclinkberkeleyedu, w...
2812    [i, hope, that, this, comes, off, as, a, somew...
3913    [reading, all, you, folks, things, to, do, to,...
                              ...                        
905     [they, detect, the, oscillator, operating, in,...
5192    [in, article, 1993apr1220105620753, ns1cclehig...
3980    [in, article, apr1621515128035, engrwashington...
235     [in, article, 1993apr617033012314, ismorgancom...
5157    [anybody, got, any, goodbad, experience, with,...
Name: Body, Length: 6640, dtype: object

Using preprocessed trained data, this is the main part, implementing the Naive Bayes classifier.

### Implement the Naive Bayes Classifier

The first step of implementing the Naive Bayes classifier is calculating prior probabilities.

The prior probabilities represent the general frequency of each newsgroup in the dataset. This information helps the classifier make an initial guess about the newsgroup of a given post. It plays an essential role by providing a baseline for comparison when evaluating the likelihoods of words in a given post.

In [1062]:
# Calculate prior probabilities
# Measure the count of each newsgroup in the training dataset
newsgroup_counts = data_train['Newsgroup'].value_counts()
# Divide each count by the length of the training dataset
prior_probs = newsgroup_counts / len(data_train)
prior_probs

sci.med               0.136295
rec.sport.hockey      0.134789
rec.sport.baseball    0.134488
rec.motorcycles       0.134488
sci.electronics       0.133584
rec.autos             0.133283
alt.atheism           0.107530
talk.religion.misc    0.085542
Name: Newsgroup, dtype: float64

Next step is calculating likelihoods of each word in each newsgroup.

The likelihood of each word in each newsgroup helps the classifier understand the relationship between words and newsgroups. It measures the probability of observing a particular word in a given newsgroup, which is essential for determining the probability of a post belonging to a specific newsgroup based on its content.

In [1063]:
# Calculate word counts for each newsgroup
# Initialize a dictionary with a default value of a Counter
word_like_news = defaultdict(Counter)
# Loop through all posts of each group in training dataset 
for newsgroup, texts in data_train.groupby('Newsgroup'):
# Loop through each post
    for text in texts['Body']:
# Loop through each word of the post
        for word in text:
# The dictionary stores the word frequency for each newsgroup
            word_like_news[newsgroup][word] += 1

Now, this step is classifying the newsgroup of posts of testing dataset using likelihoods of words in each newsgroup. 

In [1067]:
# Classify a new post
def classify_post(post):
# preprocess new post
    post_words = preprocess(post)
# Empty dictionary for storing 
    newsgroup_probs = {}
    for newsgroup, word_count in word_like_news.items():
        prob = np.log(prior_probs[newsgroup])
        total_wc = sum(word_count.values())
        for word in post_words:
# Add 1 to word count in case that word count is 0
# Since log(0) is Mathematically undefined, 1 is added to every word count
            prob += np.log((word_count[word] + 1) / (len(total_wc) + len(word_count)))
# Add to dictionary; key is Newsgroup, values is prob calculated
        newsgroup_probs[newsgroup] = prob
# return max(zip(newsgroup_probs.values(), newsgroup_probs.keys()))[1]
    return max(newsgroup_probs, key=newsgroup_probs.get)

In [970]:
# Apply classifying method on posts of testing dataset
news_pred = data_test['Body'].apply(classify_post)

In [1064]:
# The result predicted by classifier
news_pred

5748       rec.motorcycles
4875           alt.atheism
4537       sci.electronics
5857           alt.atheism
6721    rec.sport.baseball
               ...        
4980       rec.motorcycles
6556               sci.med
1282       sci.electronics
2921               sci.med
3979       sci.electronics
Name: Body, Length: 738, dtype: object

### Evaluate the Classifier

#### Does the Size of Training Set Matters?

The code below calculates the accuracy of the classifier when the training size is 0.9

In [1065]:
# Compare the predicted result to the original
newsTF_result = (news_pred == data_test['Newsgroup'])
# Calculate the accuracy
news_acc_09 = newsTF_result.sum() / len(data_test['Newsgroup'])
news_acc_09

0.9186991869918699

Using the codes above, by adjusting the size of testing set from the part of preprocessing the data, the investigation is led in order to study if the sizes of training set and testing set matter.

|Train Size|Test Size|Accuracy|
|-|-|-|
|0.1|0.9|0.7067|
|0.3|0.7|0.8426|
|0.5|0.5|0.8702|
|0.7|0.3|0.9033|
|0.9|0.1|0.9187|

From the table above, the training size and the accuracy are directly proportional.

#### Misclassified Posts

The result with the training size of 0.9 represents that there are still about 30 percent of misclassified texts. In every experiment, examining the misclassified data is essential for further inquiry. Thus, this step is for finding a few misclassified texts, determining if the predicted probabilities are excessively far off, and analyzing what make them to be incorrectly sorted.

In [1068]:
# Show 10 misclassified post
misclassified_post = []
for i, tf in enumerate(newsTF_result):
    if tf != True:
        misclassified_post.append((data_test.values[i][0], news_pred.values[i], i))
# The returned list represents (Original, Predicted, post number)
misclassified_post[:10]

[('talk.religion.misc', 'alt.atheism', 3),
 ('talk.religion.misc', 'alt.atheism', 6),
 ('talk.religion.misc', 'alt.atheism', 12),
 ('talk.religion.misc', 'alt.atheism', 14),
 ('rec.autos', 'alt.atheism', 25),
 ('rec.autos', 'sci.electronics', 88),
 ('talk.religion.misc', 'alt.atheism', 102),
 ('rec.motorcycles', 'rec.autos', 123),
 ('rec.autos', 'alt.atheism', 127),
 ('talk.religion.misc', 'alt.atheism', 138)]

##### Any Reason?

From 10 misclassified examples above, the classifier is confuesed of religion and atheism. Atheism is built upon disbelief in religious figures, which is the state between religious and agnostic. Even though atheism is against the belief in God, it still describes religious and philosophical concepts, which is certainly similar to religious post. Since the prior probabilites epxresses that the frequency of atheism is larger than that of religion, in the beginning of the classifying porcess, the classifier would have misunderstood the post and chosen atheism over religion.

### Stopwords Removal

In [1047]:
# Drop the stopwords of Newsgroup Posts
with open('stopwords.txt') as f:
    stops = f.read()
# Split the stopwords into a list of individual stopword
stops = stops.split(',')

In [1074]:
# Remove Stopwords of each post of training set
for data in tqdm(data_train['Body'].values):
    data = [ele for ele in data if ele not in stops]

100%|█████████████████████████████████████| 6640/6640 [00:01<00:00, 4384.63it/s]


In [1073]:
# Preprocessed 'Body' after Stopwords Removal
data_train['Body']

7321    [comes, indirectly, from, al, morgani, who, wo...
3420    [tall, cool, one, rky57514, uxacsouiucedu, wro...
3531    [andy, collins, acollins, uclinkberkeleyedu, w...
2812    [i, hope, that, this, comes, off, as, a, somew...
3913    [reading, all, you, folks, things, to, do, to,...
                              ...                        
905     [they, detect, the, oscillator, operating, in,...
5192    [in, article, 1993apr1220105620753, ns1cclehig...
3980    [in, article, apr1621515128035, engrwashington...
235     [in, article, 1993apr617033012314, ismorgancom...
5157    [anybody, got, any, goodbad, experience, with,...
Name: Body, Length: 6640, dtype: object

In [1050]:
# Apply classifying method on posts of testing dataset after Removing Stopwords
news_pred_stop = data_test['Body'].apply(classify_post_stop)

In [1051]:
# The result predicted by classifier after Removing Stopwords
news_pred_stop

5748       rec.motorcycles
4875           alt.atheism
4537       sci.electronics
5857           alt.atheism
6721    rec.sport.baseball
               ...        
4980       rec.motorcycles
6556               sci.med
1282       sci.electronics
2921               sci.med
3979       sci.electronics
Name: Body, Length: 738, dtype: object

In [1076]:
# Compare the predicted result after stopwrods removal to the original
newsTF_result_stop = (news_pred_stop == data_test['Newsgroup'])
# Calculate the accuracy
news_acc_09_stop = newsTF_result_stop.sum() / len(data_test['Newsgroup'])
news_acc_09_stop

0.9186991869918699

In [1075]:
# List for misclassified post after stopwords removal
misclassified_post_stop = []
for i, tf in enumerate(newsTF_result_stop):
    if tf != True:
        misclassified_post_stop.append((data_test.values[i][0], news_pred_stop.values[i], i))
# (Original Newsgroup, Predicted Newsgroup, misclassified post number)
misclassified_post_stop[:10]

[('talk.religion.misc', 'alt.atheism', 3),
 ('talk.religion.misc', 'alt.atheism', 6),
 ('talk.religion.misc', 'alt.atheism', 12),
 ('talk.religion.misc', 'alt.atheism', 14),
 ('rec.autos', 'alt.atheism', 25),
 ('rec.autos', 'sci.electronics', 88),
 ('talk.religion.misc', 'alt.atheism', 102),
 ('rec.motorcycles', 'rec.autos', 123),
 ('rec.autos', 'alt.atheism', 127),
 ('talk.religion.misc', 'alt.atheism', 138)]

Based on the results above, the accuracy has not changed, impling that the stopwords removal does not affect much on the subjects of the posts.

Since the newsgroup posts usually consist of their specific words that are more related to their own topics, the stopwords would take large portion of the words the posts have in common. Hence, the stopwords removal might have affected the accuracy.

In addition to the investigation on newsgroup, the second part of this project is analyzing Naive Bayes Classifier on movie reviews dataset.

## Part 2. Movie Reviews

The first step is Load the Data.

### Load the Data

In [119]:
# Load movie_reviews.zip
movies = pd.read_csv("movie_reviews.zip")

For movie revies, steps of preporcessing the dataset and spliting the dataset into training and testing are bundeled together.

### Preprocess the Data and Split into Training and Testing Sets

Training and testing sets are divided, and the data is preprocessed by removing useless strings and spaces. After that, the arrays are created to distinguish negative and positive reviews.

In [1077]:
# List for negative words
neg = []
# List for positive words
pos = []
# Spliting the dataset into training and testing
X_train, X_test, y_train, y_test = train_test_split(movies['review'], 
                                                    movies['sentiment'], 
                                                    test_size=0.1, 
                                                    random_state=1)
# The length of training dataset
n = len(X_train)
# Loop through the length of training dataset
for i in tqdm(range(len(X_train))):
# Lowercase all reviews
    words = movies.loc[i,'review'].lower()
# Remove the bad strings of reviews
    words = clean_string(words)   
# Split the reviews into a list of individual words
    words = words.split(' ')
# If the sentiment of review is "negative"
    if movies.loc[i,'sentiment'] == "negative":
# Add word to neg
        neg += words
# If "positive"
    else:
# Add word to pos
        pos += words

100%|██████████████████████████████████| 22500/22500 [00:01<00:00, 19042.11it/s]


With the sorted arrays from the above code, new dataframe is created with two columns of "Negative" and "Positive". Then, for each word, the dataframe has the record of its word counts in each type of review. Now, to find the probability of each review and classify its setiment, naive bayes is calculated; however, instead of $P(y|x)$, $log(P(y|x))$ is used in order to avoid underflow errors. Then, for the calculation of $log$, $1$ is added to all word counts since the word with word count $0$ would have $log(0)$, and $log(0)$ is mathematically undefined.

In [336]:
# Change neg to Counter to count the frequency of each word in the negative reviews
negCount = Counter(neg)
# Change pos to Counter to count the frequency of each word in the positive reviews
posCount = Counter(pos)
# make new dictionary containing negative counts and positive counts
movie_words_data = {"Negative":negCount,"Positive":posCount}
# change the dictionary to dataframe; fill missing values with 0, and convert all values to int
wc = pd.DataFrame(movie_words_data).fillna(0).astype(int)
wc = wc + 1

### Implement the Naive Bayes Classifier

After preprocessing the data, the negative and positive probability of each review is calculated, and the larger number determines the sentiment of the review.

In [1085]:
# New empty array to represent no stopwords
stops = []
def rev_probs(review_text):
# List to store the probability values for negative
    neg_list = []
# List to store the probability values for positive
    pos_list = []
# Preprocess the review text by cleaning it, converting it to lowercase, and splitting it into words
    words = clean_string(review_text).lower().split(' ')
# Loop through each word in the list
    for word in words:
# If each word is in the stopwords
        if word in stops:
# If it is, skip to the next word
            None
# If not
        else:
# Try to calculate the probabilities for the current word
            try:
# Calculate the probability of the word given each class (negative and positive)
                p = wc.loc[word]/wc.sum()
# Append the calculated probabilities to the corresponding lists
                neg_list.append(p["Negative"])
                pos_list.append(p["Positive"])
# If an exception occurs, skip to the next word
            except:
                None
# Total count of words in the DataFrame
    total = np.array(wc.sum()).sum()
# Initialize an array of ones with length of 2
    prod = np.ones(2)
# Calculate the prior probabilities for each class
    prod[0] = wc['Negative'].sum()/(total)
    prod[1] = wc['Positive'].sum()/(total)
# Take the log10 of the probabilities
    prod = np.log10(prod)
# Add the log10 of the word probabilities for each class to the corresponding prior probabilities
    prod[0] += np.log10(neg_list).sum()
    prod[1] += np.log10(pos_list).sum()
# Final log10 probabilities for the negative and positive classes
    return prod

def pre_sentiment(review):
# Calculate probabilities of the review being negative or positive
    prod = rev_probs(review)
# Compare probabilities and return negative or positive
    return 'negative' if prod[0] > prod[1] else 'positive'

In [1086]:
# Empty list to store sentiment predictions for each review in the test dataset
sentiment_predictions = []
# Loop through each review in the test dataset
for review in tqdm(X_test.values):
# Predict the sentiment of current review using pre_sentiment function
    prediction = pre_sentiment(review)
# Append the predicted sentiment to the sentiment_predictions list
    sentiment_predictions.append(prediction)
# Return the list of sentiment predictions for all reviews in the test dataset
sentiment_predictions

100%|███████████████████████████████████████| 2500/2500 [02:42<00:00, 15.41it/s]


['negative',
 'positive',
 'negative',
 'positive',
 'positive',
 'negative',
 'positive',
 'negative',
 'negative',
 'negative',
 'negative',
 'positive',
 'negative',
 'negative',
 'positive',
 'positive',
 'negative',
 'negative',
 'negative',
 'positive',
 'negative',
 'negative',
 'positive',
 'positive',
 'negative',
 'positive',
 'positive',
 'positive',
 'negative',
 'negative',
 'positive',
 'negative',
 'positive',
 'negative',
 'negative',
 'positive',
 'positive',
 'positive',
 'positive',
 'negative',
 'negative',
 'positive',
 'negative',
 'positive',
 'negative',
 'positive',
 'negative',
 'negative',
 'positive',
 'positive',
 'positive',
 'negative',
 'negative',
 'negative',
 'negative',
 'positive',
 'positive',
 'negative',
 'positive',
 'negative',
 'negative',
 'negative',
 'negative',
 'positive',
 'positive',
 'negative',
 'negative',
 'positive',
 'positive',
 'positive',
 'negative',
 'negative',
 'negative',
 'negative',
 'negative',
 'positive',
 'negative',

Using the methods above, the array of predicted sentiments are created.

### Evaluate the Classifier

The code below calculates the accuracy of the classifier when the training size is 0.9

In [1087]:
# Compare the predicted result to the original
movieTF_result = (sentiment_predictions == y_test.values)
# Calculate the accuracy
acc_09 = (movieTF_result).sum() / len(y_test.values)
acc_09

0.9228

#### Does the Size of Training Set Matters?

Using the codes above, by adjusting the size of testing set from the part of preprocessing the data, the investigation is led in order to study if the sizes of training set and testing set matter.

|Train Size|Test Size|Accuracy|
|-|-|-|
|0.1|0.9|0.8379|
|0.3|0.7|0.8764|
|0.5|0.5|0.8962|
|0.7|0.3|0.9059|
|0.9|0.1|0.9084|

From the table above, the training size and the accuracy are directly proportional. 

#### Misclassified Texts

The result with the training size of 0.9 represents that there are still about 0.27 amount of misclassified texts. In every experiment, examining the misclassified data is essential for further inquiry. Thus, this step is for finding a few misclassified texts, determining if the predicted probabilities are excessively far off, and analyzing what make them to be incorrectly sorted.

In [1088]:
# Show 5 misclassified reviews
misclassified = []
# Loop through the compared result
for i, tf in enumerate(movieTF_result):
# If not True
    if tf != True:
        misclassified.append((X_test.values[i], y_test.values[i], sentiment_predictions[i], i))
# The returned list represents (Movie Review, Original Sentiment, Predicted Sentiment, Review Number)
misclassified[:5]

[("In my years of attending film festivals, I have seen many little films like this that never get theatrical distribution, and they end up in the $3 bins at WalMart. I just found DVD of Yank Tanks there, great doc, but how sad for it to end up as a rock-bottom remainder.<br /><br />I loved this film, wish I'd seen it at the cinema in it's everything. I'd have preferred that New Yorker Films had translated the title directly. It's good for Americans to stretch a little. If the film's title helps the US audience to explore random chaos, all the better. Cinema imitates life & visa versa.<br /><br />Also, I found it distracting that the subtitles put prices in dollars. Come on! The euro is not hard to figure out, make the gringo audiences do the math. Seeing a film, especially one shot in Paris, the viewer should not have the effect spoiled by being reminded: I am an American watching a movie and they are translating the Euros into dollars for me. <br /><br />Looking forward to seeing mor

##### Probabilites of Misclassified Texts

The previous part produces the 2D array of misclassified texts and their original setiments. From the 2D array, 5 examples are chosen for the calculation of probabilites of them.

In [1089]:
# 5 misclassified texts
misclassified_ex = misclassified[:5]

In [1090]:
# The probabilities of misclassified text
mis_prob0 = rev_probs(misclassified_ex[0][0])
# Print Review Number
print(misclassified_ex[0][3])
mis_prob0

10


array([-388.90915935, -389.16376262])

For the first example, the predicted setiment is negative, but the original is positive.

In [1091]:
mis_prob1 = rev_probs(misclassified_ex[1][0])
print(misclassified_ex[1][3])
mis_prob1

11


array([-568.57661636, -564.51750241])

For the second example, the predicted setiment is negative, but the original is positive.

In [1092]:
mis_prob2 = rev_probs(misclassified_ex[2][0])
print(misclassified_ex[2][3])
mis_prob2

29


array([-481.48140245, -483.79963026])

For the third example, the predicted setiment is positive, but the original is negative.

In [1093]:
mis_prob3 = rev_probs(misclassified_ex[3][0])
print(misclassified_ex[3][3])
mis_prob3

32


array([-397.69439777, -397.03839223])

For the forth example, the predicted setiment is negative, but the original is positive.

In [1094]:
mis_prob4 = rev_probs(misclassified_ex[4][0])
print(misclassified_ex[4][3])
mis_prob4

36


array([-579.13246099, -577.66518058])

For the last example, the predicted setiment is negative, but the original is positive.

##### Any Reason?

For all 5 missorted examples, the negative and positive probabilites are exremely close. In order to reducdes the misclassified texts with those reasons, the classifier needs to be analyzed. Analyzing can help identify potential issues or areas for improvement in the classifier, so the first approach is removing the stopwords.

### Stopwords Removal

The first part of stopwords removal is loading the data of stopwords.

In [1095]:
# Drop the stopwords of Movie Reviews
with open('stopwords.txt') as f:
    stops = f.read()
# items to be removed
stops = stops.split(',')
wc = wc.drop(stops, errors='ignore')

Then, the naive bayes classifier is implemented, and this process includes removing the stopwords in each review.

In [1096]:
# Empty list to store sentiment predictions after stopwords removal for each review in the test dataset
sentiment_pred_stopwords = []
# Loop through each review in the test dataset
for review in tqdm(X_test.values):
# Predict the sentiment of the current review after removing stopwords
    prediction = pre_sentiment(review)
# Append the predicted sentiment after stopwords removal to the sentiment_predictions list
    sentiment_pred_stopwords.append(prediction)
# List of sentiment predictions after stopwords removal for all reviews in the test dataset
sentiment_pred_stopwords

100%|███████████████████████████████████████| 2500/2500 [02:44<00:00, 15.22it/s]


['negative',
 'positive',
 'negative',
 'positive',
 'positive',
 'negative',
 'positive',
 'negative',
 'negative',
 'negative',
 'negative',
 'positive',
 'negative',
 'negative',
 'positive',
 'positive',
 'negative',
 'negative',
 'negative',
 'positive',
 'negative',
 'negative',
 'positive',
 'positive',
 'negative',
 'positive',
 'positive',
 'positive',
 'negative',
 'negative',
 'positive',
 'negative',
 'positive',
 'negative',
 'negative',
 'positive',
 'positive',
 'positive',
 'positive',
 'negative',
 'negative',
 'positive',
 'negative',
 'positive',
 'negative',
 'positive',
 'negative',
 'negative',
 'positive',
 'positive',
 'positive',
 'negative',
 'negative',
 'negative',
 'negative',
 'positive',
 'positive',
 'negative',
 'positive',
 'negative',
 'negative',
 'negative',
 'negative',
 'positive',
 'positive',
 'negative',
 'negative',
 'positive',
 'positive',
 'positive',
 'negative',
 'negative',
 'negative',
 'negative',
 'negative',
 'positive',
 'negative',

Following code calculates the accuracy of the classifier after removing stopwords.

In [1097]:
movieTF_stopwords_result = (sentiment_pred_stopwords == y_test.values)
acc_09_stopwords = (movieTF_stopwords_result).sum() / len(y_test.values)
acc_09_stopwords

0.9228

After removing stopwords, the classifier was run, and the accuracy increased by approximately 0.0144.

In [350]:
# Show 5 misclassified reviews after stopwords removal
misclassified_stopwords = []
# Loop through the results after removing stopwors
for i, tf in enumerate(movieTF_stopwords_result):
# If not True
    if tf != True:
        misclassified_stopwords.append((X_test.values[i], y_test.values[i], sentiment_pred_stopwords[i], i))
# The returned list represents (Movie Review, Original Sentiment, Predicted Sentiment, Review Number)
misclassified_stopwords[:5]

[("In my years of attending film festivals, I have seen many little films like this that never get theatrical distribution, and they end up in the $3 bins at WalMart. I just found DVD of Yank Tanks there, great doc, but how sad for it to end up as a rock-bottom remainder.<br /><br />I loved this film, wish I'd seen it at the cinema in it's everything. I'd have preferred that New Yorker Films had translated the title directly. It's good for Americans to stretch a little. If the film's title helps the US audience to explore random chaos, all the better. Cinema imitates life & visa versa.<br /><br />Also, I found it distracting that the subtitles put prices in dollars. Come on! The euro is not hard to figure out, make the gringo audiences do the math. Seeing a film, especially one shot in Paris, the viewer should not have the effect spoiled by being reminded: I am an American watching a movie and they are translating the Euros into dollars for me. <br /><br />Looking forward to seeing mor

In [351]:
misclassified_stopwords_ex = misclassified_stopwords[:5]

In [352]:
mis_prob0_stop = rev_probs(misclassified_stopwords_ex[0][0])
print(misclassified_stopwords_ex[0][3])
mis_prob0_stop

10


array([-388.90915935, -389.16376262])

In [353]:
mis_prob1_stop = rev_probs(misclassified_stopwords_ex[1][0])
print(misclassified_stopwords_ex[1][3])
mis_prob1_stop

11


array([-568.57661636, -564.51750241])

In [354]:
mis_prob2_stop = rev_probs(misclassified_stopwords_ex[2][0])
print(misclassified_stopwords_ex[2][3])
mis_prob2_stop

29


array([-481.48140245, -483.79963026])

In [355]:
mis_prob3_stop = rev_probs(misclassified_stopwords_ex[3][0])
print(misclassified_stopwords_ex[3][3])
mis_prob3_stop

32


array([-397.69439777, -397.03839223])

In [356]:
mis_prob4_stop = rev_probs(misclassified_stopwords_ex[4][0])
print(misclassified_stopwords_ex[4][3])
mis_prob4_stop

36


array([-579.13246099, -577.66518058])

With the stopwords removal, a few of the misclassified texts have been correctly sorted. In case of 10th and 11th texts, they are still misclassified; however, the differences between their negative and positive probabilites have been decreased, which means that the classifier has been improved. The next approach for improvement is using negations. The negations can change the meaning of words and phrases, and it is significant for the classifier to understand them. Thus, by adding prefix "NOT_" in front of the negations words, such as "not" and "isn't", the classifier may be improved.

### Text Samples with Negations

#### How Negations Can Change the Meaning of Words and Phrases? (Add "NOT_" in front of Negation Words)

In [370]:
# Representing negation words using Regular Expressions
negation_words = r"\b\w*(?:not|n't)\b"
# Add prefix "NOT_" in front of all the negation words
def add_NOT(match):
    return "NOT_" + match.group()

In [371]:
# Empty list to store sentiment predictions with negating negation words
sentiment_pred_NOT = []
for review in tqdm(X_test.values):
    newReview = re.sub(negation_words, add_NOT, review, flags=re.IGNORECASE)
    prediction = pre_sentiment(newReview)
    sentiment_pred_NOT.append(prediction)
sentiment_pred_NOT

100%|███████████████████████████████████████| 2500/2500 [02:36<00:00, 15.95it/s]


['negative',
 'positive',
 'negative',
 'positive',
 'positive',
 'negative',
 'positive',
 'negative',
 'negative',
 'negative',
 'negative',
 'positive',
 'negative',
 'negative',
 'positive',
 'positive',
 'negative',
 'negative',
 'negative',
 'positive',
 'negative',
 'negative',
 'positive',
 'positive',
 'negative',
 'positive',
 'positive',
 'positive',
 'negative',
 'negative',
 'positive',
 'negative',
 'positive',
 'negative',
 'negative',
 'positive',
 'positive',
 'positive',
 'positive',
 'negative',
 'negative',
 'positive',
 'negative',
 'positive',
 'negative',
 'positive',
 'negative',
 'negative',
 'positive',
 'positive',
 'positive',
 'negative',
 'negative',
 'negative',
 'negative',
 'positive',
 'positive',
 'negative',
 'positive',
 'negative',
 'negative',
 'negative',
 'negative',
 'positive',
 'positive',
 'negative',
 'negative',
 'positive',
 'positive',
 'positive',
 'negative',
 'negative',
 'negative',
 'negative',
 'negative',
 'positive',
 'negative',

In [372]:
movieTF_NOT_result = (sentiment_pred_NOT == y_test.values)
acc_09_NOT = (movieTF_NOT_result).sum() / len(y_test.values)
acc_09_NOT

0.9264

The result of adding prefix "NOT_" seems successful; the accuracy is increased by approximately 0.0036.

In addition to the result above, this adding prefix method is applied to new random texts containing several negation words in order to accurately clarify that adding "NOT_" is unsuccessful method.

### Applying Classifier to New Texts

In [373]:
# Random Positive text with Negation words in it
text_NOT = "This is not a bad process. There are't any bugs in my house."

In [374]:
# Result without using negations
pre_sentiment(text_NOT)

'negative'

At the first time, even though the context of the "text_NOT" is not negative, the classifier labels the text as negative.

In [375]:
new_text_NOT = re.sub(negation_words, add_NOT, review, flags=re.IGNORECASE)
new_text_pred_NOT = pre_sentiment(new_text_NOT)
# Result with using negations
new_text_pred_NOT

'positive'

After going through adding "NOT_" prefix in front of all negation words, now the classifier correctly distinguishes the text as positive. Thus, it is proved that this method is appropriate.

### Most Frequent Words

As the last experiment on Movie Reviews, the code below produces five most frequent words for negative and positive reviews.

In [325]:
# Most Frequent words in the negative sentiment
wc.sort_values(by=['Negative'], ascending=[False])[:6]

Unnamed: 0,Negative,Positive
movie,21367,16355
film,16273,17780
one,11435,11777
,8896,8594
out,7985,7088
it's,7549,7538


In [326]:
# Most Frequent words in the positiive sentiment
wc.sort_values(by=['Positive'], ascending=[False])[:6]

Unnamed: 0,Negative,Positive
film,16273,17780
movie,21367,16355
one,11435,11777
,8896,8594
it's,7549,7538
very,5129,7370


Negative and positive reviews have a few common most frequent words. Since they are all movie reveiws, they both contain the word "movie" and "film" the most, and as the word "one" specifies certain movies or characters the reviews talk about, all reviews contain the word "one" as the third most frequent word. Then, there is one difference between negative and positive reviews. While negative reviews have the word "out" the fourth most frequent word, positive reviews have the word "very" on its fifth most frequent word. Since the word "very" often has positive expressions, the negative reviews have less amounts of the word "very".

## Conclusion

Overlal, this project have investigated the performance of the Naive Bayes classifier for text classification on two distinct datasets: (1) newsgroup posts and (2) movie reviews. This comprehensive study displays valuable insights into the strengths and limitations of the Naive Bayes classifier for various text classification tasks.

The analysis revealed that the classifier could effectively predict the sentiment of a movie review, and identify the newsgroup to which a post belongs. The impact of various factors on the classification accuracy, such as the size of the training set and the inclusion of stopwords are also explored. It was found that removing stopwords improved the performance of the classifier only on movie reviews, while increasing the size of the training set generally led to better classification accuracy for both datasets.

By examining misclassified examples, we have gained insights into the limitations of the Naive Bayes classifier and the challenges it faces in certain cases.

In conclusion, the Naive Bayes classifier has proven to be a valuable tool for text classification tasks. This study provides a strong foundation for future research and improvements. Exploring more sophisticated techniques to handle negations or investigating the performance of the Naive Bayes classifier will be supportive tool in combination with other machine learning algorithms. Ultimately, the findings from this project will contribute to the ongoing development of efficient and effective text classification systems for various applications.

## References

1.9. naive Bayes. scikit. (n.d.). Retrieved April 23, 2023, from https://scikit-learn.org/stable/modules/naive_bayes.html#:~:text=Naive%20Bayes%20methods%20are%20a,value%20of%20the%20class%20variable 