##  Naive Bayes Model

The given data is a subset of [the IMDB movie review dataset](https://www.kaggle.com/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews).

In [19]:
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import DecisionTreeRegressor
#from sklearn.model_selection import cross_val_scorey

In [20]:
# Import the dataset
import pandas as pd
from google.colab import files
file = files.upload()
df = pd.read_csv("IMDB Dataset_subset.csv")
df.head()

Saving IMDB Dataset_subset.csv to IMDB Dataset_subset (2).csv


Unnamed: 0,review,sentiment
0,I really liked this Summerslam due to the look...,positive
1,Not many television shows appeal to quite as m...,positive
2,The film quickly gets to a major chase scene w...,negative
3,Jane Austen would definitely approve of this o...,positive
4,Expectations were somewhat high for me when I ...,negative


In [21]:
# Packages required for preprocessing #
from sklearn.feature_extraction.text import CountVectorizer
from nltk.stem import WordNetLemmatizer #for lemmatization
import re #regular expression package
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [22]:
X = [row for row in df['review']] #list of reviews
classes = df['sentiment'] #list of true classes

In [23]:
# Pre-process the data
reviews = []
lemmatizer = WordNetLemmatizer()
for review in range(0, len(X)):
    # part 1
    review = re.sub(r'[\W_]', ' ', str(X[review]))
    review = re.sub(r'\s+[a-zA-Z]\s+', ' ', review)
    review = re.sub(r'\^[a-zA-Z]\s+', ' ', review)
    review = re.sub(r'\s+', ' ', review, flags=re.I)
    review = re.sub(r'^b\s+', '', review) # if a review record is in bytes, the corresponding line will have a letter 'b' appended at the start)
    review = review.lower()
    review = re.sub(r'[0-9]+', '', review)
    # part 2
    review = review.split()
    review = [lemmatizer.lemmatize(word) for word in review]
    review = ' '.join(review)
    reviews.append(review)

Part1 : We are cleaning the data to further use for analysis. In first line we remove special characters and symbols from the text. In second and third line we remove one letter alphabetic and replace with space In fourth line, eliminating multiple spacing and repplacing with single space. In fifth line, eliminating b if a review record is in bytes, the corresponding line will have a letter 'b' appended at the start In sixth line, we use use lower case

Part2 : The whole purpose is to split words as single strings which is acheived by The split() method this separates the text into words based on whitespace thus, creating a list of words.review = [lemmatizer.lemmatize(word) for word in review] does convert words to simple terms eliminating the past or future verbs in the words.review = ' '.join(review) combines the lemmatized words back into a single text string, with spaces between them

Both part 1 and part 2 are part of pre-processing before we build our model

In [24]:
# Continue with pre-processing
vectorizer = CountVectorizer(stop_words = "english", max_df=0.7, min_df=5)
texts = vectorizer.fit_transform(reviews).toarray()
vocab = vectorizer.vocabulary_
vocab = sorted(vocab.items(), key = lambda x: x[1])
vocab = [v[0] for v in vocab]

Texts: shows how many times each word is used in each review. The matrix representation helps us understand the specific word frequency in the text.

Vocab: is a list of all the different words we found in the text. Each word is listed only once.

Texts" and "vocab" work together to make text analysis and modeling easier. "Texts" keeps track of how often each word from "vocab" appears in each text, creating a numerical representation of the text data. This allows us to analyze and model the text more effectively.

In simpler terms, "texts" counts how many times each word from "vocab" is used in each review, helping us analyze the text data quantitatively.

In [25]:
vocab[:10]

['aamir',
 'aaron',
 'abandon',
 'abandoned',
 'abbas',
 'abbott',
 'abc',
 'abducted',
 'abe',
 'abel']

In [26]:
texts

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [27]:
X_train, X_test, y_train, y_test = train_test_split(texts, classes, test_size=0.20, random_state=42)

In [28]:
X_train.shape

(4000, 9768)

In [29]:
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB()
clf.fit(X_train,y_train)
pred_text = clf.predict(X_test)

In [30]:
from sklearn.metrics import accuracy_score, classification_report

#on the training set
train_predictions = clf.predict(X_train)
training_accuracy = accuracy_score(y_train, train_predictions)

#validation set
validation_predictions = clf.predict(X_test)
validation_accuracy = accuracy_score(y_test, validation_predictions)


In [31]:
# Print the results
print("Train data Accuracy:", training_accuracy)
print("Test data Accuracy:", validation_accuracy)

#classification report for the validation set
neg_po = ["Negative", "Positive"]
validation_report = classification_report(y_test, pred_text, target_names=neg_po)
print("\nValidation Classification Report:\n", validation_report)

Train data Accuracy: 0.9215
Test data Accuracy: 0.831

Validation Classification Report:
               precision    recall  f1-score   support

    Negative       0.83      0.84      0.83       506
    Positive       0.84      0.82      0.83       494

    accuracy                           0.83      1000
   macro avg       0.83      0.83      0.83      1000
weighted avg       0.83      0.83      0.83      1000



An accuracy of 83.1% says us that the model is performing well on the test data, but it's not as accurate as it was when it was trained(i,e 92%).

But, test data is new/different from what the model has seen before. Thus, the model cannot be as good on new data as it is on the trained data.

The F1-scores, (i.e,83%), tell us that the model is performing good for capturing actual positive cases.

When it comes to recall, it's roughly 84% for "Negative" and 82% for "Positive which indicates that the model is quite good at finding actual "positive cases"