In [79]:
!kaggle datasets download -d lakshmi25npathi/imdb-dataset-of-50k-movie-reviews

Dataset URL: https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews
License(s): other
imdb-dataset-of-50k-movie-reviews.zip: Skipping, found more recently modified local copy (use --force to force download)


In [80]:
import zipfile

with zipfile.ZipFile("imdb-dataset-of-50k-movie-reviews.zip", 'r') as zip_ref:
    zip_ref.extractall("imdb_data")

In [81]:
import numpy as np 
import pandas as pd

df = pd.read_csv("/Users/netrakc/Desktop/Data-Science/MachineLearning_Algorithms/NaiveBayes/imdb_data/IMDB Dataset.csv")
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [82]:
df['review'][0]

"One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.<br /><br />It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.<br /><br />I would say the main appeal of the show is due to the fa

### Text Cleaning

* Sample 1000 rows. 
* Remove HMTL Tags. 
* Remove Special Characters. 
* Converting Every thing to lower case. 
* Removing Stop words. 
* Stemming. 

In [83]:
# df = df.sample(1000)

In [84]:
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [85]:
df.shape

(50000, 2)

In [86]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   review     50000 non-null  object
 1   sentiment  50000 non-null  object
dtypes: object(2)
memory usage: 781.4+ KB


In [87]:
# Manually Converting the Positive to 1, and Negative to 0.
df['sentiment'].replace({
    "positive":1, 
    "negative":0
}, inplace=True)

In [88]:
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,1
1,A wonderful little production. <br /><br />The...,1
2,I thought this was a wonderful way to spend ti...,1
3,Basically there's a family where a little boy ...,0
4,"Petter Mattei's ""Love in the Time of Money"" is...",1


In [89]:
df.tail()

Unnamed: 0,review,sentiment
49995,I thought this movie did a down right good job...,1
49996,"Bad plot, bad dialogue, bad acting, idiotic di...",0
49997,I am a Catholic taught in parochial elementary...,0
49998,I'm going to have to disagree with the previou...,0
49999,No one expects the Star Trek movies to be high...,0


In [90]:
# Removing the HTML Tags on one row first to test either it will works or not. 
import re 

clean = re.compile("<.*?>")
re.sub(clean, '', df.iloc[2].review)

'I thought this was a wonderful way to spend time on a too hot summer weekend, sitting in the air conditioned theater and watching a light-hearted comedy. The plot is simplistic, but the dialogue is witty and the characters are likable (even the well bread suspected serial killer). While some may be disappointed when they realize this is not Match Point 2: Risk Addiction, I thought it was proof that Woody Allen is still fully in control of the style many of us have grown to love.This was the most I\'d laughed at one of Woody\'s comedies in years (dare I say a decade?). While I\'ve never been impressed with Scarlet Johanson, in this she managed to tone down her "sexy" image and jumped right into a average, but spirited young woman.This may not be the crown jewel of his career, but it was wittier than "Devil Wears Prada" and more interesting than "Superman" a great comedy to go see with friends.'

In [91]:
# Now, Applying on whole text. 
# Function to clean html tags. 
def clean_html(text):
    clean = re.compile("<.*?>")
    return re.sub(clean, '', text)

df['review'] = df['review'].apply(clean_html)

In [92]:
# Converting to lower case. 
def convert_to_lower(text):
    return text.lower()

df['review'] = df['review'].apply(convert_to_lower)

In [93]:
# function to remove special characters. 
def remove_special(text):
    x = ''
    for i in text:
        if i.isalnum():
            x=x+i
        else:
            x=x+' '
    return x

remove_special('th%e @ classic use of the word.it is called oz as that is the nickname given to the oswald maximum security state penitentary. it focuses mainly on emerald city, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. em city is home to many..aryans, muslims, gangstas, latinos, christians, italians, irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.i would say the main appeal of the show is due to the fact that it goes where other shows wouldnt dare. forget pretty pictures painted for mainstream audiences, f')

'th e   classic use of the word it is called oz as that is the nickname given to the oswald maximum security state penitentary  it focuses mainly on emerald city  an experimental section of the prison where all the cells have glass fronts and face inwards  so privacy is not high on the agenda  em city is home to many  aryans  muslims  gangstas  latinos  christians  italians  irish and more    so scuffles  death stares  dodgy dealings and shady agreements are never far away i would say the main appeal of the show is due to the fact that it goes where other shows wouldnt dare  forget pretty pictures painted for mainstream audiences  f'

In [94]:
# Now, Applying on the whole text. 
df['review'] = df['review'].apply(remove_special)

In [95]:
df['review']

0        one of the other reviewers has mentioned that ...
1        a wonderful little production  the filming tec...
2        i thought this was a wonderful way to spend ti...
3        basically there s a family where a little boy ...
4        petter mattei s  love in the time of money  is...
                               ...                        
49995    i thought this movie did a down right good job...
49996    bad plot  bad dialogue  bad acting  idiotic di...
49997    i am a catholic taught in parochial elementary...
49998    i m going to have to disagree with the previou...
49999    no one expects the star trek movies to be high...
Name: review, Length: 50000, dtype: object

In [96]:
# Remove the stop words. 
import nltk
from nltk.corpus import stopwords

stopwords.words("english")

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [97]:
df

Unnamed: 0,review,sentiment
0,one of the other reviewers has mentioned that ...,1
1,a wonderful little production the filming tec...,1
2,i thought this was a wonderful way to spend ti...,1
3,basically there s a family where a little boy ...,0
4,petter mattei s love in the time of money is...,1
...,...,...
49995,i thought this movie did a down right good job...,1
49996,bad plot bad dialogue bad acting idiotic di...,0
49997,i am a catholic taught in parochial elementary...,0
49998,i m going to have to disagree with the previou...,0


In [None]:
def remove_stopwords(text):
    x = []
    for i in text.split():
        if i not in stopwords.words("english"):
            x.append(i)
    y=x[:]
    x.clear()
    return y

df['review'] = df['review'].apply(remove_stopwords)

In [None]:
df

Unnamed: 0,review,sentiment
37954,"[genie, played, shaq, name, kazaam, whack, rhy...",0
28433,"[barney, idiot, dinosaur, unfortunaltely, go, ...",0
25957,"[apology, movie, absolutely, nothing, rachel, ...",0
29087,"[firstly, good, things, film, cliche, slasher,...",0
36502,"[cannot, get, awful, movie, eyes, want, jump, ...",0
...,...,...
5074,"[loved, show, waiting, come, dvd, never, anyon...",1
33042,"[canadian, filmmaker, mary, harron, cultural, ...",1
9719,"[film, version, sandra, bernhard, one, woman, ...",0
23892,"[one, thing, sure, watch, film, bad, day, stor...",1


In [None]:
# Stemming is a text normalization technique that reduces words to their base or root form, removing prefixes.
from nltk.stem.porter import PorterStemmer
porter_stemmer = PorterStemmer()

y = []

def stem_word(text):
    for i in text:
        y.append(porter_stemmer.stem(i))

    z = y[:]
    y.clear()
    return z

# Testing. 
# stem_word(["I", "Loved", "Loving", "it"])
df["review"] = df["review"].apply(stem_word)

In [None]:
df

Unnamed: 0,review,sentiment
37954,"[geni, play, shaq, name, kazaam, whack, rhyme,...",0
28433,"[barney, idiot, dinosaur, unfortunalt, go, ext...",0
25957,"[apolog, movi, absolut, noth, rachel, griffith...",0
29087,"[firstli, good, thing, film, clich, slasher, s...",0
36502,"[cannot, get, aw, movi, eye, want, jump, head,...",0
...,...,...
5074,"[love, show, wait, come, dvd, never, anyon, kn...",1
33042,"[canadian, filmmak, mari, harron, cultur, gadf...",1
9719,"[film, version, sandra, bernhard, one, woman, ...",0
23892,"[one, thing, sure, watch, film, bad, day, stor...",1


In [None]:
# Join back. 
def join_back(list_input):
    return " ".join(list_input)

df['review'] = df["review"].apply(join_back)
df['review']

37954    geni play shaq name kazaam whack rhyme corni l...
28433    barney idiot dinosaur unfortunalt go extinct d...
25957    apolog movi absolut noth rachel griffith must ...
29087    firstli good thing film clich slasher stuff co...
36502    cannot get aw movi eye want jump head ear gush...
                               ...                        
5074     love show wait come dvd never anyon know get s...
33042    canadian filmmak mari harron cultur gadfli who...
9719     film version sandra bernhard one woman broadwa...
23892    one thing sure watch film bad day stori base a...
41220    extraordinari film music made feel aw rodrigu ...
Name: review, Length: 1000, dtype: object

In [None]:
# Separating the Data into X and y. 
X = df.iloc[:, 0:1].values
y = df.iloc[:, -1].values

X.shape, y.shape

((1000, 1), (1000,))

In [None]:
# Implementing the CountVectorizer. 
from sklearn.feature_extraction.text import CountVectorizer

cv=CountVectorizer(max_features=2500)

X=cv.fit_transform(df['review']).toarray()
X.shape

(1000, 2500)

In [None]:
X[0].mean()

0.0228

In [None]:
# train test split. 
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train.shape, y_train.shape, X_test.shape, y_test.shape

((800, 1), (800,), (200, 1), (200,))

In [None]:
from sklearn.naive_bayes import GaussianNB,MultinomialNB,BernoulliNB

clf1=GaussianNB()
clf2=MultinomialNB()
clf3=BernoulliNB()

clf1.fit(X_train,y_train)
clf2.fit(X_train,y_train)
clf3.fit(X_train,y_train)

ValueError: could not convert string to float: 'sit home flip channel ran across potenti sound like interest film like destruct type movi decid watch know end watch whole 2 hour seen type movi know mani time back 1998 2000 dozen film dealt global destruct sort best one list far deep impact believ one problem film 1 cheap special effect like someth old comput 2 background inform explan weather pattern go make movi weather least decenc entertain viewer technic detail 3 come 2 3 peopl figur storm converg chicago expert left field 4 interest charact truli care anyon except mayb pregnant woman felt charact develop 5 thought provok moment ever factual incorrect theme first part film bet conclus show us destruct scene search rescu oper like done mani time judg special effect first part movi imagin expect cours end main charact surviv life go origin'

In [None]:
y_pred1=clf1.predict(X_test)
y_pred2=clf2.predict(X_test)
y_pred3=clf3.predict(X_test)

ValueError: could not convert string to float: 'jefferey dahmer one sick guy much say alreadi said except mani documentari film made probabl better one ridicul cheesi cheesi guy post whole film youtub ad annot make viewer laugh carl crew star serial killer jeffrey dahmer kill spree began 1978 young guy dahmer want friend final 1991 man wish sex eat bother watch whole film basic documentari show attack dahmer pull got caught sinc film made 1993 one year dahmer bludgeon death fellow inmat death dahmer shown probabl would cheesi chees fest 1 10'

In [None]:
from sklearn.metrics import accuracy_score
print("Gaussian",accuracy_score(y_test,y_pred1))
print("Multinomial",accuracy_score(y_test,y_pred2))
print("Bernaulli",accuracy_score(y_test,y_pred3))

NameError: name 'y_pred1' is not defined