# NLP Movie Review

In [1]:
import pandas as pd
import numpy as np
df=pd.read_csv('IMDB Dataset.csv')
df

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive
...,...,...
49995,I thought this movie did a down right good job...,positive
49996,"Bad plot, bad dialogue, bad acting, idiotic di...",negative
49997,I am a Catholic taught in parochial elementary...,negative
49998,I'm going to have to disagree with the previou...,negative


In [2]:
df['review'][100].lower()

"this short film that inspired the soon-to-be full length feature - spatula madness - is a hilarious piece that contends against similar cartoons yielding multiple writers. the short film stars edward the spatula who after being fired from his job, joins in the fight against the evil spoons. this premise allows for some funny content near the beginning, but is barely present for the remainder of the feature. this film's 15-minute running time is absorbed by some odd-ball comedy and a small musical number. unfortunately not much else lies below it. the plot that is set up doesn't really have time to show. but it's surely follows it plot better than many high-budget hollywood films. this film is worth watching at least a few times. take it for what it is, and don't expect a deep story."

In [3]:
df['review']=df['review'].str.lower()
df.head(5)

Unnamed: 0,review,sentiment
0,one of the other reviewers has mentioned that ...,positive
1,a wonderful little production. <br /><br />the...,positive
2,i thought this was a wonderful way to spend ti...,positive
3,basically there's a family where a little boy ...,negative
4,"petter mattei's ""love in the time of money"" is...",positive


## Preprocessing

Lower Case

In [4]:
import re

def preprocess(text):
    text = text.lower()
    text = re.sub(r'[^\w\s]', '', text)
    return text

In [5]:
df['review']=df['review'].apply(preprocess)
df.head(5)

Unnamed: 0,review,sentiment
0,one of the other reviewers has mentioned that ...,positive
1,a wonderful little production br br the filmin...,positive
2,i thought this was a wonderful way to spend ti...,positive
3,basically theres a family where a little boy j...,negative
4,petter matteis love in the time of money is a ...,positive


Remove HTML Tags using Regex

In [6]:
def remove_html_tags(text):
    clean_pattern = re.compile('<.*?>')
    return clean_pattern.sub('', text)

In [7]:
df['review']=df['review'].apply(remove_html_tags)
df.head(5)

Unnamed: 0,review,sentiment
0,one of the other reviewers has mentioned that ...,positive
1,a wonderful little production br br the filmin...,positive
2,i thought this was a wonderful way to spend ti...,positive
3,basically theres a family where a little boy j...,negative
4,petter matteis love in the time of money is a ...,positive


Removing URLs

In [8]:
def remove_urls(text):
    clean_pattern = re.compile(r'https?://\S+|www\.\S+')
    return clean_pattern.sub('', text)

In [9]:
df['review']=df['review'].apply(remove_urls)
df.head(5)

Unnamed: 0,review,sentiment
0,one of the other reviewers has mentioned that ...,positive
1,a wonderful little production br br the filmin...,positive
2,i thought this was a wonderful way to spend ti...,positive
3,basically theres a family where a little boy j...,negative
4,petter matteis love in the time of money is a ...,positive


Removing Punctuation

In [10]:
import string
exclude=string.punctuation

def remove_punctuation(text):
    for char in exclude:
        text=text.replace(char,'')
    return text

In [11]:
df['review']=df['review'].apply(remove_punctuation)
df.head(5)

Unnamed: 0,review,sentiment
0,one of the other reviewers has mentioned that ...,positive
1,a wonderful little production br br the filmin...,positive
2,i thought this was a wonderful way to spend ti...,positive
3,basically theres a family where a little boy j...,negative
4,petter matteis love in the time of money is a ...,positive


Removing stopwords

In [12]:
from nltk.corpus import stopwords
stopwords=set(stopwords.words('english'))

def remove_stopwords(text):
    clean_text=' '.join([word for word in text.split() if word not in stopwords])
    return clean_text

In [13]:
df['review']=df['review'].apply(remove_stopwords)
df.head(5)

Unnamed: 0,review,sentiment
0,one reviewers mentioned watching 1 oz episode ...,positive
1,wonderful little production br br filming tech...,positive
2,thought wonderful way spend time hot summer we...,positive
3,basically theres family little boy jake thinks...,negative
4,petter matteis love time money visually stunni...,positive


Handling Emojis

In [14]:
import emoji

def remove_emoji(text):
    emoji_pattern = re.compile("["
                            u"\U0001F600-\U0001F64F"  # emoticons
                            u"\U0001F300-\U0001F5FF" #symbols, pictograph
                            u"\U0001F680-\U0001F6FF" #transport and map symbol
                            u"\U0001F1E0-\U0001F1FF" # flags(IOS)
                            u"\U00002702-\U000027B0"
                            u"\U00002FC2-\U0001F251"
                            "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'', text)

In [15]:
df['review']=df['review'].apply(remove_emoji)
df.head(5)

Unnamed: 0,review,sentiment
0,one reviewers mentioned watching 1 oz episode ...,positive
1,wonderful little production br br filming tech...,positive
2,thought wonderful way spend time hot summer we...,positive
3,basically theres family little boy jake thinks...,negative
4,petter matteis love time money visually stunni...,positive


Tokenization & Lemmatization

In [16]:
import spacy
nlp=spacy.load('en_core_web_sm')

text = df['review'][0]
doc = nlp(text)

for token in doc:
    print(token.text)

for token in doc:
    print(f"{token.text:{15}} {token.pos_:{8}} {token.dep_}")

one
reviewers
mentioned
watching
1
oz
episode
you
ll
hooked
right
exactly
happened
mebr
br
first
thing
struck
oz
brutality
unflinching
scenes
violence
set
right
word
go
trust
show
faint
hearted
timid
show
pulls
punches
regards
drugs
sex
violence
hardcore
classic
use
wordbr
br
called
oz
nickname
given
oswald
maximum
security
state
penitentary
focuses
mainly
emerald
city
experimental
section
prison
cells
glass
fronts
face
inwards
privacy
high
agenda
em
city
home
manyaryans
muslims
gangstas
latinos
christians
italians
irish
moreso
scuffles
death
stares
dodgy
dealings
shady
agreements
never
far
awaybr
br
would
say
main
appeal
show
due
fact
goes
shows
would
nt
dare
forget
pretty
pictures
painted
mainstream
audiences
forget
charm
forget
romanceoz
does
nt
mess
around
first
episode
ever
saw
struck
nasty
surreal
could
nt
say
ready
watched
developed
taste
oz
got
accustomed
high
levels
graphic
violence
violence
injustice
crooked
guards
who
ll
sold
nickel
inmates
who
ll
kill
order
get
away
well
ma

Stemming

In [17]:
from nltk.stem.porter import PorterStemmer
ps=PorterStemmer()

def stemming(text):
    stem_text=' '.join([ps.stem(word) for word in text.split()])
    return stem_text

In [18]:
df['review']=df['review'].apply(stemming)
df.head(5)

Unnamed: 0,review,sentiment
0,one review mention watch 1 oz episod youll hoo...,positive
1,wonder littl product br br film techniqu unass...,positive
2,thought wonder way spend time hot summer weeke...,positive
3,basic there famili littl boy jake think there ...,negative
4,petter mattei love time money visual stun film...,positive


Bag of Words

In [19]:
from sklearn.feature_extraction.text import CountVectorizer

cv=CountVectorizer(max_features=5000)
x=cv.fit_transform(df['review']).toarray()
x.shape

(50000, 5000)

TF-IDF

In [20]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf=TfidfVectorizer(max_features=5000)
x=tfidf.fit_transform(df['review']).toarray()
x.shape

(50000, 5000)

Word2Vec

In [21]:
from gensim.models import Word2Vec

w2v_sentences = [text.split() for text in df['review']]
w2v_model = Word2Vec(sentences=w2v_sentences, vector_size=100, window=5, min_count=2)

w2v_model.save('w2vec_model')
w2v_model = Word2Vec.load('w2vec_model')

print(f'vector of word similar to "good" : {w2v_model.wv.most_similar("good")}')
print(f'vector of word doesnt match:{w2v_model.wv.doesnt_match(['good', 'amazing', 'great', 'wonderful', 'bad', 'terrible'])}')

vector of word similar to "good" : [('decent', 0.7662745714187622), ('great', 0.7098829746246338), ('bad', 0.6865018606185913), ('cool', 0.6199120879173279), ('alright', 0.6183957457542419), ('nice', 0.6104507446289062), ('excel', 0.5941542387008667), ('fine', 0.5864112377166748), ('okay', 0.5726799964904785), ('halfbad', 0.5655313730239868)]
vector of word doesnt match:wonderful


Change review to single vector

In [22]:
def get_review_vector(review, w2v_model):
    if not review:
        return np.zeros(w2v_model.vector_size)
    
    words = (review).split()
    valid_words = [word for word in words if word in w2v_model.wv.key_to_index]

    if len(valid_words) == 0:
        return np.zeros(w2v_model.vector_size)

    vector = np.mean([w2v_model.wv[word] for word in valid_words], axis=0)
    return vector / np.linalg.norm(vector) # normalize vector

# apply to all reviews
df['review_vector'] = df['review'].apply(lambda x: get_review_vector(x, w2v_model))
print(df['review_vector'].head())
    

0    [-0.0015930779, -0.031668793, 0.035055637, -0....
1    [-0.027537, 0.04877297, -0.08130531, -0.006986...
2    [0.067163, 0.027719958, -0.057168644, -0.00580...
3    [-0.017885853, -0.045905422, 0.054175206, -0.0...
4    [0.02839853, 0.011918786, 0.0023428812, -0.018...
Name: review_vector, dtype: object


Train Test Split

In [23]:
from sklearn.model_selection import train_test_split

X = np.array([get_review_vector(review, w2v_model) for review in df['review']])

y = df['sentiment'].map({'positive':1, 'negative':0}).values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Training data shape: {X_train.shape}")
print(f"Test data shape: {X_test.shape}")
print(f"Training labels shape: {y_train.shape}")
print(f"Test labels shape: {y_test.shape}")

Training data shape: (40000, 100)
Test data shape: (10000, 100)
Training labels shape: (40000,)
Test labels shape: (10000,)


## ML models
* Logistic Regression
* SVM
* Random Forest
* Gradient Boosting

Logistic Regression

In [24]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.model_selection import cross_val_score

model_lore = LogisticRegression()

model_lore.fit(X_train, y_train)
lore_pred = model_lore.predict(X_test)

print(f"Accuracy: {accuracy_score(y_test, lore_pred)}") # Accuracy report
print(f"Classification report: {classification_report(y_test, lore_pred)}") # Classification report
print(f"Confussion matrix: {confusion_matrix(y_test, lore_pred)}") # Confusion matrix

rf_score = cross_val_score(model_lore, X, y, cv=5)
print(f"Cross validation score: {rf_score.mean():.4f} (+/- {rf_score.std() * 2:.4f})")

Accuracy: 0.8561
Classification report:               precision    recall  f1-score   support

           0       0.86      0.85      0.85      4961
           1       0.86      0.86      0.86      5039

    accuracy                           0.86     10000
   macro avg       0.86      0.86      0.86     10000
weighted avg       0.86      0.86      0.86     10000

Confussion matrix: [[4228  733]
 [ 706 4333]]
Cross validation score: 0.8592 (+/- 0.0076)


SVM

In [25]:
from sklearn.svm import SVC

model_svc = SVC(kernel='rbf', gamma='auto', random_state=42)

model_svc.fit(X_train, y_train)
svc_pred = model_svc.predict(X_test)

print(f"Accuracy: {accuracy_score(y_test, svc_pred)}") # Accuracy report
print(f"Classification report: {classification_report(y_test, svc_pred)}") # Classification report
print(f"Confussion matrix: {confusion_matrix(y_test, svc_pred)}") # Confusion matrix

scv_score = cross_val_score(model_svc, X, y, cv=5)
print(f"Cross validation score: {scv_score.mean():.4f} (+/- {scv_score.std() * 2:.4f})")

Accuracy: 0.8457
Classification report:               precision    recall  f1-score   support

           0       0.85      0.84      0.84      4961
           1       0.84      0.85      0.85      5039

    accuracy                           0.85     10000
   macro avg       0.85      0.85      0.85     10000
weighted avg       0.85      0.85      0.85     10000

Confussion matrix: [[4170  791]
 [ 752 4287]]
Cross validation score: 0.8482 (+/- 0.0093)


Random Forest

In [26]:
from sklearn.ensemble import RandomForestClassifier

model_rf = RandomForestClassifier(n_estimators=100, random_state=42, max_depth=10)

model_rf.fit(X_train, y_train)
rf_pred = model_rf.predict(X_test)

print(f"Accuracy: {accuracy_score(y_test, rf_pred)}") # Accuracy report
print(f"Classification report: {classification_report(y_test, rf_pred)}") # Classification report
print(f"Confussion matrix: {confusion_matrix(y_test, rf_pred)}") # Confusion matrix

rf_score = cross_val_score(model_rf, X, y, cv=5)
print(f"Cross validation score: {rf_score.mean():.4f} (+/- {rf_score.std() * 2:.4f})")

Accuracy: 0.8239
Classification report:               precision    recall  f1-score   support

           0       0.83      0.80      0.82      4961
           1       0.81      0.84      0.83      5039

    accuracy                           0.82     10000
   macro avg       0.82      0.82      0.82     10000
weighted avg       0.82      0.82      0.82     10000

Confussion matrix: [[3993  968]
 [ 793 4246]]
Cross validation score: 0.8300 (+/- 0.0016)


Gradient Boosting

In [27]:
from sklearn.ensemble import GradientBoostingClassifier

model_gb = GradientBoostingClassifier(n_estimators=100, random_state=42, max_depth=5, min_samples_leaf=3, learning_rate=0.5, min_samples_split=3)

model_gb.fit(X_train, y_train)
gb_pred = model_gb.predict(X_test)

print(f"Accuracy: {accuracy_score(y_test, gb_pred)}") # Accuracy report
print(f"Classification report: {classification_report(y_test, gb_pred)}") # Classification report
print(f"Confussion matrix: {confusion_matrix(y_test, gb_pred)}") # Confusion matrix

gb_score = cross_val_score(model_gb, X, y, cv=5)
print(f"Cross validation score: {gb_score.mean():.4f} (+/- {gb_score.std() * 2:.4f})")

Accuracy: 0.8435
Classification report:               precision    recall  f1-score   support

           0       0.85      0.84      0.84      4961
           1       0.84      0.85      0.85      5039

    accuracy                           0.84     10000
   macro avg       0.84      0.84      0.84     10000
weighted avg       0.84      0.84      0.84     10000

Confussion matrix: [[4152  809]
 [ 756 4283]]
Cross validation score: 0.8444 (+/- 0.0037)


Hyperparameter Tuning

GridSearch

In [28]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    'C': [0.1, 1, 10, 50],
    'penalty': ['l1', 'l2', 'elasticnet', 'none'],
    'solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'],
    'max_iter': [100, 500, 750],
    'l1_ratio': [0.1, 0.5, 0.9]
}

lore_model = LogisticRegression(random_state=42)

grid_search = GridSearchCV(estimator=lore_model, param_grid=param_grid, cv=5)
grid_search.fit(X_train, y_train)

print(f"Best parameters: {grid_search.best_params_}")
print(f"Best score: {grid_search.best_score_:.4f}")

2160 fits failed out of a total of 3600.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
180 fits failed with the following error:
Traceback (most recent call last):
  File "c:\Users\WAHYU\Documents\VS Code Project\movie review\.venv\Lib\site-packages\sklearn\model_selection\_validation.py", line 888, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "c:\Users\WAHYU\Documents\VS Code Project\movie review\.venv\Lib\site-packages\sklearn\base.py", line 1473, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\WAHYU\Documents\VS Code Project\movie review\.venv\Lib\site-packages\sklearn\linear_model\_logistic.py", line 1194, in fit
    solver

Best parameters: {'C': 1, 'l1_ratio': 0.1, 'max_iter': 100, 'penalty': 'l1', 'solver': 'saga'}
Best score: 0.8607


RandomSearch

In [29]:
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import uniform,randint

param_dist = {
    'C': uniform(0.1, 10),
    'max_iter': randint(100, 500, 750),
    'solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'],
    'penalty': ['l1', 'l2', 'elasticnet', 'none'],
}

lore_model = LogisticRegression(random_state=42)

random_search = RandomizedSearchCV(estimator=lore_model, param_distributions=param_dist, cv=5, n_iter=100)
random_search.fit(X_train, y_train)

print(f"Best parameters: {random_search.best_params_}")
print(f"Best score: {random_search.best_score_:.4f}")

335 fits failed out of a total of 500.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
10 fits failed with the following error:
Traceback (most recent call last):
  File "c:\Users\WAHYU\Documents\VS Code Project\movie review\.venv\Lib\site-packages\sklearn\model_selection\_validation.py", line 888, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "c:\Users\WAHYU\Documents\VS Code Project\movie review\.venv\Lib\site-packages\sklearn\base.py", line 1473, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\WAHYU\Documents\VS Code Project\movie review\.venv\Lib\site-packages\sklearn\linear_model\_logistic.py", line 1194, in fit
    solver = 

Best parameters: {'C': 0.8341648222713359, 'max_iter': 993, 'penalty': 'l1', 'solver': 'saga'}
Best score: 0.8606


Bayesian Optimization

In [30]:
from skopt import BayesSearchCV
from skopt.space import Real, Integer, Categorical

param_space = {
    'C': Real(1e-3, 1e3, prior='log-uniform'),
    'penalty': Categorical(['l2', None]),
    'solver': Categorical(['newton-cg', 'lbfgs', 'sag', 'saga']),
    'max_iter': Integer(100, 1000),
    'l1_ratio': Real(0, 1, prior='uniform'),
}

lore_model = LogisticRegression(random_state=42)

bayes_search = BayesSearchCV(estimator=lore_model, search_spaces=param_space, n_iter=100, cv=5)
bayes_search.fit(X_train, y_train)

print(f"Best parameters: {bayes_search.best_params_}")
print(f"Best score: {bayes_search.best_score_:.4f}")



Best parameters: OrderedDict({'C': 1.2828352659826752, 'l1_ratio': 0.02871970320489526, 'max_iter': 240, 'penalty': 'l2', 'solver': 'newton-cg'})
Best score: 0.8601


Evaluation