Project 3 – IMDB MOVIE REVIEW

Context: IMDB dataset having 25K movie reviews for natural language processing or Text analytics.
This is a dataset for binary sentiment classification containing substantially more data than previous 
benchmark datasets. We provide a set of 12,500 highly polar movie reviews for training and 12,500 for 
testing. Please use less data eg 6K reviews if you are facing memory issues but make sure to use equal 
number of positive and negative sentiment reviews. Mention clearly in the notebook, if you have used a 
reduced dataset.

For more dataset information, please go through the following link,
http://ai.stanford.edu/~amaas/data/sentiment/

Dataset Source: given

Task: Goal of this project is to predict the number of positive and negative reviews using classification.

Implementation:
- Preprocess Text Data(Remove punctuation, Perform Tokenization, Remove stopwords and 
Lemmatize/Stem)
- Perform TFIDF Vectorization
- Exploring parameter settings using GridSearchCV on Random Forest & Gradient Boosting 
Classifier. Use Xgboost instead of Gradient Boosting if it's taking a very long time in 
GridSearchCV
- Perform Final evaluation of models on the best parameter settings using the evaluation metrics
- Report the best performing model


In [67]:
#Importing Libraries
import numpy as np
import pandas as pd

import warnings
warnings.filterwarnings('ignore')

In [68]:
#importing dataset
df = pd.read_csv("C:/Users/Manju/Documents/Assignments/data set/IMDB_dataset.csv")
df

Unnamed: 0,review,sentiment
0,I thought this was a wonderful way to spend ti...,positive
1,"Probably my all-time favorite movie, a story o...",positive
2,I sure would like to see a resurrection of a u...,positive
3,"This show was an amazing, fresh & innovative i...",negative
4,Encouraged by the positive comments about this...,negative
...,...,...
24995,"When I first tuned in on this morning news, I ...",negative
24996,I got this one a few weeks ago and love it! It...,positive
24997,"Bad plot, bad dialogue, bad acting, idiotic di...",negative
24998,I am a Catholic taught in parochial elementary...,negative


In [69]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25000 entries, 0 to 24999
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   review     25000 non-null  object
 1   sentiment  25000 non-null  object
dtypes: object(2)
memory usage: 390.8+ KB


In [70]:
df.duplicated().sum()

np.int64(102)

In [71]:
df.drop_duplicates()

Unnamed: 0,review,sentiment
0,I thought this was a wonderful way to spend ti...,positive
1,"Probably my all-time favorite movie, a story o...",positive
2,I sure would like to see a resurrection of a u...,positive
3,"This show was an amazing, fresh & innovative i...",negative
4,Encouraged by the positive comments about this...,negative
...,...,...
24995,"When I first tuned in on this morning news, I ...",negative
24996,I got this one a few weeks ago and love it! It...,positive
24997,"Bad plot, bad dialogue, bad acting, idiotic di...",negative
24998,I am a Catholic taught in parochial elementary...,negative


In [72]:
data = df.drop_duplicates()
data.reset_index(drop=True, inplace=True)

data

Unnamed: 0,review,sentiment
0,I thought this was a wonderful way to spend ti...,positive
1,"Probably my all-time favorite movie, a story o...",positive
2,I sure would like to see a resurrection of a u...,positive
3,"This show was an amazing, fresh & innovative i...",negative
4,Encouraged by the positive comments about this...,negative
...,...,...
24893,"When I first tuned in on this morning news, I ...",negative
24894,I got this one a few weeks ago and love it! It...,positive
24895,"Bad plot, bad dialogue, bad acting, idiotic di...",negative
24896,I am a Catholic taught in parochial elementary...,negative


In [73]:
data.duplicated().sum()

np.int64(0)

In [74]:
data.isnull().sum()

review       0
sentiment    0
dtype: int64

In [75]:
print("Input data has {} rows and {} columns".format(len(df), len(df.columns)))
print("Cleaned data has {} rows and {} columns".format(len(data), len(data.columns)))

Input data has 25000 rows and 2 columns
Cleaned data has 24898 rows and 2 columns


In [76]:
# Preprocessing text Data by removing punctuation, performing Tokenization, removing stopwords and Stemming

positive_reviews = data[data['sentiment'] == 'positive']
negative_reviews = data[data['sentiment'] == 'negative']

num_samples = 3000

positive_sampled = positive_reviews.sample(num_samples, random_state=42)
negative_sampled = negative_reviews.sample(num_samples, random_state=42)

data = pd.concat([positive_sampled, negative_sampled], ignore_index=True)
data

Unnamed: 0,review,sentiment
0,This probably ranks in my Top-5 list of the fu...,positive
1,I saw this film at the 2002 Toronto Internatio...,positive
2,This is an exceptional film. It is part comedy...,positive
3,Krajobraz po bitwie like many films of Wajda i...,positive
4,"Personally, I LOVED TRIS MOVIE! My best friend...",positive
...,...,...
5995,How does a Scotsman in a kilt make love in the...,negative
5996,Look at the all the positive user comments of ...,negative
5997,"This is a very strange film, with a no-name ca...",negative
5998,"As far as Asian horror goes, I have seen my sh...",negative


In [77]:
data['sentiment'].value_counts()

sentiment
positive    3000
negative    3000
Name: count, dtype: int64

In [78]:
import string
def remove_punct(text):
    text_nopunct = "".join([char for char in text if char not in string.punctuation])
    return text_nopunct

data['body_text_clean'] = data['review'].apply(lambda x: remove_punct(x))

data.head()

Unnamed: 0,review,sentiment,body_text_clean
0,This probably ranks in my Top-5 list of the fu...,positive,This probably ranks in my Top5 list of the fun...
1,I saw this film at the 2002 Toronto Internatio...,positive,I saw this film at the 2002 Toronto Internatio...
2,This is an exceptional film. It is part comedy...,positive,This is an exceptional film It is part comedy ...
3,Krajobraz po bitwie like many films of Wajda i...,positive,Krajobraz po bitwie like many films of Wajda i...
4,"Personally, I LOVED TRIS MOVIE! My best friend...",positive,Personally I LOVED TRIS MOVIE My best friend t...


In [79]:
import re

def tokenize(text):
    tokens = re.split('\W+', text)
    return tokens

data['body_text_tokenized'] = data['body_text_clean'].apply(lambda x: tokenize(x.lower()))

data.head()

Unnamed: 0,review,sentiment,body_text_clean,body_text_tokenized
0,This probably ranks in my Top-5 list of the fu...,positive,This probably ranks in my Top5 list of the fun...,"[this, probably, ranks, in, my, top5, list, of..."
1,I saw this film at the 2002 Toronto Internatio...,positive,I saw this film at the 2002 Toronto Internatio...,"[i, saw, this, film, at, the, 2002, toronto, i..."
2,This is an exceptional film. It is part comedy...,positive,This is an exceptional film It is part comedy ...,"[this, is, an, exceptional, film, it, is, part..."
3,Krajobraz po bitwie like many films of Wajda i...,positive,Krajobraz po bitwie like many films of Wajda i...,"[krajobraz, po, bitwie, like, many, films, of,..."
4,"Personally, I LOVED TRIS MOVIE! My best friend...",positive,Personally I LOVED TRIS MOVIE My best friend t...,"[personally, i, loved, tris, movie, my, best, ..."


In [80]:
pip install nltk.corpus

Note: you may need to restart the kernel to use updated packages.


ERROR: Could not find a version that satisfies the requirement nltk.corpus (from versions: none)
ERROR: No matching distribution found for nltk.corpus


In [81]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Manju\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Manju\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Manju\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [82]:
stopword = nltk.corpus.stopwords.words('english')

In [83]:
def remove_stopwords(tokenized_list):
    text = [word for word in tokenized_list if word not in stopword]
    return text

data['body_text_nostop'] = data['body_text_tokenized'].apply(lambda x: remove_stopwords(x))

data.head()

Unnamed: 0,review,sentiment,body_text_clean,body_text_tokenized,body_text_nostop
0,This probably ranks in my Top-5 list of the fu...,positive,This probably ranks in my Top5 list of the fun...,"[this, probably, ranks, in, my, top5, list, of...","[probably, ranks, top5, list, funniest, movies..."
1,I saw this film at the 2002 Toronto Internatio...,positive,I saw this film at the 2002 Toronto Internatio...,"[i, saw, this, film, at, the, 2002, toronto, i...","[saw, film, 2002, toronto, international, film..."
2,This is an exceptional film. It is part comedy...,positive,This is an exceptional film It is part comedy ...,"[this, is, an, exceptional, film, it, is, part...","[exceptional, film, part, comedy, part, drama,..."
3,Krajobraz po bitwie like many films of Wajda i...,positive,Krajobraz po bitwie like many films of Wajda i...,"[krajobraz, po, bitwie, like, many, films, of,...","[krajobraz, po, bitwie, like, many, films, waj..."
4,"Personally, I LOVED TRIS MOVIE! My best friend...",positive,Personally I LOVED TRIS MOVIE My best friend t...,"[personally, i, loved, tris, movie, my, best, ...","[personally, loved, tris, movie, best, friend,..."


In [84]:
ps = nltk.PorterStemmer()
def stemming(tokenized_text):
    text = [ps.stem(word) for word in tokenized_text]
    return text

data['body_text_stemmed'] = data['body_text_nostop'].apply(lambda x: stemming(x))

data.head()

Unnamed: 0,review,sentiment,body_text_clean,body_text_tokenized,body_text_nostop,body_text_stemmed
0,This probably ranks in my Top-5 list of the fu...,positive,This probably ranks in my Top5 list of the fun...,"[this, probably, ranks, in, my, top5, list, of...","[probably, ranks, top5, list, funniest, movies...","[probabl, rank, top5, list, funniest, movi, iv..."
1,I saw this film at the 2002 Toronto Internatio...,positive,I saw this film at the 2002 Toronto Internatio...,"[i, saw, this, film, at, the, 2002, toronto, i...","[saw, film, 2002, toronto, international, film...","[saw, film, 2002, toronto, intern, film, festi..."
2,This is an exceptional film. It is part comedy...,positive,This is an exceptional film It is part comedy ...,"[this, is, an, exceptional, film, it, is, part...","[exceptional, film, part, comedy, part, drama,...","[except, film, part, comedi, part, drama, part..."
3,Krajobraz po bitwie like many films of Wajda i...,positive,Krajobraz po bitwie like many films of Wajda i...,"[krajobraz, po, bitwie, like, many, films, of,...","[krajobraz, po, bitwie, like, many, films, waj...","[krajobraz, po, bitwi, like, mani, film, wajda..."
4,"Personally, I LOVED TRIS MOVIE! My best friend...",positive,Personally I LOVED TRIS MOVIE My best friend t...,"[personally, i, loved, tris, movie, my, best, ...","[personally, loved, tris, movie, best, friend,...","[person, love, tri, movi, best, friend, told, ..."


In [85]:
def clean_text(text):
    text = "".join([word.lower() for word in text if word not in string.punctuation])
    tokens = re.split('\W+', text)
    text = [ps.stem(word) for word in tokens if word not in stopword]
    return text

In [86]:
# Performing TFIDF Vectorization
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vect = TfidfVectorizer(analyzer=clean_text)
X_tfidf = tfidf_vect.fit_transform(data['review'])
print(X_tfidf.shape)
print(tfidf_vect.get_feature_names_out())

(6000, 41265)
['' '0' '00' ... 'œoliverâ' 'œoompahpahâ' 'œyouâ']


In [87]:
X_tfidf_df = pd.DataFrame(X_tfidf.toarray())
X_tfidf_df.columns = tfidf_vect.get_feature_names_out()
X_tfidf_df.head()

Unnamed: 0,Unnamed: 1,0,00,0000000000001,00000110,0001,001,0010,007,0079,...,ã¼bertrash,ãœberbab,œconsid,œfamilyâ,œfood,œgolden,œiâ,œoliverâ,œoompahpahâ,œyouâ
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.049522,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [88]:
### Exploring parameter settings using GridSearchCV on Random Forest
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import precision_recall_fscore_support as score
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_tfidf_df, data['sentiment'], test_size=0.2)


In [89]:
def train_RF(n_est, depth):
    rf = RandomForestClassifier(n_estimators=n_est, max_depth=depth, n_jobs=-1)
    rf_model = rf.fit(X_train, y_train)
    y_pred = rf_model.predict(X_test)
    precision, recall, fscore, support = score(y_test, y_pred, pos_label='negative', average='binary')
    print('Est: {} / Depth: {} ---- Precision: {} / Recall: {} / Accuracy: {}'.format(
        n_est, depth, round(precision, 3), round(recall, 3),
        round((y_pred==y_test).sum() / len(y_pred), 3)))
    
for n_est in [10, 50, 100]:
    for depth in [10, 20, 30, None]:
        train_RF(n_est, depth)

Est: 10 / Depth: 10 ---- Precision: 0.755 / Recall: 0.654 / Accuracy: 0.709
Est: 10 / Depth: 20 ---- Precision: 0.766 / Recall: 0.686 / Accuracy: 0.728
Est: 10 / Depth: 30 ---- Precision: 0.771 / Recall: 0.738 / Accuracy: 0.749
Est: 10 / Depth: None ---- Precision: 0.714 / Recall: 0.818 / Accuracy: 0.734
Est: 50 / Depth: 10 ---- Precision: 0.86 / Recall: 0.709 / Accuracy: 0.788
Est: 50 / Depth: 20 ---- Precision: 0.848 / Recall: 0.768 / Accuracy: 0.808
Est: 50 / Depth: 30 ---- Precision: 0.845 / Recall: 0.81 / Accuracy: 0.823
Est: 50 / Depth: None ---- Precision: 0.823 / Recall: 0.818 / Accuracy: 0.813
Est: 100 / Depth: 10 ---- Precision: 0.874 / Recall: 0.746 / Accuracy: 0.812
Est: 100 / Depth: 20 ---- Precision: 0.867 / Recall: 0.781 / Accuracy: 0.823
Est: 100 / Depth: 30 ---- Precision: 0.849 / Recall: 0.818 / Accuracy: 0.829
Est: 100 / Depth: None ---- Precision: 0.842 / Recall: 0.838 / Accuracy: 0.834


In [91]:
pip install xgboost

Collecting xgboost
  Downloading xgboost-2.1.1-py3-none-win_amd64.whl.metadata (2.1 kB)
Downloading xgboost-2.1.1-py3-none-win_amd64.whl (124.9 MB)
   ---------------------------------------- 0.0/124.9 MB ? eta -:--:--
   ---------------------------------------- 0.5/124.9 MB 3.3 MB/s eta 0:00:38
   ---------------------------------------- 1.0/124.9 MB 3.9 MB/s eta 0:00:33
   ---------------------------------------- 1.0/124.9 MB 3.9 MB/s eta 0:00:33
   ---------------------------------------- 1.3/124.9 MB 1.3 MB/s eta 0:01:33
    --------------------------------------- 2.1/124.9 MB 1.9 MB/s eta 0:01:04
   - -------------------------------------- 3.1/124.9 MB 2.5 MB/s eta 0:00:50
   - -------------------------------------- 4.5/124.9 MB 3.0 MB/s eta 0:00:41
   - -------------------------------------- 5.8/124.9 MB 3.4 MB/s eta 0:00:35
   -- ------------------------------------- 7.9/124.9 MB 4.1 MB/s eta 0:00:29
   -- ------------------------------------- 9.2/124.9 MB 4.4 MB/s eta 0:00:27
 

In [92]:
### Exploring parameter settings using XGBoost
import xgboost as xgb
from sklearn.preprocessing import LabelEncoder

print(dir(xgb.XGBClassifier))
print(xgb.XGBClassifier())

['__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setstate__', '__sizeof__', '__sklearn_clone__', '__sklearn_is_fitted__', '__str__', '__subclasshook__', '__weakref__', '_build_request_for_signature', '_can_use_inplace_predict', '_check_feature_names', '_check_n_features', '_configure_fit', '_create_dmatrix', '_doc_link_module', '_doc_link_template', '_doc_link_url_param_generator', '_estimator_type', '_get_default_requests', '_get_doc_link', '_get_iteration_range', '_get_metadata_request', '_get_param_names', '_get_tags', '_get_type', '_load_model_attributes', '_more_tags', '_repr_html_', '_repr_html_inner', '_repr_mimebundle_', '_set_evaluation_result', '_validate_data', '_validate_params', 'apply', 'best_iteration', 'best_score', 'cl

In [93]:
label_encoder = LabelEncoder()
y_train_encoded = label_encoder.fit_transform(y_train)
y_test_encoded = label_encoder.transform(y_test)

def train_xgboost(n_est, max_depth, lr):
    xgb_model = xgb.XGBClassifier(n_estimators=n_est, max_depth=max_depth, learning_rate=lr)
    xgb_model.fit(X_train, y_train_encoded)
    y_pred_encoded = xgb_model.predict(X_test)
    y_pred = label_encoder.inverse_transform(y_pred_encoded)

    precision, recall, fscore, train_support = score(y_test, y_pred, pos_label='negative', average='binary')
    accuracy = (y_pred == y_test).sum() / len(y_pred)
    print('Est: {} / Depth: {} / LR: {} ---- Precision: {} / Recall: {} / Accuracy: {}'.format(
        n_est, max_depth, lr, round(precision, 3), round(recall, 3), round(accuracy, 3)))

for n_est in [50, 100, 150]:
    for max_depth in [3, 7, 11, 15]:
        for lr in [0.01, 0.1, 1]:
            train_xgboost(n_est, max_depth, lr)


Est: 50 / Depth: 3 / LR: 0.01 ---- Precision: 0.809 / Recall: 0.523 / Accuracy: 0.688
Est: 50 / Depth: 3 / LR: 0.1 ---- Precision: 0.831 / Recall: 0.683 / Accuracy: 0.762
Est: 50 / Depth: 3 / LR: 1 ---- Precision: 0.786 / Recall: 0.778 / Accuracy: 0.774
Est: 50 / Depth: 7 / LR: 0.01 ---- Precision: 0.812 / Recall: 0.562 / Accuracy: 0.704
Est: 50 / Depth: 7 / LR: 0.1 ---- Precision: 0.826 / Recall: 0.744 / Accuracy: 0.785
Est: 50 / Depth: 7 / LR: 1 ---- Precision: 0.805 / Recall: 0.787 / Accuracy: 0.79
Est: 50 / Depth: 11 / LR: 0.01 ---- Precision: 0.804 / Recall: 0.645 / Accuracy: 0.733
Est: 50 / Depth: 11 / LR: 0.1 ---- Precision: 0.822 / Recall: 0.776 / Accuracy: 0.796
Est: 50 / Depth: 11 / LR: 1 ---- Precision: 0.805 / Recall: 0.8 / Accuracy: 0.795
Est: 50 / Depth: 15 / LR: 0.01 ---- Precision: 0.798 / Recall: 0.67 / Accuracy: 0.74
Est: 50 / Depth: 15 / LR: 0.1 ---- Precision: 0.828 / Recall: 0.794 / Accuracy: 0.807
Est: 50 / Depth: 15 / LR: 1 ---- Precision: 0.798 / Recall: 0.782 /

In [95]:
import xgboost as xgb
from sklearn.metrics import precision_score, recall_score, accuracy_score
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
y_train_encoded = label_encoder.fit_transform(y_train)
y_test_encoded = label_encoder.transform(y_test)

best_n_est = 150
best_max_depth = 15
best_lr = 0.1

def train_final_xgboost():
    xgb_model = xgb.XGBClassifier(n_estimators=best_n_est, max_depth=best_max_depth, learning_rate=best_lr)
    xgb_model.fit(X_train, y_train_encoded)

    y_pred_encoded = xgb_model.predict(X_test)
    y_pred = label_encoder.inverse_transform(y_pred_encoded)

    precision = precision_score(y_test, y_pred, pos_label='negative')
    recall = recall_score(y_test, y_pred, pos_label='negative')
    accuracy = accuracy_score(y_test, y_pred)

    print("Final Evaluation Results:")
    print("Precision (Negative):", round(precision, 3))
    print("Recall (Negative):", round(recall, 3))
    print("Accuracy:", round(accuracy, 3))

train_final_xgboost()


Final Evaluation Results:
Precision (Negative): 0.842
Recall (Negative): 0.818
Accuracy: 0.825


Analysis:

For the training, the data was cleaned by removing the duplicates(102 entries). Then removed punctuation, performed Tokenization, Remove stopwords and done Stemming.

In Random Forest Classifier model, the best overall performer was Est: 100 / Depth: None ----- Precision: 0.842 / Recall: 0.838 / Accuracy: 0.834

Whereas for Gradient Boosting Classifier, Est: 150, Depth: 15, LR: 0.1 shows the best balance across all metrics with a precision of 0.842, recall of 0.818, and accuracy of 0.825. Therefore higher precision of 84.2% with fewer false positives. Higher recall means few false negatives. Accurancy shows that the model is correct 82.5% of the time.

Comparing both the models, Random Forest Classifier model is preferable as it has higher recall and accurancy, precision remains same for both. Higher recall and accuracy suggest that the model is better at identifying negative cases and performs better overall.