# Project 3 – IMDB MOVIE REVIEW

### Preprocess Text Data(Remove punctuation, Perform Tokenization, Remove stopwords and Lemmatize/Stem)

### Import all the necessary libraries

In [8]:
import pandas as pd
import string
import nltk
from nltk.corpus import stopwords
from xgboost import XGBClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.stem import WordNetLemmatizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/nikipatel/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /Users/nikipatel/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/nikipatel/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

### read dataset

In [9]:
ds = pd.read_csv('IMDB_dataset.csv')
ds

Unnamed: 0,review,sentiment
0,I thought this was a wonderful way to spend ti...,positive
1,"Probably my all-time favorite movie, a story o...",positive
2,I sure would like to see a resurrection of a u...,positive
3,"This show was an amazing, fresh & innovative i...",negative
4,Encouraged by the positive comments about this...,negative
...,...,...
24995,"When I first tuned in on this morning news, I ...",negative
24996,I got this one a few weeks ago and love it! It...,positive
24997,"Bad plot, bad dialogue, bad acting, idiotic di...",negative
24998,I am a Catholic taught in parochial elementary...,negative


### Checking for null values

In [10]:
ds.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25000 entries, 0 to 24999
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   review     25000 non-null  object
 1   sentiment  25000 non-null  object
dtypes: object(2)
memory usage: 390.8+ KB


###  perform all text preprocessing like lemmatization, tokenizing the text, and removing punctuation.

In [11]:
def preprocessing_text(review):
     # Removing a  punctuation
    review = review.translate(str.maketrans('', '', string.punctuation))
    
    # Tokenize text to words
    words = nltk.word_tokenize(review.lower())
    
    # Removing stopwords
    words = [word for word in words if word not in stopwords.words('english')]
    
    # Lemmatize words
    lemmatizer = WordNetLemmatizer()
    words = [lemmatizer.lemmatize(word) for word in words]
    
    # Join words back
    review = ' '.join(words)
    
    return review

### preprocess text and assigning it to a new column in data

In [13]:
ds['preprocessed_review'] = ds['review'].apply(preprocessing_text)
ds

Unnamed: 0,review,sentiment,preprocessed_review
0,I thought this was a wonderful way to spend ti...,positive,thought wonderful way spend time hot summer we...
1,"Probably my all-time favorite movie, a story o...",positive,probably alltime favorite movie story selfless...
2,I sure would like to see a resurrection of a u...,positive,sure would like see resurrection dated seahunt...
3,"This show was an amazing, fresh & innovative i...",negative,show amazing fresh innovative idea 70 first ai...
4,Encouraged by the positive comments about this...,negative,encouraged positive comment film looking forwa...
...,...,...,...
24995,"When I first tuned in on this morning news, I ...",negative,first tuned morning news thought wow finally e...
24996,I got this one a few weeks ago and love it! It...,positive,got one week ago love modern light filled true...
24997,"Bad plot, bad dialogue, bad acting, idiotic di...",negative,bad plot bad dialogue bad acting idiotic direc...
24998,I am a Catholic taught in parochial elementary...,negative,catholic taught parochial elementary school nu...


### Perform TF-IDF vectorizer for conversion of date into vectors and mapping values of 'sentiment' to 1 and 0  for 'positive' and 'negative' values.

In [18]:
vectorizer = TfidfVectorizer()
x = vectorizer.fit_transform(ds['preprocessed_review'])
y = ds['sentiment'].map({'positive': 1, 'negative': 0})

### Perform Grid search using the random forest classifier, also printing best parameters and accuracy score

In [23]:
# Defining the parameter grid for the Random Forest model
paramrf = {'n_estimators': [100, 200],
             'max_depth': [5, None]}

# Defining the Random Forest model
modelrf = RandomForestClassifier()

# Defining the GridSearchCV object
gridrf = GridSearchCV(rf_model, param_grid=rf_params, cv=5)

# Fit the GridSearchCV object to the data
gridrf.fit(x, y)

# Print the best parameter settings and accuracy score
print('Best Parameters:', gridrf.best_params_)
print('Accuracy:', gridrf.best_score_)

Best Parameters: {'max_depth': None, 'n_estimators': 200}
Accuracy: 0.85548


### Perform XGBoost to find best parameters and print accuracy score

In [24]:
# Defining the parameter grid for the XGBoost model
paramxg = {'max_depth': [3, 6],
              'n_estimators': [50, 100]}

# Defining the XGBoost model
modelxg = XGBClassifier()

# Defining the GridSearchCV object for XGBoost
gridxg = GridSearchCV(xgb_model, param_grid=xgb_params, cv=5)

# Fit the GridSearchCV object to the data for XGBoost
gridxg.fit(x, y)

# Print the best parameter settings and accuracy score
print('XGBoost Best Parameters:', gridxg.best_params_)
print('XGBoost Accuracy:', gridxg.best_score_)

XGBoost Best Parameters: {'max_depth': 6, 'n_estimators': 100}
XGBoost Accuracy: 0.8532399999999999


#### To conclude,to clear all text and making it suitable for the successful prediction output, I have performed Text Preprocessing. To find a best parameters I have used gridsearchcv and implemented 2 classifiers. With the randomforest accuracy is 85.54%  and with the XGBoost it is 85.32% Accuracy. So if we compare both then it can be said that RandomForest giving better accuracy as compare to XGBoost and best parameter for this is 'max_depth:None', 'n_estimator:200'. 