## Analyzing IMDb Reviews

In this code demonstration,
- We will do some **text preprocessing steps**.
- Then, we will create features using **Bag of Words model** and **TF-IDF model**.
- Finally, we will build the **logistic regression model**.

## Session: Text Analytics I

# 1. Import libraries

In [1]:
#Load the libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import nltk


import os
import warnings
warnings.filterwarnings('ignore')
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

# 2. Load the data



In [47]:
# load data
from google.colab import drive
drive.mount('/content/drive')

# Let's take only first 10000 rows for the rest of the analysis for the ease of computation
df = pd.read_csv('/content/IMDB_reviews.csv')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [48]:
df.head()

Unnamed: 0,review,sentiment
0,one of the other reviewers has mentioned that ...,positive
1,a wonderful little production the filming tech...,positive
2,i thought this was a wonderful way to spend ti...,positive
3,basically theres a family where a little boy j...,negative
4,petter matteis love in the time of money is a ...,positive


In [49]:
#shape of the dataset
df.shape

(50000, 2)

In [50]:
# Let's take only first 10000 rows for the rest of the analysis for the ease of computation
imdb = df.head(10000).copy()

In [51]:
imdb

Unnamed: 0,review,sentiment
0,one of the other reviewers has mentioned that ...,positive
1,a wonderful little production the filming tech...,positive
2,i thought this was a wonderful way to spend ti...,positive
3,basically theres a family where a little boy j...,negative
4,petter matteis love in the time of money is a ...,positive
...,...,...
9995,fun entertaining movie about wwii german spy j...,positive
9996,give me a break how can anyone say that this i...,negative
9997,this movie is a bad movie but after watching a...,negative
9998,this is a movie that was probably made to ente...,negative


# 3. Removing stopwords and tokenization

In [52]:
from nltk.corpus import stopwords

#Setting English stopwords
stopword_list=nltk.corpus.stopwords.words('english')

In [53]:
from nltk.tokenize import word_tokenize,sent_tokenize

#Removing the stopwords. This function takes the value of a single review text as an argument.
#Tokenize the text, remove the stopwords and return the cleaned review text
def remove_stopwords(text):
    tokens =  word_tokenize(text)
    ##Strip any extra spaces in each word of the list tokens
    tokens = [token.strip() for token in tokens]

    #Removing stop words from the tokens and creating a list containing only non-stopword tokens
    # Logic - convert each token in tokens to lowercase and check if it is a stopword.
    #If it is not a stopword, then add it to the filtered_tokens list

    filtered_tokens=[]
    for token in tokens:
        if token.lower() not in stopword_list:
            filtered_tokens.append(token)
    # Individual tokens(words) are joined with whitespace as a separator to create a complete sentence
    filtered_text = ' '.join(filtered_tokens)

    #########(Question 4)############
    return filtered_text




In [54]:
len(imdb['review'][678])

4202

The original length of this review is 4202. After tokenization and stopword removal it should become 2942

In [55]:
len(remove_stopwords(imdb['review'][678]))

2942

In [56]:
#Apply function on review column. Removing stopwords from each review in the dataframe
imdb['review']=imdb['review'].apply(remove_stopwords)

In [57]:
imdb.head()

Unnamed: 0,review,sentiment
0,one reviewers mentioned watching 1 oz episode ...,positive
1,wonderful little production filming technique ...,positive
2,thought wonderful way spend time hot summer we...,positive
3,basically theres family little boy jake thinks...,negative
4,petter matteis love time money visually stunni...,positive


# Stemming

- Stemming is rule-based, it omits the last few letters like 'ing', 'ed', 'es' and more. It is fast but may create strange words.
- Lemmatizing is dictionary-based, where it translates all words to the root form, like 'went' to 'go', 'going' to 'go' and more.
Generally we prefer lemmatizing, but it might take some time in large datasets.
- For the ease of computation we will use stemming in this analysis

In [58]:
#Import the necessary libraries
from nltk.stem import PorterStemmer

def simple_stemmer(text):
  porter=PorterStemmer()
  stemmed_words = []
  for word in word_tokenize(text):
        stem_word = porter.stem(word)
        stemmed_words.append(stem_word)
  return ' '.join(stemmed_words)

In [59]:
len(imdb['review'][456])

2509

In [60]:
len(simple_stemmer(imdb['review'][456]))

2225

In [61]:
# Applying stemming to all reviews in the dataframe column 'review'
imdb['review'] = imdb['review'].apply(simple_stemmer)

In [62]:
#Example of randomly selected review text from the dataframe after stemming
imdb['review'][456]

'oh good would never thought possibl see thriller wors domest disturb soon arm rotten plot terribl edit stilt act headacheinduc style sorri word sanctimoni kind movi almost forc reevalu entir genr film bad even thriller condemn complet failur seem littl betternow sanctimoni terribl film also succe difficult task rip better movi pathet job right main titl noth blatant attempt reproduc one se7en impress someth didnt smell quit right soon movi start seri corni wannab hip quickcut full gori imag bombast color knew smell come fromit turn two policemen rather policeman jim renart michael par policewoman dorothi smith jennif rubin investig murder spree vancouv serial killer known monkey killer menac chill nicknam uh work method kill quit lot peopl see nut appar work follow proverb see evil hear evil speak evil cut eye ear tongu victim far six eye six ear three tongu ingeni fashion renart smith figur monkey killer probabl go kill three peopl well probabl want complet number 666 suddenli film f

## Session: Document Clustering

### Step - 1 Data Preprocessing

#### 1.1 - Converting the labels to 1 and 0

In [63]:
imdb.head()

Unnamed: 0,review,sentiment
0,one review mention watch 1 oz episod youll hoo...,positive
1,wonder littl product film techniqu unassum old...,positive
2,thought wonder way spend time hot summer weeke...,positive
3,basic there famili littl boy jake think there ...,negative
4,petter mattei love time money visual stun film...,positive


In [64]:
#Labelling the sentiment data.

#Label Binarizer converts Categorical data and outputs into a Numpy array of 0s and 1s. Label Binarizer is used to encode column data.

from sklearn.preprocessing import LabelBinarizer

lb=LabelBinarizer()

#Transformed sentiment data
imdb['sentiment']=lb.fit_transform(imdb['sentiment'])

In [65]:
### Check the DataFrame again
imdb.head()

Unnamed: 0,review,sentiment
0,one review mention watch 1 oz episod youll hoo...,1
1,wonder littl product film techniqu unassum old...,1
2,thought wonder way spend time hot summer weeke...,1
3,basic there famili littl boy jake think there ...,0
4,petter mattei love time money visual stun film...,1


# Additional Reading - Building Logistic Regression Model using Bag of Words Model

In [66]:
norm_reviews=imdb.review

In [67]:
norm_reviews

0       one review mention watch 1 oz episod youll hoo...
1       wonder littl product film techniqu unassum old...
2       thought wonder way spend time hot summer weeke...
3       basic there famili littl boy jake think there ...
4       petter mattei love time money visual stun film...
                              ...                        
9995    fun entertain movi wwii german spi juli andrew...
9996    give break anyon say good hockey movi know mov...
9997    movi bad movi watch endless seri bad horror mo...
9998    movi probabl made entertain middl school earli...
9999    smash film filmmak show intens strang relation...
Name: review, Length: 10000, dtype: object

In [68]:
from sklearn.feature_extraction.text import CountVectorizer

#Creating a matrix with reviews in row and unique words as columns and frequency of word in review as values.
#Count vectorizer for bag of words
cv=CountVectorizer()

#Fitting model on entire data
cv_fit = cv.fit(norm_reviews)

In [69]:
#Calculating sentiment count. imdb['sentiment'] now has the cleaned review text
imdb['sentiment'].value_counts()

1    5028
0    4972
Name: sentiment, dtype: int64

In [70]:
# Split the data into training and train sets
from sklearn.model_selection import train_test_split
X_cv_fit_train, X_cv_fit_test, y_train, y_test = train_test_split(norm_reviews, imdb['sentiment'], test_size = 0.3, random_state = 1)

In [71]:
# Normalizing train reviews with bag of words model
X_cv_fit_train = cv_fit.transform(X_cv_fit_train)

#Normalised test reviews with bag of words model
X_cv_fit_test = cv_fit.transform(X_cv_fit_test)

In [72]:
print(X_cv_fit_train.shape)
print(X_cv_fit_test.shape)

(7000, 62938)
(3000, 62938)


In [73]:
# Build the logistic regression model for the data
from sklearn.linear_model import LogisticRegression

#Training the model
lr = LogisticRegression(penalty = 'l2', max_iter = 500, C=1, random_state=42).fit(X_cv_fit_train, y_train)


#Fitting the model for tf-idf features
lr_bow=lr.fit(X_cv_fit_train,y_train)
print(lr_bow)

LogisticRegression(C=1, max_iter=500, random_state=42)


In [74]:
y_pred_bow = lr_bow.predict(X_cv_fit_test)

In [75]:
## Model Evaluation
from sklearn.metrics import accuracy_score

print('Validation set Accuracy: %.3f' % accuracy_score(y_test,y_pred_bow))

Validation set Accuracy: 0.854


In [76]:
#Classification report for bag of words
from sklearn.metrics import classification_report
lr_bow_report=classification_report(y_test, y_pred_bow,target_names=['Positive','Negative'])
print(lr_bow_report)

              precision    recall  f1-score   support

    Positive       0.85      0.85      0.85      1478
    Negative       0.86      0.86      0.86      1522

    accuracy                           0.85      3000
   macro avg       0.85      0.85      0.85      3000
weighted avg       0.85      0.85      0.85      3000



#### 1.2 Create a Tf-Idf object and fit the reviews

In [77]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer()

#Normalize the reviews for fitting the tf-idf model
norm_reviews=imdb.review

## Fit the tfidf_vectorizer
tv_fit = tfidf_vectorizer.fit(norm_reviews)

In [78]:
## Split the data into training and train sets
from sklearn.model_selection import train_test_split
X_tv_fit_train, X_tv_fit_test, y_train, y_test = train_test_split(norm_reviews, imdb['sentiment'], test_size = 0.2, random_state = 1)

In [79]:
# Normalizing train reviews with tfidf model
X_tv_fit_train = tv_fit.transform(X_tv_fit_train)

In [80]:
#Normalised test reviews with tfidf model
X_tv_fit_test = tv_fit.transform(X_tv_fit_test)

In [81]:
print(X_tv_fit_train.shape)
print(X_tv_fit_test.shape)

(8000, 62938)
(2000, 62938)


In [82]:
##You can also check the array to which they have been transformed to
X_tv_fit_train.toarray()

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [83]:
# Build the logistic regression model for the data
from sklearn.linear_model import LogisticRegression

#Training the model
lr = LogisticRegression(penalty = 'l2', max_iter = 500, C=1, random_state=42).fit(X_tv_fit_train, y_train)

In [84]:
## Model Evaluation
from sklearn.metrics import accuracy_score

# Predictions at threshold = 0.5
y_pred = lr.predict(X_tv_fit_test)

print('Validation set Accuracy: %.3f' % accuracy_score(y_test,y_pred))

Validation set Accuracy: 0.881


In [85]:
#Classification report for bag of words
from sklearn.metrics import classification_report
lr_bow_report=classification_report(y_test, y_pred,target_names=['Positive','Negative'])
print(lr_bow_report)

              precision    recall  f1-score   support

    Positive       0.88      0.88      0.88       978
    Negative       0.88      0.89      0.88      1022

    accuracy                           0.88      2000
   macro avg       0.88      0.88      0.88      2000
weighted avg       0.88      0.88      0.88      2000

