<a href="https://colab.research.google.com/github/netgvarun2012/MovieReviewSentimentAnalysis/blob/main/LSTMMovieSentimentAnalysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Importing necessary Libraries

In [45]:
pip install contractions 

Collecting contractions
  Downloading contractions-0.1.72-py2.py3-none-any.whl (8.3 kB)
Collecting textsearch>=0.0.21
  Downloading textsearch-0.0.21-py2.py3-none-any.whl (7.5 kB)
Collecting anyascii
  Downloading anyascii-0.3.1-py3-none-any.whl (287 kB)
[K     |████████████████████████████████| 287 kB 8.5 MB/s 
[?25hCollecting pyahocorasick
  Downloading pyahocorasick-1.4.4-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (106 kB)
[K     |████████████████████████████████| 106 kB 54.4 MB/s 
[?25hInstalling collected packages: pyahocorasick, anyascii, textsearch, contractions
Successfully installed anyascii-0.3.1 contractions-0.1.72 pyahocorasick-1.4.4 textsearch-0.0.21


In [20]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
import keras
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
import math
import nltk
import re
import os
from bs4 import BeautifulSoup
import unicodedata
from contractions import contractions_dict
from nltk.tokenize.toktok import ToktokTokenizer
tokenizer = ToktokTokenizer()
stopword_list = nltk.corpus.stopwords.words('english')
stopword_list.remove('no')
stopword_list.remove('not')

# Loading the IMDB Movie review dataset!

In [8]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [12]:
%cp /content/gdrive/MyDrive/kaggle/kaggle.json /root/.kaggle

In [13]:
!kaggle datasets download -d lakshmi25npathi/imdb-dataset-of-50k-movie-reviews

Downloading imdb-dataset-of-50k-movie-reviews.zip to /content
 19% 5.00M/25.7M [00:00<00:00, 40.3MB/s]
100% 25.7M/25.7M [00:00<00:00, 137MB/s] 


In [14]:
!unzip /content/imdb-dataset-of-50k-movie-reviews.zip

Archive:  /content/imdb-dataset-of-50k-movie-reviews.zip
  inflating: IMDB Dataset.csv        


In [6]:
data = pd.read_csv('/content/IMDB Dataset.csv')
data

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive
...,...,...
49995,I thought this movie did a down right good job...,positive
49996,"Bad plot, bad dialogue, bad acting, idiotic di...",negative
49997,I am a Catholic taught in parochial elementary...,negative
49998,I'm going to have to disagree with the previou...,negative


# Data Preprocessing


## Removing HTML tags
HTML tags typically don’t add much value towards understanding and analyzing text.



In [7]:
def strip_html_tags(text):
    soup = BeautifulSoup(text, "html.parser")
    stripped_text = soup.get_text()
    return stripped_text

In [8]:
data['review'] = data['review'].apply(lambda cw: strip_html_tags(cw))

## Removing accented characters
Converted and standardised accented characters into ASCII characters. A simple example — converting é to e.

In [9]:
def remove_accented_chars(text):
    text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8', 'ignore')
    return text

In [10]:
data['review'] = data['review'].apply(lambda cw: remove_accented_chars(cw))

In [11]:
data

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. The filming tec...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive
...,...,...
49995,I thought this movie did a down right good job...,positive
49996,"Bad plot, bad dialogue, bad acting, idiotic di...",negative
49997,I am a Catholic taught in parochial elementary...,negative
49998,I'm going to have to disagree with the previou...,negative


# Expanding Contractions

Contractions are shortened version of words or syllables. In case of English contractions are often created by removing one of the vowels from the word. Examples would be, do not to don’t and I would to I’d. Converting each contraction to its expanded, original form helps with text standardization.

In [15]:
def expand_contractions(text, contraction_mapping=contractions_dict):
    
    contractions_pattern = re.compile('({})'.format('|'.join(contraction_mapping.keys())), 
                                      flags=re.IGNORECASE|re.DOTALL)
    def expand_match(contraction):
        match = contraction.group(0)
        first_char = match[0]
        try:
          expanded_contraction = contraction_mapping.get(match)\
                                  if contraction_mapping.get(match)\
                                  else contraction_mapping.get(match.lower())                       
          expanded_contraction = first_char+expanded_contraction[1:]
          return expanded_contraction
        except:
          return text
        
    expanded_text = contractions_pattern.sub(expand_match, text)
    expanded_text = re.sub("'", "", expanded_text)
    return expanded_text

In [16]:
data['review'] = data['review'].apply(lambda cw: expand_contractions(cw))

In [17]:
data

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. The filming tec...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there is a family where a little boy...,negative
4,"Petter Matteis ""Love in the Time of Money"" is ...",positive
...,...,...
49995,I thought this movie did a down right good job...,positive
49996,"Bad plot, bad dialogue, bad acting, idiotic di...",negative
49997,I am a Catholic tI am a Catholic taught in par...,negative
49998,I am going to have to disagree with the previo...,negative


# Removing Special characters

In [18]:
def remove_special_characters(text, remove_digits=False):
    pattern = r'[^a-zA-z0-9\s]' if not remove_digits else r'[^a-zA-z\s]'
    text = re.sub(pattern, '', text)
    return text

In [19]:
data['review'] = data['review'].apply(lambda cw: remove_special_characters(cw))

# Removing Stop words
We also need to remove stopwords from the corpus. Stopwords are commonly used words like ‘and’, ‘the’, ‘at’ that do not add any special meaning or significance to a sentence. A list of stopwords are available with nltk, and they can be removed from the corpus using the following code :

In [21]:
def remove_stopwords(text, is_lower_case=False):
    tokens = tokenizer.tokenize(text)
    tokens = [token.strip() for token in tokens]
    if is_lower_case:
        filtered_tokens = [token for token in tokens if token not in stopword_list]
    else:
        filtered_tokens = [token for token in tokens if token.lower() not in stopword_list]
    filtered_text = ' '.join(filtered_tokens)    
    return filtered_text

In [22]:
data['review'] = data['review'].apply(lambda cw: remove_stopwords(cw))

# Remving few other things

In [23]:
# Removing “@mention”
def removeothers(text):
  text = re.sub(r'@[A-Za-z0-9]+','',text)
  text = re.sub('https?://[A-Za-z0-9./]+','',text)
  text = re.sub("[^a-zA-Z]", " ", text)
  return text

In [24]:
data['review'] = data['review'].apply(lambda cw: removeothers(cw))

In [25]:
data

Unnamed: 0,review,sentiment
0,One reviewers mentioned watching Oz episode ...,positive
1,wonderful little production filming technique ...,positive
2,thought wonderful way spend time hot summer we...,positive
3,Basically family little boy Jake thinks zombie...,negative
4,Petter Matteis Love Time Money visually stunni...,positive
...,...,...
49995,thought movie right good job not creative orig...,positive
49996,Bad plot bad dialogue bad acting idiotic direc...,negative
49997,Catholic tI Catholic taught parochial elementa...,negative
49998,going disagree previous comment side Maltin on...,negative


We now perform lemmatization on the text. Lemmatization is a useful technique in NLP to obtain the root form of words, known as lemmas. For example, the lemma of the words reading, reads, read is read. This helps save unnecessary computational overhead in trying to decipher entire words, as the meanings of most words are well-expressed by their separate lemmas.



In [28]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [26]:
w_tokenizer = nltk.tokenize.WhitespaceTokenizer()
lemmatizer = nltk.stem.WordNetLemmatizer()

In [27]:
def lemmatize_text(text):
    st = ""
    for w in w_tokenizer.tokenize(text):
        st = st + lemmatizer.lemmatize(w) + " "
    return st

In [28]:
data['review'] = data.review.apply(lemmatize_text)
data

Unnamed: 0,review,sentiment
0,One reviewer mentioned watching Oz episode hoo...,positive
1,wonderful little production filming technique ...,positive
2,thought wonderful way spend time hot summer we...,positive
3,Basically family little boy Jake think zombie ...,negative
4,Petter Matteis Love Time Money visually stunni...,positive
...,...,...
49995,thought movie right good job not creative orig...,positive
49996,Bad plot bad dialogue bad acting idiotic direc...,negative
49997,Catholic tI Catholic taught parochial elementa...,negative
49998,going disagree previous comment side Maltin on...,negative


# EDA
Next, we print some basic statistics about the dataset and check if the dataset is balanced or not (equal number of all labels). Ideally, the dataset should be balanced because a severely imbalanced dataset can be challenging to model and require specialized techniqu

In [29]:
s = 0.0
for i in data['review']:
    word_list = i.split()
    s = s + len(word_list)
print("Average length of each review : ",s/data.shape[0])
pos = 0
for i in range(data.shape[0]):
    if data.iloc[i]['sentiment'] == 'positive':
        pos = pos + 1
neg = data.shape[0]-pos
print("Percentage of reviews with positive sentiment is "+str(pos/data.shape[0]*100)+"%")
print("Percentage of reviews with negative sentiment is "+str(neg/data.shape[0]*100)+"%")

Average length of each review :  331.49262
Percentage of reviews with positive sentiment is 50.0%
Percentage of reviews with negative sentiment is 50.0%


# Encoding Labels and Making Train-Test Splits
We use the LabelEncoder() from sklearn.preprocessing to convert the labels (‘positive’, ‘negative’) into 1’s and 0’s respectively.

In [30]:
reviews = data['review'].values
labels = data['sentiment'].values
encoder = LabelEncoder()
encoded_labels = encoder.fit_transform(labels)

Finally, we split the dataset into train and test parts using train_test_split from sklearn.model_selection. We use 80% of the dataset for training and 20% for testing.

In [31]:
train_sentences, test_sentences, train_labels, test_labels = train_test_split(reviews, encoded_labels, stratify = encoded_labels)

Before being fed into the LSTM model, the data needs to be padded and tokenized:

- **Tokenizing**: Keras’ inbuilt tokenizer API has fit the dataset, which splits the sentences into words and creates a dictionary of all unique words found and their uniquely assigned integers. Each sentence is converted into an array of integers representing all the individual words present in it.

- **Sequence Padding**: The array representing each sentence in the dataset is filled with zeroes to the left to make the size of the array ten and bring all collections to the same length.

In [32]:
# Hyperparameters of the model
vocab_size = 3000 # choose based on statistics
oov_tok = ''
embedding_dim = 100
max_length = 200 # choose based on statistics, for example 150 to 200
padding_type='post'
trunc_type='post'
# tokenize sentences
tokenizer = Tokenizer(num_words = vocab_size, oov_token=oov_tok)
tokenizer.fit_on_texts(train_sentences)
word_index = tokenizer.word_index
# convert train dataset to sequence and pad sequences
train_sequences = tokenizer.texts_to_sequences(train_sentences)
train_padded = pad_sequences(train_sequences, padding='post', maxlen=max_length)
# convert Test dataset to sequence and pad sequences
test_sequences = tokenizer.texts_to_sequences(test_sentences)
test_padded = pad_sequences(test_sequences, padding='post', maxlen=max_length)

# Building the Model
A Keras sequential model is built. It is a linear stack of the following layers :

- An embedding layer of dimension 100 converts each word in the sentence into a fixed-length dense vector of size 100. The input dimension is set as the vocabulary size, and the output dimension is 100. Each word in the input will hence get represented by a vector of size 100.
- A bidirectional LSTM layer of 64 units.
- A dense (fully connected) layer of 24 units with relu activation.
- A dense layer of 1 unit and sigmoid activation outputs the probability of the review is positive, i.e. if the label is 1.

The code for building the model :

In [33]:
# model initialization
model = keras.Sequential([
    keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length),
    keras.layers.Bidirectional(keras.layers.LSTM(64)),
    keras.layers.Dense(24, activation='relu'),
    keras.layers.Dense(1, activation='sigmoid')
])
# compile model
model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])
# model summary
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 200, 100)          300000    
                                                                 
 bidirectional (Bidirectiona  (None, 128)              84480     
 l)                                                              
                                                                 
 dense (Dense)               (None, 24)                3096      
                                                                 
 dense_1 (Dense)             (None, 1)                 25        
                                                                 
Total params: 387,601
Trainable params: 387,601
Non-trainable params: 0
_________________________________________________________________


The model is compiled with binary cross-entropy loss and adam optimizer. Since we have a binary classification problem, binary cross-entropy loss is used. The Adam optimizer uses stochastic gradient descent to train deep learning models, and it compares each of the predicted probabilities to the actual class label (0 or 1). Accuracy is used as the primary performance metric

In [34]:
num_epochs = 5
history = model.fit(train_padded, train_labels, 
                    epochs=num_epochs, verbose=1, 
                    validation_split=0.1)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [35]:
prediction = model.predict(test_padded)
# Get labels based on probability 1 if p>= 0.5 else 0
pred_labels = []
for i in prediction:
    if i >= 0.5:
        pred_labels.append(1)
    else:
        pred_labels.append(0)
print("Accuracy of prediction on test set : ", accuracy_score(test_labels,pred_labels))

Accuracy of prediction on test set :  0.85032


# Using the model to determine the sentiment of unseen movie reviews

In [70]:
def predict_sentiments(sentence):          
  # convert to a sequence
  sequences = tokenizer.texts_to_sequences(sentence)
  # pad the sequence
  padded = pad_sequences(sequences, padding='post', maxlen=max_length)
  # Get labels based on probability 1 if p>= 0.5 else 0
  prediction = model.predict(padded)
  pred_labels = []
  for i in prediction:
      if i >= 0.5:
          pred_labels.append(1)
      else:
          pred_labels.append(0)
  for i in range(len(sentence)):
      print(sentence[i])
      if pred_labels[i] == 1:
          s = 'Positive'
      else:
          s = 'Negative'
      print("Predicted sentiment : ",s)
      print()
  return ("Predicted sentiment : "+s)

In [71]:
# reviews on which we need to predict
sentence = ["The movie was very touching and heart whelming", 
            "I have never seen a terrible movie like this", 
            "the movie plot is terrible but it had good acting"]
predict_sentiments(sentence)

The movie was very touching and heart whelming
Predicted sentiment :  Positive

I have never seen a terrible movie like this
Predicted sentiment :  Negative

the movie plot is terrible but it had good acting
Predicted sentiment :  Negative



'Predicted sentiment : Negative'

In [58]:
# reviews on which we need to predict
sentence = ["No one expects Star Trek movie high art fan expect movie good best episode Unfortunately movie muddled implausible plot left cringing far worst nine far movie Even chance watch well known character interact another movie cannot save movie including goofy scene Kirk Spock McCoy YosehemiteI would say movie not worth rental hardly worth watching however True Fan need see movie renting movie way see even cable channel avoid movie "]
predict_sentiments(sentence)

No one expects Star Trek movie high art fan expect movie good best episode Unfortunately movie muddled implausible plot left cringing far worst nine far movie Even chance watch well known character interact another movie cannot save movie including goofy scene Kirk Spock McCoy YosehemiteI would say movie not worth rental hardly worth watching however True Fan need see movie renting movie way see even cable channel avoid movie 
Predicted sentiment :  Negative



In [57]:
data['review'][49999]

'No one expects Star Trek movie high art fan expect movie good best episode Unfortunately movie muddled implausible plot left cringing far worst nine far movie Even chance watch well known character interact another movie cannot save movie including goofy scene Kirk Spock McCoy YosehemiteI would say movie not worth rental hardly worth watching however True Fan need see movie renting movie way see even cable channel avoid movie '

In [72]:
# reviews on which we need to predict
sentence = ["I have seen the movie myself, its not that good. The story line is pathetic and also the caste seems to be doing a very shabby job. I don't think that this movie will do well on the box office"]
predict_sentiments(sentence)

I have seen the movie myself, its not that good. The story line is pathetic and also the caste seems to be doing a very shabby job. I don't think that this movie will do well on the box office
Predicted sentiment :  Negative



'Predicted sentiment : Negative'

In [63]:
# reviews on which we need to predict
sentence = ["I have seen the movie myself, its very good. The story line is solid and also the caste seems to be doing a very nice job. I think that this movie will do well on the box office"]
predict_sentiments(sentence)

I have seen the movie myself, its very good. The story line is solid and also the caste seems to be doing a very nice job. I think that this movie will do well on the box office
Predicted sentiment :  Positive



# Deploying via Flask!

In [64]:
pip install flask_ngrok

Collecting flask_ngrok
  Downloading flask_ngrok-0.0.25-py3-none-any.whl (3.1 kB)
Installing collected packages: flask-ngrok
Successfully installed flask-ngrok-0.0.25


In [65]:
pip install pyngrok

Collecting pyngrok
  Downloading pyngrok-5.1.0.tar.gz (745 kB)
[?25l[K     |▍                               | 10 kB 25.5 MB/s eta 0:00:01[K     |▉                               | 20 kB 29.5 MB/s eta 0:00:01[K     |█▎                              | 30 kB 32.3 MB/s eta 0:00:01[K     |█▊                              | 40 kB 22.6 MB/s eta 0:00:01[K     |██▏                             | 51 kB 7.9 MB/s eta 0:00:01[K     |██▋                             | 61 kB 9.1 MB/s eta 0:00:01[K     |███                             | 71 kB 9.9 MB/s eta 0:00:01[K     |███▌                            | 81 kB 10.9 MB/s eta 0:00:01[K     |████                            | 92 kB 12.0 MB/s eta 0:00:01[K     |████▍                           | 102 kB 10.2 MB/s eta 0:00:01[K     |████▉                           | 112 kB 10.2 MB/s eta 0:00:01[K     |█████▎                          | 122 kB 10.2 MB/s eta 0:00:01[K     |█████▊                          | 133 kB 10.2 MB/s eta 0:00:01[K   

In [66]:
from pyngrok import ngrok
!ngrok authtoken 22PJMsI1m7iGI76gpmxR4Z2yyHm_HzeA6PQme8h54syeC9wJ

Authtoken saved to configuration file: /root/.ngrok2/ngrok.yml


In [73]:
import logging
from flask import Flask, request, jsonify, url_for, render_template
import uuid
from flask_ngrok import run_with_ngrok
from flask import Flask
app = Flask(__name__,template_folder='/content/gdrive/MyDrive/IMDBPredictions/templates',
            static_folder='/content/gdrive/MyDrive/IMDBPredictions/static')
run_with_ngrok(app)   #starts ngrok when the app is run
gunicorn_logger = logging.getLogger('gunicorn.error')
app.logger.handlers = gunicorn_logger.handlers
app.logger.setLevel(gunicorn_logger.level)

Expected = {
    "Review":{"min":1,"max":2000}
}

@app.route('/')
def indexes():
  
  return render_template('MovieReview.html')


@app.route('/submitted', methods=['POST'])
def submitted():
  content = request.form['text']
  errors = []

  sample_example = [content]
  r1 = predict_sentiments(sample_example)

  return render_template('MovieReview.html',prediction=  r1 )

app.run()

 * Serving Flask app "__main__" (lazy loading)
 * Environment: production
[2m   Use a production WSGI server instead.[0m
 * Debug mode: off


 * Running on http://127.0.0.1:5000/ (Press CTRL+C to quit)


 * Running on http://6e89-35-203-184-68.ngrok.io
 * Traffic stats available on http://127.0.0.1:4040


127.0.0.1 - - [05/May/2022 13:59:00] "[37mGET / HTTP/1.1[0m" 200 -
127.0.0.1 - - [05/May/2022 13:59:00] "[37mGET /static/css/style.css HTTP/1.1[0m" 200 -
127.0.0.1 - - [05/May/2022 13:59:01] "[33mGET /favicon.ico HTTP/1.1[0m" 404 -
127.0.0.1 - - [05/May/2022 13:59:01] "[37mGET / HTTP/1.1[0m" 200 -
127.0.0.1 - - [05/May/2022 13:59:03] "[37mGET /static/css/style.css HTTP/1.1[0m" 200 -
127.0.0.1 - - [05/May/2022 13:59:42] "[37mPOST /submitted HTTP/1.1[0m" 200 -


I have myself seen the movie and i really liked it. I think the caste has done a fine job. The story line is refreshing and the cinematography is excellent! Do watch if you get a chance!
Predicted sentiment :  Positive



127.0.0.1 - - [05/May/2022 13:59:46] "[31m[1mGET /submitted HTTP/1.1[0m" 405 -
