<a href="https://colab.research.google.com/github/itslokeshrawat/Sentiment-Analysis/blob/main/Sentiment_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Sentiment Analysis on IMDB Reviews

In [None]:
#import all libraries 
import pandas as pd
import matplotlib.pyplot as plt

from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM,Dense, Dropout, SpatialDropout1D
from tensorflow.keras.layers import Embedding

df = pd.read_csv("/content/MovieReviewTrainingDatabase.csv")

In [None]:
df.head()

Unnamed: 0,sentiment,review
0,Positive,With all this stuff going down at the moment w...
1,Positive,'The Classic War of the Worlds' by Timothy Hin...
2,Negative,The film starts with a manager (Nicholas Bell)...
3,Negative,It must be assumed that those who praised this...
4,Positive,Superbly trashy and wondrously unpretentious 8...


In [None]:
#We don’t need neutral reviews in our dataset for this binary classification problem. So drop those rows from the dataset
reviews_df = reviews_df[reviews_df['sentiment'] != 'neutral']
print(reviews_df.shape)
reviews_df.head(5)

In [None]:
#Check the values of the sentiment column
reviews_df["sentiment"].value_counts()

Positive    12500
Negative    12500
Name: sentiment, dtype: int64

In [None]:
#Machines understand only numeric data, convert the categorical values to numeric using the factorize() method.
sentiment_label = reviews_df.sentiment.factorize()
sentiment_label

(array([0, 0, 1, ..., 1, 1, 0]),
 Index(['Positive', 'Negative'], dtype='object'))

In [None]:
#retrieve all text data from datatset
reviews = reviews_df.review.values

#tokenization
tokenizer = Tokenizer(num_words=5000)
tokenizer.fit_on_texts(reviews)
vocab_size = len(tokenizer.word_index) + 1

#Now, replace the words with their assigned numbers using the text_to_sequence() method.
encoded_docs = tokenizer.texts_to_sequences(reviews)

#Each of the sentences in the dataset is not of equal length. Use padding to pad the sentences to have equal length.
padded_sequence = pad_sequences(encoded_docs, maxlen=200)

In [None]:
print(tokenizer.word_index)



In [None]:
print(reviews[0])
print(encoded_docs[0])

With all this stuff going down at the moment with MJ i've started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ's feeling towards the press and also the obvious message of drugs are bad m'kay.  Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.  The actual feature film bit when it finally starts is only on for 20

In [None]:
print(padded_sequence[0])

[ 135   25  488  375   34   78    6  725   69   82   23 2425  936  106
   11   25  466   84    5  120    8    6   25   33    6 1633  515   34
    9  277   25   39 4147  225  765    4  673  178    7   10   36 1569
   79    3  517    2    3 2522    2    1  221 2175 2808  710   77    1
  163  211   24   66    1    4    3   51    8  380    5 1400    1   78
  710   13  623  895  778  774   15   27  561  389  589    3  221  757
    4   95 3426    3 1296  861  133 1312  344   10   16    6   14   84
   33   36   19   27  637   38  156   60    9  101    6   87   84   45
   20   92  797  242    8  124  345    2  200  122    3  761    2 3574
 2112    7   10   16    6    3  250  487 1866    6  366   27    4    1
   87 1004   84  123    5 1752   10 1280   17    6   25 2543   70   15
   28    1  684  202  513   10  865   70    9   89  120   82   84   67
   26  274  499 4505 3669    9  120   10   14    3  188   25    6  340
   31  570  323   17  379  237   38   27    4    1   87    9  439   25
    6 

In [None]:
embedding_vector_length = 32
model = Sequential() 
model.add(Embedding(vocab_size, embedding_vector_length, input_length=200) )
model.add(SpatialDropout1D(0.25))
model.add(LSTM(50, dropout=0.5, recurrent_dropout=0.5))
model.add(Dropout(0.2))
model.add(Dense(1, activation='sigmoid')) 
model.compile(loss='binary_crossentropy',optimizer='adam', metrics=['accuracy'])  
print(model.summary()) 

Model: "sequential_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_2 (Embedding)     (None, 200, 32)           3298080   
                                                                 
 spatial_dropout1d_2 (Spatia  (None, 200, 32)          0         
 lDropout1D)                                                     
                                                                 
 lstm_2 (LSTM)               (None, 50)                16600     
                                                                 
 dropout_2 (Dropout)         (None, 50)                0         
                                                                 
 dense_2 (Dense)             (None, 1)                 51        
                                                                 
Total params: 3,314,731
Trainable params: 3,314,731
Non-trainable params: 0
____________________________________________

In [None]:
#Training the sentiment analysis model
history = model.fit(padded_sequence,sentiment_label[0],validation_split=0.2, epochs=5, batch_size=32)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [None]:
#Define a function that takes a text as input and outputs its prediction label.
def predict_sentiment(text):
    tw = tokenizer.texts_to_sequences([text])
    tw = pad_sequences(tw,maxlen=200)
    prediction = int(model.predict(tw).round().item())
    print("Predicted label: ", sentiment_label[1][prediction])

In [None]:
test_sentence1 = "This movie is really good."
predict_sentiment(test_sentence1)

test_sentence2 = "This movie is really bad."
predict_sentiment(test_sentence2)

Predicted label:  Positive
Predicted label:  Negative
