# Sentiment Classification


The objective of this project is to generate Word Embeddings and retrieve outputs of each layer with Keras based on the Classification task. Word embeddings are a type of word representation that allows words with similar meaning to have a similar representation.
It is a distributed representation for the text that is perhaps one of the key breakthroughs for the impressive performance of deep learning methods on challenging natural language processing problems. 


## Dataset

We will use the IMDb dataset to learn word embeddings as we train our dataset. This dataset contains 25,000 movie reviews from IMDB, labeled with a sentiment (positive or negative).

## Loading the dataset

We will load the dataset and split it into training and test data and analyze the data to decide on the parameters for the model.


In [1]:
from tensorflow.keras.datasets import imdb
import numpy as np

(x_train, y_train), (x_test, y_test) = imdb.load_data()

X = np.concatenate((x_train, x_test), axis=0)

In [2]:
print("Number of words: ")
print(len(np.unique(np.hstack(X))))

Number of words: 
88585


As we can see the total number of different words in the reviews is 88585. We will only consider the first 7500 which are more frequently used.

In [3]:
vocab_size = 7500 #vocab size

(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=vocab_size)

In [4]:
print(y_test[0:100])

[0 1 1 0 1 1 1 0 0 1 1 0 0 0 1 0 1 0 0 0 1 1 1 1 1 1 0 1 1 0 1 1 0 0 0 0 1
 1 0 0 1 1 0 1 1 1 0 0 0 1 0 0 1 1 1 1 1 1 0 0 0 0 1 0 0 1 0 1 1 1 1 1 1 0
 0 0 0 1 0 0 1 0 1 1 0 0 0 1 0 1 1 1 0 1 1 0 1 1 0 0]


We can safely assume that we can analyze whether the review is positive or negative based on the first 500 words. We will take into consideration just that in building our model.

In [5]:
from tensorflow.keras.preprocessing.sequence import pad_sequences
maxlen = 500  #number of word used from each review

Let's pad the reviews with words fewer than 500 with 0 and the ones longer than 500 words to 500 to ensure all the x_train elements and test elements are of same length. 

In [6]:
#make all sequences of the same length
x_train = pad_sequences(x_train, maxlen=maxlen)
x_test =  pad_sequences(x_test, maxlen=maxlen)

In [7]:
print(x_train.shape)


(25000, 500)


In [8]:
import numpy as np

#Unique Class- Binary classification
np.unique(y_train)


array([0, 1])

In [9]:
#understand the IMDB dataset 
word_index = imdb.get_word_index()

print(sorted(word_index.items(), key = 
             lambda kv:(kv[1], kv[0])))  


key_list= list(word_index.keys())
value_list = list(word_index.values())

value_list.index(1)



58318

In [10]:
word_to_id = imdb.get_word_index()

#print(word_to_id)
word_to_id = {k:(v+3) for k,v in word_to_id.items()}
word_to_id["<PAD>"] = 0
word_to_id["<START>"] = 1
word_to_id["<UNK>"] = 2


In [11]:
id_to_word = {value:key for key,value in word_to_id.items()}
print(' '.join(id_to_word[id] for id in x_test[15] ))
print('The sentiment is:', y_test[15])

<PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD

In [12]:
id_to_word = {value:key for key,value in word_to_id.items()}
print(' '.join(id_to_word[id] for id in x_test[16] ))
print('The sentiment is:', y_test[16])

<PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD

## Build Keras Embedding Layer Model
We can think of the Embedding layer as a dicionary that maps a index assigned to a word to a word vector. This layer is very flexible and can be used in a few ways:

* The embedding layer can be used at the start of a larger deep learning model. 
* Also we could load pre-train word embeddings into the embedding layer when we create our model.
* Use the embedding layer to train our own word2vec models.

The keras embedding layer doesn't require us to onehot encode our words, instead we have to give each word a unqiue intger number as an id. For the imdb dataset we've loaded this has already been done, but if this wasn't the case we could use sklearn [LabelEncoder](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html).

In [13]:
import time
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Flatten, Dense, Embedding
from tensorflow.keras.layers import LSTM, TimeDistributed


In [20]:
model = Sequential()
model.add(Embedding(vocab_size, 128, input_length=maxlen))
#We will add dropout to avoid overfitting
model.add(LSTM(64, dropout=0.3, activation='tanh', recurrent_dropout=0.3, return_sequences=True))
model.add(TimeDistributed(Dense(100)))
model.add(Flatten())
model.add(Dense(250, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])
model.summary()

Model: "sequential_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_3 (Embedding)      (None, 500, 128)          960000    
_________________________________________________________________
lstm_3 (LSTM)                (None, 500, 64)           49408     
_________________________________________________________________
time_distributed_3 (TimeDist (None, 500, 100)          6500      
_________________________________________________________________
flatten_3 (Flatten)          (None, 50000)             0         
_________________________________________________________________
dense_12 (Dense)             (None, 250)               12500250  
_________________________________________________________________
dense_13 (Dense)             (None, 1)                 251       
Total params: 13,516,409
Trainable params: 13,516,409
Non-trainable params: 0
__________________________________________

In [21]:
start = time.process_time()
history = model.fit(x_train, y_train, epochs=2, batch_size=128, validation_data=(x_test, y_test))
end = time.process_time()
print('Time spent:', end-start)


Train on 25000 samples, validate on 25000 samples
Epoch 1/2
Epoch 2/2
Time spent: 1757.5492920000002


In [22]:

score = model.evaluate(x_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (score[1]*100))


Accuracy: 88.33%


## Retrive the output of each layer in keras for a given single test sample from the trained model you built

In [23]:
import tensorflow.keras.backend as K
inp = model.input                                           # input placeholder


for layer in model.layers:
    functor = K.function([inp, K.learning_phase()], [layer.output])
    test = np.array([x_test[16],])
    layer_out = functor([test,1.])
    print(layer.name)
    print(layer_out)


embedding_3
[array([[[ 0.02554358, -0.01956453,  0.02783529, ..., -0.00535768,
         -0.00465509,  0.03165284],
        [ 0.02554358, -0.01956453,  0.02783529, ..., -0.00535768,
         -0.00465509,  0.03165284],
        [ 0.02554358, -0.01956453,  0.02783529, ..., -0.00535768,
         -0.00465509,  0.03165284],
        ...,
        [-0.03044875,  0.00326135,  0.04702074, ...,  0.02906072,
         -0.01716134,  0.02893757],
        [-0.02496897,  0.09021177,  0.0822059 , ...,  0.04503163,
         -0.01114521,  0.06210981],
        [-0.02496897,  0.09021177,  0.0822059 , ...,  0.04503163,
         -0.01114521,  0.06210981]]], dtype=float32)]
lstm_3
[array([[[ 0.00181776, -0.00572887,  0.00426353, ..., -0.00403602,
          0.00603666, -0.0009559 ],
        [ 0.00338093, -0.00917423,  0.00755088, ..., -0.00674243,
          0.01007202, -0.0012634 ],
        [ 0.00470417, -0.01119786,  0.010029  , ..., -0.0085908 ,
          0.01284369, -0.00137813],
        ...,
        [ 0.01161

In [92]:
print(y_test[16])

1


As we can see above the output of the final layer of the model is close to 1, which is the value in the y_test. The model is working with considerable accuracy. We can fine tune the model changing the parameters, adding more hidden layers for better performance.

## Prediction
Let us now create methods which will use the imdb key-value to generate data from a review string and predict the output. 

In [93]:
def create_word_list(reviewString):
    reviewString = reviewString.lower()
    wordsInReviewString = reviewString.split(' ')
    wordVecArr = []
    for word in wordsInReviewString:
        word = word.replace('.','')
        word = word.replace(',','')
        word = word.replace('?','')
        word = word.replace('!','')
        if word in word_index:
            numMap = word_index[word]
            if numMap < 7500:
                wordVecArr.append(numMap)

    wordVecArr = pad_sequences([wordVecArr], maxlen=maxlen)

    return wordVecArr

def predict_review_category(reviewString):
    output = model.predict_classes(create_word_list(reviewString))
    if output[0][0] == 0:
        return "Positive"
    else:
        return "Negative"

## Analyze positive review
We will now feed in a positive review for a movie from https://www.rollingstone.com/movies/movie-reviews/clemency-review-alfre-woodard-930603/ and see if the model is able to rightly predict if the review is positive.


In [96]:
print(predict_review_category("If you want to see what great acting is, watch Alfre Woodard deliver a master class in Clemency. In this shattering second feature from writer-director Chinonye Chukwu (alaskaLand) — which earlier this year made her the first black woman to win the Grand Jury Prize at Sundance — Woodard plays Bernadine Williams, an emotionally restrained prison warden who is about to oversee her twelfth execution by lethal injection. The last one, as Chukwu shows us, damn near wrecked her. The film opens with the gut-wrenching sight of state-sanctioned murder. The paramedic can’t find a vein. The condemned man suffers convulsions. Blood spurts. His mother cries in horror behind a glass wall, until Williams draws a curtain to block her and the assembled journalists from the horror. Behind the curtain, even the prison staff is traumatized. The warden has done everything to control each step of the execution, but it takes an agonizingly long time until the prisoner’s heart monitor stops beeping. So brutal is the process, which Chukwu presents in granular detail, that you might feel as if your heart has stopped, too."))

Positive


## Analyze negative review
We will now feed in a negative review for a movie from https://www.rollingstone.com/movies/movie-reviews/cats-movie-review-taylor-swift-927486/ and see if the model is able to rightly predict if the review is positive.


In [97]:
print(predict_review_category("Attention, moviegoers searching for the worst movie of the year: We have a late-breaking winner. Cats slips in right under the radar and easily scores as the bottom of the 2019 barrel — and arguably of the decade. Even Michael Bay’s trash trilogy of soul-destroying Transformers movies can’t hold a candle. What happened? Wasn’t the stage production of Cats — music by Andrew Lloyd Webber and lyrics by poet T.S. Eliot — an award-winning smash from Broadway to Tokyo? It was. But in this all-star, all-awful screen version, directed by Tom Hooper (The King’s Speech’s, Les Miserables), everything that should work goes calamitously wrong. The first trailer earned hisses on social media. The full movie, inert and as indigestible as a hairball, is much, much worse. Shot on a soundstage to suggest a bad feline-themed Halloween party, the film — like the show — is based on Eliot’s beloved 1939 poetry collection, Old Possum’s Book of Practical Cats. That means that over a single night in London, a tribe of junkyard cats called Jellicles run a talent show to prove their worthiness to the chief judge, Old Deuteronomy (Judi Dench, a Dame who deserved better). The prize? The chosen feline will ascend to cat heaven, known as the Heaviside Layer, and be reborn into a better life where presumably no one will ever be forced to sit through this movie."))

Negative


## Conclusion:
The reviews have been rightly assessed by the model. The model can be fine-tuned for more accuracy, taking care not to over fit for better results. 