##Imports

In [0]:
%matplotlib inline
import matplotlib.pyplot as plt
import tensorflow as tf
import numpy as np
from scipy.spatial.distance import cdist



In [0]:
from tensorflow.python.keras.models import Sequential
from tensorflow.python.keras.layers import Dense, GRU, Embedding
from tensorflow.python.keras.optimizers import Adam
from tensorflow.python.keras.preprocessing.text import Tokenizer
from tensorflow.python.keras.preprocessing.sequence import pad_sequences


##Data

In [0]:
!wget "http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"

--2018-12-22 10:56:23--  http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
Resolving ai.stanford.edu (ai.stanford.edu)... 171.64.68.10
Connecting to ai.stanford.edu (ai.stanford.edu)|171.64.68.10|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 84125825 (80M) [application/x-gzip]
Saving to: ‘aclImdb_v1.tar.gz.1’


2018-12-22 10:56:30 (12.5 MB/s) - ‘aclImdb_v1.tar.gz.1’ saved [84125825/84125825]



In [0]:
!ls

aclImdb  aclImdb_v1.tar.gz  aclImdb_v1.tar.gz.1  sample_data


In [0]:
!tar -xvzf aclImdb_v1.tar.gz

Output has been cleared due to large size

In [0]:
import os
import glob

def get_data(data_dir,train=True):
    
    def _read_text_file(path):
      with open( path, 'rt') as file:
        lines = file.readlines()       
        text = " ".join(lines)
      return text

    train_test_path = "train" if train else "test"

    dir_base = os.path.join(data_dir, "aclImdb", train_test_path)

    
    path_pattern_pos = os.path.join(dir_base, "pos", "*.txt")
    path_pattern_neg = os.path.join(dir_base, "neg", "*.txt")

    paths_pos = glob.glob(path_pattern_pos)
    paths_neg = glob.glob(path_pattern_neg)

    data_pos = [_read_text_file(path) for path in paths_pos]
    data_neg = [_read_text_file(path) for path in paths_neg]

    
    x = data_pos + data_neg

    
    y = [1.0] * len(data_pos) + [0.0] * len(data_neg)
    return x,y

In [0]:
x_train_text, y_train = get_data("/content",True)
x_test_text, y_test= get_data("/content",False)

In [0]:
x_train_text[-1]

'Omen IV (1991) was a bad made-for-T.V. movie. Since the 80\'s were over, I guess the executives were experimenting in meth (the drug of choice during the 90\'s) because there is no other reason to explain this travesty. Why did they even bother making this? A t.v. movie? What were they mulling over when this one came up on the idea board? Did they even think for a second that this movie would catch on as. Perhaps they thought it could make it as a series? We\'ll never know. But I know one thing. This movie was the major reason why I never bought the Omen trilogy. They should have knocked off a couple of bucks instead of putting out this "extra" disc.<br /><br />Omen IV is basically a average American family remake of the first film. Instead of a snot nosed punk kid, we get the spooky girl who\'s a total brat to everyone around her. If the family had stronger parenting skills, then none of the demonic events that have transpired in the past films would have never occurred. These parent

In [0]:
data_text= x_train_text + x_test_text

In [0]:

len( y_test )

25000

##Tokenizing

In [0]:
tokenizer= Tokenizer( num_words=10000 )

In [0]:
%%time
tokenizer.fit_on_texts(data_text)

CPU times: user 11.1 s, sys: 54.8 ms, total: 11.2 s
Wall time: 11.2 s


In [65]:
tokenizer.word_index

{'the': 1,
 'and': 2,
 'a': 3,
 'of': 4,
 'to': 5,
 'is': 6,
 'br': 7,
 'in': 8,
 'it': 9,
 'i': 10,
 'this': 11,
 'that': 12,
 'was': 13,
 'as': 14,
 'for': 15,
 'with': 16,
 'movie': 17,
 'but': 18,
 'film': 19,
 'on': 20,
 'not': 21,
 'you': 22,
 'are': 23,
 'his': 24,
 'have': 25,
 'be': 26,
 'one': 27,
 'he': 28,
 'all': 29,
 'at': 30,
 'by': 31,
 'an': 32,
 'they': 33,
 'so': 34,
 'who': 35,
 'from': 36,
 'like': 37,
 'or': 38,
 'just': 39,
 'her': 40,
 'out': 41,
 'about': 42,
 'if': 43,
 "it's": 44,
 'has': 45,
 'there': 46,
 'some': 47,
 'what': 48,
 'good': 49,
 'when': 50,
 'more': 51,
 'very': 52,
 'up': 53,
 'no': 54,
 'time': 55,
 'my': 56,
 'even': 57,
 'would': 58,
 'she': 59,
 'which': 60,
 'only': 61,
 'really': 62,
 'see': 63,
 'story': 64,
 'their': 65,
 'had': 66,
 'can': 67,
 'me': 68,
 'well': 69,
 'were': 70,
 'than': 71,
 'much': 72,
 'we': 73,
 'bad': 74,
 'been': 75,
 'get': 76,
 'do': 77,
 'great': 78,
 'other': 79,
 'will': 80,
 'also': 81,
 'into': 82,
 'p

In [0]:
tokens_train=  tokenizer.texts_to_sequences( x_train_text )

In [0]:
x_train_text[0]

'For those that were interested in knowing how exactly humanity came to be encased in big red pods that make me crave pomegranate, there is the duo of the "Second Renaissance" shorts. I\'m not exactly sure why they are split into two parts, especially since they\'re credited as one on the DVD (and are these shorts viewed on any other format but the DVD?), but they\'re informative even if they have a few gaps.<br /><br />What really makes this first part stand out, from the second part and the rest of the animations as well, is the parallels it shows between robot uprising and civil rights. Graphic homages to slavery, fascism, concentration camps, and mass graves are mixed with verbal references to the Million Man March and humanity\'s God-complex. In fact, "God" is never really referenced by these shorts, instead replaced by "Man\'s own image".<br /><br />As far as the shorts go in the collection, "The Second Renaissance: Part I" is by far the most effective in bringing out emotion. It

In [0]:
np.array( tokens_train[0])

array([  15,  143,   12,   70,  918,    8, 1332,   85,  621, 1909,  384,
          5,   26,    8,  191,  830,   12,   94,   68,   46,    6,    1,
       4086,    4,    1,  336, 7951, 2761,  145,   21,  621,  247,  134,
         33,   23, 3407,   82,  105,  516,  261,  234,  501, 5147,   14,
         27,   20,    1,  266,    2,   23,  132, 2761, 2232,   20,   99,
         79, 2844,   18,    1,  266,   18,  501, 6274,   57,   43,   33,
         25,    3,  171, 6768,    7,    7,   48,   62,  162,   11,   86,
        173,  778,   41,   36,    1,  336,  173,    2,    1,  370,    4,
          1, 9935,   14,   69,    6,    1, 6514,    9,  276,  201, 2363,
          2, 2717, 2650, 2150,    5, 8089, 8602, 7769,    2, 3202, 7241,
         23, 1919,   16, 7017, 1825,    5,    1, 1432,  128, 4446,    2,
        545, 1300,    8,  192,  545,    6,  110,   62,   31,  132, 2761,
        298, 3014,   31, 1592,  199, 1428,    7,    7,   14,  225,   14,
          1, 2761,  139,    8,    1, 1520,    1,  3

In [0]:
tokens_test= tokenizer.texts_to_sequences( x_test_text )

##Padding & Truncating


In [0]:


num_tokens = [len(t) for t in tokens_train+ tokens_test ]
num_tokens = np.array(num_tokens)



In [0]:
np.mean(num_tokens)

221.27716

In [0]:
np.max(num_tokens)

2209

In [0]:
##Std Dvtn


max_tokens = np.mean(num_tokens) + 2 * np.std(num_tokens)
max_tokens = int(max_tokens)
max_tokens

544

In [0]:
np.sum(num_tokens < max_tokens) / len(num_tokens)

0.94532

i.e around 95 % of texts have size smaller than 544 words

In [0]:
x_train_pad= pad_sequences( tokens_train , maxlen= max_tokens , padding='pre', truncating='pre' )

In [0]:
x_test_pad= pad_sequences( tokens_test, maxlen= max_tokens , padding= 'pre', truncating='pre' )

In [0]:
x_train_pad[0]

array([   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,   

In [0]:
inverse_tokens_train= tokenizer.sequences_to_texts( tokens_train )

In [0]:
inverse_tokens_train[0]

"for those that were interested in knowing how exactly humanity came to be in big red that make me there is the duo of the second renaissance shorts i'm not exactly sure why they are split into two parts especially since they're credited as one on the dvd and are these shorts viewed on any other format but the dvd but they're informative even if they have a few gaps br br what really makes this first part stand out from the second part and the rest of the animations as well is the parallels it shows between robot and civil rights graphic to slavery concentration camps and mass graves are mixed with verbal references to the million man march and god complex in fact god is never really by these shorts instead replaced by man's own image br br as far as the shorts go in the collection the second renaissance part i is by far the most effective in bringing out emotion it's a and disturbing view of the potential of humanity to become the architect of its own destruction some may be turned of

In [0]:
x_train_text[0]

'For those that were interested in knowing how exactly humanity came to be encased in big red pods that make me crave pomegranate, there is the duo of the "Second Renaissance" shorts. I\'m not exactly sure why they are split into two parts, especially since they\'re credited as one on the DVD (and are these shorts viewed on any other format but the DVD?), but they\'re informative even if they have a few gaps.<br /><br />What really makes this first part stand out, from the second part and the rest of the animations as well, is the parallels it shows between robot uprising and civil rights. Graphic homages to slavery, fascism, concentration camps, and mass graves are mixed with verbal references to the Million Man March and humanity\'s God-complex. In fact, "God" is never really referenced by these shorts, instead replaced by "Man\'s own image".<br /><br />As far as the shorts go in the collection, "The Second Renaissance: Part I" is by far the most effective in bringing out emotion. It

##RNN

In [0]:
model = Sequential()

In [0]:
##embedding layer

In [0]:
embd_size=8
#SmallValueforSentimentAnalysis

In [0]:
 model.add( Embedding( input_dim=10000, output_dim= embd_size, input_length=max_tokens , name= 'layer_embd' ) )

In [0]:
model.add(GRU( units=16, return_sequences= True ))

In [0]:
model.add( GRU( units=8 , return_sequences= True ) )

In [0]:
model.add( GRU( units=4 ) )

In [0]:
model.add( Dense(1, activation='sigmoid' ) )

In [0]:
optimizer=Adam( lr=1e-3 )

In [0]:
model.compile( optimizer= optimizer , loss= 'binary_crossentropy' , metrics=['accuracy'] )

In [0]:
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
layer_embd (Embedding)       (None, 544, 8)            80000     
_________________________________________________________________
gru (GRU)                    (None, 544, 16)           1200      
_________________________________________________________________
gru_1 (GRU)                  (None, 544, 8)            600       
_________________________________________________________________
gru_2 (GRU)                  (None, 4)                 156       
_________________________________________________________________
dense (Dense)                (None, 1)                 5         
Total params: 81,961
Trainable params: 81,961
Non-trainable params: 0
_________________________________________________________________


In [0]:
model2= model

##Training the Model

In [0]:
model2.fit(x= x_train_pad , y= y_train, validation_split= 0.05 , epochs= 3 , batch_size=64 )

Train on 23750 samples, validate on 1250 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3


<tensorflow.python.keras.callbacks.History at 0x7f668c4a7978>

##Predictions

In [0]:
%%time
result = model.evaluate(x_test_pad, y_test)

CPU times: user 28min 35s, sys: 4min 47s, total: 33min 22s
Wall time: 20min 34s


In [0]:
print("Loss:", result[0] ," Accuracy: ", result[1]*100 )

Loss: 0.34190794646501543  Accuracy:  86.528


## Test on Self Reviews

In [0]:
text1 = "This movie is Great! "
text2 = "Lovely movie!"
text3 = "Rubbish.."
text4 = "It was okayish!"
text5= "Amazing Work"
text6="I really enjoyed my money being wasted"
text7="Master Piece"
text = [text1, text2, text3, text4, text5, text6,text7]

In [0]:

tokens_self = tokenizer.texts_to_sequences(text)

In [0]:
tokens_self_pad = pad_sequences( tokens_self , maxlen=max_tokens,
                           padding='pre', truncating='pre')

In [152]:
model2.predict(tokens_self_pad)

array([[0.8166793 ],
       [0.6521951 ],
       [0.4538961 ],
       [0.66541696],
       [0.87422496],
       [0.5617218 ],
       [0.7258598 ]], dtype=float32)

In [0]:
ans=['Positive' if m>0.6 else 'Negative'  for m in model2.predict(tokens_self_pad) ]

In [154]:
ans

['Positive',
 'Positive',
 'Negative',
 'Positive',
 'Positive',
 'Negative',
 'Positive']