# IMDB -Movie Reviews Sentiment Classification

* Word Embeddings with Keras

Word embeddings are a type of word representation that allows words with similar meaning to have a similar representation.

It is a distributed representation for text that is perhaps one of the key breakthroughs for the impressive performance of deep learning methods on challenging natural language processing problems.

We willl use the imdb dataset to learn word embeddings as we train our dataset. This dataset contains 25,000 movie reviews from IMDB, labeled with sentiment (positive or negative). 

For convenience, the words are indexed by their frequency in the dataset, meaning the for that has index 1 is the most frequent word. We will only use the first 20 words from each review to speed up training, use a max vocab size of 10,000.

In [1]:
from keras.datasets import imdb
from keras.preprocessing.sequence import pad_sequences
vocab_size = 10000 #vocab size
maxlen = 300  #number of word used from each review
#load dataset as a list of ints
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=vocab_size)
#make all sequences of the same length
x_train = pad_sequences(x_train, maxlen=maxlen)
x_test =  pad_sequences(x_test, maxlen=maxlen)

Using TensorFlow backend.


In [2]:
x_train.shape #number of review, number of words in each review

(25000, 300)

To take a look at the review and sentiment:

In [3]:
(training_data, training_labels), (test_data, test_labels)= imdb.load_data(num_words=vocab_size, index_from=3)

In [4]:
word_to_id = imdb.get_word_index()
word_to_id = {k:(v+3) for k,v in word_to_id.items()}
word_to_id["<PAD>"] = 0
word_to_id["<START>"] = 1
word_to_id["<UNK>"] = 2

In [5]:
id_to_word = {value:key for key,value in word_to_id.items()}
print(' '.join(id_to_word[id] for id in training_data[6] ))
print('The sentiment is:', training_labels[6])

<START> lavish production values and solid performances in this straightforward adaption of jane <UNK> satirical classic about the marriage game within and between the classes in <UNK> 18th century england northam and paltrow are a <UNK> mixture as friends who must pass through <UNK> and lies to discover that they love each other good humor is a <UNK> virtue which goes a long way towards explaining the <UNK> of the aged source material which has been toned down a bit in its harsh <UNK> i liked the look of the film and how shots were set up and i thought it didn't rely too much on <UNK> of head shots like most other films of the 80s and 90s do very good results
The sentiment is: 1


Here is a postive review number 6 from the training set. From the word *jane* we can guess that it is one of the works of Jane Austen, perhaps Pride and Prejudice or Sense and Sensibility?

## Build Keras Embedding Layer Model
We can think of the Embedding layer as a dicionary that maps a index assigned to a word to a word vector. This layer is very flexiable and can be used in a few ways:

* The embedding layer can be used at the start of a larger deep learning model. 
* Also we could load pre-train word embeddings into the embedding layer when we create our model.
* Use the embedding layer to train our own word2vec models.

The keras embedding layer doesn't require us to onehot encode our words, instead we have to give each word a unqiue intger number as an id. For the imdb dataset we've loaded this has already been done, but if this wasn't the case we could use sklearn [LabelEncoder](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html).

In [6]:
import time
from keras.models import Sequential
from keras.layers import Flatten, Dense, Embedding
from keras.layers.convolutional import Convolution1D, MaxPooling1D
model = Sequential()
model.add(Embedding(10000, 8, input_length=maxlen)) #10000 for vocab size, 8 for dimensionality of embedding
model.add(Convolution1D(nb_filter=32, filter_length=3, border_mode='same', activation='relu'))
model.add(MaxPooling1D(pool_length=300)) #duplicate
model.add(Flatten())
model.add(Dense(250, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 300, 8)            80000     
_________________________________________________________________
conv1d_1 (Conv1D)            (None, 300, 32)           800       
_________________________________________________________________
max_pooling1d_1 (MaxPooling1 (None, 1, 32)             0         
_________________________________________________________________
flatten_1 (Flatten)          (None, 32)                0         
_________________________________________________________________
dense_1 (Dense)              (None, 250)               8250      
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 251       
Total params: 89,301
Trainable params: 89,301
Non-trainable params: 0
_________________________________________________________________


  import sys
  


In [7]:
x_train.shape #number of examples, number or words

(25000, 300)

In [8]:
x_train[1] # words are representedby numbers

array([   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    1,  194, 1153,  194, 8255,   78,  228,    5,    6, 1463,
       4369, 5012,  134,   26,    4,  715,    8,  118, 1634,   14,  394,
         20,   13,  119,  954,  189,  102,    5,  207,  110, 3103,   21,
         14,   69,  188,    8,   30,   23,    7,   

In [9]:
y_train.shape 

(25000,)

In [10]:
y_test.shape

(25000,)

In [11]:
start = time.clock()
history = model.fit(x_train, y_train, epochs=5, batch_size=128, validation_data=(x_test, y_test))
end = time.clock()
print('Time spent:', end-start)

Train on 25000 samples, validate on 25000 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Time spent: 60.513205000000006


In [12]:
model.layers #get all layers from model

[<keras.layers.embeddings.Embedding at 0x117ae0908>,
 <keras.layers.convolutional.Conv1D at 0x117ae09b0>,
 <keras.layers.pooling.MaxPooling1D at 0x117ae0dd8>,
 <keras.layers.core.Flatten at 0x117ae0b38>,
 <keras.layers.core.Dense at 0x117b0feb8>,
 <keras.layers.core.Dense at 0x1178dc5f8>]

In [13]:
score = model.evaluate(x_test, y_test)



In [14]:
print('Trainin Acc: %.2f%%' %(score[0]*100),
     '\nTest Acc: %.2f%%' %(score[1]*100))

Trainin Acc: 37.72% 
Test Acc: 86.42%


### Hola! We got 86.42% accuracy! Not so bad at first trial, right?