# Sentimental analysis on IMDB movie reviews

### There are two main approaches to sentiment analysis using a lexicon of pre-recorded sentiment or using state of the art but more computationally expensive deep learning to learn generalized vector representation from words Feedforward net accepts fixed sized inputs like binary numbers but recurrent neural nets helps us learn from sequences data, like texts and you can use AWS with pre-build AMI (Bitfusion to host jupyter notebook on cloud)

In [3]:
import tflearn
from tflearn.data_utils import to_categorical, pad_sequences
from tflearn.datasets import imdb

In [6]:
# IMDB Dataset loading
train,test, _ = imdb.load_data(path='imdb.pkl',n_words=10000,
                              valid_portion=0.1)
trainX, trainY = train
testX, testY = test

In [7]:
# Data preprocessing 
# Sequence Padding 
# we can't just feed text strings into a neural network directly,
# we have to vectorize our inputs
# neural network are algorithms that essentially
# just apply a series of computations to your matrices.
# so converting them to numerical representations 
# or vectors is necessary

#convert each review into a matrix and pad it(pad_sequences)
#sequence padding padding value 0
trainX = pad_sequences(trainX,maxlen=100,value=0.)
testX  = pad_sequences(testX, maxlen=100,value=0.)

#Converting labels to binary vectors
trainY = to_categorical(trainY,nb_classes=2)
testY  = to_categorical(testY,nb_classes=2)

### LSTM: This layer allows out network to remember data from begining of the sequences which will improve our prediction
### We will set dropout to .08 which is a technique that helps prevent overfitting by randomly turning on and off different pathways in our network

### Our next layer is fully connected which means that every neuron in the previous layer is connected to every neuron in this layer.
### we have a set of learned feature vectors from previous layers, and adding a fully connected layer is a computationally cheap way of learning non-linear combinations of them
### Its got 2 units and it's using the softmax function as its activation function this will take in a vector of values and squash it into vector of output probabilities between 0 and 1, that sums up to 1.

### We'll use those values in out last layer, which is our regression layer. this will apply a regression operation to the input We're going to specify an optimizer method that will minimize a given loss function as well as the learning rate, which specifies how fast we want our network to train. The optimizer we'll use is adam, which performs gradient descent and categorical cross entropy is our loss, it helps to find the difference between our predicted output and the expected output After building our neural network we can go ahead and initialize it using tflearn's deep neural net function then we can call our models fit function which will launch the training process for given training and validation data we'll also set show metric to true so we can view the log of accuracy during training.

In [10]:
#network building
#input layer
net = tflearn.input_data([None,100]) # param: [batch_size,length]
# Embedding Layer
net = tflearn.embedding(net,input_dim=10000,output_dim=128) # param :[output of previous layer ,input dimension ,next layer reduced dimension]
#lstm layer (long short term memory)
net = tflearn.lstm(net,128,dropout=0.8)
# fully connected layer
net = tflearn.fully_connected(net,2,activation='softmax')
#activation layer
net = tflearn.regression(net,optimizer ='adam',learning_rate=0.001,
                        loss='categorical_crossentropy')


In [11]:
# Training
# DNN is Deep Neural Net function of tflearn
model = tflearn.DNN(net, tensorboard_verbose=0)
model.fit(trainX, trainY, validation_set=(testX, testY), show_metric=True,
          batch_size=32)

Training Step: 7039  | total loss: [1m[32m0.09564[0m[0m | time: 141.084s
| Adam | epoch: 010 | loss: 0.09564 - acc: 0.9807 -- iter: 22496/22500
Training Step: 7040  | total loss: [1m[32m0.08712[0m[0m | time: 146.327s
| Adam | epoch: 010 | loss: 0.08712 - acc: 0.9826 | val_loss: 0.73833 - val_acc: 0.8064 -- iter: 22500/22500
--
