# Learning TFlearn and Sentiment Analysis

## Really just me playing around with TD learn and imdb data

First step is to import the tdlearn libraries for usage, pretty easy right 

In [1]:
from __future__ import division, print_function, absolute_import

import tflearn
from tflearn.data_utils import to_categorical, pad_sequences
from tflearn.datasets import imdb

In the next step we grab the tflearn inbuilt data set from IMDB, this will load the train/test with the first 10000 words and split the data 90/10 for training and testing data.

The valid_portion=0.1 component keeps 10% of the training data for validation and testing of our hyperparameters

The IMDB data set is a pkl (byte stream) file, the inbuild function just refrences a URL which is then downloaded dor usage

In [2]:
# IMDB Dataset loading
train, test, _ = imdb.load_data(path='imdb.pkl', n_words=10000,
                                valid_portion=0.1)

print("train data lenght: ", len(train))
print("test data lenght: ", len(test))

train data lenght:  2
test data lenght:  2


At first I thought these variables should show the total count, then I realised its splitting the data into two hence why two is returned......

In [3]:
trainX, trainY = train
testX, testY = test

print("train data lenght: ", len(trainX))
print("sample train data: ", trainX[0])
print("test data lenght: ", len(testX))
print("sample test data: ", testX[0])

train data lenght:  22500
sample train data:  [17, 25, 10, 406, 26, 14, 56, 61, 62, 323, 4]
test data lenght:  2500
sample test data:  [6, 694, 7, 19, 360, 19, 139, 33, 893, 8, 2567, 102, 760, 3, 2237, 5, 6803, 96, 17, 25, 10, 4]


Now we can see that it's returning the total amount of items (length) and that we have a vector

In [4]:
# Data preprocessing
# Sequence padding
trainX = pad_sequences(trainX, maxlen=100, value=0.)
testX = pad_sequences(testX, maxlen=100, value=0.)

print("updated trainX[0]: ", trainX[0])
print("updated testX[0]: ", testX[0])

updated trainX[0]:  [  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0  17
  25  10 406  26  14  56  61  62 323   4]
updated testX[0]:  [   0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    6  694    7   19  360   19  139   33  893    8 2567  102
  760    3 2237    5 6803   96   17   25   10    4]


Above we have padded our trainX/testX input vectors into a uniform length of 100 with padding of 0 - now every item is the same lenght 

In [5]:
print("first entry trainY: ", trainY[:1])
print("first entry testY: ", testY[:1])

# Converting labels to binary vectors
trainY = to_categorical(trainY, nb_classes=2)
testY = to_categorical(testY, nb_classes=2)

print("updated trainY[0]: ", trainY[0])
print("updated testY[0]: ", testY[0])

first entry trainY:  [0]
first entry testY:  [0]
updated trainY[0]:  [ 1.  0.]
updated testY[0]:  [ 1.  0.]


Above we setup our labels, both test and train data, we end up with a binary class matrix for use with categorical_crossentropy, this is represneting a postivie or negative outcome of the assoicated review (no idea which is which at this point)

In [6]:
# Network building
net = tflearn.input_data([None, 100])
net = tflearn.embedding(net, input_dim=10000, output_dim=128)
net = tflearn.lstm(net, 128, dropout=0.8)
net = tflearn.fully_connected(net, 2, activation='softmax')
net = tflearn.regression(net, optimizer='adam', learning_rate=0.001,
                         loss='categorical_crossentropy')

Now we can build our network

net = tflearn.input_data([None, 100]) sets up the input of data, we set the amount of input nodes or shape (remember earlier when we set our trainX to 100 with padding?) the first element is batch size, given it's a relativly small amount of data we set this to none

net = tflearn.embedding(net, input_dim=10000, output_dim=128) this is our embedding layer, the input is the output of the first layer (net in this case) the input_dim is the amount of words we have loaded in from IMDB (we would change this based on how many words we have) and the output dimensions to 128 which is the number of dimensions in our results embeddings

net = tflearn.lstm(net, 128, dropout=0.8) we now setup a long short term memory layer and feed in the outpu values of our embedding layer, we also setup a dropout rate which will randomly turn on and off pathways helping to ensure we dont overfit out data

net = tflearn.fully_connected(net, 2, activation='softmax') we then setup a fully_connected layer which means every neuron in the previous layer is connected to every neuron in this layer and set the activation function to softmax, adding a fully connected layer in a computationally cheap way of learning non-linear combinations of them

softmax squashes the output values/probabilities for a range between 0 and 1 with a sum of 1

net = tflearn.regression(net, optimizer='adam', learning_rate=0.001, loss='categorical_crossentropy') we then create our regression layer, this will apply a regression operation to the input, we specify an optimiser method (in this case adam) which will minimise a given loss function, we set our learning rate value, categorical_crossentropy is our loss function which finds the difference between our predicted output and the expected output


In [7]:
# Training
model = tflearn.DNN(net, tensorboard_verbose=0)
model.fit(trainX, trainY, validation_set=(testX, testY), show_metric=True,
          batch_size=32)

Training Step: 7040  | total loss: [1m[32m0.05335[0m[0m
| Adam | epoch: 010 | loss: 0.05335 - acc: 0.9913 | val_loss: 0.54796 - val_acc: 0.8456 -- iter: 22500/22500
Training Step: 7040  | total loss: [1m[32m0.05335[0m[0m
| Adam | epoch: 010 | loss: 0.05335 - acc: 0.9913 | val_loss: 0.54796 - val_acc: 0.8456 -- iter: 22500/22500
--


With ten epochs the result was as above which is a very low loss (acc 99.13%!)