### Applying deep neural network to sentiment training data
We have two files <i>pos</i> and <i>neg</i> that contain positive and negative sentences (The pos file has ~5,000 positive sentiment statements, and the neg file has ~5,000 negative sentiment statements.). We'd like to train an NN classifier for sentiment analysis similar in structure to our <a href="https://github.com/omidrohanian/learning_tensorflow/blob/master/deep_net.ipynb">deep_net</a> example from before. Our inputs here are strings of characters. But we need to convert them to vectors to feed into a network. One way to do this is to have an indexed lexicon (a list of words) and construct feature vectors using the bag-of-words model. The lexicon needs to be created from the whole corpus.

In [104]:
# Example:
# lexicon = [chair, table, spoon, television]
# text = I pulled the chair up to the table
# constructed vector = [1 1 0 0]

In [105]:
# Reference: https://www.youtube.com/watch?v=YFxVHD2TNII

import nltk, random, pickle
# tokenizer separates all the words in the text
from nltk.tokenize import word_tokenize
# lemmatizer removes inflectional endings and returns the base or dictionary form of a word
from nltk.stem import WordNetLemmatizer


import numpy as np 
from collections import Counter 

lemmatizer = WordNetLemmatizer()
# num of lines
hm_lines = 10000000

def create_lexicon(pos, neg):
    lexicon = []
    for fi in [pos, neg]:
        with open(fi, 'r') as f:
            contents = f.readlines()
            for l in contents[:hm_lines]:
                all_words = word_tokenize(l.lower())
                lexicon += list(all_words)       
    lexicon = [lemmatizer.lemmatize(i) for i in lexicon]
    # now we count all the words
    w_counts = Counter(lexicon)
    l2 = []
    # This is to weed out super-common words like 'the', 'and' etc
    for w in w_counts:
        if 1000 > w_counts[w] > 50:
            l2.append(w)
    print('size of the lexicon=', len(l2))
    return l2 

Now that we have constructed the lexicon, we can use it for classifying feature sets.

In [114]:
def sample_handling(sample, lexicon, classification):
    featureset = []
    with open(sample, 'r') as f:
        contents = f.readlines()
        for l in contents[:hm_lines]:
            current_words = word_tokenize(l.lower())
            current_words = [lemmatizer.lemmatize(i) for i in current_words]
            features = np.zeros(len(lexicon))
            for word in current_words:
                if word.lower() in lexicon:
                    index_value = lexicon.index(word.lower())
                    features[index_value] += 1
            features = list(features)
            featureset.append([features, classification])
    return featureset

def create_featuresets_and_labels(pos, neg, test_size=0.1):
    lexicon = create_lexicon(pos, neg)
    features=[]
    features += sample_handling('pos.txt', lexicon,[1,0])
    features += sample_handling('neg.txt', lexicon,[0,1])
    random.shuffle(features)
    # features need to be numpy arrays
    features = np.array(features)
    testing_size = int(test_size * len(features))
    # this numpy notation means "I want a list of all the 0th elements"
    # We like to get the features up to the last 10% 
    # example: [[5, 8], [7, 9]] will become [5, 7]
    train_x = list(features[:,0][:-testing_size])
    train_y = list(features[:,1][:-testing_size])
    # we need the last 10% for testing 
    test_x = list(features[:,0][-testing_size:])
    test_y = list(features[:,1][-testing_size:])
    return train_x,train_y,test_x,test_y


train_x,train_y,test_x,test_y = \
    create_featuresets_and_labels('pos.txt', 'neg.txt', test_size=0.1)

# if you want to pickle this data:
# with open('sentiment_set.pickle','wb') as f:
#     pickle.dump([train_x,train_y,test_x,test_y],f)

size of the lexicon= 423


Our preprocesisng is now over, and we have created the collection of featuresets and labels ('sentiment_set.pickle' should be created in the same directory by now). Now we are ready to run this through a deep neural network. We can run the code from deep_net with some changes.

In [119]:
import tensorflow as tf
import numpy as np
train_x, train_y, test_x, test_y = pickle.load(open("sentiment_set.pickle","rb"))

n_nodes_hl1 = 800
n_nodes_hl2 = 800
n_nodes_hl3 = 800
n_nodes_hl4 = 800

# here we only have two classes (pos/neg)
n_classes = 2
batch_size = 100

# size of the placeholder should be identical to a training vector 
x = tf.placeholder('float', [None, len(train_x[0])])
y = tf.placeholder('float', shape=None)

def neural_network_model(data):
    hidden_1_layer = {'weights':tf.Variable(tf.random_normal([len(train_x[0]), n_nodes_hl1])),
                      'biases':tf.Variable(tf.random_normal([n_nodes_hl1]))}
    hidden_2_layer = {'weights':tf.Variable(tf.random_normal([n_nodes_hl1, n_nodes_hl2])),
                      'biases':tf.Variable(tf.random_normal([n_nodes_hl2]))}
    hidden_3_layer = {'weights':tf.Variable(tf.random_normal([n_nodes_hl2, n_nodes_hl3])),
                      'biases':tf.Variable(tf.random_normal([n_nodes_hl3]))}
    hidden_4_layer = {'weights':tf.Variable(tf.random_normal([n_nodes_hl3, n_nodes_hl4])),
                      'biases':tf.Variable(tf.random_normal([n_nodes_hl4]))}
    output_layer = {'weights':tf.Variable(tf.random_normal([n_nodes_hl4, n_classes])),
                    'biases':tf.Variable(tf.random_normal([n_classes])),}

    l1 = tf.add(tf.matmul(data,hidden_1_layer['weights']), hidden_1_layer['biases'])
    l1 = tf.nn.relu(l1)

    l2 = tf.add(tf.matmul(l1,hidden_2_layer['weights']), hidden_2_layer['biases'])
    l2 = tf.nn.relu(l2)

    l3 = tf.add(tf.matmul(l2,hidden_3_layer['weights']), hidden_3_layer['biases'])
    l3 = tf.nn.relu(l3)

    l4 = tf.add(tf.matmul(l3,hidden_4_layer['weights']), hidden_4_layer['biases'])
    l4 = tf.nn.relu(l4)

    output = tf.matmul(l4,output_layer['weights']) + output_layer['biases']
    return output

def train_neural_network(x):
    prediction = neural_network_model(x)
    cost = tf.reduce_mean( tf.nn.softmax_cross_entropy_with_logits(logits=prediction, labels=y) )
    optimizer = tf.train.AdamOptimizer().minimize(cost)
    
    hm_epochs = 15
    with tf.Session() as sess:
        sess.run(tf.global_variables_initializer())

        for epoch in range(hm_epochs):
            epoch_loss = 0
            # This part was MNIST specific and needed to be rewritten...
            i = 1
            while i < len(train_x)-1:
                start = i
                end = i + batch_size
                
                batch_x = np.array(train_x[start:end])
                batch_y = np.array(train_y[start:end])
                _, c = sess.run([optimizer, cost], feed_dict={x: batch_x, y: batch_y})
                epoch_loss += c
                i += batch_size

            print('Epoch', epoch+1, 'completed out of',hm_epochs,'loss:',epoch_loss)

        correct = tf.equal(tf.argmax(prediction, 1), tf.argmax(y, 1))

        accuracy = tf.reduce_mean(tf.cast(correct, 'float'))
        print('Accuracy:',accuracy.eval({x:test_x, y:test_y}))
        

train_neural_network(x)

Epoch 1 completed out of 15 loss: 8223562.07812
Epoch 2 completed out of 15 loss: 3063549.93945
Epoch 3 completed out of 15 loss: 1485081.6333
Epoch 4 completed out of 15 loss: 1464338.87354
Epoch 5 completed out of 15 loss: 1483004.90771
Epoch 6 completed out of 15 loss: 839093.965576
Epoch 7 completed out of 15 loss: 178526.384323
Epoch 8 completed out of 15 loss: 96525.4767385
Epoch 9 completed out of 15 loss: 52809.3573151
Epoch 10 completed out of 15 loss: 49911.8045807
Epoch 11 completed out of 15 loss: 50802.9235554
Epoch 12 completed out of 15 loss: 53283.3651581
Epoch 13 completed out of 15 loss: 56075.9269714
Epoch 14 completed out of 15 loss: 55895.8740344
Epoch 15 completed out of 15 loss: 63538.2344646
Accuracy: 0.601313


The accuracy is indeed not very impressive, but note that the data here is 10K which is small. We will have to compare this result with a case where we have a much larger dataset