# Sentiment analysis with TFLearn

In this notebook, we'll continue Andrew Trask's work by building a network for sentiment analysis on the movie review data. Instead of a network written with Numpy, we'll be using [TFLearn](http://tflearn.org/), a high-level library built on top of TensorFlow. TFLearn makes it simpler to build networks just by defining the layers. It takes care of most of the details for you.

We'll start off by importing all the modules we'll need, then load and prepare the data.

In [3]:
import pandas as pd
import numpy as np
import tensorflow as tf
import tflearn
from tflearn.data_utils import to_categorical
from sklearn.model_selection import train_test_split

## Preparing the data

Following along with Andrew, our goal here is to convert our reviews into word vectors. The word vectors will have elements representing words in the total vocabulary. If the second position represents the word 'the', for each review we'll count up the number of times 'the' appears in the text and set the second position to that count. I'll show you examples as we build the input data from the reviews data. Check out Andrew's notebook and video for more about this.

In [4]:
reviews = pd.read_csv('reviews.txt', header=None)
labels = pd.read_csv('labels.txt', header=None)

### Counting word frequency

To start off we'll need to count how often each word appears in the data. We'll use this count to create a vocabulary we'll use to encode the review data. This resulting count is known as a [bag of words](https://en.wikipedia.org/wiki/Bag-of-words_model). We'll use it to select our vocabulary and build the word vectors. You should have seen how to do this in Andrew's lesson. Try to implement it here using the [Counter class](https://docs.python.org/2/library/collections.html#collections.Counter).

> **Exercise:** Create the bag of words from the reviews data and assign it to `total_counts`. The reviews are stores in the `reviews` [Pandas DataFrame](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html). If you want the reviews as a Numpy array, use `reviews.values`. You can iterate through the rows in the DataFrame with `for idx, row in reviews.iterrows():` ([documentation](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.iterrows.html)).

In [64]:
from collections import Counter

def count_words_in_corpus(corpus, labels):
    counts = Counter()
    positive_counts = Counter()
    negative_counts = Counter()
    total_counts = Counter()
    
    for i in range(len(reviews)):
        if(labels[0][i] == 'positive'):
            for word in reviews[0][i].split(" "):
                positive_counts[word] += 1
                total_counts[word] += 1
        else:
            for word in reviews[0][i].split(" "):
                negative_counts[word] += 1
                total_counts[word] += 1
    
#     for row_index, row in corpus.iterrows():
#         for word in row[0].split(" "):
#             if word not in stopwords:
#                 counts[word] += 1
            
    return positive_counts, negative_counts, total_counts
    
positive_counts, negative_counts, total_counts = count_words_in_corpus(reviews, labels)

print("Total words in data set: ", len(total_counts))

Total words in data set:  74074


In [66]:
pos_neg_ratios = Counter()

for term,cnt in list(total_counts.most_common()):
    if(cnt > 20):
        pos_neg_ratio = positive_counts[term] / float(negative_counts[term]+1)
        pos_neg_ratios[term] = pos_neg_ratio

for word,ratio in pos_neg_ratios.most_common():
    if(ratio > 1):
        pos_neg_ratios[word] = np.log(ratio)
    else:
        pos_neg_ratios[word] = np.log((1 / (ratio+0.01)))

In [69]:
pos_neg_ratios.most_common(n=10000)

[('edie', 4.6913478822291435),
 ('morty', 4.6051701859880918),
 ('btk', 4.6051701859880918),
 ('trancers', 4.6051701859880918),
 ('delia', 4.6051701859880918),
 ('savini', 4.6051701859880918),
 ('kornbluth', 4.6051701859880918),
 ('stirba', 4.6051701859880918),
 ('dangerfield', 4.6051701859880918),
 ('saif', 4.6051701859880918),
 ('manos', 4.6051701859880918),
 ('dunaway', 4.6051701859880918),
 ('sarne', 4.6051701859880918),
 ('hasselhoff', 4.6051701859880918),
 ('weisz', 4.6051701859880918),
 ('dyer', 4.6051701859880918),
 ('hackenstein', 4.6051701859880918),
 ('darkman', 4.6051701859880918),
 ('palermo', 4.6051701859880918),
 ('tashan', 4.6051701859880918),
 ('kazaam', 4.6051701859880918),
 ('tremors', 4.6051701859880918),
 ('dorff', 4.6051701859880918),
 ('mraovich', 4.6051701859880918),
 ('hobgoblins', 4.6051701859880918),
 ('gram', 4.6051701859880918),
 ('chomsky', 4.6051701859880918),
 ('kibbutz', 4.6051701859880918),
 ('shaq', 4.6051701859880918),
 ('rosanna', 4.6051701859880918

Let's keep the first 10000 most frequent words. As Andrew noted, most of the words in the vocabulary are rarely used so they will have little effect on our predictions. Below, we'll sort `vocab` by the count value and keep the 10000 most frequent words.

In [87]:
ratios = dict(pos_neg_ratios.most_common())
vocab = sorted(ratios, key=ratios.get, reverse=True)[:1000]
vocab_size = len(vocab)
print(vocab[:60])

['edie', 'morty', 'btk', 'trancers', 'delia', 'rosanna', 'slater', 'savini', 'kornbluth', 'stirba', 'dangerfield', 'berkowitz', 'manos', 'sarne', 'hasselhoff', 'weisz', 'dyer', 'darkman', 'palermo', 'dunaway', 'tashan', 'hackenstein', 'kazaam', 'tremors', 'dorff', 'mraovich', 'rochon', 'saif', 'hobgoblins', 'gram', 'chomsky', 'kibbutz', 'shaq', 'orca', 'kareena', 'hallam', 'domergue', 'lordi', 'gymnast', 'zenia', 'louque', 'antwone', 'din', 'gunga', 'goldsworthy', 'yokai', 'gypo', 'boll', 'paulie', 'flavia', 'visconti', 'uwe', 'kells', 'blandings', 'brashear', 'gino', 'deathtrap', 'panahi', 'harilal', 'ossessione']


What's the last word in our vocabulary? We can use this to judge if 10000 is too few. If the last word is pretty common, we probably need to keep more words.

In [71]:
print(vocab[-1], ': ', total_counts[vocab[-1]])

version :  2157


The last word in our vocabulary shows up in 30 reviews out of 25000. I think it's fair to say this is a tiny proportion of reviews. We are probably fine with this number of words.

Now for each review in the data, we'll make a word vector. First we need to make a mapping of word to index, pretty easy to do with a dictionary comprehension.

> **Exercise:** Create a dictionary called `word2idx` that maps each word in the vocabulary to an index. The first word in `vocab` has index `0`, the second word has index `1`, and so on.

In [88]:
word2idx = {word: index for index,word in enumerate(vocab)}
word2idx["terrific"]

980

In [89]:
word2idx

{'mutant': 856,
 'avoid': 921,
 'melody': 922,
 'benoit': 420,
 'yoda': 737,
 'ella': 430,
 'firefighters': 534,
 'elliott': 964,
 'iturbi': 100,
 'foch': 667,
 'reda': 142,
 'antwone': 41,
 'merrill': 907,
 'abu': 231,
 'vomit': 874,
 'unwatchable': 147,
 'kareena': 34,
 'miserably': 517,
 'strathairn': 502,
 'rosanna': 5,
 'myrna': 702,
 'pickford': 447,
 'wtf': 287,
 'ronny': 95,
 'gino': 55,
 'hams': 263,
 'soccer': 669,
 'heather': 710,
 'incoherent': 190,
 'lighten': 527,
 'conflicted': 936,
 'raines': 372,
 'hasselhoff': 14,
 'feinstone': 111,
 'cringed': 690,
 'arquette': 353,
 'paycheck': 246,
 'dustin': 852,
 'vincenzo': 619,
 'filth': 878,
 'badly': 730,
 'freedom': 953,
 'hallam': 35,
 'croc': 369,
 'plods': 240,
 'seagal': 96,
 'heartwarming': 894,
 'carla': 334,
 'giamatti': 156,
 'kelso': 108,
 'trinity': 696,
 'colman': 318,
 'trelkovsky': 79,
 'shia': 761,
 'amicus': 886,
 'antidote': 813,
 'steaming': 97,
 'wai': 810,
 'hackman': 687,
 'understated': 902,
 'deathstalk

### Text to vector function

Now we can write a function that converts some text to a word vector. The function will take a string of words as input and return a vector with the words counted up. Here's the general algorithm to do this:

* Initialize the word vector with [np.zeros](https://docs.scipy.org/doc/numpy/reference/generated/numpy.zeros.html), it should be the length of the vocabulary.
* Split the input string of text into a list of words with `.split(' ')`.
* For each word in that list, increment the element in the index associated with that word, which you get from `word2idx`.

**Note:** Since all words aren't in the `vocab` dictionary, you'll get a key error if you run into one of those words. You can use the `.get` method of the `word2idx` dictionary to specify a default returned value when you make a key error. For example, `word2idx.get(word, None)` returns `None` if `word` doesn't exist in the dictionary.

In [74]:
def text_to_vector(text):
    word_vector = np.zeros(vocab_size)
    
    for word in text.split(" "):
        if word in word2idx:
            word_vector[word2idx[word]] += 1
    
    return word_vector

Now, run through our entire review data set and convert each review to a word vector.

In [90]:
word_vectors = np.zeros((len(reviews), len(vocab)), dtype=np.int_)

for index, (_, text) in enumerate(reviews.iterrows()):
    word_vectors[index] = text_to_vector(text[0])

In [77]:
# Printing out the first 5 word vectors
word_vectors[:5, :23]

array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])

### Train, Validation, Test sets

Now that we have the word_vectors, we're ready to split our data into train, validation, and test sets. Remember that we train on the train data, use the validation data to set the hyperparameters, and at the very end measure the network performance on the test data. Here we're using the function `to_categorical` from TFLearn to reshape the target data so that we'll have two output units and can classify with a softmax activation function. We actually won't be creating the validation set here, TFLearn will do that for us later.

In [91]:
Y = (labels == 'positive').astype(np.int_)
records = len(labels)

# shuffle = np.arange(records)
# np.random.shuffle(shuffle)
# train_fraction = 0.9

trainX, testX, trainY, testY = train_test_split(word_vectors, to_categorical(Y.values, 2), test_size = 0.1)


# train_split, test_split = shuffle[:int(records*train_fraction)], shuffle[int(records*train_fraction):]
# trainX, trainY = word_vectors[train_split,:], to_categorical(Y.values[train_split], 2)
# testX, testY = word_vectors[test_split,:], to_categorical(Y.values[test_split], 2)

In [92]:
trainY

array([[ 1.,  0.],
       [ 0.,  1.],
       [ 1.,  0.],
       ..., 
       [ 0.,  1.],
       [ 1.,  0.],
       [ 0.,  1.]])

## Building the network

[TFLearn](http://tflearn.org/) lets you build the network by [defining the layers](http://tflearn.org/layers/core/). 

### Input layer

For the input layer, you just need to tell it how many units you have. For example, 

```
net = tflearn.input_data([None, 100])
```

would create a network with 100 input units. The first element in the list, `None` in this case, sets the batch size. Setting it to `None` here leaves it at the default batch size.

The number of inputs to your network needs to match the size of your data. For this example, we're using 10000 element long vectors to encode our input data, so we need 10000 input units.


### Adding layers

To add new hidden layers, you use 

```
net = tflearn.fully_connected(net, n_units, activation='ReLU')
```

This adds a fully connected layer where every unit in the previous layer is connected to every unit in this layer. The first argument `net` is the network you created in the `tflearn.input_data` call. It's telling the network to use the output of the previous layer as the input to this layer. You can set the number of units in the layer with `n_hidden`, and set the activation function with the `activation` keyword. You can keep adding layers to your network by repeated calling `net = tflearn.fully_connected(net, n_units)`.

### Output layer

The last layer you add is used as the output layer. There for, you need to set the number of units to match the target data. In this case we are predicting two classes, positive or negative sentiment. You also need to set the activation function so it's appropriate for your model. Again, we're trying to predict if some input data belongs to one of two classes, so we should use softmax.

```
net = tflearn.fully_connected(net, 2, activation='softmax')
```

### Training
To set how you train the network, use 

```
net = tflearn.regression(net, optimizer='sgd', learning_rate=0.1, loss='categorical_crossentropy')
```

Again, this is passing in the network you've been building. The keywords: 

* `optimizer` sets the training method, here stochastic gradient descent
* `learning_rate` is the learning rate
* `loss` determines how the network error is calculated. In this example, with the categorical cross-entropy.

Finally you put all this together to create the model with `tflearn.DNN(net)`. So it ends up looking something like 

```
net = tflearn.input_data([None, 10])                          # Input
net = tflearn.fully_connected(net, 5, activation='ReLU')      # Hidden
net = tflearn.fully_connected(net, 2, activation='softmax')   # Output
net = tflearn.regression(net, optimizer='sgd', learning_rate=0.1, loss='categorical_crossentropy')
model = tflearn.DNN(net)
```

> **Exercise:** Below in the `build_model()` function, you'll put together the network using TFLearn. You get to choose how many layers to use, how many hidden units, etc.

In [95]:
# Network building
def build_model(learning_rate = 0.1):
    # This resets all parameters and variables, leave this here
    tf.reset_default_graph()
    
    #### Your code ####
    net = tflearn.input_data([None, vocab_size])
    
    net = tflearn.fully_connected(net, 250, activation='ReLU')
    net = tflearn.embedding(net, input_dim=1000, output_dim = 128)
    net = tflearn.lstm(net, 128, dropout=0.8)
    
    net = tflearn.fully_connected(net, 2, activation='softmax')
    net = tflearn.regression(net, optimizer='adam', learning_rate=learning_rate, loss='categorical_crossentropy')
    
    model = tflearn.DNN(net, tensorboard_verbose=0)
    return model

## Intializing the model

Next we need to call the `build_model()` function to actually build the model. In my solution I haven't included any arguments to the function, but you can add arguments so you can change parameters in the model if you want.

> **Note:** You might get a bunch of warnings here. TFLearn uses a lot of deprecated code in TensorFlow. Hopefully it gets updated to the new TensorFlow version soon.

In [101]:
model = build_model(learning_rate=0.001)

Instructions for updating:
Please switch to tf.summary.scalar. Note that tf.summary.scalar uses the node name instead of the tag. This means that TensorFlow will automatically de-duplicate summary names based on the scope they are created in. Also, passing a tensor or list of tags to a scalar summary op is no longer supported.
Instructions for updating:
Please switch to tf.summary.scalar. Note that tf.summary.scalar uses the node name instead of the tag. This means that TensorFlow will automatically de-duplicate summary names based on the scope they are created in. Also, passing a tensor or list of tags to a scalar summary op is no longer supported.
Instructions for updating:
Please switch to tf.summary.merge.
Instructions for updating:
Use `tf.global_variables_initializer` instead.


## Training the network

Now that we've constructed the network, saved as the variable `model`, we can fit it to the data. Here we use the `model.fit` method. You pass in the training features `trainX` and the training targets `trainY`. Below I set `validation_set=0.1` which reserves 10% of the data set as the validation set. You can also set the batch size and number of epochs with the `batch_size` and `n_epoch` keywords, respectively. Below is the code to fit our the network to our word vectors.

You can rerun `model.fit` to train the network further if you think you can increase the validation accuracy. Remember, all hyperparameter adjustments must be done using the validation set. **Only use the test set after you're completely done training the network.**

In [102]:
# Training
model.fit(trainX, trainY, validation_set=(testX, testY), show_metric=True, batch_size=32)

Training Step: 7040  | total loss: [1m[32m0.69337[0m[0m
| Adam | epoch: 010 | loss: 0.69337 - acc: 0.4936 | val_loss: 0.69332 - val_acc: 0.4868 -- iter: 22500/22500
Training Step: 7040  | total loss: [1m[32m0.69337[0m[0m
| Adam | epoch: 010 | loss: 0.69337 - acc: 0.4936 | val_loss: 0.69332 - val_acc: 0.4868 -- iter: 22500/22500
--


## Testing

After you're satisified with your hyperparameters, you can run the network on the test set to measure it's performance. Remember, *only do this after finalizing the hyperparameters*.

In [43]:
predictions = (np.array(model.predict(testX))[:,0] >= 0.5).astype(np.int_)
test_accuracy = np.mean(predictions == testY[:,0], axis=0)
print("Test accuracy: ", test_accuracy)

Test accuracy:  0.7684


## Try out your own text!

In [44]:
text = "This movie is so bad. It was awful and the worst"
positive_prob = model.predict([text_to_vector(text.lower())])[0][1]
print('P(positive) = {:.3f} :'.format(positive_prob), 
      'Positive' if positive_prob > 0.5 else 'Negative')

P(positive) = 0.207 : Negative
