# Tweet sentiment with Deep Neural Network   
 
## Outline
- [Importing the data](#2)
- [Defining classes](#3)
- [Training](#4)
- [Evaluation  ](#5)


In [3]:
import os
import numpy as np
import trax
from trax import layers as tl

## Data

In [6]:
train_stream = trax.data.TFDS('imdb_reviews', keys=('text', 'label'), train=True)()
eval_stream = trax.data.TFDS('imdb_reviews', keys=('text', 'label'), train=False)()

In [7]:
print(next(train_stream))  # See one example.
print(next(train_stream))  # one more

(b'Mann photographs the Alberta Rocky Mountains in a superb fashion, and Jimmy Stewart and Walter Brennan give enjoyable performances as they always seem to do. <br /><br />But come on Hollywood - a Mountie telling the people of Dawson City, Yukon to elect themselves a marshal (yes a marshal!) and to enforce the law themselves, then gunfighters battling it out on the streets for control of the town? <br /><br />Nothing even remotely resembling that happened on the Canadian side of the border during the Klondike gold rush. Mr. Mann and company appear to have mistaken Dawson City for Deadwood, the Canadian North for the American Wild West.<br /><br />Canadian viewers be prepared for a Reefer Madness type of enjoyable howl with this ludicrous plot, or, to shake your head in disgust.', 0)
(b"This was an absolutely terrible movie. Don't be lured in by Christopher Walken or Michael Ironside. Both are great actors, but this must simply be their worst role in history. Even their great acting c

In [15]:
data_pipeline = trax.data.Serial(
    trax.data.Tokenize(vocab_file='en_8k.subword', keys=[0]),
    trax.data.Shuffle(),
    trax.data.FilterByLength(max_length=2048, length_keys=[0]),
    trax.data.BucketByLength(boundaries=[  32, 128, 512, 2048],
                             batch_sizes=[512, 128,  32,    8, 1],
                             length_keys=[0]),
    trax.data.AddLossWeights()
  )


In [9]:
train_batches_stream = data_pipeline(train_stream)
eval_batches_stream = data_pipeline(eval_stream)

example_batch

In [10]:
example_batch = next(train_batches_stream)
print(f'shapes = {[x.shape for x in example_batch]}')  # Check the shapes.

shapes = [(8, 2048), (8,), (8,)]


example_batch = [x, y, weights]

In [14]:
example_batch

(array([[1786, 2090,   55, ...,    0,    0,    0],
        [ 139, 4072,  114, ...,    0,    0,    0],
        [8180,    2,   28, ...,    0,    0,    0],
        ...,
        [ 139,  907, 2047, ...,    0,    0,    0],
        [ 433,   25,  258, ...,    0,    0,    0],
        [ 506,  668,  140, ...,    0,    0,    0]]),
 array([0, 0, 1, 0, 1, 0, 1, 0]),
 array([1., 1., 1., 1., 1., 1., 1., 1.], dtype=float32))

## Model

In [4]:
model = tl.Serial(
    tl.Embedding(vocab_size=8192, d_feature=256),
    tl.Mean(axis=1),  # Average on axis 1 (length of sentence).
    tl.Dense(2),      # Classify 2 classes.
    tl.LogSoftmax()   # Produce log-probabilities.
)

# You can print model structure.
print(model)

Serial[
  Embedding_8192_256
  Mean
  Dense_2
  LogSoftmax
]


Training

In [16]:
from trax.supervised import training

# Training task.
train_task = training.TrainTask(
    labeled_data=train_batches_stream,
    loss_layer=tl.CrossEntropyLoss(),
    optimizer=trax.optimizers.Adam(0.01),
    n_steps_per_checkpoint=500,
)

# Evaluaton task.
eval_task = training.EvalTask(
    labeled_data=eval_batches_stream,
    metrics=[tl.CrossEntropyLoss(), tl.Accuracy()],
    n_eval_batches=20  # For less variance in eval numbers.
)

# Training loop saves checkpoints to output_dir.
output_dir = os.path.expanduser('~/output_dir/')
!rm -rf {output_dir}
training_loop = training.Loop(model,
                              train_task,
                              eval_tasks=[eval_task],
                              output_dir=output_dir)

# Run 2000 steps (batches).
training_loop.run(2000)




Step      1: Ran 1 train steps in 1.38 secs
Step      1: train CrossEntropyLoss |  0.50513411
Step      1: eval  CrossEntropyLoss |  0.81346474
Step      1: eval          Accuracy |  0.45937500

Step    500: Ran 499 train steps in 23.68 secs
Step    500: train CrossEntropyLoss |  0.58927298
Step    500: eval  CrossEntropyLoss |  0.46201409
Step    500: eval          Accuracy |  0.77812500

Step   1000: Ran 500 train steps in 20.22 secs
Step   1000: train CrossEntropyLoss |  0.39455706
Step   1000: eval  CrossEntropyLoss |  0.33334947
Step   1000: eval          Accuracy |  0.84218750

Step   1500: Ran 500 train steps in 20.20 secs
Step   1500: train CrossEntropyLoss |  0.36336109
Step   1500: eval  CrossEntropyLoss |  0.34425066
Step   1500: eval          Accuracy |  0.81601563

Step   2000: Ran 500 train steps in 20.06 secs
Step   2000: train CrossEntropyLoss |  0.31726089
Step   2000: eval  CrossEntropyLoss |  0.37587366
Step   2000: eval          Accuracy |  0.85781250


Evaluation

In [17]:
example_input = next(eval_batches_stream)[0][0]
example_input_str = trax.data.detokenize(example_input, vocab_file='en_8k.subword')
print(f'example input_str: {example_input_str}')
sentiment_log_probs = model(example_input[None, :])  # Add batch dimension.
print(f'Model returned sentiment probabilities: {np.exp(sentiment_log_probs)}')

example input_str: Snakes on a Plane was such a well hyped film that it was both inevitable and a little crazy to try to release another movie with almost the same title in the same year let alone the same week. Reading the other comments here I see the results. A lot of people are mad. Mad because it doesn't have the best special effects. Mad because it doesn't have a star cast. Mad because they wanted to see Samuel Jackson say "I'm sick of these M^*&*&%-Er F*^(^%-Ing Snakes on this M^*&*&%-Er F*^(^%-Ing Train"! <br /><br />Well, this sure ain't the Samuel Jackson version. And maybe that's good.<br /><br />Snakes on a Plane was lost between cop film and horror, a family action film and a bloody gory movie of death. Saturday Night Live performers got laughs while Jackson swore enough to make a grandmother cover her ears, and as far as kids go, they would be traumatized by the violence.<br /><br />Snakes on a Train however knew exactly what it was. This was a cheaply made horror movie o

Model returned sentiment probabilities: [[0.3830032 0.6169968]]
