<a href="https://colab.research.google.com/github/henrywoo/MyML/blob/master/Copy_of_sequence_models.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#### Copyright 2018 Google LLC.

In [0]:
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Sequence Models

**Learning Objectives:**
* Parse and prepare movie review string data for sentiment analysis
* Create a custom Estimator for a simple dynamic Recurrent Neural Network (RNN)
* Train and test an RNN model

In this Colab, we'll work with embeddings using text data from movie reviews (from the [ACL 2011 IMDB dataset](http://ai.stanford.edu/~amaas/data/sentiment/)). This data has already been processed into the `tf.Example` format.

Please **make a copy** of this Colab notebook before starting this lab. To do so, choose **File**->**Save a copy in Drive**.

## Setup

Let's import our dependencies and download the training and test data. [`tf.keras`](https://www.tensorflow.org/api_docs/python/tf/keras) includes a file download and caching tool that we can use to retrieve the data sets.

In [0]:
#@title Default title text
import collections
import math

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import tensorflow as tf
from IPython import display
from sklearn import metrics

tf.logging.set_verbosity(tf.logging.ERROR)
train_url = 'https://storage.googleapis.com/mledu-datasets/sparse-data-embedding/train.tfrecord'
train_path = tf.keras.utils.get_file(train_url.split('/')[-1], train_url)
test_url = 'https://storage.googleapis.com/mledu-datasets/sparse-data-embedding/test.tfrecord'
test_path = tf.keras.utils.get_file(test_url.split('/')[-1], test_url)

## Sentiment Analysis and Movie Review Classification

In this lab, we'll build a classifier to assign sentiment polarity labels to short movie reviews from IMDB: 1 if it's overall positive or 0 if overall negative.

The dataset consists of 50K movie reviews, split equally into test and train sets. The data is also split equally between positives and negatives in both train and test sets. It has already been pre-processed, and each entry contains an ordered list of "terms" from the review, along with a label.

## Building the Input Pipeline

First, let's configure the input pipeline to import our data into a TensorFlow model. We can use the following function to parse the training and test data (which is in [TFRecord](https://www.tensorflow.org/programmers_guide/datasets) format) and return a dict of the features and the corresponding labels.

In [0]:
def _parse_function(record):
  """Extracts features and labels.
  
  Args:
    record: File path to a TFRecord file    
  Returns:
    A `tuple` `(labels, features)`:
      features: A dict of tensors representing the features
      labels: A tensor with the corresponding labels.
  """
  features = {
    "terms": tf.VarLenFeature(dtype=tf.string), # terms are strings of varying lengths
    "labels": tf.FixedLenFeature(shape=[1], dtype=tf.float32) # labels are 0 or 1
  }
  
  parsed_features = tf.parse_single_example(record, features)
  
  terms = parsed_features['terms'].values
  labels = parsed_features['labels']

  return  {'terms':terms}, labels

To confirm our function is working as expected, let's construct a `TFRecordDataset` for the training data, and map the data to features and labels using the function above.

In [0]:
with tf.Graph().as_default():
  # Create the Dataset object.
  ds = tf.data.TFRecordDataset(train_path)
  # Map features and labels with the parse function.
  ds = ds.map(_parse_function)
  nxt = ds.make_one_shot_iterator().get_next()
  sess = tf.Session()

print 'Dataset : {}\n'.format(ds)
print 'Example : {}'.format(sess.run(nxt))

Now, let's build a formal input function that we can pass to the `train()` method of a TensorFlow Estimator object.

In [0]:
# Create an input_fn that parses the tf.Examples from the given files,
# and split them into features and targets.

batch_size = 32

def my_input_fn(input_filenames, num_epochs=None, shuffle=True):
  
  # Same code as above; create a dataset and map features and labels.
  ds = tf.data.TFRecordDataset(input_filenames)
  ds = ds.map(_parse_function)
  ds = ds.repeat(num_epochs)

  if shuffle:
    ds = ds.shuffle(10 * batch_size)

  # Our feature data is variable-length, so we pad and batch
  # each field of the dataset structure to whatever size is necessary.
  ds = ds.padded_batch(batch_size, ds.output_shapes)
  ds = ds.prefetch(8)
  
  # Return the next batch of data.
  features, labels = ds.make_one_shot_iterator().get_next()
  return features, labels

## Task 1: Create a custom Estimator for a Dynamic Recurrent Neural Network (RNN)

Next, we'll build a custom Estimator for a Dynamic RNN. We'll start with a simple 1-layer LSTM with 32 hidden units.

Please consult the documentation available on [Creating Custom Estimators](https://www.tensorflow.org/get_started/custom_estimators) in TensorFlow for more information on how to set up your model_fn with the model definition (including [input](https://www.tensorflow.org/get_started/custom_estimators#define_the_input_layer), [hidden](https://www.tensorflow.org/get_started/custom_estimators#hidden_layers) and [output](https://www.tensorflow.org/get_started/custom_estimators#output_layer) layers), as well as the branching code implementing [prediction](https://www.tensorflow.org/get_started/custom_estimators#predict), [evaluation](https://www.tensorflow.org/get_started/custom_estimators#evaluate) and [training](https://www.tensorflow.org/get_started/custom_estimators#train).

For additional helpful references on RNNs, please check the tutorial on [Recurrent Neural Networks](https://www.tensorflow.org/tutorials/recurrent). A variety of cell types is available; we recommend starting with the [Long Short-Term Memory unit](https://www.tensorflow.org/api_docs/python/tf/contrib/rnn/LSTMCell) recurrent network cell.

In [0]:
embedding_hash_buckets = 1000
embedding_dim = 32
rnn_cell_num_units = 32

def my_model_fn(features, labels, mode, params):
  """Custom estimator for Dynamic RNN.

  Args:
    features: This is batch_features from input_fn.
    labels: This is batch_labels from input_fn.
    mode: An instance of tf.estimator.ModeKeys.
    params: Additional parameters.
  Returns:
    A a tf.estimator.EstimatorSpec
  """
  tf.logging.info('my_model_fn: {}'.format(mode))
    
  tokens = features['terms']
  token_ids = tf.string_to_hash_bucket_fast(tokens, embedding_hash_buckets)
  
  embeddings = tf.get_variable("word_embeddings",
                               [embedding_hash_buckets, embedding_dim])
  
  token_embeddings = tf.nn.embedding_lookup(embeddings, token_ids)

  # Define the model here.
  # WRITE YOUR CODE HERE

  # Prediction mode.
  if mode == tf.estimator.ModeKeys.PREDICT:
    return tf.estimator.EstimatorSpec(mode,
                                      predictions={
                                          'probabilities': probabilities,
                                          'predictions': predictions})

  # Evaluation and training modes.
  # Calculate loss.
  loss = tf.losses.sigmoid_cross_entropy(labels, logits=logits)
  
  # Calculate the accuracy between the true labels and our predictions.
  accuracy = tf.metrics.accuracy(tf.to_int32(labels), predictions)

  # Evaluation mode.
  if mode == tf.estimator.ModeKeys.EVAL:
    return tf.estimator.EstimatorSpec(
        mode,
        loss=loss,
        eval_metric_ops={'my_accuracy': accuracy})

  # If mode is not PREDICT nor EVAL, then it must be TRAIN.
  assert mode == tf.estimator.ModeKeys.TRAIN, 'TRAIN is only ModeKey left.'

  # Training mode.
  optimizer = tf.train.AdamOptimizer(learning_rate=0.1)
#   optimizer = tf.contrib.estimator.clip_gradients_by_norm(optimizer, 5.0)
  train_op = optimizer.minimize(loss, global_step=tf.train.get_global_step())


  # Return training operations: loss and train_op.
  return tf.estimator.EstimatorSpec(
      mode,
      loss=loss,
      train_op=train_op)

### Possible solution

In [0]:
embedding_hash_buckets = 1000
embedding_dim = 32
rnn_cell_num_units = 32

def my_model_fn(features, labels, mode, params):
  """Custom estimator for Dynamic RNN.

  Args:
    features: This is batch_features from input_fn.
    labels: This is batch_labels from input_fn.
    mode: An instance of tf.estimator.ModeKeys.
    params: Additional parameters.
  Returns:
    A a tf.estimator.EstimatorSpec
  """
  tf.logging.info('my_model_fn: {}'.format(mode))
    
  tokens = features['terms']
  token_ids = tf.string_to_hash_bucket_fast(tokens, embedding_hash_buckets)
  
  embeddings = tf.get_variable("word_embeddings",
                               [embedding_hash_buckets, embedding_dim])
  
  token_embeddings = tf.nn.embedding_lookup(embeddings, token_ids)

  # Define the model here.
  rnn_cell = tf.contrib.rnn.LSTMBlockCell(rnn_cell_num_units)
  outputs, _ = tf.nn.dynamic_rnn(rnn_cell,
                                 token_embeddings,
                                 dtype=tf.float32)

  # We only want the last output.
  last_output = outputs[:, -1, :]
  logits = tf.layers.dense(last_output, 1)
  probabilities = tf.sigmoid(logits)
  predictions = tf.to_int32(tf.round(probabilities))

  # Prediction mode.
  if mode == tf.estimator.ModeKeys.PREDICT:
    return tf.estimator.EstimatorSpec(mode,
                                      predictions={
                                          'probabilities': probabilities,
                                          'predictions': predictions})

  # Evaluation and training modes.
  # Calculate loss.
  loss = tf.losses.sigmoid_cross_entropy(labels, logits=logits)
  
  # Calculate the accuracy between the true labels and our predictions.
  accuracy = tf.metrics.accuracy(tf.to_int32(labels), predictions)

  # Evaluation mode.
  if mode == tf.estimator.ModeKeys.EVAL:
    return tf.estimator.EstimatorSpec(
        mode,
        loss=loss,
        eval_metric_ops={'my_accuracy': accuracy})

  # If mode is not PREDICT nor EVAL, then it must be TRAIN.
  assert mode == tf.estimator.ModeKeys.TRAIN, 'TRAIN is only ModeKey left.'

  # Training mode.
  optimizer = tf.train.AdamOptimizer(learning_rate=0.1)
#   optimizer = tf.contrib.estimator.clip_gradients_by_norm(optimizer, 5.0)
  train_op = optimizer.minimize(loss, global_step=tf.train.get_global_step())


  # Return training operations: loss and train_op.
  return tf.estimator.EstimatorSpec(
      mode,
      loss=loss,
      train_op=train_op)

## Task 2: Use the Dynamic Recurrent Neural Network (RNN) for sentiment analysis

Next, we'll construct the Dynamic RNN classifier, train it on the training set, and evaluate it on the evaluation set. After you read through the code, run it and see how you do.

In [0]:
classifier = tf.estimator.Estimator(
    model_fn=my_model_fn,
)

num_train_steps = 100
num_eval_steps = 100

classifier.train(
    input_fn=lambda: my_input_fn([train_path]),
    steps=num_train_steps)

evaluation_metrics = classifier.evaluate(
    input_fn=lambda: my_input_fn([train_path]),
    steps=num_eval_steps)

print 'Training set metrics:'
for m in evaluation_metrics:
  print m, evaluation_metrics[m]
print '---'

evaluation_metrics = classifier.evaluate(
    input_fn=lambda: my_input_fn([test_path]),
    steps=num_eval_steps)

print 'Test set metrics:'
for m in evaluation_metrics:
  print m, evaluation_metrics[m]
print '---'

## Task 3:  Try to improve the model's performance

Now see if you can refine the model to improve performance! Here is a short list of a few things you can try to do:

* **Adjusting hyperparameters** such as cell type, cell size, number of layers, embedding dimension, optimizer type, or learning rate.
* **Using a vocabulary list** to keep only a limited set of most informative terms.
* **Updating the call to tf.nn.dynamic_rnn in the custom estimator** to take sequence length as an input argument.

## Task 4: Use the canned Estimator: RNNClassifier
Now that we've written a custom Estimator, let's try the new canned Estimator, RNNClassifier and compare the two.

RNNClassifier is defined [here](https://www.tensorflow.org/api_docs/python/tf/contrib/estimator/RNNClassifier).

Start by defining some parameters:

In [0]:
embedding_hash_buckets = 1000
embedding_dim = 32
batch_size = 32

When using the canned RNNClassifier, use the columns for sequence models defined in [sequence_feature_columns.py](https://cs.corp.google.com/piper///depot/google3/third_party/tensorflow/contrib/feature_column/python/feature_column/sequence_feature_column.py) in contrib. Start by defining a function to get feature columns here.

In [0]:
def get_feature_columns():
  # WRITE YOUR CODE HERE

Next, create your input function, which will parse the tf.Examples from the given files and split them into features and targets.

In [0]:
def my_input_fn(input_filenames, num_epochs=None, shuffle=True):
  # WRITE YOUR CODE HERE

### Solution

In [0]:
def get_feature_columns():
  terms = tf.contrib.feature_column.sequence_categorical_column_with_hash_bucket(
      "terms", embedding_hash_buckets)
  terms_embedding = tf.feature_column.embedding_column(terms, embedding_dim)
  return [terms_embedding]


def my_input_fn(input_filenames, num_epochs=None, shuffle=True):
  
  # Same code as above; create a dataset and map features and labels.
  ds = tf.data.TFRecordDataset(input_filenames)
  
  # Use FeatureColumns to generate the parse_example spec.
  feature_spec = tf.estimator.classifier_parse_example_spec(
      get_feature_columns(),
      label_key="labels",
      label_dtype=tf.float32)
  ds = ds.map(lambda x: tf.parse_single_example(x, feature_spec))

  if shuffle:
    ds = ds.shuffle(10000)

  # The canned Estimator handles padding internally and will skip computation
  # on padded terms, so we don't need to pad our data in the input_fn.
  ds = ds.batch(batch_size)
  ds = ds.repeat(num_epochs)

  # Return the next batch of data.
  features = ds.make_one_shot_iterator().get_next()
  labels = features.pop("labels")
  return features, labels

### Training and evaluating the model

Now we can construct the RNNClassifier, train it on the training set, evaluate it on the evaluation set, and print out some metrics.

In [0]:
classifier = tf.contrib.estimator.RNNClassifier(
    sequence_feature_columns=get_feature_columns(),
    num_units=[32],
    cell_type='lstm',
    n_classes=2
)

num_train_steps = 100
num_eval_steps = 100

classifier.train(
    input_fn=lambda: my_input_fn([train_path]),
    steps=num_train_steps)

evaluation_metrics = classifier.evaluate(
    input_fn=lambda: my_input_fn([train_path]),
    steps=num_eval_steps)

print 'Training set metrics:'
for m in evaluation_metrics:
  print m, evaluation_metrics[m]
print '---'

evaluation_metrics = classifier.evaluate(
    input_fn=lambda: my_input_fn([test_path]),
    steps=num_eval_steps)

print 'Test set metrics:'
for m in evaluation_metrics:
  print m, evaluation_metrics[m]
print '---'