<a href="https://colab.research.google.com/github/nianlonggu/Tensorflow-Notebooks/blob/master/Tensorflow_TPU_in_Google_Colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# A simple tutorial of using Tensorflow TPU in Google Colab

In this document, I provide a full example of how to use tensorfloe TPU in Colab. The common example of iris species recognition provided in [official turtorials](https://cloud.google.com/tpu/docs/tutorials/migrating-to-tpuestimator-api) contains many extra information, like dealing with the feature columns and manage the name of each dimension of the feature vector. This makes it sometimes misleading for the readers  to capture the key aspect of how to use TPU in their customized code. In this example, I modified the official document, and try to achieve the task of hand-written digit recognition in MNIST dataset.

The main topics include: 
1. **tensorflow Dataset API**
2. **tensorflow TPUEstimator API**
3. **Google Cloud Storage**

## Task Description
Given an hand-written image, like: 

![MNIST](https://github.com/nianlonggu/Tensorflow-Notebooks/blob/master/figures/5.png?raw=true)

the task is to train a CNN to correctly detect the number of the hand-written digit (5 in this case).  A typical way for training is to train the model using large number of ( **X**, **y** ) pairs, where **X** is the pixel matrix of an image and **y** is the corresponding label. This is a fully supervised training scheme.

To use TPU tp achive this task, there are multiple ways. One is first design the model using keras, and then use the ** tf.contrib.tpu.keras_to_tpu_model** to cast the code to TPU specific code. Clike[ here ](https://colab.research.google.com/github/tensorflow/tpu/blob/master/tools/colab/fashion_mnist.ipynb)to see the introduction.

Another way is to design a customized TPUEstimator, and then run the train, evaluate and predict function. This method has a unified framework for general supervised training problems and is easier to tranfer over different computation platforms (CPU, GPU, TPU). This is also the main topic of this document.

**Note**: before runing, clike in colab the "Edit"->"Notebook Setting" and choose TPU as accelerator!

## Start of Coding

### import python libraries

In [0]:
import tensorflow as tf
import numpy as np
import keras.datasets.mnist as mnist
import math
import os
## plt is used to plot the results
import matplotlib.pyplot as plt

Different from keras_to_tpu_model, if we want use TPUEstimator in colab, the checkpoints files cannot be stored locally in colab. You will get an error like:

***file system scheme '[local]' not implemented***

As far as I know, the only choice is to store the ckpt files in **Google Cloud Storage**.  To solve this without any additional cost, you can visit https://cloud.google.com/storage/ , apply for a 1-year free trail accout, and activate the billing service. After this, following the website instruction to create a new Google cloud storage bucket with a name like "awesome_gcs_bucket". Then you get a address to acess the GCS: ** gs://awesome_gcs_bucket**  

After get the GCS address, you need to authorize you google colab to get access to you Google Cloud Storage Busket using the following commands:

In [2]:
from google.colab import auth
auth.authenticate_user()


For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
If you depend on functionality not listed there, please file an issue.



### Create a Flag Class to hold all the global hyperparameters. 
This can also be achived by using tf.flags. Note that there are some bugs realated with tf.flags in Colab, simply add "flags.DEFINE_string('f', '', 'kernel')" to avoid it

In [0]:
class Flags:
  def __init__(self):
    ## tpu store the address of TPU, used to passed to the TPUClusterResolver
    self.tpu = 'grpc://' + os.environ['COLAB_TPU_ADDR']
    ## model_dir specify the path to store the model parameters. Here we need to use the adress to the cloud storage bucket
    ## model_dir will be passed to the tpu.RunConfig() function
    self.model_dir = "gs://awesome_gcs_bucket"
    """ batch_size: the batch size used for training, evaluation and prediction. It should be a multiple of the number of TPU cores
    Normally the number of TPU cores is 8, so the batch_size is usually 128. The reason for this is during TPU running, each batch of 
    data will be dispatched to each TPU cores, so the batch size should be divisable by the number of TPU cores."""
    self.batch_size = 128
    """ train_steps define the number of steps of the total training procedure. In each step a batch of data a fed into the TPU for 
    computation. Note that in TPUEstimator train function there are two parameters: steps and max_steps. "steps" means the further number of steps to be trained from now;
    "max_steps" means the total maximum number of steps.
    """
    self.train_steps = 10000
    ## iterations
    
    ## the bool flag for using tpu or not
    self.use_tpu = True


In [16]:
FLAGS = Flags()
FLAGS.tpu

'grpc://10.103.181.170:8470'

In [9]:


# gs://awesome_gcs_bucket
from tensorflow import flags

# flags.DEFINE_string('f', '', 'kernel')

## TPU Cluster Resolver flags
flags.DEFINE_string("tpu", default= 'grpc://' + os.environ['COLAB_TPU_ADDR'], help= "used for tpu resovler, containing the address of TPUs" )

## Model specific parameters
flags.DEFINE_string("model_dir", "gs://carlos-vic.appspot.com", "define the path to store the checkpoint files"  )
flags.DEFINE_integer('batch_size', 128, 'batch_size is usually 2^n')
flags.DEFINE_integer('train_steps', 2000, 'total number of train steps')
# WHat's this?
flags.DEFINE_integer('eval_steps', 4, 'total number of eval steps, skipped if 0')

## TPU specific parameters
flags.DEFINE_bool( 'use_tpu', True, 'True for using TPU' )
# What's this?
flags.DEFINE_integer('iterations', 500, 'number of iterations per TPU training loop' )  


DuplicateFlagError: ignored

In [0]:
# FLAGS.model_dir = "."

Preprocessing the data and make them into the format of ndarray; But there could be other format of data which is also supported by tf.contrib.estimator

In [0]:
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_val = x_train[-5000:]/255.0
x_val = np.expand_dims(x_val, -1).astype(np.float32)
y_val = y_train[-5000:].astype(np.int32)
x_train = x_train[ :-5000]/255.0
x_train = np.expand_dims(x_train, -1).astype(np.float32)
y_train = y_train[ :-5000].astype(np.int32)  ## by default y_train is not one-hot coded
x_test = x_test/255.0
x_test = np.expand_dims(x_test, -1).astype(np.float32)
y_test = y_test.astype(np.int32)

print( x_val.shape )
print( x_train.shape )
print( x_test.shape )

(5000, 28, 28, 1)
(55000, 28, 28, 1)
(10000, 28, 28, 1)


Define some global variables 

Define the helper function for plotting images.

In [0]:
def plot_images( images, labels=None, preds= None  ):
  images = np.squeeze(images )
  num = images.shape[0]
  n_rows = math.floor( math.sqrt( num ))
  n_columns = math.ceil(math.sqrt(num))

  fig, axes = plt.subplots( n_rows, n_columns )
  
  if n_rows * n_columns == 1:
    axes = [axes]
  else:
    axes = axes.flat
  
  for i, ax in enumerate( axes ):
    if i < num:
      ax.imshow( images[i] )
      xlabel= ""
      if labels is not None:
        xlabel += "True: %d "%(labels[i])
      if preds is not None:
        xlabel += "Pred: %d"%(preds[i])
      ax.set_xlabel(xlabel)
    ax.set_xticks([])
    ax.set_yticks([])
  plt.show()

#Define the function for inputing data to the Estimator

In [0]:
def train_input_fn( features, labels, batch_size ):

    """An input function for training"""
    # Convert the inputs to a Dataset.
    dataset = tf.data.Dataset.from_tensor_slices((features, labels))
    ## shuffle repeat and batch
    # Shuffle, repeat.
    dataset = dataset.shuffle(1000).repeat()
    # TPU specific batch the slices, to make sure that each batch size is divisible by the number of TPU cores
    dataset = dataset.batch( batch_size, drop_remainder = True )
    # Return the dataset.
    return dataset

In [0]:
def eval_input_fn(features, labels, batch_size):
    """An input function for training"""
    # Convert the inputs to a Dataset.
    dataset = tf.data.Dataset.from_tensor_slices((features, labels))
    dataset = dataset.shuffle(1000).repeat()
    dataset = dataset.batch(batch_size, drop_remainder = True )
    # Return the dataset.
    return dataset

In [0]:
def predict_input_fn(features, labels,batch_size):
    """An input function for training"""
    # Convert the inputs to a Dataset.
    # For predict use_tpu should be False, since drop_remainder is False
    dataset = tf.data.Dataset.from_tensor_slices((features, labels))
    dataset = dataset.batch(batch_size)
    # Return the dataset.
    return dataset

In [0]:
def metric_fn(labels, logits):
    """Function to return metrics for evaluation."""

    predicted_classes = tf.argmax(logits, 1)
    accuracy = tf.metrics.accuracy(labels=labels,
                                   predictions=predicted_classes,
                                   name="acc_op")
    return {"accuracy": accuracy}

# Self Defined Estimator

In [0]:

def model_fn(features, labels, mode, params):
    # Args:
    #
    # features: This is the x-arg from the input_fn.
    # labels:   This is the y-arg from the input_fn,
    #           see e.g. train_input_fn for these two.
    # mode:     Either TRAIN, EVAL, or PREDICT
    # params:   User-defined hyper-parameters, e.g. learning-rate.
    
    # Reference to the tensor named "x" in the input-function.
#     x = features["x"]

    ## create the model networks
    x = features
    # First convolutional layer.
    net = tf.layers.conv2d(inputs=x, name='layer_conv1',
                           filters=16, kernel_size=5,
                           padding='same', activation=tf.nn.relu)
    net = tf.layers.max_pooling2d(inputs=net, pool_size=2, strides=2)

    # Second convolutional layer.
    net = tf.layers.conv2d(inputs=net, name='layer_conv2',
                           filters=36, kernel_size=5,
                           padding='same', activation=tf.nn.relu)
    net = tf.layers.max_pooling2d(inputs=net, pool_size=2, strides=2)    

    # Flatten to a 2-rank tensor.
    net = tf.contrib.layers.flatten(net)
    # Eventually this should be replaced with:
    # net = tf.layers.flatten(net)

    # First fully-connected / dense layer.
    # This uses the ReLU activation function.
    net = tf.layers.dense(inputs=net, name='layer_fc1',
                          units=128, activation=tf.nn.relu)    

    # Second fully-connected / dense layer.
    # This is the last layer so it does not use an activation function.
    net = tf.layers.dense(inputs=net, name='layer_fc2',units=10)

    ## compute logits
    # Logits output of the neural network.
    logits = net
    
    ## for training and evaluation
    ## compute loss
    loss =  tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(labels=labels,
                                                                       logits=logits))
    # Softmax output of the neural network.
    y_pred_probabilities = tf.nn.softmax(logits=logits)
    # Classification output of the neural network.
    y_pred_classes = tf.argmax(y_pred_probabilities, axis=1)[:, tf.newaxis]
    # Prediction Accuracy for evaluation
    pred_accuracy = tf.metrics.accuracy( labels = labels, predictions = y_pred_classes , name = "op_acc" )

    if mode == tf.estimator.ModeKeys.PREDICT:
        predictions = {
              "logits": logits,
              "class_ids": y_pred_classes,
              "probabilities": y_pred_probabilities,
        }
        
        spec = tf.contrib.tpu.TPUEstimatorSpec(mode=mode, predictions=predictions)
    elif mode == tf.estimator.ModeKeys.EVAL:
        spec = tf.contrib.tpu.TPUEstimatorSpec(mode=mode, loss=loss, eval_metrics=(metric_fn, [labels, logits] ) )

    elif mode == tf.estimator.ModeKeys.TRAIN:
  
        # Define the optimizer for improving the neural network.
        optimizer = tf.train.AdamOptimizer(learning_rate=params["learning_rate"])
        
        if FLAGS.use_tpu:
            optimizer = tf.contrib.tpu.CrossShardOptimizer(optimizer)
        train_op = optimizer.minimize(
            loss=loss, global_step=tf.train.get_global_step())

#         # Define the evaluation metrics,
#         # in this case the classification accuracy.
#         metrics = \
#         {
#             "accuracy": tf.metrics.accuracy(labels, y_pred_cls)
#         }

#         # Wrap all of this in an EstimatorSpec.
# #         spec = tf.estimator.EstimatorSpec(
# #             mode=mode,
# #             loss=loss,
# #             train_op=train_op,
# #             eval_metric_ops=metrics)
        
        spec= tf.contrib.tpu.TPUEstimatorSpec(mode=mode, loss=loss,train_op= train_op, eval_metrics=(metric_fn, [labels, logits] ) )
        
    return spec

In [0]:
tpu_cluster_resolver = tf.contrib.cluster_resolver.TPUClusterResolver(
    FLAGS.tpu)

# tpu_config = os.environ.get('TF_CONFIG')
run_config = tf.contrib.tpu.RunConfig(
    model_dir=FLAGS.model_dir,
    cluster=tpu_cluster_resolver,
#     save_checkpoints_steps = 0,
    session_config=tf.ConfigProto(
        allow_soft_placement=True, log_device_placement=True),
    tpu_config=tf.contrib.tpu.TPUConfig(FLAGS.iterations),
)

In [0]:
# model = tf.estimator.Estimator(model_fn=model_fn,
#                                params={"learning_rate": 1e-4},
#                                model_dir="./checkpoints_tutorial17-2/")

model = tf.contrib.tpu.TPUEstimator(
                               model_fn=model_fn,
                               params = {"learning_rate": 1e-4 },
                               config = run_config,
                               use_tpu= FLAGS.use_tpu,
                               train_batch_size=FLAGS.batch_size,
                               eval_batch_size=FLAGS.batch_size,
                               predict_batch_size=FLAGS.batch_size,
  
                               
                                )



INFO:tensorflow:Using config: {'_model_dir': 'gs://carlos-vic.appspot.com', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': allow_soft_placement: true
log_device_placement: true
cluster_def {
  job {
    name: "worker"
    tasks {
      key: 0
      value: "10.11.28.90:8470"
    }
  }
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': None, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7fcb18704b00>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': 'grpc://10.11.28.90:8470', '_evaluation_master': 'grpc://10.11.28.90:8470', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1, '_tpu_config': TPUConfig(iterations_per_loop=500, num_shards=

# Training

In [0]:
!gsutil mb gs://carlos-vic.appspot.com

Creating gs://carlos-vic.appspot.com/...
You are attempting to perform an operation that requires a project id, with none configured. Please re-run gsutil config and make sure to follow the instructions for finding and entering your default project id.


In [0]:
model.train( input_fn = lambda params: train_input_fn( x_train, y_train, params["batch_size"] ), steps=2000  )

INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:TPU job name worker
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from gs://carlos-vic.appspot.com/model.ckpt-100000
Instructions for updating:
Use standard file utilities to get mtimes.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 100000 into gs://carlos-vic.appspot.com/model.ckpt.
INFO:tensorflow:Initialized dataset iterators in 0 seconds
INFO:tensorflow:Installing graceful shutdown hook.
INFO:tensorflow:Creating heartbeat manager for ['/job:worker/replica:0/task:0/device:CPU:0']
INFO:tensorflow:Configuring worker heartbeat: shutdown_mode: WAIT_FOR_COORDINATOR

INFO:tensorflow:Init TPU system
INFO:tensorflow:Initialized TPU in 7 seconds
INFO:tensorflow:Starting infeed thread controller.
INFO:tensorflow:Starting outfeed thread controller.
INFO:tensorf

<tensorflow.contrib.tpu.python.tpu.tpu_estimator.TPUEstimator at 0x7fcb20e39f98>

In [0]:
eval_result = model.evaluate(
      input_fn=lambda params: eval_input_fn(
          x_val, y_val, params["batch_size"]),
      steps=FLAGS.eval_steps)

INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2019-04-08T12:43:37Z
INFO:tensorflow:TPU job name worker
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from gs://carlos-vic.appspot.com/model.ckpt-102000
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Init TPU system
INFO:tensorflow:Initialized TPU in 9 seconds
INFO:tensorflow:Starting infeed thread controller.
INFO:tensorflow:Starting outfeed thread controller.
INFO:tensorflow:Initialized dataset iterators in 0 seconds
INFO:tensorflow:Enqueue next (4) batch(es) of data to infeed.
INFO:tensorflow:Dequeue next (4) batch(es) of data from outfeed.
INFO:tensorflow:Evaluation [4/4]
INFO:tensorflow:Stop infeed thread controller
INFO:tensorflow:Shutting down InfeedController thread.
INFO:tensorflow:InfeedController received shutdown signal, stopping.
INFO:tensorflow:Infeed thread finished, shutting down.

In [0]:
predict_result = model.predict(
      input_fn=lambda params: predict_input_fn(
          x_test,y_test, params["batch_size"]))


next(predict_result)

# for pred_dict, expec in zip( predict_result, y_test  ):
#   class_id = pred_dict["class_ids"][0]
#   prob_pred = pred_dict["probabilities"][class_id]
#   print( "Prediction is %d, probility %.1f, expected %d" %( class_id, prob_pred, expec )  )

INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:TPU job name worker
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from gs://carlos-vic.appspot.com/model.ckpt-102000
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Init TPU system
INFO:tensorflow:Initialized TPU in 7 seconds
INFO:tensorflow:Starting infeed thread controller.
INFO:tensorflow:Starting outfeed thread controller.
INFO:tensorflow:Initialized dataset iterators in 0 seconds
INFO:tensorflow:Enqueue next (1) batch(es) of data to infeed.
INFO:tensorflow:Dequeue next (1) batch(es) of data from outfeed.


{'class_ids': array([7]),
 'logits': array([-18.951256  ,   1.2203839 ,  -0.69914997,   2.895075  ,
        -35.774868  , -12.417238  , -79.63498   ,  50.539818  ,
          4.910502  ,   6.9981675 ], dtype=float32),
 'probabilities': array([6.6131789e-31, 3.8092727e-22, 5.5872563e-23, 2.0330702e-21,
        3.2659478e-38, 4.5509267e-28, 0.0000000e+00, 1.0000000e+00,
        1.5256014e-20, 1.2305580e-19], dtype=float32)}

In [0]:
next(predict_result, 10000)

{'class_ids': array([5]),
 'logits': array([-24.94518 , -49.249718, -20.357456, -28.440752, -11.425112,
         35.63763 ,  17.286167, -20.439623,  10.581571, -12.081831],
       dtype=float32),
 'probabilities': array([4.88899473e-27, 1.36111573e-37, 4.80442570e-25, 1.48289985e-28,
        3.63839229e-21, 1.00000000e+00, 1.07167075e-08, 4.42545347e-25,
        1.31307925e-11, 1.88669861e-21], dtype=float32)}

In [0]:
predict_result

<generator object TPUEstimator.predict at 0x7f05c0106990>

In [0]:
#  Copyright 2018 The TensorFlow Authors. All Rights Reserved.
#
#  Licensed under the Apache License, Version 2.0 (the "License");
#  you may not use this file except in compliance with the License.
#  You may obtain a copy of the License at
#
#   http://www.apache.org/licenses/LICENSE-2.0
#
#  Unless required by applicable law or agreed to in writing, software
#  distributed under the License is distributed on an "AS IS" BASIS,
#  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
#  See the License for the specific language governing permissions and
#  limitations under the License.

"""Module to generate Iris dataset for using in custom TPUEstimator."""
import pandas as pd
import tensorflow as tf

TRAIN_URL = 'http://download.tensorflow.org/data/iris_training.csv'
TEST_URL = 'http://download.tensorflow.org/data/iris_test.csv'

CSV_COLUMN_NAMES = ['SepalLength', 'SepalWidth',
                    'PetalLength', 'PetalWidth', 'Species']
SPECIES = ['Setosa', 'Versicolor', 'Virginica']

PREDICTION_INPUT_DATA = {
    'SepalLength': [6.9, 5.1, 5.9],
    'SepalWidth': [3.1, 3.3, 3.0],
    'PetalLength': [5.4, 1.7, 4.2],
    'PetalWidth': [2.1, 0.5, 1.5],
}

PREDICTION_OUTPUT_DATA = ['Virginica', 'Setosa', 'Versicolor']


def maybe_download():
  train_path = tf.keras.utils.get_file(TRAIN_URL.split('/')[-1], TRAIN_URL)
  test_path = tf.keras.utils.get_file(TEST_URL.split('/')[-1], TEST_URL)

  return train_path, test_path


def load_data(y_name='Species'):
  """Returns the iris dataset as (train_x, train_y), (test_x, test_y)."""
  train_path, test_path = maybe_download()

  train = pd.read_csv(train_path, names=CSV_COLUMN_NAMES, header=0,
                      dtype={'SepalLength': pd.np.float32,
                             'SepalWidth': pd.np.float32,
                             'PetalLength': pd.np.float32,
                             'PetalWidth': pd.np.float32,
                             'Species': pd.np.int32})
  train_x, train_y = train, train.pop(y_name)

  test = pd.read_csv(test_path, names=CSV_COLUMN_NAMES, header=0,
                     dtype={'SepalLength': pd.np.float32,
                            'SepalWidth': pd.np.float32,
                            'PetalLength': pd.np.float32,
                            'PetalWidth': pd.np.float32,
                            'Species': pd.np.int32})
  test_x, test_y = test, test.pop(y_name)

  return (train_x, train_y), (test_x, test_y)


def train_input_fn(features, labels, batch_size):
  """An input function for training."""

  # Convert the inputs to a Dataset.
  dataset = tf.data.Dataset.from_tensor_slices((dict(features), labels))

  # Shuffle, repeat, and batch the examples.
  dataset = dataset.shuffle(1000).repeat()

  dataset = dataset.batch(batch_size, drop_remainder=True)

  # Return the dataset.
  return dataset


def eval_input_fn(features, labels, batch_size):
  """An input function for evaluation."""
  features = dict(features)
  inputs = (features, labels)

  # Convert the inputs to a Dataset.
  dataset = tf.data.Dataset.from_tensor_slices(inputs)
  dataset = dataset.shuffle(1000).repeat()

  dataset = dataset.batch(batch_size, drop_remainder=True)

  # Return the dataset.
  return dataset


def predict_input_fn(features, batch_size):
  """An input function for prediction."""

  dataset = tf.data.Dataset.from_tensor_slices(features)
  dataset = dataset.batch(batch_size)
  return dataset

# Evaluation and Prediction

In [0]:
test_results = model.evaluate( input_fn = test_input_fn )

NameError: ignored

In [0]:
test_results

{'accuracy': 0.977,
 'average_loss': 0.083602235,
 'global_step': 2000,
 'loss': 10.5825615}

In [0]:
print("Classification accuracy: %.2f%%"%(test_results["accuracy"] * 100))

Classification accuracy: 97.70%
