# Train Fashion MNIST Image Classification with Distributed TensorFlow on Cloud Machine Learning Engine (Cloud MLE)

This notebook demonstrates how to use Cloud ML Engine to train a convolutional neural network model for image classification. In the upcoming lab you will deploy the trained model as an Application Programming Interface (API or a web service) for online predictions.

In [0]:
import os
#@markdown Enter  your GCP Project ID:
PROJECT = "" #@param {type: "string"}
#@markdown Enter  your GCP Storage Bucket ID:
BUCKET = "" #@param {type: "string"}
#@markdown OPTIONAL: Replace with your GCP Storage Bucket Region:
REGION = "us-central1" #@param {type:"string"}

MODEL_TYPE='cnn_batch_norm'       # convolutional neural network

# do not change these
os.environ['PROJECT'] = PROJECT
os.environ['BUCKET'] = BUCKET
os.environ['REGION'] = REGION
os.environ['MODEL_TYPE'] = MODEL_TYPE
os.environ['TFVERSION'] = '1.10'  # Tensorflow version

def start_tensorboard(logdir, url_file):
  get_ipython().system_raw('tensorboard --logdir {} --host 0.0.0.0 --port 6006 &'.format(logdir))
  get_ipython().system_raw('lt --port 6006 >> {} 2>&1 &'.format(url_file))
  get_ipython().system('cat {}'.format(url_file))

def stop_tensorboard(url_file):
  get_ipython().system_raw("ps -Af  | grep -E 'tensorboard|lt --port' | awk '{print $2}' | xargs -I % kill -9 %")
  get_ipython().system_raw("rm {}".format(url_file))

try:   
  from google.colab import auth
  auth.authenticate_user()  
  print("Authenticated")
except:
  print("Failed to authenticate")

In [0]:
%%bash
git clone https://github.com/osipov/training-data-analyst.git
cp -r training-data-analyst/bootcamps/imagereco/fashionmodel .

gcloud config set project $PROJECT
gcloud config set compute/region $REGION
gsutil -m rm -rf gs://${BUCKET}/fashion/trained_${MODEL_TYPE}

## Train as a Python module on Cloud ML Engine

Now since we want to run our code on Cloud ML Engine, we've packaged it as a Python module.

The `model.py` and `task.py` files containing the model code are in <a href="https://github.com/osipov/training-data-analyst/tree/master/bootcamps/imagereco/fashionmodel/trainer">fashionmodel/trainer</a>

**Next, use Cloud ML Engine so to train on a cluster with ** `--scale-tier=BASIC_GPU`

Note that GPU speed up depends on the model type. You'll notice that more complex models train substantially faster on GPUs. When you are working with simple models that take just seconds to minutes to train on a single node, keep in mind that Cloud ML Engine introduces a few minutes of overhead for training job setup & teardown.

In [0]:
%%bash
OUTDIR=gs://${BUCKET}/fashion/trained_${MODEL_TYPE}
JOBNAME=fashion_${MODEL_TYPE}_$(date -u +%y%m%d_%H%M%S)
echo $OUTDIR $REGION $JOBNAME
gsutil -m rm -rf $OUTDIR
gcloud ml-engine jobs submit training $JOBNAME \
   --region=$REGION \
   --module-name=trainer.task \
   --package-path=${PWD}/fashionmodel/trainer \
   --job-dir=$OUTDIR \
   --staging-bucket=gs://$BUCKET \
   --scale-tier=BASIC_GPU \
   --runtime-version=$TFVERSION \
   -- \
   --output_dir=$OUTDIR \
   --train_steps=2395 --learning_rate=0.0029 --train_batch_size=663 \
   --dprob=0.39 --ksize1=5 --nfil1=15 --ksize2=7 --nfil2=20 \
   --model=$MODEL_TYPE

Once the job is queued up for execution on Cloud ML Engine, you should see the output similar to the following:
<pre>
state: QUEUED
CommandException: 1 files/objects could not be removed.
Job [fashion_cnn_181125_182110] submitted successfully.
Your job is still active. You may view the status of your job with the command

  $ gcloud ml-engine jobs describe fashion_cnn_181125_182110

or continue streaming the logs with the command

  $ gcloud ml-engine jobs stream-logs fashion_cnn_181125_182110
</pre>

Don't worry if you see a message about files/objects that could not be removed. This message occurs because gsutil mr command tries to remove the output directory for trained model checkpoint files.

To monitor the progress of the job from the GCP user interface, navigate to [Jobs](https://console.cloud.google.com/mlengine/jobs) part of the Cloud ML Engine service. Use the "View Logs" link to get the details or monitor training details using TensorBoard.

## Monitoring training with TensorBoard
Notice that TensorBoard is now configured to look in Google Cloud Storage for the model checkpoint files. Run the next cell to launch tensorboard

In [0]:
!npm install -g localtunnel
start_tensorboard('gs://{}/fashion/trained_{}'.format(BUCKET, MODEL_TYPE), 'url')

In [0]:
%sx read -p 'Press Enter in the input box to stop TensorBoard '
stop_tensorboard('url')
print("Stopped")

<pre>
# Copyright 2017 Google Inc. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#      http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
</pre>