# Enable Virtual Environment For This Notebook.

Now we will go to the location of the directory, where we will enable our virtual environment.

<b>`$ cd /media/mujahid7292/Data/GoogleDriveSandCorp2014/ML_With_TensorFlow_On_GCP/06. End_To_End_ML_With_TensorFlow_On_GCP/Week_3/Lab_5_Training_On_Cloud_AI_Platform/Practice$`</b>

### Deactivate conda environment

<b>`$ conda deactivate`</b>

### Activate newly created virtual environment

<b>`$ source Venv/bin/activate`</b>

## <p style='color:red'>Then we will open this notebook from inside our virtual environment.</p>

# Notebook <a href="https://github.com/GoogleCloudPlatform/training-data-analyst/blob/master/courses/machine_learning/deepdive/06_structured/5_train.ipynb">Link</a>

# Necessary import of python package

In [1]:
import os
import shutil
import argparse
import json
import tensorflow as tf
print('Tensorflow Version: {}'.format(tf.__version__))

  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


Tensorflow Version: 1.8.0


# Python Variable

In [2]:
# change these to try this notebook out
ACCOUNT = 'student-04-972b7cf5493d@qwiklabs.net'
SAC = 'jupyter-notebook-sac-e'
SAC_KEY_DESTINATION = '/media/mujahid7292/Data/Gcloud_Tem_SAC'
BUCKET = 'bucket-qwiklabs-gcp-04-e6818aa80ab4'
PROJECT = 'qwiklabs-gcp-04-e6818aa80ab4'
REGION = 'us-central1'

# Bash Variable

In [3]:
os.environ['ACCOUNT'] = ACCOUNT
os.environ['SAC'] = SAC
os.environ['SAC_KEY_DESTINATION'] = SAC_KEY_DESTINATION
os.environ['BUCKET'] = BUCKET
os.environ['PROJECT'] = PROJECT
os.environ['REGION'] = REGION

# Set Google Application Credentials

In [4]:
import os
os.environ["GOOGLE_APPLICATION_CREDENTIALS"]='{}/{}.json'.format(SAC_KEY_DESTINATION,SAC)

Check Whether Google Application Credential Was Set Successfully Outside Virtual Environment

In [5]:
%%bash
set | grep GOOGLE_APPLICATION_CREDENTIALS 

GOOGLE_APPLICATION_CREDENTIALS=/media/mujahid7292/Data/Gcloud_Tem_SAC/jupyter-notebook-sac-e.json


# Set Default Project And Region

In [6]:
%%bash
gcloud config set account $ACCOUNT
gcloud config set project $PROJECT
gcloud config set compute/region $REGION

Updated property [core/account].
Updated property [core/project].
Updated property [compute/region].


## Create GCS bucket & copy training & evaluation data to bucket

In [None]:
%%bash
if ! gsutil ls | grep -q gs://${BUCKET}/babyweight/preproc; then
  #gsutil mb -l us-central1 -p ml-practice-260405 gs://bucket-ml-practice-260405
  gsutil mb -l ${REGION} -p ${PROJECT} gs://${BUCKET}
  # copy canonical set of preprocessed files if you didn't do previous notebook
  #gsutil -m cp -R gs://cloud-training-demos/babyweight gs://bucket-qwiklabs-gcp-04-e6818aa80ab4
  gsutil -m cp -R gs://cloud-training-demos/babyweight gs://${BUCKET}
fi

In [7]:
%%bash
gsutil ls gs://${BUCKET}/babyweight/preproc/*-00000*

Process is terminated.


Now that we have the TensorFlow code working on a subset of the data, we can package the TensorFlow code up as a Python module and train it on Cloud AI Platform.
<p>
<h2> Train on Cloud AI Platform</h2>
<p>
Training on Cloud AI Platform requires:
<ol>
<li> Making the code a Python package
<li> Using gcloud to submit the training code to Cloud AI Platform
</ol>

Ensure that the AI Platform API is enabled by going to this [link](https://console.developers.google.com/apis/library/ml.googleapis.com).

## Lab Task 1

The following code writes babyweight/trainer/task.py.

In [8]:
%%bash
rm -rf babyweight
mkdir babyweight
mkdir babyweight/trainer
touch babyweight/trainer/__init__.py

In [9]:
%%writefile babyweight/trainer/task.py
import argparse
import json
import os

from . import model
import tensorflow as tf

if __name__ == '__main__':
    # Create an argument parser object
    parser = argparse.ArgumentParser()
    
    # Now add our arguments one by one
    parser.add_argument(
        '--data_path',
        help = "GCS or Local path to training & evaluation data.",
        required = True
    )
    
    parser.add_argument(
        '--output_dir',
        help = "GCS or Local location to write checkpoints and export models",
        required = True
    )
    
    parser.add_argument(
        '--batch_size',
        help = "Number of examples to compute gradient over",
        type = int,
        default = 512
    )
    
    parser.add_argument(
        '--job-dir',
        help = "This model ignore this field, but it is required by gcloud",
        default = "junk"
    )
    
    parser.add_argument(
        '--nn_size',
        help = "Hidden layer size for DNN",
        nargs='+',
        type = int,
        default=[128,32,4]
    )
    
    parser.add_argument(
        '--nembeds',
        help = "Embedding size of a cross of n key. Real valued parameters",
        type = int,
        default = 3
    )
    
    parser.add_argument(
        '--train_examples',
        help = "Number of examples (In thousands) to run the training job over. If this is more than actual # of examples available, it cycles through them. So specifying 1000 here when you have only 100k examples makes this 10 epochs.",
        type = int,
        default = 5000
    )
    
    parser.add_argument(
        '--pattern',
        help = "Specify a pattern that has to be in a input files. For example 00001-of will process only one shard.",
        default = "of"
    )
    
    parser.add_argument(
        '--eval_steps',
        help = "Positive number of steps for which to evaluate model. Default to None, which means to evaluate untill input_fn raises an end-of-input exception",
        type = int,
        default = None
    )
    
    ## Parse all arguments
    args = parser.parse_args()
    arguments = args.__dict__
    
    ## Unused args provided by service
    arguments.pop('job-dir', None)
    arguments.pop('job_dir', None)
    
    ## Assign the arguments to the model variable
    output_dir = arguments.pop('output_dir')
    model.DATA_PATH = arguments.pop('data_path')
    model.BATCH_SIZE = arguments.pop('batch_size')
    model.TRAIN_STEPS = (arguments.pop('train_examples') * 1000) / model.BATCH_SIZE
    model.EVAL_STEPS = arguments.pop('eval_steps')
    print("We will train for {} steps using batch size {}".format(model.TRAIN_STEPS, model.BATCH_SIZE))
    model.PATTERN = arguments.pop('pattern')
    model.NEMBEDS = arguments.pop('nembeds')
    model.NN_SIZE = arguments.pop('nn_size')
    print("We will use DNN size of {}".format(model.NN_SIZE))
    
    
    ## Append trail id to path if we are doing Hyperparameter Tunning
    ## This code can be removed if you are not doing Hyperparameter Tunning
    output_dir = os.path.join(
        output_dir,
        json.loads(
            os.environ.get('TF_CONFIG', '{}')
        ).get('task', {}).get('trail', '')
    )
    
    ## Run the training job
    model.train_and_evaluate(output_dir)

Writing babyweight/trainer/task.py


## Lab Task 2

The following code writes babyweight/trainer/model.py.

In [10]:
%%writefile babyweight/trainer/model.py
import shutil
import numpy as np
import tensorflow as tf

tf.logging.set_verbosity(tf.logging.INFO)

DATA_PATH = None # Set from task.py file
PATTERN = 'of' # gets all files

# Determine CSV, Labels & key column
CSV_COLUMNS = 'weight_pounds,is_male,mother_age,plurality,gestation_weeks,key'.split(',')
LABEL_COLUMN = 'weight_pounds'
KEY_COLUMN = 'key'

# Set defaults value for each CSV columns
DEFAULTS = [[0.0], ['null'], [0.0], ['null'], [0.0], ['nokey']]

# Define some hyper parameter
TRAIN_STEPS = 10000
EVAL_STEPS = None
BATCH_SIZE = 512
NEMBEDS = 3
NN_SIZE = [64, 16, 4]

# Create an input function reading a file using the Dataset API
# Then provide the results to the Estimator API
def read_dataset(prefix, mode, batch_size):
    """
    """
    def _input_fn():
        """
        """
        def decode_csv(value_column):
            """
            """
            columns = tf.decode_csv(records=value_column, record_defaults=DEFAULTS)
            features = dict(zip(CSV_COLUMNS, columns))
            label = features.pop(LABEL_COLUMN)
            return features, label
        
        # Use prefix to create file path
        file_path = '{}/babyweight/preproc/{}/{}'.format(DATA_PATH, prefix, PATTERN)
        
        # Create list of files, that matches pattern.
        file_list = tf.gfile.Glob(file_path)
        
        # Create dataset from file list
        dataset = (tf.data.TextLineDataset(filenames=file_list) # Read text file
                   .map(decode_csv)) # Transform each element by applying decode_csv
        
        if mode == tf.estimator.ModeKeys.TRAIN:
            num_epochs = None # Run Indefinietly
            dataset = dataset.shuffle(buffer_size=10*batch_size)
        else:
            num_epochs = 1 # end-of-input after this
        
        dataset = dataset.repeat(num_epochs).batch(batch_size)
        return dataset.make_one_shot_iterator().get_next()
    return _input_fn

# Define features column
def get_wide_deep():
    """
    """
    # Define column type
    is_male,mother_age,plurality,gestation_weeks= \
    [\
     tf.feature_column.categorical_column_with_vocabulary_list(
        key='is_male',
        vocabulary_list=['True', 'False', 'Unknown']
    ),
     tf.feature_column.numeric_column('mother_age'),
     tf.feature_column.categorical_column_with_vocabulary_list(
         key='plurality',
         vocabulary_list=['Single(1)', 'Twins(2)', 'Triplets(3)',
                         'Quadruplets(4)', 'Quintuplets(5)','Multiple(2+)']
     ),
     tf.feature_column.numeric_column('gestation_weeks')
    ]
    
    # Discretize the age & gestation weeks column. This will convert those two
    # column from being deep to wide column.
    age_buckets = tf.feature_column.bucketized_column(
        source_column=mother_age,
        boundaries=np.arange(start=15, stop=45, step=1).tolist()
    )
    gestation_buckets = tf.feature_column.bucketized_column(
        source_column=gestation_weeks,
        boundaries=np.arange(start=17, stop=47, step=1).tolist()
    )
    
    # Sparse columns are wide, have a linear relationship with the output
    wide = [
        is_male,
        plurality,
        age_buckets,
        gestation_buckets
    ]
    
    # Feature cross all the wide column and embed into lower dimension.
    crossed = tf.feature_column.crossed_column(
        keys=wide,
        hash_bucket_size=20000
    )
    embed = tf.feature_column.embedding_column(
        categorical_column=crossed,
        dimension=NEMBEDS
    )
    
    # Continous columns are deep and have a complex relationship with the output.
    deep = [
        mother_age,
        gestation_weeks,
        embed
    ]
    
    return wide, deep

# Create serving input function to serve prediction later
def serving_input_fn():
    """
    """
    feature_placeholders = {
        'is_male': tf.placeholder(dtype=tf.string, shape=[None]),
        'mother_age': tf.placeholder(dtype=tf.float32, shape=[None]),
        'plurality': tf.placeholder(dtype=tf.string, shape=[None]),
        'gestation_weeks': tf.placeholder(dtype=tf.float32, shape=[None]),
        KEY_COLUMN: tf.placeholder_with_default(input=tf.constant(['nokey']), shape=[None])
    }
    
    features = {
        key: tf.expand_dims(input=tensor, axis=-1)
        for key, tensor in feature_placeholders.items()
    }
    
    return tf.estimator.export.ServingInputReceiver(
        features=features,
        receiver_tensors=feature_placeholders
    )

# Create metric for hyper parameter tunning
def my_rmse(labels, predictions):
    """
    """
    pred_values = predictions['predictions']
    return {
        'rmse': tf.metrics.root_mean_squared_error(
            labels=labels,
            predictions=pred_values
        )
    }

# Create estimator to train and evaluate
def train_and_evaluate(output_dir):
    """
    """
    # Ensure Filewriter Cache is clear for Tensorboard events file.
    tf.summary.FileWriterCache.clear()
    wide, deep = get_wide_deep()
    EVAL_INTERVAl = 300 # Seconds
    
    run_config = tf.estimator.RunConfig(
        save_checkpoints_secs=EVAL_INTERVAl,
        keep_checkpoint_max=3
    )
    
    estimator = tf.estimator.DNNLinearCombinedRegressor(
        model_dir=output_dir,
        linear_feature_columns=wide,
        dnn_feature_columns=deep,
        dnn_hidden_units=NN_SIZE,
        config=run_config
    )
    
    # Attach custom metric to the estimator.
    estimator = tf.contrib.estimator.add_metrics(
        estimator=estimator, 
        metric_fn=my_rmse
    )
    
    # For batch prediction, you need a key associated with each instances
    estimator = tf.contrib.estimator.forward_features(
        estimator=estimator,
        keys=KEY_COLUMN
    )
    
    train_spec = tf.estimator.TrainSpec(
        input_fn = read_dataset('train',tf.estimator.ModeKeys.TRAIN, BATCH_SIZE),
        max_steps = TRAIN_STEPS
    )
    
    exporter = tf.estimator.LatestExporter(
        name='exporter',
        serving_input_receiver_fn=serving_input_fn,
        exports_to_keep=None
    )
    
    eval_spec = tf.estimator.EvalSpec(
        input_fn=read_dataset('eval', tf.estimator.ModeKeys.EVAL, 2**15), # No need to batch in eval
        steps = EVAL_STEPS,
        start_delay_secs=60, # Start evaluating after N seconds
        throttle_secs=EVAL_INTERVAl, # Evaluate every N seconds
        exporters=exporter
    )
    
    tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec)

Writing babyweight/trainer/model.py


## Lab Task 3

After moving the code to a package, make sure it works standalone. (Note the --pattern and --train_examples lines so that I am not trying to boil the ocean on my laptop). Even then, this takes about <b>3 minutes</b> in which you won't see any output ...

In [11]:
%%bash
ls data/babyweight/preproc/

eval
train


In [None]:
%%bash
rm -rf babyweight_trained
export PYTHONPATH=${PYTHONPATH}:${PWD}/babyweight
python -m trainer.task \
    --data_path=data \
    --output_dir=babyweight_trained \
    --job-dir=./tmp \
    --pattern='00000-of-'\
    --train_examples=1 \
    --eval_steps=1

## Lab Task 4

The JSON below represents an input into your prediction model. Write the input.json file below with the next cell, then run the prediction locally to assess whether it produces predictions correctly.

In [None]:
%%writefile inputs.json
{"key": "b1", "is_male": "True", "mother_age": 26.0, "plurality": "Single(1)", "gestation_weeks": 39}
{"key": "g1", "is_male": "False", "mother_age": 26.0, "plurality": "Single(1)", "gestation_weeks": 39}

In [None]:
%%bash
sudo find "/usr/lib/google-cloud-sdk/lib/googlecloudsdk/command_lib/ml_engine" -name '*.pyc' -delete

In [None]:
%%bash
MODEL_LOCATION=$(ls -d $(pwd)/babyweight_trained/export/exporter/* | tail -1)
echo $MODEL_LOCATION1
gcloud ai-platform local predict --model-dir=$MODEL_LOCATION --json-instances=inputs.json

## Lab Task 5

Once the code works in standalone mode, you can run it on Cloud AI Platform. Because this is on the entire dataset, it will take a while. The training run took about <b> two hours </b> for me. You can monitor the job from the GCP console in the Cloud AI Platform section.

In [12]:
%%bash
OUTDIR=gs://${BUCKET}/babyweight/trained_model
JOBNAME=babyweight_$(date -u +%y%m%d_%H%M%S)
echo $OUTDIR $REGION $JOBNAME
gsutil -m rm -rf $OUTDIR
gcloud ai-platform jobs submit training $JOBNAME \
  --region=$REGION \
  --module-name=trainer.task \
  --package-path=$(pwd)/babyweight/trainer \
  --job-dir=$OUTDIR \
  --staging-bucket=gs://$BUCKET \
  --scale-tier=STANDARD_1 \
  --runtime-version=$TFVERSION \
  -- \
  --data_path=${BUCKET} \
  --output_dir=${OUTDIR} \
  --train_examples=500

Process is interrupted.


When I ran it, I used train_examples=2000000. When training finished, I filtered in the Stackdriver log on the word "dict" and saw that the last line was:
<pre>
Saving dict for global step 5714290: average_loss = 1.06473, global_step = 5714290, loss = 34882.4, rmse = 1.03186
</pre>
The final RMSE was 1.03 pounds.

<h2> Hyperparameter tuning </h2>
<p>
All of these are command-line parameters to my program.  To do hyperparameter tuning, create hyperparam.xml and pass it as --configFile.
This step will take <b>up to 2 hours</b> -- you can increase maxParallelTrials or reduce maxTrials to get it done faster.  Since maxParallelTrials is the number of initial seeds to start searching from, you don't want it to be too large; otherwise, all you have is a random search.


In [None]:
%%writefile hyperparam.yaml
trainingInput:
  scaleTier: STANDARD_1
  hyperparameters:
    hyperparameterMetricTag: rmse
    goal: MINIMIZE
    maxTrials: 20
    maxParallelTrials: 5
    enableTrialEarlyStopping: True
    params:
    - parameterName: batch_size
      type: INTEGER
      minValue: 8
      maxValue: 512
      scaleType: UNIT_LOG_SCALE
    - parameterName: nembeds
      type: INTEGER
      minValue: 3
      maxValue: 30
      scaleType: UNIT_LINEAR_SCALE
    - parameterName: nnsize
      type: INTEGER
      minValue: 64
      maxValue: 512
      scaleType: UNIT_LOG_SCALE

In [None]:
%%bash
OUTDIR=gs://${BUCKET}/babyweight/hyperparam
JOBNAME=babyweight_$(date -u +%y%m%d_%H%M%S)
echo $OUTDIR $REGION $JOBNAME
gsutil -m rm -rf $OUTDIR
gcloud ai-platform jobs submit training $JOBNAME \
  --region=$REGION \
  --module-name=trainer.task \
  --package-path=$(pwd)/babyweight/trainer \
  --job-dir=$OUTDIR \
  --staging-bucket=gs://$BUCKET \
  --scale-tier=STANDARD_1 \
  --config=hyperparam.yaml \
  --runtime-version=$TFVERSION \
  -- \
  --bucket=${BUCKET} \
  --output_dir=${OUTDIR} \
  --eval_steps=10 \
  --train_examples=20000

<h2> Repeat training </h2>
<p>
This time with tuned parameters (note last line)

In [None]:
%%bash
OUTDIR=gs://${BUCKET}/babyweight/trained_model_tuned
JOBNAME=babyweight_$(date -u +%y%m%d_%H%M%S)
echo $OUTDIR $REGION $JOBNAME
gsutil -m rm -rf $OUTDIR
gcloud ai-platform jobs submit training $JOBNAME \
  --region=$REGION \
  --module-name=trainer.task \
  --package-path=$(pwd)/babyweight/trainer \
  --job-dir=$OUTDIR \
  --staging-bucket=gs://$BUCKET \
  --scale-tier=STANDARD_1 \
  --runtime-version=$TFVERSION \
  -- \
  --bucket=${BUCKET} \
  --output_dir=${OUTDIR} \
  --train_examples=20000 --batch_size=35 --nembeds=16 --nnsize=281