# Cloud ML Engine Notes

This notebook will go over the steps in order ot setup the cloud, some tutorials and how to submit your own trainer and job.

Please note these notes mostly follow the https://cloud.google.com/ml-engine/docs/how-tos/ with some additional material. These guides go into detail on some of the concepts.


Development considerations: https://cloud.google.com/ml-engine/docs/concepts/environment-overview

## 1. Setting up the Cloud SDK

The first step involves setting up the cloud SDK. Depending on the machine (Mac, Linux or Windows) this can be downloaded from https://cloud.google.com/sdk/. 

Please note that gcloud, python 2.7 is required. However, tensorflow on Windows requires python 3.5. Therefore, the environment should have this python version configured. Please note that if the newer python 3.6 is used, this produces an incompatible wheel error when installing tensorflow. One could develop code which is possible to run on both pyhton 2.7 and python 3.5.

Please note an alternative is to use the Cloud shell on the Google Cloud Platform which has the preinstalled SDK. However, this has limited cababilities such as a limited 5GB space. Other options include setting up and working on a VM in the cloud but that would bt at a cost.

Once the SDK has been setup three steps follow.

1. Initializing gcloud
2. Installing tensorflow
3. Authentication
4. Test tensorflow

How to do these steps can be found on https://cloud.google.com/ml-engine/docs/quickstarts/command-line.

Explain that Ubuntu on Bash is needed.

## 2. Training a sample dataset

The dataset considered is the United Status Census Income dataset. The task here involves constructing a model to predict the income catagory. The tutorial is found https://cloud.google.com/ml-engine/docs/how-tos/getting-started-training-prediction. 

The task here is not the construction of the model. The model is already prebuilt and provided by DNNCombinedLinearClassifier class. Here, the point is getting used to jobs in the cloud. Therefore, we are only augmenting the dataset innputs using a linear or DNN model.

As explained in the tutorial, this step will show you how to:

1. Create a tensorflow trainer and validate it locally.
2. Run the trainer on a single cloud worker.
3. Run it on a cloud distributed system.
4. Deploy a model to support prediction.
5. Request an online prediction and see the response.
6. Request a batch prediction.

The dataset files can be found at https://github.com/GoogleCloudPlatform/cloudml-samples/archive/master.zip. These should be downlaoded and extracted. And the current directory chagned to cloudml-samples-master/census/estimator.

The trianing data can be copied from the cloud to a *data* folder in the *estimator* directory:

Furthermore, paths need to be set to these directories:

### 2.1 Running a local trainer

The first step prior to submitting a job to the cloud is to test the job locally. This avoids additinal costs resulting from running hte job multiple times to debug. The cloud ml-engine provides a process where it emulates the cloud and hence is a good testing platform.

The outputs of the job need to be saved to a specific folder. It is generally good practice that this folder is empty aswell:

Finally, the job can be submitted:

The key word *local* denotes that the job is performed on the host and not in the cloud. The module name provides the task localtion and the package path the package location.

Finally the output can be visualised using tensorboard:

### 2.2 Running a distributed local trainer

Similar to the local trainer, the engine provides a process to emulate workers in a distributed fashion. Once again this can be used as a testing platform for the cloud.

A new output directory can be specified:

Finally, the job in distributed mode can be submitted:

The arguments here are similar to the single worker local process. However, a key word *--distributed* is added to instruct the engine to perform the job in distributed mode.

Once again the output can be found in $MODEL_DIR and can be analysed using tensorboard.

### 2.3 Setting up for the Cloud

Running the job in the cloud is similar to running locally with some minor differences. One of the differences is that the data needs to be stored in the Google Cloud storage.

Hence, a bucket in the same location as that of the set project needs to be created and the data uploaded to the bucket.

First, the project ID is defined together with a bucket name:

The region is specified:

The new bucket is created by:

Finally, the data can be uploaded to the bucket and the `TRAIN_DATA` and `EVAL_DATA` updated appropriately.

Furthermore, it is important to upload also the JSON test file with the test data information and a variable `TEST_JSON` set to point at the file.

### 2.4 Training a single worker in the cloud

Once the files are uploaded to the cloud, a training job can be submitted to the cloud. Please note that now computation is at a cost. One can choose the amont of resouces used depending on the tiers. In this tutorial the `BASIC` tier to avoid expenses.

Since the `TRAIN_DATA`, `EVAL_DATA`, `REGION` and `BUCKET_NAME` variables are still defined from before, the `JOB_NAME` and `OUTPUT_PATH` can be defined. IF you're starting afresh, initialize these variables as they are needed.

Finally, the job can be run in the cloud by:

Please note that this time, there is no local keyword but insteard there is *submit training $JOB_NAME*. The region and runtime-version now need to be specified.

Once again, the data can be analysed using tensorboard:

### 2.5 Training in distributed mode in the cloud

Training in distributed mode in the cloud is similar to running distributed mode locally with some changes similar to single instance cloud process. Once again a tier needs to be chosen appropriately in order to balance units (time to execute) and cost per unit.

Please note that according to the scale tier documentation, any tier above `BASIC` is distributed. The only other scale tier with a single worker is `BASIC_GPU` which is a single isntance with a GPU.

https://cloud.google.com/ml-engine/reference/rest/v1/projects.jobs#scaletier

Each job once again has a name and output path define as:

Finally, it can be run using:

In fact, the running command is exactly the same as the single instance but with a scale tier as `STANDARD_1`.

Again, results can be analysed using tensorbaord: