# Day 4 - Deploy to Production with Google Cloud Platform

During the previous exercises, you worked on building a robust machine learning pipeline to build a model that predicts NYC taxi fares.
However, you realized the training set is pretty big and it takes a lot of time to train models locally. And because you want to iterate quickly, it does not make sense to waste hours of your time waiting for a training to complete.

**But there is a solution!**

<img src="https://www.gstatic.com/devrel-devsite/prod/vcd1bbe5dda31d2b800805cc4c730b0229f847f2d108be33386b6e78644e79178/cloud/images/cloud-logo.svg" width="200px"/>



In this series of exercises, you will learn how to deploy your code to [Google Cloud Platform](https://cloud.google.com/) aka **GCP** and in particular how to use **[AI Platform](https://cloud.google.com/ai-platform/)** in order to leverage the power of distributed computing and speedup your ML experimentation.

Beyond training models, you will see how you can make you models available to the world, manage different versions and serve predictions at scale.

<img src="https://cloud.google.com/ai-platform/images/ml-workflow.svg" />

### Summary

1. [GCP Setup](#part1)
2. [Deploy and train a simple model](#part2)
2. [Make predictions from simple model](#part3)
4. [Deploy and train a model with a pipeline and dependencies](#part4)
5. [Make predictions from pipeline model](#part5) 
6. [Train at scale](#part6) 

## 1. GCP Setup <a id="part1" />

### Setup Project

- Go to [Google Cloud](https://console.cloud.google.com/) and create an account if you do not already have one.
- In the Cloud Console, on the project selector page, select or create a Cloud project. You can name it `WagonBootcamp` for example
- Make sure that billing is enabled for your Google Cloud project. Don't worry, as a first time user, you have a **$300 credit** to use for Google Cloud resources, which will be more than enough for this project.
- [Enable the AI Platform Training & Prediction and Compute Engine APIs.](https://console.cloud.google.com/flows/enableapi?apiid=ml.googleapis.com,compute_component&_ga=2.269215094.662509797.1580849510-2071889129.1567861089&_gac=1.154971594.1580849512.CjwKCAiAyeTxBRBvEiwAuM8dnbZ6uMwizbZW44J2mBCX6ncEjwjwpgF8S8QsvhYAXLkJ8awDnIRTNRoCJ_0QAvD_BwE)

### Install Cloud sdk

We need to install new cli called gcloud  

- On Mac :
```bash
brew cask install google-cloud-sdk
```
- On Windows follow link bellow
[Install and initialize the Cloud SDK](https://cloud.google.com/sdk/docs/)

### Create a service account key
- Go to [Service Account key page](https://console.cloud.google.com/apis/credentials/serviceaccountkey) 
- Create a new Service Account key :
  - Give whatever name you want to that account
  - Set Role as `project > owner`
  
- Download json, and store it under `~/[FILE_NAME].json`
- Then add `export GOOGLE_APPLICATION_CREDENTIALS="~/[FILE_NAME].json"`:  
   - to your `~/.aliases` for mac & linux
   - to your `.bash_profile` for windows

### Troubleshooting

`AccessDeniedException: 403 The project to be billed is associated with an absent billing account.`

- Make sure that billing is enabled for your Google Cloud Platform project.
https://cloud.google.com/billing/docs/how-to/modify-project

### Create a bucket
You will need a Google Cloud bucket to store data, code and trained models.
For this you have two ways
- From UI
    - Go to [Storage](https://console.cloud.google.com/storage) and create a bucket from there
- Programmatically (**recommended**)
    - Use `gsutil mb` command line. See doc [here](https://cloud.google.com/storage/docs/creating-buckets#storage-create-bucket-gsutil)
    
##### Exercise
Let's use the command line tool in order to automate the workflow as much as possible  
Also to make things easier, you are going to create a `Makefile` to run the commands.

- Create a `Makefile` 
```bash 
touch Makefile
```

- Implement command that creates the bucket you need in your project.  
**IMPORTANT**: Bucket Names must be **globally unique**, please respect convention `wagon-ml-[YOUR_LAST_NAME]-XX`


- Run this Makefile (you have the example bellow)
- Check on [Storage](https://console.cloud.google.com/storage) that your bucket is correctly created

##### Makefile

```bash
PROJECT_ID=XXXX
BUCKET_NAME=wagon-ml-[YOUR_LAST_NAME]-XX
REGION=europe-west1

set_project:
    -@gcloud config set project ${PROJECT_ID}

create_bucket:
	-@gsutil mb -l ${REGION} -p ${PROJECT_ID} gs://${BUCKET_NAME}

all: set_project create_bucket
```

### Upload data to your bucket
Now let's upload the data to your bucket.
- Make you sure have the data on your local disk
- Add `upload_data` command in the Makefile that uploads `train.csv` and `test.csv` to `/data` folder in your bucket  
**HINT** use `gsutil cp` command => get help with [GCP documentation](https://cloud.google.com/storage/docs/uploading-objects)
- Run command:
```bash
make upload_data
```
- Check on [Storage](https://console.cloud.google.com/storage) that your file was correctly uploaded

### Troubleshooting

`wagon-XX@wagon-bootcamp-XXXXX.iam.gserviceaccount.com does not have storage.objects.list access to wagon-ml-[YOUR_NAME]-XX`

- Make sure that you secret key is correctly appended to your .aliases

# 2. Deploy and train a simple model <a id="part2" />

To get familiar with Google AI Platform, we will first deploy and train a simple model for the TaxiFare Challenge.

This model will be a linear model **`fare_amount ~ C * distance`**


To start we need to create a python module. For this, create the following structure:
- `TaxiFareModelSimple`
  - `__init__.py`
  - `trainer.py`
- `Makefile`

#### trainer.py
We are going to have all code in  `trainer.py` file.
- Write `get_data` method that loads data from your bucket (never forget that **simple is smart** and **pandas is great**)
- Write `save_model` method that upload the joblib file to `/models` folder. Give a unique name to your file, by appending a timestamp for example  
Good practice to check on [GCP Documentation](https://cloud.google.com/storage/docs/uploading-objects#storage-upload-object-python)

In [1]:
# trainer.py

import datetime
import os
from google.cloud import storage
from sklearn.externals import joblib
import pandas as pd
from sklearn import linear_model

def get_data():
    """method to get the training data (or a portion of it) from google cloud bucket"""
    pass

def compute_distance(df):
    """method to compute distance of df"""
    pass

def preprocess(df):
    """method that pre-process the data"""
    df["distance"] = compute_distance(df)
    X_train = df[["distance"]]
    y_train = df["fare_amount"]
    return X_train, y_train

def train_model(X_train, y_train):
    """method that trains the model"""
    clf = linear_model.Lasso(alpha=0.1)
    clf.fit(X_train, y_train)
    return clf
    
def save_model(clf):
    """Save the model into a .joblib and upload it on Google Storage /models folder
    HINTS : use sklearn.joblib (or jbolib) libraries and google-cloud-storage"""
    pass
    
# if __name__ == '__main__':
#     df = get_data()
#     X_train, y_train = preprocess(df)
#     clf = train_model(X_train, y_train)
#     save_model(clf)



##### Makefile

Then create the `Makefile` that will have a command to submit `trainer.py` to AI Plaform for training.  
Before anything run train.py locally and inspect on GCP Storage if your model is correctly uploaded
- Write command for `submit_training`
- Tip: https://cloud.google.com/sdk/gcloud/reference/ai-platform/jobs/submit/training
- Do not spend more than 20 minutes on that task, feel free to call TA for help 

```
PACKAGE_NAME=XXXX
FILENAME=XXX
BUCKET_NAME=XXX
PROJECT_ID=XXX
JOB_NAME=XXXX
REGION=XXX
PYTHON_VERSION=XXX
RUNTIME_VERSION=XXX
FRAMEWORK=XXX
MODEL_NAME=XXX

set_project:
    -@gcloud config set project ${PROJECT_ID}

submit_training:
    ## write command line to submit training

all: set_project submit_training
```

##### Then:
- Run the Makefile
- Go to [Jobs page](https://console.cloud.google.com/ai-platform/jobs) and check that your training task has been submitted and completes succesfully
- Once the training task is complete, go to the models folder (`/models`) and verify the model file has correctly been uploaded.

### Create a model and a version

Before you can use your model, you need to create a model resource on AI Platform and link a version to the model file.
- Go https://console.cloud.google.com/ai-platform/models and create a model
- Then on the model page, create a new version and specify the Google Cloud Storage path where you uploaded the model file.

## 3. Make predictions from simple model <a id="part3" />

Now that you have trained your model on the cloud, let's use it to make predictions on the test set!

We are going to write a `predict.py` script that will make predictions on a sample test set. 

#### Exercise
- Fill in the missing methods of `predict.py`
- Run `predict.py` to generate predicitions on test set
- Submit these predictions to Kaggle


Documentation: https://cloud.google.com/ai-platform/training/docs/python-client-library?hl=en

In [None]:
## predict.py

import googleapiclient.discovery
import numpy
import pandas as pd

BUCKET_NAME = "XXXX"
VERSION = "XXXX"
PROJECT_ID = 'XXX'
MODEL = 'XXXX'

def predict_json(project, model, instances, version=None):
    """Send json data to a deployed model for prediction. """
    service = googleapiclient.discovery.build('ml', 'v1')
    name = 'projects/{}/models/{}'.format(project, model)
    if version is not None:
        name += '/versions/{}'.format(version)
    response = service.projects().predict(
        name=name,
        body={'instances': instances}
    ).execute()
    if 'error' in response:
        raise RuntimeError(response['error'])
    return response['predictions']


def get_test_data():
    """ load test data from google cloud bucket"""
    pass


def preprocess(df):
    """
    preprocess method. This should be identical to the preprocess method
    that was used from training.
    """
    X_test, y_test = None, None
    return X_test, y_test


def convert_to_json_instances(X_test):
    return X_test.values.tolist()


# df = get_test_data().head(100) # only predict for first 100 rows
# X_test, y_test = preprocess(df)
# instances = convert_to_json_instances(X_test)
# results = predict_json(project=PROJECT_ID, model=MODEL, instances=instances, version=VERSION)
# df["fare_amount"] = results
# df[["key", "fare_amount"]].to_csv("predictions.csv", index=False)

## 4.  Deploy and train a model with a pipeline and dependencies<a id="part4" />

Now that you succefully trained a model on GCP and then used it to make predictions, let's see how you can do the same thing with a more complex model. 
One limitation we have seen in the previous case, is that you need to duplicate the pre-processing logic during training and scoring.

**With pipelines you can save directly in the trained model the pre-processing part so that you do not have do duplicate code when making predictions**

In this exercise, you are going to use the code we have from day 2-3 (with Sklearn pipeline).

The main difficulty here is that you will have to upload the code dependencies that are needed to run the pipeline.

#### Exercise
- Create a folder structure like this one
- TaxiFareModel
  - `__init__.py`
  - `trainer.py`
  - `encoders.py`
  - `utils.py`
- re-use code we have from day 2-3 to train the model using a `Sklearn pipeline` and our custom encoders (`DistanceTransformer` and `TimeFeaturesEncoder`).
- `data.py` needs to be updated to load data from google cloud bucket (instead of local file)


In [37]:
## trainer.py

from sklearn.pipeline import Pipeline, make_pipeline
from TaxiFareModel.data import Data
from TaxiFareModel.encoders import TimeFeaturesEncoder, DistanceTransformer

class Trainer(object):
    """trainer class that trains the model"""
    
    def __init__(self, **kwargs):
        self.kwargs = kwargs

    def get_pipeline(self):
        """builds pipeline"""
        pass
    
    def train(self):
        """trains the model"""
        pass

    def save_model(self):
        """saves model to google cloud bucket"""
        pass
    
if __name__ == '__main__':
    t = Trainer()
    t.train()
    t.save_model()

#### Makefile
Now we need a Makefile to submit the training code to GCP.

#### Exercise:
- Write this new Makefile, making sure you also upload the code dependencies. Read this [doc](https://cloud.google.com/ai-platform/training/docs/packaging-trainer) to help you out.

## 5. Make predictions from pipeline model <a id="part5" />

Finally, once your new model has been submitted and trained, you need to:
1. Create a new version for this model with a custom predictions routine
2. Call the model to make predictions on the test set

### 1. Custom predictions routine
Because your input data is Pandas DataFrame, you need to create a customer `Predictor` class to tell the predictions Google Cloud service how to handle input data.

Read this doc for more details [Custom Prediction Routines](https://cloud.google.com/ai-platform/prediction/docs/custom-prediction-routines?hl=en)

- create a `Predictor` class with:
  - `predict(self, instances, **kwargs)` method 
  - `from_path(cls, model_dir)` classmethod
- Then write a new command in your `Makefile` called `create_model_version` that will create a new model version with your custom `Predictor` class. Look for [https://cloud.google.com/sdk/gcloud/reference/beta/ai-platform/versions/create](https://cloud.google.com/sdk/gcloud/reference/beta/ai-platform/versions/create)

In [40]:
## predictor.py

class Predictor(object):

    def predict(self, instances, **kwargs):
        pass

    @classmethod
    def from_path(cls, model_dir):
        pass

### 2. Make predictions

Now that you have your new version with a custom prediction routine, you can call it to make predictions on the test set.

- Write a script `predict.py` that will call this new version for the entire test set.
- Submit these predictions to Kaggle

In [43]:
## predict.py

BUCKET_NAME = "XXXX"
VERSION = "XXX"
MODEL = "XXX"
PROJECT = "XXX"

def predict_json(project, model, instances, version=None):
    """Send json data to the deployed model for prediction. """
    pass


def get_data():
    """ load test data """
    pass


def convert_to_json_instances(X_test):
    pass

# if __name__ == '__main__':
#     df = get_data()
#     instances = convert_to_json_instances(df)
#     results = predict_json(project=PROJECT, model=MODEL, instances=instances, version=VERSION)
#     df["fare_amount"] = results
#     df[["key", "fare_amount"]].to_csv("predictions.csv", index=False)

## 6. Train at scale <a id="part6" />

Now it is time to go back to our original goal which is building the best model. 
- Try to change machine types and scale tiers in order to train your model with the entire dataset faster [https://cloud.google.com/ai-platform/training/docs/machine-types](https://cloud.google.com/ai-platform/training/docs/machine-types)
- Perform extenstive hyperparameters tuning to fine tune your model
- Submit your predictions to Kaggle