# Vertex Tabular Binary Classification with .CustomJob().from_local_script

**Use case: predict if a customer will buy on return visit.**

The ecommerce dataset has different training features:
- latest_ecommerce_progress
- bounces
- time_on_site
- pageviews
- source
- medium
- channel_grouping
- device_category
- country

The label: will_buy_on_return_visit

Data is imbalanced

## Set Constants

In [55]:
PROJECT_ID = 'jchavezar-demo'
REGION = 'us-central1'
DATASET_URI = 'gs://vtx-datasets-public/ecommerce/datasets.csv'
MODEL_URI = 'gs://vtx-models/pytorch/ecommerce'
MODEL_DISPLAY_NAME = 'pytorch-ecommerce'
STAGING_URI = 'gs://vtx-staging/pytorch/ecommerce/'
TRAIN_IMAGE_URI = 'us-docker.pkg.dev/vertex-ai/training/pytorch-gpu.1-11:latest'
PREDICTION_IMAGE_URI = 'us-docker.pkg.dev/vertex-ai/prediction/tf2-cpu.2-9:latest'

## Create Folder Structure

```
source
     └─── trainer
          |  train.py
          |

```

In [2]:
!rm -fr source
!mkdir -p source/trainer

## Intro

Below we have the code for the training, it was made with PyTorch by building a neural network with these components:

- 2 types of features set: categorical and numerical.
- Shape detection of embedding layer for categorical.
- Drouput to avoid overfit during the training.
- Batch Normalization to standarize the data.
- 1 input layer, shape: 114x32: 
  - 114 is the number of total features (categorical and numerical) after the embedding.
  - 32 is the number of the neurons.
- Activation function applied to the last input layer to fix non-linearity.
- 1 output layer, shape: 32x2.

The following diagram shows the neural netowkr with steps ordered used during the Model building class: ShelterOutcomeModel.

<center><img src="../../../images/04-pytorch-nn.png"/></center>

In [3]:
%%writefile source/trainer/train.py
import os
import torch
import pickle
import argparse
import numpy as np
import pandas as pd
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as torch_optim
from google.cloud import storage
from collections import defaultdict
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from torch.utils.data import Dataset, DataLoader
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings('ignore')

parser = argparse.ArgumentParser()

parser.add_argument(
    '--dataset_uri',
    type = str,
    help = 'Dataset uri in the format gs://[BUCKET]/*suffix*/file_name.extension')
parser.add_argument(
    '--project',
    type = str,
    help = 'This is the tenant or the Google Cloud project id name')

args = parser.parse_args()

## Prepare Data

def preprocessing_data(df):
    target = 'will_buy_on_return_visit'
    cat_columns = [i for i in df.columns if df[i].dtypes == 'object']
    num_columns = [i for i in df.columns if df[i].dtypes == 'int64' or df[i].dtypes == 'float']
    num_columns.remove(target)

    cat_train_df = df[cat_columns]
    num_train_df = df[num_columns]
    label = df[target].to_numpy()
    
    labelencoder = defaultdict(LabelEncoder)
    cat_train_df[cat_columns] = cat_train_df[cat_columns].apply(lambda x: labelencoder[x.name].fit_transform(x))
    cat_train_df[cat_columns] = cat_train_df[cat_columns].astype('category')
    
    train_df = pd.concat([cat_train_df,num_train_df], axis=1)
    X_train, X_val, y_train, y_val = train_test_split(train_df, label, test_size=0.10, random_state=0)
    
    ## Numerical columns standarization
    scaler = StandardScaler()
    X_train[num_columns] = scaler.fit_transform(X_train[num_columns])
    X_val[num_columns] = scaler.transform(X_val[num_columns])
    
    # Categorical Embedding
    embedded_cols = {n: len(col.cat.categories) for n,col in X_train[cat_columns].items() if len(col.cat.categories) > 2}
    embedded_col_names = embedded_cols.keys()
    embedding_sizes = [(n_categories, min(50, (n_categories+1)//2)) for _,n_categories in embedded_cols.items()]
    embedding_sizes = nn.ModuleList([nn.Embedding(categories, size) for categories,size in embedding_sizes])
    pickle.dump(labelencoder, open('label.pkl', 'wb'))
    pickle.dump(scaler, open('std_scaler.pkl', 'wb'))
    pickle.dump(embedding_sizes, open('emb.pkl', 'wb'))
    
    return X_train, X_val, y_train, y_val, embedded_col_names, embedding_sizes

df = pd.read_csv(args.dataset_uri)
X_train, X_val, y_train, y_val, embedded_col_names, embedding_sizes = preprocessing_data(df)

## PyTorch Dataset

class ShelterOutcomeDataset(Dataset):
    def __init__(self, X, Y, embedded_col_names):
        X = X.copy()
        self.X1 = X.loc[:,embedded_col_names].copy().values.astype(np.int64) #categorical columns
        self.X2 = X.drop(columns=embedded_col_names).copy().values.astype(np.float32) #numerical columns
        self.y = Y
        
    def __len__(self):
        return len(self.y)
    
    def __getitem__(self, idx):
        return self.X1[idx], self.X2[idx], self.y[idx]
    
## Train and Valid datasets

train_ds = ShelterOutcomeDataset(X_train, y_train, embedded_col_names)
valid_ds = ShelterOutcomeDataset(X_val, y_val, embedded_col_names)

## CPU or GPU selection

def get_default_device():
    """Pick GPU if available, else CPU"""
    if torch.cuda.is_available():
        return torch.device('cuda')
    else:
        return torch.device('cpu')
    

def to_device(data, device):
    """Move tensor(s) to chosen device"""
    if isinstance(data, (list,tuple)):
        return [to_device(x, device) for x in data]
    return data.to(device, non_blocking=True)


class DeviceDataLoader():
    """Wrap a dataloader to move data to a device"""
    def __init__(self, dl, device):
        self.dl = dl
        self.device = device
        
    def __iter__(self):
        """Yield a batch of data after moving it to device"""
        for b in self.dl: 
            yield to_device(b, self.device)

    def __len__(self):
        """Number of batches"""
        return len(self.dl)

device = get_default_device()


## Model

class ShelterOutcomeModel(nn.Module):
    def __init__(self, embedding_sizes, n_cont):
        super().__init__()
        self.embeddings = embedding_sizes
        n_emb = sum(e.embedding_dim for e in self.embeddings) #length of all embeddings combined
        self.n_emb, self.n_cont = n_emb, n_cont
        self.lin1 = nn.Linear(self.n_emb + self.n_cont, 200)
        self.lin2 = nn.Linear(200, 70)
        self.lin3 = nn.Linear(70, 2)
        self.bn1 = nn.BatchNorm1d(self.n_cont)
        self.bn2 = nn.BatchNorm1d(200)
        self.bn3 = nn.BatchNorm1d(70)
        self.emb_drop = nn.Dropout(0.6)
        self.drops = nn.Dropout(0.3)
        

    def forward(self, x_cat, x_cont):
        x = [e(x_cat[:,i]) for i,e in enumerate(self.embeddings)]
        x = torch.cat(x, 1)
        x = self.emb_drop(x)
        x2 = self.bn1(x_cont)
        x = torch.cat([x, x2], 1)
        x = F.relu(self.lin1(x))
        x = self.drops(x)
        x = self.bn2(x)
        x = F.relu(self.lin2(x))
        x = self.drops(x)
        x = self.bn3(x)
        x = self.lin3(x)
        return x
    
model = ShelterOutcomeModel(embedding_sizes, 4)
to_device(model, device)

## Define Optimizer

def get_optimizer(model, lr = 0.001, wd = 0.0):
    parameters = filter(lambda p: p.requires_grad, model.parameters())
    optim = torch_optim.Adam(parameters, lr=lr, weight_decay=wd)
    return optim

## Train Model

def train_model(model, optim, train_dl):
    model.train()
    total = 0
    sum_loss = 0
    for x1, x2, y in train_dl:
        batch = y.shape[0]
        output = model(x1, x2)
        loss = F.cross_entropy(output, y)   
        optim.zero_grad()
        loss.backward()
        optim.step()
        total += batch
        sum_loss += batch*(loss.item())
    return sum_loss/total


def val_loss(model, valid_dl):
    model.eval()
    total = 0
    sum_loss = 0
    correct = 0
    for x1, x2, y in valid_dl:
        current_batch_size = y.shape[0]
        out = model(x1, x2)
        loss = F.cross_entropy(out, y)
        sum_loss += current_batch_size*(loss.item())
        total += current_batch_size
        pred = torch.max(out, 1)[1]
        correct += (pred == y).float().sum().item()
    print("valid loss %.3f and accuracy %.3f" % (sum_loss/total, correct/total))
    return sum_loss/total, correct/total

def train_loop(model, epochs, lr=0.01, wd=0.0):
    optim = get_optimizer(model, lr = lr, wd = wd)
    for i in range(epochs): 
        loss = train_model(model, optim, train_dl)
        print("training loss: ", loss)
        val_loss(model, valid_dl)
        
        
batch_size = 1000
train_dl = DataLoader(train_ds, batch_size=batch_size,shuffle=True)
valid_dl = DataLoader(valid_ds, batch_size=batch_size,shuffle=True)

train_dl = DeviceDataLoader(train_dl, device)
valid_dl = DeviceDataLoader(valid_dl, device)


train_loop(model, epochs=8, lr=0.05, wd=0.00001)
torch.save(model.state_dict(), "state_model.pt")

bucket = os.environ['AIP_MODEL_DIR'].split('/')[2]
blob_name = '/'.join(os.environ['AIP_MODEL_DIR'].split('/')[3:])

storage_client = storage.Client(project=args.project)
bucket = storage_client.bucket(bucket)

files_to_upload = ["label.pkl", "std_scaler.pkl", "state_model.pt", "emb.pkl"]

for i in files_to_upload:
    blob = bucket.blob(blob_name+i)
    blob.upload_from_filename(i)

Writing source/trainer/train.py


## Training Job (CustomJob.from_local_script)

To speed up the training a GPU NVIDIA Tesla T4 is used, it should take around 2 minutes to finish.

In [None]:
from google.cloud import aiplatform as aip

customJob = aip.CustomJob.from_local_script(
    display_name = 'pytorch_tab_sa_ecommerce',
    script_path = 'source/trainer/train.py',
    container_uri = TRAIN_IMAGE_URI,
    requirements = ['scikit-learn'],
    args = [
        '--dataset_uri', 
        DATASET_URI,
        '--project',
        PROJECT_ID
    ],
    replica_count = 1,
    machine_type = 'n1-standard-4',
    accelerator_type = 'NVIDIA_TESLA_T4',
    accelerator_count = 1,
    staging_bucket = STAGING_URI,
    base_output_dir = MODEL_URI
)

customJob.run()

## Creating Custom Container for Prediction

#### The method I'm using is called Custom Prediction Routines, where we specify load, preprocess and prediction methods and Vertex will do the rest for us

In [143]:
USER_SRC_DIR = "src_dir_pytorch"  # @param {type:"string"}
IMAGE_URI = "us-central1-docker.pkg.dev/jchavezar-demo/custom-predictions/pytorch-ecommerce:latest"

In [6]:
!rm -fr $USER_SRC_DIR
!mkdir $USER_SRC_DIR

In [7]:
%%writefile $USER_SRC_DIR/requirements.txt
fastapi
uvicorn==0.17.6
pandas
torch
scikit-learn
google-cloud-storage>=1.26.0,<2.0.0dev
google-cloud-aiplatform[prediction]>=1.16.0

Writing src_dir_pytorch/requirements.txt


In [None]:
!pip install -r $USER_SRC_DIR/requirements.txt

In [45]:
## Copy all the Artifacts from Vertex Custom Training
!gsutil cp $MODEL_URI/model/* $USER_SRC_DIR

Copying gs://vtx-models/pytorch/ecommerce/model/emb.pkl...
Copying gs://vtx-models/pytorch/ecommerce/model/label.pkl...                    
Copying gs://vtx-models/pytorch/ecommerce/model/state_model.pt...               
Copying gs://vtx-models/pytorch/ecommerce/model/std_scaler.pkl...               
/ [4 files][382.8 KiB/382.8 KiB]                                                
Operation completed over 4 objects/382.8 KiB.                                    


#### PyTorch has issues with libraries so I highly recommend install their packages with conda:

$ conda install pytorch torchvision torchaudio cpuonly -c pytorch

In [39]:
%%writefile $USER_SRC_DIR/predictor.py

import os
import torch
import pickle
import numpy as np
import pandas as pd
import torch.nn as nn
from typing import Dict
import torch.nn.functional as F
import torch.optim as torch_optim
from sklearn.preprocessing import LabelEncoder
from torch.utils.data import Dataset, DataLoader
from google.cloud.aiplatform.utils import prediction_utils
from google.cloud.aiplatform.prediction.predictor import Predictor


class CustomPyTorchPredictor(Predictor):
    
    def __init__(self):
        self.embedded_col_names = ['source', 'medium', 'channelGrouping', 'deviceCategory', 'country']
        self.columns = [ "latest_ecommerce_progress" , "bounces", "time_on_site", "pageviews", "source", "medium", "channelGrouping", "deviceCategory", "country"]

            
    def preprocess(self, prediction_input: Dict) -> torch.utils.data.dataloader.DataLoader:
        instances = prediction_input["instances"]
        data = pd.DataFrame(instances, columns = self.columns)
        ## Prepare Data        
        embedded_col_names = ['source', 'medium', 'channelGrouping', 'deviceCategory', 'country']
        
        def preprocessing_data(df):
            import pickle
            
            standarization = pickle.load(open("std_scaler.pkl", "rb"))
            labelencoder = pickle.load(open("label.pkl", "rb"))
    
            target = 'will_buy_on_return_visit'
            cat_columns = [i for i in df.columns if df[i].dtypes == 'object']
            num_columns = [i for i in df.columns if df[i].dtypes == 'int64' or df[i].dtypes == 'float']

            cat_df = df[cat_columns]
            num_df = df[num_columns]
    
            cat_df = cat_df.apply(lambda x: labelencoder[x.name].transform(x))
            cat_df = cat_df.astype('category')
    
            df = pd.concat([cat_df, num_df], axis=1)
            df[num_columns] = standarization.transform(df[num_columns])
            
            return df

        class PredictData(Dataset):
            def __init__(self, X):
                embedded_col_names = ['source', 'medium', 'channelGrouping', 'deviceCategory', 'country']
                self.X1 = X.loc[:,embedded_col_names].copy().values.astype(np.int64)
                self.X2 = X.drop(columns=embedded_col_names).copy().values.astype(np.float32)

            def __getitem__(self, index):
                return self.X1[index], self.X2[index]

            def __len__ (self):
                return len(self.X1)
        
        prep_df = DataLoader(PredictData(preprocessing_data(data)))
        return prep_df
    
    def load(self, artifacts_uri: str):
        """Loads the model artifacts."""
        prediction_utils.download_model_artifacts(artifacts_uri)
        self.embeddings = pickle.load(open('emb.pkl', 'rb'))
        class ShelterOutcomeModel(nn.Module):
            def __init__(self, embedding_sizes, n_cont):
                super().__init__()
                self.embeddings = embedding_sizes
                n_emb = sum(e.embedding_dim for e in self.embeddings) #length of all embeddings combined
                self.n_emb, self.n_cont = n_emb, 4
                self.lin1 = nn.Linear(self.n_emb + self.n_cont, 200)
                self.lin2 = nn.Linear(200, 70)
                self.lin3 = nn.Linear(70, 2)
                self.bn1 = nn.BatchNorm1d(self.n_cont)
                self.bn2 = nn.BatchNorm1d(200)
                self.bn3 = nn.BatchNorm1d(70)
                self.emb_drop = nn.Dropout(0.6)
                self.drops = nn.Dropout(0.3)


            def forward(self, x_cat, x_cont):
                x = [e(x_cat[:,i]) for i,e in enumerate(self.embeddings)]
                x = torch.cat(x, 1)
                x = self.emb_drop(x)
                x2 = self.bn1(x_cont)
                x = torch.cat([x, x2], 1)
                x = F.relu(self.lin1(x))
                x = self.drops(x)
                x = self.bn2(x)
                x = F.relu(self.lin2(x))
                x = self.drops(x)
                x = self.bn3(x)
                x = self.lin3(x)
                return x
            
        device = torch.device('cpu')
        self._model = ShelterOutcomeModel(self.embeddings, 4)
        self._model.load_state_dict(torch.load("state_model.pt", map_location=device))
        
    @torch.inference_mode()
    def predict(self, instances: torch.utils.data.dataloader.DataLoader) -> list:
        """Performs prediction."""
        preds = []
        self._model.eval()
        with torch.no_grad():
            for x1,x2 in instances:
                out = self._model(x1,x2)
                prob = F.softmax(out, dim=1)
                preds.append(prob)
        final_probs = [item for sublist in preds for item in sublist]
        predicted = [0 if t[0] > 0.5 else 1 for t in final_probs]
        print(predicted)
        return predicted

    def postprocess(self, prediction_results: list) -> Dict:
        return {"predictions": prediction_results}

Overwriting src_dir_pytorch/predictor.py


## Authentication

The easiest way to handle AuthN/AuthZ for next steps is by login with application credentials, this method will store json credential locally here: **/home/jupyter/.config/gcloud/application_default_credentials.json**

!gcloud auth application-default login

In [40]:
CREDENTIALS_FILE = "/home/jupyter/.config/gcloud/application_default_credentials.json"

In [41]:
import os

from google.cloud.aiplatform.prediction import LocalModel
from src_dir_pytorch.predictor import \
    CustomPyTorchPredictor  # Update this path as the variable $USER_SRC_DIR to import the custom predictor.

local_model = LocalModel.build_cpr_model(
    USER_SRC_DIR,
    IMAGE_URI,
    predictor=CustomPyTorchPredictor,
    requirements_path=os.path.join(USER_SRC_DIR, "requirements.txt"),
)

  self.stdin = io.open(p2cwrite, 'wb', bufsize)
  self.stdout = io.open(c2pread, 'rb', bufsize)


In [42]:
local_model.get_serving_container_spec()

image_uri: "us-central1-docker.pkg.dev/jchavezar-demo/custom-predictions/pytorch-ecommerce:latest"
predict_route: "/predict"
health_route: "/health"

##  Test Model Locally

In [43]:
INPUT_FILE = "instances.json"

In [46]:
%%writefile $INPUT_FILE
{
    "instances": [
        [0, 0, 142, 5.0, "(direct)", "(none)", "Direct", "mobile", "Argentina"]
    ]
}

Overwriting instances.json


In [51]:
with local_model.deploy_to_local_endpoint(
    artifact_uri=f"{MODEL_URI}/model",
    credential_path = CREDENTIALS_FILE,
) as local_endpoint:
    predict_response = local_endpoint.predict(
        request_file = INPUT_FILE,
        headers={"Content-Type": "application/json"},
    )

## Deploy to Vertex AI

In [52]:
local_model.push_image()

  self.stdin = io.open(p2cwrite, 'wb', bufsize)
  self.stdout = io.open(c2pread, 'rb', bufsize)


## Upload Model to Vertex Model Registry

In [56]:
model = aip.Model.upload(
    local_model=local_model,
    display_name=MODEL_DISPLAY_NAME,
    artifact_uri=f"{MODEL_URI}/model",
)

Creating Model
Create Model backing LRO: projects/569083142710/locations/us-central1/models/275372687176499200/operations/1071499163876720640
Model created. Resource name: projects/569083142710/locations/us-central1/models/275372687176499200@1
To use this Model in another session:
model = aiplatform.Model('projects/569083142710/locations/us-central1/models/275372687176499200@1')


## Deploy Model using Vertex Endpoints

In [57]:
endpoint = model.deploy(machine_type="n1-standard-4")

Creating Endpoint
Create Endpoint backing LRO: projects/569083142710/locations/us-central1/endpoints/2482173887983386624/operations/4245129526289367040
Endpoint created. Resource name: projects/569083142710/locations/us-central1/endpoints/2482173887983386624
To use this Endpoint in another session:
endpoint = aiplatform.Endpoint('projects/569083142710/locations/us-central1/endpoints/2482173887983386624')
Deploying model to Endpoint : projects/569083142710/locations/us-central1/endpoints/2482173887983386624
Deploy Endpoint model backing LRO: projects/569083142710/locations/us-central1/endpoints/2482173887983386624/operations/4801324080269623296
Endpoint model deployed. Resource name: projects/569083142710/locations/us-central1/endpoints/2482173887983386624


## Test

In [121]:
! curl -X POST -H "Authorization: Bearer $(gcloud auth print-access-token)" -H "Content-Type: application/json" https://us-central1-aiplatform.googleapis.com/v1/$endpoint.gca_resource.name:predict -d "@$INPUT_FILE"

{
  "predictions": [
    0
  ],
  "deployedModelId": "5720939319225483264",
  "model": "projects/569083142710/locations/us-central1/models/275372687176499200",
  "modelDisplayName": "pytorch-ecommerce",
  "modelVersionId": "1"
}


## Destroy Endpoint

In [137]:
endpoint.undeploy(deployed_model_id=endpoint.gca_resource.deployed_models[0].id)

Undeploying Endpoint model: projects/569083142710/locations/us-central1/endpoints/2482173887983386624
Undeploy Endpoint model backing LRO: projects/569083142710/locations/us-central1/endpoints/2482173887983386624/operations/2089031204685742080
Endpoint model undeployed. Resource name: projects/569083142710/locations/us-central1/endpoints/2482173887983386624


In [138]:
endpoint.delete()

Deleting Endpoint : projects/569083142710/locations/us-central1/endpoints/2482173887983386624
Delete Endpoint  backing LRO: projects/569083142710/locations/us-central1/operations/8368737935100477440
Endpoint deleted. . Resource name: projects/569083142710/locations/us-central1/endpoints/2482173887983386624


In [139]:
model.delete()

Deleting Model : projects/569083142710/locations/us-central1/models/275372687176499200
Delete Model  backing LRO: projects/569083142710/locations/us-central1/operations/7686442591553847296
Model deleted. . Resource name: projects/569083142710/locations/us-central1/models/275372687176499200


In [146]:
!rm -fr $USER_SRC_DIR
!rm -fr instances.json
!rm -fr source