# Part 3 : VertexAI Auto ML Embedding
In this notebook, we'll use the awesome Vertex AI AutoML to train models with baseline and node embedding features

## Outline
1. VertexAI Setup
2. Configure and Run Baseline Model
3. Configure and Run Graph Embedding Based Model
4. Results & Conclusions

In [17]:
%%capture
%pip install google-cloud-storage google-cloud-aiplatform python-dotenv

In [18]:
from google.cloud import aiplatform
from dotenv import load_dotenv
import os
import pandas as pd

## Vertex AI Setup

Lets define the variables to connect to VertexAI

In [19]:
load_dotenv('config.env', override=True)
REGION = os.getenv('GCLOUD_REGION')
shell_output = ! gcloud config list --format 'value(core.project)' 2>/dev/null
PROJECT_ID = shell_output[0]

STORAGE_BUCKET = PROJECT_ID + '-fsi'

os.environ["GCLOUD_PROJECT"] = PROJECT_ID

baseline_data = os.path.join("gs://", STORAGE_BUCKET, 'insurance_fraud', 'baseline.csv')
embedding_data = os.path.join("gs://", STORAGE_BUCKET, 'insurance_fraud', 'embedding.csv')

We are going to create Tabular dataset objects for our baseline & embedding data below. These datasets refer to the Cloud Storage CSV files we just uploaded in the previous notebooks.

In [20]:
aiplatform.init(project=PROJECT_ID, location=REGION)

baseline_dataset = aiplatform.TabularDataset.create(
    display_name="claims-raw",
    gcs_source=baseline_data,
)
baseline_dataset.wait()

print(f'\tDataset: "{baseline_dataset.display_name}"')
print(f'\tname: "{baseline_dataset.resource_name}"')

embedding_dataset = aiplatform.TabularDataset.create(
    display_name="claims-embedding",
    gcs_source=embedding_data,
)
embedding_dataset.wait()

print(f'\tDataset: "{embedding_dataset.display_name}"')
print(f'\tname: "{embedding_dataset.resource_name}"')

Creating TabularDataset
Create TabularDataset backing LRO: projects/803648085855/locations/us-west1/datasets/133770982681739264/operations/1347999056630120448
TabularDataset created. Resource name: projects/803648085855/locations/us-west1/datasets/133770982681739264
To use this TabularDataset in another session:
ds = aiplatform.TabularDataset('projects/803648085855/locations/us-west1/datasets/133770982681739264')
	Dataset: "claims-raw"
	name: "projects/803648085855/locations/us-west1/datasets/133770982681739264"
Creating TabularDataset
Create TabularDataset backing LRO: projects/803648085855/locations/us-west1/datasets/6893674023364853760/operations/8175456091723792384
TabularDataset created. Resource name: projects/803648085855/locations/us-west1/datasets/6893674023364853760
To use this TabularDataset in another session:
ds = aiplatform.TabularDataset('projects/803648085855/locations/us-west1/datasets/6893674023364853760')
	Dataset: "claims-embedding"
	name: "projects/803648085855/loc

## Configure and Run Baseline Model

Now, lets define the numeric columns in our baseline dataset and define the job that will help us classify fraudulent claims

In [21]:
baseline_cols = list(pd.read_csv(baseline_data).columns)
baseline_cols

['provider',
 'potentialFraudInd',
 'renalDiseaseIndicatorEnc',
 'chronicCondAlzheimerEnc',
 'chronicCondHeartfailureEnc',
 'chronicCondKidneyDiseaseEnc',
 'chronicCondCancerEnc',
 'chronicCondObstrPulmonaryEnc',
 'chronicCondDepressionEnc',
 'chronicCondDiabetesEnc',
 'chronicCondIschemicHeartEnc',
 'chronicCondOsteoporasisEnc',
 'chronicCondrheumatoidarthritisEnc',
 'chronicCondstrokeEnc',
 'claimCount']

In [22]:
num_baseline_cols = [i for i in baseline_cols if i not in ['provider', 'providerId']]
baseline_column_specs = {column: "numeric" for column in num_baseline_cols}

raw_job = aiplatform.AutoMLTabularTrainingJob(
    display_name="train-fraud-baseline-automl",
    optimization_prediction_type="classification",
    column_specs=baseline_column_specs,
)

After that we can run the model asyncroniously

In [23]:
raw_model = raw_job.run(
    dataset=baseline_dataset,
    target_column="potentialFraudInd",
    training_fraction_split=0.6,
    validation_fraction_split=0.2,
    test_fraction_split=0.2,
    model_display_name="train-fraud-baseline-automl",
    disable_early_stopping=False,
    budget_milli_node_hours=1000,
    sync = False
)

## Configure and Run Graph Embedding Based Model

Similarly, let's define the classifier job for the embedding dataset

In [24]:
embedding_cols = list(pd.read_csv(embedding_data).columns)
embedding_cols

['provider',
 'potentialFraudInd',
 'renalDiseaseIndicatorEnc',
 'chronicCondAlzheimerEnc',
 'chronicCondHeartfailureEnc',
 'chronicCondKidneyDiseaseEnc',
 'chronicCondCancerEnc',
 'chronicCondObstrPulmonaryEnc',
 'chronicCondDepressionEnc',
 'chronicCondDiabetesEnc',
 'chronicCondIschemicHeartEnc',
 'chronicCondOsteoporasisEnc',
 'chronicCondrheumatoidarthritisEnc',
 'chronicCondstrokeEnc',
 'claimCount',
 'providerId',
 'groupCodeEmb_0',
 'groupCodeEmb_1',
 'groupCodeEmb_2',
 'groupCodeEmb_3',
 'groupCodeEmb_4',
 'groupCodeEmb_5',
 'groupCodeEmb_6',
 'groupCodeEmb_7',
 'groupCodeEmb_8',
 'groupCodeEmb_9',
 'groupCodeEmb_10',
 'groupCodeEmb_11',
 'groupCodeEmb_12',
 'groupCodeEmb_13',
 'groupCodeEmb_14',
 'groupCodeEmb_15',
 'groupCodeEmb_16',
 'groupCodeEmb_17',
 'groupCodeEmb_18',
 'groupCodeEmb_19',
 'groupCodeEmb_20',
 'groupCodeEmb_21',
 'groupCodeEmb_22',
 'groupCodeEmb_23',
 'groupCodeEmb_24',
 'groupCodeEmb_25',
 'groupCodeEmb_26',
 'groupCodeEmb_27',
 'groupCodeEmb_28',
 'gro

In [25]:
num_embedding_cols = [i for i in embedding_cols if i not in ['provider', 'providerId']]
embedding_column_specs = {column: "numeric" for column in num_embedding_cols}

embedding_job = aiplatform.AutoMLTabularTrainingJob(
    display_name="train-fraud-embeddings-automl",
    optimization_prediction_type="classification",
    column_specs=embedding_column_specs
)

Now run the training job.

In [26]:
embedding_model = embedding_job.run(
    dataset=embedding_dataset,
    target_column="potentialFraudInd",
    training_fraction_split=0.6,
    validation_fraction_split=0.2,
    test_fraction_split=0.2,
    model_display_name="train-fraud-embeddings-automl",
    disable_early_stopping=False,
    budget_milli_node_hours=1000,
    sync = False
)

View Training:
https://console.cloud.google.com/ai/platform/locations/us-west1/training/5303439360996147200?project=803648085855
View Training:
https://console.cloud.google.com/ai/platform/locations/us-west1/training/7645311167228805120?project=803648085855
AutoMLTabularTrainingJob projects/803648085855/locations/us-west1/trainingPipelines/5303439360996147200 current state:
PipelineState.PIPELINE_STATE_RUNNING
AutoMLTabularTrainingJob projects/803648085855/locations/us-west1/trainingPipelines/7645311167228805120 current state:
PipelineState.PIPELINE_STATE_RUNNING
AutoMLTabularTrainingJob projects/803648085855/locations/us-west1/trainingPipelines/5303439360996147200 current state:
PipelineState.PIPELINE_STATE_RUNNING
AutoMLTabularTrainingJob projects/803648085855/locations/us-west1/trainingPipelines/7645311167228805120 current state:
PipelineState.PIPELINE_STATE_RUNNING
AutoMLTabularTrainingJob projects/803648085855/locations/us-west1/trainingPipelines/1119595307168956416 current state:

1000 milli node hours, or one node hour, is the minimum budget that Vertex AI allows. These jobs will probably run for two to two and a half hours.

We're going to move on while that runs. You can check on the job later in the Google Cloud Console to see the results. There's a link to the specific job in the output of the cells above.

## Results

Once completed, you can start comparing the training results of both of your models. My results look like below.


### Results for Baseline data (without embeddings)
Here are the results for our Baseline data. 
![Baseline Metrics](https://storage.googleapis.com/neo4j-datasets/insurance-claim/img/baseline-metrics.png)

![Baseline Confusion Matrix](https://storage.googleapis.com/neo4j-datasets/insurance-claim/img/baseline-confusion.png)

These are the features ranked by their importance
![Baseline Features](https://storage.googleapis.com/neo4j-datasets/insurance-claim/img/baseline-features.png)

Let's see what value our embeddings add to:
![Embedding Metrics](https://storage.googleapis.com/neo4j-datasets/insurance-claim/img/embedding-metrics.png)

![Embedding Confusion Matrix](https://storage.googleapis.com/neo4j-datasets/insurance-claim/img/embedding-confusion.png)

![Embedding Features](https://storage.googleapis.com/neo4j-datasets/insurance-claim/img/embedding-features.png)

As you can see, embeddings fare much better with a higher for PR AUC value of 65% compared to baseline's 50%. Also, in terms of feature importance, many of the group code and diagnosis code embeddings are considered more important than raw features.

## Conclusion
Vertex AI made our job simpler by taking care of lots of overheads like hyper parameter tuning, feature importance etc. Once you find your best model using Vertex AI, you can also export the features like embeddings generated using GDS to Vertex AI Feature Store, deploy your model endpoints and start doing some predictions. 

Neo4j GDS has more than 70 algorithms in the toolbox which can help you do Graph Data Science in a memory optimised platform. While we covered only FastRP embedding algorithm here, there are few more like GraphSAGE, Node2Vec, HashGNN etc. The models & theories we tested out could be improved more. We will leave it to you to try it out!