# VertexAI Auto ML Embedding
In this notebook, we'll use the awesome Vertex AI AutoML to train a model using our graph embedding as an additional feature.

In [1]:
!pip install --quiet google-cloud-storage google-cloud-aiplatform python-dotenv

### Import the required libraries

In [2]:
from google.cloud import aiplatform
from dotenv import load_dotenv
import os

## Vertex AI Setup

### Workspace details
Lets define the variables to connect to VertexAI

In [3]:
load_dotenv('config.env', override=True)
REGION = os.getenv('GCLOUD_REGION')
shell_output = ! gcloud config list --format 'value(core.project)' 2>/dev/null
PROJECT_ID = shell_output[0]

STORAGE_BUCKET = PROJECT_ID + '-fsi'

os.environ["GCLOUD_PROJECT"] = PROJECT_ID

We are going to create Tabular dataset objects for our raw & embedding data below. These datasets refer to the Cloud Storage CSV files we just uploaded in the previous Notebook

In [4]:
aiplatform.init(project=PROJECT_ID, location=REGION)

baseline_dataset = aiplatform.TabularDataset.create(
    display_name="claims-raw",
    gcs_source=os.path.join("gs://", STORAGE_BUCKET, 'insurance_fraud', 'baseline.csv'),
)
baseline_dataset.wait()

print(f'\tDataset: "{baseline_dataset.display_name}"')
print(f'\tname: "{baseline_dataset.resource_name}"')

embedding_dataset = aiplatform.TabularDataset.create(
    display_name="claims-embedding",
    gcs_source=os.path.join("gs://", STORAGE_BUCKET, 'insurance_fraud', 'embedding.csv'),
)
embedding_dataset.wait()

print(f'\tDataset: "{embedding_dataset.display_name}"')
print(f'\tname: "{embedding_dataset.resource_name}"')

Creating TabularDataset
Create TabularDataset backing LRO: projects/934983306258/locations/us-central1/datasets/6969661271960453120/operations/6406713311403966464
TabularDataset created. Resource name: projects/934983306258/locations/us-central1/datasets/6969661271960453120
To use this TabularDataset in another session:
ds = aiplatform.TabularDataset('projects/934983306258/locations/us-central1/datasets/6969661271960453120')
	Dataset: "claims-raw"
	name: "projects/934983306258/locations/us-central1/datasets/6969661271960453120"
Creating TabularDataset
Create TabularDataset backing LRO: projects/934983306258/locations/us-central1/datasets/7059733264507863040/operations/5963108748107972608
TabularDataset created. Resource name: projects/934983306258/locations/us-central1/datasets/7059733264507863040
To use this TabularDataset in another session:
ds = aiplatform.TabularDataset('projects/934983306258/locations/us-central1/datasets/7059733264507863040')
	Dataset: "claims-embedding"
	name: "

## Setup and Run Baseline

Now, lets define the numeric columns in our baseline dataset and define the job that will help us classify fraudulent claims

In [5]:
import pandas as pd
DATA_DIR = 'data/'
baseline_cols = list(pd.read_csv(DATA_DIR + 'baseline.csv').columns)
baseline_cols

['provider',
 'potentialFraudInd',
 'inpationFraction',
 'renalDiseaseIndicatorNumEnc',
 'chronicCondAlzheimerEnc',
 'chronicCondHeartfailureEnc',
 'chronicCondKidneyDiseaseEnc',
 'chronicCondCancerEnc',
 'chronicCondObstrPulmonaryEnc',
 'chronicCondDepressionEnc',
 'chronicCondDiabetesEnc',
 'chronicCondIschemicHeartEnc',
 'chronicCondOsteoporasisEnc',
 'chronicCondrheumatoidarthritisEnc',
 'chronicCondstrokeEnc',
 'chronicCondAlzheimerIndEnc',
 'chronicCondHeartfailureIndEnc',
 'chronicCondKidneyDiseaseIndEnc',
 'chronicCondCancerIndEnc',
 'chronicCondObstrPulmonaryIndEnc',
 'chronicCondDepressionIndEnc',
 'chronicCondDiabetesIndEnc',
 'chronicCondIschemicHeartIndEnc',
 'chronicCondOsteoporasisIndEnc',
 'chronicCondrheumatoidarthritisIndEnc',
 'chronicCondstrokeIndEnc',
 'claimCount',
 'avgClaimAmtReimbursed',
 'providerId']

In [6]:
num_baseline_cols = [i for i in baseline_cols if i not in ['provider', 'providerId']]
baseline_column_specs = {column: "numeric" for column in num_baseline_cols}

raw_job = aiplatform.AutoMLTabularTrainingJob(
    display_name="train-fraud-baseline-automl",
    optimization_prediction_type="classification",
    column_specs=baseline_column_specs,
)

In [7]:
raw_model = raw_job.run(
    dataset=baseline_dataset,
    target_column="potentialFraudInd",
    training_fraction_split=0.6,
    validation_fraction_split=0.2,
    test_fraction_split=0.2,
    model_display_name="train-fraud-baseline-automl",
    disable_early_stopping=False,
    budget_milli_node_hours=1000,
    sync = False
)

View Training:
https://console.cloud.google.com/ai/platform/locations/us-central1/training/8759625932366413824?project=934983306258
AutoMLTabularTrainingJob projects/934983306258/locations/us-central1/trainingPipelines/8759625932366413824 current state:
PipelineState.PIPELINE_STATE_RUNNING


## Setup and Run with Embedding Features

Similarly, let's define the classifier job for the embedding dataset

In [8]:
embedding_cols = list(pd.read_csv(DATA_DIR + 'embedding.csv').columns)
embedding_cols

['provider',
 'potentialFraudInd',
 'inpationFraction',
 'renalDiseaseIndicatorNumEnc',
 'chronicCondAlzheimerEnc',
 'chronicCondHeartfailureEnc',
 'chronicCondKidneyDiseaseEnc',
 'chronicCondCancerEnc',
 'chronicCondObstrPulmonaryEnc',
 'chronicCondDepressionEnc',
 'chronicCondDiabetesEnc',
 'chronicCondIschemicHeartEnc',
 'chronicCondOsteoporasisEnc',
 'chronicCondrheumatoidarthritisEnc',
 'chronicCondstrokeEnc',
 'chronicCondAlzheimerIndEnc',
 'chronicCondHeartfailureIndEnc',
 'chronicCondKidneyDiseaseIndEnc',
 'chronicCondCancerIndEnc',
 'chronicCondObstrPulmonaryIndEnc',
 'chronicCondDepressionIndEnc',
 'chronicCondDiabetesIndEnc',
 'chronicCondIschemicHeartIndEnc',
 'chronicCondOsteoporasisIndEnc',
 'chronicCondrheumatoidarthritisIndEnc',
 'chronicCondstrokeIndEnc',
 'claimCount',
 'avgClaimAmtReimbursed',
 'providerId',
 'groupCodeEmb_0',
 'groupCodeEmb_1',
 'groupCodeEmb_2',
 'groupCodeEmb_3',
 'groupCodeEmb_4',
 'groupCodeEmb_5',
 'groupCodeEmb_6',
 'groupCodeEmb_7',
 'groupCo

AutoMLTabularTrainingJob projects/934983306258/locations/us-central1/trainingPipelines/8759625932366413824 current state:
PipelineState.PIPELINE_STATE_RUNNING
AutoMLTabularTrainingJob projects/934983306258/locations/us-central1/trainingPipelines/8759625932366413824 current state:
PipelineState.PIPELINE_STATE_RUNNING


In [9]:
num_embedding_cols = [i for i in embedding_cols if i not in ['provider', 'providerId']]
embedding_column_specs = {column: "numeric" for column in num_embedding_cols}

embedding_job = aiplatform.AutoMLTabularTrainingJob(
    display_name="train-fraud-embeddings-automl",
    optimization_prediction_type="classification",
    optimization_objective="minimize-log-loss",
    column_specs=embedding_column_specs
)

AutoMLTabularTrainingJob projects/934983306258/locations/us-central1/trainingPipelines/8759625932366413824 current state:
PipelineState.PIPELINE_STATE_RUNNING


In [10]:
embedding_model = embedding_job.run(
    dataset=embedding_dataset,
    target_column="potentialFraudInd",
    training_fraction_split=0.6,
    validation_fraction_split=0.2,
    test_fraction_split=0.2,
    model_display_name="train-fraud-embeddings-automl",
    disable_early_stopping=False,
    budget_milli_node_hours=1000,
    sync = False
)

View Training:
https://console.cloud.google.com/ai/platform/locations/us-central1/training/7088579364379426816?project=934983306258
AutoMLTabularTrainingJob projects/934983306258/locations/us-central1/trainingPipelines/7088579364379426816 current state:
PipelineState.PIPELINE_STATE_RUNNING
AutoMLTabularTrainingJob projects/934983306258/locations/us-central1/trainingPipelines/7088579364379426816 current state:
PipelineState.PIPELINE_STATE_RUNNING
AutoMLTabularTrainingJob projects/934983306258/locations/us-central1/trainingPipelines/8759625932366413824 current state:
PipelineState.PIPELINE_STATE_RUNNING
AutoMLTabularTrainingJob projects/934983306258/locations/us-central1/trainingPipelines/7088579364379426816 current state:
PipelineState.PIPELINE_STATE_RUNNING
AutoMLTabularTrainingJob projects/934983306258/locations/us-central1/trainingPipelines/7088579364379426816 current state:
PipelineState.PIPELINE_STATE_RUNNING
AutoMLTabularTrainingJob projects/934983306258/locations/us-central1/trai

1000 milli node hours, or one node hour, is the minimum budget that Vertex AI allows. However, Vertex AI isn't respecting that budget currently. This job will probably run for two and a half hours.

We're going to move on while that runs. You can check on the job later in the Google Cloud Console to see the results. There's a link to the specific job in the output of the cell above.

# Results

Once completed, you can start comparing the training results of both of your models. My results look like below.
I could see a ~10% improved F1 score and precision with my embedding model than the raw one.

![Metrics](https://storage.googleapis.com/neo4j-datasets/insurance-claim/img/emb_vs_raw-metrics.png)

![Metrics1](https://storage.googleapis.com/neo4j-datasets/insurance-claim/img/emb_vs_raw-metrics1.png)

This the feature importance in both the models. As you can see embeddings could capture more meaningful features by relationships than the raw ones.

![Feature Comparison](https://storage.googleapis.com/neo4j-datasets/insurance-claim/img/emb_vs_raw-features.png)

Finally, this is the Confusion Matrix of the two models. The embedding algorithms has lesser false positives.

![Confusion Matrix](https://storage.googleapis.com/neo4j-datasets/insurance-claim/img/emb_vs_raw-confusion.png)


# Conclusion
Vertex AI made our job simpler by taking care of lots of overheads like hyper parameter tuning, feature importance etc. Once you find your best model using Vertex AI, you can also export the features like embeddings generated using GDS to Vertex AI Feature Store, deploy your model endpoints and start doing some predictions. 

Neo4j GDS has more than 70 algorithms in the toolbox which can help you do Graph Data Science in a memory optimised platform. While we covered only FastRP embedding algorithm here, there are few more like GraphSAGE, Node2Vec, HashGNN etc. The models we tested out could be improved more and can include both raw and embedding features. We will leave it to you to try it out!