# VertexAI Auto ML Embedding
In this notebook, we'll use the awesome Vertex AI AutoML to train a model using our graph embedding as an additional feature.

In [None]:
!pip install --quiet google-cloud-storage
!pip install --quiet google-cloud-aiplatform

## Vertex AI Setup

### Import the required libraries

In [None]:
from google.cloud import aiplatform
import os

### Workspace details
Lets define the variables to connect to VertexAI

In [None]:
# Edit this variable!
REGION = 'us-west1'
shell_output = ! gcloud config list --format 'value(core.project)' 2>/dev/null
PROJECT_ID = shell_output[0]

STORAGE_BUCKET = PROJECT_ID + '-fsi'

os.environ["GCLOUD_PROJECT"] = PROJECT_ID

We are going to create Tabular dataset objects for our raw & embedding data below. These datasets refer to the Cloud Storage CSV files we just uploaded in the previous Notebook

In [None]:
aiplatform.init(project=PROJECT_ID, location=REGION)

raw_dataset = aiplatform.TabularDataset.create(
    display_name="claims-raw",
    gcs_source=os.path.join("gs://", STORAGE_BUCKET, 'insurance_fraud', 'raw.csv'),
)
raw_dataset.wait()

print(f'\tDataset: "{raw_dataset.display_name}"')
print(f'\tname: "{raw_dataset.resource_name}"')

embedding_dataset = aiplatform.TabularDataset.create(
    display_name="claims-embedding",
    gcs_source=os.path.join("gs://", STORAGE_BUCKET, 'insurance_fraud', 'embedding.csv'),
)
embedding_dataset.wait()

print(f'\tDataset: "{embedding_dataset.display_name}"')
print(f'\tname: "{embedding_dataset.resource_name}"')

Now, lets define the numeric columns in our RAW dataset and define the job that will help us classify fraudulent claims

In [None]:
num_column_names = [
       'claim_amt_reimbursed', 'claim_diag_code_1', 'claim_diag_code_2',
       'claim_diag_code_3', 'claim_diag_code_4', 'claim_diag_code_5',
       'claim_diag_code_6', 'claim_diag_code_7', 'claim_diag_code_8',
       'claim_diag_code_9', 'claim_diag_code_10', 'claim_procedure_code_1',
       'claim_procedure_code_2', 'claim_procedure_code_3',
       'claim_procedure_code_4', 'claim_procedure_code_5',
       'claim_procedure_code_6', 'deductible_amt_paid',
       'claim_admit_diagnosis_code', 
       'diag_group_code', 'gender', 'race',
       'renal_disease_indicator', 'state', 'county',
       'num_of_months_part_a_cov', 'num_of_months_part_b_Cov',
       'chronic_cond_alzheimer', 'chronic_cond_heartfailure',
       'chronic_cond_kidneydisease', 'chronic_cond_cancer',
       'chronic_cond_obstrpulmonary', 'chronic_cond_depression',
       'chronic_cond_diabetes', 'chronic_cond_ischemicheart',
       'chronic_cond_osteoporasis', 'chronic_cond_rheumatoidarthritis',
       'chronic_cond_stroke', 'ip_annual_reimbursement_amt',
       'ip_annual_deductible_amt', 'op_annual_reimbursement_amt',
       'op_annual_deductible_amt', 'target']
column_specs = {column: "numeric" for column in num_column_names}

raw_job = aiplatform.AutoMLTabularTrainingJob(
    display_name="train-fraud-raw-automl",
    optimization_prediction_type="classification",
    column_specs=column_specs,
)

Similarly, let's define the classifier job for the embedding dataset

In [None]:
EMBEDDING_DIMENSION = 32
embedding_column_names = ["embedding_{}".format(i) for i in range(EMBEDDING_DIMENSION)]
other_column_names = ['id']
all_columns = embedding_column_names
column_specs = {column: "numeric" for column in all_columns}

embedding_job = aiplatform.AutoMLTabularTrainingJob(
    display_name="train-fraud-embeddings-automl",
    optimization_prediction_type="classification",
    column_specs=column_specs,
)

Lets start running both the jobs now

In [None]:
raw_model = raw_job.run(
    dataset=raw_dataset,
    target_column="target",
    training_fraction_split=0.8,
    validation_fraction_split=0.1,
    test_fraction_split=0.1,
    model_display_name="insurance-fraud-prediction-model-raw",
    disable_early_stopping=False,
    budget_milli_node_hours=1000,
    sync = False
)
embedding_model = embedding_job.run(
    dataset=embedding_dataset,
    target_column="target",
    training_fraction_split=0.8,
    validation_fraction_split=0.1,
    test_fraction_split=0.1,
    model_display_name="insurance-fraud-prediction-model-embedding",
    disable_early_stopping=False,
    budget_milli_node_hours=1000,
    sync = False
)

1000 milli node hours, or one node hour, is the minimum budget that Vertex AI allows. However, Vertex AI isn't respecting that budget currently. This job will probably run for two and a half hours.

We're going to move on while that runs. You can check on the job later in the Google Cloud Console to see the results. There's a link to the specific job in the output of the cell above.

# Results

Once completed, you can start comparing the training results of both of your models. My results look like below.
I could see a ~10% improved F1 score and precision with my embedding model than the raw one.

![Metrics](https://storage.googleapis.com/neo4j-datasets/insurance-claim/img/emb_vs_raw-metrics.png)

![Metrics1](https://storage.googleapis.com/neo4j-datasets/insurance-claim/img/emb_vs_raw-metrics1.png)

This the feature importance in both the models. As you can see embeddings could capture more meaningful features by relationships than the raw ones.

![Feature Comparison](https://storage.googleapis.com/neo4j-datasets/insurance-claim/img/emb_vs_raw-features.png)

Finally, this is the Confusion Matrix of the two models. The embedding algorithms has lesser false positives.

![Confusion Matrix](https://storage.googleapis.com/neo4j-datasets/insurance-claim/img/emb_vs_raw-confusion.png)


# Conclusion
Vertex AI made our job simpler by taking care of lots of overheads like hyper parameter tuning, feature importance etc. Once you find your best model using Vertex AI, you can also export the features like embeddings generated using GDS to Vertex AI Feature Store, deploy your model endpoints and start doing some predictions. 

Neo4j GDS has more than 70 algorithms in the toolbox which can help you do Graph Data Science in a memory optimised platform. While we covered only FastRP embedding algorithm here, there are few more like GraphSAGE, Node2Vec, HashGNN etc. The models we tested out could be improved more and can include both raw and embedding features. We will leave it to you to try it out!