# Create Embedding
In this notebook, we'll connect to a Neo4j instance.  We'll load data based on a schema and compute graph embeddings.  The notebook exports that data to pandas and then writes them to Cloud Storage as CSV files.

## Using the Neo4j API
Let's connect to our Neo4j deployment.  First off, install the Neo4j Graph Data Science package.

In [28]:
!pip install graphdatascience --quiet
!pip install --quiet google-cloud-storage
!pip install --quiet google.cloud.aiplatform

Now, you're going to need the connection string and credentials from the deployment you created above.

In [31]:
# Edit these variables! 
DB_URL = 'neo4j+s://b9579464.databases.neo4j.io' #'neo4j+s://URL.databases.neo4j.io'
DB_PASS = 'wPMOtOPRfKaXGigxm5oT8EzrviLAjqW0yCapd7zNQcY'

# You can leave this default
DB_USER = 'neo4j'

Lets create GDS connection object using the variables defined above

In [32]:
from graphdatascience import GraphDataScience
gds = GraphDataScience(DB_URL, auth=(DB_USER, DB_PASS))

# Explore & Load Data

The dataset we are going to use is from a public available [Kaggle dataset](https://www.kaggle.com/datasets/rohitrox/healthcare-provider-fraud-detection-analysis).  These are healthcare expense claims with anonymised beneficiaries, claims and providers.  We've filtered the data and cleaned up the datasets. The cleaned data can be downloaded [here](https://storage.googleapis.com/neo4j-datasets/insurance-claim/data.csv)

We will predict the potentially fraudulent providers based on the claims filed by them. We will use a GDS embedding algorithm to chart out Fraudulent patterns in the provider's claims to understand the future behaviour of providers.

The dataset has

- **Inpatient Data**: 
Contains claims filed for those patients who are admitted in the hospitals. It also provides additional details like their admission and discharge dates and admit and diagnosis code.
- **Outpatient Data**
- **Beneficiary Details Data**: 
Contains beneficiary KYC details like health conditions,regioregion they belong to etc. 


Before loading data into any Database, we usually have to come up with a schema and implement it on the Database. Graph Data Modelling is an important step with Neo4j and you define it based on the questions you would like to ask on the graph.

If you are interested, you can explore the data using Pandas

In [7]:
import pandas as pd
import numpy as np

#Let
raw_df = pd.read_csv('https://storage.googleapis.com/neo4j-datasets/insurance-claim/data.csv', 
                     index_col=False, dtype='unicode')
raw_df.head()

Unnamed: 0,bene_id,claim_id,claim_start_dt,claim_end_dt,provider,claim_amt_reimbursed,attending_physician,operating_physician,other_physician,claim_diag_code_1,...,chronic_cond_diabetes,chronic_cond_ischemicheart,chronic_cond_osteoporasis,chronic_cond_rheumatoidarthritis,chronic_cond_stroke,ip_annual_reimbursement_amt,ip_annual_deductible_amt,op_annual_reimbursement_amt,op_annual_deductible_amt,is_fraud
0,BENE11002,CLM624349,2009-10-11,2009-10-11,PRV56011,30,PHY326117,,,78943,...,1,1,1,1,1,0,0,30,50,1
1,BENE11004,CLM121801,2009-01-06,2009-01-06,PRV56011,40,PHY334319,,,71988,...,0,0,0,0,1,0,0,1810,760,1
2,BENE11004,CLM150998,2009-01-22,2009-01-22,PRV56011,200,PHY403831,,,82382,...,0,0,0,0,1,0,0,1810,760,1
3,BENE11004,CLM173224,2009-02-03,2009-02-03,PRV56011,20,PHY339887,,,20381,...,0,0,0,0,1,0,0,1810,760,1
4,BENE11004,CLM224741,2009-03-03,2009-03-03,PRV56011,40,PHY345721,,,V6546,...,0,0,0,0,1,0,0,1810,760,1


Let's go with the following schema as our questions are more focussed around the relationships between claims, providers, physicians and the diagnoses 

![image info](https://storage.googleapis.com/neo4j-datasets/insurance-claim/img/schema.png)

### Create Constraints
In order to ensure uniqueness of nodes, lets create some constraints. This will ensure that no duplicate nodes are created and speed up the CSV loading process, especially if we want to use `MERGE` statements - as MERGE statement creates new nodes only if they don't exist. 

In [33]:
gds.run_cypher('CREATE CONSTRAINT IF NOT EXISTS FOR (n:Beneficiary) REQUIRE (n.id) IS NODE KEY;')
gds.run_cypher('CREATE CONSTRAINT IF NOT EXISTS FOR (n:RenalDisease) REQUIRE (n.name) IS NODE KEY;')
gds.run_cypher('CREATE CONSTRAINT IF NOT EXISTS FOR (n:IschemicHeartDisease) REQUIRE (n.name) IS NODE KEY;')
gds.run_cypher('CREATE CONSTRAINT IF NOT EXISTS FOR (n:Osteoporosis) REQUIRE (n.name) IS NODE KEY;')
gds.run_cypher('CREATE CONSTRAINT IF NOT EXISTS FOR (n:RheumatoidArthritis) REQUIRE (n.name) IS NODE KEY;')
gds.run_cypher('CREATE CONSTRAINT IF NOT EXISTS FOR (n:Stroke) REQUIRE (n.name) IS NODE KEY;')
gds.run_cypher('CREATE CONSTRAINT IF NOT EXISTS FOR (n:Diabetes) REQUIRE (n.name) IS NODE KEY;')
gds.run_cypher('CREATE CONSTRAINT IF NOT EXISTS FOR (n:Depression) REQUIRE (n.name) IS NODE KEY;')
gds.run_cypher('CREATE CONSTRAINT IF NOT EXISTS FOR (n:ObstructivePulmonaryDisease) REQUIRE (n.name) IS NODE KEY;')
gds.run_cypher('CREATE CONSTRAINT IF NOT EXISTS FOR (n:Cancer) REQUIRE (n.name) IS NODE KEY;')
gds.run_cypher('CREATE CONSTRAINT IF NOT EXISTS FOR (n:KidneyDisease) REQUIRE (n.name) IS NODE KEY;')
gds.run_cypher('CREATE CONSTRAINT IF NOT EXISTS FOR (n:HeartFailure) REQUIRE (n.name) IS NODE KEY;')
gds.run_cypher('CREATE CONSTRAINT IF NOT EXISTS FOR (n:Alzheimer) REQUIRE (n.name) IS NODE KEY;')
gds.run_cypher('CREATE CONSTRAINT IF NOT EXISTS FOR (n:County) REQUIRE (n.id) IS NODE KEY;')
gds.run_cypher('CREATE CONSTRAINT IF NOT EXISTS FOR (n:State) REQUIRE (n.id) IS NODE KEY;')
gds.run_cypher('CREATE CONSTRAINT IF NOT EXISTS FOR (n:Claim) REQUIRE (n.id) IS NODE KEY;')
gds.run_cypher('CREATE CONSTRAINT IF NOT EXISTS FOR (n:DiagnosisGroup) REQUIRE (n.id) IS NODE KEY;')
gds.run_cypher('CREATE CONSTRAINT IF NOT EXISTS FOR (n:AdmitDiagnosis) REQUIRE (n.id) IS NODE KEY;')
gds.run_cypher('CREATE CONSTRAINT IF NOT EXISTS FOR (n:Physician) REQUIRE (n.id) IS NODE KEY;')
gds.run_cypher('CREATE CONSTRAINT IF NOT EXISTS FOR (n:Provider) REQUIRE (n.id) IS NODE KEY;')
gds.run_cypher('CREATE CONSTRAINT IF NOT EXISTS FOR (n:Diagnosis) REQUIRE (n.id) IS NODE KEY;')
gds.run_cypher('CREATE CONSTRAINT IF NOT EXISTS FOR (n:Procedure) REQUIRE (n.id) IS NODE KEY;')


### Load Data

Now, we're going to take data from the Google Cloud Storage bucket and import it into Neo4j.  There are a few different ways to do this.  We'll do with a naive LOAD CSV statements via the GDS Python API.  

The Neo4j [Data Importer](https://data-importer.neo4j.io/) is another option.  It's a great graphical way to import data.  However, the LOAD CSV option we're using makes it really easy to pull directly from Cloud Storage, so is probably a better choice for what we need.

Lets start creating the `Beneficiary`, `Claim`, `Provider`, `County` and `State` nodes first.

In [34]:
result = gds.run_cypher(
  """
    LOAD CSV WITH HEADERS FROM 'https://storage.googleapis.com/neo4j-datasets/insurance-claim/data.csv' AS row
    
    MERGE (bene:Beneficiary {id: row.bene_id})
    ON CREATE SET
        bene.dob = row.dob, 
        bene.gender = row.gender, 
        bene.race = row.race, 
        bene.ipAnnualReimbursementAmt = row.ip_annual_reimbursement_amt, 
        bene.opAnnualReimbursementAmt = row.op_annual_reimbursement_amt,
        bene.ipAnnualDeductibleAmt = row.ip_annual_deductible_amt, 
        bene.opAnnualDeductibleAmt = row.op_annual_deductible_amt, 
        bene.partACovMonths = row.num_of_months_part_a_cov, 
        bene.partBCovMonths = row.num_of_months_part_b_Cov,
        bene.dod = row.dod
  
    MERGE (provider:Provider {id: row.provider})
    MERGE (county:County {id: row.county})
    MERGE (state:State {id: row.state})
    
    CREATE (claim:Claim {id: row.claim_id, startDate: row.claim_start_dt, 
        endDate: row.claim_end_dt, reimbursedAmt: row.claim_amt_reimbursed,
      isFraud: row.is_fraud})
    
    FOREACH(_ IN CASE WHEN trim(row.discharge_dt) <> "" THEN [1] ELSE [] END | SET claim.dischargeDate = row.discharge_dt)
    FOREACH(_ IN CASE WHEN trim(row.admission_dt) <> "" THEN [1] ELSE [] END | SET claim.admitDate = row.admission_dt)
    FOREACH(_ IN CASE WHEN trim(row.deductible_amt_paid) <> "" THEN [1] ELSE [] END | SET claim.deductibleAmtPaid = row.deductible_amt_paid)

  """
)

As per the schema we agreed upon earlier, lets start to connect the ndoes we just created

In [35]:
result = gds.run_cypher(
  """
  LOAD CSV WITH HEADERS FROM 'https://storage.googleapis.com/neo4j-datasets/insurance-claim/data.csv' AS row
    MATCH (bene:Beneficiary {id: row.bene_id})
    MATCH (claim:Claim {id: row.claim_id})
    MATCH (provider:Provider {id: row.provider})
    MATCH (county:County {id: row.county})
    MATCH (state:State {id: row.state})

    MERGE (county)<-[:LOCATED_AT]-(bene)
    CREATE (bene)-[:FILED_CLAIM]->(claim)-[:PROVIDED_BY]->(provider)
    MERGE (state)<-[:PART_OF]-(county)
  """
)

Time to create the `Physician`, `AdmitDiagnosis`, `DiagnosisGroup` nodes and relate them Claims and Providers

In [36]:
result = gds.run_cypher(
  """
  LOAD CSV WITH HEADERS FROM 'https://storage.googleapis.com/neo4j-datasets/insurance-claim/data.csv' AS row
    MATCH (provider:Provider {id: row.provider})
    MATCH (claim:Claim {id: row.claim_id})

    FOREACH(_ IN CASE WHEN trim(row.attending_physician) <> "" THEN [1] ELSE [] END |
      MERGE (attPhysician:Physician {id: row.attending_physician}) 
      MERGE (provider)<-[:WORKS_FOR]-(attPhysician)
      MERGE (attPhysician)<-[:ATTENDED_BY]-(claim)
    )
    FOREACH(_ IN CASE WHEN trim(row.operating_physician) <> "" THEN [1] ELSE [] END |
      MERGE (opPhysician:Physician {id: row.operating_physician}) 
      MERGE (provider)<-[:WORKS_FOR]-(opPhysician)
      MERGE (opPhysician)<-[:OPERATED_BY]-(claim)
    )
    FOREACH(_ IN CASE WHEN trim(row.other_physician) <> "" THEN [1] ELSE [] END |
      MERGE (othPhysician:Physician {id: row.other_physician}) 
      MERGE (provider)<-[:WORKS_FOR]-(othPhysician)
      MERGE (othPhysician)<-[:ALSO_ATTENDED_BY]-(claim)
    )
    FOREACH(_ IN CASE WHEN trim(row.claim_admit_diagnosis_code) <> "" THEN [1] ELSE [] END |
      MERGE (admitDiagnosis:AdmitDiagnosis {id: row.claim_admit_diagnosis_code})
      MERGE (claim)-[:HAS_ADMIT_DIAGNOSIS]->(admitDiagnosis)
    )
    FOREACH(_ IN CASE WHEN trim(row.diag_group_code) <> "" THEN [1] ELSE [] END |
      MERGE (diagnosisGroup:DiagnosisGroup {id: row.diag_group_code})
      MERGE (claim)-[:HAS_DIAGNOSIS_GROUP]->(diagnosisGroup)
    )
  """)

As per the schema, Claims are related to diagnosis and procedures. Let deal those nodes & relationships now

In [37]:
result = gds.run_cypher(
  """
  LOAD CSV WITH HEADERS FROM 'https://storage.googleapis.com/neo4j-datasets/insurance-claim/data.csv' AS row
    MATCH (claim:Claim {id: row.claim_id})

    FOREACH(_ IN CASE WHEN trim(row.claim_procedure_code_1) <> "" THEN [1] ELSE [] END |
      MERGE (proc1:Procedure {id: row.claim_procedure_code_1})
      MERGE (claim)-[:HAS_PROCEDURE]->(proc1)
    )
    FOREACH(_ IN CASE WHEN trim(row.claim_procedure_code_2) <> "" THEN [1] ELSE [] END |
      MERGE (proc2:Procedure {id: row.claim_procedure_code_2})
      MERGE (claim)-[:HAS_PROCEDURE]->(proc2)
    )
    FOREACH(_ IN CASE WHEN trim(row.claim_procedure_code_3) <> "" THEN [1] ELSE [] END |
      MERGE (proc3:Procedure {id: row.claim_procedure_code_3})
      MERGE (claim)-[:HAS_PROCEDURE]->(proc3)
    )
    FOREACH(_ IN CASE WHEN trim(row.claim_procedure_code_4) <> "" THEN [1] ELSE [] END |
      MERGE (proc4:Procedure {id: row.claim_procedure_code_4})
      MERGE (claim)-[:HAS_PROCEDURE]->(proc4)
    )
    FOREACH(_ IN CASE WHEN trim(row.claim_procedure_code_5) <> "" THEN [1] ELSE [] END |
      MERGE (proc5:Procedure {id: row.claim_procedure_code_5})
      MERGE (claim)-[:HAS_PROCEDURE]->(proc5)
    )
    FOREACH(_ IN CASE WHEN trim(row.claim_procedure_code_6) <> "" THEN [1] ELSE [] END |
      MERGE (proc6:Procedure {id: row.claim_procedure_code_6})
      MERGE (claim)-[:HAS_PROCEDURE]->(proc6)
    )
    
    FOREACH(_ IN CASE WHEN trim(row.claim_diag_code_1) <> "" THEN [1] ELSE [] END |
      MERGE (diag1:Diagnosis {id: row.claim_diag_code_1})
      MERGE (claim)-[:HAS_DIAGNOSIS]->(diag1)
    )
    FOREACH(_ IN CASE WHEN trim(row.claim_diag_code_2) <> "" THEN [1] ELSE [] END |
      MERGE (diag2:Diagnosis {id: row.claim_diag_code_2})
      MERGE (claim)-[:HAS_DIAGNOSIS]->(diag2)
    )
    FOREACH(_ IN CASE WHEN trim(row.claim_diag_code_3) <> "" THEN [1] ELSE [] END |
      MERGE (diag3:Diagnosis {id: row.claim_diag_code_3})
      MERGE (claim)-[:HAS_DIAGNOSIS]->(diag3)
    )
    FOREACH(_ IN CASE WHEN trim(row.claim_diag_code_4) <> "" THEN [1] ELSE [] END |
      MERGE (diag4:Diagnosis {id: row.claim_diag_code_4})
      MERGE (claim)-[:HAS_DIAGNOSIS]->(diag4)
    )
    FOREACH(_ IN CASE WHEN trim(row.claim_diag_code_5) <> "" THEN [1] ELSE [] END |
      MERGE (diag5:Diagnosis {id: row.claim_diag_code_5})
      MERGE (claim)-[:HAS_DIAGNOSIS]->(diag5)
    )
    FOREACH(_ IN CASE WHEN trim(row.claim_diag_code_6) <> "" THEN [1] ELSE [] END |
      MERGE (diag6:Diagnosis {id: row.claim_diag_code_6})
      MERGE (claim)-[:HAS_DIAGNOSIS]->(diag6)
    )
    FOREACH(_ IN CASE WHEN trim(row.claim_diag_code_7) <> "" THEN [1] ELSE [] END |
      MERGE (diag7:Diagnosis {id: row.claim_diag_code_7})
      MERGE (claim)-[:HAS_DIAGNOSIS]->(diag7)
    )
    FOREACH(_ IN CASE WHEN trim(row.claim_diag_code_8) <> "" THEN [1] ELSE [] END |
      MERGE (diag8:Diagnosis {id: row.claim_diag_code_8})
      MERGE (claim)-[:HAS_DIAGNOSIS]->(diag8)
    )
    FOREACH(_ IN CASE WHEN trim(row.claim_diag_code_9) <> "" THEN [1] ELSE [] END |
      MERGE (diag9:Diagnosis {id: row.claim_diag_code_9})
      MERGE (claim)-[:HAS_DIAGNOSIS]->(diag9)
    )
    FOREACH(_ IN CASE WHEN trim(row.claim_diag_code_10) <> "" THEN [1] ELSE [] END |
      MERGE (diag10:Diagnosis {id: row.claim_diag_code_10})
      MERGE (claim)-[:HAS_DIAGNOSIS]->(diag10)
    )
  """)

Finally, lets connect our Benificiaries (or) patients with diseases they suffer from

In [38]:
result = gds.run_cypher(
  """
  LOAD CSV WITH HEADERS FROM 'https://storage.googleapis.com/neo4j-datasets/insurance-claim/data.csv' AS row
    MATCH (bene:Beneficiary {id: row.bene_id})

    FOREACH(_ IN CASE WHEN row.chronic_cond_alzheimer = '1' THEN [1] ELSE [] END |
      MERGE (n:Alzheimer {name:'Alzheimer'}) MERGE (bene)-[:HAS_CHRONIC]->(n)
    )
    FOREACH(_ IN CASE WHEN row.chronic_cond_kidneydisease = '1' THEN [1] ELSE [] END |
      MERGE (n:KidneyDisease {name:'Kidney Disease'}) MERGE (bene)-[:HAS_CHRONIC]->(n)
    )
    FOREACH(_ IN CASE WHEN row.chronic_cond_heartfailure = '1' THEN [1] ELSE [] END |
      MERGE (n:HeartFailure {name:'Heart Failure'}) MERGE (bene)-[:HAS_CHRONIC]->(n)
    )
    FOREACH(_ IN CASE WHEN row.chronic_cond_cancer = '1' THEN [1] ELSE [] END |
      MERGE (n:Cancer {name:'Cancer'}) MERGE (bene)-[:HAS_CHRONIC]->(n)
    )
    FOREACH(_ IN CASE WHEN row.chronic_cond_obstrpulmonary = '1' THEN [1] ELSE [] END |
      MERGE (n:ObstructivePulmonaryDisease {name:'Obstructive Pulmonary Disease'}) 
      MERGE (bene)-[:HAS_CHRONIC]->(n)
    )
    FOREACH(_ IN CASE WHEN row.chronic_cond_depression = '1' THEN [1] ELSE [] END |
      MERGE (n:Depression {name:'Depression'}) MERGE (bene)-[:HAS_CHRONIC]->(n)
    )
    FOREACH(_ IN CASE WHEN row.chronic_cond_diabetes = '1' THEN [1] ELSE [] END |
      MERGE (n:Diabetes {name:'Diabetes'}) MERGE (bene)-[:HAS_CHRONIC]->(n)
    )
    FOREACH(_ IN CASE WHEN row.chronic_cond_ischemicheart = '1' THEN [1] ELSE [] END |
      MERGE (n:IschemicHeartDisease {name:'Ischemic Heart Disease'}) MERGE (bene)-[:HAS_CHRONIC]->(n)
    )
    FOREACH(_ IN CASE WHEN row.chronic_cond_osteoporasis = '1' THEN [1] ELSE [] END |
      MERGE (n:Osteoporosis {name:'Osteoporosis'}) MERGE (bene)-[:HAS_CHRONIC]->(n)
    )
    FOREACH(_ IN CASE WHEN row.chronic_cond_rheumatoidarthritis = '1' THEN [1] ELSE [] END |
      MERGE (n:RheumatoidArthritis {name:'Rheumatoid Arthritis'}) MERGE (bene)-[:HAS_CHRONIC]->(n)
    )
    FOREACH(_ IN CASE WHEN row.chronic_cond_stroke = '1' THEN [1] ELSE [] END |
      MERGE (n:Stroke {name:'Stroke'}) MERGE (bene)-[:HAS_CHRONIC]->(n)
    )
    FOREACH(_ IN CASE WHEN row.renal_disease_indicator = '1' THEN [1] ELSE [] END |
      MERGE (n:RenalDisease {name:'Renal Disease'}) MERGE (bene)-[:HAS_DISEASE]->(n)
    )
  """
)

## Graph Data Science
We got the data inside our Database! Let's do some Graph Data Science. This is how a typical GDS workflow looks like inside Neo4j

![GDS Workflow](https://storage.googleapis.com/neo4j-datasets/insurance-claim/img/gds_workflow.png)

As first step, we're going to use Neo4j Graph Data Science to create an in memory graph represtation of the data.  We'll enhance that representation with features we engineer using a graph embedding.

In [39]:
result = gds.run_cypher(
  """
    CALL gds.graph.project(
      'projection',
      ['Beneficiary','RenalDisease','IschemicHeartDisease','Osteoporosis',
      'RheumatoidArthritis','Stroke','Diabetes','Depression','ObstructivePulmonaryDisease',
      'Cancer','KidneyDisease','HeartFailure','Alzheimer','County','State','Claim',
      'DiagnosisGroup','AdmitDiagnosis','Physician','Provider','Diagnosis','Procedure'],
      {
        LOCATED_AT: {orientation: 'UNDIRECTED'},
        FILED_CLAIM: {orientation: 'UNDIRECTED'},
        PROVIDED_BY: {orientation: 'UNDIRECTED'},
        PART_OF: {orientation: 'UNDIRECTED'},
        WORKS_FOR: {orientation: 'UNDIRECTED'},
        ATTENDED_BY: {orientation: 'UNDIRECTED'},
        OPERATED_BY: {orientation: 'UNDIRECTED'},
        ALSO_ATTENDED_BY: {orientation: 'UNDIRECTED'},
        HAS_ADMIT_DIAGNOSIS: {orientation: 'UNDIRECTED'},
        HAS_DIAGNOSIS_GROUP: {orientation: 'UNDIRECTED'},
        HAS_PROCEDURE: {orientation: 'UNDIRECTED'},
        HAS_DIAGNOSIS: {orientation: 'UNDIRECTED'},
        HAS_CHRONIC: {orientation: 'UNDIRECTED'},
        HAS_DISEASE: {orientation: 'UNDIRECTED'}
      }
    )
    YIELD
      graphName AS graph,
      relationshipProjection AS readProjection,
      nodeCount AS nodes,
      relationshipCount AS rels
  """
)
display(result)

Unnamed: 0,graph,readProjection,nodes,rels
0,projection,{'HAS_DIAGNOSIS': {'orientation': 'UNDIRECTED'...,167271,2105376


If you get an error saying the graph already exists, that's probably because you ran this code before. You can destroy it using this command:

In [43]:
# result = gds.run_cypher(
#   """
#     CALL gds.graph.drop('projection')
#   """
# )
# display(result)

Unnamed: 0,graphName,database,memoryUsage,sizeInBytes,nodeCount,relationshipCount,configuration,density,creationTime,modificationTime,schema,schemaWithOrientation
0,projection,neo4j,,-1,175858,2278006,{'relationshipProjection': {'LOCATED_AT': {'or...,7.4e-05,2023-04-11T02:55:17.931693467+00:00,2023-04-11T02:55:31.944540335+00:00,"{'graphProperties': {}, 'relationships': {'LOC...","{'graphProperties': {}, 'relationships': {'LOC..."


Now, let's list the details of the graph to make sure the projection was created as we want.

In [40]:
result = gds.run_cypher(
  """
    CALL gds.graph.list()
  """
)
display(result)

Unnamed: 0,degreeDistribution,graphName,database,memoryUsage,sizeInBytes,nodeCount,relationshipCount,configuration,density,creationTime,modificationTime,schema,schemaWithOrientation
0,"{'p99': 58, 'min': 1, 'max': 41743, 'mean': 12...",projection,neo4j,66 MiB,69493423,167271,2105376,{'relationshipProjection': {'HAS_DIAGNOSIS': {...,7.5e-05,2023-04-13T05:21:31.308880807+00:00,2023-04-13T05:21:31.916404731+00:00,"{'graphProperties': {}, 'relationships': {'HAS...","{'graphProperties': {}, 'relationships': {'HAS..."


Now we can generate an embedding from that graph. This is a new feature we can use in our predictions. We're using FastRP, which is a more full featured and higher performance of Node2Vec. You can learn more about that [here](https://neo4j.com/docs/graph-data-science/current/algorithms/fastrp/).

There are a bunch of parameters we could adjust in this.  One of the most obvious is the embeddingDimension.  The documentation covers many more.

In [41]:
result = gds.run_cypher(
  """
  CALL gds.fastRP.mutate('projection',{
    embeddingDimension: 32,
    randomSeed: 1,
    mutateProperty:'embedding'
  })
  """
)
display(result)

Unnamed: 0,nodePropertiesWritten,mutateMillis,nodeCount,preProcessingMillis,computeMillis,configuration
0,167271,1,167271,2,409,"{'nodeSelfInfluence': 0, 'propertyRatio': 0.0,..."


That creates an embedding for each node type.  However, we only want the embedding on the nodes of type holding.

We're going to take the embedding from our projection and write it to the holding nodes in the underlying database.

In [42]:
result = gds.run_cypher(
  """
    CALL gds.graph.writeNodeProperties('projection', ['embedding'], 
    ['Claim'])
    YIELD writeMillis
  """
)
display(result)

Unnamed: 0,writeMillis
0,902


In [43]:
result = gds.run_cypher(
  """ 
    MATCH (claim:Claim)
    RETURN claim.id as id, claim.embedding as embedding, claim.isFraud as target
    
  """
)
display(result)

ERROR:neo4j:Failed to write data to connection ResolvedIPv4Address(('35.240.86.240', 7687)) (IPv4Address(('35.240.86.240', 7687)))
ERROR:neo4j:Failed to write data to connection IPv4Address(('bc42e675.databases.neo4j.io', 7687)) (IPv4Address(('35.240.86.240', 7687)))
ERROR:neo4j:Failed to write data to connection ResolvedIPv4Address(('35.240.86.240', 7687)) (IPv4Address(('35.240.86.240', 7687)))
ERROR:neo4j:Failed to write data to connection ResolvedIPv4Address(('35.240.86.240', 7687)) (IPv4Address(('35.240.86.240', 7687)))
ERROR:neo4j:Failed to write data to connection ResolvedIPv4Address(('35.240.86.240', 7687)) (IPv4Address(('35.240.86.240', 7687)))


Unnamed: 0,id,embedding,target
0,CLM110011,"[0.25475338101387024, -0.06622820347547531, -0...",1
1,CLM110030,"[0.23718515038490295, -0.3272676467895508, -0....",1
2,CLM110031,"[-0.04657674953341484, -0.26664191484451294, -...",1
3,CLM110038,"[-0.20030340552330017, -0.3031255304813385, 0....",0
4,CLM110040,"[-0.10362769663333893, 0.39278534054756165, -0...",1
...,...,...,...
100075,CLM82009,"[0.09063015133142471, -0.4597131609916687, -0....",1
100076,CLM82013,"[-0.10496556758880615, -0.28348422050476074, -...",1
100077,CLM82218,"[-0.18778499960899353, -0.1842784881591797, 0....",1
100078,CLM82303,"[-0.12101850658655167, 0.09562279284000397, 0....",1


This is what we just did

![embeddings](https://storage.googleapis.com/neo4j-datasets/insurance-claim/img/what_are_embeddings.png)

# Export Embeddings
Now we're going to reformat the query output so that the embeddings can be fed in to a Vertex AI Auto ML pipeline. Note that we are exporting only embeddings and all the other features are intentionally left out. This is to showcase how powerful these vectors are!

In [44]:
df = result
df

Unnamed: 0,id,embedding,target
0,CLM110011,"[0.25475338101387024, -0.06622820347547531, -0...",1
1,CLM110030,"[0.23718515038490295, -0.3272676467895508, -0....",1
2,CLM110031,"[-0.04657674953341484, -0.26664191484451294, -...",1
3,CLM110038,"[-0.20030340552330017, -0.3031255304813385, 0....",0
4,CLM110040,"[-0.10362769663333893, 0.39278534054756165, -0...",1
...,...,...,...
100075,CLM82009,"[0.09063015133142471, -0.4597131609916687, -0....",1
100076,CLM82013,"[-0.10496556758880615, -0.28348422050476074, -...",1
100077,CLM82218,"[-0.18778499960899353, -0.1842784881591797, 0....",1
100078,CLM82303,"[-0.12101850658655167, 0.09562279284000397, 0....",1


Note that the embedding row is an array. To make this dataset more consumable, we should flatten that out into multiple individual features: embedding_0, embedding_1, ... embedding_n.


In [45]:
embeddings = pd.DataFrame(df['embedding'].values.tolist()).add_prefix("embedding_")
merged = df.drop(columns=['embedding']).merge(embeddings, left_index=True, right_index=True)
merged

Unnamed: 0,id,target,embedding_0,embedding_1,embedding_2,embedding_3,embedding_4,embedding_5,embedding_6,embedding_7,...,embedding_22,embedding_23,embedding_24,embedding_25,embedding_26,embedding_27,embedding_28,embedding_29,embedding_30,embedding_31
0,CLM110011,1,0.254753,-0.066228,-0.024775,0.163831,-0.148595,-0.342255,0.038257,0.216315,...,-0.581566,0.196882,-0.163435,-0.278253,-0.017337,0.189501,-0.050509,-0.203961,0.106913,0.248206
1,CLM110030,1,0.237185,-0.327268,-0.296732,0.629316,-0.000570,-0.490610,-0.408764,0.116625,...,-0.110048,0.188176,0.225552,0.046880,-0.099424,-0.002114,0.629702,0.002217,0.173656,-0.096110
2,CLM110031,1,-0.046577,-0.266642,-0.080029,0.199684,0.253483,-0.039590,0.138160,0.136227,...,0.005766,0.671487,0.206718,-0.307127,0.093026,0.166204,0.820539,0.381333,0.241134,0.115846
3,CLM110038,0,-0.200303,-0.303126,0.032465,0.343753,0.260205,0.229180,0.014260,-0.082209,...,0.268068,0.230118,0.376665,-0.203866,-0.064994,0.299583,0.366296,0.697823,0.307896,0.145842
4,CLM110040,1,-0.103628,0.392785,-0.174991,0.163666,-0.326548,-0.669855,-0.055932,-0.428920,...,-0.279547,0.243947,-0.040868,0.043690,-0.281771,0.385158,-0.177652,0.196085,0.290687,0.073680
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
100075,CLM82009,1,0.090630,-0.459713,-0.277713,0.413355,0.147438,-0.042238,-0.256380,-0.201239,...,0.243564,-0.007511,0.005761,-0.205602,-0.044075,-0.079467,0.395546,0.769303,0.056248,0.313104
100076,CLM82013,1,-0.104966,-0.283484,-0.172737,0.295921,0.385614,-0.201597,-0.282823,0.280774,...,0.348946,0.266918,0.052059,-0.213536,0.183162,-0.187067,0.317850,-0.100454,-0.053826,0.310971
100077,CLM82218,1,-0.187785,-0.184278,0.147906,0.382442,0.140461,0.000992,-0.134462,0.241466,...,0.105916,0.309607,-0.046990,-0.274648,0.064452,-0.004789,0.386445,0.063947,0.319438,0.178781
100078,CLM82303,1,-0.121019,0.095623,0.023514,0.169223,0.133172,-0.387692,0.229661,0.213616,...,0.228377,0.016648,0.235559,0.285635,-0.080166,0.245624,0.640240,0.177711,-0.048835,0.286514


If you are curious, visualize the embeddings as a t-SNE plot. It can look something like this:
![embedding_viz](https://storage.googleapis.com/neo4j-datasets/insurance-claim/img/embeddings-tsne.png)

Now that we have the data formatted properly, let's write it as CSV

In [47]:
import os, numpy as np 

df = merged

outdir = './data'
if not os.path.exists(outdir):
    os.mkdir(outdir)

data = df.sample(frac=1).reset_index(drop=True)
data.to_csv(os.path.join(outdir, 'embedding.csv'), index=False)

## Upload to Google Cloud Storage
Now let's write the file to Google Cloud Storage so we can use it in our model.  To do so, we must set a few environment variables.

Edit the REGION variable below.  You'll want to be sure it matches the region where your notebook is running.

The STORAGE_BUCKET is the name of a new bucket.  It must be globally unique.  It also needs to be all lower case.

In [48]:
import os

# Edit this variable!
REGION = 'us-west1'
shell_output = ! gcloud config list --format 'value(core.project)' 2>/dev/null
PROJECT_ID = shell_output[0]

STORAGE_BUCKET = PROJECT_ID + '-fsi'
STORAGE_BUCKET

os.environ["GCLOUD_PROJECT"] = PROJECT_ID

In [49]:
from google.cloud import storage
client = storage.Client()
bucket = client.bucket(STORAGE_BUCKET)
if not bucket.exists:
    bucket.create(location=REGION)

blob = bucket.blob(os.path.join('insurance_fraud', 'embedding.csv'))
blob.upload_from_filename(os.path.join(outdir, 'embedding.csv'))

We will also export factorize the raw data without embeddings and run a similar Auto ML pipeline on Vertex AI. Then, lets compare the accuracy between the two!

In [52]:
raw_df = pd.read_csv("https://storage.googleapis.com/neo4j-datasets/insurance-claim/data.csv")
raw_df.rename(columns={'claim_id': 'id',  
                       'is_fraud': 'target'}, inplace=True)
raw_df['claim_diag_code_1'] = pd.factorize(raw_df['claim_diag_code_1'])[0] + 1
raw_df['claim_diag_code_2'] = pd.factorize(raw_df['claim_diag_code_2'])[0] + 1
raw_df['claim_diag_code_3'] = pd.factorize(raw_df['claim_diag_code_3'])[0] + 1
raw_df['claim_diag_code_4'] = pd.factorize(raw_df['claim_diag_code_4'])[0] + 1
raw_df['claim_diag_code_5'] = pd.factorize(raw_df['claim_diag_code_5'])[0] + 1
raw_df['claim_diag_code_6'] = pd.factorize(raw_df['claim_diag_code_6'])[0] + 1
raw_df['claim_diag_code_7'] = pd.factorize(raw_df['claim_diag_code_7'])[0] + 1
raw_df['claim_diag_code_8'] = pd.factorize(raw_df['claim_diag_code_8'])[0] + 1
raw_df['claim_diag_code_9'] = pd.factorize(raw_df['claim_diag_code_9'])[0] + 1
raw_df['claim_diag_code_10'] = pd.factorize(raw_df['claim_diag_code_10'])[0] + 1
raw_df['claim_procedure_code_1'] = pd.factorize(raw_df['claim_procedure_code_1'])[0] + 1
raw_df['claim_procedure_code_2'] = pd.factorize(raw_df['claim_procedure_code_2'])[0] + 1
raw_df['claim_procedure_code_3'] = pd.factorize(raw_df['claim_procedure_code_3'])[0] + 1
raw_df['claim_procedure_code_4'] = pd.factorize(raw_df['claim_procedure_code_4'])[0] + 1
raw_df['claim_procedure_code_5'] = pd.factorize(raw_df['claim_procedure_code_5'])[0] + 1
raw_df['claim_procedure_code_6'] = pd.factorize(raw_df['claim_procedure_code_6'])[0] + 1
raw_df['claim_admit_diagnosis_code'] = pd.factorize(raw_df['claim_admit_diagnosis_code'])[0] + 1
raw_df['diag_group_code'] = pd.factorize(raw_df['diag_group_code'])[0] + 1

raw_data = raw_df.sample(frac=1).reset_index(drop=True)
raw_data.to_csv(os.path.join(outdir, 'raw.csv'), index=False)

blob = bucket.blob(os.path.join('insurance_fraud', 'raw.csv'))
blob.upload_from_filename(os.path.join(outdir, 'raw.csv'))