# Ingestion


This notebook parses data from XXX using Google Vertex AI Generative AI.  It then uses Generative AI to create Neo4j Cypher queries which write the data to a Neo4j database.

<table align="left">

  <td>
    <a href="https://colab.research.google.com/github/neo4j-partners/intelligent-app-google-generativeai-neo4j/blob/main/ingestion/ingestion.ipynb?token=GHSAT0AAAAAAB5TWSXKZODTVRMD2NB7DCQMZDNK3FA" target="_blank">
      <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Colab logo"> Run in Colab
    </a>
</td>
</table>

## Setup
This notebook should be run within Colab.  You cannot currently use Google Vertex AI Workbench because of auth issues.  The "run in colab" button won't work because the repo is private.

First we need to install the preview libraries for Generative AI.  It's a new version of the AI platform library.  To access the bucket, your user account and project will need to be part of the preview.

By default a Vertex AI Workbench Notebook uses a service account.  That account doesn't have access to the bucket where the preview binary is.  So, you'll need to auth.  To do so, open a terminal window in your managed notebook and run the command 'gcloud auth login'.  With that complete, you'll be able to install the preview library.

In [None]:
from google.colab import auth as google_auth
google_auth.authenticate_user()

In [None]:
!gsutil cp gs://vertex_sdk_llm_private_releases/SDK/google_cloud_aiplatform-1.25.dev20230413+language.models-py2.py3-none-any.whl .

In [None]:
%pip install --user ./google_cloud_aiplatform-1.25.dev20230413+language.models-py2.py3-none-any.whl "shapely<2.0.0" --force-reinstall

Important -- You will now need to restart the kernel.

The training below shows how to instruction-tune a text-bison model. The chat-bison model which we are going to use in the ingestion process is currently not tunable. The below code is meant to show an example of fine-tuning.

Clarification --- what is the tuning job doing if this isn't tunable?

In [None]:
# Note, you will need to set these variables
project_id = 'neo4jbusinessdev'
location = 'us-central1'

In [None]:
from google.cloud.aiplatform.private_preview.language_models import TextGenerationModel
from google.cloud import aiplatform

Question -- is eng2cypher a parsed copy of data.  How did it end up that way?

In [None]:
training_data = 'gs://gs_vertex_ai/eng2cypher/eng2cypher.jsonl'
train_steps = 10

aiplatform.init(project=project_id, location=location)
model = TextGenerationModel.from_pretrained("text-bison-001")

model.tune_model(
  training_data=training_data,
  train_steps=train_steps,
  tuning_job_location="europe-west4",
  tuned_model_location="us-central1",
)

# Test the tuned model
print(model.predict("Tell me some ideas combining VR and fitness:"))

Creating PipelineJob
PipelineJob created. Resource name: projects/803648085855/locations/europe-west4/pipelineJobs/tune-large-model-20230523211814
To use this PipelineJob in another session:
pipeline_job = aiplatform.PipelineJob.get('projects/803648085855/locations/europe-west4/pipelineJobs/tune-large-model-20230523211814')
View Pipeline Job:
https://console.cloud.google.com/vertex-ai/locations/europe-west4/pipelines/runs/tune-large-model-20230523211814?project=803648085855
PipelineJob projects/803648085855/locations/europe-west4/pipelineJobs/tune-large-model-20230523211814 current state:
PipelineState.PIPELINE_STATE_RUNNING
PipelineJob projects/803648085855/locations/europe-west4/pipelineJobs/tune-large-model-20230523211814 current state:
PipelineState.PIPELINE_STATE_RUNNING
PipelineJob projects/803648085855/locations/europe-west4/pipelineJobs/tune-large-model-20230523211814 current state:
PipelineState.PIPELINE_STATE_RUNNING
PipelineJob projects/803648085855/locations/europe-west4/pi

KeyboardInterrupt: 

In [None]:
from google.cloud.aiplatform.private_preview.language_models import TextGenerationModel
from google.cloud import aiplatform

aiplatform.init(project=project_id, location=location)
model = TextGenerationModel.from_pretrained("text-bison-001")
tuned_model_names = model.list_tuned_model_names()
print(tuned_model_names)

['projects/803648085855/locations/us-central1/models/7947351409425383424']


## Data Cleansing

Now, let's define a function that can help clean the input data. The data refers to some figures like scanned images. We don't have them and so we will remove any such references.

In [None]:
def clean_text(text):
  clean = "\n".join([row for row in text.split("\n")])
  clean = re.sub(r'\(fig[^)]*\)', '', clean, flags=re.IGNORECASE)
  return clean

Let's take this case sheet and extract entities and relations using LLM

In [None]:
sample_que = """The patient was a 34-yr-old man who presented with complaints of fever and a chronic cough.
He was a smoker and had a history of pulmonary tuberculosis that had been treated and cured.
A computed tomographic (CT) scan revealed multiple tiny nodules in both lungs.
A thoracoscopic lung biopsy was taken from the right upper lobe.
The microscopic examination revealed a typical LCH.
The tumor cells had vesicular and grooved nuclei, and they formed small aggregations around the bronchioles (Fig.1).
The tumor cells were strongly positive for S-100 protein, vimentin, CD68 and CD1a.
There were infiltrations of lymphocytes and eosinophils around the tumor cells.
With performing additional radiologic examinations, no other organs were thought to be involved.
He quit smoking, but he received no other specific treatment.
He was well for the following one year.
After this, a follow-up CT scan was performed and it showed a 4 cm-sized mass in the left lower lobe, in addition to the multiple tiny nodules in both lungs (Fig.2).
A needle biopsy specimen revealed the possibility of a sarcoma; therefore, a lobectomy was performed.
Grossly, a 4 cm-sized poorly-circumscribed lobulated gray-white mass was found (Fig.3), and there were a few small satellite nodules around the main mass.
Microscopically, the tumor cells were aggregated in large sheets and they showed an infiltrative growth.
The cytologic features of some of the tumor cells were similar to those seen in a typical LCH.
However, many tumor cells showed overtly malignant cytologic features such as pleomorphic/hyperchromatic nuclei and prominent nucleoli (Fig.4), and multinucleated tumor giant cells were also found.
There were numerous mitotic figures ranging from 30 to 60 per 10 high power fields, and some of them were abnormal.
A few foci of typical LCH remained around the main tumor mass.
Immunohistochemically, the tumor cells were strongly positive for S-100 protein (Fig.5) and vimentin; they were also positive for CD68 (Dako N1577, Clone KPI), and focally positive for CD1a (Fig.6), and they were negative for cytokeratin, epithelial membrane antigen, CD3, CD20 and HMB45.
The ultrastructural analysis failed to demonstrate any Birbeck granules in the cytoplasm of the tumor cells.
Now, at five months after lobectomy, the patient is doing well with no significant change in the radiologic findings.
"""

sample_ans = """
{'entities': [{'label': 'Case',
    'id': 'case1',
    'summary': '34-yr-old man with fever, chronic cough, history of pulmonary tuberculosis, LCH diagnosis, and sarcoma. Underwent lobectomy and is doing well.'},
   {'label': 'Person',
    'id': 'person1',
    'age': '34',
    'location': '',
    'gender': 'male'},
   {'label': 'Symptom', 'id': 'fever', 'description': 'Fever'},
   {'label': 'Symptom', 'id': 'chronicCough', 'description': 'Chronic cough'},
   {'label': 'Disease',
    'id': 'pulmonaryTuberculosis',
    'name': 'Pulmonary Tuberculosis'},
   {'label': 'Disease',
    'id': 'langerhansCellHistiocytosis',
    'name': 'Langerhans Cell Histiocytosis'},
   {'label': 'Disease', 'id': 'sarcoma', 'name': 'Sarcoma'},
   {'label': 'BodySystem', 'id': 'lungs', 'name': 'Lungs'},
   {'label': 'BodySystem', 'id': 'heart', 'name': 'Heart'},
   {'label': 'Diagnosis',
    'id': 'ctScan',
    'name': 'CT Scan',
    'description': 'Computed Tomographic (CT) scan',
    'when': 'initial'},
   {'label': 'Diagnosis',
    'id': 'thoracoscopicLungBiopsy',
    'name': 'Thoracoscopic Lung Biopsy',
    'description': 'Thoracoscopic lung biopsy from the right upper lobe',
    'when': 'initial'},
   {'label': 'Diagnosis',
    'id': 'followUpCtScan',
    'name': 'Follow-up CT Scan',
    'description': 'Follow-up CT scan showing a 4 cm-sized mass in the left lower lobe',
    'when': 'one year later'},
   {'label': 'Diagnosis',
    'id': 'needleBiopsy',
    'name': 'Needle Biopsy',
    'description': 'Needle biopsy specimen revealing the possibility of a sarcoma',
    'when': 'one year later'},
   {'label': 'Diagnosis',
    'id': 'lobectomy',
    'name': 'Lobectomy',
    'description': 'Lobectomy performed to remove the mass',
    'when': 'one year later'},
   {'label': 'Biological',
    'id': 'multipleTinyNodules',
    'name': 'Multiple Tiny Nodules',
    'description': 'Multiple tiny nodules in both lungs'},
   {'label': 'Biological',
    'id': 'lchCells',
    'name': 'LCH Cells',
    'description': 'Typical LCH cells with vesicular and grooved nuclei'},
   {'label': 'Biological',
    'id': 'tumorCells',
    'name': 'Tumor Cells',
    'description': 'Tumor cells with malignant cytologic features'}],
  'relationships': ['case1|FOR|person1',
   "person1|HAS_SYMPTOM{when:'initial',frequency:'',span:''}|fever",
   "person1|HAS_SYMPTOM{when:'initial',frequency:'',span:''}|chronicCough",
   "person1|HAS_DISEASE{when:'past'}|pulmonaryTuberculosis",
   "person1|HAS_DISEASE{when:'initial'}|langerhansCellHistiocytosis",
   "person1|HAS_DISEASE{when:'one year later'}|sarcoma",
   'chronicCough|SEEN_ON|lungs',
   'langerhansCellHistiocytosis|AFFECTS|lungs',
   'sarcoma|AFFECTS|lungs',
   'person1|HAS_DIAGNOSIS|ctScan',
   'person1|HAS_DIAGNOSIS|thoracoscopicLungBiopsy',
   'person1|HAS_DIAGNOSIS|followUpCtScan',
   'person1|HAS_DIAGNOSIS|needleBiopsy',
   'person1|HAS_DIAGNOSIS|lobectomy',
   'ctScan|SHOWED|multipleTinyNodules',
   'thoracoscopicLungBiopsy|SHOWED|lchCells',
   'lobectomy|SHOWED|tumorCells']}
"""

que = """A 28-year-old previously healthy man presented with a 6-week history of palpitations.
The symptoms occurred during rest, 2–3 times per week, lasted up to 30 minutes at a time and were associated with dyspnea.
Except for a grade 2/6 holosystolic tricuspid regurgitation murmur (best heard at the left sternal border with inspiratory accentuation), physical examination yielded unremarkable findings.
An electrocardiogram (ECG) revealed normal sinus rhythm and a Wolff– Parkinson– White pre-excitation pattern (Fig.1: Top), produced by a right-sided accessory pathway.
Transthoracic echocardiography demonstrated the presence of Ebstein's anomaly of the tricuspid valve, with apical displacement of the valve and formation of an “atrialized” right ventricle (a functional unit between the right atrium and the inlet [inflow] portion of the right ventricle) (Fig.2).
The anterior tricuspid valve leaflet was elongated (Fig.2C, arrow), whereas the septal leaflet was rudimentary (Fig.2C, arrowhead).
Contrast echocardiography using saline revealed a patent foramen ovale with right-to-left shunting and bubbles in the left atrium (Fig.2D).
The patient underwent an electrophysiologic study with mapping of the accessory pathway, followed by radiofrequency ablation (interruption of the pathway using the heat generated by electromagnetic waves at the tip of an ablation catheter).
His post-ablation ECG showed a prolonged PR interval and an odd “second” QRS complex in leads III, aVF and V2–V4 (Fig.1Bottom), a consequence of abnormal impulse conduction in the “atrialized” right ventricle.
The patient reported no recurrence of palpitations at follow-up 6 months after the ablation.

"""

## Prompt Definition

**⚠️** You need to duplicate `config.env.example` file in the left and rename as `config.env`. Edit the values in this file and provide the values for API keys and Neo4j credentials

In [None]:
#from google.colab import drive
#drive.mount('/content/drive')

In [None]:
#load_dotenv('/content/drive/MyDrive/Colab Notebooks/GenAI-Playground/config-gcp.env', override=True)
#
#shell_output = ! gcloud config list --format 'value(core.project)' 2>/dev/null
#PROJECT_ID = os.getenv('PROJECT_ID')
#os.environ["GCLOUD_PROJECT"] = PROJECT_ID
#os.environ['GCLOUD_REGION'] = 'us-central1'

This is a helper function to talk to the LLM with our prompt and text input

In [None]:
def process_gpt(
    project_id: str,
    model_name: str,
    temperature: float,
    max_output_tokens: int,
    top_p: float,
    top_k: int,
    context: str,
    prompt: str,
    que: str,
    examples,
    location: str = "us-central1"
    ) :
    """Predict using a Large Language Model."""
    aiplatform.init(project=project_id, location=location)

    chat_model = ChatModel.from_pretrained(model_name)
    parameters = {
      "temperature": temperature,
      "max_output_tokens": max_output_tokens,
      "top_p": top_p,
      "top_k": top_k,
    }

    chat = chat_model.start_chat(
      context=context,
      examples=examples
    )
    return chat.send_message(prompt+que,**parameters)


This is a simple prompt to start with. If the processing is very complex, you can also chain the prompts as and when required. I am going to use a single prompt here that helps me to extract the text strictly as per the Entities and Relationships defined. This is a simplification. In the real scenario, especially with medical records, you have to leverage on Domain experts to define the Ontology systematically and capture the important information. You might also be fine-tuning the model as and when required.

Also, instead of one single large model, you can also consider chaining a number of smaller ones as per your needs.

We are going with this graph schema for our case sheet:

![schema.png](https://github.com/neo4j-partners/intelligent-app-google-generativeai-neo4j/blob/main/ingestion/schema.png?raw=1)

In [None]:
prompt="""From the Case sheet for a patient below, extract the following Entities & relationships described in the mentioned format 
0. ALWAYS FINISH THE OUTPUT. Never send partial responses
1. First, look for these Entity types in the text and generate as comma-separated format similar to entity type.
   `id` property of each entity must be alphanumeric and must be unique among the entities. You will be referring this property to define the relationship between entities. Do not create new entity types that aren't mentioned below. Document must be summarized and stored inside Case entity under `summary` property. You will have to generate as many entities as needed as per the types below:
    Entity Types:
    label:'Case',id:string,summary:string //Case
    label:'Person',id:string,age:string,location:string,gender:string //Patient mentioned in the case
    label:'Symptom',id:string,description:string //Symptom Entity; `id` property is the name of the symptom, in lowercase & camel-case & should always start with an alphabet
    label:'Disease',id:string,name:string //Disease diagnosed now or previously as per the Case sheet; `id` property is the name of the disease, in lowercase & camel-case & should always start with an alphabet
    label:'BodySystem',id:string,name:string //Body Part affected. Eg: Chest, Lungs; id property is the name of the part, in lowercase & camel-case & should always start with an alphabet
    label:'Diagnosis',id:string,name:string,description:string,when:string //Diagnostic procedure conducted; `id` property is the summary of the Diagnosis, in lowercase & camel-case & should always start with an alphabet
    label:'Biological',id:string,name:string,description:string //Results identified from Diagnosis; `id` property is the summary of the Biological, in lowercase & camel-case & should always start with an alphabet
    
3. Next generate each relationships as triples of head, relationship and tail. To refer the head and tail entity, use their respective `id` property. Relationship property should be mentioned within brackets as comma-separated. They should follow these relationship types below. You will have to generate as many relationships as needed as defined below:
    Relationship types:
    case|FOR|person
    person|HAS_SYMPTOM{when:string,frequency:string,span:string}|symptom //the properties inside HAS_SYMPTOM gets populated from the Case sheet
    person|HAS_DISEASE{when:string}|disease //the properties inside HAS_DISEASE gets populated from the Case sheet
    symptom|SEEN_ON|chest
    disease|AFFECTS|heart
    person|HAS_DIAGNOSIS|diagnosis
    diagnosis|SHOWED|biological
4. Do not send any response other than code block in the response

The output should look like :
{
    "entities": [{"label":"Case","id":string,"summary":string}],
    "relationships": ["disease|AFFECTS|heart"]
}

Case Sheet:
$ctext
"""

Let's run our completion task with our LLM

In [None]:
import re
from string import Template
from google.cloud.aiplatform.private_preview.language_models import InputOutputTextPair, ChatModel


def run_completion(prompt, results, ctext):
    try:
      pr = Template(prompt).substitute(ctext=ctext)
      res = process_gpt(project_id,
                        'chat-bison-001'
                        , 0, 1024, 0.8, 40, '''You are a helpful Medical Case Sheet expert who extracts relevant information which will be eventually used to store them on a Neo4j Knowledge Graph after processing''',
                        prompt, que, [
        InputOutputTextPair(
          input_text=prompt+sample_que,
          output_text=sample_ans, 
        )
      ], location="us-central1")
      results.append(res)
      return results
    except Exception as e:
        print(e)

prompts = [prompt]
results = []

for p in prompts:
  results = run_completion(p, results, clean_text(sample_que))
    

403 Permission 'aiplatform.endpoints.predict' denied on resource '//aiplatform.googleapis.com/projects/cloud-large-language-models/locations/us-central1/endpoints/chat-bison-001' (or it may not exist). [reason: "IAM_PERMISSION_DENIED"
domain: "aiplatform.googleapis.com"
metadata {
  key: "resource"
  value: "projects/cloud-large-language-models/locations/us-central1/endpoints/chat-bison-001"
}
metadata {
  key: "permission"
  value: "aiplatform.endpoints.predict"
}
]


In [None]:
results[0].text

## Neo4j Cypher Generation

The entities and relationships we got from the LLM have to be transformed to Cypher so we can write them into Neo4j.

In [None]:
def get_prop_str(prop_dict, _id):
    s = []
    for key, val in prop_dict.items():
      if key != 'label' and key != 'id':
         s.append(_id+"."+key+' = "'+str(val).replace('\"', '"').replace('"', '\"')+'"') 
    return ' ON CREATE SET ' + ','.join(s)

def get_cypher_compliant_var(_id):
    return "_"+ re.sub(r'[\W_]', '', _id)

def generate_cypher(in_json):
    e_map = {}
    e_stmt = []
    r_stmt = []
    e_stmt_tpl = Template("($id:$label{id:'$key'})")
    r_stmt_tpl = Template("""
      MATCH $src
      MATCH $tgt
      MERGE ($src_id)-[:$rel]->($tgt_id)
    """)
    for obj in in_json:
      for j in obj['entities']:
          props = ''
          label = j['label']
          id = j['id']
          if label == 'Case':
                id = 'c'+str(time.time_ns())
          elif label == 'Person':
                id = 'p'+str(time.time_ns())
          varname = get_cypher_compliant_var(j['id'])
          stmt = e_stmt_tpl.substitute(id=varname, label=label, key=id)
          e_map[varname] = stmt
          e_stmt.append('MERGE '+ stmt + get_prop_str(j, varname))

      for st in obj['relationships']:
          rels = st.split("|")
          src_id = get_cypher_compliant_var(rels[0].strip())
          rel = rels[1].strip()
          tgt_id = get_cypher_compliant_var(rels[2].strip())
          stmt = r_stmt_tpl.substitute(
              src_id=src_id, tgt_id=tgt_id, src=e_map[src_id], tgt=e_map[tgt_id], rel=rel)
          
          r_stmt.append(stmt)

    return e_stmt, r_stmt

In [None]:
ent_cyp, rel_cyp = generate_cypher(results)

TypeError: 'NoneType' object is not iterable

## Data Ingestion

You will need a Neo4j AuraDS Pro instance.  You can deploy that on Google Cloud Marketplace [here](https://console.cloud.google.com/marketplace/product/endpoints/prod.n4gcp.neo4j.io).

With that complete, you'll need to install the Neo4j library and set up your database connection.

In [None]:
%pip install graphdatascience

In [None]:
from graphdatascience import GraphDataScience

In [None]:
connectionUrl = os.getenv('NEO4J_CONN_URL')
username = os.getenv('NEO4J_USER')
password = os.getenv('NEO4J_PASSWORD')

In [None]:
gds = GraphDataScience(connectionUrl, auth=(username, password))
gds.version()

Before loading the data, create constraints as below

In [None]:
gds.run_cypher('CREATE CONSTRAINT unique_case_id IF NOT EXISTS FOR (n:Case) REQUIRE n.id IS UNIQUE')
gds.run_cypher('CREATE CONSTRAINT unique_person_id IF NOT EXISTS FOR (n:Person) REQUIRE (n.id) IS UNIQUE')
gds.run_cypher('CREATE CONSTRAINT unique_symptom_id IF NOT EXISTS FOR (n:Symptom) REQUIRE (n.id) IS UNIQUE')
gds.run_cypher('CREATE CONSTRAINT unique_disease_id IF NOT EXISTS FOR (n:Disease) REQUIRE n.id IS UNIQUE')
gds.run_cypher('CREATE CONSTRAINT unique_bodysys_id IF NOT EXISTS FOR (n:BodySystem) REQUIRE n.id IS UNIQUE')
gds.run_cypher('CREATE CONSTRAINT unique_diag_id IF NOT EXISTS FOR (n:Diagnosis) REQUIRE n.id IS UNIQUE')
gds.run_cypher('CREATE CONSTRAINT unique_biological_id IF NOT EXISTS FOR (n:Biological) REQUIRE n.id IS UNIQUE')

Ingest the entities

In [None]:
%%time
for e in ent_cyp:
    gds.run_cypher(e)


Ingest relationships now

In [None]:
%%time
for r in rel_cyp:
    gds.run_cypher(r)

This is a helper function to ingest all case sheets inside the `data/` directory

In [None]:
def run_pipeline(count=191):
    txt_files = glob.glob("data/case_sheets/*.txt")[0:count]
    print(f"Running pipeline for {len(txt_files)} files")
    failed_files = process_pipeline(txt_files)
    print(failed_files)
    return failed_files

def process_pipeline(files):
    failed_files = []
    for f in files:
        try:
            with open(f, 'r') as file:
                print(f"  {f}: Reading File...")
                data = file.read().rstrip()
                text = clean_text(data)
                print(f"    {f}: Extracting E & R")
                results = extract_entities_relationships(f, text)
                print(f"    {f}: Generating Cypher")
                ent_cyp, rel_cyp = generate_cypher(results)
                print(f"    {f}: Ingesting Entities")
                for e in ent_cyp:
                    gds.run_cypher(e)
                print(f"    {f}: Ingesting Relationships")
                for r in rel_cyp:
                    gds.run_cypher(r)
                print(f"    {f}: Processing DONE")
        except Exception as e:
            print(f"    {f}: Processing Failed with exception {e}")
            failed_files.append(f)
    return failed_files
            
def extract_entities_relationships(f, text):
    start = timer()
    system = "You are a helpful Medical Case Sheet expert who extracts relevant information and store them on a Neo4j Knowledge Graph"
    prompts = [prompt1]
    all_cypher = ""
    results = []
    for p in prompts:
      p = Template(p).substitute(ctext=text)
      res = process_gpt(system, p)
      results.append(json.loads(res))
    end = timer()
    elapsed = (end-start)
    print(f"    {f}: E & R took {elapsed}secs")
    return results

In [None]:
%%time
failed_files = run_pipeline(200)

If processing failed for some files due to API Rate limit or some other error, you can retry as below

In [None]:
%%time
failed_files = process_pipeline(failed_files)
failed_files

In [None]:
results

## Cypher Generation for Consumption

In [None]:
%%time
def run_completion(prompt, que, results, ctext):
    try:
      pr = Template(prompt).substitute(ctext=ctext)
      examples = [InputOutputTextPair(
          input_text=prompt+'Which disease affect most of my patients?',
          output_text="""MATCH (d:Disease) RETURN d.name as disease, SIZE([(d)-[]-(p:Person) | p]) AS affected_patients ORDER BY affected_patients DESC LIMIT 1"""
        ), InputOutputTextPair(
          input_text=prompt+'Which patient has the most number of symptoms?',
          output_text="""MATCH (p:Person)-[hasSymptom]->(s:Symptom) RETURN p.name AS patient, COUNT(s) AS symptoms ORDER BY symptoms DESC LIMIT 1"""
        )]
      res = process_gpt(PROJECT_ID,
                        'chat-bison-001'
                        , 0, 1024, 0.8, 40, '''You are an assistant that translates english to Neo4j cypher\n''',
                        prompt, que, examples, location="us-central1")
      results.append(res)
      return results
    except Exception as e:
        print(e)

prompt = '''Using this Neo4j schema and Reply ONLY in Cypher when it makes sense.\nHere are the instructions to follow:\n1. Use the Neo4j schema to generate cypher compatible ONLY for Neo4j Version 5\n2. Do not use EXISTS, SIZE keywords in the cypher.\n3. Use only Nodes and relationships mentioned in the schema while generating the response\n4. Reply ONLY in Cypher when it makes sense.\n5. Always do a case-insensitive and fuzzy search for any properties related search. Eg: to search for a Heart Disease use `toLower(d.name) contains 'heart disease'`\n6. Patient node is synonymous to Person\n\nSchema:\nNodes:\n    label:'Case',id:string,summary:string //Case Node\n    label:'Person',id:string,age:string,location:string,gender:string //Patient Node\n    label:'Symptom',id:string,description:string //Symptom Node\n    label:'Disease',id:string,name:string //Disease Node\n    label:'BodySystem',id:string,name:string //Node for Body Part affected Eg: Heart, lungs\n    label:'Diagnosis',id:string,name:string,description:string,when:string //Diagnostic Node\n    label:'Biological',id:string,name:string,description:string //Node for Results identified from Diagnosis\n\nRelationships:\n    (:Case)-[:FOR]->(Person)\n    (:Person)-[:HAS_SYMPTOM{when:string,frequency:string,span:string}]->(Symptom)\n    (:Person)-[:HAS_DISEASE{when:string}]->(:Disease)\n    (:Symptom)-[:SEEN_ON]->(:BodySystem)\n    (:Disease)-[:AFFECTS]->(:BodySystem)\n    (:Person)-[:HAS_DIAGNOSIS]->(:Diagnosis)\n    (:Diagnosis)-[:SHOWED]->(:Biological)'''      
results = []
run_completion(prompt, 'Which age group has the most number of heart diseases?', results, clean_text(sample_que))
    

As you see, at the moment Cypher generated is not syntactically correct. We might need to do some fine tuning here or pick some other models like Codey.

In [None]:
# To do - move libraries to where they are imported.  Remove unused libraries.

#%pip install graphdatascience
#%pip install python-dotenv
#%pip install retry


# To do - move libraries to where they are imported.  Remove unused libraries.

#import os
#from retry import retry
#import re
#from string import Template
#import json 
#import ast
#import time
#import pandas as pd
#from graphdatascience import GraphDataScience
#import glob
#from timeit import default_timer as timer
#from dotenv import load_dotenv

#from google.cloud import aiplatform
#from google.cloud.aiplatform.private_preview.language_models import ChatModel, InputOutputTextPair
