# 1. Intelligent App with Google GenAI & Neo4j
In this notebook, let's explore how to effectively leverage Google GenAI to build and consume knowledge on Neo4j.

This notebook parses data from a public [corpus of Resumes / Curriculum Vitae](https://github.com/florex/resume_corpus) using Google Vertex AI Generative AI's `text-bison` model. The model will be prompted to recognise ad extract Entities & Relationships. We will then generate Neo4j Cypher queries using them and write the data to a Neo4j database.
We will again use a `text-bison` model and prompt it to convert questions in english to Cypher - Neo4j's query language, which can be used for data retrieval

## Setup
This notebook should be run within Vertex AI Workbench.  Be sure to select "single user" when starting a managed notebook to run this.  Otherwise the auth won't allow access to the preview.  

First we need to install the latest libraries for Generative AI.

In [None]:
!pip install --user google-cloud-aiplatform --upgrade

You will need to restart the kernel after the pip install completes.

In [1]:
# Note, you will need to set your project_id
project_id = 'neo4jbusinessdev'
location = 'us-central1'

In [2]:
import vertexai
vertexai.init(project=project_id, location='us-central1')

## Prompt Definition

In the upcoming sections, we will extract knowledge adhering to the following schema. This is a very Simplified schema to denote a Resume. Normally, you will have Domain Experts who come up with an ideal Ontology.

![schema](images/schema.png)

To achieve our Extraction goal as per the schema, I am going to chain a series of prompts, each focused on only one task - to extract a specific entity. By this way, you can avoid Token limitations. Also, the quality of extraction will be good.

In [287]:
person_prompt_tpl="""From the Curriculum Vitae text for a job aspirant below, extract Entities strictly as instructed below
1. First, look for this Entity type in the text and generate as comma-separated format similar to entity type.
   `id` property of each entity must be alphanumeric and must be unique among the entities. You will be referring this property to define the relationship between entities. NEVER create new entity types that aren't mentioned below. Document must be summarized and stored inside Person entity under `description` property
    Entity Types:
    label:'Person',id:string,role:string,description:string //Person Node
2. Description property should be a crisp text summary and MUST NOT be more than 100 characters
3. If you cannot find any information on the entities & relationships above, it is okay to return empty value. DO NOT create fictious data
4. Do NOT create duplicate entities
5. Restrict yourself to extract only Person information. No Position, Company, Education or Skill information should be focussed.
Example Output JSON:
{"entities": [{"label":"Person","id":"person1","role":"Prompt Developer","description":"Prompt Developer with more than 30 years of LLM experience"}]}

Question: Now, extract the Person for the text below -
$ctext

Answer:
"""

In [288]:
postion_prompt_tpl="""From the Curriculum Vitae text for a job aspirant below, extract Entities & relationships strictly as instructed below
1. First, look for these Entity types in the text and generate as comma-separated format.
   `id` property of each entity must be alphanumeric and must be unique among the entities. You will be referring this property to define the relationship between entities. NEVER create new entity types that aren't mentioned below. You will have to generate as many entities as needed as per the types below:
    Entity Types:
    label:'Position',id:string,title:string,location:string,startDate:string,endDate:string,url:string //Position Node
    label:'Company',id:string,name:string //Company Node
2. Next generate each relationships as triples of head, relationship and tail. To refer the head and tail entity, use their respective `id` property. NEVER create new Relationship types that aren't mentioned below:
    Relationship definition:
    position|AT_COMPANY|company //Ensure this is a string in the generated output
3. If you cannot find any information on the entities & relationships above, it is okay to return empty value. DO NOT create fictious data
4. Do NOT create duplicate entities. 
5. Restrict yourself to extract only Position and Company information. No Education or Skill information should be focussed.
 Example Output JSON:
{"entities": [{"label":"Position","id":"position1","title":"Software Engineer","location":"Singapore",startDate:"2021-01-01",endDate:"present"},{"label":"Position","id":"position2","title":"Senior Software Engineer","location":"Mars",startDate:"2020-01-01",endDate:"2020-12-31"},{label:"Company",id:"company1",name:"Neo4j Singapore Pte Ltd"},{"label":"Company","id":"company2","name":"Neo4j Mars Inc"}],"relationships": ["position1|AT_COMPANY|company1","position2|AT_COMPANY|company2"]}

Question: Now, extract entities & relationships as mentioned above for the text below -
$ctext

Answer:
"""

In [240]:
skill_prompt_tpl="""From the Curriculum Vitae text below, extract Entities strictly as instructed below
1. Look for prominent Skill Entities in the text. The`id` property of each entity must be alphanumeric and must be unique among the entities. NEVER create new entity types that aren't mentioned below:
    Entity Definition:
    label:'Skill',id:string,name:string,level:string //Skill Node
Example Output Format:
{"entities": [{"label":"Skill","id":"skill1","name":"Neo4j","level":"expert"},{"label":"Skill","id":"skill2","name":"Pytorch","level":"expert"}]}

Question: Now, extract entities as mentioned above for the text below -
$ctext

Answer:
"""

In [281]:
edu_prompt_tpl="""From the Curriculum Vitae text for a job aspirant below, extract Entities strictly as instructed below
1. Look for this Education entity type and generate as comma-separated format similar to entity type.
   `id` property of each entity must be alphanumeric and must be unique among the entities. You will be referring this property to define the relationship between entities. NEVER create other entity types that aren't mentioned below. You will have to generate as many entities as needed as per the types below:
    Entity Definition:
    label:'Education',id:string,degree:string,university:string,graduation_date:string,score:string,url:string //Education Node
2. If you cannot find any information on the entities above, it is okay to return empty value. DO NOT create fictious data
3. Do NOT create duplicate entities or properties
4. Strictly extract only Education. No Skill or other Entity should be extracted
Output JSON (Strict):
{"entities": [{"label":"Education","id":"education1","degree":"Bachelor of Science","graduationDate":"May 2022","score":"5.0"}]}

Question: Now, extract entities as mentioned above for the text below -
$ctext

Answer:
"""

This is a helper function to talk to the LLM with our prompt and text input. We will use the `text-bison` base model. In your usecase, you might need to finetune it. VertexAI provides an elegant way to finetune it. The weights will be staying within your tenant and the base model is frozen.

In [19]:
from vertexai.preview.language_models import TextGenerationModel
def run_text_model(
    project_id: str,
    model_name: str,
    temperature: float,
    max_decode_steps: int,
    top_p: float,
    top_k: int,
    prompt: str,
    location: str = "us-central1",
    tuned_model_name: str = "",
    ) :
    """Text Completion Use a Large Language Model."""
    vertexai.init(project=project_id, location=location)
    model = TextGenerationModel.from_pretrained(model_name)
    if tuned_model_name:
      model = model.get_tuned_model(tuned_model_name)
    response = model.predict(
        prompt,
        temperature=temperature,
        max_output_tokens=max_decode_steps,
        top_k=top_k,
        top_p=top_p,)
    return response.text

In [5]:
def extract_entities_relationships(prompt, tuned_model_name):
    try:
        res = run_text_model(project_id, "text-bison@001", 0, 1024, 0.8, 40, prompt, location, tuned_model_name)
        return res
    except Exception as e:
        print(e)
    

Now, let's run our extraction task

In [290]:
sample_que = """Contractor Contractor Contractor Jacksonville, FL Work Experience Contractor Criterion Systems - Naval Air Station Key West, FL April 2015 to November 2018 Provide support for Command Naval Region South East, Naval Air Station Key West† ï Worked in conjunction with other contract personnel as well as government civilian and military personnel in the accomplishment of the tasks associated with the contract† ï Communicated with the IT Director if the time lines for specific tasks could not be met and provided detailed information on the impediments, risks associated with not meeting timelines and a mitigation strategy to complete the milestones as soon as possible† ï Documented all tasks accomplished in a monthly status report† ï Provided Navy Marine Corp Intranet, NMCI, support in the form of guidance, education and assistance to ensure that the end users at Naval Region Southeast and its subordinate installations/detachments receive the NMCI systems and applications required to perform mission deliverables† ï Provided the Activity Contract Technical Representatives (ACTRs) and NMCI managers training, guidance, and support services for all NMCI functions to include Defense Messaging System (DMS), IT Administration, and customer support† ï DMS support included providing operational support, maintenance, and distribution of official US Navy messages via DMS, the message traffic included both unclassified and classified messages† ï IT Administration and customer support included providing support for work requests and resolving all IT related issues† ï Provided Tier 1 hardware, software and network connectivity support† ï Identified, researched, and resolved all technical problems† ï Responded to all telephone calls, email and personnel requests for all technical support† ï Documented, tracked and monitored problems to ensure a timely resolution† ï Assisted end users with email delivery analysis and remediation† ï Performed routine maintenance for both desktop and networked printers to include exchanging parts, toners, drums, and maintenance kits† ï Administered Active Directory groups and distribution lists† ï Administered share drive folders and permissions† ï Provided technical support to include establishing and troubleshooting Video Teleconference Calls, VTC, as needed† ï Provided complete classroom support to instructors conducting training that included loading NMCI approved Commercial Off the Shelf, COTS, and Government Off the Shelf, GOTS, software, troubleshoot software, and tested for compatibility with Windows 10 computers† ï Submitted and tracked via Remedy Ticketing Systems all base wide issues that were reported† ï Assisted with the Blackberry activation and configuration of Blackberry Enterprise Server email accounts† ï Updated Global Address Lists in Exchange via Active Directory† ï Provided 1st tier support and escalated as deemed necessary to include hardware, software, and connectivity support to both military and civilian personnel System Administrator DMI Mobile Solutions May 2013 to March 2015 Provide support for Command Naval Region South East, Naval Air Station Key West† ï Worked in conjunction with other contract personnel as well as government civilian and military personnel in the accomplishment of the tasks associated with the contract† ï Communicated with the IT Director if the time lines for specific tasks could not be met and provided detailed information on the impediments, risks associated with not meeting timelines and a mitigation strategy to complete the milestones as soon as possible† ï Documented all tasks accomplished in a monthly status report† ï Provided Navy Marine Corp Intranet, NMCI, support in the form of guidance, education and assistance to ensure that the end users at Naval Region Southeast and its subordinate installations/detachments receive the NMCI systems and applications required to perform mission deliverables† ï Provided the Activity Contract Technical Representatives (ACTRs) and NMCI managers training, guidance, and support services for all NMCI functions to include Defense Messaging System (DMS), IT Administration, and customer support† ï DMS support included providing operational support, maintenance, and distribution of official US Navy messages via DMS, the message traffic included both unclassified and classified messages† ï IT Administration and customer support included providing support for work requests and resolving all IT related issues† ï Provided Tier 1 hardware, software and network connectivity support† ï Identified, researched, and resolved all technical problems† ï Responded to all telephone calls, email and personnel requests for all technical support† ï Documented, tracked and monitored problems to ensure a timely resolution† ï Assisted end users with email delivery analysis and remediation† ï Performed routine maintenance for both desktop and networked printers to include exchanging parts, toners, drums, and maintenance kits† ï Administered Active Directory groups and distribution lists† ï Administered share drive folders and permissions† ï Provided technical support to include establishing and troubleshooting Video Teleconference Calls, VTC, as needed† ï Provided complete classroom support to instructors conducting training that included loading NMCI approved Commercial Off the Shelf, COTS, and Government Off the Shelf, GOTS, software, troubleshoot software, and tested for compatibility with Windows 10 computers† ï Submitted and tracked via Remedy Ticketing Systems all base wide issues that were reported† ï Assisted with the Blackberry activation and configuration of Blackberry Enterprise Server email accounts† ï Updated Global Address Lists in Exchange via Active Directory† ï Provided 1st tier support and escalated as deemed necessary to include hardware, software, and connectivity support to both military and civilian personnel Systems Administrator SAIC November 2008 to May 2013 Provide support for Command Naval Region South East, Naval Air Station Key West† ï Worked in conjunction with other contract personnel as well as government civilian and military personnel in the accomplishment of the tasks associated with the contract† ï Communicated with the IT Director if the time lines for specific tasks could not be met and provided detailed information on the impediments, risks associated with not meeting timelines and a mitigation strategy to complete the milestones as soon as possible† ï Documented all tasks accomplished in a monthly status report† ï Provided Navy Marine Corp Intranet, NMCI, support in the form of guidance, education and assistance to ensure that the end users at Naval Region Southeast and its subordinate installations/detachments receive the NMCI systems and applications required to perform mission deliverables† ï Provided the Activity Contract Technical Representatives (ACTRs) and NMCI managers training, guidance, and support services for all NMCI functions to include Defense Messaging System (DMS), IT Administration, and customer support† ï DMS support included providing operational support, maintenance, and distribution of official US Navy messages via DMS, the message traffic included both unclassified and classified messages† ï IT Administration and customer support included providing support for work requests and resolving all IT related issues† ï Provided Tier 1 hardware, software and network connectivity support† ï Identified, researched, and resolved all technical problems† ï Responded to all telephone calls, email and personnel requests for all technical support† ï Documented, tracked and monitored problems to ensure a timely resolution† ï Assisted end users with email delivery analysis and remediation† ï Performed routine maintenance for both desktop and networked printers to include exchanging parts, toners, drums, and maintenance kits† ï Administered Active Directory groups and distribution lists† ï Administered share drive folders and permissions† ï Provided technical support to include establishing and troubleshooting Video Teleconference Calls, VTC, as needed† ï Provided complete classroom support to instructors conducting training that included loading NMCI approved Commercial Off the Shelf, COTS, and Government Off the Shelf, GOTS, software, troubleshoot software, and tested for compatibility with Windows 10 computers† ï Submitted and tracked via Remedy Ticketing Systems all base wide issues that were reported† ï Assisted with the Blackberry activation and configuration of Blackberry Enterprise Server email accounts† ï Updated Global Address Lists in Exchange via Active Directory† ï Provided 1st tier support and escalated as deemed necessary to include hardware, software, and connectivity support to both military and civilian personnel Network Administrator Keys Federal Credit Union September 1998 to September 2008 ï Administered and maintained network that included updating Cisco Routers/Switches IOS† ï Installed/troubleshoot both desktop and network printers to include all preventive maintenance tasks† ï Created/administered user and exchange email accounts† ï Ensured security of the network by regulating and monitoring access to share drive files, password administration, and backed up all files via Veritas Backup to ensure the integrity in the event of network outages, natural disasters† ï Documented network configurations and prepared backup strategies and procedures† ï Identified and documented network problems, possible causes, and ramifications† ï Continually assessed the current systems to make sure it met the needs of Keys Federal Credit Union† ï Monitored servers, IIS, Exchange, SQL, Print servers, via Paessler PRTG Enterprise Console† ï Developed, implemented, and tested off site systems to provide business continuity in case of on-site emergencies such as power failures, hurricanes, fires, and floods† ï Provided all aspects of troubleshooting and fielded questions encountered by staff members in regards to all applications and software† ï Trained all staff on both software and hardware usage† ï Stayed abreast of all recent developments, new technologies and products, and emerging communication strategies and methods† ï Continually ensured that all skills are up to date through education and by reading computer related literature† ï Administered Hyland Onbase SQL Enterprise Server which housed all imaging of credit union legal documents to include configuration of the ODBC client which provided connectivity to the Onbase server† ï Provided recommendations on all purchased computer and network systems Education Weslaco High School - Weslaco, TX 1979 to 1982"""
from string import Template
prompts = [person_prompt_tpl, postion_prompt_tpl, skill_prompt_tpl, edu_prompt_tpl]
import json
results = {"entities": [], "relationships": []}
for p in prompts:
    _prompt = Template(p).substitute(ctext=sample_que)
    _extraction = extract_entities_relationships(_prompt, '')
    if 'Answer:\n' in _extraction:
        _extraction = _extraction.split('Answer:\n ')[1]
    if _extraction.strip() == '':
        continue
    try:
        _extraction = json.loads(_extraction.replace("\'", "'").replace('`', ''))
    except json.JSONDecodeError:
        print(_extraction)
        #Temp hack to ignore Skills cut off by token limitation
        _extraction = _extraction[:_extraction.rfind("}")+1] + ']}'
        _extraction = json.loads(_extraction.replace("\'", "'"))
    results["entities"].extend(_extraction["entities"])
    if "relationships" in _extraction:
        results["relationships"].extend(_extraction["relationships"])

In [283]:
person_id = results["entities"][0]["id"]
for e in results["entities"][1:]:
    if e['label'] == 'Position':
        results["relationships"].append(f"{person_id}|HAS_POSITION|{e['id']}")
    if e['label'] == 'Skill':
        results["relationships"].append(f"{person_id}|HAS_SKILL|{e['id']}")
    if e['label'] == 'Education':
        results["relationships"].append(f"{person_id}|HAS_EDUCATION|{e['id']}")

The extracted entities & relationships will look like this

In [100]:
results

{'entities': [{'label': 'Person',
   'id': 'person1',
   'role': 'Systems Administrator',
   'description': 'Systems Administrator with over 20 years of experience in the IT industry'},
  {'label': 'Position',
   'id': 'position1',
   'title': 'Systems Administrator',
   'location': 'Key West, FL',
   'startDate': 'September 1998',
   'endDate': 'September 2008',
   'url': ''},
  {'label': 'Position',
   'id': 'position2',
   'title': 'Systems Administrator',
   'location': 'Key West, FL',
   'startDate': 'November 2008',
   'endDate': 'May 2013',
   'url': ''},
  {'label': 'Position',
   'id': 'position3',
   'title': 'Systems Administrator',
   'location': 'Key West, FL',
   'startDate': 'April 2015',
   'endDate': 'November 2018',
   'url': ''},
  {'label': 'Company', 'id': 'company1', 'name': 'Keys Federal Credit Union'},
  {'label': 'Company', 'id': 'company2', 'name': 'SAIC'},
  {'label': 'Company', 'id': 'company3', 'name': 'DMI Mobile Solutions'},
  {'label': 'Company', 'id': '

## Data Ingestion Cypher Generation

The entities and relationships we got from the LLM have to be transformed to Cypher so we can write them into Neo4j.

In [264]:
import time

def get_prop_str(prop_dict, _id):
    s = []
    for key, val in prop_dict.items():
      if key != 'label' and key != 'id':
         s.append(_id+"."+key+' = "'+str(val).replace('\"', '"').replace('"', '\"')+'"') 
    return ' ON CREATE SET ' + ','.join(s)

def get_cypher_compliant_var(_id):
    s = "_"+ re.sub(r'[\W_]', '', _id).lower() #avoid numbers appearing as firstchar; replace spaces
    return s[:20] #restrict variable size

def generate_cypher(in_json):
    e_map = {}
    e_stmt = []
    r_stmt = []
    e_stmt_tpl = Template("($id:$label{id:'$key'})")
    r_stmt_tpl = Template("""
      MATCH $src
      MATCH $tgt
      MERGE ($src_id)-[:$rel]->($tgt_id)
    """)
    for obj in in_json:
      for j in obj['entities']:
          props = ''
          label = j['label']
          id = ''
          if label == 'Person':
            id = 'p'+str(time.time_ns())
          elif label == 'Position':
            id = 'j'+str(time.time_ns())
          elif label == 'Education':
            id = 'e'+str(time.time_ns())
          else:
                id = get_cypher_compliant_var(j['name'])
          if label in ['Person', 'Position', 'Education', 'Skill', 'Company']:
            varname = get_cypher_compliant_var(j['id'])
            stmt = e_stmt_tpl.substitute(id=varname, label=label, key=id)
            e_map[varname] = stmt
            e_stmt.append('MERGE '+ stmt + get_prop_str(j, varname))

      for st in obj['relationships']:
          rels = st.split("|")
          src_id = get_cypher_compliant_var(rels[0].strip())
          rel = rels[1].strip()
          if rel in ['HAS_SKILL', 'HAS_EDUCATION', 'AT_COMPANY', 'HAS_POSITION']: #we ignore other relationships
            tgt_id = get_cypher_compliant_var(rels[2].strip())
            stmt = r_stmt_tpl.substitute(
              src_id=src_id, tgt_id=tgt_id, src=e_map[src_id], tgt=e_map[tgt_id], rel=rel)
            r_stmt.append(stmt)

    return e_stmt, r_stmt

In [102]:
ent_cyp, rel_cyp = generate_cypher([results])

print(ent_cyp, rel_cyp)

['MERGE (_person1:Person{id:\'p1685102194065206925\'}) ON CREATE SET _person1.role = "Systems Administrator",_person1.description = "Systems Administrator with over 20 years of experience in the IT industry"', 'MERGE (_position1:Position{id:\'j1685102194065241742\'}) ON CREATE SET _position1.title = "Systems Administrator",_position1.location = "Key West, FL",_position1.startDate = "September 1998",_position1.endDate = "September 2008",_position1.url = ""', 'MERGE (_position2:Position{id:\'j1685102194065257867\'}) ON CREATE SET _position2.title = "Systems Administrator",_position2.location = "Key West, FL",_position2.startDate = "November 2008",_position2.endDate = "May 2013",_position2.url = ""', 'MERGE (_position3:Position{id:\'j1685102194065270601\'}) ON CREATE SET _position3.title = "Systems Administrator",_position3.location = "Key West, FL",_position3.startDate = "April 2015",_position3.endDate = "November 2018",_position3.url = ""', 'MERGE (_company1:Company{id:\'_keysfederalcre

## Data Ingestion

You will need a Neo4j AuraDS Pro instance.  You can deploy that on Google Cloud Marketplace [here](https://console.cloud.google.com/marketplace/product/endpoints/prod.n4gcp.neo4j.io).

With that complete, you'll need to install the Neo4j library and set up your database connection.

In [None]:
%pip install --user graphdatascience

In [10]:
from graphdatascience import GraphDataScience

In [11]:
# You will need to change these variables
connectionUrl = 'neo4j+s://7929d24d.databases.neo4j.io'
username = 'neo4j'
password = 'MhfEDkNieOS79LCTR6KbGNyblfmfjeAmaBbdx-70wVg'

In [12]:
gds = GraphDataScience(connectionUrl, auth=(username, password))
gds.version()

'2.3.6+19'

Before loading the data, create constraints as below

In [29]:
gds.run_cypher('CREATE CONSTRAINT unique_person_id IF NOT EXISTS FOR (n:Person) REQUIRE (n.id) IS UNIQUE')
gds.run_cypher('CREATE CONSTRAINT unique_position_id IF NOT EXISTS FOR (n:Position) REQUIRE (n.id) IS UNIQUE')
gds.run_cypher('CREATE CONSTRAINT unique_skill_id IF NOT EXISTS FOR (n:Skill) REQUIRE n.id IS UNIQUE')
gds.run_cypher('CREATE CONSTRAINT unique_education_id IF NOT EXISTS FOR (n:Education) REQUIRE n.id IS UNIQUE')
gds.run_cypher('CREATE CONSTRAINT unique_company_id IF NOT EXISTS FOR (n:Company) REQUIRE n.id IS UNIQUE')

Ingest the entities

In [None]:
%%time
for e in ent_cyp:
    gds.run_cypher(e)


CPU times: user 161 ms, sys: 15.4 ms, total: 176 ms
Wall time: 13.1 s


Ingest relationships now

In [104]:
%%time
for r in rel_cyp:
    gds.run_cypher(r)

CPU times: user 128 ms, sys: 15.6 ms, total: 144 ms
Wall time: 11.7 s


Your ingested data from the above commands might look like this:
![ingested data](images/ingested_data.png)

We got thousands of Resumes in the `data` directory. Let us run a pipeline to ingest only a few of them now. 

In [460]:
import glob
def run_pipeline(start=0, count=1):
    txt_files = glob.glob("data/*.txt")[start:count]
    print(f"Running pipeline for {len(txt_files)} files")
    failed_files = process_pipeline(txt_files)
    print(failed_files)
    return failed_files

def process_pipeline(files):
    failed_files = []
    i = 0
    for f in files:
        i += 1
        try:
            with open(f, 'r', encoding='utf-8', errors='ignore') as file:
                print(f"  {f}: Reading File No. ({i})")
                data = file.read().rstrip()
                text = data
                print(f"    {f}: Extracting Entities & Relationships")
                results = run_extraction(f, text)
                print(f"    {f}: Generating Cypher")
                ent_cyp, rel_cyp = generate_cypher(results)
                print(f"    {f}: Ingesting Entities")
                for e in ent_cyp:
                    gds.run_cypher(e)
                print(f"    {f}: Ingesting Relationships")
                for r in rel_cyp:
                    gds.run_cypher(r)
                print(f"    {f}: Processing DONE")
        except Exception as e:
            print(f"    {f}: Processing Failed with exception {e}")
            failed_files.append(f)
    return failed_files
        
from timeit import default_timer as timer
def run_extraction(f, text):
    start = timer()
    prompts = [person_prompt_tpl, postion_prompt_tpl, skill_prompt_tpl, edu_prompt_tpl]
    results = {"entities": [], "relationships": []}
    for p in prompts:
        _prompt = Template(p).substitute(ctext=text)
        _extraction = extract_entities_relationships(_prompt, '')
        if 'Answer:\n' in _extraction:
            _extraction = _extraction.split('Answer:\n ')[1]
        if _extraction.strip() == '':
            continue
        try:
            _extraction = json.loads(_extraction.replace("\'", "'"))
        except json.JSONDecodeError:
            #Temp hack to ignore Skills cut off by token limitation
            _extraction = _extraction[:_extraction.rfind("}")+1] + ']}'
            _extraction = json.loads(_extraction.replace("\'", "'"))
        results["entities"].extend(_extraction["entities"])
        if "relationships" in _extraction:
            results["relationships"].extend(_extraction["relationships"])
    person_id = results["entities"][0]["id"]
    for e in results["entities"][1:]:
        if e['label'] == 'Position':
            results["relationships"].append(f"{person_id}|HAS_POSITION|{e['id']}")
        if e['label'] == 'Skill':
            results["relationships"].append(f"{person_id}|HAS_SKILL|{e['id']}")
        if e['label'] == 'Education':
            results["relationships"].append(f"{person_id}|HAS_EDUCATION|{e['id']}")
    end = timer()
    elapsed = (end-start)
    print(f"    {f}: Entity Extraction took {elapsed}secs")
    return [results]

Lets run the pipeline only for the first 5 files

In [None]:
%%time
failed_files = run_pipeline(0, 5)

If processing failed for some files due to API Rate limit, you can retry as below. For token limitation error, it is better to chunk the text and retry.

In [None]:
%%time
failed_files = process_pipeline(failed_files)
failed_files

## Cypher Generation for Consumption

### Tune the model to generate Cypher

The `text-bison` base model, at the time of this writing needs some finetuning to generate Syntactically Correct Cypher Statements. So, let us fine-tune this model to generate Cypher. In this section, we will tune the model only using 30 Cypher statements. With this limited tuning, the model achieves some Cypher generation capability but it is not State of The Art. In Production scenario, you need to aim for more training data. The total training time takes more than an hour. It also involves TPU resources.

First, let us upload our training set in `jsonl` format to a GCS bucket. We will use this file `finetuning/eng-to-cypher-trng.jsonl` for our fine-tuning. You can take a look over the data there.

VertexAI expects you to adhere to this format for each line of the `jsonl` file. 
```json
{"input_text": "MY_INPUT_PROMPT", "output_text": "CYPHER_QUERY"} 
```

When you got some changes in the training data, ensure that you upload the updated file in a different name than your previous tuning exercises. Because VertexAI caches data uploaded previously, it tends to skip any file validation and resorts to the previously uploaded data.

In [311]:
from google.cloud import storage
from timeit import default_timer as timer

bucket_name = project_id + '-genai'
client = storage.Client()
try:
    bucket = client.get_bucket(bucket_name)
except:
    bucket = client.bucket(bucket_name)
    bucket.storage_class = 'STANDARD'
    bucket = client.create_bucket(bucket)

upload_name = f"finetuning/eng-to-cypher-trng-{time}.jsonl" #this ensures vertexai reloads the file
filename = 'finetuning/eng-to-cypher-trng.jsonl'
blob = bucket.blob(upload_name)
blob.upload_from_filename(filename)

Let's tune the model for a hundred training steps. When you the below code, the following sequence happens:
1. Pipeline Validation
2. Dataset Export
3. Prompt Validation
4. jsonl to tfrecord conversion
5. Parameter Composition for Adapter tuning
6. LLM Tuning
7. Model uploading and
8. Endpoint deployment


![finetuning-process](images/finetune-seq.png)

In [None]:
training_data = 'gs://' + bucket_name + '/' + upload_name
train_steps = 100

vertexai.init(project=project_id, location=location)
model = TextGenerationModel.from_pretrained("text-bison@001")

model.tune_model(
  training_data=training_data,
  train_steps=train_steps,
  tuning_job_location="europe-west4",
  tuned_model_location="us-central1",
)

To get the details of the fine-tuned model:

In [319]:
model = TextGenerationModel.from_pretrained("text-bison@001")
models = model.list_tuned_model_names()
#The first model in the list is the one we just tuned.
entity_extraction_tuned_model = models[0]
entity_extraction_tuned_model

'projects/803648085855/locations/us-central1/models/4627846640332439552'

### Generate Cypher

Lets create a wrapper to call the text model

In [481]:
def english_to_cypher(prompt, tuned_model_name=''):
    try:
        res = run_text_model(project_id, "text-bison@001", 0.1, 1024, 0.95, 40, prompt, location, tuned_model_name)
        return res
    except Exception as e:
        print(e)

We have to create Prompt Template that clearly states what schema to use, what kind of Cypher to generate and how. We will provide Question-Answer example and also a reason why that answer is arrived

In [483]:
prompt = """
Context:
You are an expert Neo4j Cypher translator who understands the question in english and convert to Cypher strictly based on the Neo4j Schema provided and the instructions below:
1. Use the Neo4j schema to generate cypher compatible ONLY for Neo4j Version 5
2. Do not use EXISTS, SIZE keywords in the cypher. Use alias when using the WITH keyword
3. Use only Nodes and relationships mentioned in the schema while generating the response
4. Reply ONLY in Cypher
5. Always do a case-insensitive and fuzzy search for any properties related search. Eg: to search for a Company name use `toLower(c.name) contains 'neo4j'`
6. Candidate node is synonymous to Person
7. Always use aliases to refer properties in the query
Now, use this Neo4j schema and Reply ONLY in Cypher when it makes sense.
Schema:
Nodes:
    label:'Person',id:string,role:string,description:string //Person Node
    label:'Position',id:string,title:string,location:string,startDate:string,endDate:string,url:string //Position Node
    label:'Company',id:string,name:string //Company Node
    label:'Skill',id:string,name:string,level:string //Skill Node
    label:'Education',id:string,degree:string,university:string,graduation_date:string,score:string,url:string //Education Node
Relationships:
    (:Person)-[:HAS_POSITION]->(:Position)
    (:Position)-[:AT_COMPANY]->(:Company)
    (:Person)-[:HAS_SKILL]->(:Skill)
    (:Person)-[:HAS_EDUCATION]->(:Education)
Ouput Format (Strict): //Only code as output. No other text
MATCH (p:Person)-[:HAS_SKILL]->(s:Skill) WHERE toLower(p.name) CONTAINS 'java' AND toLower(p.level) CONTAINS 'expert' RETURN COUNT(p) 

Question: How many Texas-based experts do I have on Delphi?
Answer:
MATCH (p:Person)-[:HAS_SKILL]->(s:Skill) 
MATCH (p)-[:HAS_POSITION]->(pos:Position)
WHERE toLower(s.name) CONTAINS 'delphi' AND toLower(s.level) CONTAINS 'expert' 
AND (toLower(pos.location) CONTAINS 'texas' OR toLower(pos.location) CONTAINS 'tx') RETURN COUNT(p)

Reason:
1. As per schema definition of nodes & relationships above, Person node is related to Skill node via HAS_SKILL relationship.
2. From the schema, Skill has name and levels as properties. Expertise can be checked using `level`
3. Since Texas can be denoted as TX, we search for the Position's location as either 'texas' or 'tx'
4. Finally, we return the number of persons who match the input criteria using COUNT function

Question:
{question}

Answer:
"""

que = 'How many experts do we have on MS Word?'
_prompt = prompt.replace('{question}', que)

cypher = english_to_cypher(_prompt, entity_extraction_tuned_model)
if 'Answer:\n ' in cypher:
    cypher = cypher.split('Answer:\n ')[1]
cypher = cypher.replace('\n', '')
cypher
    

"MATCH (p:Person)-[:HAS_SKILL]->(s:Skill) WHERE toLower(s.name) CONTAINS 'word' RETURN COUNT(p)"

## Skill Finder Chatbot

With our tuned model now working, let's create a Chatbot that can help our interaction with Neo4j using only English.

We will be using Langchain to quickly build a chatbot that converts english to cypher, execute it on Neo4j and which will be then augmented using GenAI before sending the response to the user.

### VertexAI LangChain
Langchain for VertexAI is currently not yet committed to the LangChain codebase. Until then, we will use the following code. 

In [438]:
import time
from typing import Any, Mapping, List, Dict, Optional, Tuple, Union
from dataclasses import dataclass, field

from pydantic import BaseModel, Extra, root_validator

from langchain.llms.base import LLM
from langchain.embeddings.base import Embeddings
from langchain.chat_models.base import BaseChatModel
from langchain.llms.utils import enforce_stop_tokens
from langchain.schema import Generation, LLMResult
from langchain.schema import AIMessage, BaseMessage, ChatGeneration, ChatResult, HumanMessage, SystemMessage

from vertexai.preview.language_models import TextGenerationResponse, ChatSession


def rate_limit(max_per_minute):
  period = 60 / max_per_minute
  print('Waiting')
  while True:
    before = time.time()
    yield
    after = time.time()
    elapsed = after - before
    sleep_time = max(0, period - elapsed)
    if sleep_time > 0:
      print('.', end='')
      time.sleep(sleep_time)


class _VertexCommon(BaseModel):
    """Wrapper around Vertex AI large language models.

    To use, you should have the
    ``google.cloud.aiplatform.private_preview.language_models`` python package
    installed.
    """
    client: Any = None #: :meta private:
    model_name: str = "text-bison@001"
    """Model name to use."""

    temperature: float = 0.2
    """What sampling temperature to use."""

    top_p: int = 0.8
    """Total probability mass of tokens to consider at each step."""

    top_k: int = 40
    """The number of highest probability tokens to keep for top-k filtering."""

    max_output_tokens: int = 200
    """The maximum number of tokens to generate in the completion."""

    @property
    def _default_params(self) -> Mapping[str, Any]:
        """Get the default parameters for calling Vertex AI API."""
        return {
            "temperature": self.temperature,
            "top_p": self.top_p,
            "top_k": self.top_k,
            "max_output_tokens": self.max_output_tokens
        }

    def _predict(self, prompt: str, stop: Optional[List[str]]) -> str:
        res = self.client.predict(prompt, **self._default_params)
        return self._enforce_stop_words(res.text, stop)

    def _enforce_stop_words(self, text: str, stop: Optional[List[str]]) -> str:
        if stop:
            return enforce_stop_tokens(text, stop)
        return text

    @property
    def _llm_type(self) -> str:
        """Return type of llm."""
        return "vertex_ai"

class VertexLLM(_VertexCommon, LLM):
    model_name: str = "text-bison@001"

    @root_validator()
    def validate_environment(cls, values: Dict) -> Dict:
        """Validate that the python package exists in environment."""
        try:
            from vertexai.preview.language_models import TextGenerationModel
        except ImportError:
            raise ValueError(
                "Could not import Vertex AI LLM python package. "
            )

        try:
            values["client"] = TextGenerationModel.from_pretrained(values["model_name"])
        except AttributeError:
            raise ValueError(
                "Could not set Vertex Text Model client."
            )

        return values

    def _call(self, prompt: str, stop: Optional[List[str]] = None) -> str:
        """Call out to Vertex AI's create endpoint.

        Args:
            prompt: The prompt to pass into the model.

        Returns:
            The string generated by the model.
        """
        return self._predict(prompt, stop)


@dataclass
class _MessagePair:
    """InputOutputTextPair represents a pair of input and output texts."""

    question: HumanMessage
    answer: AIMessage


@dataclass
class _ChatHistory:
    """InputOutputTextPair represents a pair of input and output texts."""

    history: List[_MessagePair] = field(default_factory=list)
    system_message: Optional[SystemMessage] = None


def _parse_chat_history(history: List[BaseMessage]) -> _ChatHistory:
    """Parses a sequence of messages into history.

    A sequency should be either (SystemMessage, HumanMessage, AIMessage,
    HumanMessage, AIMessage, ...) or (HumanMessage, AIMessage, HumanMessage,
    AIMessage, ...).
    """
    if not history:
        return _ChatHistory()
    first_message = history[0]
    system_message = first_message if isinstance(first_message, SystemMessage) else None
    chat_history = _ChatHistory(system_message=system_message)
    messages_left = history[1:] if system_message else history
    # if len(messages_left) % 2 != 0:
    #     raise ValueError(
    #         f"Amount of messages in history should be even, got {len(messages_left)}!"
    #     )
    for question, answer in zip(messages_left[::2], messages_left[1::2]):
        if not isinstance(question, HumanMessage) or not isinstance(answer, AIMessage):
            raise ValueError(
                "A human message should follow a bot one, "
                f"got {question.type}, {answer.type}."
            )
        chat_history.history.append(_MessagePair(question=question, answer=answer))
    return chat_history


class _VertexChatCommon(_VertexCommon):
    """Wrapper around Vertex AI Chat large language models.

    To use, you should have the
    ``vertexai.preview.language_models`` python package
    installed.
    """
    model_name: str = "chat-bison@001"
    """Model name to use."""

    @root_validator()
    def validate_environment(cls, values: Dict) -> Dict:
        """Validate that the python package exists in environment."""
        try:
            from vertexai.preview.language_models import ChatModel
        except ImportError:
            raise ValueError(
                "Could not import Vertex AI LLM python package. "
            )

        try:
            values["client"] = ChatModel.from_pretrained(values["model_name"])
        except AttributeError:
            raise ValueError(
                "Could not set Vertex Text Model client."
            )

        return values

    def _response_to_chat_results(
        self, response: TextGenerationResponse, stop: Optional[List[str]]
    ) -> ChatResult:
        text = self._enforce_stop_words(response.text, stop)
        return ChatResult(generations=[ChatGeneration(message=AIMessage(content=text))])


class VertexChat(_VertexChatCommon, BaseChatModel):
    """Wrapper around Vertex AI large language models.

    To use, you should have the
    ``vertexai.preview.language_models`` python package
    installed.
    """

    model_name: str = "chat-bison@001"
    chat: Any = None  #: :meta private:

    def send_message(
        self, message: Union[HumanMessage, str], stop: Optional[List[str]] = None
    ) -> ChatResult:
        text = message.content if isinstance(message, BaseMessage) else message
        response = self.chat.send_message(text)
        text = self._enforce_stop_words(response.text, stop)
        return ChatResult(generations=[ChatGeneration(message=AIMessage(content=text))])

    def _generate(
        self, messages: List[BaseMessage], stop: Optional[List[str]] = None
    ) -> ChatResult:
        if not messages:
            raise ValueError(
                "You should provide at least one message to start the chat!"
            )
        question = messages[-1]
        if not isinstance(question, HumanMessage):
            raise ValueError(
                f"Last message in the list should be from human, got {question.type}."
            )
        self.start_chat(messages[:-1])
        return self.send_message(question)

    def start_chat(self, messages: List[BaseMessage]) -> None:
        """Starts a chat."""
        history = _parse_chat_history(messages)
        context = history.system_message.content if history.system_message else None
        self.chat = self.client.start_chat(context=context, **self._default_params)
        for pair in history.history:
            self.chat._history.append((pair.question.content, pair.answer.content))

    def clear_chat(self) -> None:
        self.chat = None

    @property
    def history(self) -> List[BaseMessage]:
        """Chat history."""
        history: List[BaseMessage] = []
        if self.chat:
            for question, answer in self.chat._history:
                history.append(HumanMessage(content=question))
                history.append(AIMessage(content=answer))
        return history

    async def _agenerate(
        self, messages: List[BaseMessage], stop: Optional[List[str]] = None
    ) -> ChatResult:
        raise NotImplementedError(
            """Vertex AI doesn't support async requests at the moment."""
        )

class VertexMultiTurnChat(_VertexChatCommon, BaseChatModel):
    """Wrapper around Vertex AI large language models."""

    model_name: str = "chat-bison@001"
    chat: Optional[ChatSession] = None

    def clear_chat(self) -> None:
        self.chat = None

    def start_chat(self, message: Optional[SystemMessage] = None) -> None:
        if self.chat:
            raise ValueError("Chat has already been started. Please, clear it first.")
        if message and not isinstance(message, SystemMessage):
            raise ValueError("Context should be a system message")
        context = message.content if message else None
        self.chat = self.client.start_chat(context=context, **self._default_params)

    @property
    def history(self) -> List[Tuple[str]]:
        """Chat history."""
        if self.chat:
            return self.chat._history
        return []

    def _generate(
        self, messages: List[BaseMessage], stop: Optional[List[str]] = None
    ) -> ChatResult:
        if len(messages) != 1:
            raise ValueError(
                "You should send exactly one message to the chat each turn."
            )
        if not self.chat:
            raise ValueError("You should start_chat first!")
        response = self.chat.send_message(messages[0].content)
        return self._response_to_chat_results(response, stop=stop)

    async def _agenerate(
        self, messages: List[BaseMessage], stop: Optional[List[str]] = None
    ) -> ChatResult:
        raise NotImplementedError(
            """Vertex AI doesn't support async requests at the moment."""
        )

class VertexEmbeddings(Embeddings, BaseModel):
    """Wrapper around Vertex AI large language models embeddings API.

    To use, you should have the
    ``google.cloud.aiplatform.private_preview.language_models`` python package
    installed.
    """
    model_name: str = "textembedding-gecko@001"
    """Model name to use."""

    model: Any
    requests_per_minute: int = 15


    @root_validator()
    def validate_environment(cls, values: Dict) -> Dict:
        """Validate that the python package exists in environment."""
        try:
            from vertexai.preview.language_models import TextEmbeddingModel

        except ImportError:
            raise ValueError(
                "Could not import Vertex AI LLM python package. "
            )

        try:
            values["model"] = TextEmbeddingModel

        except AttributeError:
            raise ValueError(
                "Could not set Vertex Text Model client."
            )

        return values

    class Config:
        """Configuration for this pydantic object."""

        extra = Extra.forbid

    def embed_documents(self, texts: List[str]) -> List[List[float]]:
      """Call Vertex LLM embedding endpoint for embedding docs
      Args:
          texts: The list of texts to embed.
      Returns:
          List of embeddings, one for each text.
      """
      self.model = self.model.from_pretrained(self.model_name)

      limiter = rate_limit(self.requests_per_minute)
      results = []
      docs = list(texts)

      while docs:
        # Working in batches of 2 because the API apparently won't let
        # us send more than 2 documents per request to get embeddings.
        head, docs = docs[:2], docs[2:]
        # print(f'Sending embedding request for: {head!r}')
        chunk = self.model.get_embeddings(head)
        results.extend(chunk)
        next(limiter)

      return [r.values for r in results]

    def embed_query(self, text: str) -> List[float]:
      """Call Vertex LLM embedding endpoint for embedding query text.
      Args:
        text: The text to embed.
      Returns:
        Embedding for the text.
      """
      single_result = self.embed_documents([text])
      return single_result[0]

### Neo4j LangChain module
Neo4j released a LangChain agent that can convert english to cypher and augment the result with LLM based on your DB schema. This makes Graph Consumption easier for non-cypher experts. 

![neo4j-langchain](images/langchain-neo4j.png)

Let's see how to use it. First, we have to create Neo4jGraph & VertexLLM Connection objects

In [439]:
from langchain.chains import GraphCypherQAChain
from langchain.graphs import Neo4jGraph
from langchain.prompts.prompt import PromptTemplate

graph = Neo4jGraph(
    url=connectionUrl, 
    username="neo4j", 
    password=password
)
chain = GraphCypherQAChain.from_llm(
    VertexLLM(model_name='text-bison@001',
            max_output_tokens=1024,
            temperature=0,
            top_p=0.95,
            top_k=40), graph=graph, verbose=True
)

That's it! You can run the agent now. Simply provide the command in english. You get Cypher generated

In [453]:
chain.run("""How many experts do we have on MS Word?""")



[1m> Entering new GraphCypherQAChain chain...[0m
Generated Cypher:
[32;1m[1;3mMATCH (p:Person)-[:HAS_SKILL]->(s:Skill) WHERE s.name = "MS Word" RETURN count(p)[0m
Full Context:
[32;1m[1;3m[{'count(p)': 4}][0m

[1m> Finished chain.[0m


'We have 4 experts on MS Word.'

### Chatbot!
Time to build a chatbot. We will be using Gradio to quickly try out our chatbot

In [None]:
!pip install --user gradio --quiet

Running the code below will render a Chat widget. You can view the Cypher generated for your input below this rendering. 
P.S: Due to Quota limitations, you might be facing errors while submitting the input. You need to wait a while in between your queries

In [455]:
import gradio as gr
from langchain.memory import ConversationBufferMemory

memory = ConversationBufferMemory(memory_key="chat_history", return_messages=True)
llm=VertexLLM(model_name='text-bison@001',
                            max_output_tokens=1024,
                            temperature=0,
                            top_p=0.95,
                            top_k=40)
agent_chain = chain
def chat_response(input_text):
    response = agent_chain.run(input_text)
    return response

interface = gr.Interface(fn=chat_response, inputs="text", outputs="text", 
                         description="Skill Finder Chatbot")

interface.launch(share=True)

Running on local URL:  http://127.0.0.1:7863
Running on public URL: https://82df6cc5bf8520af37.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades (NEW!), check out Spaces: https://huggingface.co/spaces






[1m> Entering new GraphCypherQAChain chain...[0m
Generated Cypher:
[32;1m[1;3mMATCH (p:Person)-[:HAS_EDUCATION]->(e:Education)-[:DEGREE]->(d:Degree) WHERE d.name = 'Bachelor' WITH p, e, d MATCH (p)-[:HAS_SKILL]->(s:Skill) RETURN s.name AS skill[0m
Full Context:
[32;1m[1;3m[][0m

[1m> Finished chain.[0m


[1m> Entering new GraphCypherQAChain chain...[0m
Generated Cypher:
[32;1m[1;3mMATCH (p:Person)-[:HAS_POSITION]->(pos:Position)-[:AT_COMPANY]->(c:Company) RETURN c.name AS company, count(p) AS count[0m
Full Context:
[32;1m[1;3m[{'company': 'Mutual Benefit Group', 'count': 1}, {'company': 'Genworth Financial', 'count': 3}, {'company': 'Kaybamz Inc', 'count': 1}, {'company': 'Lastcard Managment Inc', 'count': 1}, {'company': 'Express Scripts', 'count': 1}, {'company': 'Apex Systems Inc', 'count': 1}, {'company': 'Charter Communications', 'count': 2}, {'company': 'LevelUp RPO', 'count': 1}, {'company': 'Ally Bank', 'count': 1}, {'company': 'ITT Corporation', 'count': 1},