<a href="https://colab.research.google.com/github/lqst/notebooks/blob/main/GenAI_Workshop_Jan_24_participants.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
#pip install graphdatascience openai retry

In [11]:
from graphdatascience import GraphDataScience
import openai
import pandas as pd
from retry import retry
import re
from string import Template
import time
import glob
from timeit import default_timer as timer

Launching a Neo4j instance in Sandbox and initiating your OpenAI API key

[Neo4j Sanbox](https://neo4j.com/sandbox/)

In [12]:
bolt = "bolt://3.239.161.211:7687" # Put your Bolt URL between the double quotation marks
user = "neo4j"
password = "salute-tomorrows-submarining" # Put your sandbox password between the double quotation marks
auth = (user, password)
openai.api_key = "--- key goes here ----" # Replace with your openAI API key

Testing the pipes

In [13]:
gds = GraphDataScience(bolt, auth=auth, aura_ds=False)
print(gds.version())

2.6.0


Create a pandas dataframe with toy data

In [None]:
transaction_df = pd.DataFrame([
    {'name': 'Anna', 'merchant':'Amazon', 'amount': 100},
    {'name': 'Anna', 'merchant':'Dustin', 'amount': 5499},
    {'name': 'Anna', 'merchant':'eBay', 'amount': 220},
    {'name': 'Jonas', 'merchant':'Amazon', 'amount': 220},
    {'name': 'Jonas', 'merchant':'Dustin', 'amount': 399},
    {'name': 'Jonas', 'merchant':'eBay', 'amount': 1499},
    {'name': 'Jonas', 'merchant':'Bikes.de', 'amount': 2000},
    {'name': 'Kristof', 'merchant':'Amazon', 'amount': 423},
    {'name': 'Kristof', 'merchant':'Dustin', 'amount': 530},
    {'name': 'Kristof', 'merchant':'Hello Fresh', 'amount': 1050},
    {'name': 'Kristof', 'merchant':'Steam', 'amount': 230},
    {'name': 'Kristof', 'merchant':'Activision', 'amount': 783},
    {'name': 'Jonas', 'merchant':'Bikes.de', 'amount': 22000},
    {'name': 'Håkan', 'merchant':'Hello Fresh', 'amount': 2100},
    {'name': 'Håkan', 'merchant':'Steam', 'amount': 230},
    {'name': 'Håkan', 'merchant':'Activision', 'amount': 783},

], columns = ['name', 'merchant', 'amount'])
transaction_df.head(15)

We create a graph from the tabular data with the following relationship:


> (:Person)-[:TRANSACTED_WITH]->(:Merchant)



In [None]:
gds.run_cypher(
    """
    unwind $transactions as transaction
    merge (p:Person{name: transaction['name']})
    merge (m:Merchant{name: transaction['merchant']})
    merge (p)-[tx:TRANSACTED_WITH]->(m)
       set tx.amount = transaction['amount']
    """,
    params = { 'transactions': transaction_df.to_dict(orient='records') }
)

# **Basic navigation of graph with cypher**


Exercise

In [None]:
# What persons are in the database?
gds.run_cypher("""
  match (p:Person)
  return p.name as person_name
""").head()

In [None]:
# What merchants are persons transacting with?
gds.run_cypher("""
  match (p:Person)-[tx:TRANSACTED_WITH]->(m:Merchant)
  return m.name as name, collect(p.name) as persons
""").head()

In [None]:
# Exercise 1: Who's the biggest spender?
# Replace with your solution
gds.run_cypher("""
   // Your code goes here
""").head(10)

# **Let's find some Implicit Relationships and persist them**

We are letting our Graph Grow, i.e. we make our implicit relationship, that we have found from factual data points, explicit

In [None]:
G, res = gds.graph.project(
    "shopping",                                     #  Graph name
    ["Person", "Merchant"],                         #  Node projection
    {"TRANSACTED_WITH": {                           #  Relationship projection
        "properties": "amount",
        "orientation": "REVERSE"}
    }
)

In [None]:
print(f"Graph '{G.name()}' node count: {G.node_count()}")
print(f"Graph '{G.name()}' node labels: {G.node_labels()}")


Run the node similarity algorithm and write back the result as a relationship with a property holding the similarity score

In [None]:
write_results = gds.nodeSimilarity.write(
    G,
    writeRelationshipType= 'SIMILAR_CUSTOMERS' ,
    writeProperty= 'SIM_SCORE',
    relationshipWeightProperty= 'amount'
)

# removing symmetric relationships
gds.run_cypher("""
    MATCH (m:Merchant)-[r:SIMILAR_CUSTOMERS]->(n:Merchant)
    WHERE (n)-[:SIMILAR_CUSTOMERS]->(m) AND id(m)<id(n)
    DELETE r
""")


write_results

# **Using LLM's to augment your Knowledge Graph**

Use LLMs to enrich the graph with descriptions of the merchants. Starting by defining a function that will make API calls to openAI and return descriptions of the merchants

In [None]:
from openai import OpenAI
client = OpenAI(api_key=openai.api_key)

In [None]:
def get_merchant_desc(merchant):
    completion = client.chat.completions.create(model="gpt-3.5-turbo", messages=[{"role": "user", "content": "Provide a description of the merchant " + merchant + ". Use a maximum of 300 characters"}])
    return completion.choices[0].message.content

*Populate* a dataframe with all the merchants in the graph

In [None]:
merchant_df = gds.run_cypher("""
  match (m:Merchant)
  return distinct m.name as merchant
""")
merchant_df.head(10)

Create a new column for the merchant description and use the previously defined function to generate descriptions

In [None]:
merchant_df['description'] = merchant_df['merchant'].apply(get_merchant_desc)
merchant_df.head(10)

We write the description as properties in form of strings to the node

In [None]:
gds.run_cypher("""
  unwind $merchant_des as merchant_des
  MATCH  (m:Merchant {name: merchant_des['merchant']})
  set m.description=merchant_des['description']""",
  params = { 'merchant_des': merchant_df.to_dict(orient='records') }
)

# ***It's better with Vectors...***

In [None]:
gds.run_cypher("""CALL db.index.vector.createNodeIndex('description-embeddings', 'Merchant', 'embedding', 1536, 'cosine')""")

In [None]:
gds.run_cypher("""
  MATCH (m:Merchant)
  WITH collect(m) as nodes
  UNWIND nodes as n
  CALL apoc.ml.openai.embedding([n.description], $apiKey, {}) yield embedding
  WITH n, embedding
  MATCH (n)
  CALL db.create.setVectorProperty(n, 'embedding', embedding)
  YIELD node
  RETURN COUNT(node)""",
  params = {'apiKey': openai.api_key}
)

now, let's have a look at the browser....

# **Show Time! RAG with vector search**

In [None]:
search_phrase = input("Search phrase:")

In [None]:

gds.run_cypher("""
  CALL apoc.ml.openai.embedding([$search_phrase], $apiKey, {})
  YIELD embedding
  CALL db.index.vector.queryNodes('description-embeddings', 3, embedding)
  YIELD node, score
  MATCH (node)<-[:TRANSACTED_WITH]-(p:Person)
  RETURN node.name as merchant, node.description as merchantDescription, collect(p.name) as customers, score""",
  params = {'apiKey': openai.api_key, 'search_phrase': search_phrase}
)

In [None]:
G.drop()
gds.run_cypher("""
  MATCH (n)
  DETACH DELETE n"""
)
gds.run_cypher("""DROP INDEX `description-embeddings`""")

# ***Named Entity Recognition with LLMs***

In [14]:
def clean_text(text):
  clean = "\n".join([row for row in text.split("\n")])
  clean = re.sub(r'\(fig[^)]*\)', '', clean, flags=re.IGNORECASE)
  return clean

In [15]:
article_txt = """The patient was a 34-yr-old man who presented with complaints of fever and a chronic cough.
He was a smoker and had a history of pulmonary tuberculosis that had been treated and cured.
A computed tomographic (CT) scan revealed multiple tiny nodules in both lungs.
A thoracoscopic lung biopsy was taken from the right upper lobe.
The microscopic examination revealed a typical LCH.
The tumor cells had vesicular and grooved nuclei, and they formed small aggregations around the bronchioles (Fig.1).
The tumor cells were strongly positive for S-100 protein, vimentin, CD68 and CD1a.
There were infiltrations of lymphocytes and eosinophils around the tumor cells.
With performing additional radiologic examinations, no other organs were thought to be involved.
He quit smoking, but he received no other specific treatment.
He was well for the following one year.
After this, a follow-up CT scan was performed and it showed a 4 cm-sized mass in the left lower lobe, in addition to the multiple tiny nodules in both lungs (Fig.2).
A needle biopsy specimen revealed the possibility of a sarcoma; therefore, a lobectomy was performed.
Grossly, a 4 cm-sized poorly-circumscribed lobulated gray-white mass was found (Fig.3), and there were a few small satellite nodules around the main mass.
Microscopically, the tumor cells were aggregated in large sheets and they showed an infiltrative growth.
The cytologic features of some of the tumor cells were similar to those seen in a typical LCH.
However, many tumor cells showed overtly malignant cytologic features such as pleomorphic/hyperchromatic nuclei and prominent nucleoli (Fig.4), and multinucleated tumor giant cells were also found.
There were numerous mitotic figures ranging from 30 to 60 per 10 high power fields, and some of them were abnormal.
A few foci of typical LCH remained around the main tumor mass.
Immunohistochemically, the tumor cells were strongly positive for S-100 protein (Fig.5) and vimentin; they were also positive for CD68 (Dako N1577, Clone KPI), and focally positive for CD1a (Fig.6), and they were negative for cytokeratin, epithelial membrane antigen, CD3, CD20 and HMB45.
The ultrastructural analysis failed to demonstrate any Birbeck granules in the cytoplasm of the tumor cells.
Now, at five months after lobectomy, the patient is doing well with no significant change in the radiologic findings.
"""

In [17]:
@retry(tries=2, delay=5)
def process_gpt(system,
                prompt):

    completion = openai.ChatCompletion.create(
        model="gpt-4",
        max_tokens=4000,
        # Try to be as deterministic as possible
        temperature=0,
        messages=[
            {"role": "system", "content": system},
            {"role": "user", "content": prompt},
        ]
    )
    nlp_results = completion.choices[0].message.content
    return nlp_results

In [18]:
prompt1="""From the Case sheet for a patient below, extract the following Entities & relationships described in the mentioned format
0. ALWAYS FINISH THE OUTPUT. Never send partial responses
1. First, look for these Entity types in the text and generate as comma-separated format similar to entity type.
   `id` property of each entity must be alphanumeric and must be unique among the entities. You will be referring this property to define the relationship between entities. Do not create new entity types that aren't mentioned below. Document must be summarized and stored inside Case entity under `summary` property. You will have to generate as many entities as needed as per the types below:
    Entity Types:
    label:'Case',id:string,summary:string //Case
    label:'Person',id:string,age:string,location:string,gender:string //Patient mentioned in the case
    label:'Symptom',id:string,description:string //Symptom Entity; `id` property is the name of the symptom, in lowercase & camel-case & should always start with an alphabet
    label:'Disease',id:string,name:string //Disease diagnosed now or previously as per the Case sheet; `id` property is the name of the disease, in lowercase & camel-case & should always start with an alphabet
    label:'BodySystem',id:string,name:string //Body Part affected. Eg: Chest, Lungs; id property is the name of the part, in lowercase & camel-case & should always start with an alphabet
    label:'Diagnosis',id:string,name:string,description:string,when:string //Diagnostic procedure conducted; `id` property is the summary of the Diagnosis, in lowercase & camel-case & should always start with an alphabet
    label:'Biological',id:string,name:string,description:string //Results identified from Diagnosis; `id` property is the summary of the Biological, in lowercase & camel-case & should always start with an alphabet

3. Next generate each relationships as triples of head, relationship and tail. To refer the head and tail entity, use their respective `id` property. Relationship property should be mentioned within brackets as comma-separated. They should follow these relationship types below. You will have to generate as many relationships as needed as defined below:
    Relationship types:
    case|FOR|person
    person|HAS_SYMPTOM{when:string,frequency:string,span:string}|symptom //the properties inside HAS_SYMPTOM gets populated from the Case sheet
    person|HAS_DISEASE{when:string}|disease //the properties inside HAS_DISEASE gets populated from the Case sheet
    symptom|SEEN_ON|chest
    disease|AFFECTS|heart
    person|HAS_DIAGNOSIS|diagnosis
    diagnosis|SHOWED|biological

The output should look like :
{
    "entities": [{"label":"Case","id":string,"summary":string}],
    "relationships": ["disease|AFFECTS|heart"]
}

Case Sheet:
$ctext
"""

In [19]:
%%time
def run_completion(prompt, results, ctext):
    try:
      system = "You are a helpful Medical Case Sheet expert who extracts relevant information and store them on a Neo4j Knowledge Graph"
      pr = Template(prompt).substitute(ctext=ctext)
      res = process_gpt(system, pr)
      results.append(json.loads(res.replace("\'", "'")))
      return results
    except Exception as e:
        print(e)

prompts = [prompt1]
results = []
for p in prompts:
  results = run_completion(p, results, clean_text(article_txt))


CPU times: user 24.8 ms, sys: 8.62 ms, total: 33.4 ms
Wall time: 55.8 s


In [20]:
results

[{'entities': [{'label': 'Case',
    'id': 'case1',
    'summary': 'The patient was a 34-yr-old man who presented with complaints of fever and a chronic cough. He was a smoker and had a history of pulmonary tuberculosis that had been treated and cured. A computed tomographic (CT) scan revealed multiple tiny nodules in both lungs. A thoracoscopic lung biopsy was taken from the right upper lobe. The microscopic examination revealed a typical LCH. The tumor cells had vesicular and grooved nuclei, and they formed small aggregations around the bronchioles. The tumor cells were strongly positive for S-100 protein, vimentin, CD68 and CD1a. There were infiltrations of lymphocytes and eosinophils around the tumor cells. With performing additional radiologic examinations, no other organs were thought to be involved. He quit smoking, but he received no other specific treatment. He was well for the following one year. After this, a follow-up CT scan was performed and it showed a 4 cm-sized mass in

In [39]:
gds.run_cypher('CREATE CONSTRAINT unique_entity_id IF NOT EXISTS FOR (n:Entity) REQUIRE n.id IS UNIQUE')

In [55]:
gds.run_cypher(""" 
    unwind $rows as row
    merge (n:Entity{id: row.id})
    set n += apoc.map.removeKeys(row, ['id','label'])
    with n, row
    call apoc.create.addLabels(id(n),[row.label]) yield node
    return count(*) as nodes_processed 
""",
params={'rows': results[0]['entities']})

Unnamed: 0,nodes_processed
0,11


In [57]:
gds.run_cypher(""" 
    unwind $rows as row
    with split(row, '|') as r
    match (a:Entity{id: r[0]}), (b:Entity{id: r[2]}) 
    call apoc.merge.relationship(a, r[1], {}, {}, b, {}) yield rel
    return count(*) as rels_processed 
""",
params={'rows': results[0]['relationships']})

Unnamed: 0,rels_processed
0,15


In [54]:
# gds.run_cypher("MATCH (n) DETACH DELETE n")