### Introduction

The way humans think is by having something like a "mental model" of the world with networks of concepts and relationships between them. This is a way to represent knowledge and reason about it.
Instead of being able to see the big picture of network, we eagerly explore from one node to another, following the links between them.
This way of querying the network is more simple and straightforward rather than trying to identify a specific query to explore the network.
Following this idea, it makes our LLM-generated cypher queries more accurate and reliable.

#### Initial Imports

In [1]:
from pprint import pprint

#### Loading data into neo4j

In [2]:
from langchain.document_loaders import PyPDFLoader

loader = PyPDFLoader('documents/Adyen_ A First Principles Payment Platform.pdf')
documents = loader.load()

In [3]:
from langchain_experimental.text_splitter import SemanticChunker
from langchain_community.embeddings import OllamaEmbeddings

embeddings = OllamaEmbeddings(
    base_url="http://localhost:11434",	
    model="llama3:instruct"
)

semantic_chunker = SemanticChunker(embeddings, breakpoint_threshold_type="percentile")
#
semantic_chunks = semantic_chunker.create_documents([d.page_content for d in documents])

pprint(semantic_chunks[:5])

[Document(page_content="Introduction\n[00:01:59]Zack: This is Zack Fuss and today we are breaking  down European  based pay business , Adyen. Adyen was found \nin Amsterdam  in 2006 by a group of payments  entrepreneurs  who had already built and sold a business  in the space. Adyen \nwas their chance to start a fresh and build a modern solution  to displace  the patchwork  legacy system that merchants  were \nbeing forced to use. To breakdown  the business , I'm joined by Michael Willar, a portfolio  manager  at Stenham  Asset \nManagement . Our discussion  covers Adyen's single platform  solution  in detail, the driving force behind their track record of \nprofitable  growth, and why payments  isn't a winner take all market. Please enjoy this breakdown  of Adyen. A Bird's Eye View of Adyen and Payments\n[00:02:42]Zack: So today we'll be breaking  down Adyen, a large payment  processing  business . Despite its nearly $70 billion \nUS market cap, it's a business  that's relatively  unk

##### Extracting entities from each chunk

In [4]:
import spacy

# Load English tokenizer, POS tagger, parser, NER and word vectors
nlp = spacy.load("en_core_web_lg")

# Process whole documents
text = semantic_chunks[0].page_content

doc = nlp(text)

# Analyze syntax
print("Noun phrases:", [chunk.text for chunk in doc.noun_chunks])
print("Verbs:", [token.lemma_ for token in doc if token.pos_ == "VERB"])

# Find named entities, phrases and concepts
for entity in doc.ents:
    print(entity.text, entity.label_)

[nltk_data] Downloading package omw-1.4 to /home/jianyang/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
  from .autonotebook import tqdm as notebook_tqdm
  warn(f"Failed to load image Python extension: {e}")
[nltk_data] Downloading package omw-1.4 to /home/jianyang/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


Noun phrases: ['Introduction', 'This', 'Zack Fuss', 'we', 'European  based pay business', 'Adyen', 'Adyen', 'Amsterdam', 'a group', 'payments', ' entrepreneurs', 'who', 'a business', 'the space', 'their chance', 'a modern solution', 'the patchwork', ' legacy system', 'that', 'the business', 'I', 'Michael Willar', 'a portfolio  manager', 'Stenham  Asset \nManagement', 'Our discussion', "Adyen's single platform  solution", 'detail', 'the driving force', 'their track record', '\nprofitable  growth', 'why payments', 'a winner', 'all market', 'this breakdown', 'Adyen', "A Bird's Eye View", 'Adyen', 'Payments', '00:02:42]Zack', 'we', 'Adyen', 'processing  business', 'its nearly $70 billion \nUS market cap', 'it', 'a business', 'that', 'American  investors']
Verbs: ['break', 'base', 'find', 'build', 'sell', 'start', 'build', 'displace', 'force', 'use', 'breakdown', 'join', 'cover', 'drive', 'take', 'enjoy', 'break']
Zack Fuss PERSON
today DATE
European NORP
Amsterdam GPE
2006 DATE
Michael Wil

In [345]:
from langchain_community.llms import Ollama

# Use LLM to recursively extract entities from each chunk
llm = Ollama(model="llama3:instruct", temperature=0, base_url="http://localhost:11434", verbose=False)

In [210]:
from langchain_core.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser

entities_extraction_prompt = PromptTemplate(
    template="""
    <|begin_of_text|><|start_header_id|>system<|end_header_id|>
    Extract all entities from the following text. 
    Present your answers in the format ie "ENTITY_1;ENTITY_2;ENTITY_3"
    DO NOT give any preamble or explanation.
    Some example entities: 'john;apple;new_york'
    <|start_header_id|>user<|end_header_id|>
    Text: {text}
    <|eot_id|><|start_header_id|>assistant<|end_header_id|>""",
    input_variables=["text"],
)

entities_extraction_pipeline = entities_extraction_prompt | llm | StrOutputParser()

In [211]:
rs_extraction_prompt = PromptTemplate(
    template="""
    <|begin_of_text|><|start_header_id|>system<|end_header_id|>
    Extract all relationships between the given entities from the given text.
    Present your answers in the format ie "ENTITY_1|RELATIONSHIP_1|ENTITY_2;ENTITY_2|RELATIONSHIP_2|ENTITY_3"
    DO NOT give any preamble or explanation.
    Some example relationships:
    'sarah_johnson|HAS_SKILL|machine_learning;sarah_johnson|HAS_SKILL|data_analytics'

    <|start_header_id|>user<|end_header_id|>
    Text: {text}
    Identified entities: {entities}
    ALready identified relationship terms: {existing_rs_terms}
    IMPORTANT: You are NOT to extract any entities. Use the given entities to extract relationships.
    IMPORTANT: DO NOT USE ANY ENTITIES THAT ARE NOT IN THE "Identified entities" LIST.
    <|eot_id|><|start_header_id|>assistant<|end_header_id|>""",
    input_variables=["text", "entities", "existing_rs_terms"],
)

rs_extraction_pipeline = rs_extraction_prompt | llm | StrOutputParser()

In [212]:
def get_entities_from_text(text):
    entities = entities_extraction_pipeline.invoke({"text": text})
    return entities

In [213]:
def get_rs_from_text_and_entities(text, entities, existing_rs_terms):
    rs_from_chunk = rs_extraction_pipeline.invoke({"text": text, "entities": entities, "existing_rs_terms": existing_rs_terms})

    return rs_from_chunk

In [214]:
# TODO: There might stil be some entities in the relationships that are not in the entities list. We will replace the entities set with the entities extracted from the relationships. 
def get_entities_and_rs_from_chunks(chunks: list):
    entities = set()
    rs = set()
    rs_terms = set()

    for chunk in chunks:
        chunk_text = chunk.page_content

        # Extract ENTITIES from chunk
        unformatted_entities = get_entities_from_text(chunk_text)
        # Remove leading and trailing whitespaces and Format entities
        # curr_entities_list = [entity.strip() for entity in unformatted_entities.split(";")]
        curr_entities_list = [re.sub(r'\W+', '', entity.strip().replace(" ", "_")) for entity in unformatted_entities.split(";")]
        # print("Pre-cleaned entities: ", curr_entities_list)
        # Only retain alphanumeric entities
        cleaned_entities_list = [''.join(re.findall(r'\w+', entity)).lower() for entity in curr_entities_list]
        # print("Cleaned entities: ", cleaned_entities_list)
        # Update entities set
        entities.update(cleaned_entities_list)

        # Extract RELATIONSHIPS from chunk
        unformatted_rs = get_rs_from_text_and_entities(chunk_text, cleaned_entities_list, rs_terms)
        # Remove leading and trailing whitespaces and Format relationships
        curr_rs_list = [rs.strip() for rs in unformatted_rs.split(";")]
        # Update relationships set
        rs.update(curr_rs_list)

        # Extract RELATIONSHIP TERMS from chunk
        curr_rs_terms = []
        for relationship in curr_rs_list:
            curr_rs_terms.append(relationship.split("|")[1].strip())
        # Update relationship terms set
        rs_terms.update(curr_rs_terms)

    return (entities, rs, rs_terms)


In [215]:
_, relationships, _  = get_entities_and_rs_from_chunks(semantic_chunks[:2])

In [216]:
print(relationships)

{'Micheal_willar|WORKS_AT|Stenham_Asset_Management', 'Adyen|FOUND_IN|Amsterdam', 'adyen|OPERATES_IN|apac', 'adyen|HAS_BUSINESS|nike', 'adyen|IS_HEADQUARTERED_IN|amsterdam', 'adyen|OPERATES_IN|latin_america', 'adyen|HAS_BUSINESS|overmatch', 'adyen|OPERATES_IN|europe', 'adyen|OPERATES_IN|us', 'Zack_fuss|IS_JOINED_BY|Micheal_willar', 'adyen|HAS_BUSINESS|mcdonalds', 'adyen|HAS_BUSINESS|microsoft'}


In [217]:
r_statements = []

for rs in relationships:
    src_id, rs_type, tgt_id = rs.split("|")
    src_id = src_id.replace("-", "").lower()
    print("src_id", src_id, end = " | ") 
    print("rs_type", rs_type, end = " | ")
    tgt_id = tgt_id.replace("-", "").lower()
    print("tgt_id", tgt_id)
    print()
            
    cypher = f'MERGE (a:Recursive_Test {{id: "{src_id}"}}) MERGE (b:Recursive_Test {{id: "{tgt_id}"}}) MERGE (a)-[:{rs_type}]->(b)'
    r_statements.append(cypher)

src_id micheal_willar | rs_type WORKS_AT | tgt_id stenham_asset_management

src_id adyen | rs_type FOUND_IN | tgt_id amsterdam

src_id adyen | rs_type OPERATES_IN | tgt_id apac

src_id adyen | rs_type HAS_BUSINESS | tgt_id nike

src_id adyen | rs_type IS_HEADQUARTERED_IN | tgt_id amsterdam

src_id adyen | rs_type OPERATES_IN | tgt_id latin_america

src_id adyen | rs_type HAS_BUSINESS | tgt_id overmatch

src_id adyen | rs_type OPERATES_IN | tgt_id europe

src_id adyen | rs_type OPERATES_IN | tgt_id us

src_id zack_fuss | rs_type IS_JOINED_BY | tgt_id micheal_willar

src_id adyen | rs_type HAS_BUSINESS | tgt_id mcdonalds

src_id adyen | rs_type HAS_BUSINESS | tgt_id microsoft



In [218]:
from neo4j import GraphDatabase

url = "bolt://localhost:7687"
username = "neo4j"
password = "password"
gds = GraphDatabase.driver(url, auth=(username, password))

In [219]:
# Generate and execute cypher statements
cypher_statements = r_statements
for i, stmt in enumerate(cypher_statements):
    print(f"Executing cypher statement {i+1} of {len(cypher_statements)}")
    try:
        gds.execute_query(stmt)
    except Exception as e:
        with open("failed_statements.txt", "w") as f:
            f.write(f"{stmt} - Exception: {e}\n")

Executing cypher statement 1 of 12
Executing cypher statement 2 of 12
Executing cypher statement 3 of 12
Executing cypher statement 4 of 12
Executing cypher statement 5 of 12
Executing cypher statement 6 of 12
Executing cypher statement 7 of 12
Executing cypher statement 8 of 12
Executing cypher statement 9 of 12
Executing cypher statement 10 of 12
Executing cypher statement 11 of 12
Executing cypher statement 12 of 12


In [235]:
from langchain_community.vectorstores import Neo4jVector

existing_graph = Neo4jVector.from_existing_graph(
    embedding=embeddings,
    url=url,
    username=username,
    password=password,
    node_label="Recursive_Test",
    text_node_properties=["id"],
    embedding_node_property="embedding",
)



![image](images/image.png)

In [249]:
root_node = existing_graph.similarity_search("the company that process payments", top_k=1)[0]

In [254]:
root_entity = root_node.page_content.strip()[3:]
print("Root entity:", root_entity)

Root entity:  adyen


In [425]:
# Extract all immediate entities and relationships from node with id "adyen"
with gds.session() as session:
    curr_knowledge = []
    result = session.run("MATCH (a:Recursive_Test {id: 'adyen'})-[r]-(b) RETURN a, r, b")
    for record in result:
        curr_knowledge.append(f'{record["a"]["id"]}-{record["r"].type}-{record["b"]["id"]}')

In [426]:
curr_knowledge

['adyen-HAS_BUSINESS-microsoft',
 'adyen-HAS_BUSINESS-mcdonalds',
 'adyen-OPERATES_IN-us',
 'adyen-OPERATES_IN-europe',
 'adyen-HAS_BUSINESS-overmatch',
 'adyen-OPERATES_IN-latin_america',
 'adyen-IS_HEADQUARTERED_IN-amsterdam',
 'adyen-HAS_BUSINESS-nike',
 'adyen-OPERATES_IN-apac',
 'adyen-FOUND_IN-amsterdam']

In [427]:
# query = "WHAT OTHER COMPANIES are located in the same place as adyen's headquarters?"
# query = "Where is Adyen headquartered?"
query = "What does Adyen do?"

In [428]:
statement = llm.invoke(f"""
    You are trying to see if a given question can be answered with the current knowledge and answer it if possible.
    You are given this knowledge: {curr_knowledge}.
    The knowledge is formatted as "currentNode-HAS_A_RELATIONSHIP-possibleNextNode".
    You are currently exploring all the relationships of the "currentNode" in the knowledge.
    You are asked the question: "{query}".
    Assume that you are traversing a knowledge graph and the knowledge given earlier is NOT all the information you have.
    If this current node doesn't help us answer the question, we want to determine which node to explore next. 

    The task:
    If you can answer the question with the current knowledge, provide the answer and explain how your answer answers the question.
    If you cannot answer the question with the current knowledge, either say "give up" OR specify the relationship that will help you answer the question.
    
    In addition, when you cannot answer the question, give the relationship that you want to explore next in a new line without any preamble or explanation.
    """)

In [429]:
print(statement)

I'll start by exploring the relationships of Adyen.

The current knowledge is: ['adyen-HAS_BUSINESS-microsoft', 'adyen-HAS_BUSINESS-mcdonalds', 'adyen-OPERATES_IN-us', 'adyen-OPERATES_IN-europe', 'adyen-HAS_BUSINESS-overmatch', 'adyen-OPERATES_IN-latin_america', 'adyen-IS_HEADQUARTERED_IN-amsterdam', 'adyen-HAS_BUSINESS-nike', 'adyen-OPERATES_IN-apac', 'adyen-FOUND_IN-amsterdam']

Since Adyen is the current node, I'll look at its relationships.

The question is: "What does Adyen do?"

I can answer this question with the current knowledge. The answer is that Adyen has businesses in various companies such as Microsoft, McDonald's, Overmatch, and Nike, which suggests that it provides some kind of services or solutions to these companies. Additionally, its presence in different regions like Europe, Latin America, APAC, and US implies that it operates globally.

So, the answer is: Adyen has businesses in various companies and operates globally.

adyen-OPERATES_IN-apac


In [447]:
feedback = (llm.invoke(f"You are a critic trying to detect gaps in someone's logic in their statement. \
                       The question posed to the person is: ({query}). The knowledge they and you have is limited to: ({curr_knowledge}). \
                        Given the statement below, if the statement is logically sound, respond with 'good'. \
                        If the statement is not logically sound or doesnt fully answer the question and requires further exploration of extensions to the knowledge, \
                            explain why. Suggest DIRECTLY how to correct the answer. Statement: {statement}"))

In [448]:
print(feedback)

Not quite "good" yet!

The statement attempts to answer the question "What does Adyen do?" by highlighting its relationships with other entities. However, it doesn't fully address the question. The statement only mentions that Adyen has businesses in various companies and operates globally, but it doesn't explicitly state what kind of services or solutions Adyen provides.

To make the answer more comprehensive and logically sound, I would suggest adding a direct connection between Adyen's relationships with these companies and its services or solutions. For example:

"I can answer this question with the current knowledge. The answer is that Adyen has businesses in various companies such as Microsoft, McDonald's, Overmatch, and Nike, which suggests that it provides payment processing or e-commerce solutions to these companies. Additionally, its presence in different regions like Europe, Latin America, APAC, and US implies that it operates globally."

By explicitly stating the services o

In [457]:
reattempted_statement = llm.invoke(f"""
    You were given the task as follows:
        ***TASK START***
        You are given this knowledge: {curr_knowledge}.
        The knowledge is formatted as "currentNode-HAS_A_RELATIONSHIP-possibleNextNode".
        You are currently exploring all the relationships of the "currentNode" in the knowledge.
        You are asked the question: "{query}".
        Assume that you are traversing a knowledge graph and the knowledge given earlier is NOT all the information you have.
        If this current node doesn't help us answer the question, we want to determine which node to explore next. 

        The task:
        If you can answer the question with the current knowledge, provide the answer.
        If you cannot answer the question with the current knowledge, either say "give up" OR specify the relationship that will help you answer the question.
        
        ***TASK END***

    You gave an answer: {statement}.
    BUT you were given the feedback: {feedback}.

    Correct your answer based on the feedback given. 

    Format the LAST LINE of your answer as follows:
    If you CAN answer the question, give in the format "ANSWER: <your answer>".
    If you CANNOT answer the question, give in the format "RELATIONSHIP: <relationship to explore next> or GIVE UP".
    """)

In [458]:
print(reattempted_statement)

I'll start by exploring the relationships of Adyen.

The current knowledge is: ['adyen-HAS_BUSINESS-microsoft', 'adyen-HAS_BUSINESS-mcdonalds', 'adyen-OPERATES_IN-us', 'adyen-OPERATES_IN-europe', 'adyen-HAS_BUSINESS-overmatch', 'adyen-OPERATES_IN-latin_america', 'adyen-IS_HEADQUARTERED_IN-amsterdam', 'adyen-HAS_BUSINESS-nike', 'adyen-OPERATES_IN-apac', 'adyen-FOUND_IN-amsterdam']

Since Adyen is the current node, I'll look at its relationships.

The question is: "What does Adyen do?"

I can answer this question with the current knowledge. The answer is that Adyen has businesses in various companies such as Microsoft, McDonald's, Overmatch, and Nike, which suggests that it provides some kind of services or solutions to these companies. Additionally, its presence in different regions like Europe, Latin America, APAC, and US implies that it operates globally.

I can provide more context by explicitly stating the services or solutions Adyen provides.

ANSWER: Adyen has businesses in variou

In [446]:
# Explore the next node based on the feedback
agent_response = reattempted_statement.split("\n")[-1].lower().strip()
header, agent_verdict = agent_response.split(":")[0].strip(), agent_response.split(":")[1].strip()

if header == "relationship":
    if agent_verdict == "give up":
        print("Agent has given up")
    else:
        print("Agent wants to explore the relationship:", agent_verdict)
        entity_to_explore = agent_verdict.split("-")[-1]
        print("Entity to explore:", entity_to_explore)
        # Extract all immediate entities and relationships from node with id "adyen"
        with gds.session() as session:
            curr_knowledge = []
            query_statement = "MATCH (a:Recursive_Test {id: '" + entity_to_explore + "'})-[r]-(b) RETURN a, r, b"
            result = session.run(query_statement)
            for record in result:
                curr_knowledge.append(f'{record["a"]["id"]}-{record["r"].type}-{record["b"]["id"]}')

        print(curr_knowledge)
else:
    print("Answer:", agent_verdict)

Answer: adyen has businesses in various companies, such as microsoft, mcdonald's, overmatch, and nike.


!! Then the idea is that we recurse back from here, slowly exploring the graph and taking notes along the way, terminating ONLY if we arrive back at the same starting node.

#### [!IMPORTANT] The use of an adversarial agent to check the output helps tremendously in ensuring accuracy of the answer and adherence to the task.

In [337]:

# Close the driver
gds.close()