# Knowledge Graph Evaluation and Metrics
This notebook seeks to identify the different points of evaluation for knowledge graphs and alter its parameters to improve the performance of GraphRAG. Below lists the three different stages of RAG and its control points where these parameters can be adjusted.

>

## **Three Stages of RAG**
### **Ingestion and Preprocessing**
The code below utilizes the [Neo4j App](https://llm-graph-builder.neo4jlabs.com/) for uploading and processing the unstructured data (i.e PDF). When generating a graph using the Neo4j app, it is important to note that a document node and associated chunk nodes are created using data. Each chunk of text creates entity nodes with associated relationships, as created by the selected LLM (i.e Gemini, OpenAI).

**Levers of Control:**
- **Schema & Chunk Tailoring**
  - Establishing specific entity nodes and relationships, oriented to the domain of the data can construct a knowledge graph more ideal for graph search and document retrieval
- **Embedding Model Assessment**
  - Choosing different emedding models (i.e all-MiniLM-L6-v2) can output different numeric values in each embedding


### **Retrieval**
Retrieving the relevant documents from Neo4j Graph/Vector database. Traditional data science metrics (i.e precision, recall, F1 score) are used to measure the amount of relevant documents the GraphQARetriever can obtain. The process

**Levers of Control**
- **Query to Cypher Statement**
  - Converting the user query to relevant cypher statement using llm (i.e Gemini, OpenAI) prior to creating a document retriever object.
- **Document Retrieval Assessment**
  - Retrieving the relevant documents from the Graph/Vector DB (i.e GraphQARetriever) using various graph traversal and vector similarity serach techniques


### **Response**
Creating a response from retrieved documents and ensuring relevance.

**Levers of Control**
- **Embedding to Response**
  - Converting the embedding of the retrieved documents to human readable response.
- **Human and Validation Assessment**
  - Ensuring if the response answers the original query.

*References*: 
- [Neo4j llm-graph-builder Github repo](https://github.com/neo4j-labs/llm-graph-builder)
- [RAG TRIAD Metrics](https://truera.com/ai-quality-education/generative-ai-rags/what-is-the-rag-triad/)

## Initial Setup

In [1]:
# Add path to sys.path to enable methods within backend/src (do once)
import os
import sys
import logging
sys.path.append('/home/shinhojung/llm-graph-builder/backend')

from dotenv import load_dotenv
load_dotenv("example.env")

print(sys.path)
print(os.getenv("OPENAI_API_KEY"))

Python-dotenv could not parse statement starting at line 32


['/usr/lib/python311.zip', '/usr/lib/python3.11', '/usr/lib/python3.11/lib-dynload', '', '/home/shinhojung/llm-graph-builder/backend/.venv/lib/python3.11/site-packages', '/home/shinhojung/llm-graph-builder/backend']
sk-28b-djvg18VA3fYppoVWFmSxfEHiOQsJ4aA9A1pzACT3BlbkFJUgtLbbevea7SC8jojxGTBniBK-DSXockirimHAicAA


## Connecting with Neo4j DB instance
Run the backend of the llm-graph-builder repo by running the following command within the ```llm-graph-builder/backend``` folder:

```uvicorn score:app --reload --log-level debug```

**NOTE**: This assumes there's an existing Neo4j instance already created using the Neo4j app.

In [2]:
# Connect with Neo4j Database @/connect
import requests
import json

uri="neo4j+s://819c2e86.databases.neo4j.io:7687"
userName="neo4j"
password="3rLOPhwRVsZa6zzv7Nl0EusFB2Rocl_OTH34UIUeayw"
database="neo4j"
document_names = ["Anatomy_and_Physiology_CH13.pdf"]

url = "http://127.0.0.1:8000/connect"

data = {
    "uri": uri,
    "userName": userName,
    "password":password,
    "database":database
}

content_length = len(data)
headers = {
    "Content-Type": "application/x-www-form-urlencoded",
    "Content-Length":"4"
}
json_data = json.dumps(data)

response = requests.post(url, headers=headers, data=data)

print(response.text)

{"status":"Success","data":{"db_vector_dimension":384,"application_dimension":384,"message":"Connection Successful"}}


## Query the Existing Knowledge Graph

The query ``` MATCH (d:Document) Return d``` shows all nodes labeled as "Document".
Within the records, look for the number of chunks.
```
records: [<Record d=<Node element_id='4:187c0626-2398-4709-a30d-c929bee00f7d:2' 
labels=frozenset({'Document'}) 
properties={
    'fileName': 'Anatomy_and_Physiology_CH13.pdf', 
    'errorMessage': '', 
    'fileSource': 'local file', 
    'total_chunks': 188, 
    'processingTime': 435.63, 
    'createdAt': neo4j.time.DateTime(2024, 10, 13, 23, 10, 11, 139430000), 
    'fileSize': 13380426, 'nodeCount': 799, 
    'model': 'openai-gpt-4o', 
    'processed_chunk': 188, 
    'fileType': 'pdf', 
    'relationshipCount': 532, 
    'is_cancelled': False, 
    'status': 'Completed', 
    'updatedAt': neo4j.time.DateTime(2024, 10, 13, 23, 17, 39, 134987000)}>>]```


In [3]:
# /graph_query >> Query test
from src.graph_query import get_graphDB_driver, execute_query,extract_node_elements,extract_relationships
from src.shared.constants import GRAPH_QUERY, GRAPH_CHUNK_LIMIT

driver = get_graphDB_driver(uri,userName,password)
query =  """
MATCH (d:Document)
RETURN d

"""  # labels are case-sensitive
records, summary, keys = execute_query(driver, query, document_names, doc_limit=None)
document_nodes = extract_node_elements(records)
document_relationships = extract_relationships(records)

print(f"records: {records}")
# print(f"summary: {summary}")
# print(f"keys: {keys}")
# print(f"document_nodes: {document_nodes}")
# print(f"document_relationships: {document_relationships}")

records: [<Record d=<Node element_id='4:187c0626-2398-4709-a30d-c929bee00f7d:2' labels=frozenset({'Document'}) properties={'fileName': 'Anatomy_and_Physiology_CH13.pdf', 'errorMessage': '', 'fileSource': 'local file', 'total_chunks': 188, 'processingTime': 435.63, 'createdAt': neo4j.time.DateTime(2024, 10, 13, 23, 10, 11, 139430000), 'fileSize': 13380426, 'nodeCount': 799, 'model': 'openai-gpt-4o', 'processed_chunk': 188, 'fileType': 'pdf', 'relationshipCount': 532, 'is_cancelled': False, 'status': 'Completed', 'updatedAt': neo4j.time.DateTime(2024, 10, 13, 23, 17, 39, 134987000)}>>]


## Visualize Graph in Juypter Notebooks using yfiles
Try comparing with Neo4j's visualization tool online to ensure accuracy.

In [4]:
# Visualize Graph
from neo4j import GraphDatabase
from yfiles_jupyter_graphs import GraphWidget

default_cypher = "MATCH (n:Document)<-[r]->(c:Chunk)<-[s]->(e) RETURN n,r,c,s,e LIMIT 500"

def showGraph(cypher: str = default_cypher):
    driver = GraphDatabase.driver(
        uri = uri,
        auth = (os.environ["NEO4J_USERNAME"], os.environ["NEO4J_PASSWORD"])
    )
    session = driver.session()
    widget = GraphWidget(graph = session.run(cypher).graph())
    widget.node_label_mapping = 'id'
    return widget

showGraph()

GraphWidget(layout=Layout(height='800px', width='100%'))

In [5]:
# /graph_query >> get_graph_results function
from src.graph_query import get_graphDB_driver, execute_query,extract_node_elements,extract_relationships
from src.shared.constants import GRAPH_QUERY, GRAPH_CHUNK_LIMIT

driver = get_graphDB_driver(uri,userName,password)
query = GRAPH_QUERY.format(graph_chunk_limit=GRAPH_CHUNK_LIMIT)
document_names =["Anatomy_and_Physiology_CH13.pdf"]
records, summary, keys = execute_query(driver, query, document_names, doc_limit=None)
document_nodes = extract_node_elements(records)
document_relationships = extract_relationships(records)

# print(f"records: {records}")
print(f"summary: {summary}")
print(f"keys: {keys}")

print(f"no of nodes : {len(document_nodes)}")
print(f"no of relations : {len(document_relationships)}")
results = {
    "nodes": document_nodes,
    "relationships": document_relationships
}




summary: <neo4j._work.summary.ResultSummary object at 0x7f8738ad5ed0>
keys: ['nodes', 'rels']
no of nodes : 212
no of relations : 945


In [9]:
from src.api_response import create_api_response

print(create_api_response('Success', data=results))

{'status': 'Success', 'data': {'nodes': [{'element_id': '4:187c0626-2398-4709-a30d-c929bee00f7d:2', 'labels': ['Document'], 'properties': {'fileName': 'Anatomy_and_Physiology_CH13.pdf', 'errorMessage': '', 'fileSource': 'local file', 'total_chunks': 188, 'processingTime': 435.63, 'createdAt': '2024-10-13T23:10:11.139430000', 'fileSize': 13380426, 'nodeCount': 799, 'model': 'openai-gpt-4o', 'processed_chunk': 188, 'fileType': 'pdf', 'relationshipCount': 532, 'is_cancelled': False, 'status': 'Completed', 'updatedAt': '2024-10-13T23:17:39.134987000'}}, {'element_id': '4:187c0626-2398-4709-a30d-c929bee00f7d:3', 'labels': ['Chunk'], 'properties': {'fileName': 'Anatomy_and_Physiology_CH13.pdf', 'content_offset': 0, 'page_number': 1, 'length': 1043, 'id': '25cfd9eace2fcb0f5fc90f2f144a85bf17483389', 'position': 1}}, {'element_id': '4:187c0626-2398-4709-a30d-c929bee00f7d:4', 'labels': ['Chunk'], 'properties': {'fileName': 'Anatomy_and_Physiology_CH13.pdf', 'content_offset': 1043, 'page_number':

In [6]:
# Deconstructed: /chat_bot API >> QA_RAG >> create_graph_chain
from src.llm import get_llm
import langchain_community.graphs.neo4j_graph as n
from langchain.chains import GraphCypherQAChain

# Fetching graph from database using user agent
graph = n.Neo4jGraph(url=uri,username=userName,password=password,database=database,sanitize = True, refresh_schema=True, driver_config={'user_agent':os.environ.get('NEO4J_USER_AGENT')})

# RAG Parameters
model="gemini-1.5-pro"
question="Tell me about anatomy and physiology."
document_names="Anatomy_and_Physiology_CH13.pdf"
session_id = None
mode="graph + vector + fulltext"

# Create_graph_chain
cypher_llm,model_name = get_llm(model)
qa_llm,model_name = get_llm(model)
graph_chain = GraphCypherQAChain.from_llm(
    cypher_llm=cypher_llm,
    qa_llm=qa_llm,
    validate_cypher= True,
    graph=graph,
    # verbose=True, 
    return_intermediate_steps = True,
    top_k=3
)

print(graph_chain)
print(qa_llm)

graph=<langchain_community.graphs.neo4j_graph.Neo4jGraph object at 0x7f8738ad4ed0> cypher_generation_chain=LLMChain(prompt=PromptTemplate(input_variables=['question', 'schema'], template='Task:Generate Cypher statement to query a graph database.\nInstructions:\nUse only the provided relationship types and properties in the schema.\nDo not use any other relationship types or properties that are not provided.\nSchema:\n{schema}\nNote: Do not include any explanations or apologies in your responses.\nDo not respond to any questions that might ask anything else than for you to construct a Cypher statement.\nDo not include any text except the generated Cypher statement.\n\nThe question is:\n{question}'), llm=ChatVertexAI(project='genie-rbs-dev', model_name='gemini-1.5-pro-preview-0514', full_model_name='projects/genie-rbs-dev/locations/us-central1/publishers/google/models/gemini-1.5-pro-preview-0514', client_options=ClientOptions: {'api_endpoint': 'us-central1-aiplatform.googleapis.com', 'cl

## Retrieval
There are three options for retrieval:
- Graph
- Vector
- Hybrid (Graph + Vector)

In [7]:
# Deconstructed: /chat_bot API >> QA_RAG >> get_graph_response
# Graph Retrieval
from src.QA_integration_new import get_graph_response
graph_response = get_graph_response(graph_chain,question)
print("This is graph response:")
print(graph_response) # not sure why it takes longer than online app or indicates it doesn't know after second try


  embeddings = SentenceTransformerEmbeddings(
  from tqdm.autonotebook import tqdm, trange


This is graph response:
{'response': 'The cerebrum, a large part of the central nervous system, handles higher functions like memory and emotion. It contains the cerebral cortex, which is responsible for higher functions of the nervous system, and the basal nuclei, which are responsible for cognitive processing, particularly planning movements. The cerebrum also includes the basal forebrain, which contains nuclei important in learning and memory. \n', 'cypher_query': 'MATCH (a:Anatomy)-[r]->(b) RETURN a, r, b', 'context': [{'a': {'description': 'The cerebrum is a large component of the CNS in humans, responsible for higher neurological functions such as memory, emotion, and consciousness. It includes the cerebral cortex and several deep nuclei.', 'id': 'Cerebrum'}, 'r': ({'description': 'The cerebrum is a large component of the CNS in humans, responsible for higher neurological functions such as memory, emotion, and consciousness. It includes the cerebral cortex and several deep nuclei

In [10]:
from src.QA_integration_new import load_embedding_model
# Under the get_neo4j_retriever function
EMBEDDING_MODEL = os.getenv('EMBEDDING_MODEL')
EMBEDDING_FUNCTION , _ = load_embedding_model(EMBEDDING_MODEL)

print(EMBEDDING_FUNCTION)

client=SentenceTransformer(
  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
) model_name='all-MiniLM-L6-v2' cache_folder=None model_kwargs={} encode_kwargs={} multi_process=False show_progress=False


In [12]:
# Deconstructed: /chat_bot API >> QA_RAG >> VECTOR_GRAPH_SEARCH and setup_chat
# Different from RAG_Graph
from src.shared.constants import VECTOR_GRAPH_SEARCH_QUERY, VECTOR_GRAPH_SEARCH_ENTITY_LIMIT, VECTOR_SEARCH_QUERY
from src.QA_integration_new import setup_chat, retrieve_documents, process_documents, get_neo4j_retriever, create_document_retriever_chain
from src.shared.common_fn import create_graph_database_connection
from src.shared.constants import CHAT_SEARCH_KWARG_K, CHAT_SEARCH_KWARG_SCORE_THRESHOLD
from src.llm import get_llm
from langchain_community.vectorstores.neo4j_vector import Neo4jVector

llm,model_name = get_llm(model)

# under QA_RAG with Hybrid Search
retrieval_query = VECTOR_GRAPH_SEARCH_QUERY.format(no_of_entites=VECTOR_GRAPH_SEARCH_ENTITY_LIMIT)

# Create retriever
index_name = "vector"
keyword_index = "keyword"
search_k=CHAT_SEARCH_KWARG_K, 
score_threshold=CHAT_SEARCH_KWARG_SCORE_THRESHOLD

# Vectors within a graph (aka. setup chat)
neo_db = Neo4jVector.from_existing_graph(
                embedding=EMBEDDING_FUNCTION,
                index_name=index_name,
                retrieval_query=retrieval_query,
                graph=graph,
                search_type="hybrid",
                node_label="Chunk",
                embedding_node_property="embedding",
                text_node_properties=["text"],
                keyword_index_name=keyword_index
                )
retriever = neo_db.as_retriever(search_type="similarity_score_threshold",search_kwargs={"score_threshold": score_threshold}) # default k = 4 not 3
doc_retriever = create_document_retriever_chain(llm, retriever)
docs = retrieve_documents(doc_retriever, messages)

# Point of Evaluation - Retrieval
## Precision
## Recall
## F1 Score
## Mean Reciporal Rank (MRR)
## Mean Average Precision (MAP)

if docs:
    content, result, total_tokens = process_documents(docs, question, messages, llm,model)
else:
    content = "I couldn't find any relevant documents to answer your question."
    result = {"sources": [], "chunkdetails": []}
    total_tokens = 0

# Point of Evaluation - Response
## Human Evaluation
## Embedding-Based Comparison
## ROUGE-N

print()
print(f"message: {content}")
print(f"sources: {result['sources']}")
# print(f"model: {model_version}")
print(f"chunkdetails: {result['chunkdetails']}")
print(f"total_tokens: {total_tokens}")
print(f"mode: {mode}")





message: Anatomy and physiology are about the structures of the body and how they work.  The nervous system, for example, controls much of the body, both voluntary actions like movement and involuntary ones like digestion.  Certain structures in the nervous system are specifically responsible for these functions. Understanding these structures and functions requires a detailed look at the anatomy of the nervous system, both the central and peripheral parts. 

sources: {'Anatomy_and_Physiology_CH13.pdf'}
chunkdetails: [{'id': '3dcb1d2c500ba3fb4c07154048a8dd87b1a73cad', 'score': 1.0}, {'id': '2a3e54668f2428ec2ebc6f523c46a89feab17f4f', 'score': 1.0}, {'id': '25cfd9eace2fcb0f5fc90f2f144a85bf17483389', 'score': 0.9825}, {'id': '90e1d10d0082182a3861e32e61d127b3426e6cba', 'score': 0.9671}]
total_tokens: 3185
mode: graph + vector + fulltext


In [11]:
# Create messages
from src.QA_integration_new import create_neo4j_chat_message_history
from src.shared.common_fn import create_graph_database_connection
from langchain_core.messages import HumanMessage, SystemMessage

history = create_neo4j_chat_message_history(graph, session_id=" ")
messages = history.messages
user_question = HumanMessage(content=question)
messages.append(user_question)

print(messages)

[HumanMessage(content='Tell me about anatomy and physiology.')]


## Compare Difference between Graph, Vector, vs Hybrid (Graph+Vector)


In [12]:
print("This is Graph-only response")
print(graph_response['response'])
print("This is Graph + Vector response ")
print(content)


This is Graph-only response
The cerebrum, a large part of the central nervous system, handles higher functions like memory and consciousness. It contains the cerebral cortex, responsible for higher nervous system functions, and the basal nuclei, which plan movements, and the basal forebrain, important for learning and memory. 

This is Graph + Vector response 
Anatomy and physiology are about the structures of the body and how they work. Anatomy focuses on the physical makeup of the body, like organs and tissues, while physiology explores the functions and processes of those structures, explaining how they work together to keep the body alive and functioning. 



In [None]:
# Deconstructed: /chat_bot API >> QA_RAG >> get_graph_response
from langchain_core.messages.ai import AIMessage
from langchain_community.chat_message_histories.neo4j import Neo4jChatMessageHistory
from src.QA_integration_new import summarize_and_log

history = Neo4jChatMessageHistory(
            graph=graph,
            session_id="session_id"
)
messages = history.messages
ai_response = AIMessage(content=graph_response["response"]) if graph_response["response"] else AIMessage(content="Something went wrong")
messages.append(ai_response)
# summarize_and_log(history, messages, qa_llm)

result = {
    "session_id": session_id, 
    "message": graph_response["response"], 
    "info": {
        "model": model_version,
        'cypher_query':graph_response["cypher_query"],
        "context" : graph_response["context"],
        "mode" : mode,
        "response_time": 0
    },
    "user": "chatbot"
} 
print(result)

In [None]:
# Test chat_bot API 
url = "http://127.0.0.1:8000/chat_bot"

data = {
    "uri": uri,
    "userName": userName,
    "password": password,
    "database": database,
    "question": question,
    "model": "gemini-1.5-pro",
    "mode": "graph + vector + fulltext",
    "document_names": []
}

content_length = len(data)
headers = {
    "Content-Type": "application/x-www-form-urlencoded",
    "Content-Length":"1077"
}
json_data = json.dumps(data)

response = requests.post(url, headers=headers, data=data)

print(response.text)

In [None]:
# Test graph_query API 
url = 'http://127.0.0.1:8000/graph_query'

data = {
    "uri": uri,
    "userName": userName,
    "password":password,
    "database":database,
    "document_names":["Anatomy_and_Physiology_2e_-_WEB_c9nD9QL_Chapter-13.pdf"]
}
response = requests.post(url, headers=headers, data=data)
print(response.text)