#### <div style= "font-family: Cambria; font-weight:bold; letter-spacing: 0px; color:white; font-size:180%; text-align:left;padding:3.0px; background: maroon; border-bottom: 8px solid black" > TABLE OF CONTENTS<br><div>
* [IMPORTS](#1)
* [Introduction](#2)
* [GraphQA Chain](#3)
* [Custom Chain](#4)
* [Semantic Retrieval](#5)
* [Final Chain](#6)


In [1]:
import os

import langchain
## Chains
from operator import itemgetter

from langchain_community.graphs import Neo4jGraph

from langchain_core.runnables import RunnableLambda, RunnablePassthrough

from langchain_community.chains.graph_qa.prompts import (
    CYPHER_QA_PROMPT,
    CYPHER_GENERATION_PROMPT
)
## LLMs:
from langchain_openai import OpenAI, ChatOpenAI

langchain.debug=False

<a id="2"></a>
# <div style= "font-family: Cambria; font-weight:bold; letter-spacing: 0px; color: white; font-size:120%; text-align:left;padding:3.0px; background: maroon; border-bottom: 8px solid black" > Introduction<br><div>

In this notebook we are going to show how to create a 'Custom' GraphRAG set up, using as an example the GraphQAChain client provided by Langchain.

<a id="3"></a>
# <div style= "font-family: Cambria; font-weight:bold; letter-spacing: 0px; color: white; font-size:120%; text-align:left;padding:3.0px; background: maroon; border-bottom: 8px solid black" > GraphQA Chain<br><div>

We have based this development in the [GraphCypherQAChain](https://api.python.langchain.com/en/latest/chains/langchain_community.chains.graph_qa.cypher.GraphCypherQAChain.html) chain. We are going to show how to replicate it's main behaviour here, but adapting it to the LCEL langchain notation, considering that those old chains are deprecated. In this way we ensure that our final solution will be more 'production ready', and also will be more customized.


Inspecting the chain class definition we realized that, by default, it uses two predefined prompts:

* Query generation prompt: Handles the conversion from a user query to a CYPHER query

* QuestionAnswer prompt: Once the context has been retrieved from our Knowledge graph, this prompt handles the conversation

Let's take a look to this prompts:

## Visualize the prompts

In [2]:
CYPHER_GENERATION_PROMPT

PromptTemplate(input_variables=['question', 'schema'], template='Task:Generate Cypher statement to query a graph database.\nInstructions:\nUse only the provided relationship types and properties in the schema.\nDo not use any other relationship types or properties that are not provided.\nSchema:\n{schema}\nNote: Do not include any explanations or apologies in your responses.\nDo not respond to any questions that might ask anything else than for you to construct a Cypher statement.\nDo not include any text except the generated Cypher statement.\n\nThe question is:\n{question}')

In [3]:
CYPHER_GENERATION_PROMPT.invoke({'question':"How to build a confusion matrix with plotly?",'schema':'This will be the schema of the graph'}).text

'Task:Generate Cypher statement to query a graph database.\nInstructions:\nUse only the provided relationship types and properties in the schema.\nDo not use any other relationship types or properties that are not provided.\nSchema:\nThis will be the schema of the graph\nNote: Do not include any explanations or apologies in your responses.\nDo not respond to any questions that might ask anything else than for you to construct a Cypher statement.\nDo not include any text except the generated Cypher statement.\n\nThe question is:\nHow to build a confusion matrix with plotly?'

In [4]:
CYPHER_QA_PROMPT

PromptTemplate(input_variables=['context', 'question'], template="You are an assistant that helps to form nice and human understandable answers.\nThe information part contains the provided information that you must use to construct an answer.\nThe provided information is authoritative, you must never doubt it or try to use your internal knowledge to correct it.\nMake the answer sound as a response to the question. Do not mention that you based the result on the given information.\nHere is an example:\n\nQuestion: Which managers own Neo4j stocks?\nContext:[manager:CTL LLC, manager:JANE STREET GROUP LLC]\nHelpful Answer: CTL LLC, JANE STREET GROUP LLC owns Neo4j stocks.\n\nFollow this example when generating answers.\nIf the provided information is empty, say that you don't know the answer.\nInformation:\n{context}\n\nQuestion: {question}\nHelpful Answer:")

As we can see, this second prompt is just to handle the conversation, once the query has retrieved some content, so we will focus in the first one.

<a id="4"></a>
# <div style= "font-family: Cambria; font-weight:bold; letter-spacing: 0px; color: white; font-size:120%; text-align:left;padding:3.0px; background: maroon; border-bottom: 8px solid black" > Custom Chain<br><div>

We are going to replicate here a chain that has mainly the same behaviour, but adapted to our use case.

In [5]:
llm=ChatOpenAI(temperature=0, model_name="gpt-3.5-turbo") # gpt-4-0125-preview occasionally has issues

chain = CYPHER_GENERATION_PROMPT | llm

In [6]:
result = chain.invoke({'question':"How to build a confusion matrix with plotly?",'schema':'This will be the schema of the graph'})

In [7]:
result.content

'MATCH (a:Actual)-[r:ACTUAL_PREDICTED]->(p:Predicted)\nWITH {label: a.label, prediction: p.label} as data, count(*) as count\nRETURN data, count'

As we can see this value makes not sense at all, because we have not provided the schema to the LLM yet, let's reproduce this part based in the reference Chain

### Load a graph from neo4j

We do this with the wrapper that langchain community offers. We could create our own, but for now we will just stick to it.
The main advantage of this client is that directly create an schema for us, so we can contextualize the LLM.

In [8]:
graph = Neo4jGraph(url="bolt://localhost:7687", username="neo4j", password=os.environ['NEO4J_PASSWORD'],database='graphrag')

### This function allows us to directly get the schema from the database

In [None]:
graph.get_schema

In [9]:
graph.get_structured_schema

{'node_props': {'Function': [{'property': 'description', 'type': 'STRING'},
   {'property': 'embedding', 'type': 'LIST'},
   {'property': 'name', 'type': 'STRING'},
   {'property': 'code', 'type': 'STRING'},
   {'property': 'file_path', 'type': 'STRING'}],
  'Area': [{'property': 'name', 'type': 'STRING'}],
  'SubArea': [{'property': 'name', 'type': 'STRING'}],
  'Framework': [{'property': 'name', 'type': 'STRING'}],
  'Class': [{'property': 'description', 'type': 'STRING'},
   {'property': 'name', 'type': 'STRING'},
   {'property': 'code', 'type': 'STRING'},
   {'property': 'file_path', 'type': 'STRING'}]},
 'rel_props': {},
 'relationships': [{'start': 'Area',
   'type': 'CONTAINS_SUBAREA',
   'end': 'SubArea'},
  {'start': 'Area', 'type': 'CONTAINS_FRAMEWORK', 'end': 'Framework'},
  {'start': 'SubArea', 'type': 'CONTAINS_FRAMEWORK', 'end': 'Framework'},
  {'start': 'Framework', 'type': 'CONTAINS_FUNCTION', 'end': 'Function'},
  {'start': 'Framework', 'type': 'CONTAINS_CLASS', 'end':

## Invoke the function with this schema as input

In [10]:
result = chain.invoke({'question':"How to use plotly framework?",'schema':graph.get_schema})

In [11]:
result.content

'MATCH (a:Area)-[:CONTAINS_FRAMEWORK]->(f:Framework {name: "plotly"})\nRETURN f, a'

Other advantage of the langchain graph client is that allows to directly run the queries returned by this first LLM:

In [12]:
context = graph.query(result.content)[:5]

In [13]:
context

[{'f': {'name': 'plotly'}, 'a': {'name': 'visualization'}}]

### Here we added the keyword 'framework' in the question, but that is hightly unlikely in a normal query

In [14]:
result = chain.invoke({'question':"How to use plotly?",'schema':graph.get_schema})

result.content

"MATCH (a:Area)-[:CONTAINS_SUBAREA]->(sa:SubArea)-[:CONTAINS_FRAMEWORK]->(f:Framework)-[:CONTAINS_FUNCTION]->(func:Function)\nWHERE func.name = 'plotly'\nRETURN a, sa, f, func;"

In [15]:
context = graph.query(result.content)[:5]
context

[]

The first problem that we see is that considering how our graph is built, the entities are difficult to assign only by their name. This is a generic problem of this kind of solution. Entities should have a very descriptive name (Person, Organization...) so they can be eassily identified by a general LLM. So we should try and contextualize better about the entities that the LLM can expect. For that we will take as reference the base prompt used in the GraphCypherChain and add the following lines defining the entities for our problem.

You are a helpful assistant that understands the context of data science and can generate Cypher queries to retrieve information from a Neo4j database.

The database schema includes the following entities:
- Data Preprocessing Area: Nodes labeled as 'DataPreprocessingArea' representing areas of data preprocessing.
- SubArea: Nodes labeled as 'SubArea' representing sub-areas within data preprocessing.
- Framework: Nodes labeled as 'Framework' representing frameworks used in data science.
- Class: Nodes labeled as 'Class' representing a set of functions defining a Python class within a framework.
- Function: Nodes labeled as 'Function' representing specific functions within frameworks.

In [16]:
# Prompts:
from langchain_core.prompts import (
    PromptTemplate
)

cypher_gen_prompt = PromptTemplate.from_template(
    """
    You are a Cypher language expert.
    Your Task:Generate Cypher statement to query a graph database.
    To better contextualize, the Graph database is mapping the Data Science implementations using python
    and is divided in the following entities:

    Instructions:
    Use only the provided relationship types and properties in the schema.
    Do not use any other relationship types or properties that are not provided.
    Schema:
    {schema}
    Note: Do not include any explanations or apologies in your responses.
    Do not respond to any questions that might ask anything else than for you to construct a Cypher statement.
    Do not include any text except the generated Cypher statement.
        - Data Preprocessing Area: Nodes labeled as 'Area' representing areas of Data Science, like 'Data Visualization' or 'Data Preprocessing'.
        - SubArea: Nodes labeled as 'SubArea' representing sub-areas within data preprocessing. This field is optional, some of the nodes may not have a relation with a 'SubArea' node, so generally 
        should not be added in the query.
        - Framework: Nodes labeled as 'Framework' representing frameworks used in data science.
        - Class: Nodes labeled as 'Class' representing a set of functions defining a Python class within a framework.
        - Function: Nodes labeled as 'Function' representing custom functions built on top of those frameworks.
    Nodes do not neccesarily have parents of each type of label.

    Your main focus should be to identify the Framework and the Function that is being asked.
    The question is:
    {question}
    """
   
)

In [18]:
llm=ChatOpenAI(temperature=0, model_name="gpt-3.5-turbo") # gpt-4-0125-preview occasionally has issues

custom_prompt_chain = cypher_gen_prompt | llm

In [24]:
result = custom_prompt_chain.invoke({'question':"How to use plotly?",'schema':graph.get_schema})

In [25]:
result.content

'MATCH (:Framework{name: "plotly"})-[:CONTAINS_FUNCTION]->(f:Function)\nRETURN f;'

In [26]:

result = custom_prompt_chain.invoke({'question':"How to use plotly to generate a confusion matrix?",'schema':graph.get_schema})
result.content

'MATCH (:Framework{name: "plotly"})-[:CONTAINS_FUNCTION]->(:Function{name: "generate_confusion_matrix"})\nRETURN *;'

In [27]:
context = graph.query(result.content)[:5]
context

ValueError: Generated Cypher Statement is not valid
{code: Neo.ClientError.Statement.SyntaxError} {message: RETURN * is not allowed when there are no variables in scope (line 2, column 1 (offset: 104))
"RETURN *;"
 ^}

In [28]:

result = custom_prompt_chain.invoke({'question':"How to use plotly to generate a 'plot_confusion_matrix' function? Provide the code",'schema':graph.get_schema})
result

AIMessage(content="MATCH (a:Area)-[:CONTAINS_FRAMEWORK]->(f:Framework)-[:CONTAINS_FUNCTION]->(func:Function)\nWHERE f.name = 'plotly' AND func.name = 'plot_confusion_matrix'\nRETURN func.code", additional_kwargs={'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 48, 'prompt_tokens': 477, 'total_tokens': 525}, 'model_name': 'gpt-3.5-turbo-0125', 'system_fingerprint': None, 'finish_reason': 'stop', 'logprobs': None}, id='run-06f7b36a-04ba-4b7c-a7da-27ba12d48e27-0', usage_metadata={'input_tokens': 477, 'output_tokens': 48, 'total_tokens': 525})

In [29]:
context = graph.query(result.content)[:5]
context

[{'func.code': 'def plot_confusion_matrix(confusion_matrix, class_names):     """     Generates a confusion matrix plot from a sklearn confusion matrix object.      Parameters:     - confusion_matrix (array): Numpy array containing the confusion matrix information.     - class_names (list of str): List of class names corresponding to the labels.      Returns:     - fig (plotly.graph_objects.Figure): The Plotly figure object for the confusion matrix plot.     """     fig = ff.create_annotated_heatmap(         z=confusion_matrix,         x=class_names,         y=class_names,         colorscale=\'Blues\',         showscale=True     )     fig.update_layout(title=\'Confusion Matrix\', xaxis_title=\'Predicted Label\', yaxis_title=\'True Label\')     fig.update_traces(text=confusion_matrix.astype(str), texttemplate=\'%{text}\')     return fig'}]

### We can see that after contextualizing what can be understood as a 'Framework' in our graph, the LLM is correctly identifying plotly as a framework

In [30]:
qa_prompt_template =  PromptTemplate.from_template(
    """You are an assistant that helps to form nice and human understandable answers.
    The information part contains the provided information that you must use to construct an answer.
    The provided information is authoritative, you must never doubt it or try to use your internal knowledge to correct it.
    Make the answer sound as a response to the question. Do not mention that you based the result on the given information.
    If the provided information is empty, say that you don't know the answer.
    Information:
    {context}
    
    Question: {question}
    Helpful Answer:
    """
)

We have sligthly adapted the prompt from the reference chain to our end, removing the example mainly. Let's build the complete chain

In [33]:
def run_cypher_query(query):
    print("Generated query---->",query.content)
    node_contents = graph.query(query.content)[:5]
    return node_contents

In [34]:
full_qa_chain = {'context': cypher_gen_prompt | llm | RunnableLambda(run_cypher_query), 'question': RunnablePassthrough()} | qa_prompt_template | llm

langchain.debug=True

full_qa_chain.invoke({'question':"How to use plotly to generate a 'plot_confusion_matrix' function? Provide the code",'schema':graph.get_schema})

[32;1m[1;3m[chain/start][0m [1m[chain:RunnableSequence] Entering Chain run with input:
[0m{
  "question": "How to use plotly to generate a 'plot_confusion_matrix' function? Provide the code",
  "schema": "Node properties:\nFunction {description: STRING, embedding: LIST, name: STRING, code: STRING, file_path: STRING}\nArea {name: STRING}\nSubArea {name: STRING}\nFramework {name: STRING}\nClass {description: STRING, name: STRING, code: STRING, file_path: STRING}\nRelationship properties:\n\nThe relationships:\n(:Area)-[:CONTAINS_SUBAREA]->(:SubArea)\n(:Area)-[:CONTAINS_FRAMEWORK]->(:Framework)\n(:SubArea)-[:CONTAINS_FRAMEWORK]->(:Framework)\n(:Framework)-[:CONTAINS_FUNCTION]->(:Function)\n(:Framework)-[:CONTAINS_CLASS]->(:Class)"
}
[32;1m[1;3m[chain/start][0m [1m[chain:RunnableSequence > chain:RunnableParallel<context,question>] Entering Chain run with input:
[0m{
  "question": "How to use plotly to generate a 'plot_confusion_matrix' function? Provide the code",
  "schema": "No

AIMessage(content="To generate a 'plot_confusion_matrix' function using plotly, you can use the following code:\n\n```python\ndef plot_confusion_matrix(confusion_matrix, class_names):\n    fig = ff.create_annotated_heatmap(\n        z=confusion_matrix,\n        x=class_names,\n        y=class_names,\n        colorscale='Blues',\n        showscale=True\n    )\n    fig.update_layout(title='Confusion Matrix', xaxis_title='Predicted Label', yaxis_title='True Label')\n    fig.update_traces(text=confusion_matrix.astype(str), texttemplate='%{text}')\n    return fig\n```\n\nThis code will generate a confusion matrix plot using the provided sklearn confusion matrix object and class names.", additional_kwargs={'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 149, 'prompt_tokens': 479, 'total_tokens': 628}, 'model_name': 'gpt-3.5-turbo-0125', 'system_fingerprint': None, 'finish_reason': 'stop', 'logprobs': None}, id='run-430d6c36-9970-42f7-8339-a33508187fbd-0', usage_meta

This was everything regarding the usage of the Graph based in a query retrieval strategy. But seing that this is not always accurate and may not give any result, we are going to mix it with a 'Semantic simmilarity' retrieval procedure.