# BYOKG RAG Demo
This notebook demonstrates a RAG (Retrieval Augmented Generation) system built on top of a Knowledge Graph. The system allows querying a knowledge graph using natural language questions and retrieving relevant information to generate answers.

1. **Graph Store**: Neptune Analytics endpoint for the graph structure
2. **KG Linker**: Links natural language queries to graph entities and paths
3. **CypherKGLinker**: Maps natural language queries to cypher queries
4. **Entity Linker**: Matches entities from text to graph nodes
5. **Triplet Retriever**: Retrieves relevant triplets from the graph
6. **Path Retriever**: Finds paths between entities in the graph
7. **Query Engine**: Orchestrates all components to answer questions

#### Setup
If you haven't already, install the toolkit and dependencies in [README.md](../../byokg-rag/README.md).
Let's validate if the package is correctly installed.

In [1]:
# !pip install https://github.com/awslabs/graphrag-toolkit/archive/refs/tags/v3.8.1.zip#subdirectory=byokg-rag

In [2]:
from graphrag_toolkit.byokg_rag.graphstore import NeptuneAnalyticsGraphStore

### Graph Store
The `NeptuneAnalyticsGraphStore` class provides an interface to work with the Neptune Analytics graph.
If you already have a NeptuneAnalyticsGraphEndpoint you want to use, simply change the cell below to assign `graph_identifier` to your NeptuneAnalytics graph id. 

If you don't already have a Neptune Graph then you can create one by running the command below from an environment that has the AWS CLI configured with appropriate permissions. Please refer to documentation for more details about [creating a graph](https://docs.aws.amazon.com/neptune-analytics/latest/userguide/create-graph-using-console.html) and [loading data into the graph](https://docs.aws.amazon.com/neptune-analytics/latest/userguide/batch-load.html).

```
aws neptune-graph create-graph --graph-name 'edgar-byokg' --provisioned-memory 128 --public-connectivity --replica-count 0 --vector-search-configuration '{"dimension": 384}'
```

After running the command you should receive a response that includes the graph id. Change the cell below to assign  `graph_identifier` to the id.

To run the rest of the notebook, you'll need to ensure that the environment has the right IAM permissions to interact with your neptune analytics graph endpoint. Specifically you will need `neptune-graph:ReadDataViaQuery` and `neptune-graph:GetGraph`. If you are also using the example dataset, you will need s3 IAM read permissions so that `graphstore.read_from_csv` can access data from `s3://aws-neptune-customer-samples-*/*`. If you're using your own dataset then you also need to provide write access so that `read_from_csv` can upload your csv file to an s3 location you specify where it will be ingested by Neptune Analytics.

In the rest of the notebook, we
1. Initialize the BYOKG graph store to use a Neptune Analytics Graph
2. Optionally, load an example data from a CSV file for a new graph
3. Get basic statistics about the graph and sample edges for a specific node
4. Run the BYOKG retrieval functions and QueryEngine on a sample question

In [3]:
region = "us-east-1" #replace with aws region
graph_identifier = "g-owod636z08" # replace with graph id 

In [4]:
graph_store = NeptuneAnalyticsGraphStore(graph_identifier=graph_identifier,
                                         region=region)

#### Loading Data

If you ran the command to create a new graph, then uncomment the code cell below to load the new graph with some data. The data we are loading is a public dataset from the [SEC EDGAR system](https://www.sec.gov/search-filings) and contains information on stock holdings. 

See the [Neptune Analytics example notebook](https://github.com/aws/graph-notebook/blob/main/src/graph_notebook/notebooks/02-Neptune-Analytics/03-Sample-Use-Cases/02-Investment-Analysis/01-EDGAR-Competitor-Analysis-using-Knowledge-Graph-Graph-Algorithms-and-Vector-Search.ipynb) for more details

In [5]:
# graph_store.read_from_csv(s3_path=f"s3://aws-neptune-customer-samples-{region}/sample-datasets/gremlin/edgar/")

In [6]:
# Print graph statistics
number_of_nodes = len(graph_store.nodes())
number_of_edges = len(graph_store.edges())
print(f"The graph has {number_of_nodes} nodes and {number_of_edges} edges.")

The graph has 43072 nodes and 11335002 edges.


In [7]:
# Print graph schema
import json

schema = graph_store.get_schema()
print(json.dumps(schema, indent=4))

[
    {
        "schema": {
            "edgeLabelDetails": {
                "owns": {
                    "properties": {
                        "value": {
                            "datatypes": [
                                "Float"
                            ]
                        },
                        "quantity": {
                            "datatypes": [
                                "Float"
                            ]
                        }
                    }
                },
                "has_holderquarter": {
                    "properties": {}
                }
            },
            "edgeLabels": [
                "has_holderquarter",
                "owns"
            ],
            "nodeLabels": [
                "Holding",
                "Holder",
                "HolderQuarter"
            ],
            "labelTriples": [
                {
                    "~type": "has_holderquarter",
                    "~from": "Holder",
      

In order to customize how we refer to nodes in the graph, we can tell the graphstore to assign a property as the text representation key for each node.


To see the properties available for each node, you can run
```
print(schema[0]["schema"]["nodeLabelDetails"])
```
and select the right property


Below we use the `name` property for the `Holder` nodes and for `Holding` nodes. We leave HoldingQuarter as the default `~id` property, by assigning it's representation as None to be able to uniquely identify each holding quarter with it's Holder. This is optional, you only need to pass in the node labels that you want to refer to using a particular property.

In [8]:
graph_store.assign_text_repr_prop_for_nodes(Holder='name', Holding='name', HoldingQuarter=None)

Now we can get ask for some details of some nodes in the graph. For example, let's ask for the following nodes:

* `"Miracle Mile Advisors, LLC"`
* `"Cranbrook Wealth Management, LLC"`


In [9]:
nodes_details = graph_store.get_nodes(["Miracle Mile Advisors, LLC", "Cranbrook Wealth Management, LLC"])
print(json.dumps(nodes_details, indent=4))

{
    "1599084": {
        "description": "Cranbrook Wealth Management is a registered investment advisor devoted to providing customized investment solutions and comprehensive financial planning for individuals and families. With a flexible mandate anchored to clients\u00d5 goals  Cranbrook constructs globally diversified portfolios optimized for tax efficiency. The firm provides holistic services including education planning and legacy strategies to reduce uncertainty. Cranbrook distinguishes itself through transparent fees  proactive communication  and a disciplined process designed to limit emotional investing. As an independent fiduciary  Cranbrook places client interests first to help secure financial futures with confidence.",
        "name": "Cranbrook Wealth Management, LLC"
    },
    "1585859": {
        "description": " Here is a 99-word description of Miracle Mile Advisors  LLC and their financial mandate:Miracle Mile Advisors is a registered investment advisory firm based

We can also take a look the connections from `"Miracle Mile Advisors, LLC"`

In [10]:
graph_store.get_one_hop_edges(["Miracle Mile Advisors, LLC"])

{'Miracle Mile Advisors, LLC': {'has_holderquarter': {('Miracle Mile Advisors, LLC',
    'has_holderquarter',
    '20231025_1585859')}}}

In [11]:
graph_store.get_one_hop_edges(["20231025_1585859"])

{'20231025_1585859': {'owns': {('20231025_1585859',
    'owns',
    '23ANDME HOLDING CO'),
   ('20231025_1585859', 'owns', '3-D SYS CORP DEL'),
   ('20231025_1585859', 'owns', '3M CO'),
   ('20231025_1585859', 'owns', 'ABBOTT LABS'),
   ('20231025_1585859', 'owns', 'ABBVIE INC'),
   ('20231025_1585859', 'owns', 'ACCENTURE PLC IRELAND'),
   ('20231025_1585859', 'owns', 'ACCOLADE INC'),
   ('20231025_1585859', 'owns', 'ACTIVISION BLIZZARD INC'),
   ('20231025_1585859', 'owns', 'ADC THERAPEUTICS SA'),
   ('20231025_1585859', 'owns', 'ADOBE INC'),
   ('20231025_1585859', 'owns', 'ADVANCED MICRO DEVICES INC'),
   ('20231025_1585859', 'owns', 'AFLAC INC'),
   ('20231025_1585859', 'owns', 'AIR PRODS & CHEMS INC'),
   ('20231025_1585859', 'owns', 'AIRBNB INC'),
   ('20231025_1585859', 'owns', 'AKERO THERAPEUTICS INC'),
   ('20231025_1585859', 'owns', 'ALBERTSONS COS INC'),
   ('20231025_1585859', 'owns', 'ALIBABA GROUP HLDG LTD'),
   ('20231025_1585859', 'owns', 'ALLSTATE CORP'),
   ('20231025

### Question Answering

We define a sample question to test our system. The question requires reasoning through multiple hops in the knowledge graph to find the answer.

In [12]:
question = "Does Miracle Mile Advisors own any Vanguard Index funds"

In [13]:
# check answers
for triplet in graph_store.get_one_hop_edges(["20231025_1585859"])['20231025_1585859']['owns']:
    if "vanguard index" in triplet[2].lower():
        print(triplet)

('20231025_1585859', 'owns', 'VANGUARD INDEX FDS')


### Cypher KG Linker & Cypher Retrieval
The `CypherKGLinker` uses an LLM (Claude 3.5 Sonnet) to:
1. Generate opencypher queries for linking question entities to KG nodes
2. Generate opencypher queries for retrieving KG answers
3. Generate initial responses based on its knowledge

In [14]:
from graphrag_toolkit.byokg_rag.graph_connectors import CypherKGLinker
from graphrag_toolkit.byokg_rag.llm import BedrockGenerator

# Initialize llm
llm_generator = BedrockGenerator(
                model_name='us.anthropic.claude-3-5-sonnet-20240620-v1:0',
                region_name='us-west-2')

cypher_linker = CypherKGLinker(graph_store=graph_store, llm_generator=llm_generator)
response = cypher_linker.generate_response(
                question=question,
                schema=schema,
                graph_context="Not provided. Use the above schema to understand the graph."
            )
response
artifacts = cypher_linker.parse_response(response)
artifacts

{'opencypher-linking': ['MATCH (h:Holder), (holding:Holding)',
  "WHERE toLower(h.name) CONTAINS toLower('Miracle Mile Advisors')",
  "  AND toLower(holding.name) CONTAINS toLower('Vanguard')",
  "  AND toLower(holding.type) CONTAINS toLower('Index')",
  'RETURN ID(h), ID(holding)',
  'LIMIT 5'],
 'opencypher': ['MATCH (h:Holder)-[:has_holderquarter]->(hq:HolderQuarter)-[o:owns]->(holding:Holding)',
  "WHERE toLower(h.name) CONTAINS toLower('Miracle Mile Advisors')",
  "  AND toLower(holding.name) CONTAINS toLower('Vanguard')",
  "  AND toLower(holding.type) CONTAINS toLower('Index')",
  'RETURN h.name AS Holder, holding.name AS Fund, o.quantity AS Quantity, o.value AS Value'],
 'draft-answer-generation': []}

In [15]:
from graphrag_toolkit.byokg_rag.graph_retrievers import GraphQueryRetriever
graph_query_executor = GraphQueryRetriever(graph_store)
graph_query = " ".join(artifacts["opencypher"])
cypher_context, cypher_answers = graph_query_executor.retrieve(graph_query, return_answers=True)
cypher_answers

[]

### KG Linker
The `KGLinker` uses an LLM (Claude 3.5 Sonnet) to:
1. Extract entities from the question
2. Identify potential relationship paths in the graph
3. Generate initial responses based on its knowledge

In [16]:

from graphrag_toolkit.byokg_rag.graph_connectors import KGLinker
from graphrag_toolkit.byokg_rag.llm import BedrockGenerator



# Initialize llm
llm_generator = BedrockGenerator(
                model_name='us.anthropic.claude-3-5-sonnet-20240620-v1:0',
                region_name='us-west-2')

kg_linker = KGLinker(graph_store=graph_store, llm_generator=llm_generator)
response = kg_linker.generate_response(
                question=question,
                schema=schema,
                graph_context="Not provided. Use the above schema to understand the graph."
            )
response


'<entities>\nMiracle Mile Advisors\nVanguard Index funds\n</entities>\n\n<paths>\nHolder -has_holderquarter-> HolderQuarter -owns-> Holding\n</paths>\n\n<answers>\n</answers>\n\n<opencypher>\nMATCH (h:Holder {name: "Miracle Mile Advisors"})-[:has_holderquarter]->(hq:HolderQuarter)-[:owns]->(holding:Holding)\nWHERE holding.name CONTAINS "Vanguard" AND holding.type = "Index Fund"\nRETURN DISTINCT holding.name AS VanguardIndexFunds\n</opencypher>'

In [17]:
artifacts = kg_linker.parse_response(response)
artifacts

{'entity-extraction': ['Miracle Mile Advisors', 'Vanguard Index funds'],
 'path-extraction': ['Holder -has_holderquarter-> HolderQuarter -owns-> Holding'],
 'draft-answer-generation': [],
 'opencypher': ['MATCH (h:Holder {name: "Miracle Mile Advisors"})-[:has_holderquarter]->(hq:HolderQuarter)-[:owns]->(holding:Holding)',
  'WHERE holding.name CONTAINS "Vanguard" AND holding.type = "Index Fund"',
  'RETURN DISTINCT holding.name AS VanguardIndexFunds']}

### Entity Linking
The `EntityLinker` uses fuzzy string matching to
1. Match extracted entities to actual nodes in the graph
3. Link potential answers to graph nodes

In [18]:
from graphrag_toolkit.byokg_rag.indexing import FuzzyStringIndex
from graphrag_toolkit.byokg_rag.graph_retrievers import EntityLinker

# Add graph nodes text for string matching
string_index = FuzzyStringIndex()
string_index.add(graph_store.nodes())
retriever = string_index.as_entity_matcher()
entity_linker = EntityLinker(retriever=retriever)

linked_entities = entity_linker.link(artifacts["entity-extraction"], return_dict=False)
linked_answers = entity_linker.link(artifacts["draft-answer-generation"], return_dict=False)
linked_entities, linked_answers

(['VANGUARD INDEX FUNDS',
  'Miracle Mile Advisors, LLC',
  'Vanguard Index Fds',
  'VANGUARD INDEX FDS          ',
  'Scotts Miracle-Gro Company Class A',
  'Professional Financial Advisors, LLC'],
 [])

### Triplet Retrieval
The `AgenticRetriever` uses an LLM to:
1. Navigate the graph starting from linked entities
2. Select relevant relations based on the question
3. Expand those relations and decide which relevant entities to explore next.
4. It returns the relevant (head->relation->tail) based on the question.


In [19]:
from graphrag_toolkit.byokg_rag.graph_retrievers import AgenticRetriever
from graphrag_toolkit.byokg_rag.graph_retrievers import GTraversal, TripletGVerbalizer
graph_traversal = GTraversal(graph_store)
graph_verbalizer = TripletGVerbalizer()
triplet_retriever = AgenticRetriever(
    llm_generator=llm_generator, 
    graph_traversal=graph_traversal,
    graph_verbalizer=graph_verbalizer)

In [20]:
triplet_context = triplet_retriever.retrieve(query=question, source_nodes=linked_entities)
triplet_context

['Miracle Mile Advisors, LLC -> has_holderquarter -> 20231025_1585859',
 'Professional Financial Advisors, LLC -> has_holderquarter -> 20231025_1798221',
 '20231025_1585859 -> owns -> JOHNSON CTLS INTL PLC | D R HORTON INC | TESLA INC | BRISTOL-MYERS SQUIBB CO | MANCHESTER UTD PLC NEW | WABTEC | AUTOMATIC DATA PROCESSING IN | GENERAL ELECTRIC CO | FIRST MAJESTIC SILVER CORP | ALTIMMUNE INC | EXLSERVICE HOLDINGS INC | ACCENTURE PLC IRELAND | HSBC HLDGS PLC | NCR CORP NEW | DUPONT DE NEMOURS INC | FACTSET RESH SYS INC | BANK AMERICA CORP | ELEVANCE HEALTH INC | CARRIER GLOBAL CORPORATION | INVESCO QQQ TR | PROSHARES TR | AMGEN INC | FIRST TR EXCHANGE-TRADED FD | RETAIL OPPORTUNITY INVTS COR | AMETEK INC | EQUINIX INC | TEXAS INSTRS INC | XYLEM INC | CHEVRON CORP NEW | SHOCKWAVE MED INC | AFLAC INC | NETAPP INC | NATIONAL FUEL GAS CO | CISCO SYS INC | CLEARWATER PAPER CORP | BROADCOM INC | KLA CORP | ECOLAB INC | TEGNA INC | COPA HOLDINGS SA | CARDINAL HEALTH INC | QUALYS INC | SYSCO CORP

### Path Retrieval
The `PathRetriever` uses the identified metapaths and candidate answers to:
1. Retrieve actual paths in the graph following the metapath
2. Retrieve shortest paths connecting question entities and candidate answers (if any) 
3. Verbalize the paths for context

In [21]:
from graphrag_toolkit.byokg_rag.graph_retrievers import PathRetriever
from graphrag_toolkit.byokg_rag.graph_retrievers import GTraversal, PathVerbalizer
graph_traversal = GTraversal(graph_store)
path_verbalizer = PathVerbalizer()
path_retriever = PathRetriever(
    graph_traversal=graph_traversal,
    path_verbalizer=path_verbalizer)

metapaths = [[component.strip() for component in path.split("->")] for path in artifacts["path-extraction"]]

path_context = path_retriever.retrieve(linked_entities, metapaths, linked_answers)
path_context

[]

Let's try answering the question now with the retrieved context from various retrieval mechanisms.

First we can create a `ByoKGQueryEngine` instance which can invoke an LLM and generate a response using the context we already retrieved from the graph

In [22]:
from graphrag_toolkit.byokg_rag.byokg_query_engine import ByoKGQueryEngine

byokg_query_engine = ByoKGQueryEngine(
    graph_store=graph_store,
    kg_linker=kg_linker,
    cypher_kg_linker=cypher_linker,
    triplet_retriever=triplet_retriever,
    path_retriever=path_retriever,
    entity_linker=entity_linker
)

Generating a response using the triplet context from graph traversal

In [23]:
answers, response = byokg_query_engine.generate_response(question, "\n".join(triplet_context))

print("Generated answers: ", answers)
if ("Yes" in '\n'.join(answers) or "vanguard index" in ('\n'.join(answers)).lower()):
    print(f"Success! Ground-truth answer retrieved!") 


Generated answers:  ['VANGUARD INDEX FDS', 'VANGUARD SPECIALIZED FUNDS', 'VANGUARD MALVERN FDS', 'VANGUARD INTL EQUITY INDEX F', 'VANGUARD ADMIRAL FDS INC', 'VANGUARD MUN BD FDS', 'VANGUARD WORLD FDS', 'VANGUARD STAR FDS', 'VANGUARD BD INDEX FDS', 'VANGUARD TAX-MANAGED FDS', 'VANGUARD SCOTTSDALE FDS', 'VANGUARD WHITEHALL FDS']
Success! Ground-truth answer retrieved!


### BYOKG RAG Pipeline

We can also use the `ByoKGQueryEngine` to combine all into a single call to:
1. Process natural language questions
2. Retrieve relevant context from the graph
3. Generate answers based on the retrieved information

In [25]:
from graphrag_toolkit.byokg_rag.byokg_query_engine import ByoKGQueryEngine

byokg_query_engine = ByoKGQueryEngine(
    graph_store=graph_store,
    kg_linker=kg_linker,
    cypher_kg_linker=None, # deactivate cypher linker and rely on entity linker
    triplet_retriever=triplet_retriever,
    path_retriever=path_retriever,
    entity_linker=entity_linker
)

iterations=1
retrieved_context = byokg_query_engine.query(question, iterations=iterations)
answers, response = byokg_query_engine.generate_response(question, "\n".join(retrieved_context))

print(retrieved_context)
print(answers)
print(response)

['Miracle Mile Advisors, LLC -> has_holderquarter -> 20231025_1585859', 'Professional Financial Advisors, LLC -> has_holderquarter -> 20231025_1798221', '20231025_1585859 -> owns -> JOHNSON CTLS INTL PLC | D R HORTON INC | TESLA INC | BRISTOL-MYERS SQUIBB CO | MANCHESTER UTD PLC NEW | WABTEC | AUTOMATIC DATA PROCESSING IN | GENERAL ELECTRIC CO | FIRST MAJESTIC SILVER CORP | ALTIMMUNE INC | EXLSERVICE HOLDINGS INC | ACCENTURE PLC IRELAND | HSBC HLDGS PLC | NCR CORP NEW | DUPONT DE NEMOURS INC | FACTSET RESH SYS INC | BANK AMERICA CORP | ELEVANCE HEALTH INC | CARRIER GLOBAL CORPORATION | INVESCO QQQ TR | PROSHARES TR | AMGEN INC | FIRST TR EXCHANGE-TRADED FD | RETAIL OPPORTUNITY INVTS COR | AMETEK INC | EQUINIX INC | TEXAS INSTRS INC | XYLEM INC | CHEVRON CORP NEW | SHOCKWAVE MED INC | AFLAC INC | NETAPP INC | NATIONAL FUEL GAS CO | CISCO SYS INC | CLEARWATER PAPER CORP | BROADCOM INC | KLA CORP | ECOLAB INC | TEGNA INC | COPA HOLDINGS SA | CARDINAL HEALTH INC | QUALYS INC | SYSCO CORP |