# BYOKG RAG Demo
This notebook demonstrates a RAG (Retrieval Augmented Generation) system built on top of a Knowledge Graph. The system allows querying a knowledge graph using natural language questions and retrieving relevant information to generate answers.

1. **Graph Store**: Neptune Analytics endpoint for the graph structure
2. **KG Linker**: Links natural language queries to graph entities and paths
3. **Entity Linker**: Matches entities from text to graph nodes
4. **Triplet Retriever**: Retrieves relevant triplets from the graph
5. **Path Retriever**: Finds paths between entities in the graph
6. **Query Engine**: Orchestrates all components to answer questions

#### Setup
If you haven't already, install the toolkit and dependencies in [README.md](../../byokg-rag/README.md).
Let's validate if the package is correctly installed.

In [None]:
# !pip install https://github.com/awslabs/graphrag-toolkit/archive/refs/tags/v3.8.1.zip#subdirectory=byokg-rag

In [None]:
from graphrag_toolkit.byokg_rag.graphstore import NeptuneAnalyticsGraphStore

### Graph Store
The `NeptuneAnalyticsGraphStore` class provides an interface to work with the Neptune Analytics graph.
If you already have a NeptuneAnalyticsGraphEndpoint you want to use, simply change the cell below to assign `graph_identifier` to your NeptuneAnalytics graph id. 

If you don't already have a Neptune Graph then you can create one by running the command below from an environment that has the AWS CLI configured with appropriate permissions. Please refer to documentation for more details about [creating a graph](https://docs.aws.amazon.com/neptune-analytics/latest/userguide/create-graph-using-console.html) and [loading data into the graph](https://docs.aws.amazon.com/neptune-analytics/latest/userguide/batch-load.html).

```
aws neptune-graph create-graph --graph-name 'edgar-byokg' --provisioned-memory 128 --public-connectivity --replica-count 0 --vector-search-configuration '{"dimension": 384}'
```

After running the command you should receive a response that includes the graph id. Change the cell below to assign  `graph_identifier` to the id.

To run the rest of the notebook, you'll need to ensure that the environment has the right IAM permissions to interact with your neptune analytics graph endpoint. Specifically you will need `neptune-graph:ReadDataViaQuery` and `neptune-graph:GetGraph`. If you are also using the example dataset, you will need s3 IAM read permissions so that `graphstore.read_from_csv` can access data from `s3://aws-neptune-customer-samples-*/*`. If you're using your own dataset then you also need to provide write access so that `read_from_csv` can upload your csv file to an s3 location you specify where it will be ingested by Neptune Analytics.

In the rest of the notebook, we
1. Initialize the BYOKG graph store to use a Neptune Analytics Graph
2. Optionally, load an example data from a CSV file for a new graph
3. Get basic statistics about the graph and sample edges for a specific node
4. Run the BYOKG retrieval functions and QueryEngine on a sample question

In [None]:
region = "us-east-1" #replace with aws region
graph_identifier = "<>" # replace with graph id 

In [None]:
graph_store = NeptuneAnalyticsGraphStore(graph_identifier=graph_identifier,
                                         region=region)

#### Loading Data

If you ran the command to create a new graph, then uncomment the code cell below to load the new graph with some data. The data we are loading is a public dataset from the [SEC EDGAR system](https://www.sec.gov/search-filings) and contains information on stock holdings. 

See the [Neptune Analytics example notebook](https://github.com/aws/graph-notebook/blob/main/src/graph_notebook/notebooks/02-Neptune-Analytics/03-Sample-Use-Cases/02-Investment-Analysis/01-EDGAR-Competitor-Analysis-using-Knowledge-Graph-Graph-Algorithms-and-Vector-Search.ipynb) for more details

In [None]:
# graph_store.read_from_csv(s3_path=f"s3://aws-neptune-customer-samples-{region}/sample-datasets/gremlin/edgar/")

In [None]:
# Print graph statistics
number_of_nodes = len(graph_store.nodes())
number_of_edges = len(graph_store.edges())
print(f"The graph has {number_of_nodes} nodes and {number_of_edges} edges.")

In [None]:
# Print graph schema
import json

schema = graph_store.get_schema()
print(json.dumps(schema, indent=4))

In order to customize how we refer to nodes in the graph, we can tell the graphstore to assign a property as the text representation key for each node.


To see the properties available for each node, you can run
```
print(schema[0]["schema"]["nodeLabelDetails"])
```
and select the right property


Below we use the `name` property for the `Holder` nodes and for `Holding` nodes. We leave HoldingQuarter as the default `~id` property, by assigning it's representation as None to be able to uniquely identify each holding quarter with it's Holder. This is optional, you only need to pass in the node labels that you want to refer to using a particular property.

In [None]:
graph_store.assign_text_repr_prop_for_nodes(Holder='name', Holding='name', HoldingQuarter=None)

Now we can get ask for some details of some nodes in the graph. For example, let's ask for the following nodes:

* `"Miracle Mile Advisors, LLC"`
* `"Cranbrook Wealth Management, LLC"`


In [None]:
nodes_details = graph_store.get_nodes(["Miracle Mile Advisors, LLC", "Cranbrook Wealth Management, LLC"])
print(json.dumps(nodes_details, indent=4))

We can also take a look the connections from `"Miracle Mile Advisors, LLC"`

In [None]:
graph_store.get_one_hop_edges(["Miracle Mile Advisors, LLC"])

In [None]:
graph_store.get_one_hop_edges(["20231025_1585859"])

### Question Answering

We define a sample question to test our system. The question requires reasoning through multiple hops in the knowledge graph to find the answer.

In [None]:
question = "Does Miracle Mile Advisors own any Vanguard Index funds"

### KG Linker
The `KGLinker` uses an LLM (Claude 3.5 Sonnet) to:
1. Extract entities from the question
2. Identify potential relationship paths in the graph
3. Generate initial responses based on its knowledge

In [None]:

from graphrag_toolkit.byokg_rag.graph_connectors import KGLinker
from graphrag_toolkit.byokg_rag.llm import BedrockGenerator



# Initialize llm
llm_generator = BedrockGenerator(
                model_name='us.anthropic.claude-3-5-sonnet-20240620-v1:0',
                region_name='us-west-2')

kg_linker = KGLinker(graph_store=graph_store, llm_generator=llm_generator)
response = kg_linker.generate_response(
                question=question,
                schema=schema,
                graph_context="Not provided. Use the above schema to understand the graph."
            )
response


In [None]:
artifacts = kg_linker.parse_response(response)
artifacts

### Entity Linking
The `EntityLinker` uses fuzzy string matching to
1. Match extracted entities to actual nodes in the graph
3. Link potential answers to graph nodes

In [None]:
from graphrag_toolkit.byokg_rag.indexing import FuzzyStringIndex
from graphrag_toolkit.byokg_rag.graph_retrievers import EntityLinker

# Add graph nodes text for string matching
string_index = FuzzyStringIndex()
string_index.add(graph_store.nodes())
retriever = string_index.as_entity_matcher()
entity_linker = EntityLinker(retriever=retriever)

linked_entities = entity_linker.link(artifacts["entity-extraction"], return_dict=False)
linked_answers = entity_linker.link(artifacts["draft-answer-generation"], return_dict=False)
linked_entities, linked_answers

### Triplet Retrieval
The `AgenticRetriever` uses an LLM to:
1. Navigate the graph starting from linked entities
2. Select relevant relations based on the question
3. Expand those relations and decide which relevant entities to explore next.
4. It returns the relevant (head->relation->tail) based on the question.


In [None]:
from graphrag_toolkit.byokg_rag.graph_retrievers import AgenticRetriever
from graphrag_toolkit.byokg_rag.graph_retrievers import GTraversal, TripletGVerbalizer
graph_traversal = GTraversal(graph_store)
graph_verbalizer = TripletGVerbalizer()
triplet_retriever = AgenticRetriever(
    llm_generator=llm_generator, 
    graph_traversal=graph_traversal,
    graph_verbalizer=graph_verbalizer)

In [None]:
triplet_context = triplet_retriever.retrieve(query=question, source_nodes=linked_entities)
triplet_context

### Path Retrieval
The `PathRetriever` uses the identified metapaths and candidate answers to:
1. Retrieve actual paths in the graph following the metapath
2. Retrieve shortest paths connecting question entities and candidate answers (if any) 
3. Verbalize the paths for context

In [None]:
from graphrag_toolkit.byokg_rag.graph_retrievers import PathRetriever
from graphrag_toolkit.byokg_rag.graph_retrievers import GTraversal, PathVerbalizer
graph_traversal = GTraversal(graph_store)
path_verbalizer = PathVerbalizer()
path_retriever = PathRetriever(
    graph_traversal=graph_traversal,
    path_verbalizer=path_verbalizer)

metapaths = [[component.strip() for component in path.split("->")] for path in artifacts["path-extraction"]]
shortened_paths = []
for path in metapaths:
    if len(path) > 1:
        shortened_paths.append(path[:1])
for path in metapaths:
    if len(path) > 2:
        shortened_paths.append(path[:2])
metapaths += shortened_paths
path_context = path_retriever.retrieve(linked_entities, metapaths, linked_answers)
path_context

Let's try answering the question now with the retrieved context from various retrieval mechanisms.

First we can create a `ByoKGQueryEngine` instance which can invoke an LLM and generate a response using the context we already retrieved from the graph

In [None]:
from graphrag_toolkit.byokg_rag.byokg_query_engine import ByoKGQueryEngine

byokg_query_engine = ByoKGQueryEngine(
    graph_store=graph_store,
    kg_linker=kg_linker,
    triplet_retriever=triplet_retriever,
    path_retriever=path_retriever,
    entity_linker=entity_linker
)

Generating a response using the triplet context from graph traversal

In [None]:
answers, response = byokg_query_engine.generate_response(question, "\n".join(triplet_context))

print("Generated answers: ", answers)
print(f"Success! Ground-truth answer retrieved!") if "Yes" in '\n'.join(answers) else print("Failure..")


Now generating a response using the path context from the path reteriever

In [None]:
answers, response = byokg_query_engine.generate_response(question, "\n".join(path_context))

print("Generated answers: ", answers)
print(f"Success! Ground-truth answer retrieved!") if "Yes" in '\n'.join(answers) else print("Failure..")

### BYOKG RAG Pipeline

We can also use the `ByoKGQueryEngine` to combine all into a single call to:
1. Process natural language questions
2. Retrieve relevant context from the graph
3. Generate answers based on the retrieved information

In [None]:
from graphrag_toolkit.byokg_rag.byokg_query_engine import ByoKGQueryEngine

byokg_query_engine = ByoKGQueryEngine(
    graph_store=graph_store,
    kg_linker=kg_linker,
    triplet_retriever=triplet_retriever,
    path_retriever=path_retriever,
    entity_linker=entity_linker
)

retrieved_context = byokg_query_engine.query(question)
answers, response = byokg_query_engine.generate_response(question, "\n".join(retrieved_context))

print(answers)
print(response)