# 1a - Customising Extraction

## Prerequisites

Complete <a href="../../../../nbclassic/notebooks/graphrag-toolkit/1-Indexing.ipynb"><b>Exercise 1 - Indexing</b></a> before beginning these additional exercises.


## Overview

You are now going to add some JSON data to your graph using custom prompts and a list of preferred entity classifications.

The JSON data to be added to the graph represents the results of calls to the Amazon Neptune `describe-db-instances` and Amazon EC2 `describe-security-groups` management API methods.

### üîç 1a.1 Review the JSON data

Run the cells below to view the JSON data to be indexed:

In [None]:
%pycat source-data/neptune/db.json

In [None]:
%pycat source-data/neptune/sg.json

### üîç 1a.2 Review the custom prompts

You can customize the extraction process using custom prompts and a list of preferred entity classifications.

Run the cells below to view the custom prompts for extracting a) propositions, and b) topics, statements and facts from the JSON source documents:

In [None]:
%pycat prompts/extract-propositions-json.txt

In [None]:
%pycat prompts/extract-topics-json.txt

### üéØ 1a.3 Extract and build from JSON documents

The following cell combines the Extract and Build stages into a single operation: `extract_and_build()`. 

The Extract stage uses the custom prompts discussed above. This prompt is further parameterized with a list of preferred entity classifications. These entity classifications help guide the LLM to label entities (e.g. database instances, endpoints and VPCs) in a consistent manner.

Run the code below to extract data from the two JSON files and build the graph and vector stores:

In [None]:
%reload_ext dotenv
%dotenv

import os

from graphrag_toolkit.lexical_graph import LexicalGraphIndex
from graphrag_toolkit.lexical_graph import IndexingConfig, ExtractionConfig, BuildConfig
from graphrag_toolkit.lexical_graph.storage import GraphStoreFactory
from graphrag_toolkit.lexical_graph.storage import VectorStoreFactory
from graphrag_toolkit.lexical_graph.indexing.load import JSONArrayReader
from graphrag_toolkit.lexical_graph.indexing.build import Checkpoint
from graphrag_toolkit.lexical_graph.utils.io_utils import read_text

def get_metadata(data):
    metadata = {}
    if 'GroupId' in data:
        metadata['GroupId'] = f"GroupId: {data.get('GroupId', '')}"
    if 'DBInstanceIdentifier' in data:
        metadata['DBInstanceIdentifier'] = f"DBInstanceIdentifier: {data.get('DBInstanceIdentifier', '')}"
    return metadata

with (
    GraphStoreFactory.for_graph_store(os.environ['GRAPH_STORE']) as graph_store,
    VectorStoreFactory.for_vector_store(os.environ['VECTOR_STORE'], index_names=['chunk']) as vector_store
):

    config = IndexingConfig( 
        chunking=None,
        extraction=ExtractionConfig(
            extract_propositions_prompt_template=read_text('./prompts/extract-propositions-json.txt'),
            extract_topics_prompt_template=read_text('./prompts/extract-topics-json.txt'),
            preferred_entity_classifications=[
                'DBInstance',
                'DBClusterIdentifier',
                'DBInstanceClass',
                'Endpoint',
                'SecurityGroup',
                'DBSubnetGroup',
                'VPC',
                'Subnet',
                'SubnetAvailabilityZone',
                'IPPermissionsEgress',
                'IPPermissions'
            ]
        )
    )
    
    checkpoint = Checkpoint('1-extract-build')

    graph_index = LexicalGraphIndex(
        graph_store, 
        vector_store,
        indexing_config=config
    )

    reader = JSONArrayReader(metadata_fn=get_metadata)
    
    graph_index.extract_and_build(
        nodes=reader.load_data('./source-data/neptune/db.json'), 
        show_progress=True,
        checkpoint=checkpoint
    )
    
    graph_index.extract_and_build(
        nodes=reader.load_data('./source-data/neptune/sg.json'), 
        show_progress=True,
        checkpoint=checkpoint
    )

print('Complete')

### üéØ 1a.4 Visualise the extracted data

Once again, you can view the extracted data. The code below supplies some filter criteria to the visualisation so that only the newly extracted data is displayed:

In [None]:
NB_CLASSIC = True

from graphrag_toolkit.lexical_graph.visualisation import GraphNotebookVisualisation

v = GraphNotebookVisualisation(nb_classic=NB_CLASSIC)

source_filter = [
    {'filename': 'db.json'},
    {'filename': 'sg.json'}
]

v.display_sources(filter=source_filter)

You can also view _all_ of the entities that have been extracted so far. Notice how the database and security group entities conform to the list of preferred entity classifications supplied during the extract operation.

In [None]:
v.display_entities()

### üéØ 1a.5 Visualise the inferred schema

Besides creating nodes that represent sources, chunks, topics, statememts, facts and entities, the GraphRAG Toolkit also creates schema nodes that represent the inferred domain sematics at the entity-relationship tier (the lowest tier of the hierarchical lexical graph structure). You can view this schema by running the following cell:

In [None]:
v.display_schema()

This ability to create an inferred schema will become important in the third notebook **03 - Agentic Use Cases**, when you create domain-specific tools for use by an AI agent.

### üéØ 1a.6 Query across the data

You're now in a position to ask a question of your data. The following cell asks a question that requires joining across the two sources of data (the Neptune clusters described in one of the JSON files, and the instance family descriptions from the Neptune documentation):

In [None]:
%reload_ext dotenv
%dotenv

import os

from graphrag_toolkit.lexical_graph import LexicalGraphQueryEngine
from graphrag_toolkit.lexical_graph.storage import GraphStoreFactory
from graphrag_toolkit.lexical_graph.storage import VectorStoreFactory

with (
    GraphStoreFactory.for_graph_store(os.environ['GRAPH_STORE']) as graph_store,
    VectorStoreFactory.for_vector_store(os.environ['VECTOR_STORE']) as vector_store
):

    query_engine = LexicalGraphQueryEngine.for_traversal_based_search(
        graph_store, 
        vector_store,
        streaming=True,
        no_cache=True
    )

    response = query_engine.query("Can gr-1756394635-cluster currently use the lookup cache?")
    
response.print_response_stream()