# A - Advanced Construction

## Setup

If you haven't already, install the toolkit and dependencies using the [Setup](./00-Setup.ipynb) notebook.

## Extract and build pipelines

See [Advanced graph construction](https://github.com/awslabs/graphrag-toolkit/blob/main/docs/lexical-graph/indexing.md#advanced-graph-construction).

In [1]:
%reload_ext dotenv
%dotenv

import os

from graphrag_toolkit.lexical_graph.storage import GraphStoreFactory
from graphrag_toolkit.lexical_graph.storage import VectorStoreFactory
from graphrag_toolkit.lexical_graph.indexing import sink
from graphrag_toolkit.lexical_graph.indexing.constants import PROPOSITIONS_KEY, DEFAULT_ENTITY_CLASSIFICATIONS
from graphrag_toolkit.lexical_graph.indexing.extract import LLMPropositionExtractor
from graphrag_toolkit.lexical_graph.indexing.extract import TopicExtractor
from graphrag_toolkit.lexical_graph.indexing.extract import GraphScopedValueStore
from graphrag_toolkit.lexical_graph.indexing.extract import ScopedValueProvider, DEFAULT_SCOPE
from graphrag_toolkit.lexical_graph.indexing.extract import ExtractionPipeline
from graphrag_toolkit.lexical_graph.indexing.build import Checkpoint
from graphrag_toolkit.lexical_graph.indexing.build import BuildPipeline
from graphrag_toolkit.lexical_graph.indexing.build import VectorIndexing
from graphrag_toolkit.lexical_graph.indexing.build import GraphConstruction
from graphrag_toolkit.lexical_graph import LexicalGraphIndex, set_logging_config
from graphrag_toolkit.lexical_graph.storage.graph.falkordb import FalkorDBGraphStoreFactory


from llama_index.core.node_parser import SentenceSplitter
from llama_index.readers.web import SimpleWebPageReader



checkpoint = Checkpoint('advanced-construction-example', enabled=True)
set_logging_config('INFO')


# Register the FalkorDB backend with the factory
GraphStoreFactory.register(FalkorDBGraphStoreFactory)

# Create graph and vector stores
graph_store = GraphStoreFactory.for_graph_store(os.environ['GRAPH_STORE'])
vector_store = VectorStoreFactory.for_vector_store(os.environ['VECTOR_STORE'])

graph_index = LexicalGraphIndex(
    graph_store,
    vector_store
)

# Create extraction pipeline components

# 1. Chunking using SentenceSplitter
splitter = SentenceSplitter(
    chunk_size=350,
    chunk_overlap=50
)

# 2. Proposition extraction
proposition_extractor = LLMPropositionExtractor()

# 3. Topic extraction
entity_classification_provider = ScopedValueProvider(
    label='EntityClassification',
    scoped_value_store=GraphScopedValueStore(graph_store=graph_store),
    initial_scoped_values = { DEFAULT_SCOPE: DEFAULT_ENTITY_CLASSIFICATIONS }
)

topic_extractor = TopicExtractor(
    source_metadata_field=PROPOSITIONS_KEY, # Omit this line if not performing proposition extraction
    entity_classification_provider=entity_classification_provider # Entity classifications saved to graph between LLM invocations
)

# Create extraction pipeline
extraction_pipeline = ExtractionPipeline.create(
    components=[
        splitter, 
        proposition_extractor,
        topic_extractor
    ],
    num_workers=2,
    batch_size=4,
    checkpoint=checkpoint,
    show_progress=True
)

# Create build pipeline components
graph_construction = GraphConstruction.for_graph_store(graph_store)
vector_indexing = VectorIndexing.for_vector_store(vector_store)
        
# Create build pipeline        
build_pipeline = BuildPipeline.create(
    components=[
        graph_construction,
        vector_indexing
    ],
    num_workers=2,
    batch_size=10,
    batch_writes_enabled=True,
    checkpoint=checkpoint,
    show_progress=True
)

# Load source documents
doc_urls = [
    'https://docs.aws.amazon.com/neptune/latest/userguide/intro.html',
    'https://docs.aws.amazon.com/neptune-analytics/latest/userguide/what-is-neptune-analytics.html',
    'https://docs.aws.amazon.com/neptune-analytics/latest/userguide/neptune-analytics-features.html',
    'https://docs.aws.amazon.com/neptune-analytics/latest/userguide/neptune-analytics-vs-neptune-database.html'
]

docs = SimpleWebPageReader(
    html_to_text=True,
    metadata_fn=lambda url:{'url': url}
).load_data(doc_urls)

# Run the build and exraction stages
docs | extraction_pipeline | build_pipeline | sink

print('Complete')

2025-05-06 11:08:42:INFO:g.l.i.e.extraction_pipeline:Running extraction pipeline [batch_size: 4, num_workers: 2]


Extracting propositions [nodes: 4, num_workers: 4]: 100%|██████████| 4/4 [00:08<00:00,  2.10s/it]
Extracting propositions [nodes: 7, num_workers: 4]: 100%|██████████| 7/7 [00:19<00:00,  2.73s/it]
Extracting topics [nodes: 4, num_workers: 4]: 100%|██████████| 4/4 [00:19<00:00,  4.85s/it]
Extracting topics [nodes: 7, num_workers: 4]: 100%|██████████| 7/7 [00:54<00:00,  7.78s/it]


2025-05-06 11:09:56:INFO:g.l.i.b.build_pipeline:Running build pipeline [batch_size: 10, num_workers: 2, job_sizes: [425, 170], batch_writes_enabled: True, batch_write_size: 25]


Building graph [batch_writes_enabled: True, batch_write_size: 25]: 100%|██████████| 170/170 [00:00<00:00, 53797.47it/s]
Building graph [batch_writes_enabled: True, batch_write_size: 25]: 100%|██████████| 425/425 [00:00<00:00, 51060.67it/s]
Building vector index [batch_writes_enabled: True, batch_write_size: 25]: 100%|██████████| 170/170 [00:00<00:00, 915316.66it/s]
Building vector index [batch_writes_enabled: True, batch_write_size: 25]: 100%|██████████| 425/425 [00:00<00:00, 1111333.67it/s]


UndefinedObject: data type text has no default operator class for access method "gin"
HINT:  You must specify an operator class for the index or define a default operator class for the data type.


In [None]:
from graphrag_toolkit.lexical_graph import GraphRAGConfig
print(f"GraphRAGConfig: {GraphRAGConfig}")