# 1 - Indexing (20 mins)

## What You Will Learn

In this exercise you will learn how to extract information from unstructured, text-based source documents to build a graph and vector index. 

In the course of this exercise you will:
  
  - Inspect the document to be indexed
  - Load and extract the data to disk
  - Inspect the extracted chunks
  - Build the graph and vector index from the extracted data
  - Learn about the hierarchical lexical graph model
  - Visualise the graph and inferred schema
  - Query the graph

## Indexing Process

The indexing process is split into two pipeline stages:

#### 1. Extract

Loads and chunks documents, using a Large Language Model (LLM) to perform two extraction steps: proposition extraction, which converts chunked text into well-formed statements, and topic/entity/fact extraction, which identifies relations and concepts.

#### 2. Build

Populates a graph and creates embeddings from the results of the Extract stage.

  
![Extract and Build Stages](./images/extract-and-build.png)

### Indexing options

The GraphRAG Toolkit allows for flexible indexing configurations:

  - You can run the Extract and Build pipelines together for continuous ingest.
  - You can run the Extract and Build stages separately, for more controlled processing.
  - You can configure the tookit to perform batch extraction, which uses Amazon Bedrock batch inference, for handling large datasets.

***

## Extract Stage

In the following example, you'll run the Extract stage to load some data, chunk the text, and perform both proposition extraction and topic/entity/fact extraction. The results of the Extract stage are then written to disk.

![Extract Stage](./images/extract.png)

#### Loading

To load data from an external source, you use a LlamaIndex reader. LlamaIndex provides a range of different [readers  and connectors](https://llamaindexxx.readthedocs.io/en/latest/understanding/loading/llamahub.html) for different data sources. In this example, you're going to use a LlamaIndex `SimpleDirectoryReader` to load some Amazon Neptune documentation from a markdown document.

#### Chunking

Chunking splits a large text into smaller chunks using a [LlamaIndex splitter](https://developers.llamaindex.ai/python/framework/module_guides/loading/node_parsers/modules/#text-splitters), such as `MarkdownNodeParser` or `SentenceSplitter`. If you don't explicitly specify a chunking strategy, the GraphRAG Toolkit uses a `SentenceSplitter` by default.

#### Extraction

The actual extraction consists of two LLM calls per chunk:

  1. Extract sets of propositions ‚Äì small, well-formed statements ‚Äì¬†from the chunked text.
  2. From the propositions, further extract topics, statements and facts.

### üîç 1.1 Inspect the document that you are about to index

Look at the document that you're about to load by running the cell below:

In [None]:
%pycat source-data/neptune/instance-types.md

### üéØ 1.2 Load, chunk and extract data

Run the following cell to extract the data and save it to the filesystem:

In [None]:
%reload_ext dotenv
%dotenv

import os

from graphrag_toolkit.lexical_graph import LexicalGraphIndex, IndexingConfig
from graphrag_toolkit.lexical_graph.storage import GraphStoreFactory, VectorStoreFactory
from graphrag_toolkit.lexical_graph.indexing.load import FileBasedDocs

from llama_index.core.node_parser import MarkdownNodeParser
from llama_index.core import SimpleDirectoryReader

with (
    GraphStoreFactory.for_graph_store(os.environ['GRAPH_STORE']) as graph_store,
    VectorStoreFactory.for_vector_store(os.environ['VECTOR_STORE']) as vector_store
):
    
    # create a LexicalGraphIndex indexing component
    
    config = IndexingConfig(
        chunking=[MarkdownNodeParser()] # chunks document based on markdown headings
    )

    graph_index = LexicalGraphIndex( # core GraphRAG Toolkit indexing component
        graph_store, 
        vector_store,
        indexing_config=config
    )
    
    # load the source document

    loader = SimpleDirectoryReader( # reads source documents from filesystem
        input_files=["./source-data/neptune/instance-types.md"],
        file_metadata=lambda p:{'file_name':os.path.basename(p)}
    )
    
    source_docs = loader.load_data()
    
    # create a destination for the extracted data
    
    extracted_docs = FileBasedDocs( # saves extracted data to filesystem
        docs_directory='extracted',
        collection_id='example-1'
    )
    
    # extract the data
    
    graph_index.extract(
        nodes=source_docs, 
        handler=extracted_docs,
        show_progress=True
    )

print('Complete')

### üîç 1.3 Inspect the extracted data

Before you proceed to the Build stage, take at look at the results of the Extract stage. View one of the extracted chunks that has been saved to disk by running the cell below:

In [None]:
%pycat extracted/example-1/aws::e61317db:7893/aws::e61317db:7893:5cbc4df2.json

There's a lot of detail to each extracted chunk, but these are the fields we're interested in:

```
{
    ...
    "metadata": {
        ...
        "aws::graph::propositions": [
            <propositions>
        ],
        "aws::graph::topics": {
            "topics": [
                <topics/statements/facts>
            ]
        }
    },
    ...
    "text": <chunk text>,    
}
```

Take a moment to review the contents of these fields:

#### text

Towards the bottom of the file is the raw text of the chunk. In our example, the `MarkdownNodeParser` split the document into multiple chunks based on markdown headings in the document. This is the text of one of those chunks.

#### metadata

This field contains the results of the proposition extraction and topic/entity/fact extraction steps. 
  
  - Look at the contents of the `aws::graph::propositions` metadata field. Proposition extraction 'cleans up' the content by extracting sets of well-formed, self-contained propositions from the chunked text.
  - Compare the propositions in `aws::graph::propositions` with the original chunk text. Notice how each proposition has been framed as a standalone statement, with pronouns replaced by proper names.
  - Next, look at the contents of the `aws::graph::topics` metadata field. Here, the extracted propositions are further broken up into statements and facts, grouped by topics. When you run the Build stage later in this exercise, these topics, statements and facts will be inserted into the graph.

***

## Build Stage

The Build stage uses the extracted chunks that you looked at above to create the graph and associated vector embeddings:

![Build Stage](./images/build.png)

### üéØ 1.4 Build the graph and vector stores from the previously extracted chunks

Run the cell below to populate the graph and vector stores from the chunks you extracted earlier. 

In [None]:
%reload_ext dotenv
%dotenv

import os

from graphrag_toolkit.lexical_graph import LexicalGraphIndex
from graphrag_toolkit.lexical_graph.storage import GraphStoreFactory
from graphrag_toolkit.lexical_graph.storage import VectorStoreFactory
from graphrag_toolkit.lexical_graph.indexing.load import FileBasedDocs

with (
    GraphStoreFactory.for_graph_store(os.environ['GRAPH_STORE']) as graph_store,
    VectorStoreFactory.for_vector_store(
        os.environ['VECTOR_STORE'], 
        index_names=['chunk'] # valid values include 'chunk', 'topic', 'statement'
    ) as vector_store
):
    # create a LexicalGraphIndex indexing component 
    
    graph_index = LexicalGraphIndex(
        graph_store, 
        vector_store
    )
    
    # specify the source of previously extracted data
    
    extracted_docs = FileBasedDocs(
        docs_directory='extracted',
        collection_id='example-1'
    )
    
    # populate the graph and vector stores

    graph_index.build(
        nodes=extracted_docs, 
        show_progress=True
    )

print('Build complete')

***

## Lexical Graph

The GraphRAG Toolkit builds a particular kind of graph model, called a hierarchical lexical graph.

This graph has a very specific purpose: to make it easy to find sets of relevant statements that can be used to answer the user's question.

Statements are the key data elements in the lexical graph. You can think of the entire lexical graph as a "bucket" of statements, with other nodes (sources, chunks, topics, facts and entities) playing specific roles to help find and organize statements:

  - **Sources** Contain source document metadata for filtering and versioning
  - **Chunks** Provide vector-based entry points into the graph
  - **Statements** Standalone propositions - the primary unit of context supplied to an LLM
  - **Topics** Thematically group statements belonging to the _same_ source
  - **Facts** Connect statements belonging to _different_ sources
  - **Entities** Represent domain semantics of dataset

![Lexical Graph](./images/lexical-graph.png)

### üéØ 1.6 Visualise the graph

You can visualise the graph that you've just built by running the cells below. Click on the **Graph** tab in the output to view the visualisation. You should see a single source document, several chunks, and then the topics, statements, facts and entities derived from those chunks.

<div class="alert alert-success">
üîß <b style="color:black;">Jupyter Notebook formatting</b>

If you're running in JupyterLab or version 7 of the Jupyter Notebook application, set the `NB_CLASSIC` variable below to `False`. 
</div>

<div class="alert alert-danger">
‚ö†Ô∏è <b style="color:black;">Display errors</b>

Note that sometimes with the Classic Notebook setup, the display can break, and you have to scroll through the output (including what may appear to be JavaScript errors) to get to the **Graph** tab.
</div>

In [None]:
NB_CLASSIC = True

from graphrag_toolkit.lexical_graph.visualisation import GraphNotebookVisualisation

v = GraphNotebookVisualisation(nb_classic=NB_CLASSIC)
v.display_sources()

### üéØ 1.7 Visualise the inferred schema

As well as a lexical graph, the GraphRAG Toolkit also creates schema nodes that represent the implicit domain semantics at the entity-relationship tier (the lowest tier of the lexical graph structure). You can view this schema by running the following cell:

In [None]:
NB_CLASSIC = True

from graphrag_toolkit.lexical_graph.visualisation import GraphNotebookVisualisation

v = GraphNotebookVisualisation(nb_classic=NB_CLASSIC)
v.display_schema()

If you later run the optional notebook, **03 - Agentic Use Cases**, you will see how this ability to infer the schema is used to create descriptions of domain-specific agentic tools.

### üéØ 1.8 Query the Data

You can now ask a question of your data:

In [None]:
%reload_ext dotenv
%dotenv

import os

from graphrag_toolkit.lexical_graph import LexicalGraphQueryEngine
from graphrag_toolkit.lexical_graph.storage import GraphStoreFactory
from graphrag_toolkit.lexical_graph.storage import VectorStoreFactory

with (
    GraphStoreFactory.for_graph_store(os.environ['GRAPH_STORE']) as graph_store,
    VectorStoreFactory.for_vector_store(os.environ['VECTOR_STORE']) as vector_store
):

    query_engine = LexicalGraphQueryEngine.for_traversal_based_search(
        graph_store, 
        vector_store,
        streaming=True,
        no_cache=True
    )

    response = query_engine.query("What are the differences between the r7i and r8g instance families?")
    
response.print_response_stream()

## Next Exercise

Go to <a href="../../../nbclassic/notebooks/graphrag-toolkit/2-Querying.ipynb"><b>Exercise 2 - Querying</b></a> to continue the workshop exercises.