# Appendix 2 - Batch Extraction

In this workshop, you've seen how to index data on a chunk-by-chunk basis – either using separate Extract and Build steps, or combined in a single operation. In both cases, individual chunks are passed to an LLM for extraction.

For large datasets, this chunk-by-chunk approach can be slow and costly. [Batch extraction](https://github.com/awslabs/graphrag-toolkit/blob/main/docs/lexical-graph/batch-extraction.md) is a feature of the GraphRAG Toolkit that can significantly improve extraction performance for large datasets It is used with [Amazon Bedrock batch inference](https://docs.aws.amazon.com/bedrock/latest/userguide/batch-inference.html) in the Extract stage of the indexing process. Batch extraction submits large batches of chunks to Bedrock for processing simltaneously. Bedrock batch inference workloads are charged at a 50% discount compared to On-Demand pricing.

This Appendix summarizes the batch extraction feature. You won't be using it during this this workshop, but it's good to know it exists for larger workloads.

#### Prerequisites

To set up batch extraction, users need to create an Amazon S3 bucket in the AWS Region where they will be running batch extraction and create a custom service role for batch inference with appropriate permissions. 

#### Configuration

To use batch extraction with the `LexicalGraphIndex`, a `BatchConfig` object must be created and supplied to the `LexicalGraphIndex` as part of the `IndexingConfig`. This `BatchConfig` object manages the configuration settings for Amazon Bedrock batch inference jobs.

#### Example

```python
import os

from graphrag_toolkit.lexical_graph import LexicalGraphIndex
from graphrag_toolkit.lexical_graph import GraphRAGConfig, IndexingConfig
from graphrag_toolkit.lexical_graph.storage import GraphStoreFactory
from graphrag_toolkit.lexical_graph.storage import VectorStoreFactory
from graphrag_toolkit.lexical_graph.indexing.extract import BatchConfig

from llama_index.core import SimpleDirectoryReader
    
def batch_extract_and_load():
    
    GraphRAGConfig.extraction_batch_size = 1000

    batch_config = BatchConfig(
        region='us-west-2',
        bucket_name='my-bucket',
        key_prefix='batch-extract',
        role_arn='arn:aws:iam::111111111111:role/my-batch-inference-role',
        max_batch_size=40000
    )

    indexing_config = IndexingConfig(batch_config=batch_config)

    with (
        GraphStoreFactory.for_graph_store(os.environ['GRAPH_STORE']) as graph_store,
        VectorStoreFactory.for_vector_store(os.environ['VECTOR_STORE']) as vector_store
    ):

        graph_index = LexicalGraphIndex(
            graph_store, 
            vector_store,
            indexing_config=indexing_config
        )

        reader = SimpleDirectoryReader(input_dir='path/to/directory')
        docs = reader.load_data()

        graph_index.extract_and_build(docs, show_progress=True)
    
batch_extract_and_load()
```