In [1]:
# Copyright (c) 2024 Microsoft Corporation.
# Licensed under the MIT License.

## API Overview

This notebook provides a demonstration of how to interact with graphrag as a library using the API as opposed to the CLI. Note that graphrag's CLI actually connects to the library through this API for all operations. 

In [2]:
import graphrag.api as api
from graphrag.index.typing import PipelineRunResult

  from .autonotebook import tqdm as notebook_tqdm





## Prerequisite
As a prerequisite to all API operations, a `GraphRagConfig` object is required. It is the primary means to control the behavior of graphrag and can be instantiated from a `settings.yaml` configuration file.

Please refer to the [CLI docs](https://microsoft.github.io/graphrag/cli/#init) for more detailed information on how to generate the `settings.yaml` file.

#### Load `settings.yaml` configuration

In [3]:
import yaml

settings = yaml.safe_load(open("settings.yaml"))  # noqa: PTH123, SIM115

At this point, you can modify the imported settings to align with your application's requirements. For example, if building a UI application, the application might need to change the input and/or storage destinations dynamically in order to enable users to build and query different indexes.

### Generate a `GraphRagConfig` object

In [4]:
from graphrag.config.create_graphrag_config import create_graphrag_config

graphrag_config = create_graphrag_config(
    values=settings, root_dir="."
)

## Indexing API

*Indexing* is the process of ingesting raw text data and constructing a knowledge graph. GraphRAG currently supports plaintext (`.txt`) and `.csv` file formats.

In [5]:
#graphrag_config

## Build an index

In [6]:
import logging

from graphrag.cache.noop_pipeline_cache import NoopPipelineCache
from graphrag.callbacks.workflow_callbacks import WorkflowCallbacks
from graphrag.config.enums import CacheType
from graphrag.config.models.graph_rag_config import GraphRagConfig
from graphrag.index.run.run_workflows import run_workflows
from graphrag.index.typing import PipelineRunResult
from graphrag.logger.base import ProgressLogger

log = logging.getLogger(__name__)


async def build_index(
    config: GraphRagConfig,
    memory_profile: bool = False,
    callbacks: list[WorkflowCallbacks] | None = None,
    progress_logger: ProgressLogger | None = None,
) -> list[PipelineRunResult]:
    """Run the pipeline with the given configuration.

    Parameters
    ----------
    config : GraphRagConfig
        The configuration.
    memory_profile : bool
        Whether to enable memory profiling.
    callbacks : list[WorkflowCallbacks] | None default=None
        A list of callbacks to register.
    progress_logger : ProgressLogger | None default=None
        The progress logger.

    Returns
    -------
    list[PipelineRunResult]
        The list of pipeline run results
    """
    is_update_run = bool(config.update_index_output)

    pipeline_cache = (
        NoopPipelineCache() if config.cache.type == CacheType.none is None else None
    )
    # create a pipeline reporter and add to any additional callbacks
    # TODO: remove the type ignore once the new config engine has been refactored
    callbacks = callbacks or []
    outputs: list[PipelineRunResult] = []

    if memory_profile:
        log.warning("New pipeline does not yet support memory profiling.")

    workflows = _get_workflows_list(config)

    async for output in run_workflows(
        workflows,
        config,
        cache=pipeline_cache,
        callbacks=callbacks,
        logger=progress_logger,
        is_update_run=is_update_run,
    ):
        outputs.append(output)
        if progress_logger:
            if output.errors and len(output.errors) > 0:
                progress_logger.error(output.workflow)
            else:
                progress_logger.success(output.workflow)
            progress_logger.info(str(output.result))

    return outputs


def _get_workflows_list(config: GraphRagConfig) -> list[str]:
    return [
        "create_base_text_units",
        "create_final_documents",
        "extract_graph",
        "compute_communities",
        "create_final_entities",
        "create_final_relationships",
        "create_final_nodes",
        "create_final_communities",
        *(["create_final_covariates"] if config.claim_extraction.enabled else []),
        "create_final_text_units",
        "create_final_community_reports",
        "generate_text_embeddings",
    ]

In [7]:
config = graphrag_config


workflows = _get_workflows_list(config)

outputs: list[PipelineRunResult] = []

async for output in run_workflows(
    workflows,
    config,
    cache=None,
    callbacks=[],
    logger=None,
    is_update_run=None,
):
    outputs.append(output)
    print(output)



def _get_workflows_list(config: GraphRagConfig) -> list[str]:
    return [
        "create_base_text_units",
        "create_final_documents",
        "extract_graph",
        "compute_communities",
        "create_final_entities",
        "create_final_relationships",
        "create_final_nodes",
        "create_final_communities",
        *(["create_final_covariates"] if config.claim_extraction.enabled else []),
        "create_final_text_units",
        "create_final_community_reports",
        "generate_text_embeddings",
    ]

PipelineRunResult(workflow='create_base_text_units', result=                                                   id  \
0   336671e337e5f4539069473e8f8691b3ed696331aabe67...   
1   2160a0c64179a7920c578f3400ad64f77c22927e6ab8c7...   
2   d798befe565a9ed5b6b536fd8a95a1d396867b232ec308...   
3   cc6a8a52ea673776c03f32442c2a05f75b59d30a0bf4c0...   
4   1c129c3dd67b1761adbdb4186b2de1036b2e4ff3683e4d...   
5   fdd19e6236e61193504953904d1221bc393a60fe728ffa...   
6   a998e6a1b2d1e74ba419f937061024540104d2716c1917...   
7   3292473b26f94c7aff219219ee64dcba1585532bac0857...   
8   25ae520bd79457caa7d277e1ba3731e3f498fc62f02935...   
9   d38581de899a32c16a744f6a867412cb91e528f9383372...   
10  b644ae78a58c60ff6b7a6b959c84a3f5f7d8b97123992d...   
11  f995cd9f704ad3fe03a64029c2dfa6beb97262269f6a4c...   
12  d1537788200767168593eb8e9d4f4c4b7006aba28fb1cc...   
13  5c5adb5118a758e4a0a70d2d702eb73cddadb7d95e2efa...   
14  3c6bd4bf5311e797262e6b100e39817183f99836bd9760...   
15  88479779a69573e42cf7992d

  _edge_swap_numba = nb.jit(_edge_swap, nopython=False)


PipelineRunResult(workflow='compute_communities', result=    level  community  parent                 title
0       0          0      -1              ABRAHAMS
0       0          0      -1               SCROOGE
0       0          0      -1              ALI BABA
0       0          0      -1    ANGELIC MESSENGERS
0       0          0      -1              APOSTLES
..    ...        ...     ...                   ...
22      2         22      14  MR. SCROOGE'S NEPHEW
23      2         23      14       CRATCHIT FAMILY
23      2         23      14                MARTHA
23      2         23      14          MISS BELINDA
23      2         23      14                 PETER

[202 rows x 4 columns], errors=None)
PipelineRunResult(workflow='create_final_entities', result=                                       id  human_readable_id  \
0    66d26fbf-e56b-458d-9112-8b39972578f9                  0   
1    71b5fb9e-9db2-45e9-aecf-c245055bc548                  1   
2    ff909442-b318-453f-961c-acf2b3cc6248 

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  input.loc[:, NODE_DETAILS] = input.loc[


PipelineRunResult(workflow='create_final_community_reports', result=                                  id  human_readable_id  community  parent  \
0   1575e4e3317644e595fc65c31c9cd01e                 22         22      14   
1   06cb97b320e147e49703426869bca94e                 23         23      14   
2   62cc766c70e542418ef676dc10d9c7bb                  9          9       0   
3   55fc907c8ddc47eb834b9683961bed4a                 10         10       0   
4   bdafda2d67a24804b1c2a1e61f47a723                 11         11       0   
5   1b34f5b6a973489cae01f4e25da0aece                 12         12       0   
6   b1411d7e91f34625aa80b10553b58960                 13         13       3   
7   e79a990069aa4708a06627e02c1e71de                 14         14       3   
8   4d4ac8cf8d93444bb7054bda27c531b9                 15         15       4   
9   ba96ba06edd9405f810bc9a92cb3666f                 16         16       4   
10  072d0fe4ac2941b8ab8b24f23de46fea                 17         17       4

[2025-03-17T16:04:02Z WARN  lance::dataset::write::insert] No existing dataset at /Users/paulbruffett/Documents/code/agents/graph_carol/output/lancedb/default-entity-description.lance, it will be created
[2025-03-17T16:04:11Z WARN  lance::dataset::write::insert] No existing dataset at /Users/paulbruffett/Documents/code/agents/graph_carol/output/lancedb/default-text_unit-text.lance, it will be created


PipelineRunResult(workflow='generate_text_embeddings', result=None, errors=None)


[2025-03-17T16:04:23Z WARN  lance::dataset::write::insert] No existing dataset at /Users/paulbruffett/Documents/code/agents/graph_carol/output/lancedb/default-community-full_content.lance, it will be created


In [8]:
#index_result: list[PipelineRunResult] = await api.build_index(config=graphrag_config)

# index_result is a list of workflows that make up the indexing pipeline that was run
for workflow_result in outputs:
    print(workflow_result)
    status = f"error\n{workflow_result.errors}" if workflow_result.errors else "success"
    print(f"Workflow Name: {workflow_result.workflow}\tStatus: {status}")

PipelineRunResult(workflow='create_base_text_units', result=                                                   id  \
0   336671e337e5f4539069473e8f8691b3ed696331aabe67...   
1   2160a0c64179a7920c578f3400ad64f77c22927e6ab8c7...   
2   d798befe565a9ed5b6b536fd8a95a1d396867b232ec308...   
3   cc6a8a52ea673776c03f32442c2a05f75b59d30a0bf4c0...   
4   1c129c3dd67b1761adbdb4186b2de1036b2e4ff3683e4d...   
5   fdd19e6236e61193504953904d1221bc393a60fe728ffa...   
6   a998e6a1b2d1e74ba419f937061024540104d2716c1917...   
7   3292473b26f94c7aff219219ee64dcba1585532bac0857...   
8   25ae520bd79457caa7d277e1ba3731e3f498fc62f02935...   
9   d38581de899a32c16a744f6a867412cb91e528f9383372...   
10  b644ae78a58c60ff6b7a6b959c84a3f5f7d8b97123992d...   
11  f995cd9f704ad3fe03a64029c2dfa6beb97262269f6a4c...   
12  d1537788200767168593eb8e9d4f4c4b7006aba28fb1cc...   
13  5c5adb5118a758e4a0a70d2d702eb73cddadb7d95e2efa...   
14  3c6bd4bf5311e797262e6b100e39817183f99836bd9760...   
15  88479779a69573e42cf7992d

## Query an index

To query an index, several index files must first be read into memory and passed to the query API. 

In [10]:
import pandas as pd

final_nodes = pd.read_parquet("output/create_final_nodes.parquet")
final_entities = pd.read_parquet(
    "output/create_final_entities.parquet"
)
final_communities = pd.read_parquet(
    "output/create_final_communities.parquet"
)
final_community_reports = pd.read_parquet(
    "output/create_final_community_reports.parquet"
)

response, context = await api.global_search(
    config=graphrag_config,
    nodes=final_nodes,
    entities=final_entities,
    communities=final_communities,
    community_reports=final_community_reports,
    community_level=2,
    dynamic_community_selection=False,
    response_type="Multiple Paragraphs",
    query="Who is Scrooge and what are his main relationships?",
)

creating llm client with {'api_key': 'REDACTED,len=164', 'type': "openai_chat", 'encoding_model': 'cl100k_base', 'model': 'gpt-4o', 'max_tokens': 4000, 'temperature': 0.0, 'top_p': 1.0, 'n': 1, 'frequency_penalty': 0.0, 'presence_penalty': 0.0, 'request_timeout': 180.0, 'api_base': None, 'api_version': None, 'organization': None, 'proxy': None, 'audience': None, 'deployment_name': None, 'model_supports_json': True, 'tokens_per_minute': 50000, 'requests_per_minute': 1000, 'max_retries': 10, 'max_retry_wait': 10.0, 'sleep_on_rate_limit_recommendation': True, 'concurrent_requests': 25, 'responses': None}


The response object is the official reponse from graphrag while the context object holds various metadata regarding the querying process used to obtain the final response.

In [11]:
print(response)

### Ebenezer Scrooge: Character Overview

Ebenezer Scrooge is initially depicted as a miserly and cold-hearted individual, renowned for his business acumen and disdain for Christmas. He is characterized as a wealthy but solitary man, whose interactions with others are marked by his stinginess and lack of empathy [Data: Reports (9)].

### Key Relationships

1. **Supernatural Spirits**: Scrooge's interactions with the supernatural spirits, including the Ghost of Christmas Past, the Ghost of Christmas Yet to Come, and the Ghost of Jacob Marley, are pivotal in his journey towards redemption. These spirits guide him through reflections on his past, present, and potential future, playing a crucial role in his transformation [Data: Reports (9, 19, 20)].

2. **Jacob Marley**: Scrooge's relationship with his deceased business partner, Jacob Marley, is significant. Marley's ghost warns Scrooge about the consequences of his current lifestyle and sets the stage for his transformation [Data: Report

Digging into the context a bit more provides users with extremely granular information such as what sources of data (down to the level of text chunks) were ultimately retrieved and used as part of the context sent to the LLM model).

In [12]:
from pprint import pprint

pprint(context)  # noqa: T203

{'claims': [],
 'entities': [],
 'relationships': [],
 'reports': [{'content': '# Ebenezer Scrooge and His Transformative Journey\n'
                         '\n'
                         'The community centers around Ebenezer Scrooge, a '
                         'wealthy but miserly businessman, and his '
                         'interactions with various entities that lead to his '
                         'profound personal transformation. Key relationships '
                         'include his business partnership with Marley, his '
                         'familial connection with his nephew Fred, and his '
                         'encounters with supernatural spirits that guide him '
                         'towards redemption. The narrative unfolds in the '
                         "City of London, where Scrooge's past, present, and "
                         'potential future are explored, ultimately leading to '
                         'his change of heart and newfound