# GraphRAG Quickstart

## Prerequisites
Install 3rd party packages, not part of the Python Standard Library, to run the notebook

In [35]:
! pip install devtools python-magic requests tqdm

Defaulting to user installation because normal site-packages is not writeable


In [36]:
import getpass
import json
import time
from pathlib import Path

import magic
import requests
from devtools import pprint
from tqdm import tqdm

## (REQUIRED) User Configuration
Set the API subscription key, API base endpoint, and some file directory names that will be referenced later in the notebook.

#### API subscription key

APIM supports multiple forms of authentication and access control (e.g. managed identity). For this notebook demonstration, we will use a **[subscription key](https://learn.microsoft.com/en-us/azure/api-management/api-management-subscriptions)**. To locate this key, visit the Azure Portal. The subscription key can be found under `<my_resource_group> --> <API Management service> --> <APIs> --> <Subscriptions> --> <Built-in all-access subscription> Primary Key`. For multiple API users, individual subscription keys can be generated.

In [19]:
ocp_apim_subscription_key = getpass.getpass(
    "ここにAPI ManagementのAPI subscription keyを設定"
)

"""
"Ocp-Apim-Subscription-Key": 
    This is a custom HTTP header used by Azure API Management service (APIM) to 
    authenticate API requests. The value for this key should be set to the subscription 
    key provided by the Azure APIM instance in your GraphRAG resource group.
"""
headers = {"Ocp-Apim-Subscription-Key": ocp_apim_subscription_key}

#### Setup directories and API endpoint

For demonstration purposes, please use the provided `get-wiki-articles.py` script to download a small set of wikipedia articles or provide your own data (graphrag requires txt files to be utf-8 encoded).

In [20]:
"""
These parameters must be defined by the notebook user:

- file_directory: a local directory of text files. The file structure should be flat,
                  with no nested directories. (i.e. file_directory/file1.txt, file_directory/file2.txt, etc.)
- storage_name:   a unique name to identify a blob storage container in Azure where files
                  from `file_directory` will be uploaded.
- index_name:     a unique name to identify a single graphrag knowledge graph index.
                  Note: Multiple indexes may be created from the same `storage_name` blob storage container.
- endpoint:       the base/endpoint URL for the GraphRAG API (this is the Gateway URL found in the APIM resource).
"""

file_directory = "../data"
storage_name = "data"
index_name = "test-all-01"
endpoint = "https://apim-uugxkaopsbxne.azure-api.net"

In [21]:
assert (
    file_directory != "" and storage_name != "" and index_name != "" and endpoint != ""
)

## Upload Files

For a demonstration of how to index data in graphrag, we first need to ingest a few files into graphrag.

In [22]:
def upload_files(
    file_directory: str,
    storage_name: str,
    batch_size: int = 100,
    overwrite: bool = True,
    max_retries: int = 5,
) -> requests.Response | list[Path]:
    """
    Upload files to a blob storage container.

    Args:
    file_directory - a local directory of .txt files to upload. All files must have utf-8 encoding.
    storage_name - a unique name for the Azure storage blob container.
    batch_size - the number of files to upload in a single batch.
    overwrite - whether or not to overwrite files if they already exist in the storage blob container.
    max_retries - the maximum number of times to retry uploading a batch of files if the API is busy.

    NOTE: Uploading files may sometimes fail if the blob container was recently deleted
    (i.e. a few seconds before. The solution "in practice" is to sleep a few seconds and try again.
    """
    url = endpoint + "/data"

    def upload_batch(
        files: list, storage_name: str, overwrite: bool, max_retries: int
    ) -> requests.Response:
        for _ in range(max_retries):
            response = requests.post(
                url=url,
                files=files,
                params={"storage_name": storage_name, "overwrite": overwrite},
                headers=headers,
            )
            # API may be busy, retry
            if response.status_code == 500:
                print("API busy. Sleeping and will try again.")
                time.sleep(10)
                continue
            return response
        return response

    batch_files = []
    accepted_file_types = ["text/plain"]
    filepaths = list(Path(file_directory).iterdir())
    for file in tqdm(filepaths):
        # validate that file is a file, has acceptable file type, has a .txt extension, and has utf-8 encoding
        if (
            not file.is_file()
            or file.suffix != ".txt"
            or magic.from_file(str(file), mime=True) not in accepted_file_types
        ):
            print(f"Skipping invalid file: {file}")
            continue
        # open and decode file as utf-8, ignore bad characters
        batch_files.append(
            ("files", open(file=file, mode="r", encoding="utf-8", errors="ignore"))
        )
        # upload batch of files
        if len(batch_files) == batch_size:
            response = upload_batch(batch_files, storage_name, overwrite, max_retries)
            # if response is not ok, return early
            if not response.ok:
                return response
            batch_files.clear()
    # upload remaining files
    if len(batch_files) > 0:
        response = upload_batch(batch_files, storage_name, overwrite, max_retries)
    return response


response = upload_files(
    file_directory=file_directory,
    storage_name=storage_name,
    batch_size=100,
    overwrite=True,
)
if not response.ok:
    print(response.text)
else:
    print(response)

  0%|          | 0/168 [00:00<?, ?it/s]

100%|██████████| 168/168 [00:05<00:00, 31.61it/s]


<Response [200]>


## Build an Index

After data files have been uploaded, we can construct a knowledge graph by building a search index.

In [49]:
def build_index(
    storage_name: str,
    index_name: str,
) -> requests.Response:
    """Create a search index.
    This function kicks off a job that builds a knowledge graph index from files located in a blob storage container.
    """
    url = endpoint + "/index"
    request = {"storage_name": storage_name, "index_name": index_name}
    return requests.post(url, params=request, headers=headers)


response = build_index(storage_name=storage_name, index_name=index_name)
print(response)
if response.ok:
    print(response.text)
else:
    print(f"Failed to submit job.\nStatus: {response.text}")

<Response [202]>
{"detail":"Index 'test-01' already exists and has not finished building."}


### Check status of an indexing job

Please wait for your index to reach 100 percent completion before continuing on to the next section (running queries). You may rerun the next cell multiple times to monitor status. Note: the indexing speed of graphrag is directly correlated to the TPM quota of the Azure OpenAI model you are using.

In [52]:
def index_status(index_name: str) -> requests.Response:
    url = endpoint + f"/index/status/{index_name}"
    return requests.get(url, headers=headers)


response = index_status(index_name)
pprint(response.json())

{
    'status_code': 200,
    'index_name': 'test-01',
    'storage_name': 'data',
    'status': 'complete',
    'percent_complete': 100.0,
    'progress': '16 out of 16 workflows completed successfully.',
}


## Query

Once an indexing job is complete, the knowledge graph is ready to query. Two types of queries (global and local) are currently supported. We encourage you to try both and experience the difference in responses. Note that query response time is also correlated to the TPM quota of the Azure OpenAI model you are using.

In [32]:
# a helper function to parse out the result from a query response
def parse_query_response(
    response: requests.Response, return_context_data: bool = False
) -> requests.Response | dict[list[dict]]:
    """
    Print response['result'] value and return context data.
    """
    if response.ok:
        print(json.loads(response.text)["result"])
        if return_context_data:
            return json.loads(response.text)["context_data"]
        return response
    else:
        print(response.reason)
        print(response.content)
        return response

### Global Query 

Global queries are resource-intensive, but provide good responses to questions that require an understanding of the dataset as a whole.

In [53]:
%%time


def global_search(
    index_name: str | list[str], query: str, community_level: int
) -> requests.Response:
    """Run a global query over the knowledge graph(s) associated with one or more indexes"""
    url = endpoint + "/query/global"
    # optional parameter: community level to query the graph at (default for global query = 1)
    request = {
        "index_name": index_name,
        "query": query,
        "community_level": community_level,
    }
    return requests.post(url, json=request, headers=headers)


# perform a global query
global_response = global_search(
    index_name=index_name,
    query="このデータに見られる主なトピックをまとめて",
    community_level=1,
)
global_response_data = parse_query_response(global_response, return_context_data=True)
global_response_data

## 主なトピックの概要

このデータセットには、住宅エリアの建設特徴、アフターサービスの役割、寒冷地での特有の課題、換気システムのメンテナンス、そしてガス給湯器の設置と保守に関する情報が含まれています。以下に各トピックの詳細を示します。

### 住宅エリアの建設特徴

ハイムとデシオGTの住宅エリアは、石膏ボードを使用した壁や天井などの独自の建設特徴を持っています。これらの材料は、構造の完全性と安全性を確保するために特定の建設およびメンテナンスガイドラインが必要です [Data: Reports (4)]。

### アフターサービスの役割

アフターセールスサービス部門は、設置や構造変更に関するサポートと相談を提供する重要な役割を果たしています。これには、ガス給湯器や床の補強などのシステムの安全性と効率性を確保することが含まれます [Data: Reports (9)]。

### 寒冷地での特有の課題

北海道のような寒冷地では、水道管、メーター、ヒーターの凍結を防ぐための特定の対策が必要です。これにより、機能性を維持し、損傷を防ぐことができます [Data: Reports (1)]。

### 換気システムのメンテナンス

第1種換気システム、またはエアファクトリーは、ハイムやデシオGTのような住宅エリアで使用される包括的な換気システムです。空気の質と快適さを確保するために、フィルターなどのコンポーネントの定期的なメンテナンスが必要です [Data: Reports (8)]。

### ガス給湯器の設置と保守

設置業者は、ガス給湯器の設置と保守を担当しており、アフターセールスサービス部門と緊密に連携して効率的な運用を確保し、故障に対処します [Data: Reports (11)]。

これらのトピックは、住宅の安全性、効率性、快適性を確保するための重要な要素を示しています。
CPU times: user 23.3 ms, sys: 1.41 ms, total: 24.7 ms
Wall time: 13.8 s


{'reports': [{'id': '4',
   'title': 'ハイム and デシオGT Residential Areas',
   'occurrence weight': 1.0,
   'content': "# ハイム and デシオGT Residential Areas\n\nThe community is centered around the residential areas of ハイム and デシオGT, which are characterized by specific construction and maintenance requirements. These areas utilize gypsum board construction and have detailed guidelines for ventilation systems and air conditioner installations, involving entities like ツーユー and エアコン室内機.\n\n## Unique construction features of ハイム and デシオGT\n\nBoth ハイム and デシオGT are residential areas with unique construction features that involve the use of gypsum board for walls and ceilings. This construction method requires specific attention to the positioning of wood studs and the fixing of nails or screws to ensure structural integrity and safety. The use of gypsum board necessitates adherence to specific guidelines to maintain the durability and safety of the structures [Data: Entities (27, 28); Relationships

### Local Query

Local search queries are best suited for narrow-focused questions that require an understanding of specific entities mentioned in the documents (e.g. What are the healing properties of chamomile?)

In [54]:
%%time


def local_search(
    index_name: str | list[str], query: str, community_level: int
) -> requests.Response:
    """Run a local query over the knowledge graph(s) associated with one or more indexes"""
    url = endpoint + "/query/local"
    # optional parameter: community level to query the graph at (default for local query = 2)
    request = {
        "index_name": index_name,
        "query": query,
        "community_level": community_level,
    }
    return requests.post(url, json=request, headers=headers)


# perform a local query
local_response = local_search(
    index_name=index_name,
    query="このデータに見られる主なトピックをまとめて",
    community_level=2,
)
local_response_data = parse_query_response(local_response, return_context_data=True)
local_response_data

以下に、提供されたデータに基づく主なトピックをまとめます。

### 1. ハイムとデシオGTの住宅地域

ハイムとデシオGTは、特定の建設およびメンテナンス要件を持つ住宅地域です。これらの地域では、石膏ボードを使用した壁や天井の建設が行われており、木製スタッドの位置や釘やネジの固定に特別な注意が必要です。この建設方法は、構造の耐久性と安全性を維持するために特定のガイドラインに従うことが求められます [Data: Entities (27, 28); Relationships (39, 43, 44, 40)]。

### 2. 換気システムと空調設備

これらの住宅地域では、第1種換気システムが使用されており、住民の快適さと空気の質を維持するために重要です。これらのシステムは、生活環境に不可欠であり、定期的なメンテナンスが必要です [Data: Relationships (45, 42)]。また、エアコンの室内機の設置には、特定の材料と方法が必要であり、デコスケやツーユーなどの材料が使用されます [Data: Entities (64); Relationships (83, 61, 92, 93, 94)]。

### 3. ツーユーの役割

ツーユーは、住宅地域の建設において重要な役割を果たしており、壁の建設やエアコンの設置に必要な材料を提供しています。ツーユーは、石膏ボードと組み合わせて使用され、構造の完全性と機能性を確保するために重要です [Data: Entities (42); Relationships (58, 59, 60, 61)]。

### 4. メンテナンスと安全ガイドライン

ハイムとデシオGTの住宅地域には、快適で安全な生活環境を確保するための詳細なメンテナンスと安全ガイドラインがあります。これらのガイドラインは、天井や壁の荷重制限や換気システムのメンテナンスなどの側面をカバーしています。これらのガイドラインに従うことは、住宅地域の構造的完全性と安全性を維持するために重要です [Data: Entities (27, 28); Relationships (44, 40)]。

これらのトピックは、住宅地域の建設、設備、メンテナンスに関する重要な要素を強調しており、住民の生活の質を向上させるための取り組みを示しています。
CPU 

{'reports': [{'id': '4',
   'title': 'ハイム and デシオGT Residential Areas',
   'content': "# ハイム and デシオGT Residential Areas\n\nThe community is centered around the residential areas of ハイム and デシオGT, which are characterized by specific construction and maintenance requirements. These areas utilize gypsum board construction and have detailed guidelines for ventilation systems and air conditioner installations, involving entities like ツーユー and エアコン室内機.\n\n## Unique construction features of ハイム and デシオGT\n\nBoth ハイム and デシオGT are residential areas with unique construction features that involve the use of gypsum board for walls and ceilings. This construction method requires specific attention to the positioning of wood studs and the fixing of nails or screws to ensure structural integrity and safety. The use of gypsum board necessitates adherence to specific guidelines to maintain the durability and safety of the structures [Data: Entities (27, 28); Relationships (39, 43, 44, 40)].\n\n## Ven