# Query Text Data workflow with "Lazy GraphRAG" retrieval mechanism

Vector-based semantic search is a "best first" search approach for relevant chunks - it simply takes the top k most similar chunks based on embedding distances. It has no sense of the breadth of semantic content in the dataset.

Full [GraphRAG](https://aka.ms/graphrag) is a "breadth first" search approach for relevant content in community summaries - it always queries each community. It has no sense of which communities are best.

Lazy GraphRAG is a hybrid approach that requires no summarization-based index (which is expensive, especially if the user only has a few queries). Instead, it relies purely on an embedding-based index for input text chunks (cheap, fast) augmented by the concept graph of noun-phrase cooccurrences extracted from these chunks (free, fast). It works as follows:

- Given a query, text chunks are ranked using vector-based semantic search.
- Chunk rankings are used to determine community rankings in the concept graph.
- In community rank order, the top k chunks are subject to a rapid relevant test (cheap, fast). If relevant chunks are found, the community is retained, otherwise it is discarded.
- After n communities fail to return any relevant chunks, the "iterative narrowing" process repeats with remaining communities in rank order.
- The process terminates when no more testable chunks exist or a predefined testing budget is reached (e.g., 100 relevance tests).

The advantages of this approach include:

- A balance of best-first (semantic search) and breadth-first (community structure) approaches for any testing budget
- Very fast and almost free - suitable for one-off and exploratory queries before committing to a full indexing
- Easily accommodates dynamic contexts where new texts/chunks are streaming in to the dataset (supported in the current implementation)
- Provides a consistent level of "effort" for a fixed test budget, providing various standards of answers to questions (e.g., global questions) for which there is no ground truth (e.g., 1000 chunks = gold standard, 500 chunks = silver standard, 100 chunks = bronze standard, etc.)


In [156]:
import sys

sys.path.append("..")

In [157]:
import os
import asyncio
import io
import pandas as pd
from toolkit.AI.openai_configuration import OpenAIConfiguration
from toolkit.AI.openai_embedder import OpenAIEmbedder
from toolkit.helpers.constants import CACHE_PATH

import toolkit.query_text_data.input_processor as input_processor
import toolkit.query_text_data.relevance_assessor as relevance_assessor
import toolkit.query_text_data.answer_builder as answer_builder
import toolkit.query_text_data.graph_builder as graph_builder
import toolkit.query_text_data.pattern_detector as pattern_detector
import toolkit.graph.graph_fusion_encoder_embedding as gfee
import toolkit.query_text_data.helper_functions as helper_functions
import nest_asyncio # Necessary to run async code in ipynb

nest_asyncio.apply()

In [158]:
# Comment/uncomment the following lines to toggle access to library code updates
import importlib
importlib.reload(input_processor)
importlib.reload(pattern_detector)
importlib.reload(helper_functions)
importlib.reload(relevance_assessor)
importlib.reload(answer_builder)
importlib.reload(graph_builder)

[nltk_data] Downloading package brown to
[nltk_data]     C:\Users\daedge\AppData\Roaming\nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\daedge\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\daedge\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


<module 'toolkit.query_text_data.graph_builder' from 'c:\\Users\\daedge\\Code\\ITK\\intelligence-toolkit\\example_notebooks\\..\\toolkit\\query_text_data\\graph_builder.py'>

In [159]:
# Set up the AI model and embedding model
ai_configuration = OpenAIConfiguration(
    {
        "api_type": "OpenAI",
        "api_key": os.environ["OPENAI_API_KEY"],
        "model": "gpt-4o-2024-08-06",
    }
)

text_embedder = OpenAIEmbedder(
    configuration=ai_configuration,
)

In [160]:
# Provide text inputs as rows of a dataframe
input_path = "../example_outputs/query_text_data/news_articles/news_articles_texts.csv"
file_name = input_path.split("/")[-1]
file_to_chunks = input_processor.process_file_bytes({file_name: open(input_path, 'rb').read()}, input_processor.PeriodOption.NONE)
print("Chunked input text")

Chunked input text


In [161]:
max_cluster_size = 25 # The maximum number of concepts in a leaf-level community in the concept graph
# Process the chunks into the index data structures
(
    cid_to_text,
    text_to_cid,
    period_concept_graphs,
    hierarchical_communities,
    community_to_label,
    concept_to_cids,
    cid_to_concepts,
    previous_cid,
    next_cid,
    period_to_cids,
    node_period_counts,
    edge_period_counts,
) = input_processor.process_chunks(file_to_chunks=file_to_chunks, max_cluster_size=max_cluster_size)
print(f"Processed chunks")

Processed chunks


In [162]:
# Embed the text chunks
cid_to_vector = await helper_functions.embed_texts(
    cid_to_text=cid_to_text,
    text_embedder=text_embedder,
)
print(f"Embedded chunk text")

Got 501 existing texts
Got 0 new texts
Embedded chunk text


In [163]:
question = "What events are discussed?"
relevance_test_budget = 100 # How many relevance tests are permitted per query. Higher values may provide higher quality results at higher cost
adjacent_search_steps = 1 # How many chunks before and after each relevant chunk to test, once the relevance test budget is near or the search process has terminted
community_relevance_tests = 5 # How many relevance tests to run on each community in turn
relevance_test_batch_size = 5 # How many relevance tests to run in parallel at a time
community_ranking_chunks = 5 # How many chunks to use to rank communities by relevance
irrelevant_community_restart = 3 # When to restart testing communities in relevance order

# Mine relevant chunks to the question
async def mine_relevant_chunks():
    (
        relevant_cids,
        chunk_progress,
    ) = await relevance_assessor.detect_relevant_chunks(
        ai_configuration=ai_configuration,
        question=question,
        cid_to_text=cid_to_text,
        cid_to_concepts=cid_to_concepts,
        cid_to_vector=cid_to_vector,
        hierarchical_communities=hierarchical_communities,
        community_to_label=community_to_label,
        previous_cid=previous_cid,
        next_cid=next_cid,
        embedder=text_embedder,
        embedding_cache=CACHE_PATH,
        select_logit_bias=5,
        adjacent_search_steps=adjacent_search_steps,
        community_ranking_chunks=community_ranking_chunks,
        relevance_test_budget=relevance_test_budget,
        community_relevance_tests=community_relevance_tests,
        relevance_test_batch_size=relevance_test_batch_size,
        irrelevant_community_restart=irrelevant_community_restart,
    )
    return relevant_cids, chunk_progress

if __name__ == "__main__":
    relevant_cids, chunk_progress = await mine_relevant_chunks()
    print(f"Mined relevant chunks")
    


Semantic search cids: [213, 50, 104, 359, 38, 449, 76, 178, 182, 314, 344, 376, 237, 484, 464, 181, 29, 96, 324, 144, 93, 202, 150, 399, 279, 155, 179, 441, 364, 370, 62, 229, 381, 56, 446, 183, 189, 310, 409, 470, 55, 267, 400, 64, 457, 196, 388, 90, 485, 380, 138, 290, 341, 4, 224, 31, 47, 219, 24, 251, 239, 244, 109, 440, 348, 84, 180, 296, 284, 130, 168, 173, 112, 188, 65, 233, 382, 349, 164, 442, 418, 14, 304, 493, 216, 277, 184, 468, 192, 278, 408, 166, 443, 165, 72, 20, 105, 412, 325, 379]
Level 0 community sequence: ['1.4', '1.1', '1.7', '1.10', '1.5', '1.3', '1.13', '1.6', '1.8', '1.18', '1.9', '1.11', '1.17', '1.15', '1.19', '1.14', '1.2', '1.16', '1.12']
Level 1 community sequence: ['1.1.3', '1.1.1', '1.10.1', '1.4.3', '1.4.1', '1.4.6', '1.1.6', '1.1.4', '1.1.2', '1.4.2', '1.4.5', '1.4.4', '1.3.1', '1.13.2', '1.1.5', '1.1.7', '1.10.2', '1.18', '1.3.4', '1.9.2', '1.5.1', '1.6.3', '1.10.3', '1.8.5', '1.13.3', '1.3.3', '1.5.5', '1.17.3', '1.7.18', '1.6.5', '1.5.2', '1.10.5', '1

100%|██████████| 5/5 [00:00<00:00,  5.69it/s]


Batch 1 of 1: 5 relevant chunks
Community 1 relevant? True
Incrementing level
New level 0 loop after 5 tests
Community sequence: ['1.4', '1.1', '1.7', '1.10', '1.5', '1.3', '1.13', '1.6', '1.8', '1.18', '1.9', '1.11', '1.17', '1.15', '1.19', '1.14', '1.2', '1.16', '1.12']
Assessing relevance for community 1.4 with chunks [314, 344, 376, 484, 464]
Assessing relevance for topic 1.4 with 5 chunks


100%|██████████| 5/5 [00:01<00:00,  4.41it/s]


Batch 1 of 1: 5 relevant chunks
Community 1.4 relevant? True
Assessing relevance for community 1.1 with chunks [449, 76, 178, 182, 181]
Assessing relevance for topic 1.1 with 5 chunks


100%|██████████| 5/5 [00:01<00:00,  4.95it/s]


Batch 1 of 1: 5 relevant chunks
Community 1.1 relevant? True
Assessing relevance for community 1.7 with chunks [237, 196, 388, 168, 166]
Assessing relevance for topic 1.7 with 5 chunks


100%|██████████| 5/5 [00:00<00:00,  8.21it/s]


Batch 1 of 1: 5 relevant chunks
Community 1.7 relevant? True
Assessing relevance for community 1.10 with chunks [399, 90, 290, 109, 277]
Assessing relevance for topic 1.10 with 5 chunks


100%|██████████| 5/5 [00:00<00:00,  7.91it/s]


Batch 1 of 1: 5 relevant chunks
Community 1.10 relevant? True
Assessing relevance for community 1.5 with chunks [267, 65, 184, 343, 297]
Assessing relevance for topic 1.5 with 5 chunks


100%|██████████| 5/5 [00:01<00:00,  4.71it/s]


Batch 1 of 1: 5 relevant chunks
Community 1.5 relevant? True
Assessing relevance for community 1.3 with chunks [188, 468, 408, 347, 483]
Assessing relevance for topic 1.3 with 5 chunks


100%|██████████| 5/5 [00:00<00:00,  8.84it/s]


Batch 1 of 1: 5 relevant chunks
Community 1.3 relevant? True
Assessing relevance for community 1.13 with chunks [179, 485, 141, 120]
Assessing relevance for topic 1.13 with 4 chunks


100%|██████████| 4/4 [00:00<00:00,  6.96it/s]


Batch 1 of 1: 4 relevant chunks
Community 1.13 relevant? True
Assessing relevance for community 1.6 with chunks [341, 445, 252, 320, 274]
Assessing relevance for topic 1.6 with 5 chunks


100%|██████████| 5/5 [00:00<00:00,  5.56it/s]


Batch 1 of 1: 5 relevant chunks
Community 1.6 relevant? True
Assessing relevance for community 1.8 with chunks [433, 363, 32, 19, 447]
Assessing relevance for topic 1.8 with 5 chunks


100%|██████████| 5/5 [00:00<00:00,  5.59it/s]


Batch 1 of 1: 5 relevant chunks
Community 1.8 relevant? True
Assessing relevance for community 1.18 with chunks [373]
Assessing relevance for topic 1.18 with 1 chunks


100%|██████████| 1/1 [00:00<00:00,  1.60it/s]


Batch 1 of 1: 1 relevant chunks
Community 1.18 relevant? True
Assessing relevance for community 1.9 with chunks [278, 162, 281, 346, 421]
Assessing relevance for topic 1.9 with 5 chunks


100%|██████████| 5/5 [00:02<00:00,  1.72it/s]


Batch 1 of 1: 5 relevant chunks
Community 1.9 relevant? True
Assessing relevance for community 1.11 with chunks [272, 301, 355, 142]
Assessing relevance for topic 1.11 with 4 chunks


100%|██████████| 4/4 [00:00<00:00,  7.35it/s]


Batch 1 of 1: 4 relevant chunks
Community 1.11 relevant? True
Assessing relevance for community 1.17 with chunks [374, 58, 40, 488, 236]
Assessing relevance for topic 1.17 with 5 chunks


100%|██████████| 5/5 [00:00<00:00,  7.42it/s]


Batch 1 of 1: 5 relevant chunks
Community 1.17 relevant? True
Assessing relevance for community 1.15 with chunks [306, 139, 100, 103]
Assessing relevance for topic 1.15 with 4 chunks


100%|██████████| 4/4 [00:00<00:00,  4.49it/s]


Batch 1 of 1: 3 relevant chunks
Community 1.15 relevant? True
Assessing relevance for community 1.14 with chunks [151, 234, 437, 207, 390]
Assessing relevance for topic 1.14 with 5 chunks


100%|██████████| 5/5 [00:00<00:00,  6.77it/s]


Batch 1 of 1: 5 relevant chunks
Community 1.14 relevant? True
Assessing relevance for community 1.2 with chunks [266, 472, 435, 185, 286]
Assessing relevance for topic 1.2 with 5 chunks


100%|██████████| 5/5 [00:00<00:00,  5.59it/s]


Batch 1 of 1: 5 relevant chunks
Community 1.2 relevant? True
Assessing relevance for community 1.16 with chunks [1, 396]
Assessing relevance for topic 1.16 with 2 chunks


100%|██████████| 2/2 [00:00<00:00,  3.87it/s]


Batch 1 of 1: 2 relevant chunks
Community 1.16 relevant? True
Assessing relevance for community 1.12 with chunks [137, 392, 92, 57]
Assessing relevance for topic 1.12 with 4 chunks


100%|██████████| 4/4 [00:00<00:00,  7.75it/s]


Batch 1 of 1: 4 relevant chunks
Community 1.12 relevant? True
Incrementing level
New level 1 loop after 84 tests
Community sequence: ['1.1.3', '1.1.1', '1.10.1', '1.4.3', '1.4.1', '1.4.6', '1.1.6', '1.1.4', '1.1.2', '1.4.2', '1.4.5', '1.4.4', '1.3.1', '1.13.2', '1.1.5', '1.1.7', '1.10.2', '1.18', '1.3.4', '1.9.2', '1.5.1', '1.6.3', '1.10.3', '1.8.5', '1.13.3', '1.3.3', '1.5.5', '1.17.3', '1.7.18', '1.6.5', '1.5.2', '1.10.5', '1.7.25', '1.11.1', '1.8.2', '1.5.4', '1.10.4', '1.19', '1.14.5', '1.7.5', '1.11.2', '1.6.1', '1.8.1', '1.6.2', '1.5.6', '1.11.4', '1.7.21', '1.11.6', '1.7.9', '1.7.4', '1.13.5', '1.11.3', '1.13.4', '1.4.7', '1.13.1', '1.3.2', '1.7.1', '1.10.6', '1.15.4', '1.5.3', '1.9.3', '1.15.2', '1.12.2', '1.5.8', '1.7.19', '1.2.1', '1.8.3', '1.2.2', '1.7.24', '1.7.6', '1.11.5', '1.14.2', '1.17.5', '1.2.4', '1.5.9', '1.7.28', '1.17.1', '1.16.1', '1.7.10', '1.2.5', '1.7.8', '1.9.5', '1.7.17', '1.9.1', '1.5.7', '1.8.6', '1.9.4', '1.3.5', '1.2.6', '1.14.1', '1.14.3', '1.6.6', '1.1

100%|██████████| 5/5 [00:00<00:00,  6.30it/s]


Batch 1 of 1: 5 relevant chunks
Community 1.1.3 relevant? True
Assessing relevance for community 1.1.1 with chunks [29, 144, 441, 229, 446]
Assessing relevance for topic 1.1.1 with 5 chunks


100%|██████████| 5/5 [00:00<00:00,  8.40it/s]


Batch 1 of 1: 5 relevant chunks
Community 1.1.1 relevant? True
Assessing relevance for community 1.10.1 with chunks [324, 155, 14, 192, 70]
Assessing relevance for topic 1.10.1 with 5 chunks


100%|██████████| 5/5 [00:00<00:00,  7.46it/s]


Batch 1 of 1: 5 relevant chunks
Community 1.10.1 relevant? True
Assessing relevance for community 1.4.3 with chunks [364, 310, 55, 224, 440]
Assessing relevance for topic 1.4.3 with 5 chunks


100%|██████████| 1/1 [00:00<00:00,  1.71it/s]


Batch 1 of 1: 1 relevant chunks
Community 1.4.3 relevant? True
Assessing relevance for community 1.4.1 with chunks [189, 112, 304, 153, 288]
Assessing relevance for topic 1.4.1 with 5 chunks


0it [00:00, ?it/s]


Batch 1 of 1: 0 relevant chunks
Community 1.4.1 relevant? False
Assessing relevance for community 1.4.6 with chunks [279, 477, 113, 45]
Assessing relevance for topic 1.4.6 with 4 chunks


0it [00:00, ?it/s]


Batch 1 of 1: 0 relevant chunks
Community 1.4.6 relevant? False
Assessing relevance for community 1.1.6 with chunks [470, 337, 261, 119, 132]
Assessing relevance for topic 1.1.6 with 5 chunks


0it [00:00, ?it/s]

Batch 1 of 1: 0 relevant chunks
Community 1.1.6 relevant? False
0 successive irrelevant communities; restarting
Incrementing level
Assessing relevance for neighbours with 0 chunks
[('topic 1.10.1', 192, 'Yes'), ('topic 1.10.1', 70, 'Yes'), ('topic 1.4.3', 364, 'Yes')]
**Chunks relevant/tested: 99/100** (topic 1: 5/5; topic 1.4: 5/5; topic 1.1: 5/5; topic 1.7: 5/5; topic 1.10: 5/5; topic 1.5: 5/5; topic 1.3: 5/5; topic 1.13: 4/4; topic 1.6: 5/5; topic 1.8: 5/5; topic 1.18: 1/1; topic 1.9: 5/5; topic 1.11: 4/4; topic 1.17: 5/5; topic 1.15: 3/4; topic 1.14: 5/5; topic 1.2: 5/5; topic 1.16: 2/2; topic 1.12: 4/4; topic 1.1.3: 5/5; topic 1.1.1: 5/5; topic 1.10.1: 5/5; topic 1.4.3: 1/1)
Mined relevant chunks





In [164]:
answer_batch_size = 10 # How many chunks to integrate into the answer at a time
# Generate an extended answer to the question, which could then be summarized
(
    partial_answers,
    answer_progress,
) = answer_builder.answer_question(
    ai_configuration=ai_configuration,
    question=question,
    relevant_cids=relevant_cids,
    cid_to_text=cid_to_text,
    answer_batch_size=answer_batch_size,
)
print(f"Answered question")

Answered question


In [165]:
# Output some process summarry information
print(chunk_progress)
print(answer_progress)

**Chunks relevant/tested: 99/100** (topic 1: 5/5; topic 1.4: 5/5; topic 1.1: 5/5; topic 1.7: 5/5; topic 1.10: 5/5; topic 1.5: 5/5; topic 1.3: 5/5; topic 1.13: 4/4; topic 1.6: 5/5; topic 1.8: 5/5; topic 1.18: 1/1; topic 1.9: 5/5; topic 1.11: 4/4; topic 1.17: 5/5; topic 1.15: 3/4; topic 1.14: 5/5; topic 1.2: 5/5; topic 1.16: 2/2; topic 1.12: 4/4; topic 1.1.3: 5/5; topic 1.1.1: 5/5; topic 1.10.1: 5/5; topic 1.4.3: 1/1)
**Chunks referenced/relevant: 84/99** (answer 1: 10/10, answer 2: 8/10, answer 3: 9/10, answer 4: 8/10, answer 5: 9/10, answer 6: 9/10, answer 7: 9/10, answer 8: 9/10, answer 9: 7/10, answer 10: 6/9)


In [166]:
# Output the final extended answer
final_answer = partial_answers[-1]
print(final_answer)

# A Review of Recent Notable Events

*In response to: What events are discussed?*

## Introduction

This report provides a comprehensive overview of several significant events that have taken place recently, categorized into community, economic, sports, cultural, and environmental themes. 

**Community Events**
- Community Meeting at Greenwood Town Hall
- Community Center Hosts Successful Volunteer Fair
- Springfield Community Festival
- Mayoral Candidates Debate at City Hall
- Joint Effort for Community Betterment Launched
- Riverside Park Cleanup Drive: A Community Unites for the Environment
- Downtown Park Festival Kicks Off with a Grand Opening Ceremony
- Annual Charity Gala Raises Funds for Local Causes
- Hope Fundraiser Gala Raises Millions for Charity
- Regional Leaders Drive Community Engagement Efforts
- Main Street Comes Alive with Flavors: Community Cook-Off Raises Funds for Local Food Bank
- Helping Hands Hosts Valentine's Day Service Event

**Economic Forums**
- Global Eco