# Query Text Data workflow with "Lazy GraphRAG" retrieval mechanism

Vector-based semantic search is a "best first" search approach for relevant chunks - it simply takes the top k most similar chunks based on embedding distances. It has no sense of the breadth of semantic content in the dataset.

Full [GraphRAG](https://aka.ms/graphrag) is a "breadth first" search approach for relevant content in community summaries - it always queries each community. It has no sense of which communities are best.

Lazy GraphRAG is a hybrid approach that requires no summarization-based index (which is expensive, especially if the user only has a few queries). Instead, it relies purely on an embedding-based index for input text chunks (cheap, fast) augmented by the concept graph of noun-phrase cooccurrences extracted from these chunks (free, fast). It works as follows:

- Given a query, text chunks are ranked using vector-based semantic search.
- Chunk rankings are used to determine community rankings in the concept graph.
- In community rank order, the top k chunks are subject to a rapid relevant test (cheap, fast). If relevant chunks are found, the community is retained, otherwise it is discarded.
- After n communities fail to return any relevant chunks, the "iterative narrowing" process repeats with remaining communities in rank order.
- The process terminates when no more testable chunks exist or a predefined testing budget is reached (e.g., 100 relevance tests).

The advantages of this approach include:

- A balance of best-first (semantic search) and breadth-first (community structure) approaches for any testing budget
- Very fast and almost free - suitable for one-off and exploratory queries before committing to a full indexing
- Easily accommodates dynamic contexts where new texts/chunks are streaming in to the dataset (supported in the current implementation)
- Provides a consistent level of "effort" for a fixed test budget, providing various standards of answers to questions (e.g., global questions) for which there is no ground truth (e.g., 1000 chunks = gold standard, 500 chunks = silver standard, 100 chunks = bronze standard, etc.)


In [145]:
import sys

sys.path.append("..")

In [146]:
import os
import asyncio
import io
import pandas as pd
from toolkit.AI.openai_configuration import OpenAIConfiguration
from toolkit.AI.openai_embedder import OpenAIEmbedder
from toolkit.helpers.constants import CACHE_PATH

import toolkit.query_text_data.input_processor as input_processor
import toolkit.query_text_data.relevance_assessor as relevance_assessor
import toolkit.query_text_data.answer_builder as answer_builder
import toolkit.query_text_data.graph_builder as graph_builder
import toolkit.query_text_data.pattern_detector as pattern_detector
import toolkit.graph.graph_fusion_encoder_embedding as gfee
import toolkit.query_text_data.helper_functions as helper_functions
import nest_asyncio # Necessary to run async code in ipynb

nest_asyncio.apply()

In [147]:
# Comment/uncomment the following lines to toggle access to library code updates
import importlib
importlib.reload(input_processor)
importlib.reload(pattern_detector)
importlib.reload(helper_functions)
importlib.reload(relevance_assessor)
importlib.reload(answer_builder)
importlib.reload(graph_builder)

[nltk_data] Downloading package brown to
[nltk_data]     C:\Users\daedge\AppData\Roaming\nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\daedge\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\daedge\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


<module 'toolkit.query_text_data.graph_builder' from 'c:\\Users\\daedge\\Code\\ITK\\intelligence-toolkit\\example_notebooks\\..\\toolkit\\query_text_data\\graph_builder.py'>

In [148]:
# Set up the AI model and embedding model
ai_configuration = OpenAIConfiguration(
    {
        "api_type": "OpenAI",
        "api_key": os.environ["OPENAI_API_KEY"],
        "model": "gpt-4o-2024-08-06",
    }
)

text_embedder = OpenAIEmbedder(
    configuration=ai_configuration,
)

In [149]:
# Provide text inputs as rows of a dataframe
input_path = "../example_outputs/query_text_data/news_articles/news_articles_texts.csv"
file_name = input_path.split("/")[-1]
file_to_chunks = input_processor.process_file_bytes({file_name: open(input_path, 'rb').read()}, input_processor.PeriodOption.NONE)
print("Chunked input text")

Chunked input text


In [150]:
max_cluster_size = 25 # The maximum number of concepts in a leaf-level community in the concept graph
# Process the chunks into the index data structures
(
    cid_to_text,
    text_to_cid,
    period_concept_graphs,
    hierarchical_communities,
    community_to_label,
    concept_to_cids,
    cid_to_concepts,
    previous_cid,
    next_cid,
    period_to_cids,
    node_period_counts,
    edge_period_counts,
) = input_processor.process_chunks(file_to_chunks=file_to_chunks, max_cluster_size=max_cluster_size)
print(f"Processed chunks")

Processed chunks


In [151]:
# Embed the text chunks
cid_to_vector = await helper_functions.embed_texts(
    cid_to_text=cid_to_text,
    text_embedder=text_embedder,
)
print(f"Embedded chunk text")

Got 501 existing texts
Got 0 new texts
Embedded chunk text


In [152]:
question = "What events are discussed?"
relevance_test_budget = 100 # How many relevance tests are permitted per query. Higher values may provide higher quality results at higher cost
adjacent_search_steps = 1 # How many chunks before and after each relevant chunk to test, once the relevance test budget is near or the search process has terminted
community_relevance_tests = 5 # How many relevance tests to run on each community in turn
relevance_test_batch_size = 5 # How many relevance tests to run in parallel at a time
community_ranking_chunks = 5 # How many chunks to use to rank communities by relevance
irrelevant_community_restart = 3 # When to restart testing communities in relevance order

# Mine relevant chunks to the question
async def mine_relevant_chunks():
    (
        relevant_cids,
        chunk_progress,
    ) = await relevance_assessor.detect_relevant_chunks(
        ai_configuration=ai_configuration,
        question=question,
        cid_to_text=cid_to_text,
        cid_to_concepts=cid_to_concepts,
        cid_to_vector=cid_to_vector,
        hierarchical_communities=hierarchical_communities,
        community_to_label=community_to_label,
        previous_cid=previous_cid,
        next_cid=next_cid,
        embedder=text_embedder,
        embedding_cache=CACHE_PATH,
        select_logit_bias=5,
        adjacent_search_steps=adjacent_search_steps,
        community_ranking_chunks=community_ranking_chunks,
        relevance_test_budget=relevance_test_budget,
        community_relevance_tests=community_relevance_tests,
        relevance_test_batch_size=relevance_test_batch_size,
        irrelevant_community_restart=irrelevant_community_restart,
    )
    return relevant_cids, chunk_progress

if __name__ == "__main__":
    relevant_cids, chunk_progress = await mine_relevant_chunks()
    print(f"Mined relevant chunks")
    


New level -1 loop after 0 tests
Assessing relevance for topic 1 with 5 chunks


100%|██████████| 5/5 [00:00<00:00,  5.04it/s]


Batch 1 of 1: 5 relevant chunks
Community 1 relevant? True
Incrementing level
New level 0 loop after 5 tests
Assessing relevance for topic 1.4 with 5 chunks


100%|██████████| 5/5 [00:01<00:00,  2.68it/s]


Batch 1 of 1: 5 relevant chunks
Community 1.4 relevant? True
Assessing relevance for topic 1.1 with 5 chunks


100%|██████████| 5/5 [00:00<00:00,  8.18it/s]


Batch 1 of 1: 5 relevant chunks
Community 1.1 relevant? True
Assessing relevance for topic 1.7 with 5 chunks


100%|██████████| 5/5 [00:00<00:00,  7.13it/s]


Batch 1 of 1: 5 relevant chunks
Community 1.7 relevant? True
Assessing relevance for topic 1.10 with 5 chunks


100%|██████████| 5/5 [00:00<00:00,  9.10it/s]


Batch 1 of 1: 5 relevant chunks
Community 1.10 relevant? True
Assessing relevance for topic 1.5 with 5 chunks


100%|██████████| 5/5 [00:00<00:00,  8.28it/s]


Batch 1 of 1: 5 relevant chunks
Community 1.5 relevant? True
Assessing relevance for topic 1.3 with 5 chunks


100%|██████████| 5/5 [00:01<00:00,  3.08it/s]


Batch 1 of 1: 5 relevant chunks
Community 1.3 relevant? True
Assessing relevance for topic 1.13 with 5 chunks


100%|██████████| 5/5 [00:00<00:00,  9.07it/s]


Batch 1 of 1: 5 relevant chunks
Community 1.13 relevant? True
Assessing relevance for topic 1.6 with 5 chunks


100%|██████████| 5/5 [00:01<00:00,  4.45it/s]


Batch 1 of 1: 3 relevant chunks
Community 1.6 relevant? True
Assessing relevance for topic 1.8 with 5 chunks


100%|██████████| 5/5 [00:00<00:00,  5.33it/s]


Batch 1 of 1: 5 relevant chunks
Community 1.8 relevant? True
Assessing relevance for topic 1.18 with 5 chunks


100%|██████████| 5/5 [00:00<00:00,  6.57it/s]


Batch 1 of 1: 2 relevant chunks
Community 1.18 relevant? True
Assessing relevance for topic 1.9 with 5 chunks


100%|██████████| 5/5 [00:01<00:00,  4.33it/s]


Batch 1 of 1: 5 relevant chunks
Community 1.9 relevant? True
Assessing relevance for topic 1.11 with 5 chunks


100%|██████████| 5/5 [00:02<00:00,  1.72it/s]


Batch 1 of 1: 5 relevant chunks
Community 1.11 relevant? True
Assessing relevance for topic 1.17 with 5 chunks


100%|██████████| 5/5 [00:00<00:00,  5.57it/s]


Batch 1 of 1: 4 relevant chunks
Community 1.17 relevant? True
Assessing relevance for topic 1.15 with 5 chunks


100%|██████████| 5/5 [00:00<00:00,  6.68it/s]


Batch 1 of 1: 3 relevant chunks
Community 1.15 relevant? True
Assessing relevance for topic 1.19 with 5 chunks


100%|██████████| 5/5 [00:01<00:00,  4.42it/s]


Batch 1 of 1: 5 relevant chunks
Community 1.19 relevant? True
Assessing relevance for topic 1.14 with 5 chunks


100%|██████████| 5/5 [00:00<00:00,  7.98it/s]


Batch 1 of 1: 5 relevant chunks
Community 1.14 relevant? True
Assessing relevance for topic 1.2 with 5 chunks


100%|██████████| 5/5 [00:01<00:00,  4.27it/s]


Batch 1 of 1: 5 relevant chunks
Community 1.2 relevant? True
Assessing relevance for topic 1.16 with 5 chunks


100%|██████████| 5/5 [00:00<00:00,  7.10it/s]


Batch 1 of 1: 5 relevant chunks
Community 1.16 relevant? True
Assessing relevance for topic 1.12 with 5 chunks


100%|██████████| 5/5 [00:00<00:00,  5.81it/s]

Batch 1 of 1: 5 relevant chunks
Community 1.12 relevant? True
Incrementing level
Assessing relevance for neighbours with 0 chunks
[('topic 1.12', 392, 'Yes'), ('topic 1.12', 137, 'Yes'), ('topic 1.12', 268, 'Yes')]
**Chunks relevant/tested: 92/100** (topic 1: 5/5; topic 1.4: 5/5; topic 1.1: 5/5; topic 1.7: 5/5; topic 1.10: 5/5; topic 1.5: 5/5; topic 1.3: 5/5; topic 1.13: 5/5; topic 1.6: 3/5; topic 1.8: 5/5; topic 1.18: 2/5; topic 1.9: 5/5; topic 1.11: 5/5; topic 1.17: 4/5; topic 1.15: 3/5; topic 1.19: 5/5; topic 1.14: 5/5; topic 1.2: 5/5; topic 1.16: 5/5; topic 1.12: 5/5)
Mined relevant chunks





In [153]:
answer_batch_size = 10 # How many chunks to integrate into the answer at a time
# Generate an extended answer to the question, which could then be summarized
(
    partial_answers,
    answer_progress,
) = answer_builder.answer_question(
    ai_configuration=ai_configuration,
    question=question,
    relevant_cids=relevant_cids,
    cid_to_text=cid_to_text,
    answer_batch_size=answer_batch_size,
)
print(f"Answered question")

Answered question


In [154]:
# Output some process summarry information
print(chunk_progress)
print(answer_progress)

**Chunks relevant/tested: 92/100** (topic 1: 5/5; topic 1.4: 5/5; topic 1.1: 5/5; topic 1.7: 5/5; topic 1.10: 5/5; topic 1.5: 5/5; topic 1.3: 5/5; topic 1.13: 5/5; topic 1.6: 3/5; topic 1.8: 5/5; topic 1.18: 2/5; topic 1.9: 5/5; topic 1.11: 5/5; topic 1.17: 4/5; topic 1.15: 3/5; topic 1.19: 5/5; topic 1.14: 5/5; topic 1.2: 5/5; topic 1.16: 5/5; topic 1.12: 5/5)
**Chunks referenced/relevant: 68/92** (answer 1: 10/10, answer 2: 6/10, answer 3: 7/10, answer 4: 8/10, answer 5: 8/10, answer 6: 7/10, answer 7: 8/10, answer 8: 6/10, answer 9: 6/10, answer 10: 2/2)


In [155]:
# Output the final extended answer
final_answer = partial_answers[-1]
print(final_answer)

# A Review of Recent Notable Events

*In response to: What events are discussed?*

## Introduction

This report provides an overview of several notable events across different domains, including sports, international diplomacy, cultural festivals, art exhibitions, community initiatives, and environmental efforts. The events are grouped into the following categories:

- **Sports Events**
  - Greenfield Hawks Triumph in Championship Final
  - Madrid FC Celebrates Victory Parade
  - Alex Torres Signs New Contract with Madrid FC
  - European Football League Announces New Season Schedule
  - Manchester United Secures Rising Star Alex Johnson
  - Hillside Marathon Sees Record Participation
  - Jordan Banks Scores Hat-Trick in Thrilling Exhibition Match in Lisbon
  - LA Lakers Secure Championship with Historic Win
  - LeBron James Shatters Records in Thrilling Lakers Victory
  - City League Basketball Finals Set for December at Downtown Arena
  - Serena Blake Advances to Finals in Melbourne w