# Query Text Data workflow with "Lazy GraphRAG" retrieval mechanism

Vector-based semantic search is a "best first" search approach for relevant chunks - it simply takes the top k most similar chunks based on embedding distances. It has no sense of the breadth of semantic content in the dataset.

Full [GraphRAG](https://aka.ms/graphrag) is a "breadth first" search approach for relevant content in community summaries - it always queries each community. It has no sense of which communities are best.

Lazy GraphRAG is a hybrid approach that requires no summarization-based index (which is expensive, especially if the user only has a few queries). Instead, it relies purely on an embedding-based index for input text chunks (cheap, fast) augmented by the concept graph of noun-phrase cooccurrences extracted from these chunks (free, fast). It works as follows:

- Given a query, text chunks are ranked using vector-based semantic search.
- Chunk rankings are used to determine community rankings in the concept graph.
- In community rank order, the top k chunks are subject to a rapid relevant test (cheap, fast). If relevant chunks are found, the community is retained, otherwise it is discarded.
- After n communities fail to return any relevant chunks, the "iterative narrowing" process repeats with remaining communities in rank order.
- The process terminates when no more testable chunks exist or a predefined testing budget is reached (e.g., 100 relevance tests).

The advantages of this approach include:

- A balance of best-first (semantic search) and breadth-first (community structure) approaches for any testing budget
- Very fast and almost free - suitable for one-off and exploratory queries before committing to a full indexing
- Easily accommodates dynamic contexts where new texts/chunks are streaming in to the dataset (supported in the current implementation)
- Provides a consistent level of "effort" for a fixed test budget, providing various standards of answers to questions (e.g., global questions) for which there is no ground truth (e.g., 1000 chunks = gold standard, 500 chunks = silver standard, 100 chunks = bronze standard, etc.)


In [1]:
import sys
sys.path.append("..")
import os
os.environ['MKL_THREADING_LAYER'] = 'GNU' # Avoids threadpoolctl error in Linux and MacOS
from toolkit.query_text_data import QueryTextData, ProcessedChunks, ChunkSearchConfig, AnswerConfig, AnswerObject
from toolkit.AI import OpenAIConfiguration, OpenAIEmbedder
from toolkit.helpers.constants import CACHE_PATH
import nest_asyncio # Necessary to run async code in ipynb
import pandas as pd
nest_asyncio.apply()
import importlib
import toolkit.query_text_data
importlib.reload(toolkit.query_text_data)

  from .autonotebook import tqdm as notebook_tqdm
[nltk_data] Downloading package brown to
[nltk_data]     C:\Users\daedge\AppData\Roaming\nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\daedge\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\daedge\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


<module 'toolkit.query_text_data' from 'c:\\Users\\daedge\\Code\\ITK\\intelligence-toolkit\\example_notebooks\\..\\toolkit\\query_text_data\\__init__.py'>

In [2]:
# Create the workflow object
qtd = QueryTextData()
# Set up the AI model and embedding model
ai_configuration = OpenAIConfiguration(
    {
        "api_type": "OpenAI",
        "api_key": os.environ["OPENAI_API_KEY"],
        "model": "gpt-4o",
    }
)
qtd.set_ai_config(
    ai_configuration=ai_configuration,
    embedding_cache=CACHE_PATH
)
text_embedder = OpenAIEmbedder(
    configuration=ai_configuration,
)
qtd.set_embedder(text_embedder)
print("Created QueryTextData object")

Created QueryTextData object


In [3]:
# Provide text inputs as a dictionary of title->text
input_path = "../example_outputs/query_text_data/news_articles/news_articles_texts.csv"
file_name = input_path.split("/")[-1]
df = pd.read_csv(input_path)
text_to_chunks = qtd.process_data_from_df(df, file_name)
print("Processed data from df")

Processed data from df


In [4]:
# Process the chunks into the index data structures
processed_chunks: ProcessedChunks = qtd.process_text_chunks()
print(f"Processed chunks")

Processed chunks


In [5]:
# Embed the text chunks
cid_to_vector = await qtd.embed_text_chunks()
print(f"Embedded chunks")

Got 501 existing texts
Got 0 new texts
Embedded chunks


In [6]:
question = "What events are discussed?"
# Mine relevant chunks to the question
chunk_search_config: ChunkSearchConfig = ChunkSearchConfig(
    # How many relevance tests are permitted per query. Higher values may provide higher quality results at higher cost
    relevance_test_budget=20,
    # How many chunks before and after each relevant chunk to test, once the relevance test budget is near or the search process has terminated
    adjacent_test_steps=1, 
    # How many relevance tests to run on each community in turn
    community_relevance_tests=5,
    # How many relevance tests to run in parallel at a time
    relevance_test_batch_size=5,
    # How many chunks to use to rank communities by relevance
    community_ranking_chunks=5,
    # When to restart testing communities in relevance order
    irrelevant_community_restart=5
)
relevant_cids, search_summary = await qtd.detect_relevant_text_chunks(
    question=question,
    chunk_search_config=chunk_search_config
)
print(f"Mined relevant chunks")

Top semantic search cids: [213, 104, 50, 38, 449, 76, 178, 182, 29, 359, 344, 484, 376, 314, 464, 381, 399, 181, 93, 324, 179, 150, 202, 409, 144, 155, 441, 96, 279, 64, 229, 237, 56, 189, 55, 364, 370, 219, 251, 400, 388, 84, 62, 341, 284, 130, 47, 485, 277, 31, 183, 310, 196, 239, 267, 475, 290, 470, 457, 138, 440, 112, 296, 446, 164, 180, 24, 418, 224, 493, 4, 468, 168, 173, 380, 244, 188, 90, 233, 382, 65, 349, 442, 348, 14, 304, 109, 337, 105, 379, 391, 408, 166, 443, 165, 216, 72, 374, 20, 152]
Level 0 community sequence: ['1.1', '1.16', '1.8', '1.6', '1.3', '1.11', '1.5', '1.14', '1.7', '1.19', '1.4', '1.12', '1.10', '1.9', '1.18', '1.20', '1.2', '1.17', '1.13', '1.15']
Level 1 community sequence: ['1.1.1', '1.1.6', '1.1.2', '1.1.4', '1.8.1', '1.1.5', '1.16.1', '1.11.1', '1.1.3', '1.16.2', '1.19', '1.16.4', '1.14.2', '1.11.2', '1.5.1', '1.4.1', '1.11.3', '1.14.4', '1.8.2', '1.1.7', '1.3.1', '1.7.4', '1.9.3', '1.18.3', '1.12.5', '1.8.3', '1.5.3', '1.8.4', '1.6.14', '1.6.19', '1.3

100%|██████████| 5/5 [00:00<00:00,  5.05it/s]


Community 1 relevant? True
Incrementing level
New level 0 loop after 5 tests
Community sequence: ['1.1', '1.16', '1.8', '1.6', '1.3', '1.11', '1.5', '1.14', '1.7', '1.19', '1.4', '1.12', '1.10', '1.9', '1.18', '1.20', '1.2', '1.17', '1.13', '1.15']
Assessing relevance for community 1.1 with chunks [178, 182, 29, 359, 344]


100%|██████████| 5/5 [00:00<00:00,  5.36it/s]


Community 1.1 relevant? True
Assessing relevance for community 1.16 with chunks [76, 93, 261, 458, 490]


100%|██████████| 5/5 [00:00<00:00,  8.42it/s]


Community 1.16 relevant? True
Assessing relevance for community 1.8 with chunks [399, 277, 290, 168, 90]


100%|██████████| 5/5 [00:00<00:00,  7.13it/s]


Community 1.8 relevant? True
Assessing relevance for community 1.6 with chunks [237, 388, 196, 379, 166]


0it [00:00, ?it/s]


Community 1.6 relevant? False
Assessing relevance for community 1.3 with chunks [267, 65, 184, 343, 87]


0it [00:00, ?it/s]


Community 1.3 relevant? False
Assessing relevance for community 1.11 with chunks [468, 188, 408, 313, 483]


0it [00:00, ?it/s]


Community 1.11 relevant? False
Assessing relevance for community 1.5 with chunks [47, 39, 347, 260, 21]


0it [00:00, ?it/s]


Community 1.5 relevant? False
Assessing relevance for community 1.14 with chunks [179, 141, 99, 120]


0it [00:00, ?it/s]

Community 1.14 relevant? False
0 successive irrelevant communities; restarting
Incrementing level
Mined relevant chunks





In [7]:
# Generate an extended answer to the question, which could then be summarized
answer_config = AnswerConfig(
    extract_claims = True, # Whether to extract claims from relevant text chunks
    claim_search_depth = 5, # If extracting claims, how many chunks to search for additional claim support
    target_chunks_per_cluster = 5, # How many chunks to aim to analyze together in a single LLM call
)
answer_object: AnswerObject = await qtd.answer_question_with_relevant_chunks(
    answer_config=answer_config
)
print(f"Answered question")

Found Intel OpenMP ('libiomp') and LLVM OpenMP ('libomp') loaded at
the same time. Both libraries are known to be incompatible and this
can cause random crashes or deadlocks on Linux when loaded in the
same Python program.
Using threadpoolctl may cause crashes or deadlocks. For more
information and possible workarounds, please see
    https://github.com/joblib/threadpoolctl/blob/master/multiple_openmp.md

100%|██████████| 4/4 [00:10<00:00,  2.73s/it]


Claims to embed: {0: 'Scientists and policymakers convened in Cape Town to address global climate issues and explore solutions. (context: A conference in Cape Town focused on climate challenges and solutions.)', 1: 'Experts and policymakers debated education reform, focusing on curriculum updates, funding disparities, and technology integration. (context: A policy debate on education reform took place in Capital City.)', 2: 'The Global Climate Forum in Sydney gathered experts to discuss carbon reduction, renewable energy, and socio-economic impacts of climate policies. (context: The Global Climate Forum in Sydney addressed climate change challenges.)', 3: 'Elena Garcia presented AI breakthroughs at the Tech Innovators Conference, focusing on practical applications and ethical considerations. (context: The Tech Innovators Conference in Tokyo highlighted AI advancements.)', 4: 'Leaders at the Global Climate Action Forum in Berlin discussed solutions to mitigate climate change impacts. (c

100%|██████████| 19/19 [00:01<00:00, 17.32it/s]


Got 0 existing texts
Got 19 new texts
Cix to vector: dict_keys([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18])
Running 19 requery tasks


100%|██████████| 19/19 [00:16<00:00,  1.13it/s]
100%|██████████| 19/19 [00:05<00:00,  3.30it/s]


Answered question


In [8]:
# Output the final extended answer
print(answer_object.extended_answer)

# Global Conferences and Forums Address Key Issues in Climate, Economy, Education, and Technology

*In response to: What events are discussed?*

## Executive summary

The report covers a range of global events and forums that address pressing issues in climate change, economic stability, education reform, and technological advancements. These events bring together experts, policymakers, and industry leaders to discuss and propose solutions to these challenges. The themes include climate action, economic resilience, education reform, and technological innovation, each highlighting the importance of international cooperation and strategic planning.

## Theme: Global Climate Conferences Highlight Urgent Environmental Challenges and Solutions

Several global climate conferences have been organized to address urgent environmental challenges. These events focus on solutions such as renewable energy, carbon reduction, and international cooperation to combat climate change. The conferences aim