# Query Text Data

Demonstrates use of the Intelligence Toolkit library to respond to queries about a collection of text documents.

See [readme](https://github.com/microsoft/intelligence-toolkit/blob/main/app/workflows/query_text_data/README.md) for more details.


In [1]:
import sys

sys.path.append("..")
import os

os.environ["MKL_THREADING_LAYER"] = (
    "GNU"  # Avoids threadpoolctl error in Linux and MacOS
)
from intelligence_toolkit.query_text_data import (
    QueryTextData,
    ProcessedChunks,
    ChunkSearchConfig,
    AnswerConfig,
    AnswerObject,
)
import intelligence_toolkit.query_text_data.prompts as prompts
from intelligence_toolkit.AI.openai_configuration import OpenAIConfiguration
from intelligence_toolkit.AI.openai_embedder import OpenAIEmbedder
from intelligence_toolkit.helpers.constants import CACHE_PATH
import nest_asyncio  # Necessary to run async code in ipynb
import pandas as pd

nest_asyncio.apply()

  from .autonotebook import tqdm as notebook_tqdm
[nltk_data] Downloading package brown to /home/ddesouza/nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package punkt to /home/ddesouza/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     /home/ddesouza/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


In [2]:
# Create the workflow object
qtd = QueryTextData()
# Set up the AI model and embedding model
ai_configuration = OpenAIConfiguration(
    {
        "api_type": "OpenAI",
        "api_key": os.environ["OPENAI_API_KEY"],
        "model": "gpt-4o",
    }
)
qtd.set_ai_config(ai_configuration=ai_configuration, embedding_cache=CACHE_PATH)
text_embedder = OpenAIEmbedder(
    configuration=ai_configuration,
)
qtd.set_embedder(text_embedder)
print("Created QueryTextData object")

Created QueryTextData object


In [3]:
# Provide text inputs as a dictionary of title->text
# Enter the path to your own data here
input_path = "../example_outputs/query_text_data/news_articles/news_articles_texts.csv"
file_name = input_path.split("/")[-1]
df = pd.read_csv(input_path)
text_to_chunks = qtd.process_data_from_df(df, file_name)
print("Processed data from df")

Processed data from df


In [4]:
# Process the chunks into the index data structures
processed_chunks: ProcessedChunks = qtd.process_text_chunks()
print(f"Processed chunks")

Processed chunks


In [5]:
# Embed the text chunks
cid_to_vector = await qtd.embed_text_chunks()
print(f"Embedded chunks")

100%|██████████| 501/501 [00:34<00:00, 14.34it/s]


Got 0 existing texts
Got 501 new texts
Embedded chunks


In [6]:
# Edit the query to be answered
query = "What events are discussed?"
expanded_query = await qtd.anchor_query_to_concepts(query=query, top_concepts=100)
print(f"Expanded query: {expanded_query}")

Expanded query: What events are discussed in the context of renowned artists, artificial intelligence, healthcare reform, tech innovators, culinary arts, international tennis, political landscape, and global relief efforts? Are there significant gatherings or summits involving world leaders, tech enthusiasts, art lovers, or food enthusiasts? Are there events related to the film industry, fashion enthusiasts, or music enthusiasts? What about events in major cities like Chicago, Washington D.C., Beijing, or Paris? Are there discussions on key issues such as sustainable growth, renewable energy sources, or geopolitical tensions?


In [7]:
# Mine relevant chunks to the query
chunk_search_config: ChunkSearchConfig = ChunkSearchConfig(
    # How many relevance tests are permitted per query. Higher values may provide higher quality results at higher cost
    relevance_test_budget=50,
    # How many chunks before and after each relevant chunk to test, once the relevance test budget is near or the search process has terminated
    adjacent_test_steps=1,
    # How many relevance tests to run on each community in turn
    community_relevance_tests=5,
    # How many relevance tests to run in parallel at a time
    relevance_test_batch_size=5,
    # How many chunks to use to rank communities by relevance
    community_ranking_chunks=5,
    # When to restart testing communities in relevance order
    irrelevant_community_restart=5,
)
relevant_cids, search_summary = await qtd.detect_relevant_text_chunks(
    query=query, expanded_query=expanded_query, chunk_search_config=chunk_search_config
)
print(f"Mined relevant chunks")

Top semantic search cids: [381, 144, 470, 90, 391, 104, 55, 284, 70, 344, 213, 484, 497, 239, 182, 314, 93, 155, 364, 31, 443, 249, 290, 324, 441, 296, 325, 359, 464, 178, 400, 409, 309, 14, 352, 244, 150, 96, 50, 61, 164, 202, 219, 7, 224, 276, 408, 449, 251, 29, 261, 380, 376, 34, 370, 382, 76, 313, 84, 208, 189, 188, 415, 58, 195, 130, 304, 288, 180, 173, 19, 181, 374, 218, 440, 387, 183, 38, 89, 1, 129, 168, 152, 260, 403, 333, 141, 418, 432, 264, 289, 118, 132, 312, 228, 442, 363, 274, 109, 279]
Level 0 community sequence: ['1.1', '1.8', '1.12', '1.11', '1.5', '1.16', '1.19', '1.6', '1.14', '1.3', '1.4', '1.7', '1.10', '1.9', '1.2', '1.18', '1.15', '1.17', '1.13', '1.20']
Level 1 community sequence: ['1.8.1', '1.8.3', '1.1.4', '1.1.1', '1.11.1', '1.8.2', '1.1.2', '1.1.5', '1.8.4', '1.1.3', '1.5.1', '1.19', '1.12.3', '1.6.5', '1.16.1', '1.16.4', '1.11.3', '1.1.6', '1.11.2', '1.1.7', '1.14.2', '1.12.5', '1.14.4', '1.8.6', '1.5.3', '1.16.2', '1.3.3', '1.4.1', '1.7.2', '1.9.3', '1.4.2

100%|██████████| 5/5 [00:01<00:00,  3.49it/s]


Community 1 relevant? True
Incrementing level
New level 0 loop after 5 tests
Community sequence: ['1.1', '1.8', '1.12', '1.11', '1.5', '1.16', '1.19', '1.6', '1.14', '1.3', '1.4', '1.7', '1.10', '1.9', '1.2', '1.18', '1.15', '1.17', '1.13', '1.20']
Assessing relevance for community 1.1 with chunks [104, 55, 284, 344, 213]


100%|██████████| 5/5 [00:00<00:00,  6.42it/s]


Community 1.1 relevant? True
Assessing relevance for community 1.8 with chunks [70, 93, 290, 309, 352]


100%|██████████| 5/5 [00:00<00:00,  5.00it/s]


Community 1.8 relevant? True
Assessing relevance for community 1.12 with chunks [7, 347, 322, 489, 308]


100%|██████████| 5/5 [00:01<00:00,  3.17it/s]


Community 1.12 relevant? True
Assessing relevance for community 1.11 with chunks [61, 408, 261, 313, 188]


100%|██████████| 5/5 [00:01<00:00,  4.96it/s]


Community 1.11 relevant? True
Assessing relevance for community 1.5 with chunks [260, 47, 128, 67, 343]


100%|██████████| 5/5 [00:00<00:00,  5.57it/s]


Community 1.5 relevant? True
Assessing relevance for community 1.16 with chunks [76, 136, 295, 83, 479]


100%|██████████| 5/5 [00:01<00:00,  4.77it/s]


Community 1.16 relevant? True
Assessing relevance for community 1.19 with chunks [230, 99, 2, 77]


100%|██████████| 4/4 [00:00<00:00,  4.64it/s]


Community 1.19 relevant? True
Assessing relevance for community 1.6 with chunks [462, 237, 375, 166, 339]


100%|██████████| 5/5 [00:13<00:00,  2.66s/it]


Community 1.6 relevant? True
Assessing relevance for community 1.14 with chunks [141, 179, 120, 60, 107]


100%|██████████| 5/5 [00:01<00:00,  4.47it/s]


Community 1.14 relevant? True
Assessing relevance for community 1.3 with chunks [206, 0, 186, 246, 204]


100%|██████████| 1/1 [00:01<00:00,  1.18s/it]


Community 1.3 relevant? True
Assessing relevance for community 1.4 with chunks [320, 133, 252, 268, 95]


0it [00:00, ?it/s]


Community 1.4 relevant? False
Assessing relevance for community 1.7 with chunks [301, 161, 272, 355, 413]


0it [00:00, ?it/s]


Community 1.7 relevant? False
Assessing relevance for community 1.10 with chunks [19, 363, 360, 91, 171]


0it [00:00, ?it/s]


Community 1.10 relevant? False
Assessing relevance for community 1.9 with chunks [278, 358, 281, 423, 162]


0it [00:00, ?it/s]


Community 1.9 relevant? False
Assessing relevance for community 1.2 with chunks [374, 266, 396, 151, 169]


0it [00:00, ?it/s]

Community 1.2 relevant? False
5 successive irrelevant communities; restarting
Incrementing level
Mined relevant chunks





In [8]:
# Generate an extended answer to the query, which could then be summarized into a shorter form
answer_config = AnswerConfig(
    extract_claims=False,  # Whether to extract claims from relevant text chunks
    claim_search_depth=0,  # If extracting claims, how many chunks to search for additional claim support
    target_chunks_per_cluster=5,  # How many chunks to aim to analyze together in a single LLM call
)
await qtd.answer_query_with_relevant_chunks(answer_config=answer_config)
print("Answered query")

  0%|          | 0/8 [00:00<?, ?it/s]

100%|██████████| 8/8 [00:14<00:00,  1.76s/it]


Extracted references: [2, 7, 55, 60, 61, 70, 76, 83, 90, 93, 104, 107, 120, 136, 144, 206, 213, 230, 284, 295, 308, 309, 322, 339, 344, 375, 381, 391, 462, 470, 479]
Answered query


In [9]:
# Output the final extended answer
print(qtd.answer_object.extended_answer)

# Report

## Query

*What events are discussed?*

## Expanded Query

*What events are discussed in the context of renowned artists, artificial intelligence, healthcare reform, tech innovators, culinary arts, international tennis, political landscape, and global relief efforts? Are there significant gatherings or summits involving world leaders, tech enthusiasts, art lovers, or food enthusiasts? Are there events related to the film industry, fashion enthusiasts, or music enthusiasts? What about events in major cities like Chicago, Washington D.C., Beijing, or Paris? Are there discussions on key issues such as sustainable growth, renewable energy sources, or geopolitical tensions?*

## Answer

The events discussed include a range of international and local initiatives across various fields. These include earthquake relief efforts in Indonesia, global climate summits, art exhibitions at the Louvre Museum and the Modern Art Museum, the Tech Innovators Conferences, film festivals, political

In [10]:
# Condense the answer
qtd.condense_answer(ai_instructions=prompts.user_prompt)
print("Condensed answer")

Condensed answer


In [11]:
# Output the final extended answer
print(qtd.condensed_answer)

# Global Events and Their Impact on Society

## Introduction

This report explores a diverse array of events that highlight significant cultural, technological, political, and economic developments across the globe. These events range from international disaster relief efforts and climate summits to art exhibitions, film festivals, political campaigns, and economic forums. Each event underscores the importance of collaboration, innovation, and community engagement in addressing contemporary challenges and fostering a sustainable future.

## Disaster Relief and Climate Action

### Earthquake Relief Efforts in Indonesia

Following a devastating earthquake in Bali and Lombok, international and local organizations have mobilized to provide relief. The Global Relief Network and the World Health Organization have deployed medical teams to deliver essential healthcare services, addressing both immediate and long-term needs [source: news_articles_texts.csv_3 (1), news_articles_texts.csv_121 (1