# Query Text Data

Demonstrates use of the Intelligence Toolkit library to respond to queries about a collection of text documents.

See [readme](https://github.com/microsoft/intelligence-toolkit/blob/main/app/workflows/query_text_data/README.md) for more details.


In [15]:
import sys

sys.path.append("..")
import os

os.environ["MKL_THREADING_LAYER"] = (
    "GNU"  # Avoids threadpoolctl error in Linux and MacOS
)
from intelligence_toolkit.query_text_data.api import QueryTextData
from intelligence_toolkit.query_text_data.classes import (
    ProcessedChunks,
    ChunkSearchConfig,
    AnswerConfig,
)

import intelligence_toolkit.query_text_data.prompts as prompts
from intelligence_toolkit.AI.openai_configuration import OpenAIConfiguration
from intelligence_toolkit.AI.openai_embedder import OpenAIEmbedder
from intelligence_toolkit.helpers.constants import CACHE_PATH
import nest_asyncio  # Necessary to run async code in ipynb
import pandas as pd

nest_asyncio.apply()

In [16]:
# Create the workflow object
qtd = QueryTextData()
# Set up the AI model and embedding model
ai_configuration = OpenAIConfiguration(
    {
        "api_type": "OpenAI",
        "api_key": os.environ["OPENAI_API_KEY"],
        "model": "gpt-4o",
    }
)
qtd.set_ai_config(ai_configuration=ai_configuration, embedding_cache=CACHE_PATH)
text_embedder = OpenAIEmbedder(
    configuration=ai_configuration,
)
qtd.set_embedder(text_embedder)
print("Created QueryTextData object")

Created QueryTextData object


In [17]:
# Provide text inputs as a dictionary of title->text
# Enter the path to your own data here
input_path = "../example_outputs/query_text_data/news_articles/news_articles_texts.csv"
file_name = input_path.split("/")[-1]
df = pd.read_csv(input_path)
text_to_chunks = qtd.process_data_from_df(df, file_name)
print("Processed data from df")

Processed data from df


In [18]:
# Process the chunks into the index data structures
processed_chunks: ProcessedChunks = qtd.process_text_chunks()
print(f"Processed chunks")

Processed chunks


In [19]:
# Embed the text chunks
cid_to_vector = await qtd.embed_text_chunks()
print(f"Embedded chunks")

Got 501 existing texts
Got 0 new texts
Embedded chunks


In [20]:
# Edit the query to be answered
query = "What events are discussed?"
expanded_query = await qtd.anchor_query_to_concepts(query=query, top_concepts=100)
print(f"Expanded query: {expanded_query}")

Expanded query: What events are discussed in the context of renowned artists, artificial intelligence, healthcare reform, tech innovators, and significant gatherings involving world leaders and political analysts? Are there events related to the art world, culinary arts, tech industry, or international organizations? Additionally, are there events in major cities like Chicago, Washington D.C., or international locations such as Beijing, Paris, or Tokyo?


In [21]:
# Mine relevant chunks to the query
chunk_search_config: ChunkSearchConfig = ChunkSearchConfig(
    # How many relevance tests are permitted per query. Higher values may provide higher quality results at higher cost
    relevance_test_budget=50,
    # How many chunks before and after each relevant chunk to test, once the relevance test budget is near or the search process has terminated
    adjacent_test_steps=1,
    # How many relevance tests to run on each community in turn
    community_relevance_tests=5,
    # How many relevance tests to run in parallel at a time
    relevance_test_batch_size=5,
    # How many chunks to use to rank communities by relevance
    community_ranking_chunks=5,
    # When to restart testing communities in relevance order
    irrelevant_community_restart=5,
)
relevant_cids, search_summary = await qtd.detect_relevant_text_chunks(
    query=query, expanded_query=expanded_query, chunk_search_config=chunk_search_config
)
print(f"Mined relevant chunks")

Top semantic search cids: [90, 70, 391, 470, 55, 309, 497, 155, 381, 7, 104, 324, 14, 415, 359, 400, 213, 93, 96, 219, 144, 296, 352, 408, 208, 290, 249, 61, 441, 322, 188, 320, 376, 38, 182, 344, 251, 370, 325, 239, 202, 341, 443, 314, 178, 264, 380, 58, 364, 284, 261, 31, 276, 289, 50, 449, 382, 399, 218, 387, 180, 76, 313, 195, 484, 181, 129, 83, 434, 164, 118, 34, 462, 288, 89, 141, 403, 72, 333, 183, 374, 343, 150, 133, 228, 29, 176, 418, 168, 210, 19, 67, 189, 363, 274, 56, 224, 464, 491, 132]
Level 0 community sequence: ['1.8', '1.1', '1.12', '1.11', '1.16', '1.6', '1.7', '1.5', '1.19', '1.4', '1.14', '1.3', '1.10', '1.2', '1.9', '1.18', '1.20', '1.15', '1.13', '1.17']
Level 1 community sequence: ['1.8.3', '1.8.1', '1.8.2', '1.8.4', '1.1.4', '1.12.3', '1.11.1', '1.11.3', '1.11.2', '1.19', '1.1.1', '1.16.1', '1.12.5', '1.5.1', '1.1.5', '1.1.2', '1.1.6', '1.6.5', '1.1.7', '1.4.1', '1.1.3', '1.8.6', '1.16.4', '1.4.2', '1.7.2', '1.5.3', '1.4.3', '1.16.2', '1.8.7', '1.14.4', '1.3.3',

100%|██████████| 5/5 [00:00<00:00,  5.52it/s]


Community 1 relevant? True
Incrementing level
New level 0 loop after 5 tests
Community sequence: ['1.8', '1.1', '1.12', '1.11', '1.16', '1.6', '1.7', '1.5', '1.19', '1.4', '1.14', '1.3', '1.10', '1.2', '1.9', '1.18', '1.20', '1.15', '1.13', '1.17']
Assessing relevance for community 1.8 with chunks [309, 497, 155, 324, 14]


100%|██████████| 5/5 [00:00<00:00,  6.60it/s]


Community 1.8 relevant? True
Assessing relevance for community 1.1 with chunks [381, 104, 400, 213, 219]


100%|██████████| 5/5 [00:01<00:00,  3.72it/s]


Community 1.1 relevant? True
Assessing relevance for community 1.12 with chunks [7, 322, 308, 347, 379]


100%|██████████| 5/5 [00:00<00:00,  6.86it/s]


Community 1.12 relevant? True
Assessing relevance for community 1.11 with chunks [408, 61, 188, 261, 387]


100%|██████████| 5/5 [00:00<00:00,  6.63it/s]


Community 1.11 relevant? True
Assessing relevance for community 1.16 with chunks [76, 83, 136, 458, 295]


100%|██████████| 5/5 [00:00<00:00,  5.55it/s]


Community 1.16 relevant? True
Assessing relevance for community 1.6 with chunks [462, 323, 375, 237, 240]


100%|██████████| 5/5 [00:01<00:00,  4.04it/s]


Community 1.6 relevant? True
Assessing relevance for community 1.7 with chunks [320, 141, 67, 161, 47]


100%|██████████| 5/5 [00:00<00:00,  6.51it/s]


Community 1.7 relevant? True
Assessing relevance for community 1.5 with chunks [343, 159, 260, 438, 87]


100%|██████████| 5/5 [00:00<00:00,  6.36it/s]


Community 1.5 relevant? True
Assessing relevance for community 1.19 with chunks [230, 2]


100%|██████████| 2/2 [00:00<00:00,  3.42it/s]


Community 1.19 relevant? True
Assessing relevance for community 1.4 with chunks [133, 268, 342, 410, 398]


100%|██████████| 3/3 [00:00<00:00,  5.36it/s]


Community 1.4 relevant? True
Assessing relevance for community 1.14 with chunks [179, 60, 120, 18]


0it [00:00, ?it/s]


Community 1.14 relevant? False
Assessing relevance for community 1.3 with chunks [206, 186, 0, 394, 246]


0it [00:00, ?it/s]


Community 1.3 relevant? False
Assessing relevance for community 1.10 with chunks [19, 363, 360, 91, 32]


0it [00:00, ?it/s]


Community 1.10 relevant? False
Assessing relevance for community 1.2 with chunks [374, 278, 266, 396, 338]


0it [00:00, ?it/s]


Community 1.2 relevant? False
Assessing relevance for community 1.9 with chunks [281, 423, 421, 358, 162]


0it [00:00, ?it/s]

Community 1.9 relevant? False
5 successive irrelevant communities; restarting
Incrementing level
Mined relevant chunks





In [22]:
# Generate an extended answer to the query, which could then be summarized into a shorter form
answer_config = AnswerConfig(
    extract_claims=False,  # Whether to extract claims from relevant text chunks
    claim_search_depth=0,  # If extracting claims, how many chunks to search for additional claim support
    target_chunks_per_cluster=5,  # How many chunks to aim to analyze together in a single LLM call
)
await qtd.answer_query_with_relevant_chunks(answer_config=answer_config)
print("Answered query")

100%|██████████| 8/8 [00:13<00:00,  1.73s/it]


Extracted references: [7, 14, 55, 61, 67, 70, 76, 87, 90, 104, 136, 155, 159, 213, 219, 230, 240, 268, 308, 309, 324, 381, 391, 400, 408, 438, 462, 470]
Answered query


In [23]:
# Output the final extended answer
print(qtd.answer_object.extended_answer)

# Report

## Query

*What events are discussed?*

## Expanded Query

*What events are discussed in the context of renowned artists, artificial intelligence, healthcare reform, tech innovators, and significant gatherings involving world leaders and political analysts? Are there events related to the art world, culinary arts, tech industry, or international organizations? Additionally, are there events in major cities like Chicago, Washington D.C., or international locations such as Beijing, Paris, or Tokyo?*

## Answer

The events discussed include a variety of cultural, technological, economic, and health-related gatherings. These include the Louvre Museum's Asian Art Exhibition curated by Sophia Lin, Emily Rivera's exhibition at the Metropolitan Art Gallery, Emma Thompson's cultural exploration of Paris and the Louvre Museum, Mia Liu's fashion collection at Paris Fashion Week, the Tech Innovators Conferences in global tech hubs, political agendas focusing on healthcare reform, the Glo

In [24]:
# Condense the answer
qtd.condense_answer(ai_instructions=prompts.user_prompt)
print("Condensed answer")

Condensed answer


In [25]:
# Output the final extended answer
print(qtd.condensed_answer)

# Global Events: A Comprehensive Overview of Cultural, Technological, Economic, and Health Innovations

## Introduction

This report provides an overview of various significant events across the globe, focusing on cultural, technological, economic, and health-related gatherings. These events highlight the interconnectedness of different sectors and the importance of collaboration and cultural exchange in addressing global challenges.

## Cultural and Artistic Events

### Louvre Museum's Asian Art Exhibition

The Louvre Museum in Paris has launched an Asian Art Exhibition curated by Sophia Lin, showcasing a diverse collection of artworks from countries such as China, Japan, Korea, and India. This exhibition aims to foster cultural exchange between East and West and is set to run for six months, attracting art enthusiasts globally [source: news_articles_texts.csv_8 (1)].

### Emily Rivera's Exhibition at the Metropolitan Art Gallery

Emily Rivera's exhibition at the Metropolitan Art Gall