-------------------------
#### Demo: Document Summary Index on Wikipedia Articles

This demo showcases the **Document Summary Index**, applied to Wikipedia articles about different cities.

#### Key Features of Document Summary Index
1. **Summary Extraction**:
   - A summary is extracted from each document.
   - The index stores the summary as well as all nodes corresponding to the document.

2. **Retrieval Process**:
   - Retrieval can be performed using:
     - **LLM-based querying** (currently supported).
     - **Embeddings-based querying** (planned as a TODO feature).
   - The retrieval begins by selecting the relevant documents to the query based on their summaries.
   - All nodes corresponding to the selected documents are then retrieved.

#### Use Case
This approach is ideal for cases where:
- High-level document summaries can help narrow down the search space.
- Specific nodes within the document need to be accessed after selecting relevant documents.

#### Enhancements
- Embedding-based retrieval will allow for more flexible and efficient querying using vector similarity.

--------------------------

In [1]:
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
import openai

from llama_index.core.node_parser import SentenceSplitter

In [2]:
documents    = SimpleDirectoryReader("./data/cities").load_data()

len(documents)

5

In [3]:
type(documents[0])

llama_index.core.schema.Document

In [4]:
dict(documents[0]).keys()

dict_keys(['id_', 'embedding', 'metadata', 'excluded_embed_metadata_keys', 'excluded_llm_metadata_keys', 'relationships', 'metadata_template', 'metadata_separator', 'text', 'mimetype', 'start_char_idx', 'end_char_idx', 'metadata_seperator', 'text_template'])

In [5]:
len(documents[0].text), len(documents[1].text), len(documents[2].text), len(documents[3].text), len(documents[4].text)

(66607, 85828, 83621, 69201, 82752)

In [6]:
index        = VectorStoreIndex.from_documents(documents)

In [8]:
index.docstore.get_all_ref_doc_info()

{'b47ece49-2c05-487d-98ce-ea2078d08780': RefDocInfo(node_ids=['af0eca2f-e133-4f6c-b6cf-02c9329a065e', '53be1e7e-fee0-480b-b990-b936c18ae385', '3e4cc218-2b4a-4b2d-bfb7-028990e90838', 'd4b37a4d-2648-495b-b2f6-4fe4a15eab94', 'accb4b65-6e38-4c0e-99cc-24aaf24e168f', '484b84b9-e2bd-4693-a471-a0be2956b2e1', 'd54653ee-f28a-4cfd-a8a3-804154b9e208', 'c439034f-adb3-437c-86f6-10bf32d77c9a', '6e00e117-eb63-40e6-8fe8-310aaf9b390d', '683023d3-4411-410f-b62f-735e646f3b2c', '03682304-8256-4475-9a7e-057615ab2ec4', 'a74eb861-6502-4870-8124-07e643371e6a', 'c32ab5c7-b510-4bd4-8f23-aa96692ea696', '1674b646-8461-42bf-b421-9c3f9c64d745', 'b60b84d3-9aae-449f-b555-13469cd9328c', '33c3818f-58c7-473b-ab09-e7a7c19cbb9e', '9ca8cefe-29de-4b4b-b797-0cbacc055366', '40f3b6d9-5255-450b-a532-e2c8689820ee', 'baaf4aaa-2b8b-4141-8a90-4012814153b4'], metadata={'file_path': 'D:\\gridflowai\\NIIT-Tredence-LLM\\Day21 - 21 NOV\\data\\cities\\Boston.txt', 'file_name': 'Boston.txt', 'file_type': 'text/plain', 'file_size': 66712, '

In [9]:
index.docstore.get_all_ref_doc_info().keys()  # 5 documents

dict_keys(['b47ece49-2c05-487d-98ce-ea2078d08780', 'a9be3c96-b36d-4481-8fb5-24af647a3251', '418197b3-6f3a-4ed8-8b88-18b6336d5e58', '4b600617-a4c8-4916-8b8c-5ef8ed5d7653', '7fd39e3f-f588-4be2-a3ad-1d7b0b2fa5a5'])

In [10]:
index.docstore.get_all_ref_doc_info()['b47ece49-2c05-487d-98ce-ea2078d08780']    # a specific document

RefDocInfo(node_ids=['af0eca2f-e133-4f6c-b6cf-02c9329a065e', '53be1e7e-fee0-480b-b990-b936c18ae385', '3e4cc218-2b4a-4b2d-bfb7-028990e90838', 'd4b37a4d-2648-495b-b2f6-4fe4a15eab94', 'accb4b65-6e38-4c0e-99cc-24aaf24e168f', '484b84b9-e2bd-4693-a471-a0be2956b2e1', 'd54653ee-f28a-4cfd-a8a3-804154b9e208', 'c439034f-adb3-437c-86f6-10bf32d77c9a', '6e00e117-eb63-40e6-8fe8-310aaf9b390d', '683023d3-4411-410f-b62f-735e646f3b2c', '03682304-8256-4475-9a7e-057615ab2ec4', 'a74eb861-6502-4870-8124-07e643371e6a', 'c32ab5c7-b510-4bd4-8f23-aa96692ea696', '1674b646-8461-42bf-b421-9c3f9c64d745', 'b60b84d3-9aae-449f-b555-13469cd9328c', '33c3818f-58c7-473b-ab09-e7a7c19cbb9e', '9ca8cefe-29de-4b4b-b797-0cbacc055366', '40f3b6d9-5255-450b-a532-e2c8689820ee', 'baaf4aaa-2b8b-4141-8a90-4012814153b4'], metadata={'file_path': 'D:\\gridflowai\\NIIT-Tredence-LLM\\Day21 - 21 NOV\\data\\cities\\Boston.txt', 'file_name': 'Boston.txt', 'file_type': 'text/plain', 'file_size': 66712, 'creation_date': '2024-11-21', 'last_modif

In [12]:
index.docstore.get_all_ref_doc_info()['b47ece49-2c05-487d-98ce-ea2078d08780'].to_dict()

{'node_ids': ['af0eca2f-e133-4f6c-b6cf-02c9329a065e',
  '53be1e7e-fee0-480b-b990-b936c18ae385',
  '3e4cc218-2b4a-4b2d-bfb7-028990e90838',
  'd4b37a4d-2648-495b-b2f6-4fe4a15eab94',
  'accb4b65-6e38-4c0e-99cc-24aaf24e168f',
  '484b84b9-e2bd-4693-a471-a0be2956b2e1',
  'd54653ee-f28a-4cfd-a8a3-804154b9e208',
  'c439034f-adb3-437c-86f6-10bf32d77c9a',
  '6e00e117-eb63-40e6-8fe8-310aaf9b390d',
  '683023d3-4411-410f-b62f-735e646f3b2c',
  '03682304-8256-4475-9a7e-057615ab2ec4',
  'a74eb861-6502-4870-8124-07e643371e6a',
  'c32ab5c7-b510-4bd4-8f23-aa96692ea696',
  '1674b646-8461-42bf-b421-9c3f9c64d745',
  'b60b84d3-9aae-449f-b555-13469cd9328c',
  '33c3818f-58c7-473b-ab09-e7a7c19cbb9e',
  '9ca8cefe-29de-4b4b-b797-0cbacc055366',
  '40f3b6d9-5255-450b-a532-e2c8689820ee',
  'baaf4aaa-2b8b-4141-8a90-4012814153b4'],
 'metadata': {'file_path': 'D:\\gridflowai\\NIIT-Tredence-LLM\\Day21 - 21 NOV\\data\\cities\\Boston.txt',
  'file_name': 'Boston.txt',
  'file_type': 'text/plain',
  'file_size': 66712,
  '

In [14]:
len(index.docstore.get_all_ref_doc_info()['b47ece49-2c05-487d-98ce-ea2078d08780'].to_dict()['node_ids'])

19

In [22]:
len(dict(index.docstore.get_node('9ca8cefe-29de-4b4b-b797-0cbacc055366'))['text'])

4153

In [23]:
query_engine = index.as_query_engine()

In [24]:
for i in range(5):
    response = query_engine.query("What did the author do growing up?")
    print(response)
    print()

The author of the text was involved in the foundation of Boston by Puritan colonists in 1630.

The author of the text was involved in the foundation of Boston by Puritan colonists in 1630.

The author of the text was involved in the foundation of Boston by Puritan colonists in 1630.

The author of the text did not mention anything about their personal life or what they did growing up.

The author founded Boston by inviting Puritan colonists to share the peninsula with them after the failing colony of Charlestown.

