-----------------------------------
#### documents, nodes, and indexes
--------------------------------

![image.png](attachment:0ac63af3-ad0d-4954-91c1-3f16081621a4.png)

In [1]:
#pip install llama-index

In [2]:
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader

The `SimpleDirectoryReader` in LlamaIndex (formerly known as GPT Index) is a utility designed for easy ingestion of documents from a directory into a data structure that can be used to build an index. 

This reader helps in automatically reading and loading files from a specified directory, preparing the content for indexing, which can later be queried or used for other applications, such as retrieval-augmented generation (RAG).

**Key Features:**
- `Automated File Reading`: It can read all the files in a given directory, often supporting a variety of file types like .txt, .pdf, .docx, etc.
- `Recursive Reading`: It has options to read files recursively if the directory contains subfolders.
- `Custom Parsing`: You can apply custom parsing logic for specific file types or specific use cases.

In [3]:
documents = SimpleDirectoryReader('files').load_data()

In [4]:
len(documents)

2

In [5]:
dict(documents[0])

{'id_': '0980058c-d7f6-4d5f-b4f5-2e1182feaa77',
 'embedding': None,
 'metadata': {'file_path': 'D:\\gridflowai\\NIIT-Tredence-LLM\\Day20 - 20 NOV\\files\\sample_document1.txt',
  'file_name': 'sample_document1.txt',
  'file_type': 'text/plain',
  'file_size': 612,
  'creation_date': '2024-11-20',
  'last_modified_date': '2024-09-07'},
 'excluded_embed_metadata_keys': ['file_name',
  'file_type',
  'file_size',
  'creation_date',
  'last_modified_date',
  'last_accessed_date'],
 'excluded_llm_metadata_keys': ['file_name',
  'file_type',
  'file_size',
  'creation_date',
  'last_modified_date',
  'last_accessed_date'],
 'relationships': {},
 'text': "In ancient Rome, the city of Rome itself was the heart of the vast Roman Empire. It was known for its grand architecture, including iconic structures like the Colosseum and the Pantheon. The Romans were skilled engineers and builders, creating an extensive network of roads, aqueducts, and bridges that connected their far-reaching territories

In [6]:
from pprint import pprint

In [7]:
# Function to pretty print document content and metadata
def pretty_print_documents(documents):
    for doc in documents:
        print(f"Document ID: {doc.id_}")
        print("Metadata:")
        pprint(doc.metadata, indent=4)
        print("\nContent:")
        print(doc.text)
        print("="*80)  # separator between documents

# Call the function
pretty_print_documents(documents)

Document ID: 0980058c-d7f6-4d5f-b4f5-2e1182feaa77
Metadata:
{   'creation_date': '2024-11-20',
    'file_name': 'sample_document1.txt',
    'file_path': 'D:\\gridflowai\\NIIT-Tredence-LLM\\Day20 - 20 '
                 'NOV\\files\\sample_document1.txt',
    'file_size': 612,
    'file_type': 'text/plain',
    'last_modified_date': '2024-09-07'}

Content:
In ancient Rome, the city of Rome itself was the heart of the vast Roman Empire. It was known for its grand architecture, including iconic structures like the Colosseum and the Pantheon. The Romans were skilled engineers and builders, creating an extensive network of roads, aqueducts, and bridges that connected their far-reaching territories. The Roman Republic, with its Senate and elected officials, gave rise to the famous Roman legions, which conquered vast lands and brought them under Roman rule. The Roman civilization's influence on art, law, and governance can still be seen in modern societies today.
Document ID: 1471c016-9984-42

In [8]:
from llama_index.core import Document

In [9]:
text = "The quick brown fox jumps over the lazy dog."

In [10]:
doc = Document(
    text     = text,
    metadata = {'author': 'bhupen','category': 'others'},
    id_      = '1'
)

print(doc)

Doc ID: 1
Text: The quick brown fox jumps over the lazy dog.


![image.png](attachment:7045dbb6-dc10-4bbb-9c86-21792d4ce78b.png)

#### automated data ingestion

In [11]:
#!pip install wikipedia

In [12]:
#!pip install llama-index-readers-wikipedia

In [13]:
from llama_index.readers.wikipedia import WikipediaReader

In [14]:
loader = WikipediaReader()

In [15]:
documents = loader.load_data(
                pages=['AI', 'Deep Learning']
)

In [16]:
print(f"loaded {len(documents)} documents")

loaded 2 documents


In [17]:
# Call the function
pretty_print_documents(documents)

Document ID: 202898
Metadata:
{}

Content:
The atmosphere of Earth is composed of a layer of gas mixture that surrounds the Earth's planetary surface (both lands and oceans), known collectively as air, with variable quantities of suspended aerosols and particulates (which create weather features such as clouds and hazes), all retained by Earth's gravity. The atmosphere serves as a protective buffer between the Earth's surface and outer space, shields the surface from most meteoroids and ultraviolet solar radiation, keeps it warm and reduces diurnal temperature variation (temperature extremes between day and night) through heat retention (greenhouse effect), redistributes heat and moisture among different regions via air currents, and provides the chemical and climate conditions allowing life to exist and evolve on Earth.
By mole fraction (i.e., by quantity of molecules), dry air contains 78.08% nitrogen, 20.95% oxygen, 0.93% argon, 0.04% carbon dioxide, and small amounts of other trace

#### Nodes

While Documents represent the raw data and can be used as such, Nodes are smaller `chunks` of content extracted from the Documents. The goal is to break down Documents into smaller, more manageable pieces of text. This serves a few purposes:

- Solves prompt limit problems: Allows us to feed only relevant parts of large documents.
- Creates semantic units of data: Organizes data into smaller, more focused units.
- Allows the creation of relationships between Nodes: Connects Nodes based on their relationships. Helps understand connections and dependencies between different pieces of information.

![image.png](attachment:80c62e08-c80b-4f99-9d76-13adb21265a0.png)

- Nodes can store Images too
- But we will focus on TEXT NODE Class

**TextNode**

- Chunk of text derived from an original Document.
- Attributes:
    - `start_char_idx` and `end_char_idx`: Optional integer values for starting and ending character positions.
    - `text_template` and `metadata_template`: Template fields for formatting text and metadata.
    - `metadata_seperator`: String field for separating metadata fields.
    - `metadata`: Any useful metadata (e.g., parent Document ID, relationships, tags).

In [18]:
from llama_index.core import Document
from llama_index.core.schema import TextNode

In [19]:
doc = Document(text="This is a sample document text")

In [20]:
n1 = TextNode(text=doc.text[0:16],  doc_id=doc.id_)
n2 = TextNode(text=doc.text[17:30], doc_id=doc.id_)

In [21]:
n1, n2

(TextNode(id_='c27d06bd-d066-42c1-a3e7-a78042f49f64', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text='This is a sample', mimetype='text/plain', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadata_seperator='\n'),
 TextNode(id_='01f1e069-a2a3-46fb-8df5-8c79bc69caf6', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text='document text', mimetype='text/plain', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadata_seperator='\n'))

In [22]:
from llama_index.core import Document
from llama_index.core.node_parser import TokenTextSplitter

In [23]:
doc = Document(
    text=(
        "This is sentence 1. "
        "This is sentence 2. "
        "This is Sentence 3. "
    ),
    metadata={"author": "Grokkers Founder"}
)

splitter = TokenTextSplitter(
    chunk_size   = 17,
    chunk_overlap= 0,
    separator    = "."
)

nodes = splitter.get_nodes_from_documents([doc])

Metadata length (8) is close to chunk size (17). Resulting chunks are less than 50 tokens. Consider increasing the chunk size or decreasing the size of your metadata to avoid this.


In [24]:
for node in nodes:
    print(node.text)
    print(node.metadata)

This is sentence 1
{'author': 'Grokkers Founder'}
. This is sentence 2
{'author': 'Grokkers Founder'}
. This is Sentence 3.
{'author': 'Grokkers Founder'}


In [25]:
from llama_index.core import Document
from llama_index.core.schema import (
    TextNode,
    NodeRelationship,
    RelatedNodeInfo
)

In [26]:
doc = Document(text="First sentence. Second Sentence")

n1  = TextNode(text="First sentence",  node_id=doc.doc_id)
n2  = TextNode(text="Second sentence", node_id=doc.doc_id)

In [27]:
n1.relationships[NodeRelationship.NEXT]     = n2.node_id
n2.relationships[NodeRelationship.PREVIOUS] = n1.node_id

#### Why are relationships important?

Creating relationships between Nodes in LlamaIndex can be useful for several reasons:

- **Enables more contextual querying**: By linking Nodes together, you can leverage their relationships during querying to retrieve additional relevant context. For example, when querying a node, you could also return the previous or next Nodes to provide more context.

- **Allows tracking provenance**: Relationships encode provenance – where source Nodes originated and how they are connected. This is useful when you need to identify the original source of a node.

- **Enables navigation through nodes**: Traversing Nodes by their relationships enables new types of queries, such as finding the next node that contains a specific keyword. Navigation along relationships provides another dimension for searching.

- **Supports the construction of knowledge graphs**: Nodes and relationships are the building blocks of knowledge graphs. Linking Nodes into a graph structure allows for constructing knowledge graphs from text using LlamaIndex. (More on this in Chapter 5: *Indexing with LlamaIndex*).

- **Improves the index structure**: Some LlamaIndex indexes, such as trees and graphs, use node relationships to build their internal structure. This allows for the creation of more complex and expressive index topologies (discussed further in Chapter 5).

In summary, relationships augment Nodes with additional contextual connections, supporting more expressive querying, source tracking, knowledge graph construction, and the development of complex index structures.


#### Indexes

LlamaIndex supports different types of indexes, each with unique strengths and trade-offs. Here are some of the available index types:

- **SummaryIndex**: Functions like a recipe box, keeping Nodes in order for sequential access. It chunks documents into Nodes and concatenates them into a list, making it ideal for reading through large documents.

- **DocumentSummaryIndex**: Creates a concise summary for each document and maps these summaries back to their respective nodes. This facilitates quick identification of relevant documents by summarizing them for efficient information retrieval.

- **VectorStoreIndex**: A sophisticated index often used in Retrieval-Augmented Generation (RAG) applications. It converts text into vector embeddings, groups similar Nodes using mathematical techniques, and helps locate Nodes that are alike.

- **TreeIndex**: Organizes Nodes in a hierarchical, tree-like structure, with parent nodes storing summaries of child nodes. Summaries are generated using an LLM. This index is particularly useful for tasks that involve summarization.

- **KeywordTableIndex**: Similar to finding a dish by ingredients, this index links important keywords to the Nodes that contain them. It makes locating any node easy by searching for specific keywords.

- **KnowledgeGraphIndex**: Used to link facts within a large network of data, stored as a knowledge graph. This index excels at answering complex queries about interconnected information.

- **ComposableGraph**: Allows for building complex index structures where document-level indexes are indexed into higher-level collections. It even supports creating an "index of indexes" to access data from multiple documents in a larger collection.

**SummaryIndex**

In [29]:
from llama_index.core import SummaryIndex, Document
from llama_index.core.schema import TextNode

In [30]:
nodes = [
    TextNode(text="Lionel Messi is a football player from Argentina."),
    TextNode(text="He also played cricket. He has won the Ballon d'Or trophy 7 times."),
    TextNode(text="Lionel Messi's hometown is Rosario."),
    TextNode(text="He was born on June 24, 1987.")
]

In [31]:
index = SummaryIndex(nodes)

In [32]:
import os

In [33]:
# needs OPENAI_API_KEY to be set
# Convert the index to a query engine.

query_engine = index.as_query_engine()

In [34]:
response = query_engine.query("What is Messi's hometown?")
print(response)

Rosario


![image.png](attachment:64372979-fbe9-4c4a-b1c7-19a70818c99d.png)