## A very basic RAG

You would never build a RAG system this basic. But it helps illustrate the problems we are trying to solve with some of the more advanced techniques.

In [1]:
#%pip install --quiet llama-index llama-index-retrievers-bm25 llama-index-llms-anthropic anthropic

In [1]:
MODEL_ID = "claude-3-7-sonnet-latest"

import os
from dotenv import load_dotenv
load_dotenv("../keys.env")
assert os.environ["ANTHROPIC_API_KEY"][:2] == "sk",\
       "Please specify the ANTHROPIC_API_KEY access token in keys.env file"

In [2]:
import gutenberg_text_loader as gtl
import logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)

Try reading Anabasis of Alexander https://www.gutenberg.org/cache/epub/46976/pg46976.txt
a 2nd century historical account of Alexander the Great

In [4]:
gs = gtl.GutenbergSource()
doc = gs.load_from_url("https://www.gutenberg.org/cache/epub/46976/pg46976.txt")

2025-03-26 16:32:29,018 - INFO - Loading https://www.gutenberg.org/cache/epub/46976/pg46976.txt from cache
2025-03-26 16:32:29,069 - INFO - Cleaned Gutenberg text: removed 1033 chars from start, 18492 chars from end
2025-03-26 16:32:29,072 - INFO - Successfully loaded text from https://www.gutenberg.org/cache/epub/46976/pg46976.txt.


In [5]:
doc.text[21000:22000]

'he calls himself so in _Cynegeticus_ (v.\n6); and in _Periplus_ (xii. 5; xxv. 1), he distinguishes Xenophon by\nthe addition _the elder_. Lucian (_Alexander_, 56) calls Arrian simply\n_Xenophon_. During the stay of the emperor Hadrian at Athens, A.D. 126,\nArrian gained his friendship. He accompanied his patron to Rome, where\nhe received the Roman citizenship. In consequence of this, he assumed\nthe name of Flavius.[2] In the same way the Jewish historian, Josephus,\nhad been allowed by Vespasian and Titus to bear the imperial name\nFlavius.[3]\n\nPhotius says, that Arrian had a distinguished career in Rome, being\nentrusted with various political offices, and at last reaching the\nsupreme dignity of consul under Antoninus Pius.[4] Previous to this\nhe was appointed (A.D. 132) by Hadrian, Governor of Cappadocia, which\nprovince was soon after invaded by the Alani, or Massagetae, whom he\ndefeated and expelled.[5] When Marcus Aurelius came to the throne,\nArrian withdrew into private 

In [6]:
print(doc.id_)

d8d32970-34fb-4ade-a5e5-fc4283fac4a9


## Step 1: Index document

We will break up the document into chunks, and index it using BM25
See: https://kmwllc.com/index.php/2020/03/20/understanding-tf-idf-and-bm-25/

In [7]:
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.storage.docstore import SimpleDocumentStore
from llama_index.core import Document

class Indexer:
    """
    A class to load documents into LlamaIndex using BM25.
    
    Attributes:
        chunk_size (int): Size of text chunks for processing.
        chunk_overlap (int): Overlap between text chunks.
        docstore (SimpleDocumentStore): Document store for storing processed documents.
    """
    
    def __init__(
        self,
        cache_dir: str = "./.cache",
        chunk_size: int = 1024,
        chunk_overlap: int = 20
    ):
        """
        Initialize the Indexer.
        
        Args:
            chunk_size (int): Size of text chunks for processing. Defaults to 1024.
            chunk_overlap (int): Overlap between text chunks. Defaults to 20.
        """        
        self.chunk_size = chunk_size
        self.chunk_overlap = chunk_overlap
        
        # Initialize a simple document store
        self.docstore = SimpleDocumentStore()
        
        self.node_parser = SentenceSplitter(
            chunk_size=self.chunk_size,
            chunk_overlap=self.chunk_overlap
        )
        
        logger.info("Indexer initialized")
    

    def add_document_to_index(self, document: Document):
        # Parse the document into nodes
        nodes = self.node_parser.get_nodes_from_documents([document])

        # Add nodes to the document store
        self.docstore.add_documents(nodes)

        logger.info(f"Successfully loaded text from {document.id_} -- {len(nodes)} nodes created.")
            
    def get_docstore(self) -> SimpleDocumentStore:
        return self.docstore

In [8]:
index = Indexer(chunk_size=100, chunk_overlap=20)
index.add_document_to_index(doc)

2025-03-26 16:32:34,088 - INFO - Indexer initialized
2025-03-26 16:32:41,245 - INFO - Successfully loaded text from d8d32970-34fb-4ade-a5e5-fc4283fac4a9 -- 6104 nodes created.


## Step 2: Retrieve nodes that match query

In [9]:
from llama_index.retrievers.bm25 import BM25Retriever
retriever = BM25Retriever.from_defaults(
    docstore=index.get_docstore(),
    similarity_top_k=5)

2025-03-26 16:32:43,327 - DEBUG - Building index from IDs objects


In [10]:
from llama_index.core.response.notebook_utils import display_source_node
retrieved_nodes = retriever.retrieve("Describe the relationship between Alexander and Diogenes")
for node in retrieved_nodes:
    display_source_node(node, 1024)

**Node ID:** 9af68013-02e5-46c5-b269-043de46e5fc4<br>**Similarity:** 4.2463765144348145<br>**Text:** But Diogenes said that he
wanted nothing else, except that he and his attendants would stand out
of the sunlight. Alexander is said to have expressed his admiration
of Diogenes’s conduct.<br>

**Node ID:** 1a90db02-ab85-4048-9630-e9ccc553b69e<br>**Similarity:** 4.118840217590332<br>**Text:** 100 stades; and most of it is the mean between
these breadths.[642] This river Indus Alexander crossed at daybreak
with his army into the country of the Indians; concerning whom, in
this history I have described neither what laws they enjoy,<br>

**Node ID:** 6fa83134-b9a3-4ef3-8d8b-11b4903ce9fa<br>**Similarity:** 3.639586925506592<br>**Text:** 32). Alexander said: “If I were
not Alexander, I should like to be Diogenes.” Cf. _Arrian_, i. 1;
Plutarch (_de Fortit. Alex._, p. 331).<br>

**Node ID:** b4525555-6685-499c-af0d-656909d02e7b<br>**Similarity:** 3.4104578495025635<br>**Text:** Alexander is said to have expressed his admiration
of Diogenes’s conduct.[832] Thus it is evident that Alexander was
not entirely destitute of better feelings; but he was the slave of
his insatiable ambition.<br>

**Node ID:** 4431854c-be3d-43b1-9a10-29afab936025<br>**Similarity:** 3.2550690174102783<br>**Text:** He also ascertained that for
the present Bessus held the supreme command, both on account of his
relationship to Darius and because the war was being carried on in his
viceregal province. Hearing this,<br>

## Step 3: Generate using these nodes

In [11]:
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.llms.anthropic import Anthropic

llm = Anthropic(
    model=MODEL_ID,
    api_key=os.environ['ANTHROPIC_API_KEY'],
    temperature=0.2
)

In [12]:
from llama_index.core.llms import ChatMessage
messages = [
    ChatMessage(
        role="system", content="Use the following text to answer the given question."
    )
]
messages += [
    ChatMessage(role="system", content=node.text) for node in retrieved_nodes
]
messages += [
    ChatMessage(role="user", content="Describe the relationship between Alexander and Diogenes.")
]
response = llm.chat(messages)
print(response)

2025-03-26 16:32:53,376 - INFO - HTTP Request: POST https://api.anthropic.com/v1/messages "HTTP/1.1 200 OK"


assistant: Based on the text, Alexander and Diogenes had a brief but notable encounter. When Alexander met Diogenes, Alexander asked what he could do for him, but Diogenes simply requested that Alexander and his attendants move out of his sunlight. Rather than being offended by this dismissive response, Alexander is said to have expressed admiration for Diogenes's conduct. 

The text also mentions that Alexander reportedly said, "If I were not Alexander, I should like to be Diogenes," suggesting he respected Diogenes's philosophical approach and independence. However, the text notes that despite showing this capacity for "better feelings," Alexander remained "the slave of his insatiable ambition," implying a contrast between Diogenes's simple, unambitious lifestyle and Alexander's conquering nature.


## LlamaIndex Query engine to simplify Step 3

In [14]:
query_engine = RetrieverQueryEngine.from_args(
    retriever=retriever, llm=llm
)

response = query_engine.query("Describe the relationship between Alexander and Diogenes.")
response = {
    "answer": str(response),
    "source_nodes": response.source_nodes
}
print(response['answer'])

2025-03-26 16:33:19,766 - INFO - HTTP Request: POST https://api.anthropic.com/v1/messages "HTTP/1.1 200 OK"


The relationship between Alexander and Diogenes was marked by a notable encounter where Diogenes requested only that Alexander and his attendants stand out of his sunlight. Rather than being offended by this unusual request from someone addressing such a powerful figure, Alexander expressed admiration for Diogenes' conduct. 

This interaction reveals something about both men's characters. Alexander, despite his immense power and ambition, showed appreciation for Diogenes' simple and independent nature. In fact, Alexander is quoted as saying, "If I were not Alexander, I should like to be Diogenes," suggesting he respected the philosopher's way of life.

While Alexander was described as "not entirely destitute of better feelings," he was ultimately characterized as "the slave of his insatiable ambition," which contrasts with Diogenes' apparent contentment with merely having access to sunlight.


In [15]:
for node in response['source_nodes']:
    print(node)

Node ID: 9af68013-02e5-46c5-b269-043de46e5fc4
Text: But Diogenes said that he wanted nothing else, except that he
and his attendants would stand out of the sunlight. Alexander is said
to have expressed his admiration of Diogenes’s conduct.
Score:  4.246

Node ID: 1a90db02-ab85-4048-9630-e9ccc553b69e
Text: 100 stades; and most of it is the mean between these
breadths.[642] This river Indus Alexander crossed at daybreak with his
army into the country of the Indians; concerning whom, in this history
I have described neither what laws they enjoy,
Score:  4.119

Node ID: 6fa83134-b9a3-4ef3-8d8b-11b4903ce9fa
Text: 32). Alexander said: “If I were not Alexander, I should like to
be Diogenes.” Cf. _Arrian_, i. 1; Plutarch (_de Fortit. Alex._, p.
331).
Score:  3.640

Node ID: b4525555-6685-499c-af0d-656909d02e7b
Text: Alexander is said to have expressed his admiration of Diogenes’s
conduct.[832] Thus it is evident that Alexander was not entirely
destitute of better feelings; but he was the slave

## End to end example

In [18]:
def build_query_engine(urls: [str], chunk_size: int) -> RetrieverQueryEngine:
    gs = gtl.GutenbergSource()
    index = Indexer(chunk_size=chunk_size, chunk_overlap=chunk_size//10)
    
    for url in urls:
        doc = gs.load_from_url(url)
        index.add_document_to_index(doc)
    
    retriever = BM25Retriever.from_defaults(
        docstore=index.get_docstore(),
        similarity_top_k=5)
    
    llm = Anthropic(
        model=MODEL_ID,
        api_key=os.environ['ANTHROPIC_API_KEY'],
        temperature=0.2
    )
    
    query_engine = RetrieverQueryEngine.from_args(
        retriever=retriever, llm=llm
    )
    
    return query_engine

def print_response_to_query(query_engine: RetrieverQueryEngine, query: str):
    response = query_engine.query(query)
    response = {
        "answer": str(response),
        "source_nodes": response.source_nodes
    }
    print(response['answer'])
    print("\n\n**Sources**:")
    for node in response['source_nodes']:
        print(node)

In [19]:
query_engine = build_query_engine(["https://www.gutenberg.org/files/53669/53669-0.txt"], 100) # Portable Flame Thrower
print_response_to_query(query_engine, "What should I do if the diaphragm is ruptured?")

2025-03-26 16:33:31,935 - INFO - Indexer initialized
2025-03-26 16:33:31,938 - INFO - Loading https://www.gutenberg.org/files/53669/53669-0.txt from cache
2025-03-26 16:33:31,952 - INFO - Cleaned Gutenberg text: removed 50 chars from start, 49 chars from end
2025-03-26 16:33:31,954 - INFO - Successfully loaded text from https://www.gutenberg.org/files/53669/53669-0.txt.
2025-03-26 16:33:32,521 - INFO - Successfully loaded text from 0b831228-929d-440c-9dff-214c1568e6bb -- 1208 nodes created.
2025-03-26 16:33:32,676 - DEBUG - Building index from IDs objects
2025-03-26 16:33:35,382 - INFO - HTTP Request: POST https://api.anthropic.com/v1/messages "HTTP/1.1 200 OK"


If the diaphragm is ruptured, you should replace the safety head with an unbroken head. Additionally, if you notice any tears, separation, or leaks occurring at the diaphragm, you should replace the entire valve-diaphragm assembly.

When handling the diaphragm components during maintenance, remember to unscrew the diaphragm cap by hand (not using a wrench) and be careful not to disturb the position of the yoke block by turning the needle, as this would affect the valve-needle adjustment.


**Sources**:
Node ID: bcb413ea-6213-4774-bbd6-5d97e13f3222
Text: Inspect to see if diaphragm is intact. If diaphragm is ruptured,
replace the safety head with an unbroken head.
Score:  4.869

Node ID: 5f5de0db-4942-4fd1-9ac0-e6b4490351e6
Text: (3) Unscrew diaphragm cap and pull out washer, support, and
valve-diaphragm assembly. To prevent loss of valve-needle adjustment
(Fig 54), do not disturb position of yoke block by turning the needle.
Score:  3.282

Node ID: 1cf20eb6-8a0d-4cbe-9efe-b1d3f7b2637c


## Limitation 1: Semantic Understanding

Even though "ruptured" is the same as "broken", the returned nodes are very different because the search for "broken" doesn't return the sentences explaining what to do when it's ruptured (or vice-versa).
As a result, the generated answer misses the key point about replacing the safety head.

In [20]:
print_response_to_query(query_engine, "What should I do if the diaphragm is broken?")

2025-03-26 16:33:41,946 - INFO - HTTP Request: POST https://api.anthropic.com/v1/messages "HTTP/1.1 200 OK"


If the diaphragm is broken, you should replace the valve-diaphragm assembly. You would need to unscrew the diaphragm cap and pull out the washer, support, and valve-diaphragm assembly. When doing this repair, it's important not to disturb the position of the yoke block by turning the needle, as this would affect the valve-needle adjustment. After replacing the damaged components, you should screw the diaphragm cap back on by hand without using a wrench, and then install the valve grip.


**Sources**:
Node ID: 5f5de0db-4942-4fd1-9ac0-e6b4490351e6
Text: (3) Unscrew diaphragm cap and pull out washer, support, and
valve-diaphragm assembly. To prevent loss of valve-needle adjustment
(Fig 54), do not disturb position of yoke block by turning the needle.
Score:  3.282

Node ID: 043fd85f-3f8d-4bfe-8e5f-44bbfd091bf1
Text: (Par 49)    (2) _Spring-case assembly._ If outer case rotates
and inner case does   not, and no spring action occurs, spring is
broken and spring case   should be replaced as 

## Limitation 2: Chunk size

The results vary quite dramatically depending on the size of the chunks. It's unclear what size of chunk is best for a given a query.

In [21]:
def print_response(chunk_size: int) -> str:
    query_engine = build_query_engine(["https://www.gutenberg.org/files/53669/53669-0.txt"],
                                     chunk_size=chunk_size)
    response = query_engine.query("What should I do if the diaphragm is ruptured?")
    print(response)

print_response(100)

2025-03-26 16:33:42,137 - INFO - Indexer initialized
2025-03-26 16:33:42,142 - INFO - Loading https://www.gutenberg.org/files/53669/53669-0.txt from cache
2025-03-26 16:33:42,158 - INFO - Cleaned Gutenberg text: removed 50 chars from start, 49 chars from end
2025-03-26 16:33:42,160 - INFO - Successfully loaded text from https://www.gutenberg.org/files/53669/53669-0.txt.
2025-03-26 16:33:42,726 - INFO - Successfully loaded text from 92daa847-5f40-4692-97b7-8bc943f54f17 -- 1208 nodes created.
2025-03-26 16:33:42,885 - DEBUG - Building index from IDs objects
2025-03-26 16:33:45,718 - INFO - HTTP Request: POST https://api.anthropic.com/v1/messages "HTTP/1.1 200 OK"


If the diaphragm is ruptured, you should replace the safety head with an unbroken head. Additionally, if you notice any tears, separation, or leaks occurring at the diaphragm, you should replace the entire valve-diaphragm assembly.

When handling the diaphragm components during maintenance, remember to unscrew the diaphragm cap by hand (not using a wrench) and be careful not to disturb the position of the yoke block by turning the needle, as this would affect the valve-needle adjustment.


In [22]:
print_response(200)

2025-03-26 16:33:45,736 - INFO - Indexer initialized
2025-03-26 16:33:45,739 - INFO - Loading https://www.gutenberg.org/files/53669/53669-0.txt from cache
2025-03-26 16:33:45,751 - INFO - Cleaned Gutenberg text: removed 50 chars from start, 49 chars from end
2025-03-26 16:33:45,753 - INFO - Successfully loaded text from https://www.gutenberg.org/files/53669/53669-0.txt.
2025-03-26 16:33:46,062 - INFO - Successfully loaded text from b29e8780-f093-41cb-9138-2924ee6d1d99 -- 376 nodes created.
2025-03-26 16:33:46,387 - DEBUG - Building index from IDs objects
2025-03-26 16:33:50,268 - INFO - HTTP Request: POST https://api.anthropic.com/v1/messages "HTTP/1.1 200 OK"


If the diaphragm is ruptured, you should replace the safety head with an unbroken head. When checking the safety-head plug, you'll need to remove the deflector tube from the head (using your hand, not a wrench) to inspect if the diaphragm is intact. After replacement, reassemble the plug, head, and deflector tube in the left fuel tank.


In [23]:
print_response(500)

2025-03-26 16:33:50,291 - INFO - Indexer initialized
2025-03-26 16:33:50,294 - INFO - Loading https://www.gutenberg.org/files/53669/53669-0.txt from cache
2025-03-26 16:33:50,306 - INFO - Cleaned Gutenberg text: removed 50 chars from start, 49 chars from end
2025-03-26 16:33:50,309 - INFO - Successfully loaded text from https://www.gutenberg.org/files/53669/53669-0.txt.
2025-03-26 16:33:50,542 - INFO - Successfully loaded text from 3bfc6656-bcbf-403b-991a-e9bd762d9a59 -- 124 nodes created.
2025-03-26 16:33:50,594 - DEBUG - Building index from IDs objects
2025-03-26 16:33:55,299 - INFO - HTTP Request: POST https://api.anthropic.com/v1/messages "HTTP/1.1 200 OK"


If you find that the diaphragm is ruptured, you should replace the safety head with an unbroken head. After replacement, you'll need to reassemble the plug, head, and deflector tube in the left fuel tank. When reinstalling, the deflector tube should face to the rear at a 45-degree angle to the operator's left shoulder. Remember to screw in the deflector tube by hand only (do not use a wrench on it), and then tighten the lock nut with a wrench.


## Exploring tf-idf


In [25]:
gs = gtl.GutenbergSource()
index = Indexer(chunk_size=200, chunk_overlap=0)
for url in [
    "https://www.gutenberg.org/cache/epub/46976/pg46976.txt", # Alexander
    # "https://www.gutenberg.org/cache/epub/6400/pg6400.txt", # Twelve Caesars
    # "https://www.gutenberg.org/cache/epub/3296/pg3296.txt", # Augustine
]:
    doc = gs.load_from_url(url)
    index.add_document_to_index(doc)
docstore = index.get_docstore()

2025-03-26 16:33:56,625 - INFO - Indexer initialized
2025-03-26 16:33:56,651 - INFO - Loading https://www.gutenberg.org/cache/epub/46976/pg46976.txt from cache
2025-03-26 16:33:56,695 - INFO - Cleaned Gutenberg text: removed 1033 chars from start, 18492 chars from end
2025-03-26 16:33:56,697 - INFO - Successfully loaded text from https://www.gutenberg.org/cache/epub/46976/pg46976.txt.
2025-03-26 16:33:58,532 - INFO - Successfully loaded text from cdfce596-f986-4ee5-9e11-99040b5efa88 -- 1788 nodes created.


In [26]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

tfidf_vectorizer = TfidfVectorizer(stop_words='english')
corpus = [str(value.text) for key, value in docstore.docs.items()]
tfidf_vector = tfidf_vectorizer.fit_transform(corpus)
tfidf_df = pd.DataFrame(tfidf_vector.toarray(), columns=tfidf_vectorizer.get_feature_names_out())

count_vectorizer = CountVectorizer(stop_words='english')
count_vector = count_vectorizer.fit_transform(corpus)
count_df = pd.DataFrame(count_vector.toarray(), columns=count_vectorizer.get_feature_names_out())

In [27]:
tfidf_df.columns[3050], count_df.columns[3050]

('circuit', 'circuit')

In [28]:
tfidf_df[['astonishment']].sum().values[0], count_df[['astonishment']].sum().values[0]

(0.4144554127366783, 2)

In [29]:
try:
    tfidf_df["Describe the relationship between Alexander and Diogenes".lower().split()].sum()
except Exception as e:
    print("ERROR:", e)

ERROR: "['describe', 'the', 'between', 'and'] not in index"


In [30]:
def count_tfidf(words):
    print("Count:")
    print(count_df[words.lower().split()].sum())
    print("TFIDF:")
    print(tfidf_df[words.lower().split()].sum())

count_tfidf("relationship Alexander Diogenes")

Count:
relationship       1
alexander       1311
diogenes           6
dtype: int64
TFIDF:
relationship     0.255707
alexander       61.043674
diogenes         1.013318
dtype: float64
