# Hands-on 1: Ingestion and Chunking

### 🎯 Problem
Given the Paul Graham’s essay, build a sophisticated chat-engine address this target conversation:​

- Primary query: “What did Paul Graham do in the summer 1995?”​
- Follow-up query: “Why did he organize it?”​

### 🔍 Suggested tasks
- 🔍 Use the previous QueryFusionRetrieve​
- 🔤 Create a ChatMemoryBuffer​
- 🚀 Create a CondensePlusContextChatEngine​
- 📊 Validate the result!​
- 🌐 [+] Create a Chat & QueryGeneration prompt

## Code

In [1]:
# if running on colab uncomment those lines
%pip install httpx==0.27.2
%pip install llama-index==0.12.3
%pip install llama-index-retrievers-bm25>=0.50
%pip install openai==1.57.0
!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'

Note: you may need to restart the kernel to use updated packages.


c:\Users\n.fretti\Desktop\projects\rag_and_roll\.venv\Scripts\python.exe: No module named pip


Note: you may need to restart the kernel to use updated packages.


c:\Users\n.fretti\Desktop\projects\rag_and_roll\.venv\Scripts\python.exe: No module named pip


Note: you may need to restart the kernel to use updated packages.


c:\Users\n.fretti\Desktop\projects\rag_and_roll\.venv\Scripts\python.exe: No module named pip


Note: you may need to restart the kernel to use updated packages.


c:\Users\n.fretti\Desktop\projects\rag_and_roll\.venv\Scripts\python.exe: No module named pip
The syntax of the command is incorrect.
'wget' is not recognized as an internal or external command,
operable program or batch file.


In [1]:
# load .env
# from dotenv import load_dotenv
# load_dotenv()
import os
os.environ["OPENAI_API_KEY"] = "sk-..."

In [2]:
import nest_asyncio

nest_asyncio.apply()

In [3]:

from llama_index.core import VectorStoreIndex
from llama_index.core.node_parser import SentenceSplitter
from rich import print as rprint
from llama_index.retrievers.bm25 import BM25Retriever
from llama_index.core.retrievers import QueryFusionRetriever
from llama_index.core.memory import ChatMemoryBuffer
from llama_index.core import SimpleDirectoryReader
from llama_index.core.retrievers.fusion_retriever import FUSION_MODES
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core.chat_engine.condense_plus_context import CondensePlusContextChatEngine
from llama_index.core.base.base_retriever import BaseRetriever
from llama_index.core.base.base_query_engine import BaseQueryEngine

resource module not available on Windows


  from .autonotebook import tqdm as notebook_tqdm


In [4]:
documents = SimpleDirectoryReader("./data/paul_graham").load_data()

splitter = SentenceSplitter(chunk_size=256)

index = VectorStoreIndex.from_documents(
    documents, transformations=[splitter], show_progress=True
)

Parsing nodes: 100%|██████████| 1/1 [00:00<00:00, 18.85it/s]
Generating embeddings: 100%|██████████| 587/587 [00:09<00:00, 60.05it/s]


In [5]:
QUERY_GEN_PROMPT="""You are a helpful assistant that generates multiple search queries based on a single input query. Generate {num_queries} search queries, one on each line, related to the following input query:
Query: {query}
Queries:"""

In [6]:
def init_query_fusion_retriever(index: VectorStoreIndex, similarity_top_k: int, mode: FUSION_MODES, num_queries:int = 3) -> QueryFusionRetriever:
    vector_retriever = index.as_retriever(similarity_top_k=similarity_top_k)
    bm25_retriever = BM25Retriever.from_defaults(
        docstore=index.docstore, similarity_top_k=similarity_top_k
    )
    retriever = QueryFusionRetriever(
        [vector_retriever, bm25_retriever],
        similarity_top_k=similarity_top_k,
        num_queries=num_queries,
        mode=mode,
        use_async=True,
        verbose=True,
        query_gen_prompt=QUERY_GEN_PROMPT,
    )
    return retriever

In [7]:
CHAT_PROMPT = """You're a chatbot, able to have normal conversations, about an essay discussing Paul Grahams life. 
Here are the relevant documents for the context:
{context_str}
Instruction: use the previous chat history, or the context aboute, to interact with the user. Use only the information provided in the context to generate responses."""

In [8]:
def init_condense_plus_context_chat_engine(retriever: BaseRetriever, query_engine: BaseQueryEngine) -> CondensePlusContextChatEngine:

    memory = ChatMemoryBuffer.from_defaults(token_limit=3000)

    chat_engine = CondensePlusContextChatEngine.from_defaults(
        retriever=retriever,
        query_engine=query_engine,
        memory=memory,
        context_prompt=CHAT_PROMPT,
    )
    return chat_engine


In [9]:
retriever = init_query_fusion_retriever(index, similarity_top_k=5, mode=FUSION_MODES.RECIPROCAL_RANK, num_queries=3)
query_engine = RetrieverQueryEngine(retriever)
chat_engine = init_condense_plus_context_chat_engine(retriever=retriever, query_engine=query_engine)

In [10]:

chat_engine.chat("What did Paul Graham do in the summer of 1995?")

Generated queries:
1. What projects did Paul Graham work on in the summer of 1995?
2. How did Paul Graham's activities in the summer of 1995 influence his career trajectory?


AgentChatResponse(response='In the summer of 1995, Paul Graham and his team used a building he owned in Cambridge as their headquarters. They organized weekly dinners on Tuesdays and invited experts on startups to give talks after dinner. They also created something called the Summer Founders Program, inviting undergraduates to apply. This initiative received 225 applications, including from recent graduates or those about to graduate that spring.', sources=[ToolOutput(content='[NodeWithScore(node=TextNode(id_=\'8d51f2b1-4a01-48ac-8e75-9306b66ea07b\', embedding=None, metadata={\'file_path\': \'c:\\\\Users\\\\n.fretti\\\\Desktop\\\\projects\\\\rag_and_roll\\\\lesson_02\\\\data\\\\paul_graham\\\\paul_graham_essay.txt\', \'file_name\': \'paul_graham_essay.txt\', \'file_type\': \'text/plain\', \'file_size\': 75042, \'creation_date\': \'2024-12-04\', \'last_modified_date\': \'2024-12-04\'}, excluded_embed_metadata_keys=[\'file_name\', \'file_type\', \'file_size\', \'creation_date\', \'last_

In [11]:
chat_engine.chat("Why did he organize it?")

Generated queries:
1. What were the specific activities organized by Paul Graham in the summer of 1995?
2. How did the activities organized by Paul Graham in the summer of 1995 impact the participants?


AgentChatResponse(response='Paul Graham organized the Summer Founders Program and other activities in the summer of 1995 to provide undergraduates with an alternative to traditional summer jobs at tech companies. The goal was to give them the opportunity to start their own startups instead of working at places like Microsoft. Additionally, Graham and his team saw this as a way to gain experience as investors by funding multiple startups at once and practicing their investor skills on the participants of the program.', sources=[ToolOutput(content='[NodeWithScore(node=TextNode(id_=\'3b6d272b-2d0f-4b72-ae44-dcd08627b775\', embedding=None, metadata={\'file_path\': \'c:\\\\Users\\\\n.fretti\\\\Desktop\\\\projects\\\\rag_and_roll\\\\lesson_02\\\\data\\\\paul_graham\\\\paul_graham_essay.txt\', \'file_name\': \'paul_graham_essay.txt\', \'file_type\': \'text/plain\', \'file_size\': 75042, \'creation_date\': \'2024-12-04\', \'last_modified_date\': \'2024-12-04\'}, excluded_embed_metadata_keys=[\

In [12]:
chat_engine.reset()

In [13]:
chat_engine.chat("What did I ask you?")

Generated queries:
1. How can I improve my memory recall?
2. What are some techniques for enhancing cognitive function?


AgentChatResponse(response="You asked me about an essay discussing Paul Graham's life. Would you like to know more details about it?", sources=[ToolOutput(content='[NodeWithScore(node=TextNode(id_=\'19fde6ea-4a39-4948-b911-eb016f54de41\', embedding=None, metadata={\'file_path\': \'c:\\\\Users\\\\n.fretti\\\\Desktop\\\\projects\\\\rag_and_roll\\\\lesson_02\\\\data\\\\paul_graham\\\\paul_graham_essay.txt\', \'file_name\': \'paul_graham_essay.txt\', \'file_type\': \'text/plain\', \'file_size\': 75042, \'creation_date\': \'2024-12-04\', \'last_modified_date\': \'2024-12-04\'}, excluded_embed_metadata_keys=[\'file_name\', \'file_type\', \'file_size\', \'creation_date\', \'last_modified_date\', \'last_accessed_date\'], excluded_llm_metadata_keys=[\'file_name\', \'file_type\', \'file_size\', \'creation_date\', \'last_modified_date\', \'last_accessed_date\'], relationships={<NodeRelationship.SOURCE: \'1\'>: RelatedNodeInfo(node_id=\'a9934eb1-c149-4e2c-a5dd-5bdd8e472503\', node_type=\'4\', meta

Try to make more questions or just try to use QUERY_GEN_PROMPT and CHAT_PROMPT and see how it goes...