## Chapter 4

    - This notebook contains the code for the chapter 4 of the book

In [1]:
from getpass import getpass
import os
OPENAI_API_KEY = getpass()

In [2]:
os.environ['OPENAI_API_KEY'] = OPENAI_API_KEY

### Recipe 4.2a: Loading a PDF document (default parameters)

    NOTE: Install pypdf before running the code

In [4]:
from langchain_community.document_loaders import PyPDFLoader
 
## Step 1: Define the path (a directory)
path = "./rag_resources/product_specification_sheet.pdf"
 
## Step 2: Instantiate the class with the path
pdf_loader = PyPDFLoader(path)
 

## Step 3: Load into memory
pdf_docs_basic = pdf_loader.load() 

## Step 4: Print the Document class returned.
print(pdf_docs_basic) 


[Document(metadata={'producer': 'Microsoft® Word for Microsoft 365', 'creator': 'Microsoft® Word for Microsoft 365', 'creationdate': '2025-05-23T10:56:06+00:00', 'moddate': '2025-05-23T10:56:06+00:00', 'source': './rag_resources/product_specification_sheet.pdf', 'total_pages': 4, 'page': 0, 'page_label': '1'}, page_content='Product Specification Sheet – Zonic Earbuds Pro \n \nProduct Introduction \nThe Zonic Earbuds Pro represent the next generation of true wireless audio for consumers \nwho value performance, reliability, and convenience. Developed for busy professionals and \ntech-savvy users alike, these earbuds integrate premium hardware and sophisticated software \nto create an immersive, hassle-free listening experience. Whether you are managing calls on \nthe go, relaxing with your favorite playlist, or seeking comfort during long work sessions, Zonic \nEarbuds Pro are engineered to fit seamlessly into modern lifestyles. \n \nKey Features & Use Cases \n• All-day battery: Spend l

In [3]:
## Count the length of the result using the len function
print(len(pdf_docs_basic))
 
## Print the .page_content of the first Document 
print(pdf_docs_basic[0].page_content)
 
## Print the metadat of the first Document 
print(pdf_docs_basic[0].metadata) 


4
Product Specification Sheet – Zonic Earbuds Pro 
 
Product Introduction 
The Zonic Earbuds Pro represent the next generation of true wireless audio for consumers 
who value performance, reliability, and convenience. Developed for busy professionals and 
tech-savvy users alike, these earbuds integrate premium hardware and sophisticated software 
to create an immersive, hassle-free listening experience. Whether you are managing calls on 
the go, relaxing with your favorite playlist, or seeking comfort during long work sessions, Zonic 
Earbuds Pro are engineered to fit seamlessly into modern lifestyles. 
 
Key Features & Use Cases 
• All-day battery: Spend less time charging and more time connected, with up to 8 hours 
of listening on a single charge and an additional 24 hours provided via the compact 
charging case. 
• Superior sound quality : Advanced 10mm bio -inspired drivers deliver rich bass, clear 
mids, and crisp highs across all music genres and voice calls. 
• Smart connectivi

### Recipe 4.2b: Harnessing PyPDFLoader for a PDF with tables and images

    NOTE: Install "rapidocr-onnxruntime" before running this code

In [5]:
from langchain_community.document_loaders import PyPDFLoader
 
## Specify the path
path = "./rag_resources/product_specification_sheet.pdf"
 
## Instantiate the class with the path

pdf_loader = PyPDFLoader(path, 
                         mode = "page", 
                         extraction_mode = 'layout', 
                         extract_images = True,                         
                         images_inner_format="markdown-img")

## Load into memory
pdf_docs_layout = pdf_loader.load()
 
## Print the number of first page to see the layout
print(pdf_docs_layout[0].page_content)
 
## Print the second page to see the layout of the table
print(pdf_docs_layout[1].page_content)
 
## Print the last page to see if the images are well parsed
print(pdf_docs_layout[3].page_content) 


Product Specification Sheet – Zonic Earbuds Pro


Product Introduction

The Zonic Earbuds Pro represent the next generation of true wireless audio for consumers
who value performance, reliability, and convenience. Developed for busy professionals and
tech-savvy users alike, these earbuds integrate premium hardware and sophisticated software
to create an immersive, hassle-free listening experience. Whether you are managing calls on
the go, relaxing with your favorite playlist, or seeking comfort during long work sessions, Zonic
Earbuds Pro are engineered to fit seamlessly into modern lifestyles.


Key Features & Use Cases

    •   All-day battery: Spend less time charging and more time connected, with up to 8 hours
        of listening on a single charge and an additional 24 hours provided via the compact
        charging case.

    •   Superior sound quality: Advanced 10mm bio-inspired drivers deliver rich bass, clear
        mids, and crisp highs across all music genres and voice call

### Recipe 4.2c: Loading CSV files

In [7]:
from langchain_community.document_loaders import CSVLoader
 
 ## Specify the path
path = "./rag_resources/monthly_sales_report_may_2025.csv"
 
## Instantiate the class with the path
csv_loader = CSVLoader(path)
 
## Load into memory
csv_docs = csv_loader.load()


In [8]:
print(len(csv_docs))
 
print("\n ###")            
 
print(csv_docs[0].page_content)
 
print("\n ###")            
 
print(csv_docs[0].metadata) 


372

 ###
date: 1/5/2025
product_ID: ZF-SWP-EXT-2025
product_name: ZonicFit Smartwatch Pro (Extended)
region: North America
units_sold: 105
revenue_USD: 52447.5
cost_of_goods_sold_USD: 26223.75
gross_profit_USD: 26223.75

 ###
{'source': './rag_resources/monthly_sales_report_may_2025.csv', 'row': 0}


### Recipe 4.2d: Loading HTML documents

NOTE: This requires a dependency called "unstructured"

In [9]:
from langchain_community.document_loaders import UnstructuredHTMLLoader
 
## Step 1: Specify the path
path = "./rag_resources/zonic_landing_page.html"
 
## Step 2: Instantiate the class with the path
html_loader = UnstructuredHTMLLoader(path)
 
## Step 3: Load into memory
html_docs = html_loader.load()
 


In [10]:
# Print the first document
print(len(html_docs))
 
# Print the first document
print(html_docs[0].page_content)
 
# Print the first document's metadata
print(html_docs[0].metadata)


1
Zonic

Experience the Future of Connected Living with Zonic

From smart wearables to intelligent home devices, Zonic empowers your lifestyle with innovative technology designed for simplicity and performance.

Discover Our Products

Our Innovative Zonic Products

ZonicFit Smartwatch Pro

ZonicFit Smartwatch Pro

Track your health, stay connected, and push your limits with advanced sensors and extended battery life.

ZonicSound Smart Speaker

ZonicSound Smart Speaker

Immersive audio and intelligent voice assistant for your home. Experience sound like never before.

ZonicAudio Pro Earbuds

ZonicAudio Pro Earbuds

Premium sound, active noise cancellation, and all-day comfort for your music and calls.

ZonicConnect Home Camera

ZonicConnect Home Camera

Keep an eye on what matters most with smart monitoring, motion detection, and two-way audio.

About Zonic Innovations

At Zonic, we believe in a world where technology enhances daily life seamlessly and intuitively. Founded on principles

### Recipe 4.2e: Batch loading PDFs from a directory 

In [11]:
from langchain_community.document_loaders import DirectoryLoader, PyPDFLoader
 
## Step 1: Define the path (a directory)
path = "./rag_resources"
 
## Step 2: Instantiate the directory loader with the path
dir_loader = DirectoryLoader(path, glob="**/*.pdf", 
                             loader_cls = PyPDFLoader,
                             show_progress=True,
                             loader_kwargs = {"mode":"page", 
                                              "extraction_mode":'layout',  
                                              "images_inner_format":"markdown-img"}
                            )
 
 
## Step 3: Invoke the load method to load the files into memory
dir_pdf_docs = dir_loader.load() 
 


100%|██████████| 3/3 [00:00<00:00,  5.30it/s]


### Recipe 4.2f: Loading multiple file types from a directory 

In [14]:
from langchain_community.document_loaders import DirectoryLoader, PyPDFLoader, TextLoader

# Define the directory path
path = "rag_resources"
 
# Instantiate the directory loader for the PDF docs
pdf_loader = DirectoryLoader(
    path,
    glob="**/*.pdf",
    loader_cls=PyPDFLoader,
    loader_kwargs={
        "mode": "page",
        "extraction_mode": "layout",
        "images_inner_format": "markdown-img"
    },
    show_progress=True
)

# Load the doc into memory
pdf_docs = pdf_loader.load()
for doc in pdf_docs:
    doc.metadata["source_type"] = "pdf"
 
# Instantiate the directory loader for the text docs
txt_loader = DirectoryLoader(
    path,
    glob="**/*.txt",
    loader_cls=TextLoader,
    loader_kwargs={"encoding": "utf-8"},
    show_progress=True
)

#Load the doc into memory
txt_docs = txt_loader.load()
for doc in txt_docs:
    doc.metadata["source_type"] = "txt"
 
# Combine all documents
all_docs = pdf_docs + txt_docs

print(len(all_docs))


100%|██████████| 3/3 [00:00<00:00,  5.05it/s]
100%|██████████| 2/2 [00:00<00:00, 40.25it/s]

14





### Recipe 4.3. Recursive text splitting

In [15]:
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
 
## Step 1 Parse the file and load it into memory
path = "./rag_resources/product_specification_sheet.pdf"
pdf_docs = PyPDFLoader(path).load()
 
 
# Step 2: Define a text splitter that splits recursively through the character list
text_splitter = RecursiveCharacterTextSplitter(
    separators = ['\n\n', '\n', ' ', ''],
    chunk_size=200,  
    chunk_overlap=50  
)
 
# Step 3: Split the document using text_splitter’s .split_document method
doc_chunks = text_splitter.split_documents(pdf_docs)  

print(f"We have a total of {len(doc_chunks)} chunks in our document\n")

# Step 4: Print the first 3 document chunks with their respective chunk size
for i, chunk in enumerate(doc_chunks[:4]):
    print(f"This is chunk #{i+1}:\n{chunk.page_content} \nChunk size = {len(chunk.page_content)}\n") 


We have a total of 27 chunks in our document

This is chunk #1:
Product Specification Sheet – Zonic Earbuds Pro 
 
Product Introduction 
The Zonic Earbuds Pro represent the next generation of true wireless audio for consumers 
Chunk size = 161

This is chunk #2:
who value performance, reliability, and convenience. Developed for busy professionals and 
tech-savvy users alike, these earbuds integrate premium hardware and sophisticated software 
Chunk size = 182

This is chunk #3:
to create an immersive, hassle-free listening experience. Whether you are managing calls on 
the go, relaxing with your favorite playlist, or seeking comfort during long work sessions, Zonic 
Chunk size = 190

This is chunk #4:
Earbuds Pro are engineered to fit seamlessly into modern lifestyles. 
 
Key Features & Use Cases 
• All-day battery: Spend less time charging and more time connected, with up to 8 hours 
Chunk size = 185



### Recip 4.4 A vanilla RAG pipeline integrating document processing, embedding, storage, and retrieval (OpenAI model)

    NOTE: Replace the OPENAI_API_KEY variable with your own API key

##### Steps 1- 5: Initiate document processing

In [16]:
from langchain_community.document_loaders import DirectoryLoader, PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter

## Step 1: Define the path (a directory)
path = "./rag_resources"

## Step 2: Instantiate the directory loader with the path
dir_loader = DirectoryLoader(path, glob="**/*.pdf", 
                             loader_cls = PyPDFLoader,
                             show_progress=True,
                             loader_kwargs = {"mode":"page"}
                            )

## Step 3: Invoke the load method to load the files into memory
dir_pdf_docs = dir_loader.load()


## Step 4 Define a text splitter that splits recursively through the character list
text_splitter = RecursiveCharacterTextSplitter(
    separators = ['\n\n', '\n', ' ', ''],
    chunk_size=400,  
    chunk_overlap=20,
    length_function = len
)

## Stp 5 Split the document using text_splitter
doc_chunks = text_splitter.split_documents(dir_pdf_docs)

100%|██████████| 3/3 [00:00<00:00,  4.84it/s]


##### Steps 6 – 7: Define a vector store

In [21]:
from langchain_openai import OpenAIEmbeddings
from langchain_chroma import Chroma
 
## Step 6: Specify an embedding model
embedding_model = OpenAIEmbeddings(api_key = OPENAI_API_KEY,
                                   model = "text-embedding-3-small")
 
## Step 7: Create a Chroma instance and instantiate it with an embedding model and the document chunk
vectore_store = Chroma.from_documents(
    embedding = embedding_model,
    documents = doc_chunks
   
) 


##### Step 8: Define the retriever using the vector store

In [22]:
## Step 8: Transform the vectore store into a retriever
retriever = vectore_store.as_retriever(
    search_type = "similarity",
    search_kwargs = {"k":2}
)


##### Steps 9 – 10: Defining a model to generate a response, with the prompt template specifying instructions

In [23]:
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
  
## Step 9: Define a prompt template
prompt_template = ChatPromptTemplate.from_template(
    input_variable = ["context", "question"],
    template = 
    """
    Use the following piece of context to answer the question at the end. 
    Instruction: Stay loyal to the context as close as possible. Don't add your opinion. If the context is not enough to answer the question, state this.
    Context: {context} 
    Question: {question}. 
    
    """
)
 
## Step 10: Define a chat model for response generation
llm = ChatOpenAI(
    openai_api_key = OPENAI_API_KEY,
    model = "gpt-4o-mini",
    temperature = 0,
)


##### Steps 11-12: Creating the LCEL chain that integrates all into a runnable chain

In [25]:
## Step 11: Define the chain
from langchain_core.runnables import RunnablePassthrough
 
rag_chain = (
    {"question": RunnablePassthrough(), "context": retriever}
    | prompt_template 
    | llm 
   )
 
 ## Step 12: Invoke the chain with a suitable question
response = rag_chain.invoke("What are the battery life specifications and charging capabilities of the Zonic Earbuds Pro, including quick charge features?")
 
 
## Print the response
print(response.content)


The battery life specifications and charging capabilities of the Zonic Earbuds Pro are as follows:

- Battery Life (Earbuds): 8 hours
- Battery Life (Case): 24 hours (additional)
- Charging Time: 
  - Earbuds: 1.5 hours
  - Case: 2 hours
- Quick Charge: 10 minutes of charging provides 1 hour of playback.
