<div style="text-align: center;">
    <h1 style="color: #FF6347;">Self-Guided Lab: Retrieval-Augmented Generation (RAGs)</h1>
</div>

<div style="text-align: center;">
    <img src="https://media4.giphy.com/media/v1.Y2lkPTc5MGI3NjExZ3FsdzRveTBrenMxM3VnbDMwaTJxN2NnZm50aGFibXk1NzNnY2Q0MCZlcD12MV9pbnRlcm5hbF9naWZfYnlfaWQmY3Q9Zw/LR5ZBwZHv02lmpVoEU/giphy.gif" alt="NLP Gif" style="width: 300px; height: 150px; object-fit: cover; object-position: center;">
</div>

<h1 style="color: #FF6347;">Data Storage & Retrieval</h1>


<h2 style="color: #FF8C00;">PyPDFLoader</h2>

`PyPDFLoader` is a lightweight Python library designed to streamline the process of loading and parsing PDF documents for text processing tasks. It is particularly useful in Retrieval-Augmented Generation workflows where text extraction from PDFs is required.

- **What Does PyPDFLoader Do?**
  - Extracts text from PDF files, retaining formatting and layout.
  - Simplifies the preprocessing of document-based datasets.
  - Supports efficient and scalable loading of large PDF collections.

- **Key Features:**
  - Compatible with popular NLP libraries and frameworks.
  - Handles multi-page PDFs and embedded images (e.g., OCR-compatible setups).
  - Provides flexible configurations for structured text extraction.

- **Use Cases:**
  - Preparing PDF documents for retrieval-based systems in RAGs.
  - Automating the text extraction pipeline for document analysis.
  - Creating datasets from academic papers, technical manuals, and reports.


In [2]:
%pip install langchain langchain_community pypdf
%pip install termcolor langchain_openai langchain-huggingface sentence-transformers chromadb langchain_chroma tiktoken openai python-dotenv


Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [3]:
import os
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import CharacterTextSplitter, RecursiveCharacterTextSplitter
import warnings
warnings.filterwarnings('ignore')


<h3 style="color: #FF8C00;">Loading the Documents</h3>

In [4]:
# # File path for the document

# file_path = r"../LAB/ai-for-everyone.pdf"

<h3 style="color: #FF8C00;">Documents into pages</h3>

The `PyPDFLoader` library allows efficient loading and splitting of PDF documents into smaller, manageable parts for NLP tasks.

This functionality is particularly useful in workflows requiring granular text processing, such as Retrieval-Augmented Generation (RAG).


In [5]:
# # Load and split the document
# loader = PyPDFLoader(file_path)
# pages = loader.load_and_split()
# len(pages)

<h3 style="color: #FF8C00;">Pages into Chunks</h3>


####  RecursiveCharacterTextSplitter in LangChain

The `RecursiveCharacterTextSplitter` is the **recommended splitter** in LangChain when you want to break down long documents into smaller, semantically meaningful chunks — especially useful in **RAG pipelines**, where clean context chunks lead to better LLM responses.

####  Parameters

| Parameter       | Description                                                                 |
|-----------------|-----------------------------------------------------------------------------|
| `chunk_size`    | The **maximum number of characters** allowed in a chunk (e.g., `1000`).     |
| `chunk_overlap` | The number of **overlapping characters** between consecutive chunks (e.g., `200`). This helps preserve context continuity. |

####  How it works
`RecursiveCharacterTextSplitter` attempts to split the text **intelligently**, trying the following separators in order:
1. Paragraphs (`"\n\n"`)
2. Lines (`"\n"`)
3. Sentences or words (`" "`)
4. Individual characters (as a last resort)

This makes it ideal for handling **natural language documents**, such as PDFs, articles, or long reports, without breaking sentences or paragraphs in awkward ways.



In [6]:
# text_splitter = RecursiveCharacterTextSplitter(
#     chunk_size=1000,
#     chunk_overlap=200
# )
# chunks = text_splitter.split_documents(pages)

# len(chunks)

####  Alternative: CharacterTextSplitter

`CharacterTextSplitter` is a simpler splitter that breaks text into chunks based **purely on character count**, without trying to preserve any natural language structure.

##### Example:
```python
from langchain.text_splitter import CharacterTextSplitter

text_splitter = CharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200
)
````

This method is faster and more predictable but may split text in the middle of a sentence or paragraph, which can hurt performance in downstream tasks like retrieval or QA.

---

#### Comparison Table

| Feature                        | RecursiveCharacterTextSplitter | CharacterTextSplitter     |
| ------------------------------ | ------------------------------ | ------------------------- |
| Structure-aware splitting      |  Yes                          |  No                      |
| Preserves sentence/paragraphs  |  Yes                          |  No                      |
| Risk of splitting mid-sentence |  Minimal                     |  High                   |
| Ideal for RAG/document QA      |  Highly recommended           |  Only if structured text |
| Performance speed              |  Slightly slower             |  Faster                  |

---

#### Recommendation

Use `RecursiveCharacterTextSplitter` for most real-world document processing tasks, especially when building RAG pipelines or working with structured natural language content like PDFs or articles.

## Best Practices for Choosing Chunk Size in RAG

### Best Practices for Chunk Size in RAG

| Factor                      | Recommendation                                                                                                                                                                                          |
| ---------------------------| ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **LLM context limit**       | Choose a chunk size that lets you retrieve multiple chunks **without exceeding the model’s token limit**. For example, GPT-4o supports 128k tokens, but with GPT-3.5 (16k) or GPT-4 (32k), keep it modest. |
| **Chunk size (in characters)** | Typically: **500–1,000 characters** per chunk → ~75–200 tokens. This fits well for retrieval + prompt without context overflow.                                                                           |
| **Chunk size (in tokens)**  | If using token-based splitter (e.g. `TokenTextSplitter`): aim for **100–300 tokens** per chunk.                                                                                                            |
| **Chunk overlap**           | Use **overlap of 10–30%** (e.g., 100–300 characters or ~50 tokens) to preserve context across chunk boundaries and avoid cutting off important ideas mid-sentence.                                        |
| **Document structure**      | Use **`RecursiveCharacterTextSplitter`** to preserve semantic boundaries (paragraphs, sentences) instead of arbitrary cuts.                                                                                |
| **Task type**               | For **question answering**, smaller chunks (~500–800 chars) reduce noise.<br>For **summarization**, slightly larger chunks (~1000–1500) are OK.                                                          |
| **Embedding model**         | Some models (e.g., `text-embedding-3-large`) can handle long input. But still, smaller chunks give **finer-grained retrieval**, which improves relevance.                                                  |
| **Query type**              | If users ask **very specific questions**, small focused chunks are better. For broader queries, bigger chunks might help.                                                                                  |


### Rule of Thumb

| Use Case                 | Chunk Size      | Overlap |
| ------------------------| --------------- | ------- |
| Factual Q&A              | 500–800 chars   | 100–200 |
| Summarization            | 1000–1500 chars | 200–300 |
| Technical documents      | 400–700 chars   | 100–200 |
| Long reports/books       | 800–1200 chars  | 200–300 |
| Small LLMs (≤16k tokens) | ≤800 chars      | 100–200 |


### Avoid

- Chunks >2000 characters: risks context overflow.
- No overlap: may lose key information between chunks.



<h2 style="color: #FF8C00;">Embeddings</h2>

Embeddings transform text into dense vector representations, capturing semantic meaning and contextual relationships. They are essential for efficient document retrieval and similarity analysis.

- **What are OpenAI Embeddings?**
  - Pre-trained embeddings like `text-embedding-3-large` generate high-quality vector representations for text.
  - Encapsulate semantic relationships in the text, enabling robust NLP applications.

- **Key Features of `text-embedding-3-large`:**
  - Large-scale embedding model optimized for accuracy and versatility.
  - Handles diverse NLP tasks, including retrieval, classification, and clustering.
  - Ideal for applications with high-performance requirements.

- **Benefits:**
  - Reduces the need for extensive custom training.
  - Provides state-of-the-art performance in retrieval-augmented systems.
  - Compatible with RAGs to create powerful context-aware models.


In [7]:
from langchain.embeddings import OpenAIEmbeddings
from dotenv import load_dotenv

In [8]:
# load_dotenv()

In [9]:
# api_key = os.getenv("OPENAI_API_KEY")
# embeddings = OpenAIEmbeddings(model="text-embedding-3-large")

<h2 style="color: #FF8C00;">ChromaDB</h2>

ChromaDB is a versatile vector database designed for efficiently storing and retrieving embeddings. It integrates seamlessly with embedding models to enable high-performance similarity search and context-based retrieval.

### Workflow Overview:
- **Step 1:** Generate embeddings using a pre-trained model (e.g., OpenAI's `text-embedding-3-large`).
- **Step 2:** Store the embeddings in ChromaDB for efficient retrieval and similarity calculations.
- **Step 3:** Use the stored embeddings to perform searches, matching, or context-based retrieval.

### Key Features of ChromaDB:
- **Scalability:** Handles large-scale datasets with optimized indexing and search capabilities.
- **Speed:** Provides fast and accurate retrieval of embeddings for real-time applications.
- **Integration:** Supports integration with popular frameworks and libraries for embedding generation.

In [10]:
from langchain.vectorstores import Chroma

In [11]:
# db = Chroma.from_documents(chunks, embeddings, persist_directory="./chroma_db_LAB")
# print("ChromaDB created with document embeddings.")

<h1 style="color: #FF6347;">Retrieving Documents</h1>


### Exercice1: Write a user question that someone might ask about your book’s topic or content.

In [12]:
# user_question = "How to increase accuracy in the prediction?" # User question
# retrieved_docs = db.similarity_search(user_question, k=10) # k is the number of documents to retrieve

In [13]:
# # Display top results
# for i, doc in enumerate(retrieved_docs[:3]): # Display top 3 results
#     print(f"Document {i+1}:\n{doc.page_content[36:1000]}") # Display content

<h2 style="color: #FF8C00;">Preparing Content for GenAI</h2>

In [14]:
# def _get_document_prompt(docs):
#     prompt = "\n"
#     for doc in docs:
#         prompt += "\nContent:\n"
#         prompt += doc.page_content + "\n\n"
#     return prompt

In [15]:
# # Generate a formatted context from the retrieved documents
# formatted_context = _get_document_prompt(retrieved_docs)
# print("Context formatted for GPT model.")

<h2 style="color: #FF8C00;">ChatBot Architecture</h2>

### Exercice2: Write a prompt that is relevant and tailored to the content and style of your book.

In [16]:
# prompt = f"""
# ## SYSTEM ROLE
# You are a knowledgeable and factual chatbot designed to assist with technical questions about **AI**, specifically focusing on **Accuracy**.
# Your answers must be based exclusively on provided content from technical books provided.

# ## USER QUESTION
# The user has asked:
# "{user_question}"

# ## CONTEXT
# Here is the relevant content from the technical books:
# '''
# {formatted_context}
# '''

# ## GUIDELINES
# 1. **Accuracy**:
#    - Only use the content in the `CONTEXT` section to answer.
#    - If the answer cannot be found, explicitly state: "The provided context does not contain this information."
#    - Start explain machine learning and then prediction in bulletpoints (charts, graphs, background and other aspects to consider)
#    - Follow by differential diagnosis
#    - Lastly explain the values for performance interpretation.

# 2. **Transparency**:
#    - Reference the book's name and page numbers when providing information.
#    - Do not speculate or provide opinions.

# 3. **Clarity**:
#    - Use simple, professional, and concise language.
#    - Format your response in Markdown for readability.

# ## TASK
# 1. Answer the user's question **directly** if possible.
# 2. Point the user to relevant parts of the documentation.
# 3. Provide the response in the following format:

# ## RESPONSE FORMAT
# '''
# # [Brief Title of the Answer]
# [Answer in simple, clear text.]

# **Source**:
# • [Book Title], Page(s): [...]
# '''
# """
# print("Prompt constructed.")


In [17]:
import openai

### Exercice3: Tune parameters like temperature, and penalties to control how creative, focused, or varied the model's responses are.

In [18]:
# # Set up GPT client and parameters
# client = openai.OpenAI()
# model_params = {
#     'model': 'gpt-4o',
#     'temperature': 0.9,  # Increase creativity
#     'max_tokens': 4000,  # Allow for longer responses
#     'top_p': 0.9,        # Use nucleus sampling
#     'frequency_penalty': 0.5,  # Reduce repetition
#     'presence_penalty': 0.6    # Encourage new topics
# }

<h1 style="color: #FF6347;">Response</h1>


In [19]:
# messages = [{'role': 'user', 'content': prompt}]
# completion = client.chat.completions.create(messages=messages, **model_params, timeout=120)

In [20]:
# answer = completion.choices[0].message.content
# print(answer)

<img src="https://miro.medium.com/v2/resize:fit:824/1*GK56xmDIWtNQAD_jnBIt2g.png" alt="NLP Gif" style="width: 500px">

<h2 style="color: #FF6347;">Cosine Similarity</h2>

**Cosine similarity** is a metric used to measure the alignment or similarity between two vectors, calculated as the cosine of the angle between them. It is the **most common metric used in RAG pipelines** for vector retrieval.. It provides a scale from -1 to 1:

- **-1**: Vectors are completely opposite.
- **0**: Vectors are orthogonal (uncorrelated or unrelated).
- **1**: Vectors are identical.


<img src="https://storage.googleapis.com/lds-media/images/cosine-similarity-vectors.original.jpg" alt="NLP Gif" style="width: 700px">

<h2 style="color: #FF6347;">Keyword Highlighting</h2>

Highlighting important keywords helps users quickly understand the relevance of the retrieved text to their query.

In [21]:
from termcolor import colored

The `highlight_keywords` function is designed to highlight specific keywords within a given text. It replaces each keyword in the text with a highlighted version using the `colored` function from the `termcolor` library.


In [22]:
# def highlight_keywords(text, keywords):
#     for keyword in keywords:
#         text = text.replace(keyword, colored(keyword, 'green', attrs=['bold']))
#     return text

### Exercice4: add your keywords

In [23]:
# query_keywords = ["statistic","chart","algorithm"] # add your keywords
# for i, doc in enumerate(retrieved_docs[:1]):
#     snippet = doc.page_content[:200]
#     highlighted = highlight_keywords(snippet, query_keywords)
#     print(f"Snippet {i+1}:\n{highlighted}\n{'-'*80}")

1. `query_keywords` is a list of keywords to be highlighted.
2. The loop iterates over the first document in retrieved_docs.
3. For each document, a snippet of the first 200 characters is extracted.
4. The highlight_keywords function is called to highlight the keywords in the snippet.
5. The highlighted snippet is printed along with a separator line.

<h1 style="color: #FF6347;">Bonus</h1>

**Try loading one of your own PDF books and go through the steps again to explore how the pipeline works with your content**:


In [24]:
# File path for the document
file_path = r"C:\Users\Nekky Lung\Desktop\IronHack AI Course\2_FT_July2025\week7\day5\LAB\H.M._Raghunath_Hydrology_Principles_Analysis.pdf"

# Load and split the document
loader = PyPDFLoader(file_path)
pages = loader.load_and_split()
len(pages)

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200
)
chunks = text_splitter.split_documents(pages)

len(chunks)

1119

In [25]:
load_dotenv()

api_key = os.getenv("OPENAI_API_KEY")
embeddings = OpenAIEmbeddings(model="text-embedding-3-large")

db = Chroma.from_documents(chunks, embeddings, persist_directory="./chroma_db_lesson")
print("ChromaDB created with document embeddings.")

user_question = "How I calculate Flood Frequency?" # User question
retrieved_docs = db.similarity_search(user_question, k=10) # k is the number of documents to retrieve

# Display top results
for i, doc in enumerate(retrieved_docs[:3]): # Display top 3 results
    print(f"Document {i+1}:\n{doc.page_content[36:1000]}") # Display content

ChromaDB created with document embeddings.
Document 1:
tive frequency curve
Annual No. of Cumulative Probability
flood peak occurrences occurrences =×FHG IKJ
CF
f 100Σ
%
C.I. or or
(1000 cumec) frequency, f frequency, CF
0-2* 0 87 100
2-4* 17 87 100
4-6 27 70 80.5
6-8 18 43 49.5
8-10 18 25 28.8
10-12 3 7 8.05
12-14 0 4 4.6
14-16 2 4 4.6
16-18 1 2 2.3
18-20 1 1 1.15
Σf = 87
*0-<2.
2- <4, and like that.
(a) Partial duration series. There are 175 flood exceedances (above Qb) during 87
years. Average number of exceedances per year.
λ = 175
87  = 2.01
73. 1957 4548 3.6579
74. 1958 4056 3.6081
75. 1959 4493 3.6525
76. 1960 3884 3.5893
77. 1961 4855 3.6861
78. 1962 5760 3.7604
79. 1963 9192 3.9634
80. 1964 3024 3.4806
81. 1965 2509 3.3994
82. 1966 4741 4.6759
83. 1967 5919 3.7725
84. 1968 3798 3.5795
85. 1969 4546 3.6577
86. 1970 3842 3.5845
87. 1971 4542 3.6573
Document 2:
interval of flood (T-yr)
5 10 50 100 200
1 20 10 2 1 0.5
56 7 41 10 5 2
10 89 65 18 10 5
25 99.6 93 40 22 12
50 — 99.5 6

In [26]:
def _get_document_prompt(docs):
    prompt = "\n"
    for doc in docs:
        prompt += "\nContent:\n"
        prompt += doc.page_content + "\n\n"
    return prompt

# Generate a formatted context from the retrieved documents
formatted_context = _get_document_prompt(retrieved_docs)
print("Context formatted for GPT model.")

Context formatted for GPT model.


In [27]:
prompt = f"""
## SYSTEM ROLE
You are a knowledgeable and factual chatbot designed to assist with technical questions about **Hydrology**, specifically focusing on **flooding**.
Your answers must be based exclusively on provided content from technical books provided.

## USER QUESTION
The user has asked:
"{user_question}"

## CONTEXT
Here is the relevant content from the technical books:
'''
{formatted_context}
'''

## GUIDELINES
1. **Accuracy**:
   - Only use the content in the `CONTEXT` section to answer.
   - If the answer cannot be found, explicitly state: "The provided context does not contain this information."
   - Start explain hydrology and then statistic in bulletpoints (calculation, values, background , graph and other aspects to consider)
   - Follow by differential diagnosis
   - Lastly explain the values for interpretation.

2. **Transparency**:
   - Reference the book's name and page numbers when providing information.
   - Do not speculate or provide opinions.

3. **Clarity**:
   - Use simple, professional, and concise language.
   - Format your response in Markdown for readability.

## TASK
1. Answer the user's question **directly** if possible.
2. Point the user to relevant parts of the documentation.
3. Provide the response in the following format:

## RESPONSE FORMAT
'''
# [Brief Title of the Answer]
[Answer in simple, clear text.]

**Source**:
• [Book Title], Page(s): [...]
'''
"""
print("Prompt constructed.")

Prompt constructed.


In [28]:
import openai

# Set up GPT client and parameters
client = openai.OpenAI()
model_params = {
    'model': 'gpt-4o',
    'temperature': 0.7,  # Increase creativity
    'max_tokens': 4000,  # Allow for longer responses
    'top_p': 0.9,        # Use nucleus sampling
    'frequency_penalty': 0.5,  # Reduce repetition
    'presence_penalty': 0.6    # Encourage new topics
}

In [29]:
messages = [{'role': 'user', 'content': prompt}]
completion = client.chat.completions.create(messages=messages, **model_params, timeout=120)

answer = completion.choices[0].message.content
print(answer)

'''
# Calculating Flood Frequency

To calculate flood frequency, you can use statistical methods to estimate how often floods of a certain magnitude are expected to occur. Here's a step-by-step guide based on the provided context:

- **Hydrology Background**:
  - Flood frequency analysis is used to predict the probability of flood events over time.
  - It involves compiling and analyzing historical data of flood peaks.

- **Statistical Calculation**:
  - Arrange stream flow peaks in descending order by magnitude; this forms the basis for statistical analysis.
  - Compute recurrence intervals using stochastic methods. The recurrence interval (T-year) indicates the average time between floods of a given size.
  - Use formulas like Gumbel's or Weibull's method, which involve plotting on specialized paper (e.g., semi-log or log-log paper).

- **Values and Graphs**:
  - The cumulative frequency curve is computed with tables like Table 15.6, which lists annual flood occurrences and probabili