# **How to build RAG application - PDF method**
---
Demo how to create RAG from PDF file 

## **Library Installation**
Install those required libary

In [14]:
%pip install --quiet -U langchain ## LLM libary
%pip install --quiet -U chromadb ## Vector Storage
# %pip install --quiet -U langchain-chroma ## LLM Vector Storage
%pip install --quiet -U pypdf ## Loading PDFs
%pip install --quiet -U pytest ## Unit testing
%pip install --quiet -U langchain-community ## LLM Community Library
%pip install --quiet -U langchain-ollama ## LLM Ollama Library

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


Import some libraries

In [17]:
import argparse
import os
import shutil
from IPython.display import display, Markdown
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain.schema.document import Document
from langchain_ollama import OllamaEmbeddings
from langchain.vectorstores.chroma import Chroma
# from langchain_chroma import Chroma
from langchain.prompts import ChatPromptTemplate
# from langchain_community.llms.ollama import Ollama
from langchain_ollama import OllamaLLM

Define variables

In [None]:
# PDF_PATH = "./data/pdf/Maestro_Policy_Engine_25.40.00_Configuration_Guide.pdf"
# CHROMA_PATH = "./chroma-database/mpe-db"
# PDF_PATH = "./data/pdf/Monopoly Manual 2007.pdf"
# CHROMA_PATH = "./chroma-database/monopoly-db"
PDF_PATH = "./data/pdf/mysql-tutorial-excerpt-8.0-en.a4.pdf"
CHROMA_PATH = "./chroma-database/mysql-db"
# PDF_PATH = "./data/pdf/mysql-security-excerpt-8.0-en.pdf"
# CHROMA_PATH = "./chroma-database/mysql-security-db"

## **Loading PDF Data**
Try to load PDF data

In [19]:

loader = PyPDFLoader(PDF_PATH)
pages = loader.load()
# print(pages)

Split the documents

In [20]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=800,
    chunk_overlap=80,
    length_function=len,
    is_separator_regex=False
)
chunks = text_splitter.split_documents(pages)

Preparing embedding

In [22]:
embeddings = OllamaEmbeddings(model="nomic-embed-text")

**(OPTIONAL) Clear Databse**
Clear database if required

In [6]:
IS_DB_CLEARED = False
if IS_DB_CLEARED:
     if os.path.exists(CHROMA_PATH):
        shutil.rmtree(CHROMA_PATH)

## **Preparing Chroma**

In [23]:
db = Chroma(
    persist_directory=CHROMA_PATH,
    embedding_function=embeddings
)

Calculate Page IDs

In [25]:
last_page_id = None
current_chunck_index = 0

# Calculate the page ID
print(f"Processing {len(chunks)} chunks")
for chunk in chunks:
    source = chunk.metadata.get("source")
    page = chunk.metadata.get("page")
    current_page_id = f"{source}:{page}"
    # print(f"=== Processing {current_page_id} ===")

    # if the page ID is the same as the last one, increment the index
    # print(f"Last page ID: {last_page_id} | Current page ID: {current_page_id}")

    if current_page_id == last_page_id:
        current_chunck_index += 1
    else:
        current_chunck_index = 0
    # print(f"Chunk index: {current_chunck_index}")

    # Calculate the chunk ID
    chunk_id = f"{current_page_id}:{current_chunck_index}"
    last_page_id = current_page_id
    # print(f"Chunk ID: {chunk_id}")

    # Add it to the page meta-data
    chunk.metadata["id"] = chunk_id

Processing 1958 chunks


Add or update the documents

In [28]:
existing_items = db.get(include=[]) # IDs are always included by default
existing_ids = set(existing_items["ids"])
print(f"Number of existing documents in DB: {len(existing_ids)}")

Number of existing documents in DB: 1958


Only add document that don't exist in the DB.

In [27]:
new_chunks = []
for chunk in chunks:
    if chunk.metadata["id"] not in existing_ids:
        new_chunks.append(chunk)

if len(new_chunks) > 0:
    print(f"Adding new documents to DB: {len(new_chunks)}")
    chunks = [chunk.metadata["id"] for chunk in new_chunks]
    db.add_documents(new_chunks, ids=chunks)
    db.persist()
else:
    print("No new documents to add to DB")

Adding new documents to DB: 1958


  db.persist()


## **Query Data**
Preparing to search from DB

In [None]:

query_text = """
Imagine you are Database Engineer who has been given several tasks as below:
- Create a new table to store information of the company staffs in MySQL database server. It will contain the "First Name", "Last Name", "Age", "Contact Number", and "Email".
- Insert a few records in table that just created.
- Select all record in table that just inserted.
You are requested to create a document to share with your team member.
Please provide some examples of the SQL query.\n
Could you please provide the steps in bullet points.\n
Sample as below:\n
Step 1: ...\n
Step 2: ...\n
Step 3: ...\n
"""
# query_text = """
# Imagine you are Securify Officer who will propose security practises for items as below:
# - What is security guideline for MySQL database server?
# - How to keep password secured?
# You are requested to create a document to share with your team member.
# Could you please provide the steps in bullet points.\n
# Sample as below:\n
# Step 1: ...\n
# Step 2: ...\n
# Step 3: ...\n
# """
PROMPT_TEMPLATE = """
1. If not sure, say "I don't know".
2. Answer the question based only on the following context:

Context: {context}

---

Answer the question based on the above context: {question}
"""
results = db.similarity_search_with_score(query_text, k=5)
context_text = "\n\n---\n\n".join([doc.page_content for doc, _score in results])
prompt_template = ChatPromptTemplate.from_template(PROMPT_TEMPLATE)
print(f"Context: \n{context_text}")
print(f"Question: \n{query_text}")
prompt = prompt_template.format(context=context_text, question=query_text)

Context: 
2.2.1 End-User Guidelines for Password Security
MySQL users should use the following guidelines to keep passwords secure.
When you run a client program to connect to the MySQL server, it is inadvisable to specify your password
in a way that exposes it to discovery by other users. The methods you can use to specify your password
when you run client programs are listed here, along with an assessment of the risks of each method.
In short, the safest methods are to have the client program prompt for the password or to specify the
password in a properly protected option file.
• Use the mysql_config_editor utility, which enables you to store authentication credentials in an
encrypted login path file named .mylogin.cnf. The file can be read later by MySQL client programs to

---

Security Guidelines
Review the MySQL installation instructions, paying particular attention to the information about setting
a root password. See Section 3.4, “Securing the Initial MySQL Account”.
• Use the

Split the reasoning if deepseek LLM called

In [30]:
def extract_think_content(response_text):
    """
    Parses and separates content wrapped in XML-style "think" tags
    from the final response.
    """
    start_tag = "<think>"
    end_tag = "</think>"

    start_index = response_text.find(start_tag) + len(start_tag)
    end_index = response_text.find(end_tag)

    if start_index != -1 and end_index != -1:
        reasoning_content = response_text[start_index:end_index].strip()
        final_response = response_text[end_index + len(end_tag):].strip()
        return reasoning_content, final_response
    else:
        return None, response_text

Preparing Model for prompt

In [31]:
# MODEL_NAME = "deepseek-r1:1.5b"
# MODEL_NAME = "deepseek-r1:8b"
MODEL_NAME = "llama3.2:3b"
# MODEL_NAME = "tinyllama"
# MODEL_NAME = "tinydolphin"
# MODEL_NAME = "phi3"

model = OllamaLLM(model=MODEL_NAME)
response_text = model.invoke(prompt)

sources = [doc.metadata.get("id", None) for doc, _score in results]
if MODEL_NAME == "deepseek-r1:8b" or MODEL_NAME == "deepseek-r1:1.5b":
    reasoning_content, final_response = extract_think_content(response_text)
    formatted_response = f"Query:\n\n{query_text}\n\nResponse:\n\n{final_response}\n\nReasoning:\n\n{reasoning_content}\n\nSources: {sources}"
else:
    formatted_response = f"Query:\n\n{query_text}\n\nResponse:\n\n{response_text}\n\nSources: {sources}"
# print(formatted_response)
display(Markdown(formatted_response))

Query:


Imagine you are Securify Officer who will propose security practises for items as below:
- What is security guideline for MySQL database server?
- How to keep password secured?
You are requested to create a document to share with your team member.
Could you please provide the steps in bullet points.

Sample as below:

Step 1: ...

Step 2: ...

Step 3: ...



Response:

I'd be happy to help you with that. Here is a sample document proposing security practices for MySQL database server:

**Security Guidelines for MySQL Database Server**

As a Securify Officer, it's essential to follow these guidelines to ensure the security of our MySQL database server.

**Step 1: Review and Understand Security Guidelines**

* Review the MySQL installation instructions, paying particular attention to setting a root password (Section 3.4, "Securing the Initial MySQL Account").
* Familiarize yourself with the End-User Guidelines for Password Security (Section 2.2.1) and security guidelines in Chapter 1, Chapter 2, Chapter 3, Chapter 4, Chapter 6, and other relevant sections.

**Step 2: Secure Passwords**

* Use a properly protected option file to store MySQL credentials.
* Consider using the mysql_config_editor utility to store authentication credentials in an encrypted login path file named .mylogin.cnf.
* Never hardcode passwords in scripts or configuration files. Instead, use environment variables or command-line arguments to pass passwords securely.

**Step 3: Manage Privileges and Access Control**

* Use the SHOW GRANTS statement to check which accounts have access to what privileges.
* Regularly review and revoke unnecessary privileges using the REVOKE statement (Section 2.2.1).
* Ensure that users only have the necessary permissions to perform their tasks.

**Step 4: Protect Against SQL Injections and Data Corruption**

* Regularly update MySQL and its plugins to ensure you have the latest security patches.
* Use prepared statements and parameterized queries to prevent SQL injections.
* Implement data validation and sanitization to prevent data corruption.

**Step 5: Network Security and Access Control**

* Configure MySQL to only listen on localhost or a limited set of other hosts (Section Chapter 6, Security Components and Plugins).
* Use firewalls to restrict access to the MySQL server.
* Regularly monitor network traffic for suspicious activity.

**Step 6: Regular Backups and Maintenance**

* Regularly back up database files, configuration, and log files to prevent data loss in case of an attack or other disaster.
* Perform regular maintenance tasks, such as updating MySQL plugins and running security audits.

By following these guidelines, we can ensure the security of our MySQL database server and protect against various threats.

Sources: ['./data/pdf/mysql-security-excerpt-8.0-en.pdf:12:1', './data/pdf/mysql-security-excerpt-8.0-en.pdf:11:0', './data/pdf/mysql-security-excerpt-8.0-en.pdf:494:1', './data/pdf/mysql-security-excerpt-8.0-en.pdf:8:0', './data/pdf/mysql-security-excerpt-8.0-en.pdf:8:1']