# **How to build RAG application - PDF method**
---
Demo how to create RAG from PDF file 

## **Library Installation**
Install those required libary

In [1]:
%pip install --quiet -U langchain ## LLM libary
%pip install --quiet -U chromadb ## Vector Storage
# %pip install --quiet -U langchain-chroma ## LLM Vector Storage
%pip install --quiet -U pypdf ## Loading PDFs
%pip install --quiet -U pytest ## Unit testing
%pip install --quiet -U langchain-community ## LLM Community Library
%pip install --quiet -U langchain-ollama ## LLM Ollama Library

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


Import some libraries

In [2]:
import argparse
import os
import shutil
from IPython.display import display, Markdown
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain.schema.document import Document
from langchain_ollama import OllamaEmbeddings
from langchain.vectorstores.chroma import Chroma
# from langchain_chroma import Chroma
from langchain.prompts import ChatPromptTemplate
# from langchain_community.llms.ollama import Ollama
from langchain_ollama import OllamaLLM

Define variables

In [3]:
# PDF_PATH = "./data/pdf/Maestro_Policy_Engine_25.40.00_Configuration_Guide.pdf"
# CHROMA_PATH = "./chroma-database/mpe-db"
# PDF_PATH = "./data/pdf/Monopoly Manual 2007.pdf"
# CHROMA_PATH = "./chroma-database/monopoly-db"
# PDF_PATH = "./data/pdf/mysql-tutorial-excerpt-8.0-en.a4.pdf"
# CHROMA_PATH = "./chroma-database/mysql-db"
PDF_PATH = "./data/pdf/mysql-security-excerpt-8.0-en.pdf"
CHROMA_PATH = "./chroma-database/all-mysql-db"

## **Loading PDF Data**
Try to load PDF data

In [4]:

loader = PyPDFLoader(PDF_PATH)
pages = loader.load()
# print(pages)

Split the documents

In [5]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=800,
    chunk_overlap=80,
    length_function=len,
    is_separator_regex=False
)
chunks = text_splitter.split_documents(pages)

Preparing embedding

In [6]:
embeddings = OllamaEmbeddings(model="nomic-embed-text")

**(OPTIONAL) Clear Databse**
Clear database if required

In [7]:
IS_DB_CLEARED = False
if IS_DB_CLEARED:
     if os.path.exists(CHROMA_PATH):
        shutil.rmtree(CHROMA_PATH)

## **Preparing Chroma**

In [8]:
db = Chroma(
    persist_directory=CHROMA_PATH,
    embedding_function=embeddings
)

  db = Chroma(


Calculate Page IDs

In [9]:
last_page_id = None
current_chunck_index = 0

# Calculate the page ID
print(f"Processing {len(chunks)} chunks")
for chunk in chunks:
    source = chunk.metadata.get("source")
    page = chunk.metadata.get("page")
    current_page_id = f"{source}:{page}"
    # print(f"=== Processing {current_page_id} ===")

    # if the page ID is the same as the last one, increment the index
    # print(f"Last page ID: {last_page_id} | Current page ID: {current_page_id}")

    if current_page_id == last_page_id:
        current_chunck_index += 1
    else:
        current_chunck_index = 0
    # print(f"Chunk index: {current_chunck_index}")

    # Calculate the chunk ID
    chunk_id = f"{current_page_id}:{current_chunck_index}"
    last_page_id = current_page_id
    # print(f"Chunk ID: {chunk_id}")

    # Add it to the page meta-data
    chunk.metadata["id"] = chunk_id

Processing 1958 chunks


Add or update the documents

In [10]:
existing_items = db.get(include=[]) # IDs are always included by default
existing_ids = set(existing_items["ids"])
print(f"Number of existing documents in DB: {len(existing_ids)}")

Number of existing documents in DB: 2116


Only add document that don't exist in the DB.

In [11]:
new_chunks = []
for chunk in chunks:
    if chunk.metadata["id"] not in existing_ids:
        new_chunks.append(chunk)

if len(new_chunks) > 0:
    print(f"Adding new documents to DB: {len(new_chunks)}")
    chunks = [chunk.metadata["id"] for chunk in new_chunks]
    db.add_documents(new_chunks, ids=chunks)
    db.persist()
else:
    print("No new documents to add to DB")

No new documents to add to DB


## **Query Data**
Preparing to search from DB

In [18]:
# reuse existing db
# query_text = """
# Imagine you are Database Engineer who has been given several tasks as below:
# - Create a new table to store information of the company staffs in MySQL database server. It will contain the "First Name", "Last Name", "Age", "Contact Number", and "Email".
# - Insert a few records in table that just created.
# - Select all record in table that just inserted.
# You are requested to create a document to share with your team member.
# Please provide some examples of the SQL query.\n
# Could you please provide the steps in bullet points.\n
# Sample as below:\n
# Step 1: ...\n
# Step 2: ...\n
# Step 3: ...\n
# """
query_text = """
Imagine you are Securify Officer who will propose security practises for items as below:\n
- What is security guideline for MySQL database server?\n
- How to keep password secured?\n
You are requested to create a document to share with your team member.\n
Could you please provide the document format as below:\n
Topic\n
  - item 1\n
  - item 2\n
  - item 3\n
"""
PROMPT_TEMPLATE = """
1. If not sure, say "I don't know".
2. Answer the question based only on the following context:

Context: {context}

---

Answer the question based on the above context: {question}
"""
results = db.similarity_search_with_score(query_text, k=5)
context_text = "\n\n---\n\n".join([doc.page_content for doc, _score in results])
prompt_template = ChatPromptTemplate.from_template(PROMPT_TEMPLATE)
print(f"Context: \n{context_text}")
print(f"Question: \n{query_text}")
prompt = prompt_template.format(context=context_text, question=query_text)

Context: 
2.2.1 End-User Guidelines for Password Security
MySQL users should use the following guidelines to keep passwords secure.
When you run a client program to connect to the MySQL server, it is inadvisable to specify your password
in a way that exposes it to discovery by other users. The methods you can use to specify your password
when you run client programs are listed here, along with an assessment of the risks of each method.
In short, the safest methods are to have the client program prompt for the password or to specify the
password in a properly protected option file.
• Use the mysql_config_editor utility, which enables you to store authentication credentials in an
encrypted login path file named .mylogin.cnf. The file can be read later by MySQL client programs to

---

concerns include the following:
• Section 2.1, “Security Guidelines”.
• Section 2.3, “Making MySQL Secure Against Attackers”.
• How to Reset the Root Password.
• Section 2.5, “How to Run MySQL as a Normal U

Split the reasoning if deepseek LLM called

In [19]:
def extract_think_content(response_text):
    """
    Parses and separates content wrapped in XML-style "think" tags
    from the final response.
    """
    start_tag = "<think>"
    end_tag = "</think>"

    start_index = response_text.find(start_tag) + len(start_tag)
    end_index = response_text.find(end_tag)

    if start_index != -1 and end_index != -1:
        reasoning_content = response_text[start_index:end_index].strip()
        final_response = response_text[end_index + len(end_tag):].strip()
        return reasoning_content, final_response
    else:
        return None, response_text

Preparing Model for prompt

In [20]:
# MODEL_NAME = "deepseek-r1:1.5b"
# MODEL_NAME = "deepseek-r1:8b"
MODEL_NAME = "llama3.2:3b"
# MODEL_NAME = "tinyllama"
# MODEL_NAME = "tinydolphin"
# MODEL_NAME = "phi3"

model = OllamaLLM(model=MODEL_NAME, temperature=0.7)
response_text = model.invoke(prompt)

sources = [doc.metadata.get("id", None) for doc, _score in results]
if MODEL_NAME == "deepseek-r1:8b" or MODEL_NAME == "deepseek-r1:1.5b":
    reasoning_content, final_response = extract_think_content(response_text)
    formatted_response = f"Query:\n\n{query_text}\n\nResponse:\n\n{final_response}\n\nReasoning:\n\n{reasoning_content}\n\nSources: {sources}"
else:
    formatted_response = f"Query:\n\n{query_text}\n\nResponse:\n\n{response_text}\n\nSources: {sources}"
# print(formatted_response)
display(Markdown(formatted_response))

Query:


Imagine you are Securify Officer who will propose security practises for items as below:

- What is security guideline for MySQL database server?

- How to keep password secured?

You are requested to create a document to share with your team member.

Could you please provide the document format as below:

Topic

  - item 1

  - item 2

  - item 3



Response:

I don't know about the default authentication plugin in MySQL 8.0.

 

**Secure Password Handling Guidelines for MySQL Database Server**

As a Securify Officer, it is essential to follow best practices to protect our MySQL database server from unauthorized access. The following guidelines outline the recommended methods for securing passwords:

1. **Use mysql_config_editor utility**: This utility enables you to store authentication credentials in an encrypted login path file named .mylogin.cnf. The file can be read later by MySQL client programs.

2. **Specify password securely in option files**: Passwords should not be exposed to discovery by other users when running client programs. Instead, use the mysql_config_editor utility or specify the password in a properly protected option file.

3. **Avoid storing cleartext passwords in database**: Never store cleartext passwords in your database, as this can lead to significant security risks if the computer becomes compromised. Use SHA2() or other one-way hashing functions and store the hash value instead.

4. **Use salt values for password hashing**: To prevent password recovery using rainbow tables, do not use plain passwords with these functions; instead, choose a string to be used as a salt and use the hash(hash(password)+salt) format.

5. **Regularly review and revoke unnecessary privileges**: Use the SHOW GRANTS statement to check which accounts have access to what, then use the REVOKE statement to remove those privileges that are not necessary.

By following these guidelines, we can significantly improve the security of our MySQL database server and protect our sensitive data.

Sources: ['./data/pdf/mysql-security-excerpt-8.0-en.pdf:12:1', './data/pdf/mysql-security-excerpt-8.0-en.pdf:494:1', './data/pdf/mysql-security-excerpt-8.0-en.pdf:1:0', './data/pdf/mysql-security-excerpt-8.0-en.pdf:8:0', './data/pdf/mysql-security-excerpt-8.0-en.pdf:11:0']