# Retrieval-augmented Generation using PDF URLs

**Author**: Dany Raihan

This project, for demonstartive purpose, uses `langchain` to chain outputs together into a flow for radibility, `chromadb` as Vector database, `text-embedding-3-small` by OPENAI to vectorize text, and `gpt-4o-mini` by OPENAI to query contexts and questions. However, you can replace any of the componet using any model or alternatives.


## Workflow:
1. **Fetch PDF Content**: Download and extract content from a list of PDF URLs.
2. **Trim Metadata**: Prepare the necessary metadata for vectorization.
3. **Split Text into Chunks**: Convert the PDF content with metadata into overlapping text chunks to prepare the data for vectorization.
4. **Vectorize the Text Chunks**: Vectorize the text chunks using the specified embedding model and store the vectors in a Chroma vector database.
5. **Query the Vector Store**: Perform a similarity search in the vector store based on a given query to retrieve the most relevant documents.
6. **Parse the Search Results**: Parse and join the text from the search results to create a context for the model.
7. **Construct Instruction's Model**: Construct the instruction that includes the context and the query to be passed to the model.
8. **Perform Retrieval-augmented Generation**: Generate the final answer using the RAG pipeline by querying the model with the constructed instruction.

Each step is crucial in constructing a reliable RAG pipeline, especially when working with unstructured data like PDFs.

#### Project Configuration

Define the initial configurations necessary to set up our RAG pipeline. This includes specifying the PDF URLs, setting environment variables like API keys, and defining the models we will use.

In [1]:
import os

# list the pdf urls
urls = ['https://bitcoin.org/bitcoin.pdf']

# Configurations
os.environ["OPENAI_API_KEY"] = "enter_your_openai_key_here"

# Models
embedding_model = "text-embedding-3-small"
llm_model = "gpt-4o-mini"

# Vector database configuration [optional]
collection_name = "example_collection"

#### Step 1: Fetch PDF Contents

This step covers the process of downloading and extracting text content from the provided PDF URLs. We use `langchain`'s `PyMuPDFLoader` for loading the PDF documents. This step is essential as it converts unstructured PDF data into a format suitable for further processing.

In [2]:
# import necessary libraries
import re
import requests
from tempfile import NamedTemporaryFile
from langchain.document_loaders import PyMuPDFLoader
from langchain.schema import Document

In [3]:
# Validate the URLs
def ensure_url(string: str) -> str:
    """
    Ensures the given string is a URL by adding 'http://' if it doesn't start with 'http://' or 'https://'.
    Raises an error if the string is not a valid URL.

    Parameters:
        string (str): The string to be checked and possibly modified.

    Returns:
        str: The modified string that is ensured to be a URL.

    Raises:
        ValueError: If the string is not a valid URL.
    """
    if not string.startswith(("http://", "https://")):
        string = "http://" + string

    # Basic URL validation regex from https://stackoverflow.com/a/7160778
    url_regex = re.compile(
        r"^(https?:\/\/)?"  # optional protocol
        r"(www\.)?"  # optional www
        r"([a-zA-Z0-9.-]+)"  # domain
        r"(\.[a-zA-Z]{2,})?"  # top-level domain
        r"(:\d+)?"  # optional port
        r"(\/[^\s]*)?$",  # optional path
        re.IGNORECASE,
    )

    if not url_regex.match(string):
        raise ValueError(f"Invalid URL: {string}")

    return string

for url in urls:
    try:
        ensure_url(url)
        print("✅ All URLs are valid")
    except ValueError as e:
        print(str(e))

✅ All URLs are valid


In [4]:
# Fetch the PDF content
def fetch_pdf_content(urls: list[str]) -> list[Document]:
    """
    Fetches and parses text from one or more PDF URLs.

    Parameters:
        urls (list[str]): A list of PDF URLs.

    Returns:
        list[Document]: A list of Document objects containing the text and metadata from each PDF.
    """
    urls = [ensure_url(url.strip()) for url in urls if url.strip()]
    data = []

    for url in urls:
        try:
            response = requests.get(url)
            response.raise_for_status()

            with NamedTemporaryFile(delete=False, suffix=".pdf") as temp_file:
                temp_file.write(response.content)
                temp_file_path = temp_file.name

            pdf_loader = PyMuPDFLoader(file_path=temp_file_path)
            pdf_docs = pdf_loader.load()

            for pdf_doc in pdf_docs:
                data.append(Document(page_content=pdf_doc.page_content, metadata=pdf_doc.metadata))
        except requests.exceptions.RequestException as e:
            print(f"Failed to fetch URL {url}: {e}")
            continue

    return data

pdf_data = []

try:
    pdf_data = fetch_pdf_content(urls)
    print("✅ Successfully fetched PDF content with its metadata")
except Exception as e:
    print(f"Failed to fetch PDF content: {e}")
    
print(f"Content: -> \n{pdf_data[0].page_content[:100]}...")
print(f"Metadata: -> \n{pdf_data[0].metadata}")
    
# for doc in pdf_data:
#     print(f"Content: {doc.page_content[:100]}...")
#     print(f"Metadata: {doc.metadata}")

✅ Successfully fetched PDF content with its metadata
Content: -> 
Bitcoin: A Peer-to-Peer Electronic Cash System
Satoshi Nakamoto
satoshin@gmx.com
www.bitcoin.org
Abs...
Metadata: -> 
{'source': '/var/folders/q9/6pq5y1l504j2_bdkt5hykm7h0000gn/T/tmpw3wbv20w.pdf', 'file_path': '/var/folders/q9/6pq5y1l504j2_bdkt5hykm7h0000gn/T/tmpw3wbv20w.pdf', 'page': 0, 'total_pages': 9, 'format': 'PDF 1.4', 'title': '', 'author': '', 'subject': '', 'keywords': '', 'creator': 'Writer', 'producer': 'OpenOffice.org 2.4', 'creationDate': "D:20090324113315-06'00'", 'modDate': '', 'trapped': ''}


#### Step 2: Trim Metadata
Trim the necessary metadata fields that will be used for vectorization. This step involves filtering out only the important metadata that will be passed along in the vectorization process.


In [5]:
# Trim the necessary metadata to be used for vectorization
important_metadata = ['title', 'page', 'creator', 'author'] # fields to be used for vectorization are costumizable

trimmed_pdf_data = []
for data in pdf_data:
      trimmed_metadata = {key: data.metadata.get(key) for key in important_metadata if key in data.metadata}
      trimmed_data_object = Document(page_content=data.page_content, metadata=trimmed_metadata)
      trimmed_pdf_data.append(trimmed_data_object)
      
# print(f"Content: -> \n{trimmed_pdf_data[0].page_content[:100]}...")
# print(f"Metadata: -> \n{trimmed_pdf_data[0].metadata}")
print("✅ Successfully trimmed metadata for vectorization")



✅ Successfully trimmed metadata for vectorization


#### Step 3: Split Text into Chunks
Convert PDFs with metadata into overlapping text chunks. This step prepares the documents for vectorization by breaking them into manageable chunks of text.

In [6]:
# import necessary libraries 
from typing import List
from langchain.text_splitter import CharacterTextSplitter
from langchain.schema import Document

In [7]:
# Function to split text into chunks based on specified criteria
def split_text(documents: List[Document], chunk_overlap: int = 200, chunk_size: int = 1000, separator: str = "\n") -> List[Document]:
    """
    Splits text into chunks based on specified criteria.

    Parameters:
        documents (List[Document]): A list of Document objects to split.
        chunk_overlap (int): Number of characters to overlap between chunks.
        chunk_size (int): The maximum number of characters in each chunk.
        separator (str): The character to split on. Defaults to newline.

    Returns:
        List[Document]: A list of Document objects with the split content.
    """
    # Initialize the text splitter
    splitter = CharacterTextSplitter(
        chunk_overlap=chunk_overlap,
        chunk_size=chunk_size,
        separator=separator,
    )
    # Split the documents
    split_docs = splitter.split_documents(documents)
    # Convert the split documents back to the desired format
    return split_docs


# Parameters for splitting
chunk_overlap = 200
chunk_size = 1000
separator = " "  # Split by space for demonstration

# Split the text
splitted_data = split_text(trimmed_pdf_data, chunk_overlap, chunk_size, separator)

# Display the split chunks (displaying only the first 3 chunks for demonstration)
for idx, doc in enumerate(splitted_data[0:3]):
    print(f"Chunk {idx + 1}:")
    print(f"Content: {doc.page_content[:50]}...")
    print(f"Metadata: {doc.metadata}")
    print("-" * 50)
    

Chunk 1:
Content: Bitcoin: A Peer-to-Peer Electronic Cash System
Sat...
Metadata: {'title': '', 'page': 0, 'creator': 'Writer', 'author': ''}
--------------------------------------------------
Chunk 2:
Content: it came from the largest pool of CPU power. As 
lo...
Metadata: {'title': '', 'page': 0, 'creator': 'Writer', 'author': ''}
--------------------------------------------------
Chunk 3:
Content: really possible, since financial institutions cann...
Metadata: {'title': '', 'page': 0, 'creator': 'Writer', 'author': ''}
--------------------------------------------------


#### Step 4: Vectorize the Text Chunks
Vectorize the text chunks using the specified embedding model. This step involves creating a vector representation of the text data, which is crucial for the retrieval process.

In [8]:
# Import necessary libraries
from typing import List, Optional
from copy import deepcopy
from langchain.schema import Document
from chromadb import Client
from chromadb.config import Settings
from loguru import logger
from langchain_openai import OpenAIEmbeddings
from langchain_chroma import Chroma
from openai import OpenAI
import tiktoken

In [9]:
# Helper function to add documents to the vector store
def add_documents_to_vector_store(vector_store: Chroma, ingest_data: List[Document], allow_duplicates: bool, limit: Optional[int] = None) -> None:
    """
    Adds documents to the Vector Store.

    Parameters:
        vector_store (Chroma): The Chroma vector store instance.
        ingest_data (List[Document]): The data to ingest into the vector store.
        allow_duplicates (bool): Whether to allow duplicates in the vector store.
        limit (Optional[int]): The limit on the number of records to compare when `allow_duplicates` is False.
    """
    if not ingest_data:
        return

    _stored_documents_without_id = []
    if not allow_duplicates:
        stored_data = vector_store.similarity_search("", k=limit)  # Fetch existing data
        for doc in deepcopy(stored_data):
            del doc.metadata['id']
            _stored_documents_without_id.append(doc)

    documents_to_add = []
    for doc in ingest_data:
        if doc not in _stored_documents_without_id:
            documents_to_add.append(doc)

    if documents_to_add:
        logger.debug(f"Adding {len(documents_to_add)} documents to the Vector Store.")
        vector_store.add_documents(documents_to_add)
    else:
        logger.debug("No documents to add to the Vector Store.")

# Function to build the Chroma vector store
def build_vector_store(
    collection_name: str = "vector_collection",
    persist_directory: Optional[str] = None,
    chroma_server_host: Optional[str] = None,
    chroma_server_http_port: Optional[int] = None,
    chroma_server_grpc_port: Optional[int] = None,
    chroma_server_ssl_enabled: Optional[bool] = False,
    chroma_server_cors_allow_origins: Optional[list] = None,
    embedding_model: str = "text-embedding-3-small",
    ingest_data: Optional[List[Document]] = None,
    allow_duplicates: bool = True,
    limit: Optional[int] = None,
) -> Chroma:
    """
    Builds the Chroma vector store.

    Parameters:
        collection_name (str): The name of the collection in Chroma.
        persist_directory (Optional[str]): Directory to persist data.
        chroma_server_host (Optional[str]): Chroma server host.
        chroma_server_http_port (Optional[int]): Chroma server HTTP port.
        chroma_server_grpc_port (Optional[int]): Chroma server gRPC port.
        chroma_server_ssl_enabled (Optional[bool]): Whether SSL is enabled on the server.
        chroma_server_cors_allow_origins (Optional[list]): CORS origins allowed by the server.
        embedding_model (str): The embedding model to use.
        ingest_data (Optional[List[Document]]): Data to ingest into the vector store.
        allow_duplicates (bool): Whether to allow duplicates in the vector store.
        limit (Optional[int]): Limit on the number of records to compare when `allow_duplicates` is False.

    Returns:
        Chroma: The initialized Chroma vector store instance.
    """
    # Initialize Chroma settings and client
    chroma_settings = None
    client = None
    if chroma_server_host:
        chroma_settings = Settings(
            chroma_server_cors_allow_origins=chroma_server_cors_allow_origins or [],
            chroma_server_host=chroma_server_host,
            chroma_server_http_port=chroma_server_http_port or None,
            chroma_server_grpc_port=chroma_server_grpc_port or None,
            chroma_server_ssl_enabled=chroma_server_ssl_enabled,
        )
        client = Client(settings=chroma_settings)

    # Resolve persist directory
    persist_directory = persist_directory or None

    # Initialize embedding function
    embedding_function = OpenAIEmbeddings(model=embedding_model)

    # Initialize the Chroma vector store
    chroma = Chroma(
        collection_name=collection_name,
        persist_directory=persist_directory,
        client=client,
        embedding_function=embedding_function,
    )

    # Add documents to the vector store
    add_documents_to_vector_store(chroma, ingest_data, allow_duplicates, limit)

    return chroma



# Build the vector store
chroma_store = build_vector_store(
    collection_name=collection_name,
    embedding_model=embedding_model,
    ingest_data=splitted_data,
    allow_duplicates=False,
    limit=10
)

print(f"✅ Successfully built the Chroma vector store collection named '{collection_name}'")


[32m2024-08-13 03:03:00.689[0m | [34m[1mDEBUG   [0m | [36m__main__[0m:[36madd_documents_to_vector_store[0m:[36m28[0m - [34m[1mAdding 29 documents to the Vector Store.[0m


✅ Successfully built the Chroma vector store collection named 'example_collection'


### Step 5: Query the vector store
Perform a similarity search based on a given query. This step retrieves the most relevant documents from the vector store based on the input query.


In [10]:
# Write the query
query = 'what probability density used to estimate attackers potential progress in bitcon network?'

# Perform a search
search_results = chroma_store.similarity_search(query, k=5)
print(f"✅ Successfully performed a search with the query '{query}'")

# Display the top 3 search results
for idx, result in enumerate(search_results[:2]):
    print(f"Result {idx + 1}:")
    print(f"Content: {result.page_content}...")
    print(f"Metadata: {result.metadata}")
    print("-" * 50)

✅ Successfully performed a search with the query 'what probability density used to estimate attackers potential progress in bitcon network?'
Result 1:
Content: spent.
The race between the honest chain and an attacker chain can be characterized as a Binomial 
Random Walk. The success event is the honest chain being extended by one block, increasing its 
lead by +1, and the failure event is the attacker's chain being extended by one block, reducing the 
gap by -1.
The probability of an attacker catching up from a given deficit is analogous to a Gambler's 
Ruin problem. Suppose a gambler with unlimited credit starts at a deficit and plays potentially an 
infinite number of trials to try to reach breakeven. We can calculate the probability he ever 
reaches breakeven, or that an attacker ever catches up with the honest chain, as follows [8]:
p = probability an honest node finds the next block
q = probability the attacker finds the next block
qz = probability the attacker will ever catch up 

#### Step 6: Parse the Query-Similarity Search Results
Parse the text from the search results and join them to create a context for the model.

In [11]:
# Parse the text from the search results, and join them
search_result_text = [result.page_content for result in search_results]
search_result_text = "\n".join(search_result_text)

#### Step 7: Construct Instruction Format for the Model
Construct the instruction that will be passed to the model, including both the context and the query.

In [12]:
instructions = f"Context: {search_result_text} \nQuestion: {query}"

#### Step 8: Perform Retrieval-augmented Generation (RAG)
Generate the final answer using the RAG pipeline, leveraging the context and the query to produce a coherent response.


In [13]:
client = OpenAI()
result = client.chat.completions.create(
    model=llm_model,
    messages=[{"role": "user", "content": instructions }],
)

print("✅ Successfully generated the answer using Retrieval-augmented Generation pipeline")
print("Answer: \n")

# Display the answer in markdown format
from IPython.display import Markdown, display
display(Markdown(result.choices[0].message.content))

✅ Successfully generated the answer using Retrieval-augmented Generation pipeline
Answer: 



The probability density used to estimate the attacker's potential progress in the Bitcoin network is a Poisson distribution. The expected value \( \lambda \) for this distribution is given by the formula:

\[
\lambda = z \cdot \frac{q}{p}
\]

where:
- \( z \) is the number of blocks the attacker is behind,
- \( q \) is the probability that the attacker finds the next block,
- \( p \) is the probability that the honest node finds the next block.

The Poisson density function evaluates the probability of the attacker making \( k \) blocks of progress given the expected value \( \lambda \), which is used to derive the probability that the attacker could catch up from that point.

In the code example you provided, the Poisson probability mass function is computed for \( k \) blocks of progress using the formula:

\[
\text{Poisson}(k; \lambda) = \frac{\lambda^k e^{-\lambda}}{k!}
\]

This is integrated into a summation to evaluate the attacker's overall chances of catching up when accounting for all potential progress \( k \).