<a href="https://colab.research.google.com/github/kjprice/hrbamboo/blob/master/Bamboo_HR_Rag.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Using RAG Architecture To Query PDF Documents

The answers are found at the very bottom. Be sure to include an API Key for Gemini (next step) before running all the cells.

### Issues Encountered

 - Safety rating for gemini needs to be altered (`HARM_CATEGORY_DANGEROUS_CONTENT` response determined to be `MEDIUM`)
 - When querying the index, duplicates were found (_Fixed when implementing `CharacterTextSplitter`_)
 - To answer a particular question, many sources were required (_Fixed when implementing `CharacterTextSplitter`_)
 - Gemini only occassionally determines the correct answer for one question (however the source is correct).
 - The PDFs seem difficult to parse (especially `Renard R.31.pdf`), data is scattered and some "sentences" are too sparse to find meaningful data.

### Wishlist

Ideally, the following features would be implemented:

- Using LanChain or LlamaIndex to build data pipeline
- Try different Text Splitters (however `CharacterTextSplitter` seems to work really well).
- Use a persistent vector store (currently using `faiss` which is ephemeral).
- Document compression
- Use a different PDF loader

In [94]:
# @title Set Gemini API
# @markdown Your API Key can either be set here or by adding a colab secret with name "GEMINI_API_KEY".
# @markdown An API Key can be created by visiting: https://aistudio.google.com/app/apikey

from google.colab import userdata

API_KEY = "" # @param {"type":"string"}

API_KEY = API_KEY.strip() or userdata.get('GEMINI_API_KEY')


if not API_KEY:
  raise ValueError("API_KEY is not set. Please create one at https://aistudio.google.com/app/apikey")



In [95]:
from typing import List

AllSentences = List[str]

In [96]:
PDF_FILES_URLS = {
    "Australia Women's Softball Team.pdf": "https://drive.google.com/file/d/1peCsJBkC14R93SngSH1-vnWc7DFnbDkX/view?usp=sharing",
    "Renard R.31.pdf": "https://drive.google.com/file/d/1it1elgDySqtXT5HFNQWDlbS3E9tm0HHd/view?usp=sharing"
}

TEXT_SPLITTER_CHUNK_SIZE = 400
TEXT_SPLITTER_CHUNK_OVERLAP = 60
SOURCES_PER_QUERY = 3

In [97]:
# @title Download Relevent PDFs

def extract_id_froo_google_drive_url(url):
  return url.split('/')[-2]

def download_file_from_google_drive(url, destination):
  file_id = extract_id_froo_google_drive_url(url)
  !wget -O "$destination" "https://docs.google.com/uc?export=download&confirm=t&id=$file_id"

def download_pdf_if_necessary(pdf_name, pdf_url):
  if not os.path.exists(pdf_name):
    download_file_from_google_drive(pdf_url, pdf_name)

In [98]:
!pip install -q PyMuPDF sentence-transformers faiss-cpu torch transformers langchain

In [99]:
# @title Step 1: Extract Text from PDF Documents

import pymupdf  # PyMuPDF

def extract_text_from_pdf(pdf_path: str) -> str:
  """
  Extract text from a PDF file using PyMuPDF.
  """
  document = pymupdf.open(pdf_path)
  text = ""
  for page_num in range(len(document)):
      page = document.load_page(page_num)
      text += page.get_text()
  return text


In [100]:
# @title Load PDF Text

def load_pdfs_as_text() -> AllSentences:
  """
  Load all PDF files and extract text.
  """
  texts = []
  for pdf_name, pdf_url in PDF_FILES_URLS.items():
    download_pdf_if_necessary(pdf_name, pdf_url)
    texts.append(extract_text_from_pdf(pdf_name))
  return texts



In [101]:
# @title Step 2: Preprocess the Text
# @markdown Note that I replaced "sent_tokenize" with CharacterTextSplitter and results were fantastic

import re
from langchain.text_splitter import CharacterTextSplitter


def preprocess_text(text: str):
  """
  Preprocess the text by removing extra whitespaces and converting to lowercase.
  """
  # Basic preprocessing
  text = re.sub(r'\s+', ' ', text)  # Normalize whitespace
  text = text.lower()
  return text

def split_text(text):
  """
  Split the text into sentences using LangChain's CharacterTextSplitter.
  """
  text_splitter = CharacterTextSplitter(separator=' ', chunk_size=TEXT_SPLITTER_CHUNK_SIZE, chunk_overlap=TEXT_SPLITTER_CHUNK_OVERLAP)
  return text_splitter.split_text(text)


In [125]:
# @title Load and Process Text

texts = load_pdfs_as_text()

for text in texts:
  print(len(text))
  print()

preprocessed_texts = [preprocess_text(text) for text in texts]

# Flatten the list of sentences from both documents
all_sentences = [sentence for text in preprocessed_texts for sentence in split_text(text)]


15020

5269



In [103]:
# @title Step 3: Create Text Embeddings
from sentence_transformers import SentenceTransformer

class ModelTextEmbeddings:
  def __init__(self, all_sentences: List[str]):
    # Load a pre-trained sentence transformer model
    self.model = SentenceTransformer('all-MiniLM-L6-v2')
    self.embeddings = self.model.encode(all_sentences, convert_to_tensor=True)

  def get_embeddings(self):
    return self.embeddings

In [104]:
# @title Step 4: Store Embeddings in a Vector Store

import faiss
import numpy as np

class RagIndex(ModelTextEmbeddings):
  def __init__(self, all_sentences: List[str]):
    super().__init__(all_sentences)

    # Convert embeddings to a NumPy array
    embeddings_np = self.embeddings.cpu().detach().numpy()

    # Create a FAISS index and add the embeddings
    self.index = faiss.IndexFlatL2(embeddings_np.shape[1])
    self.index.add(embeddings_np)


In [105]:
# @title Step 5: Query the Vector Store

# rag_index = RagIndex(all_sentences)
# index = rag_index.index
# model = rag_index.model

class RagQueryIndex(RagIndex):
  def __init__(self, all_sentences: List[str]):
    super().__init__(all_sentences)
  def query_index(self, query: str, top_k=SOURCES_PER_QUERY) -> List[int]:
    """
    Query the FAISS index for the most similar sentences to the given query.
    Returns the relevent sentences
    """
    # Encode the query
    query_embedding = self.model.encode([query], convert_to_tensor=True).cpu().detach().numpy()

    # Search the index for the top_k most similar embeddings
    distances, indices = self.index.search(query_embedding, top_k)

    # Retrieve the corresponding sentences
    results = [all_sentences[idx] for idx in indices[0]]
    return results


In [106]:
# @title Instantiate Class and Run Example
rag =  RagQueryIndex(all_sentences)

# Example query
query = "Which two companies created the R.31 reconnaissance aircraft?"
relevant_sentences = rag.query_index(query)

print("Relevant sentences:")
for sentence in relevant_sentences:
    print(sentence)
    print()

Relevant sentences:
a second aircraft was fitted with an enclosed canopy and a gnome-rhône mistral major radial engine, becoming the r-32, with this then being replaced by a hispano-suiza 12y engine, but the r-32 did not show sufficiently improved performance to gain a production order. a further six r.31s were ordered in august 1935.[1] the r.31 entered service with the belgian air force in 1935,[2] replacing the

r.31 role reconnaissance manufacturer renard first flight 1932 introduction 1935 retired 1940 primary user belgian air force number built 34 renard r.31 the renard r.31 was a belgian reconnaissance aircraft of the 1930s. a single-engined parasol monoplane, 32 r.31s were built for the belgian air force, the survivors of which, although obsolete, remained in service when nazi germany invaded belgium

was held in position by a single vee strut on each side, conjoined with its fixed under carriage. an order for 28 r.31s was placed in march 1934, with six to be built by renard an

In [107]:
!pip install -q -U google-generativeai


In [108]:
# @title Step 6 (Optional) - Use Gemini To Filter Results
import google.generativeai as genai
from google.generativeai.types import HarmCategory, HarmBlockThreshold

import os

genai.configure(api_key=API_KEY)
gen_model = genai.GenerativeModel('gemini-1.5-flash')
def gen_ai_answer(question: str, context: str):
    prompt = f"Using the following context, answer the question '{question}' \n\n" + context + ". Result: "
    response = gen_model.generate_content(prompt,
        generation_config={"temperature": 0.1},
        safety_settings={
        HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT: HarmBlockThreshold.BLOCK_ONLY_HIGH,
    })


    return response.text

In [114]:
from IPython.core.display import display, HTML
import markdown

def print_html(html: str):
    display(HTML(html))

def print_markdown(markdown_text: str):
    html_text = markdown.markdown(markdown_text)
    print_html(html_text)

def answer_question(question: str, sources_limit = SOURCES_PER_QUERY):
    print_html(f"<h2>Question: {question}</h2>")
    relevent_sentences = rag.query_index(question, sources_limit)

    gemini_answer = gen_ai_answer(question, " ".join(relevent_sentences))
    print_html("<h3>Final Answer (From Gemini)<h3>")
    print_markdown(gemini_answer)

    print_html("<h3>Relevant Sentences<h3>")
    print_html("<ol>")
    for sentence in relevent_sentences:
        print_html(f"<li>{sentence}</li>")
    print_html("</ol>")

    print()
    print()



# Answer Four Different Questions

In [115]:
answer_question("Which two companies created the R.31 reconnaissance aircraft?")





In [116]:
# Note that we had to use *many* sources to find the correct answer
answer_question("What guns were mounted on the Renard R.31?")





In [117]:
answer_question("Who was the first softball player to represent any country at four World Series of Softball?")





In [123]:
# Note that Gemini does not always find the correct answer, however the sources include the correct answer
answer_question("Who were the pitchers on the Australian softball team's roster at the 2020 Summer Olympics?", 4)



