# RAGdoll PDF Interrogate example

@untrueaxioms

<img src='img/github-header-image.png' />


In [1]:
import logging
from dotenv import load_dotenv
load_dotenv(override=True)

True

In [2]:
from ragdoll.helpers import set_logger
loginfo = set_logger(logging.INFO)

In [3]:
config={
    'log_level':logging.INFO 
    }

In [4]:
from ragdoll.helpers import is_notebook
from ragdoll.index import RagdollIndex

index= RagdollIndex(config)
check_notebook = is_notebook(print_output=True)


Running in a Jupyter Notebook or JupyterLab environment.


## Load

In [5]:
import os
import os

pdfs=os.listdir('./test_docs')
#return a list of relative paths pdfs in test docs folder
pdfs = [os.path.join('test_docs', pdf) for pdf in pdfs if pdf.endswith('.pdf')]

pdflist = f"".join(f"\n ○ {d}" for i, d in enumerate(pdfs))
print(f"🧠 I will conduct my research based on the following pdf documents:\n {pdflist}...")

🧠 I will conduct my research based on the following pdf documents:
 
 ○ test_docs\ukpga_20070003_en.pdf
 ○ test_docs\ukpga_20160019_en.pdf...


In [6]:
documents = index.get_scraped_content(pdfs)
print("-" * 100)
print(f"extracted {len(documents)} pdf documents")
print("-" * 100)
print('\n📄  Splitting Documents\n')
split_docs = index.get_split_documents(documents)
print("-" * 100)
print(f"extracted {len(split_docs)} documents from {len(documents)} documents")
print("-" * 100)

[32m[index] 🌐 Fetching raw source content[0m
[32m[index] 📰 Chunking document[0m


----------------------------------------------------------------------------------------------------
extracted 2 pdf documents
----------------------------------------------------------------------------------------------------

📄  Splitting Documents

----------------------------------------------------------------------------------------------------
extracted 3626 documents from 2 documents
----------------------------------------------------------------------------------------------------


In [7]:
print('🔗 Or all in one like this')
split_docs = index.run_document_pipeline(pdfs)
print("-" * 100)
print(f"extracted {len(split_docs)} documents from {len(documents)} documents")
print("-" * 100)

[32m[index] Running document index pipeline[0m
[32m[index] 🌐 Fetching raw source content[0m


🔗 Or all in one like this


[32m[index] 📰 Chunking document[0m


----------------------------------------------------------------------------------------------------
extracted 3626 documents from 2 documents
----------------------------------------------------------------------------------------------------


## Embed and Store

Let’s start by initializing a simple vector store retriever and storing our docs (in chunks).


In [14]:
from ragdoll.retriever import RagdollRetriever
ragdoll = RagdollRetriever(config)
db = ragdoll.get_db(split_docs)

[32m[retriever] 🗃️  retrieving vector database (FAISS)...[0m
[32m[_client] HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"[0m
[32m[_client] HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"[0m
[32m[_client] HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"[0m
[32m[_client] HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"[0m


we'll use the contextual compression retriever, with a multiquery retriever as base because, well, why not?

In [15]:
base_retriever = ragdoll.get_mq_retriever() 

ccfg={
        "use_embeddings_filter":True, 
        "use_splitter":True, 
        "use_redundant_filter":True, 
        "use_relevant_filter":True,
        "similarity_threshold":0.6, #embeddings filter settings
    }

retriever = ragdoll.get_compression_retriever(base_retriever, ccfg)

[32m[retriever] 📋 getting multi query retriever[0m
[32m[retriever] 💭 Remember that the multi query retriever will incur additional calls to your LLM[0m
[32m[models] 🤖 retrieving OpenAI model for multi query retriever[0m
[32m[retriever] 🗜️ Compression object pipeline: embeddings_filter ➤ splitter ➤ redundant_filter ➤ relevant_filter[0m


## 🙋‍♂️ Question Time

In [16]:
question='what are these documents about?'

In [18]:
response = ragdoll.answer_me_this(question, retriever)
print(response)

[32m[retriever] 🔗 Running RAG chain[0m
[32m[models] 🤖 retrieving OpenAI model for RAG chain[0m
[32m[_client] HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"[0m
[32m[_client] HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"[0m
[32m[_client] HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"[0m
[32m[_client] HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"[0m
[32m[_client] HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"[0m
[32m[_client] HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"[0m
[33m[text_splitter] Created a chunk of size 555, which is longer than the specified 500[0m
[33m[text_splitter] Created a chunk of size 514, which is longer than the specified 500[0m
[33m[text_splitter] Created a chunk of size 598, which is longer than the specified 500[0m
[33m[text_splitter] Created a chunk of size 516, which is

Based on the provided context, the documents are related to the UK legislation, specifically the Immigration Act 2016 and the Income Tax Act 2007. The content of the documents includes provisions, amendments, penalties, enforcement, and support related to immigration and taxation in the United Kingdom.


The retriever used the following docs to support the answer

In [21]:
from ragdoll.helpers import pretty_print_docs

simdocs = retriever.get_relevant_documents(question)
print("-" * 100)
print(f"The retriever returned {len(simdocs)} relevant documents. below is a snippet:")
print("-" * 100, "\n\n")
print(pretty_print_docs(simdocs, for_llm=False)[:500])

[32m[_client] HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"[0m
[32m[_client] HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"[0m
[32m[_client] HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"[0m
[32m[_client] HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"[0m
[32m[_client] HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"[0m
[32m[_client] HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"[0m
[33m[text_splitter] Created a chunk of size 555, which is longer than the specified 500[0m
[33m[text_splitter] Created a chunk of size 514, which is longer than the specified 500[0m
[33m[text_splitter] Created a chunk of size 598, which is longer than the specified 500[0m
[33m[text_splitter] Created a chunk of size 516, which is longer than the specified 500[0m
[33m[text_splitter] Created a chunk of size 598, which is longe

----------------------------------------------------------------------------------------------------
The retriever returned 18 relevant documents. below is a snippet:
---------------------------------------------------------------------------------------------------- 


Source: test_docs\ukpga_20160019_en.pdf
Title: newbook.book
Content: 19)\nSchedule 14 — Maritime enforcement\n220\n(b)\nseize and retain any document the officer has reasonable\ngrounds to believe to be an item subject to legal privilege.\n(8) In this paragraph a “nationality document”, in relation to a person,\nmeans any document which might—\n(a)\nestablish the person’s identity, nationality or citizenship,\nor\n(b)\nindicate the place from which the person has travelled to\nthe United Kingdom
