### Context of this Notebook

The purpose of this notebook is to introduce some of the models and processes that I'll be using in a simpler context. It'll take our PDF document and chunk it before running it through a lightweight RAG process powered by Ollama.

Checking Important dependencies

In [2]:
import torch
print('CUDA Available:', torch.cuda.is_available())

CUDA Available: True


Defining paths

In [10]:
import dotenv
dotenv.load_dotenv()
file_name = "sample-new-fidelity-acnt-stmt"
pdf_data_path = "./pdf_data/" + file_name + ".pdf"
text_data_path = "./text_data/" + file_name + ".txt"

First,  we want to convert our PDF to image and run it through an OCR model. I will use Microsoft TrOCR tuned on invoice data

In [4]:
from pdf2image import convert_from_path
from PIL import Image
from pytesseract import image_to_string

def extract_text_from_pdf(pdf_path):
    try:
        # Convert PDF to images
        images = convert_from_path(pdf_path, dpi=150)  # Adjust DPI if needed
    except Exception as e:
        print(f"Error converting PDF to images: {e}")
        return None

    text = ""
    for image in images:
        try:
            # Convert image to RGB
            pil_image = image.convert('RGB')
            # Process the image and generate pixel values
            addText = image_to_string(pil_image)
            text += addText
        except Exception as e:
            print(f"Error processing image: {e}")
    return text

In [5]:
import os

pdf_text = ''
os.makedirs("text_data", exist_ok=True)
if not os.path.exists(f"text_data/{file_name}.txt"):
    with open(f"text_data/{file_name}.txt", "w") as f:
        pdf_text = extract_text_from_pdf(pdf_data_path)
        f.write(pdf_text)

Now that we have our text-file, we want to transform our query into something better based on the info in the document. To do this, we need to first set up a retreiver strategy so we can get relevant information from the document we just created.

Lets split our document into chunks. There are many ways to do this depending on needs.

In [6]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=50,
)

with open(f"text_data/{file_name}.txt", "r") as f:
    pdf_text = f.read()
passages = text_splitter.split_text(pdf_text)
passages[:2]

['*** SAMPLE STATEMENT *** For informational purposes only INVESTMENT REPORT  ¢ Fidelity July 1 — July 31, 2015  FMV E STADE eT  Your Portfolio Value: $274,222.20  Envelope # BABCEJBBPRTLA Change from Last Period: A $21,000.37  John W. Doe 100 Main St. This Period Year-to-Date Boston, MA 02201 Beginning Portfolio Value $253,221.83 $232,643.16 Additions 59,269.64 121,433.55 Subtractions -45,430.74 -98,912.58 Transaction Costs, Fees & Charges -139.77 -625.87 Change in Investment Value* 7,161.47 19,058.07 Ending',
 'in Investment Value* 7,161.47 19,058.07 Ending Portfolio Value** $274,222.20 $274,222.20  * — Appreciation or depreciation of your holdings due to price changes plus any distribution and income earned during the statement period. ** Excludes unpriced securities.  Contact Information  Online Fidelity.com FASTs™ Automated Telephone (800) 544-5555 Private Client Group (800) 544-5704  Welcome to your new Fidelity statement.  Your account numbers can be found on page 2 in the Accou

Now we need to turn our passages into vector embeddings that we can store. This will help with retreiving relevant passages by putting them in a way that our machines can understand, a vector representation. We can use many different embeddings that depend on the type of documents we want to retreive from and what our purpose is. Different semantical meanings will have different embeddings depending on what you use.

Nomic is a local lightweight embedding

In [7]:
from langchain_community.vectorstores import SKLearnVectorStore
from langchain_nomic.embeddings import NomicEmbeddings

# Define open source embeddings
embedding = NomicEmbeddings(model="nomic-embed-text-v1.5", inference_mode="local")

vectorstore = SKLearnVectorStore.from_texts(
    texts=passages,
    embedding=embedding,
)
retriever = vectorstore.as_retriever(k=8)

We now have a vectorstore of our document. We set k=8 for the retreiver, but this can depend on our chunking and context length for our LLM model. We can also put in documents that augment this if needed. For example, a description may be important if a use queries on use case.

Next step is to set up a RAG workflow.

In [8]:
retriever.invoke("Get me a csv of all asset quantities.")

[Document(metadata={'id': 'c75bc94d-b90d-4361-8aed-62321eb657ed'}, page_content='and Certificates of Deposit (CDs). There is no guarantee that Al will be paid by the issuer. Al for treasury and GNMA securities, however, is backed by the full faith and credit of the United States Government. Al totals represent accruals for only those securities with listed Al in the Holdings section of this statement. Please refer to the Help/Glossary section of Fidelity.com for additional information. See Cost Basis Information and Endnotes for important information about the adjusted cost basis'),
 Document(metadata={'id': '94e3d052-e77b-495d-9541-fac12e413950'}, page_content='from the following securities: preferred stocks, international stocks, exchange trade products (ETF\'s & ETN\'s), UITs, variable rate bonds, and international bonds, but may be included in future enhancements.  18 of 28 *** SAMPLE STATEMENT *** For informational purposes only INVESTMENT REPORT July 1 — July 31, 2015  €" Fidelit

In [14]:
from langchain_community.chat_models import ChatOllama
llama_model = ChatOllama(model="llama3.1:70b", temperature=0)

In [16]:
from langchain import hub
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
prompt = hub.pull("rlm/rag-prompt")
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)


rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llama_model
    | StrOutputParser()
)



In [17]:
rag_chain.invoke("Describe this document to me.")

'This document appears to be a sample brokerage account statement from Fidelity. It provides information about various aspects of the account, including interest rates, option transactions, cost basis, and gain/loss information. The document also includes disclaimers and instructions for reviewing and reporting any inaccuracies or discrepancies in the statement.'