### Context of this Notebook

The purpose of this notebook is to introduce some of the models and processes that I'll be using in a simpler context. It'll take our PDF document and chunk it before running it through a lightweight RAG process powered by Ollama.

Checking Important dependencies

In [2]:
import torch
print('CUDA Available:', torch.cuda.is_available())

CUDA Available: True


Defining paths

In [6]:
import dotenv
dotenv.load_dotenv()
file_name = "sample-new-fidelity-acnt-stmt"
pdf_data_path = "./pdf_data/" + file_name + ".pdf"
text_data_path = "./text_data/" + file_name + ".txt"

First,  we want to convert our PDF to image and run it through an OCR model. I will use Microsoft TrOCR tuned on invoice data

In [7]:
from pdf2image import convert_from_path
from PIL import Image
from pytesseract import image_to_string

def extract_text_from_pdf(pdf_path):
    try:
        # Convert PDF to images
        images = convert_from_path(pdf_path, dpi=150)  # Adjust DPI if needed
    except Exception as e:
        print(f"Error converting PDF to images: {e}")
        return None

    text = ""
    for image in images:
        try:
            # Convert image to RGB
            pil_image = image.convert('RGB')
            # Process the image and generate pixel values
            addText = image_to_string(pil_image)
            text += addText
        except Exception as e:
            print(f"Error processing image: {e}")
    return text

In [8]:
import os

pdf_text = ''
os.makedirs("text_data", exist_ok=True)
if not os.path.exists(f"text_data/{file_name}.txt"):
    with open(f"text_data/{file_name}.txt", "w") as f:
        pdf_text = extract_text_from_pdf(pdf_data_path)
    f.write(pdf_text)

Now that we have our text-file, we want to transform our query into something better based on the info in the document. To do this, we need to first set up a retreiver strategy so we can get relevant information from the document we just created.

Lets split our document into chunks. There are many ways to do this depending on needs.

In [9]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=50,
)

with open(f"text_data/{file_name}.txt", "r") as f:
    pdf_text = f.read()
passages = text_splitter.split_text(pdf_text)
passages

['*“" SAMPLE STATEMENT ***\nFor informational purposes only INVESTMENT REPORT\n\n69 Fidelil f July 1 - July 31, 2015\n\nYour Portfolio Value: $274,222.20\n\nEnvelope # BABCEJBBPRTLA Change from Last Period: A $21,000.37',
 'John W. Doe\npe Main et 02201 This Period Year-to-Date\nston, Beginning Portfolio Value $253,221.83 $232,643.16\nAdditions 59,269.64 121,433.55\nSubtractions -45,430.74 -98,912.58\nTransaction Costs, Fees & Charges -139.77 ~625.87\nChange in Investment Value* 7,161.47 19,058.07\nEnding Portfolio Value** $274,222.20 $274,222.20\n\nAppreciation or depreciation of your holdings due to price changes plus any distribution and\nincome earned during the statement period.\n* Excludes unpriced securities.',
 'Contact Information\n\nOnline Fidelity.com\nFAST" Automated Telephone (800) 544-5555,\nPrivate Client Group (800) 544-5704\n\nWelcome to your new Fidelity statement.\n\nYour account numbers can be found on page 2 in the Accounts Included in this\nReport section. Your st

Now we need to turn our passages into vector embeddings that we can store. This will help with retreiving relevant passages by putting them in a way that our machines can understand, a vector representation. We can use many different embeddings that depend on the type of documents we want to retreive from and what our purpose is. Different semantical meanings will have different embeddings depending on what you use.

Nomic is a local lightweight embedding

In [10]:
from langchain_community.vectorstores import SKLearnVectorStore
from langchain_nomic.embeddings import NomicEmbeddings

# Define open source embeddings
embedding = NomicEmbeddings(model="nomic-embed-text-v1.5", inference_mode="local")

vectorstore = SKLearnVectorStore.from_texts(
    texts=passages,
    embedding=embedding,
)
retriever = vectorstore.as_retriever(k=8)

We now have a vectorstore of our document. We set k=8 for the retreiver, but this can depend on our chunking and context length for our LLM model. We can also put in documents that augment this if needed. For example, a description may be important if a use queries on use case.

Next step is to set up a RAG workflow.

In [11]:
retriever.invoke("Get me a csv of all asset quantities.")

[Document(metadata={'id': '331d5551-b816-477b-9ef0-b1879ea35459'}, page_content='represent accruals for only those securities with listed Al in the Holdings section of this\nstatement. Please refer to the Help/Glossary section of Fidelity. com for additional information.\nSee Cost Basis Information and Endnotes for important information about the adjusted cost\nbasis information provided.'),
 Document(metadata={'id': '5b6d7ff0-c94d-4ec4-baa8-93db324b515e'}, page_content='IMPORTANT: If you have any unsettled trades pending, the asset allocation presented above\nmay be materially impacted and, depending on the size and scope of such unsettled trades,\nrendered unreliable. Asset allocation includes Other Holdings and Assets Held Away when\napplicable. Please note that, due to rounding, percentages may not add to 100%. For further\ndetails, please see "Frequently Asked Questions" at Fidelity.com/Statements.\n\n3 of 28\n(Fidelity\n\nAccount Value:\n\nChange in Account Value'),
 Document(met

In [12]:
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(model="gpt-4o-mini")

In [None]:
from langchain_community.chat_models import ChatOllama
llama_model = ChatOllama(model="llama3", temperature=0)

In [1]:
from langchain import hub
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
prompt = hub.pull("rlm/rag-prompt")
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)


rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llama_model
    | StrOutputParser()
)



NameError: name 'retriever' is not defined

In [14]:
rag_chain.invoke("Describe this document to me.")

'The document is an investment report for the period of July 1 to July 31, 2015, provided for informational purposes. It includes details on minimum required distributions (MRD) and encourages readers to consult additional sections for calculation details and cost basis information. The report emphasizes the importance of adhering to IRS requirements and suggests contacting Fidelity for further assistance.'