Checking Important dependencies

In [1]:
import torch
print('CUDA Available:', torch.cuda.is_available())

CUDA Available: True


Defining paths

In [2]:
file_name = "sample-new-fidelity-acnt-stmt"
pdf_data_path = "./pdf_data/" + file_name + ".pdf"
text_data_path = "./text_data/" + file_name + ".txt"

First,  we want to convert our PDF to image and run it through an OCR model. I will use Microsoft TrOCR tuned on invoice data

In [37]:
from pdf2image import convert_from_path
from PIL import Image
import requests
from transformers import TrOCRProcessor, VisionEncoderDecoderModel

processor = TrOCRProcessor.from_pretrained('microsoft/trocr-large-printed')
ocr = VisionEncoderDecoderModel.from_pretrained('microsoft/trocr-large-printed')

def extract_text_from_pdf(pdf_path):
    images = convert_from_path(pdf_path)
    text = ""
    for image in images:
        pil_image = image.convert('RGB')
        pixel_values = processor(images=pil_image, return_tensors="pt").pixel_values
        generated_ids = ocr.generate(pixel_values)
        text += processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
    return text

Some weights of VisionEncoderDecoderModel were not initialized from the model checkpoint at microsoft/trocr-large-printed and are newly initialized: ['encoder.pooler.dense.bias', 'encoder.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [38]:
import os

pdf_text = ''
os.makedirs("text_data", exist_ok=True)
if not os.path.exists(f"text_data/{file_name}.txt"):
    with open(f"text_data/{file_name}.txt", "w") as f:
        pdf_text = extract_text_from_pdf(pdf_data_path)
        f.write(pdf_text)

Now that we have our text-file, we want to transform our query into something better based on the info in the document. To do this, we need to first set up a retreiver strategy so we can get relevant information from the document we just created.

Lets split our document into chunks. There are many ways to do this depending on needs.

In [20]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=50,
)

with open(f"text_data/{file_name}.txt", "r") as f:
    pdf_text = f.read()
passages = text_splitter.split_text(pdf_text)
passages


['TEL:TEL :TOTALTOTALTOTALTOTALTOTAL :TOTAL :TOTAL :ITEMTOTAL :TOTAL:TOTAL:******TOTAL***TELTOTALTOTAL :TOTAL :TOTAL :***TOTALITEMITEM***']

Now we need to turn our passages into vector embeddings that we can store. This will help with retreiving relevant passages by putting them in a way that our machines can understand, a vector representation. We can use many different embeddings that depend on the type of documents we want to retreive from and what our purpose is. Different semantical meanings will have different embeddings depending on what you use.

Nomic is a local lightweight embedding

In [12]:
from langchain_community.vectorstores import SKLearnVectorStore
from langchain_nomic.embeddings import NomicEmbeddings

# Define open source embeddings
embedding = NomicEmbeddings(model="nomic-embed-text-v1.5", inference_mode="local")

vectorstore = SKLearnVectorStore.from_texts(
    texts=passages,
    embedding=embedding,
)
retriever = vectorstore.as_retriever(k=8)

We now have a vectorstore of our document. We set k=8 for the retreiver, but this can depend on our chunking and context length for our LLM model. We can also put in documents that augment this if needed. For example, a description may be important if a use queries on use case.

Next step is to set up a RAG workflow.

In [15]:
retriever.invoke("Get me a csv of all asset quantities.")

[Document(metadata={'id': 'de869e23-3ae9-4cee-8830-447eae8b9c80'}, page_content='ee a 58 $s\n** SAMPLE STATEMENT **\n—— For informational purposes only veer went ReeORT\n@ Fidelity So ecyoe an\n** SAMPLE STATEMENT **\n—— For informational purposes only nvesrient eePoRT\n@ Fidelity So ecyoe an\nHoldings (ontines ‘aocount 111411111\nSh ne eae ayetiy cme 8 aie ior mere ence\nErgon ramen tmesnmymmcrmmmeapiy, roms lpm tmtonme ence a\nenMtemmmtreimeninmcmmence —" amremas nnn sonare wmraeetaree\nBema oemociasecmramatey — Rusmaniltracgee eae Grcerennnetnte'),
 Document(metadata={'id': 'c64ffb9f-aedf-4718-8ed3-05f27d4ed627'}, page_content='—— For informational purposes only nvesrient eePoRT\n@ Fidelity So ecyoe an\nHoldings (continued) pecount WTA 11111\nsets\nseston ‘ust een untae Costas ‘Smitece _neame yes Yen\nSe CT\nTod FrefsredSiock GM cfaccouthaares) Oar\nsesaiptinn surty __uantty__rsrtnn_Aecwlvinmasiyay Cost ae ‘Goes _neoneeay tse\n** SAMPLE STATEMENT **\n—— For informational purposes on