# Building an AI-Powered Invoice Extraction Tool
*Created by Thu Vu for the Python for AI Projects course.*

*Date: 20 January, 2025*

👋 Welcome to another hands-on project!

🎯 In this project, we’ll explore how to use a **Retrieval-Augmented Generation (RAG) framework** to extract structured information from business invoices. We’ll walk through the **RAG pipeline** step by step to process and extract key invoice details such as invoice date, client name, total amount.

🛠️ Tools we'll be using:

- **Langchain**: It's framework for developing applications powered by language models. Read more about [Langchain](https://python.langchain.com/docs/get_started/introduction) here. 

- **Local language model (Llama3.2 via [Ollama](https://ollama.com/)):** Using local models is more secure for private data, such as the invoices in this case.

- **Streamlit**: We’ll finally bring everything together in an user-friendly, interactive ** Streamlit application**, allowing users to upload multiple invoices and view the extracted data in a structured table format. 

Let's get started! 👊

### Overview of a Vanilla Retrieval Augmented Generation (RAG) Pipeline

If you followed the last video lesson on RAG, this framework should look farmiliar to you.

<img src="RAG_deepdive.png" width=800 />

## Install and import necessary modules

If you haven't yet installed the required packages, please run in a notebook cell the following magic command:

`!pip3 install --upgrade --quiet langchain langchain-community langchain_ollama chromadb pandas streamlit pypdf`

In [1]:
!pip3 install --upgrade --quiet langchain langchain-community langchain_ollama chromadb pandas streamlit pypdf

In [2]:
# Import Langchain modules
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain_core.runnables import RunnablePassthrough
from langchain_core.prompts import ChatPromptTemplate
from pydantic import BaseModel, Field
from langchain_ollama import ChatOllama
from langchain_community.embeddings import OllamaEmbeddings

# Other modules and packages
import streamlit as st  
import pandas as pd

import uuid
import re
import os
import tempfile

## Define our LLM

We'll be using Llama3.2, which is an open-source LLM developed by Meta. 

✅👉 Since we're accessing the model using Ollama, **please make sure you have opened Ollama in your computer** and have it running in the background when running the cell below.

In [3]:
llm = ChatOllama(model='llama3.2', temperature=0)
llm.invoke("Tell me a joke about coffee")

AIMessage(content='Why did the coffee file a police report?\n\nBecause it got mugged!', additional_kwargs={}, response_metadata={'model': 'llama3.2', 'created_at': '2025-01-27T15:11:08.656243Z', 'done': True, 'done_reason': 'stop', 'total_duration': 4569981791, 'load_duration': 836568000, 'prompt_eval_count': 31, 'prompt_eval_duration': 3426000000, 'eval_count': 16, 'eval_duration': 305000000, 'message': Message(role='assistant', content='', images=None, tool_calls=None)}, id='run-7a9d2788-ce80-4ed2-aa24-6a359eea15a0-0', usage_metadata={'input_tokens': 31, 'output_tokens': 16, 'total_tokens': 47})

## 1. Ingestion Pipeline

### Load PDF document

We'll be using the PyPDFLoader module to load the text in the invoice documents. 

After running this, we get back a list of Document objects, each of them represent a page in the PDF, with the metadata and page content.

In [4]:
FILE_NAME = "invoice_ABC_consulting.pdf"

loader = PyPDFLoader("data/"+ FILE_NAME)
pages = loader.load()
pages

[Document(metadata={'source': 'data/invoice_ABC_consulting.pdf', 'page': 0, 'page_label': '1'}, page_content='Business Invoice\nBusiness Name: ABC Consulting Ltd.\nAddress: 123 Main Street, City, Country\nEmail: contact@abcconsulting.com\nPhone: +123 456 7890\nBill To:\nClient Name: XYZ Corporation\nAddress: 456 Market Street, City, Country\nEmail: finance@xyzcorp.com\nPhone: +987 654 3210\nInvoice Number: INV-2025001\nInvoice Date: January 24, 2025\nDue Date: February 7, 2025\nDescription Quantity Unit Price Total\nConsulting Services - January 10 $50.00 $500.00\nData Analysis Report 1 $300.00 $300.00\nSoftware License 2 $150.00 $300.00\nSubtotal: $1100.00\nTax (10%): $110.00\nTotal: $1210.00\nPayment Instructions:\nPage 1'),
 Document(metadata={'source': 'data/invoice_ABC_consulting.pdf', 'page': 1, 'page_label': '2'}, page_content='Business Invoice\nBank: ABC Bank\nAccount No: 123456789\nSWIFT: ABCD1234\nPayment Due by February 7, 2025\nPage 2')]

### Split document into smaller chunks (optional in this case)

Chunking is option for small PDF files like the invoices in this case.

In [5]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1500,
                                            chunk_overlap=200,
                                            length_function=len,
                                            separators=["\n\n", "\n", " "])
chunks = text_splitter.split_documents(pages)

In [6]:
chunks

[Document(metadata={'source': 'data/invoice_ABC_consulting.pdf', 'page': 0, 'page_label': '1'}, page_content='Business Invoice\nBusiness Name: ABC Consulting Ltd.\nAddress: 123 Main Street, City, Country\nEmail: contact@abcconsulting.com\nPhone: +123 456 7890\nBill To:\nClient Name: XYZ Corporation\nAddress: 456 Market Street, City, Country\nEmail: finance@xyzcorp.com\nPhone: +987 654 3210\nInvoice Number: INV-2025001\nInvoice Date: January 24, 2025\nDue Date: February 7, 2025\nDescription Quantity Unit Price Total\nConsulting Services - January 10 $50.00 $500.00\nData Analysis Report 1 $300.00 $300.00\nSoftware License 2 $150.00 $300.00\nSubtotal: $1100.00\nTax (10%): $110.00\nTotal: $1210.00\nPayment Instructions:\nPage 1'),
 Document(metadata={'source': 'data/invoice_ABC_consulting.pdf', 'page': 1, 'page_label': '2'}, page_content='Business Invoice\nBank: ABC Bank\nAccount No: 123456789\nSWIFT: ABCD1234\nPayment Due by February 7, 2025\nPage 2')]

### Create embeddings

In [7]:
def get_embedding_function():
    embeddings = OllamaEmbeddings(model="nomic-embed-text",show_progress=False)
    return embeddings

embedding_function = get_embedding_function()
test_vector = embedding_function.embed_query("cat")

  embeddings = OllamaEmbeddings(model="nomic-embed-text",show_progress=False)


### Create vector database

We'll be using ChromaDB for this project, which is an open source vector database, it’s fast and simple to use. But it’s not the only option, there’s a lot of other vector databases that do the pretty much the same thing. 

In [9]:
# Create vectorstore
vectorstore = Chroma.from_documents(documents=chunks, 
                                    embedding=embedding_function, 
                                    persist_directory = "test_vectorstore",
                                    collection_name = FILE_NAME)

In [10]:
def create_vectorstore(chunks, embedding_function, file_name, vectorstore_path):

    # Create a list of unique ids for each document based on the content
    ids = [str(uuid.uuid5(uuid.NAMESPACE_DNS, doc.page_content)) for doc in chunks]
    
    # Ensure that only unique docs with unique ids are kept
    unique_ids = set()
    unique_chunks = []
    
    unique_chunks = [] 
    for chunk, id in zip(chunks, ids):     
        if id not in unique_ids:       
            unique_ids.add(id)
            unique_chunks.append(chunk) 

    # Create a new Chroma database from the documents
    vectorstore = Chroma.from_documents(documents=unique_chunks, 
                                        ids=list(unique_ids),
                                        embedding=embedding_function, 
                                        persist_directory = vectorstore_path,
                                        collection_name = file_name)

    vectorstore.persist()
    
    return vectorstore

In [11]:
# Create vectorstore
vectorstore = create_vectorstore(chunks=chunks, 
                                 embedding_function=embedding_function, 
                                 file_name=FILE_NAME,
                                 vectorstore_path="vectorstore")

  vectorstore.persist()


## 2. Retrieval Pipeline 
Here we are querying for relevant data chunks from the vector database based on their similarity to a user question.

In [12]:
# Create retriever
retriever = vectorstore.as_retriever(search_type="similarity")

In [13]:
# Get relevant chunks for a question
relevant_chunks = retriever.invoke("What is the client name in this invoice?")
relevant_chunks

Number of requested results 4 is greater than number of elements in index 2, updating n_results = 2


[Document(metadata={'page': 0, 'page_label': '1', 'source': 'data/invoice_ABC_consulting.pdf'}, page_content='Business Invoice\nBusiness Name: ABC Consulting Ltd.\nAddress: 123 Main Street, City, Country\nEmail: contact@abcconsulting.com\nPhone: +123 456 7890\nBill To:\nClient Name: XYZ Corporation\nAddress: 456 Market Street, City, Country\nEmail: finance@xyzcorp.com\nPhone: +987 654 3210\nInvoice Number: INV-2025001\nInvoice Date: January 24, 2025\nDue Date: February 7, 2025\nDescription Quantity Unit Price Total\nConsulting Services - January 10 $50.00 $500.00\nData Analysis Report 1 $300.00 $300.00\nSoftware License 2 $150.00 $300.00\nSubtotal: $1100.00\nTax (10%): $110.00\nTotal: $1210.00\nPayment Instructions:\nPage 1'),
 Document(metadata={'page': 1, 'page_label': '2', 'source': 'data/invoice_ABC_consulting.pdf'}, page_content='Business Invoice\nBank: ABC Bank\nAccount No: 123456789\nSWIFT: ABCD1234\nPayment Due by February 7, 2025\nPage 2')]

In [14]:
# Prompt template
PROMPT_TEMPLATE = """
You are an assistant for question-answering tasks.
Use the following pieces of retrieved context to answer
the question. If you don't know the answer, say that you
don't know. DON'T MAKE UP ANYTHING.

{context}

---

Answer the question based on the above context: {question}
"""

prompt_template = ChatPromptTemplate.from_template(PROMPT_TEMPLATE)

## 3. Generation Pipeline

Here, we will pass our full prompt (including context + question) through the LLM the generate an answer.

### Manual process

In [15]:
# Concatenate context text
context_text = "\n\n".join(doc.page_content for doc in relevant_chunks)

# Create prompt
prompt = prompt_template.format(context=context_text, 
                                question="What is the client name in this invoice?")
print(prompt)

Human: 
You are an assistant for question-answering tasks.
Use the following pieces of retrieved context to answer
the question. If you don't know the answer, say that you
don't know. DON'T MAKE UP ANYTHING.

Business Invoice
Business Name: ABC Consulting Ltd.
Address: 123 Main Street, City, Country
Email: contact@abcconsulting.com
Phone: +123 456 7890
Bill To:
Client Name: XYZ Corporation
Address: 456 Market Street, City, Country
Email: finance@xyzcorp.com
Phone: +987 654 3210
Invoice Number: INV-2025001
Invoice Date: January 24, 2025
Due Date: February 7, 2025
Description Quantity Unit Price Total
Consulting Services - January 10 $50.00 $500.00
Data Analysis Report 1 $300.00 $300.00
Software License 2 $150.00 $300.00
Subtotal: $1100.00
Tax (10%): $110.00
Total: $1210.00
Payment Instructions:
Page 1

Business Invoice
Bank: ABC Bank
Account No: 123456789
SWIFT: ABCD1234
Payment Due by February 7, 2025
Page 2

---

Answer the question based on the above context: What is the client name 

In [16]:
llm.invoke(prompt)

AIMessage(content='The client name in this invoice is XYZ Corporation.', additional_kwargs={}, response_metadata={'model': 'llama3.2', 'created_at': '2025-01-27T15:46:18.927194Z', 'done': True, 'done_reason': 'stop', 'total_duration': 5065195292, 'load_duration': 828002750, 'prompt_eval_count': 326, 'prompt_eval_duration': 4024000000, 'eval_count': 11, 'eval_duration': 211000000, 'message': Message(role='assistant', content='', images=None, tool_calls=None)}, id='run-0c72adde-e571-43ba-a123-703eb77ab85a-0', usage_metadata={'input_tokens': 326, 'output_tokens': 11, 'total_tokens': 337})

### Using Langchain Expression Language

In [17]:
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

In [18]:
rag_chain = (
            {"context": retriever | format_docs, "question": RunnablePassthrough()}
            | prompt_template
            | llm
        )
rag_chain.invoke("Extract relevant details from this business invoice")

Number of requested results 4 is greater than number of elements in index 2, updating n_results = 2


AIMessage(content='Here are the relevant details extracted from the business invoice:\n\n- Business Name: ABC Consulting Ltd.\n- Address: 123 Main Street, City, Country\n- Email: contact@abcconsulting.com\n- Phone: +123 456 7890\n- Bill To:\n  - Client Name: XYZ Corporation\n  - Address: 456 Market Street, City, Country\n  - Email: finance@xyzcorp.com\n  - Phone: +987 654 3210\n- Invoice Number: INV-2025001\n- Invoice Date: January 24, 2025\n- Due Date: February 7, 2025\n- Description and quantities:\n  - Consulting Services - January 10\n  - Data Analysis Report 1\n  - Software License 2\n- Subtotal: $1100.00\n- Tax (10%): $110.00\n- Total: $1210.00\n- Payment Instructions:\n  - Bank: ABC Bank\n  - Account No: 123456789\n  - SWIFT: ABCD1234\n  - Payment Due by February 7, 2025', additional_kwargs={}, response_metadata={'model': 'llama3.2', 'created_at': '2025-01-27T15:50:55.716175Z', 'done': True, 'done_reason': 'stop', 'total_duration': 7999007208, 'load_duration': 22520375, 'prompt_

### Generate structured responses

In [19]:
class ExtractedInfo(BaseModel):
    """Extracted information from the invoice document"""
    invoice_items: str =  Field(description="Extract invoice items")
    invoice_date: str =  Field(description="Extract invoice date")
    business_name: str =  Field(description="Extract business name")
    total_amount: str =  Field(description="Extract total amount")

In [20]:
rag_chain = (
            {"context": retriever | format_docs, "question": RunnablePassthrough()}
            | prompt_template
            | llm.with_structured_output(ExtractedInfo)
        )

In [21]:
# Generate structured response
structured_response = rag_chain.invoke("Extract relevant details from this business invoice.")
structured_response

Number of requested results 4 is greater than number of elements in index 2, updating n_results = 2


ExtractedInfo(invoice_items='Consulting Services - January 10, Data Analysis Report 1, Software License 2', invoice_date='January 24, 2025', business_name='ABC Consulting Ltd.', total_amount='$1210.00')

### Transform response into a dataframe

In [22]:
df = pd.DataFrame([{
    "Invoice Items": structured_response.invoice_items,
    "Invoice Date": structured_response.invoice_date,
    "Business Name": structured_response.business_name,
    "Total Amount": structured_response.total_amount,
}])
df

Unnamed: 0,Invoice Items,Invoice Date,Business Name,Total Amount
0,"Consulting Services - January 10, Data Analysi...","January 24, 2025",ABC Consulting Ltd.,$1210.00


## Conclusions

Congratulations on successfully implementing your first RAG-powered invoice extraction system! 🎉

In this tutorial, you have:

✅ Built a Retrieval-Augmented Generation (RAG) pipeline to extract structured data from invoices.

✅ Used LangChain to process documents efficiently.

✅ Integrated a local LLM with Ollama for offline, cost-effective processing.

✅ Developed a Streamlit web application for an interactive, user-friendly experience.

💡If you're excited to adapt this project to other use cases, here are some inspiration:

1. Automated Resume Screening – Extract key details from resumes and match them with job descriptions.
2. Contract Clause Extraction – Identify and summarize important clauses from legal contracts.
3. Expense Receipt Processing – Automate the extraction of transaction details from receipts for expense tracking.
4. Medical Report Summarization – Convert lengthy medical reports into structured, digestible summaries.
5. Customer Support Ticket Analysis – Categorize and extract key insights from support requests.


**⚙️ Deploy this project**

If you’d like to take this further and deploy this app, follow the deployment lessons in this course to launch your app using Streamlit Community Cloud or Docker for a scalable, production-ready solution.