# 🧪 Invoice Processing Lab Exercise
## Technology Stack Demonstration: LangGraph + LangChain + OpenAI + Pinecone

### 🎯 Lab Objectives
This lab demonstrates the integration of modern AI technologies for document processing:
- **LangGraph**: Workflow orchestration and state management
- **LangChain**: Document parsing and text processing
- **OpenAI**: Vision-based extraction and embeddings
- **Pinecone**: Vector storage and retrieval

### 📋 Business Rules
**Invoice Validation Logic:**
- ✅ **Valid**: All required fields present (vendor name, invoice number, date, amount)
- ❌ **Invalid**: Missing any required field

### 📋 Expected Outcomes
1. Convert PDFs to markdown and display content
2. Validate invoices based on business rules
3. Store markdown content in vector database
4. Execute test queries with structured results

## Cell 1: Environment Setup
**Purpose**: Install dependencies and configure environment
**Expected Output**: Confirmation of successful package installation

In [1]:
# Install required packages
!pip install langgraph langchain langchain-openai langchain-pinecone pinecone PyMuPDF pandas

print("✅ All packages installed successfully")


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.1.1[0m[39;49m -> [0m[32;49m25.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49m/opt/homebrew/opt/python@3.10/bin/python3.10 -m pip install --upgrade pip[0m
✅ All packages installed successfully


## Cell 2: Import Libraries and Configuration
**Purpose**: Import all required libraries and set up configuration
**Expected Output**: Successful imports and configuration confirmation

In [2]:
import os
import base64
import getpass
import json
from pathlib import Path
from typing import Dict, List, TypedDict
import pandas as pd
from datetime import datetime
from IPython.display import display, Markdown

# PDF Processing
import fitz  # PyMuPDF

# LangChain Components
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_pinecone import PineconeVectorStore
from langchain.schema import Document
from langchain_core.messages import HumanMessage

# LangGraph Components
from langgraph.graph import StateGraph, END

# Pinecone
from pinecone import Pinecone, ServerlessSpec

# Configuration
class Config:
    DATA_DIR = "data"
    INDEX_NAME = "invoice-validation-vectors"
    OPENAI_MODEL = "gpt-4o-mini"
    EMBEDDING_MODEL = "text-embedding-3-small"

# API Keys
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY") or getpass.getpass("OpenAI API Key: ")
PINECONE_API_KEY = os.getenv("PINECONE_API_KEY") or getpass.getpass("Pinecone API Key: ")

print("✅ Libraries imported and configuration set")


For example, replace imports like: `from langchain_core.pydantic_v1 import BaseModel`
with: `from pydantic import BaseModel`
or the v1 compatibility namespace if you are working in a code base that has not been fully upgraded to pydantic 2 yet. 	from pydantic.v1 import BaseModel

  from langchain_pinecone.vectorstores import Pinecone, PineconeVectorStore


✅ Libraries imported and configuration set


## Cell 3: Define State and Core Functions
**Purpose**: Define LangGraph state structure and core processing functions
**Expected Output**: State class and function definitions ready for workflow

In [3]:
# Define State for LangGraph
class ProcessingState(TypedDict):
    pdf_files: List[str]
    markdown_content: Dict[str, str]
    validation_results: Dict[str, Dict]
    documents: List[Document]
    vector_ids: List[str]
    status: str
    error: str

# Initialize AI Components
llm = ChatOpenAI(api_key=OPENAI_API_KEY, model=Config.OPENAI_MODEL, temperature=0)
embeddings = OpenAIEmbeddings(api_key=OPENAI_API_KEY, model=Config.EMBEDDING_MODEL)

def pdf_to_base64(pdf_path: str) -> str:
    """Convert PDF first page to base64 image for LLM processing"""
    doc = fitz.open(pdf_path)
    page = doc.load_page(0)
    pix = page.get_pixmap(matrix=fitz.Matrix(2, 2))
    img_data = pix.tobytes("png")
    doc.close()
    return base64.b64encode(img_data).decode()

def pdf_to_markdown(pdf_path: str) -> str:
    """Convert PDF to markdown using OpenAI Vision"""
    image_base64 = pdf_to_base64(pdf_path)
    
    message = HumanMessage(
        content=[
            {
                "type": "text",
                "text": "Convert this invoice document to clean markdown format. Preserve all important details including vendor name, invoice number, date, amount, and any other relevant information. Return only the markdown."
            },
            {
                "type": "image_url",
                "image_url": {"url": f"data:image/png;base64,{image_base64}"}
            }
        ]
    )
    
    response = llm.invoke([message])
    return response.content.strip()

def validate_invoice(markdown_content: str, filename: str) -> Dict:
    """Validate invoice based on business rules"""
    validation_prompt = f"""
    Analyze this invoice content and extract the following required fields:
    1. vendor_name: Company issuing the invoice
    2. invoice_number: Invoice ID/reference number
    3. date: Invoice date
    4. amount: Total amount
    
    Invoice Content:
    {markdown_content}
    
    Return JSON with extracted fields and validation status:
    {{
        "vendor_name": "extracted value or null",
        "invoice_number": "extracted value or null",
        "date": "extracted value or null",
        "amount": "extracted value or null",
        "validation_status": "Valid or Invalid",
        "missing_fields": ["list of missing required fields"],
        "rationale": "explanation of validation decision"
    }}
    """
    
    try:
        response = llm.invoke(validation_prompt)
        content = response.content.strip()
        
        # Clean JSON response
        if '```json' in content:
            content = content.split('```json')[1].split('```')[0].strip()
        elif '```' in content:
            content = content.split('```')[1].strip()
        
        result = json.loads(content)
        result['filename'] = filename
        return result
        
    except Exception as e:
        return {
            'filename': filename,
            'vendor_name': None,
            'invoice_number': None,
            'date': None,
            'amount': None,
            'validation_status': 'Error',
            'missing_fields': ['All fields'],
            'rationale': f'Validation error: {str(e)}'
        }

print("✅ State and core functions defined")

✅ State and core functions defined


## Cell 4: Initialize Pinecone Vector Store
**Purpose**: Set up Pinecone vector database for document storage
**Expected Output**: Pinecone index created/connected and vector store initialized

In [4]:
# Initialize Pinecone
pc = Pinecone(api_key=PINECONE_API_KEY)

# Create index if it doesn't exist
existing_indexes = [idx["name"] for idx in pc.list_indexes()]
if Config.INDEX_NAME not in existing_indexes:
    print(f"Creating new Pinecone index: {Config.INDEX_NAME}")
    pc.create_index(
        name=Config.INDEX_NAME,
        dimension=1536,  # text-embedding-3-small dimension
        metric="cosine",
        spec=ServerlessSpec(cloud="aws", region="us-east-1")
    )
    import time
    time.sleep(10)  # Wait for index to be ready
    print("✅ Index created successfully")
else:
    print(f"✅ Using existing index: {Config.INDEX_NAME}")

# Initialize vector store
index = pc.Index(Config.INDEX_NAME)
vector_store = PineconeVectorStore(index=index, embedding=embeddings)

print("✅ Pinecone vector store initialized")

Creating new Pinecone index: invoice-validation-vectors
✅ Index created successfully
✅ Pinecone vector store initialized


## Cell 5: Define LangGraph Workflow Nodes
**Purpose**: Create workflow nodes for document processing pipeline
**Expected Output**: LangGraph nodes defined for orchestrated processing

In [5]:
def load_pdf_files(state: ProcessingState) -> ProcessingState:
    """Node: Load PDF files from data directory"""
    data_dir = Path(Config.DATA_DIR)
    pdf_files = [str(f) for f in data_dir.glob("*.pdf")]
    
    state["pdf_files"] = pdf_files
    state["markdown_content"] = {}
    state["validation_results"] = {}
    state["status"] = f"Found {len(pdf_files)} PDF files"
    print(f"📁 {state['status']}")
    return state

def convert_to_markdown(state: ProcessingState) -> ProcessingState:
    """Node: Convert PDFs to markdown and display content"""
    print("📄 Converting PDFs to Markdown...\n")
    
    for pdf_path in state["pdf_files"]:
        filename = Path(pdf_path).name
        print(f"Processing: {filename}")
        
        try:
            markdown_content = pdf_to_markdown(pdf_path)
            state["markdown_content"][filename] = markdown_content
            
            # Display markdown content
            print(f"\n📋 Content for {filename}:")
            print("=" * 60)
            display(Markdown(markdown_content))
            print("\n" + "=" * 60 + "\n")
            
        except Exception as e:
            print(f"❌ Error processing {filename}: {e}")
            state["error"] = str(e)
    
    state["status"] = f"Converted {len(state['markdown_content'])} files to markdown"
    return state

def validate_invoices(state: ProcessingState) -> ProcessingState:
    """Node: Validate invoices based on business rules"""
    print("✅ Validating invoices based on business rules...\n")
    
    for filename, content in state["markdown_content"].items():
        validation_result = validate_invoice(content, filename)
        state["validation_results"][filename] = validation_result
        
        status = validation_result['validation_status']
        print(f"{filename}: {status}")
        if validation_result['missing_fields']:
            print(f"  Missing: {', '.join(validation_result['missing_fields'])}")
    
    state["status"] = f"Validated {len(state['validation_results'])} invoices"
    return state

def store_in_vector_db(state: ProcessingState) -> ProcessingState:
    """Node: Store markdown documents in Pinecone vector database"""
    documents = []
    
    for filename, content in state["markdown_content"].items():
        validation = state["validation_results"].get(filename, {})
        
        doc = Document(
            page_content=content,
            metadata={
                "pdf_filename": filename,
                "doc_type": "invoice_markdown",
                "validation_status": validation.get('validation_status', 'Unknown'),
                "vendor_name": validation.get('vendor_name'),
                "invoice_number": validation.get('invoice_number'),
                "date": validation.get('date'),
                "amount": validation.get('amount'),
                "processing_timestamp": datetime.now().isoformat()
            }
        )
        documents.append(doc)
    
    state["documents"] = documents
    
    if documents:
        try:
            vector_ids = vector_store.add_documents(documents)
            state["vector_ids"] = vector_ids
            state["status"] = f"Stored {len(vector_ids)} documents in vector database"
            print(f"🗄️ {state['status']}")
            
        except Exception as e:
            state["error"] = str(e)
            state["status"] = f"Error storing documents: {e}"
            print(f"❌ {state['status']}")
    
    return state

print("✅ LangGraph workflow nodes defined")

✅ LangGraph workflow nodes defined


## Cell 6: Build and Execute LangGraph Workflow
**Purpose**: Orchestrate the complete document processing pipeline
**Expected Output**: Processed documents with validation results and vector storage

In [6]:
# Build LangGraph workflow
workflow = StateGraph(ProcessingState)

# Add nodes
workflow.add_node("load_files", load_pdf_files)
workflow.add_node("convert_markdown", convert_to_markdown)
workflow.add_node("validate_invoices", validate_invoices)
workflow.add_node("store_vectors", store_in_vector_db)

# Define workflow edges
workflow.set_entry_point("load_files")
workflow.add_edge("load_files", "convert_markdown")
workflow.add_edge("convert_markdown", "validate_invoices")
workflow.add_edge("validate_invoices", "store_vectors")
workflow.add_edge("store_vectors", END)

# Compile workflow
app = workflow.compile()

# Execute workflow
print("🚀 Starting LangGraph workflow execution...\n")

initial_state = ProcessingState(
    pdf_files=[],
    markdown_content={},
    validation_results={},
    documents=[],
    vector_ids=[],
    status="",
    error=""
)

# Run the workflow
final_state = app.invoke(initial_state)

print(f"\n✅ Workflow completed: {final_state['status']}")
if final_state.get('error'):
    print(f"⚠️ Errors encountered: {final_state['error']}")

🚀 Starting LangGraph workflow execution...

📁 Found 6 PDF files
📄 Converting PDFs to Markdown...

Processing: invoice_3.pdf

📋 Content for invoice_3.pdf:


```markdown
# Invoice

**Invoice to:**  
**BAILEY DUPONT**  
Studio Shadowe  
123 Anywhere St.,  
Any City, ST 12345  
Studio.Shadowe@mail.com  

**From:**  
**AVERY DAVIS**  
Business Consultant  
123 Anywhere St.,  
Any City, ST 12345  
hello@reallygreatsite.com  
123-456-7890  
[www.reallygreatsite.com](http://www.reallygreatsite.com)

---

## Invoice Details

| Description                                      | HRS/QTY | Rate | Subtotal  |
|--------------------------------------------------|---------|------|-----------|
| Legal Consultation                               | 12      | 285  | $3,420    |
| Financial and Tax Consultation                   | 40      | 450  | $18,000   |
| Management Consultation                           | 65      | 350  | $22,750   |

---

**Subtotal:**  $44,170  
**Tax (10%):** $4,417  
**Total:**     $48,587  

---

## Payment Method

**Bank Transfer:**  
Thynk Unlimited Bank  
**Account Number:** 123-456-7890  

---

## Terms and Conditions

Payment Terms Are Usually Stated on the Invoice. These May Specify That the Buyer Has a Maximum Number of Days in Which To Pay and Is Sometimes Offered a Discount if Paid Before the Due Date.
```



Processing: invoice_2.pdf

📋 Content for invoice_2.pdf:


```markdown
# INVOICE

**Date:**  
**No. Invoice:** 12345  

**Bill to:**  
123 Anywhere St., Any City, ST 12345  

| Date | Item Description | Price | Qty | Total |
|------|------------------|-------|-----|-------|
|      |                  |       |     |       |
|      |                  |       |     |       |
|      |                  |       |     |       |
|      |                  |       |     |       |
|      |                  |       |     |       |

**Total:**  

---

**Thank you!**

**Contact Information:**  
123 Anywhere St., Any City, ST 12345  
+123-456-7890  
hello@reallygreatsite.com  

**Payment Method:**  
Bank Name: Borcelle Bank  
Account Number: 0123 4567 89  
```



Processing: invoice_1.pdf

📋 Content for invoice_1.pdf:


```markdown
# Invoice

**Vendor:**  
MORGAN MAXWELL  
design & branding  

---

**Issued To:**  
Jonathan Patterson  
Liceria & Co.  
123 Anywhere St., Any City  

**Invoice No:** 01234  
**Date:** 11.02.2030  
**Due Date:** 11.03.2030  

---

| Description              | Unit Price | Qty | Total  |
|--------------------------|------------|-----|--------|
| brand consultation        | $100       | 1   | $100   |
| logo design              | $100       | 1   | $100   |
| website design           | $100       | 1   | $100   |
| social media templates    | $100       | 1   | $100   |
| brand manual             | $100       | 1   | $100   |

---

**Subtotal:** $500  
**Tax (10%):** $50  
**Total:** $550  

---

**Bank Details:**  
Borcele Bank  
Account Name: Avery Davis  
Account No.: 0123 4567 8901  

---

**Thank You!**
```



Processing: invoice_5.pdf

📋 Content for invoice_5.pdf:


```markdown
# Invoice

**Vendor Name:** Studio Shodwe  
**Invoice Number:** 12345  
**Date:** 25 June 2022  

**Invoice To:**  
Ketut Susilo  
123-456-7890  
hello@reallygreatsite.com  
123 Anywhere St., Any City  

| NO | DESCRIPTION          | QTY | PRICE | TOTAL  |
|----|----------------------|-----|-------|--------|
| 1  | Logo Design          | 5   | $100  | $500   |
| 2  | Website Design       | 2   | $800  | $1,600 |
| 3  | Brand Design         | 3   | $300  | $900   |
| 4  | Banner Design        | 2   | $300  | $600   |
| 5  | Flyer Design         | 2   | $400  | $800   |
| 6  | Social Media Template | 10  | $50   | $500   |
| 7  | Name Card            | 15  | $25   | $750   |
| 8  | Web Developer        | 2   | $1,000| $2,000 |

**Sub Total:** $7,650  
**Tax (15%):** $1,148  
**Grand Total:** $8,798  

**Payment Method:**  
Bank Name: Borcelle  
Account Number: 123-456-7890  

---

Thank you for business with us!

**Terms and Conditions:**  
Please send payment within 30 days of receiving this invoice. There will be a 10% interest charge per month on late invoices.

---

**Administrator:**  
Henrietta Mitchell
```



Processing: invoice_4.pdf

📋 Content for invoice_4.pdf:


```markdown
# Invoice

**Vendor Name:** Warner & Spencer  
**Phone:** +123-456-7890  
**Email:** hello@reallygreatsite.com  
**Address:** 123 Anywhere St., Any City  

---

**Bill To:**  
Jamie Chastain  
123 Anywhere St., Any City  
ST 12345  

**Invoice Number:** ST 12345  
**Date:** [Insert Date]  
**Amount Due:** [Insert Amount]  

---

**Greetings!**

A letter is a message written for a variety of purposes, from friendly to formal. They can help maintain bonds between friends, especially if they’re far apart. Letters are also used by professionals to communicate their concerns. In some schools, kids are encouraged to write letters to Santa for Christmas. There are also letters given by school administrators to the students’ parents or guardians.

If you’re thinking of writing a letter yourself, make your intentions clear from the start. You can be fun and creative or straightforward, depending on your needs. Most letters are divided into sections, including the date, recipient’s name, and salutations. As for the main content of your letter, there are often three main parts: the introduction, paragraph, and conclusion.

Your letter’s introduction can be a brief greeting, a few polite statements, or a background of why you’re writing. The paragraph is the bulk of your letter, containing the most important parts of your message. Finally, the conclusion sums up all your ideas. It can also include a closing statement or salutation. No matter what reason you have behind writing, it’s best to be organized and plan the contents of your letter before sending it out.

---

**Sincerely,**  
Neil Tran  
```



Processing: invoice_6.pdf

📋 Content for invoice_6.pdf:


```markdown
# INVOICE

**Invoice Number:** 1009-01  
**Due Date:** 1 April 2022  

## Invoice To
**Name:** Bailey Dupont  
**Address:** Studio Shadowe, 123 Anywhere St., Any City, ST 12345  

## Company
**Name:** Wardiere Inc  
**Address:** 123 Anywhere St., Any City, ST 12345  

## Description of Services
| Description                     | Price   |
|---------------------------------|---------|
| Digital Consulting Services      | $1000   |
| Application Management Services   | $2470   |
| Cloud Business Services          | $3000   |
| Business Analyst                 | $1700   |

## Summary
| Item        | Amount  |
|-------------|---------|
| Subtotal    | $8170   |
| Tax (10%)   | $817    |
| **TOTAL**   | **$8987** |

## Payment Method
**Bank Name:** Thynk Unlimited Bank  
**Bank Account:** 123-456-7890  

*Payment Terms Are Usually Stated on the Invoice. These May Specify That the Buyer Has a Maximum Number of Days in Which To Pay and Is Sometimes Offered a Discount if Paid Before the Due Date. The Buyer Could Have Already Paid for the Products or Services Listed on the Invoice.*

---

**Signature of Authorized Person:** ______________________  
**Date:** ______________________  
```



✅ Validating invoices based on business rules...

invoice_3.pdf: Invalid
  Missing: invoice_number, date
invoice_2.pdf: Invalid
  Missing: vendor_name, date, amount
invoice_1.pdf: Valid
invoice_5.pdf: Valid
invoice_4.pdf: Invalid
  Missing: date, amount
invoice_6.pdf: Valid
❌ Error storing documents: (400)
Reason: Bad Request
HTTP response headers: HTTPHeaderDict({'Date': 'Mon, 29 Sep 2025 02:37:32 GMT', 'Content-Type': 'application/json', 'Content-Length': '132', 'Connection': 'keep-alive', 'x-pinecone-request-latency-ms': '1310', 'x-pinecone-request-id': '3920942694973497685', 'x-envoy-upstream-service-time': '24', 'server': 'envoy'})
HTTP response body: {"code":3,"message":"Metadata value must be a string, number, boolean or list of strings, got 'null' for field 'date'","details":[]}


✅ Workflow completed: Error storing documents: (400)
Reason: Bad Request
HTTP response headers: HTTPHeaderDict({'Date': 'Mon, 29 Sep 2025 02:37:32 GMT', 'Content-Type': 'application/json', 'Content-

## Cell 7: Display Validation Results
**Purpose**: Show invoice validation results in structured format
**Expected Output**: DataFrame showing validation status for all processed invoices

In [7]:
# Display validation results
if final_state.get('validation_results'):
    validation_data = []
    
    for filename, result in final_state['validation_results'].items():
        validation_data.append({
            'Filename': filename,
            'Status': result['validation_status'],
            'Vendor Name': result.get('vendor_name', 'Missing'),
            'Invoice Number': result.get('invoice_number', 'Missing'),
            'Date': result.get('date', 'Missing'),
            'Amount': result.get('amount', 'Missing'),
            'Missing Fields': ', '.join(result.get('missing_fields', [])),
            'Rationale': result.get('rationale', '')[:50] + '...'
        })
    
    validation_df = pd.DataFrame(validation_data)
    
    print("📊 INVOICE VALIDATION RESULTS")
    print("=" * 100)
    print(validation_df.to_string(index=False))
    
    # Summary statistics
    valid_count = len([r for r in final_state['validation_results'].values() if r['validation_status'] == 'Valid'])
    invalid_count = len([r for r in final_state['validation_results'].values() if r['validation_status'] == 'Invalid'])
    
    print(f"\n📈 VALIDATION SUMMARY")
    print(f"Total invoices processed: {len(final_state['validation_results'])}")
    print(f"Valid invoices: {valid_count}")
    print(f"Invalid invoices: {invalid_count}")
    print(f"Validation success rate: {valid_count/len(final_state['validation_results'])*100:.1f}%")
    
else:
    print("❌ No validation results available")

📊 INVOICE VALIDATION RESULTS
     Filename  Status      Vendor Name Invoice Number         Date  Amount            Missing Fields                                             Rationale
invoice_3.pdf Invalid      AVERY DAVIS           None         None $48,587      invoice_number, date The invoice does not contain an invoice number or ...
invoice_2.pdf Invalid             None          12345         None    None vendor_name, date, amount The vendor name, invoice date, and total amount ar...
invoice_1.pdf   Valid   MORGAN MAXWELL          01234   11.02.2030    $550                           All required fields (vendor_name, invoice_number, ...
invoice_5.pdf   Valid    Studio Shodwe          12345 25 June 2022  $8,798                           All required fields are present and correctly extr...
invoice_4.pdf Invalid Warner & Spencer       ST 12345         None    None              date, amount The invoice content is missing the actual date and...
invoice_6.pdf   Valid     Wardiere Inc   