# Invoice Parser

# Initialize the OpenAI client and secure my API key

I import the OpenAI Python SDK and create a reusable `client` that reads my credentials from the environment, allowing me to send requests to models, work with files, and stream results across cells. I first had `OPENAI_API_KEY` set directly in the notebook, and I later moved it into an external script that sets the environment variable before launch, helping me keep the key out of the code and version control while keeping the same client usage everywhere.

In [1]:
# Import the OpenAI Python library to connect to the API
from openai import OpenAI

# Create a reusable client that reads API key from the environment
client = OpenAI()

# Upload a file to OpenAI for use

I upload the PDF file to OpenAI and set its purpose as `user_data`, which allows me to store the file in the platform and later reference it when making requests. This makes it possible to work with the contents of the document in different tasks such as analysis, extraction, or question answering.

In [None]:
# Upload the PDF file to OpenAI and mark it as user data
# This stores the file on the platform so it can be reference it in later requests
file = client.files.create(
    file=open("invoice.pdf", "rb"), purpose="user_data"
)

# Extract and structure invoice data with a response request

I send the uploaded PDF file along with an instruction to the model, asking it to first extract all readable text from the invoice and then transform that text into a structured JSON object. The prompt specifies the exact fields to include, such as invoice details, company information, billing information, line items, and totals. I initially tried the `gpt-5-nano` model for this task, but it was not as accurate in handling the extraction and structuring, so I switched to `gpt-5-mini`, which provided more reliable results.

In [3]:
# Send a request to the GPT-5-mini model asking it to process the uploaded PDF  
# The input includes both the file reference and instructions as text  
# The model is told to first extract all readable text from the invoice in order  
# Then, it must structure the extracted information into a JSON object with specific fields  
# The response is returned as JSON without extra explanation so it can be parsed directly  
response = client.responses.create(
    model="gpt-5-mini",
    input=[
        {
            "role": "user",
            "content": [
                {
                    "type": "input_file",
                    "file_id": file.id,
                },
                {
                    "type": "input_text",
                    "text": (
                        "You are given a PDF invoice file. First, extract all the readable text "
                        "from the PDF in proper order. Then, convert the extracted text into a "
                        "structured JSON object with the following fields:\n\n"
                        "- invoice_number\n"
                        "- invoice_date\n"
                        "- po_number\n"
                        "- company_address\n"
                        "- company_phone\n"
                        "- company_fax\n"
                        "- company_email\n"
                        "- bill_to_name\n"
                        "- bill_to_company\n"
                        "- bill_to_address\n"
                        "- bill_to_phone\n"
                        "- amount_qualify_additional_discount (Items over this amount qualify for an additional discount)\n"
                        "- percentage_discount (% discount)\n"
                        "- line_items (list of {description, quantity, unit_price, amount, discount})\n"
                        "- subtotal\n"
                        "- credit\n"
                        "- tax_percent\n"
                        "- additional_discount_percent\n"
                        "- balance_due\n\n"
                        "Return only valid JSON with no explanation."
                    ),
                },
            ],
        }
    ],
)

# Parse the model output into JSON

I take the text response from the model and load it with Python’s `json` library, which converts the output string into a JSON object. This lets me work with the structured data directly in Python, making it easy to access specific fields or use the values in further analysis.

In [None]:
# Import the built-in JSON library to work with JSON data
import json

# Convert the model’s text output into a JSON object
json_object = json.loads(response.output_text)

# Print the JSON object in a nicely formatted way with indentation for readability
print(json.dumps(json_object, indent=2))

{
  "invoice_number": "1111",
  "invoice_date": "6/28/2024",
  "po_number": "123456",
  "company_address": "321 Avenue A, Portland, OR 12345",
  "company_phone": "(206) 555-1163",
  "company_fax": "(206) 555-1164",
  "company_email": "someone@websitegoeshere.com",
  "bill_to_name": "Natasha Jones",
  "bill_to_company": "Central Beauty",
  "bill_to_address": "123 Main St., Manhattan, NY 98765",
  "bill_to_phone": "(321) 555-1234",
  "amount_qualify_additional_discount": 100.0,
  "percentage_discount": 10.0,
  "line_items": [
    {
      "description": "Item Number 1",
      "quantity": 1,
      "unit_price": 2.0,
      "amount": 2.0,
      "discount": null
    },
    {
      "description": "Item Number 2",
      "quantity": 1,
      "unit_price": 2.0,
      "amount": 2.0,
      "discount": null
    },
    {
      "description": "Item Number 3",
      "quantity": 1,
      "unit_price": 2.0,
      "amount": 2.0,
      "discount": null
    }
  ],
  "subtotal": 6.0,
  "credit": 1000.0,
  "t

# Analysis of the structured invoice JSON

The JSON output successfully captures the key details of the invoice in a structured way. The header information is complete, with invoice number, date, and purchase order included, as well as company contact information and billing details. The line items list is structured properly, though the example uses very simple entries with the same unit price and amount, which may suggest placeholder or sample data.

The subtotal calculation matches the sum of the line items, confirming that the extraction was consistent. The discounts and credits applied are notable: the large credit of 1000.0 combined with small charges leads to a negative balance due of -994.2. This suggests that the customer has more credit than the total charges, which is why the final balance is negative.

The inclusion of discount fields and tax percentage shows the model followed the prompt structure well, even though discounts on individual line items are marked as `None`. Overall, the data looks coherent and ready for use in further financial analysis or integration into accounting workflows.