## **Data Extraction from PDF**

Document extraction has always been a fascinating challenge. Over the years, advancements in AI have transformed this domain, making it easier to tackle even the most complex use cases. Using tools like Orq, extracting structured data from documents is now both efficient and practical. This cookbook demonstrates how to use Orq for processing PDF invoices, from file uploads to extracting actionable insights.

To get started, you'll need to [sign up](https://orq.ai/create-account) for an Orq account if you haven't already.

Additionally, we've prepared this [Google Colab](https://colab.research.google.com/drive/1QR1H2PTQhSB5ST29s-tHKCqfUnxU0R-9#scrollTo=FDQYGou5b66Y) file that you can copy and run right away, allowing you to quickly experiment with document processing after replacing your API key.

**Step 1: Setting Up the Environment**  
The first step is ensuring the environment is ready. Installing the Orq SDK is quick and straightforward.

In [None]:
!pip install orq-ai-sdk

#import
import pandas as pd
from google.colab import auth

Collecting orq-ai-sdk
  Downloading orq_ai_sdk-2.13.4.tar.gz (17 kB)
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting deprecation<3.0.0,>=2.1.0 (from orq-ai-sdk)
  Downloading deprecation-2.1.0-py2.py3-none-any.whl.metadata (4.6 kB)
Collecting httpx<0.28.0,>=0.27.0 (from orq-ai-sdk)
  Downloading httpx-0.27.2-py3-none-any.whl.metadata (7.1 kB)
Downloading deprecation-2.1.0-py2.py3-none-any.whl (11 kB)
Downloading httpx-0.27.2-py3-none-any.whl (76 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.4/76.4 kB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: orq-ai-sdk
  Building wheel for orq-ai-sdk (pyproject.toml) ... [?25l[?25hdone
  Created wheel for orq-ai-sdk: filename=orq_ai_sdk-2.13.4-py3-none-any.whl size=23578 sha256=7fc5cb73cc3ace91ace52e3ab9a882f276a700d3e4206c217ed83e3106ab6220
  Stored

**Step 2: Connecting to Orq**

Interacting with Orq’s platform starts with client initialization.

In [None]:
import os

from orq_ai_sdk import Orq

client = Orq(
  api_key=os.environ.get("ORQ_API_KEY", "insert_API_key_here"),
)

**Step 3: Uploading PDF Files for Processing**

Here, PDF invoices are uploaded to Orq, making them ready for extraction and analysis. In this case, we have a few PDF files stored in a Google Drive folder that will be used for demonstration. You can easily replace these with your own files to suit your use case.

In [None]:
import os
import requests

# API details
url = "https://my.orq.ai/v2/files"
headers = {
    "Authorization": f"Bearer insert_API_key_here"
}

# Folder path
folder_path = '/content/drive/MyDrive/invoice_test'

# Get the first three PDF files
pdf_files = [file for file in os.listdir(folder_path) if file.endswith('.pdf')][:3]

In [None]:
# List to store JSON responses
responses_json = []

# Process each PDF file
for file_name in pdf_files:
    file_path = os.path.join(folder_path, file_name)

    try:
        # Prepare the form data
        with open(file_path, 'rb') as file:
            files = {
                'purpose': (None, 'retrieval'),
                'file': (file_name, file)
            }

            # Send the POST request
            response = requests.post(url, headers=headers, files=files)

            # Store the JSON response after upload
            if response.status_code == 200:
                responses_json.append(response.json())
                print(f"Uploaded {file_name}, response stored.")
            else:
                print(f"Failed to upload {file_name}: {response.status_code} - {response.text}")

    except Exception as e:
        print(f"Error processing {file_name}: {e}")

# JSON responses stored in `responses_json` can now be parsed for file_ids

Error processing invoice1.pdf: HTTPSConnectionPool(host='my.orq.ai', port=443): Max retries exceeded with url: /v2/files (Caused by SSLError(SSLEOFError(8, 'EOF occurred in violation of protocol (_ssl.c:2426)')))
Failed to upload invoice2.pdf: 403 - {"message":"Forbidden: Invalid token."}
Error processing invoice3.pdf: HTTPSConnectionPool(host='my.orq.ai', port=443): Max retries exceeded with url: /v2/files (Caused by SSLError(SSLEOFError(8, 'EOF occurred in violation of protocol (_ssl.c:2426)')))


**Extracting File IDs**  

Once the files are uploaded, their unique identifiers (file_ids) are extracted from the responses. These IDs are required for processing in the next step.

In [None]:
file_ids = [response.get('_id') for response in responses_json if response.get('_id')]
print("Extracted file_ids:", file_ids)

Extracted file_ids: []


**Step 4: Deploying for Data Extraction**

To ensure consistent and structured outputs from the data extraction process, the GPT-4o model can be configured to adhere to a predefined JSON schema. By specifying the schema, the model is guided to generate results in a precise format, reducing ambiguity and ensuring compatibility with downstream systems.

Below is an example schema designed for extracting key fields from receipts, including transaction date, vendor name, and payment details. The schema enforces strict adherence, with required fields and specific data types for each property. This approach ensures that outputs are well-structured and can be directly integrated into applications or databases for further analysis, reporting, or automation. Leveraging this JSON schema with the GPT-4o model enhances the reliability of the extraction process, making it an invaluable tool for handling structured data tasks.

This is the prompt in Orq.ai:
```plaintext
Analyze the provided images of receipts and invoices. Extract the following relevant information:

Date: The date of the transaction.
Vendor Name: The name of the company or individual from whom the goods or services were purchased.
Amount: The total amount spent, including any applicable taxes.
Category: An appropriate category for the expense (e.g., Travel, Food, Office Supplies).
Payment Method: The method of payment used (e.g., Credit Card, Cash, Bank Transfer).
Invoice Number: If available, the unique identifier for the invoice.
Map each extracted piece of information to the appropriate columns in a CSV file with the following headers: Date, Vendor Name, Amount, Category, Payment Method, Invoice Number. Provide the results in a structured format suitable for CSV output.

This is the receipt:


This is the JSON Schema that helps generate the structured output:
```plaintext
{
  "name": "dataextraction_receipts",
  "strict": true,
  "schema": {
    "type": "object",
    "properties": {
      "Date": {
        "type": "string",
        "description": "The date of the transaction in YYYY-MM-DD format."
      },
      "VendorName": {
        "type": "string",
        "description": "The name of the company or individual from whom the goods or services were purchased."
      },
      "Amount": {
        "type": "number",
        "description": "The total amount spent, including any applicable taxes."
      },
      "Category": {
        "type": "string",
        "description": "An appropriate category for the expense (e.g., Travel, Food, Office Supplies)."
      },
      "PaymentMethod": {
        "type": "string",
        "description": "The method of payment used (e.g., Credit Card, Cash, Bank Transfer)."
      },
      "InvoiceNumber": {
        "type": "string",
        "description": "The unique identifier for the invoice, if available."
      }
    },
    "additionalProperties": false,
    "required": [
      "Date",
      "VendorName",
      "Amount",
      "Category",
      "PaymentMethod",
      "InvoiceNumber"
    ]
  }
}




With file_ids in hand, the next step is invoking a pre-trained deployment to extract structured data from the invoices. This process transforms raw document data into insights.

In [None]:
# Iterate through each file_id and invoke the deployment
for file_id in file_ids:
    try:
        generation = client.deployments.invoke(
            key="DataExtraction_Receipts",
            context={
                "environments": []
            },
            file_ids=[file_id],  # Use a single file_id in the list
            metadata={
                "custom-field-name": "custom-metadata-value"
            }
        )

        # Print the content for each invocation
        print(f"Response for file_id {file_id}: {generation.choices[0].message.content}")

    except Exception as e:
        print(f"Error invoking deployment for file_id {file_id}: {e}")

**What’s Next?**  
Orq’s tools provide robust capabilities for extracting structured data from unstructured PDF documents. With this workflow, you can:

- Scale Data Processing: Adapt the workflow to handle larger batches of PDF files or seamlessly integrate it into your existing systems.
- Refine Extraction Outputs: Leverage Orq’s deployment configurations to fine-tune the extraction process for specific document formats, layouts, or fields.
- Automate End-to-End Workflows: Combine this process with automated pipelines to optimize tasks such as invoice management, financial reporting, or compliance monitoring.

By transforming unstructured PDF data into actionable insights, Orq empowers businesses to streamline operations, improve decision-making, and unlock new efficiencies with ease.