## Purchase order extraction with HeatWave GenAI

This notebook shows how to digitize your legacy Purchase orders by extracting the contets from the document and being able to answer questions of it. These contents can also be stored into the database (though not shown here). We use MySQL [HeatWave GenAI](https://www.oracle.com/heatwave/genai/) to extract structured information from an unstructured document, specifically a purchase order (PO) in PDF format. The process involves loading the PDF from an object store into a HeatWave table using the [VECTOR_STORE_LOAD](https://dev.mysql.com/doc/heatwave/en/mys-hwgenai-vector-store-load.html) function. This function automatically parses the document and prepares it for querying. Subsequently, the [ML_RAG](https://dev.mysql.com/doc/heatwave/en/mys-hwgenai-ml-rag.html) procedure is used to ask natural language questions about the PO's content, such as the order date, item costs, and supplier details, showcasing the powerful document understanding and retrieval capabilities of HeatWave GenAI.

For documents with complex objects like tables (such as this PO), we use the [Vision LLMs](https://docs.oracle.com/en-us/iaas/Content/generative-ai/meta-llama-4-scout.htm) accessed via HeatWave's integration with the OCI Generative AI service.

**This requires mysql-connector-python>=9.5.0**

### Connect to the HeatWave instance
First, we need to establish a connection to an active MySQL HeatWave instance on OCI.

**Action Required**: Create an SSH tunnel to your HeatWave instance by running the command below in your terminal, substituting the placeholder values with your OCI credentials.

ssh -o ServerAliveInterval=60 -i BASTION_PKEY -L LOCAL_PORT:DBSYSTEM_IP:DBSYSTEM_PORT BASTION_USER@BASTION_IP 

Modify the Python variables below to match the credentials for your HeatWave instance. We also define an API to execute a SQL query using a cursor, and the result is returned as a Pandas DataFrame.

In [1]:
import mysql.connector
import pandas as pd
from mysql.connector.errors import OperationalError, InterfaceError
import time
import json

DBSYSTEM_SCHEMA = "ml_benchmark"
LOCAL_PORT = 3306

mydb = None  # global handle we keep fresh


def _get_conn():
    """Return a live MySQL connection, recreating it if needed."""
    global mydb
    if mydb is None or not mydb.is_connected():
        try:
            if mydb:
                mydb.close()
        except Exception:
            pass
        mydb = mysql.connector.connect(
            host="127.0.0.1",
            port=LOCAL_PORT,
            user="root",
            password="",
            database=DBSYSTEM_SCHEMA,
            allow_local_infile=True,
            use_pure=True,
            autocommit=True,
        )
    return mydb


# Helper function to execute SQL queries and return the results as a Pandas DataFrame
def execute_sql(sql: str, _retry=True) -> pd.DataFrame:
    """
    Execute SQL and return a DataFrame. Empty DF for DDL/DML without result sets.
    Ensures connection is alive; retries once on OperationalError.
    """
    conn = _get_conn()
    try:
        with conn.cursor() as cur:
            cur.execute(sql)
            if cur.description is None:
                return pd.DataFrame()
            rows = cur.fetchall()
            cols = [d[0] for d in cur.description]
            return pd.DataFrame(rows, columns=cols)
    except (OperationalError, InterfaceError) as e:
        if _retry:
            time.sleep(0.5)
            try:
                conn.close()
            except Exception:
                pass
            global mydb
            mydb = None
            return execute_sql(sql, _retry=False)
        raise

### Load a Purchase order PDF from the object store
We load a sample purchase order PDF (for example, https://docs.oracle.com/en/cloud/saas/readiness/scm/25c/ssproc25c/images/F38300_2.png) and load it an object store bucket. We then use vector_store_load to automatically parse and load the purchase order from object storage into a table (vector_store_data_1).
 

In [2]:
# Delete the vector_store_data_1 table if exists
execute_sql(f"""DROP TABLE if exists vector_store_data_1;""")
df = execute_sql(
    f"""
    CALL sys.VECTOR_STORE_LOAD(
        "oci://baumeister-datalake-sharded@lrsrfayerklw/unstructured_data/vision/Sample_PO.pdf",
        JSON_OBJECT(
            "schema_name","ml_benchmark",
            "table_name","vector_store_data_1",
            "document_parser_model", "meta.llama-4-scout-17b-16e-instruct"
        )
    )"""
)

Monitor the progress of VECTOR_STORE_LOAD

In [3]:
while True:
    df_status = execute_sql(f"""{df['task_status_query'][0]}""")
    if df_status is not None and not df_status.empty:
        status = json.loads(df_status.iloc[0, 0])["status"]
        progress = json.loads(df_status.iloc[0, 0])["progress"]
        print(f"Status:{status}, Progress:{progress}%")
        if status == "COMPLETED" or status == "ERROR" or status == "CANCELLED":
            break
    else:
        print("Not started yet")
    time.sleep(5)

Not started yet
Status:RUNNING, Progress:0%
Status:RUNNING, Progress:0%
Status:RUNNING, Progress:10%
Status:RUNNING, Progress:10%
Status:RUNNING, Progress:10%
Status:RUNNING, Progress:40%
Status:RUNNING, Progress:40%
Status:RUNNING, Progress:40%
Status:COMPLETED, Progress:100%


Invoke the ML_RAG procedure with your query about the purchase order. Lets try and find out the order date.

In [4]:
execute_sql(f"""CALL sys.ML_RAG("What is the order date?", @output, NULL);""")
json.loads(execute_sql(sql=f"""SELECT @output;""").iat[0, 0])["text"]

'The order date is April 22, 2025.'

In [5]:
execute_sql(f"""CALL sys.ML_RAG("what was the printer cost?", @output, NULL);""")
json.loads(execute_sql(f"""SELECT @output;""").iat[0, 0])["text"]

'The price of the printer is $250.00.'

In [6]:
execute_sql(f"""CALL sys.ML_RAG("Who was the supplier?", @output, NULL);""")
json.loads(execute_sql(f"""SELECT @output;""").iat[0, 0])["text"]

'The supplier is CV_SuppA01, which has an address of Gruvfaltsgatan 45, 768 90 Kiruna, SWEDEN.'

We invite you to try [HeatWave AutoML and GenAI](https://www.oracle.com/heatwave/free/). If youâ€™re new to Oracle Cloud Infrastructure, try Oracle Cloud Free Trial, a free 30-day trial with US$300 in credits.