# HeatWave GenAI

HeatWave GenAI is the industry's first automated in-database Generative AI service. Seamlessly integrating large language models (LLMs) and embedding generation within the database, it allows you to effortlessly generate new and realistic content, speed up manual or repetitive tasks like summarizing large documents, perform Retrieval Augmented Generation (RAG), and engage in natural language interactions. Refer to https://www.oracle.com/heatwave/genai for further details on Heatwave GenAI.


This notebook demonstrates the application of [ML_GENERATE](https://dev.mysql.com/doc/heatwave/en/mys-hwgenai-ml-generate.html) for summarization using data from the 2024 Olympic Games.

### References
- https://www.economicsobservatory.com/what-happened-at-the-2024-olympics
- https://en.wikipedia.org/wiki/2024_Summer_Olympics


### Prerequistises
Install the necessary packages

- mysql-connector-python
- pandas 
- langchain_community

##### Import Python packages

In [15]:
# import Python packages
import time
import json
import numpy as np
import pandas as pd
import mysql.connector
from mysql.connector.errors import OperationalError, InterfaceError
from langchain_community.document_loaders import WebBaseLoader

### Connect to the HeatWave instance
We create a connection to an active [HeatWave](https://www.oracle.com/mysql/) instance using the [MySQL Connector/Python](https://dev.mysql.com/doc/connector-python/en/). We also define an API to execute a SQL query using a cursor, and the result is returned as a Pandas DataFrame. Modify the below variables to point to your HeatWave instance. On AWS, set USE_BASTION to False. On OCI, please create a tunnel on your machine using the below command by substituting the variable with their respective values

ssh -o ServerAliveInterval=60 -i BASTION_PKEY -L LOCAL_PORT:DBSYSTEM_IP:DBSYSTEM_PORT BASTION_USER@BASTION_IP

In [None]:
BASTION_IP ="ip_address"
BASTION_USER = "opc"
BASTION_PKEY = "private_key_file"
DBSYSTEM_IP = "ip_address"
DBSYSTEM_PORT = 3306
DBSYSTEM_USER = "username"
DBSYSTEM_PASSWORD = "password"
DBSYSTEM_SCHEMA = "mlcorpus"
LOCAL_PORT = 3306
USE_BASTION = True

if USE_BASTION is True:
    DBSYSTEM_IP = "127.0.0.1"
else:
    LOCAL_PORT = DBSYSTEM_PORT
    
mydb = None  # global handle we keep fresh

def _get_conn():
    """Return a live MySQL connection, recreating it if needed."""
    global mydb
    if mydb is None or not mydb.is_connected():
        try:
            if mydb:
                mydb.close()
        except Exception:
            pass
        mydb = mysql.connector.connect(
            host=DBSYSTEM_IP,
            port=LOCAL_PORT,
            user=DBSYSTEM_USER,
            password=DBSYSTEM_PASSWORD,
            database=DBSYSTEM_SCHEMA,
            autocommit=True,
            connection_timeout=10,
        )
    return mydb

# Helper function to execute SQL queries and return the results as a Pandas DataFrame
def execute_sql(sql: str, _retry=True) -> pd.DataFrame:
    """
    Execute SQL and return a DataFrame. Empty DF for DDL/DML without result sets.
    Ensures connection is alive; retries once on OperationalError.
    """
    conn = _get_conn()
    try:
        with conn.cursor() as cur:
            cur.execute(sql)
            if cur.description is None:
                return pd.DataFrame()
            rows = cur.fetchall()
            cols = [d[0] for d in cur.description]
            return pd.DataFrame(rows, columns=cols)
    except (OperationalError, InterfaceError) as e:
        if _retry:
            time.sleep(0.5)
            try:
                conn.close()
            except Exception:
                pass
            global mydb
            mydb = None
            return execute_sql(sql, _retry=False)
        raise

# ML_GENERATE operation

You can perform content summarization using the [ML_GENERATE](https://dev.mysql.com/doc/heatwave/en/mys-hwgenai-ml-generate.html) procedure (by specifying the task type as summarization).

### Text summarization

Load the 2024 Olympic data from the URL.

In [17]:
# Initialize loader with the webpage URL
loader = WebBaseLoader(web_paths=("https://www.economicsobservatory.com/what-happened-at-the-2024-olympics",))

# Load and parse content into Documents
docs = loader.load()

Set a variable (@document, in this example) to the content of the document you want to summarize.

In [18]:
execute_sql(f"""SET @document = '{docs[0].page_content}';""")

Invoke [ML_GENERATE](https://dev.mysql.com/doc/heatwave/en/mys-hwgenai-ml-generate.html) with your variable @document and specifying the task as "summarization".

In [19]:
df = execute_sql(f"""SELECT JSON_PRETTY(sys.ML_GENERATE(@document, JSON_OBJECT("task", "summarization", "max_tokens", 512)));""")
json.loads(df.iat[0,0])["text"]

'The 2024 Summer Olympic Games were held in Paris, France, with over 10,500 athletes from 206 territories participating in 32 sports. The games saw a significant increase in representation, with more countries competing than ever before. China and the United States had the largest delegations, while smaller countries like Bangladesh, Myanmar, and Pakistan had fewer athletes.\n\nThe games also highlighted issues around representation, equality, cost control, and home advantage. Despite efforts to achieve gender parity, women still made up only 49% of competitors, with some countries having more female athletes than male. The distribution of medals was also uneven, with the top ten countries winning 63% of available medals.\n\nThe games saw several new sports, including breaking (break-dancing), which was introduced for the first time at the Olympics. The prize money for athletes varied widely, with some countries offering generous compensation schemes and others having little to no fund

In [25]:
# Delete the vector_store_data_1 table if exists
execute_sql(f"""DROP TABLE if exists vector_store_data_1;""")

### Use vector_store_load to load the files that contain the proprietary data from object storage into the HeatWave cluster.

In [26]:
df = execute_sql("""CALL sys.VECTOR_STORE_LOAD("oci://hwml-test-bucket@lrsrfayerklw/2024_Summer_Olympics_Wikipedia.pdf", JSON_OBJECT("schema_name","mlcorpus","table_name","vector_store_data_1"))""")

Monitor the progress of VECTOR_STORE_LOAD

In [27]:
while True:
    df_status = execute_sql(f"""{df['task_status_query'][0]}""")
    if df_status is not None and not df_status.empty:
        status = json.loads(df_status.iloc[0,0])['status']
        progress = json.loads(df_status.iloc[0,0])['progress']
        print(f"Status:{status}, Progress:{progress}%")
        if status == "COMPLETED" or status == "ERROR":
            break
    else:
        print("Not started yet")
    time.sleep(5)

Not started yet
Status:RUNNING, Progress:0%
Status:RUNNING, Progress:0%
Status:RUNNING, Progress:10%
Status:RUNNING, Progress:10%
Status:RUNNING, Progress:40%
Status:RUNNING, Progress:40%
Status:RUNNING, Progress:40%
Status:RUNNING, Progress:40%
Status:RUNNING, Progress:40%
Status:RUNNING, Progress:70%
Status:COMPLETED, Progress:100%


We can also summarize the documnet that is parsed using vector_store_load.

In [28]:
execute_sql("""SELECT GROUP_CONCAT(segment ORDER BY segment_number SEPARATOR ' ') INTO @parsed_document FROM (SELECT segment, segment_number FROM mlcorpus.vector_store_data_1 ORDER BY segment_number LIMIT 10) AS limited_segments;""")

In [29]:
df = execute_sql(f"""SELECT JSON_PRETTY(sys.ML_GENERATE(@parsed_document, JSON_OBJECT("task", "summarization", "max_tokens", 512)));""")
json.loads(df.iat[0,0])["text"]

'Here is a summary of the text:\n\nThe 2024 Summer Olympics, officially known as Paris 2024, will be an international multi-sport event held in France from July 26 to August 11, 204. The games will feature 204 participating nations and 10,714 athletes competing in 329 events across 32 sports. The opening ceremony will take place at the Jardins du Trocadéro and Seine, while the closing ceremony will be held at the Stade de France.'