# HeatWave GenAI

HeatWave GenAI is the industry's first automated in-database Generative AI service. Seamlessly integrating large language models (LLMs) and embedding generation within the database, it allows you to effortlessly generate new and realistic content, speed up manual or repetitive tasks like summarizing large documents, perform Retrieval Augmented Generation (RAG), and engage in natural language interactions. Refer to https://www.oracle.com/heatwave/genai for further details on Heatwave GenAI.


This notebook demonstrates the application of ML_GENERATE for summarization using data from the 2024 Olympic Games.

### References
- https://www.economicsobservatory.com/what-happened-at-the-2024-olympics
- https://en.wikipedia.org/wiki/2024_Summer_Olympics


### Prerequistises
Install the necessary packages

- mysql-connector-python
- pandas 
- langchain_community
- unstructured

##### Import Python packages

In [63]:
# import Python packages
import os
import json
import numpy as np
import pandas as pd
import mysql.connector
from langchain_community.document_loaders import UnstructuredURLLoader

### Connect to the HeatWave instance
We create a connection to an active [HeatWave](https://www.oracle.com/mysql/) instance using the [MySQL Connector/Python](https://dev.mysql.com/doc/connector-python/en/). We also define an API to execute a SQL query using a cursor, and the result is returned as a Pandas DataFrame. Modify the below variables to point to your HeatWave instance. On AWS, set USE_BASTION to False. On OCI, please create a tunnel on your machine using the below command by substituting the variable with their respective values

ssh -o ServerAliveInterval=60 -i BASTION_PKEY -L LOCAL_PORT:DBSYSTEM_IP:DBSYSTEM_PORT BASTION_USER@BASTION_IP

In [None]:
BASTION_IP ="ip_address"
BASTION_USER = "opc"
BASTION_PKEY = "private_key_file"
DBSYSTEM_IP = "ip_address"
DBSYSTEM_PORT = 3306
DBSYSTEM_USER = "username"
DBSYSTEM_PASSWORD = "password"
DBSYSTEM_SCHEMA = "mlcorpus"
LOCAL_PORT = 3306
USE_BASTION = True

if USE_BASTION is True:
    DBSYSTEM_IP = "127.0.0.1"
else:
    LOCAL_PORT = DBSYSTEM_PORT

mydb = mysql.connector.connect(
    host=DBSYSTEM_IP,
    port=LOCAL_PORT,
    user=DBSYSTEM_USER,
    password=DBSYSTEM_PASSWORD,
    database=DBSYSTEM_SCHEMA,
    allow_local_infile=True,
    use_pure=True,
    autocommit=True,
)
mycursor = mydb.cursor()


# Helper function to execute SQL queries and return the results as a Pandas DataFrame
def execute_sql(sql: str) -> pd.DataFrame:
    mycursor.execute(sql)
    return pd.DataFrame(mycursor.fetchall(), columns=mycursor.column_names)


# ML_GENERATE operation

You can perform content summarization using the ML_GENERATE procedure (by specifying the task type as summarization).

### Text summarization

Load the 2024 Olympic data from the URL.

In [65]:
url = "https://www.economicsobservatory.com/what-happened-at-the-2024-olympics"
loader = UnstructuredURLLoader(urls=[url])
docs = loader.load()

Set a variable (@document, in this example) to the content of the document you want to summarize.

In [66]:
execute_sql(f"""SET @document = '{docs[0].page_content}';""")

Invoke ML_GENERATE with your variable @document, but this time specifying the task as "summarization".

In [67]:
df = execute_sql(f"""SELECT JSON_PRETTY(sys.ML_GENERATE(@document, JSON_OBJECT("task", "summarization", "model_id", "llama3.2-3b-instruct-v1", "max_tokens", 512)));""")
json.loads(df.iat[0,0])["text"]

"The 2024 Summer Olympic Games, held in Paris, France, saw significant milestones and challenges. The games featured 10,500 Olympians from 206 territories and the International Olympic Committee (IOC) Refugee Olympic Team, with China and the United States having the largest delegations. The games highlighted issues of representation, equality, cost control, and home advantage.\n\nThe number of countries participating in the Olympics has increased over time, but there was a notable decrease in 1980 due to the American-led boycott of the Moscow games. The Paris games saw four countries winning their first-ever medals, including Saint Lucia's Julien Alfred who won gold in the women's 100m sprint.\n\nThe distribution of medals among top-performing countries reversed slightly from previous Olympics, with China and the United States claiming the most medals. However, when considering victory ratios, Kenya, North Korea, and Saint Lucia outperformed traditional Olympic giants.\n\nThe Paris gam

In [68]:
# Delete the vector_store_data_1 table if exists
execute_sql(f"""DROP TABLE if exists vector_store_data_1;""")

Use vector_store_load to create a vector store from the files that contain the proprietary data.

To do that first copy the files into /var/lib/mysql-files folder

Example: sudo cp /home/john_doe/2024_Summer_Olympics_Wikipedia.pdf /var/lib/mysql-files

In [69]:
execute_sql("""CALL sys.vector_store_load('file:///var/lib/mysql-files/2024_Summer_Olympics_Wikipedia.pdf', JSON_OBJECT("schema_name","mlcorpus","table_name","vector_store_data_1"))""")


We can also summarize the documnet that is parsed using vector_store_load.

In [70]:
execute_sql("""SELECT GROUP_CONCAT(segment ORDER BY segment_number SEPARATOR ' ') INTO @parsed_document FROM (SELECT segment, segment_number FROM mlcorpus.vector_store_data_1 ORDER BY segment_number LIMIT 10) AS limited_segments;""")

In [71]:
df = execute_sql(f"""SELECT JSON_PRETTY(sys.ML_GENERATE(@parsed_document, JSON_OBJECT("task", "summarization", "model_id", "llama3.2-3b-instruct-v1", "max_tokens", 512)));""")
json.loads(df.iat[0,0])["text"]

"Here is a summary of the text:\n\nThe 2024 Summer Olympics, officially known as Paris 2024, will be held in Paris, France from July 26 to August 11, 2024. The games will feature 204 nations and 10,714 athletes competing in 32 sports across 329 events. This will be the third time Paris hosts the Summer Olympics, after 1900 and 1924. The games were awarded to Paris at the 2017 IOC Session, with a unique process that allowed both Paris and Los Angeles to bid for the 2024 and 2028 Games. The games will feature several new sports, including breaking, and will be the final Olympics under IOC President Thomas Bach's presidency. The opening ceremony will take place outside of a stadium for the first time in modern Olympic history, with athletes paraded by boat along the Seine. The United States is expected to top the medal table, followed closely by China. The games are expected to cost â‚¬9 billion and have broken records for ticket sales, with over 9.5 million tickets sold. Despite some con