# MySQL AI
MySQL AI provides developers to build rich applications with MySQL leveraging built in machine learning, GenAI, LLMs and semantic search. They can create vectors from documents stored in a local file system. Customers can deploy these AI applications on premise or migrate them to MySQL HeatWave for lower cost, higher performance, richer functionality and latest LLMs with no change to their application. This gives developers the flexibility to build their applications on MySQL EE and then deploy them either on premise or in the cloud.


This notebook demonstrates the application of [ML_GENERATE](https://dev.mysql.com/doc/heatwave/en/mys-hwgenai-ml-generate.html) for summarization using data from the 2024 Olympic Games.

### References
- https://blogs.oracle.com/mysql/post/announcing-mysql-ai
- https://dev.mysql.com/doc/mysql-ai/9.4/en/
- https://dev.mysql.com/doc/dev/mysql-studio/latest/#overview
- https://www.economicsobservatory.com/what-happened-at-the-2024-olympics
- https://en.wikipedia.org/wiki/2024_Summer_Olympics

### Prerequistises

- mysql-connector-python
- pandas 

##### Import Python packages

In [17]:
# import Python packages
import os
import json
import numpy as np
import pandas as pd
import mysql.connector

### Connect to the MySQL AI instance
We create a connection to an active MySQL AI instance using the [MySQL Connector/Python](https://dev.mysql.com/doc/connector-python/en/). We also define an API to execute a SQL query using a cursor, and the result is returned as a Pandas DataFrame. Modify the below variables to point to your MySQL AI instance.

 - In MySQL Studio, connections are restricted to only allow localhost as the host. 
 - In MySQL Studio, the only accepted password values are the string unused or None. 

In [None]:
HOST = 'localhost'
PORT = 3306
USER = 'root'
PASSWORD = 'unused'
DATABASE = 'mlcorpus'


myconn = mysql.connector.connect(
    host=HOST,
    port=PORT,
    user=USER,
    password=PASSWORD,
    database=DATABASE,
    allow_local_infile=True,
    use_pure=True,
    autocommit=True,
)
mycursor = myconn.cursor()


# Helper function to execute SQL queries and return the results as a Pandas DataFrame
def execute_sql(sql: str) -> pd.DataFrame:
    mycursor.execute(sql)
    return pd.DataFrame(mycursor.fetchall(), columns=mycursor.column_names)


# ML_GENERATE operation

You can perform content summarization using the [ML_GENERATE](https://dev.mysql.com/doc/heatwave/en/mys-hwgenai-ml-generate.html) procedure (by specifying the task type as summarization).

### Text summarization
Set a variable (@document, in this example) to the content of the document you want to summarize.

In [0]:
document = "Paris has been the host for the 2024 Olympic and Paralympic Games. Over the course of the summer, the city – and other venues across France – have welcomed 10,500 Olympians (competing in 32 sports) and over 4,000 Paralympians (competing in 22 sports). In the Paris games, 206 territories were represented, alongside the International Olympic Committee (IOC) Refugee Olympic Team. Comparably, the 1900 Olympics – also hosted by Paris – featured athletes from only 24 nations. The American-led boycott of that edition was instigated by political tensions with the Soviet Union over its occupation of Afghanistan. Approximately 60 countries joined the United States in boycotting the games, resulting in only 81 countries sending athletes."
execute_sql(f"""SET @document = '{document}';""")

Invoke [ML_GENERATE](https://dev.mysql.com/doc/heatwave/en/mys-hwgenai-ml-generate.html) with your variable @document and specifying the task as "summarization".

In [0]:
df = execute_sql(f"""SELECT JSON_PRETTY(sys.ML_GENERATE(@document, JSON_OBJECT("task", "summarization", "max_tokens", 512)));""")
json.loads(df.iat[0,0])["text"]

"Here is a summary:\n\nThe 2024 Olympic and Paralympic Games were held in Paris, welcoming 10,500 Olympians from 206 territories and 4,000 Paralympians from 22 sports. This marked a significant increase from the 1900 Olympics, which featured athletes from only 24 nations due to an American-led boycott of the Soviet Union's occupation of Afghanistan."

### Use VECTOR_STORE_LOAD to parse the documents that contain the proprietary data.

To do that first copy the files into /var/lib/mysql-files folder

Example: sudo cp /home/john_doe/2024_Summer_Olympics_Wikipedia.pdf /var/lib/mysql-files

In [26]:
# Delete the parsed_document table if exists
execute_sql(f"""DROP TABLE if exists mlcorpus.parsed_document_segments;""")

In [27]:
df = execute_sql("""CALL sys.VECTOR_STORE_LOAD('file:///var/lib/mysql-files/2024_Summer_Olympics_Wikipedia.pdf', JSON_OBJECT("schema_name","mlcorpus","table_name","parsed_document_segments"))""")

In [0]:
df_status = execute_sql(f"""{df['task_status_query'][0]}""")
print(f"Progress:{json.loads(df_status.iloc[0,0])['progress']}%")

Progress:100%


We can also summarize the documnet that is parsed using vector_store_load.

In [28]:
execute_sql("""SELECT GROUP_CONCAT(segment ORDER BY segment_number SEPARATOR ' ') INTO @parsed_document FROM (SELECT segment, segment_number FROM mlcorpus.parsed_document_segments ORDER BY segment_number LIMIT 10) AS limited_segments;""")

Invoke [ML_GENERATE](https://dev.mysql.com/doc/heatwave/en/mys-hwgenai-ml-generate.html) with your variable @parsed_document and specifying the task as "summarization".

In [29]:
df = execute_sql(f"""SELECT JSON_PRETTY(sys.ML_GENERATE(@parsed_document, JSON_OBJECT("task", "summarization", "max_tokens", 512)));""")
json.loads(df.iat[0,0])["text"]

'Here is a summary of the text:\n\nThe 2024 Summer Olympics, officially known as Paris 2024, will be held in Paris, France from July 26 to August 11, 2024. The event will feature 204 participating nations and 10,714 athletes competing in 329 events across 32 sports. The opening ceremony is scheduled for July 26, while the closing ceremony will take place on August 11.'