# Text summarization with MySQL AI
MySQL AI provides developers to build rich applications with MySQL leveraging built in machine learning, GenAI, LLMs and semantic search. They can create vectors from documents stored in a local file system. Customers can deploy these AI applications on premise or migrate them to MySQL HeatWave for lower cost, higher performance, richer functionality and latest LLMs with no change to their application. This gives developers the flexibility to build their applications on MySQL EE and then deploy them either on premise or in the cloud.


This notebook demonstrates the application of [ML_GENERATE](https://dev.mysql.com/doc/heatwave/en/mys-hwgenai-ml-generate.html) for summarization using data from the 2024 Olympic Games which is saved in a MySQL table.

### References
- https://blogs.oracle.com/mysql/post/announcing-mysql-ai
- https://dev.mysql.com/doc/mysql-ai/9.4/en/
- https://dev.mysql.com/doc/dev/mysql-studio/latest/#overview
- https://www.economicsobservatory.com/what-happened-at-the-2024-olympics
- https://en.wikipedia.org/wiki/2024_Summer_Olympics

### Prerequistises

- mysql-connector-python
- pandas 

##### Import Python packages

In [17]:
# import Python packages
import os
import json
import time
import numpy as np
import pandas as pd
import mysql.connector

### Connect to the MySQL AI instance
We create a connection to an active MySQL AI instance using the [MySQL Connector/Python](https://dev.mysql.com/doc/connector-python/en/). We also define an API to execute a SQL query using a cursor, and the result is returned as a Pandas DataFrame. Modify the below variables to point to your MySQL AI instance.

 - In MySQL Studio, connections are restricted to only allow localhost as the host. 
 - In MySQL Studio, the only accepted password values are the string unused or None (authentication happens through the web interface).

In [None]:
HOST = 'localhost'
PORT = 3306
USER = 'root'
PASSWORD = 'unused'
DATABASE = 'mlcorpus'


myconn = mysql.connector.connect(
    host=HOST,
    port=PORT,
    user=USER,
    password=PASSWORD,
    database=DATABASE,
    allow_local_infile=True,
    use_pure=True,
    autocommit=True,
)
mycursor = myconn.cursor()

In [0]:

# Helper function to execute SQL queries and return the results as a Pandas DataFrame
def execute_sql(sql: str) -> pd.DataFrame:
    mycursor.execute(sql)
    return pd.DataFrame(mycursor.fetchall(), columns=mycursor.column_names)

### Create a MySQL table
In this step, we will create a MySQL table to store data from the 2024 Olympic Games. This table will organize the dataset in a structured format, making it easier to query, analyze, and manipulate.

In [0]:
execute_sql(f"""DROP TABLE IF EXISTS mlcorpus.olympics_table""")
execute_sql(f"""CREATE TABLE mlcorpus.olympics_table (
                id INT AUTO_INCREMENT PRIMARY KEY,
                text TEXT NOT NULL);""")

execute_sql(f"""INSERT INTO mlcorpus.olympics_table (text)
                VALUES
                ("Paris has been the host for the 2024 Olympic and Paralympic Games. Over the course of the summer, the city – and other venues across France – have welcomed 10,500 Olympians (competing in 32 sports) and over 4,000 Paralympians (competing in 22 sports)."),
                ("In the Paris games, 206 territories were represented, alongside the International Olympic Committee (IOC) Refugee Olympic Team. Comparably, the 1900 Olympics – also hosted by Paris – featured athletes from only 24 nations."),
                ("The increase in participating countries has been consistent aside from a noticeable break in 1980 when the Moscow games saw the largest Olympic boycott in history."),
                ("The American-led boycott of that edition was instigated by political tensions with the Soviet Union over its occupation of Afghanistan. Approximately 60 countries joined the United States in boycotting the games, resulting in only 81 countries sending athletes."),
                ("China (with 388 athletes) and the United States (with 594) had among the largest delegations, while countries like Australia (460 athletes), France (572), New Zealand (212) and Slovenia (74) all had high numbers of athletes at the games relative to their population size.");""")

### Concatenate Text Column into a Session Variable
In this step, we will combine the contents of the text column from the mlcorpus.olympics_table into a single session variable (e.g., @document).

**Important**: MySQL’s GROUP_CONCAT() function has a maximum output length, which is controlled by the system variable group_concat_max_len.

- **Default limit**: 1024 bytes (~1 KB)

- **Issue**: This is often too small for long text fields, such as our multi-paragraph Olympic dataset.

To prevent truncation, increase the session limit before running GROUP_CONCAT. For example, we can set it to 10,000 bytes or higher:

In [0]:
execute_sql("""SET SESSION group_concat_max_len = 10000;""")
execute_sql("""SELECT GROUP_CONCAT(text ORDER BY id SEPARATOR ' ') INTO @document FROM mlcorpus.olympics_table;""")

### Text summarization
Invoke [ML_GENERATE](https://dev.mysql.com/doc/heatwave/en/mys-hwgenai-ml-generate.html) with your variable @document and specifying the task as "summarization".

In [29]:
df = execute_sql(f"""SELECT JSON_PRETTY(sys.ML_GENERATE(@document, JSON_OBJECT("task", "summarization", "max_tokens", 512)));""")
json.loads(df.iat[0,0])["text"]

'Here is a summary:\n\nThe 2024 Paris Olympics saw a significant increase in participating countries compared to previous editions, with 206 territories represented. This is a notable improvement from the 1900 Olympics, which featured only 24 nations. The 1980 Moscow games were marked by a large boycott, led by the US, due to political tensions with the Soviet Union over Afghanistan, resulting in only 81 countries sending athletes. In contrast, the 2024 Paris Olympics had a diverse range of participating countries, with China and the US having the largest delegations.'

Close the MySQL connection

In [0]:
mycursor.close()  # Closes the cursor
myconn.close()    # Closes the connection