# Embedding generation with MySQL AI
MySQL AI provides developers to build rich applications with MySQL leveraging built in machine learning, GenAI, LLMs and semantic search. They can create vectors from documents stored in a local file system. Customers can deploy these AI applications on premise or migrate them to MySQL HeatWave for lower cost, higher performance, richer functionality and latest LLMs with no change to their application. This gives developers the flexibility to build their applications on MySQL EE and then deploy them either on premise or in the cloud.

This notebook demonstrates the application of [ML_EMBED_ROW](https://dev.mysql.com/doc/heatwave/en/mys-hwgenai-ml-embed-row.html) and [ML_EMBED_TABLE](https://dev.mysql.com/doc/heatwave/en/mys-hwgenai-ml-embed-table.html) for embedding generation using data from the 2024 Olympic Games.

Embeddings transform text or other unstructured data into high-dimensional numerical vectors that capture semantic meaning and contextual relationships. This allows machines to compare and reason about concepts based on meaning rather than exact keyword matches. You would typically generate embeddings when you need to retrieve or analyze information based on conceptual similarity—for example, finding all customer support tickets that describe the same issue even if they use different wording.

### References
- https://blogs.oracle.com/mysql/post/announcing-mysql-ai
- https://dev.mysql.com/doc/mysql-ai/9.4/en/
- https://dev.mysql.com/doc/dev/mysql-studio/latest/#overview
- https://www.economicsobservatory.com/what-happened-at-the-2024-olympics
- https://en.wikipedia.org/wiki/2024_Summer_Olympics

### Prerequistises

- mysql-connector-python
- pandas 

##### Import Python packages

In [3]:
# import Python packages
import os
import json
import numpy as np
import pandas as pd
import mysql.connector

### Connect to MySQL AI instance
We create a connection to an active MySQL AI instance using the [MySQL Connector/Python](https://dev.mysql.com/doc/connector-python/en/). We also define an API to execute a SQL query using a cursor, and the result is returned as a Pandas DataFrame. Modify the below variables to point to your MySQL AI instance.

 - In MySQL Studio, connections are restricted to only allow localhost as the host. 
 - In MySQL Studio, the only accepted password values are the string unused or None (authentication happens through the web interface).

In [None]:
HOST = 'localhost'
PORT = 3306
USER = 'root'
PASSWORD = 'unused'
DATABASE = 'mlcorpus'


myconn = mysql.connector.connect(
    host=HOST,
    port=PORT,
    user=USER,
    password=PASSWORD,
    database=DATABASE,
    allow_local_infile=True,
    use_pure=True,
    autocommit=True,
)
mycursor = myconn.cursor()


In [0]:
# Helper function to execute SQL queries and return the results as a Pandas DataFrame
def execute_sql(sql: str) -> pd.DataFrame:
    mycursor.execute(sql)
    return pd.DataFrame(mycursor.fetchall(), columns=mycursor.column_names)

# ML_EMBED_ROW operation
The [ML_EMBED_ROW](https://dev.mysql.com/doc/heatwave/en/mys-hwgenai-ml-embed-row.html) function generates a vector embedding for a given text input.

An embedding is a high-dimensional vector that represents the semantic meaning of text. Texts with similar meanings have embeddings that are close together in vector space.

In the example below, we use the default embedding model to convert a sentence into a numerical representation.

In [0]:
df = execute_sql(f"""SELECT sys.ML_EMBED_ROW("Paris has been the host for the 2024 Olympic and Paralympic Games.", NULL);""")
df.iat[0,0][:5]

[-0.01162190455943346,
 0.04976009577512741,
 -0.00718254130333662,
 -0.010453645139932632,
 0.06114770099520683]

# Comparing embeddings

Once we generate embeddings, we can compute similarity scores to measure how closely two sentences are related. 

We use the DISTANCE function with the "DOT" metric (dot product).

### Case 1: Identical Sentences
Two identical sentences should yield a similarity score very close to 1.0.

In [0]:
df = execute_sql(f"""SELECT DISTANCE(sys.ML_EMBED_ROW("Paris has been the host for the 2024 Olympic and Paralympic Games.", NULL), sys.ML_EMBED_ROW("Paris has been the host for the 2024 Olympic and Paralympic Games.", NULL), "DOT");""")
float(df.iat[0,0])

1.0000001192092896

### Case 2: Synonyms
Sentences with the same meaning but different wording (e.g., "cat" vs. "kitty") should have a high similarity, though slightly less than 1.

In [0]:
df = execute_sql(f"""SELECT DISTANCE(sys.ML_EMBED_ROW("Paris has been the host for the 2024 Olympic and Paralympic Games.", NULL), sys.ML_EMBED_ROW("Paris served as the venue for the 2024 Olympic and Paralympic Games.", NULL), "DOT");""")
float(df.iat[0,0])

0.976202130317688

### Case 3: Related but Different
Sentences about the same domain (e.g., Olympic Games) but with different details will yield a moderate similarity score.

In [0]:
df = execute_sql(f"""SELECT DISTANCE(sys.ML_EMBED_ROW("Paris has been the host for the 2024 Olympic and Paralympic Games.", NULL), sys.ML_EMBED_ROW("Many teams sent more women athletes to Paris 2024 than men.", NULL), "DOT");""")
float(df.iat[0,0])

0.8721277713775635

The similarity score decreases as the semantic difference between the sentences increases.

# ML_EMBED_TABLE Operation

The [ML_EMBED_TABLE](https://dev.mysql.com/doc/heatwave/en/mys-hwgenai-ml-embed-table.html) procedure generates embeddings for an entire column or set of rows in a table.  
This is useful when you need to embed a collection of documents, sentences, or any dataset stored in MySQL.  

The ML_EMBED_TABLE routine runs multiple embedding generations in a batch, in parallel. 

In the example below, we:
1. Create a table with Olympic-related sentences.
2. Use [ML_EMBED_TABLE](https://dev.mysql.com/doc/heatwave/en/mys-hwgenai-ml-embed-table.html) to generate embeddings for all rows.
3. Store the embeddings for later similarity queries.


In [0]:
execute_sql("DROP TABLE IF EXISTS mlcorpus.table_olympics;")

# 1. Create the table
execute_sql("""
CREATE TABLE mlcorpus.table_olympics (
    id INT NOT NULL,
    input TEXT DEFAULT NULL,
    PRIMARY KEY (id)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;
""")

# 2. Insert rows
execute_sql("INSERT INTO mlcorpus.table_olympics VALUES(0, 'Paris has been the host for the 2024 Olympic and Paralympic Games.');")
execute_sql("INSERT INTO mlcorpus.table_olympics VALUES(1, 'In the Paris games, 206 territories were represented, alongside the International Olympic Committee (IOC) Refugee Olympic Team. Comparably, the 1900 Olympics – also hosted by Paris – featured athletes from only 24 nations.');")
execute_sql("INSERT INTO mlcorpus.table_olympics VALUES(2, 'Many teams sent more women athletes to Paris 2024 than men.');")

# 3. Query table to verify
df = execute_sql("SELECT * FROM mlcorpus.table_olympics;")
df.head()

Unnamed: 0,id,input
0,0,Paris has been the host for the 2024 Olympic a...
1,1,"In the Paris games, 206 territories were repre..."
2,2,Many teams sent more women athletes to Paris 2...


In [0]:
execute_sql(f"""CALL sys.ML_EMBED_TABLE('mlcorpus.table_olympics.input', 'mlcorpus.table_olympics.response', NULL);""")
df = execute_sql("SELECT * FROM mlcorpus.table_olympics;")
df

Unnamed: 0,id,input,response,details
0,0,Paris has been the host for the 2024 Olympic a...,"[-0.01162190455943346, 0.04976009577512741, -0...","{""error"": null}"
1,1,"In the Paris games, 206 territories were repre...","[0.049480509012937546, -0.006018734071403742, ...","{""error"": null}"
2,2,Many teams sent more women athletes to Paris 2...,"[-0.012446102686226368, -0.020043356344103813,...","{""error"": null}"


Close the MySQL connection

In [0]:
mycursor.close()  # Closes the cursor
myconn.close()    # Closes the connection

We invite you to try [HeatWave AutoML and GenAI](https://www.oracle.com/heatwave/free/). If you’re new to Oracle Cloud Infrastructure, try Oracle Cloud Free Trial, a free 30-day trial with US$300 in credits.