# 03. Generate Embeddings

## What are Embeddings?
Computers don't understand text; they understand numbers. 
**Embeddings** are lists of numbers (vectors) that represent the *meaning* of a piece of text.

For example:
- "Dog" and "Puppy" will have very similar numbers.
- "Dog" and "Car" will have very different numbers.

We will use a pre-trained AI model to convert our text chunks into these number lists.

## Step 1: Install Libraries
We use `sentence-transformers`, a great library for creating embeddings.

In [None]:
%pip install sentence-transformers

## Step 2: Load Chunked Data
We load the `silver_chunks` table from the previous step.

In [None]:
spark.sql("USE rag_demo")
df_chunks = spark.table("silver_chunks")
display(df_chunks)

## Step 3: Define Embedding Function
We will use a model called `all-MiniLM-L6-v2`. It's small, fast, and works well on CPUs (perfect for the Free Edition).

In [None]:
from pyspark.sql.functions import pandas_udf, col
from pyspark.sql.types import ArrayType, FloatType
import pandas as pd
from sentence_transformers import SentenceTransformer

# Name of the model we want to download
model_name = "all-MiniLM-L6-v2"

# Define a Pandas UDF to run the model on our data
@pandas_udf(ArrayType(FloatType()))
def generate_embeddings_udf(text_series: pd.Series) -> pd.Series:
    # Load the model inside the function (so it works on worker nodes)
    model = SentenceTransformer(model_name)
    
    # Generate embeddings for the whole batch of text
    embeddings = model.encode(text_series.tolist())
    
    # Return as a Series of lists
    return pd.Series(embeddings.tolist())


## Step 4: Compute Embeddings
Now we run the function. This might take a minute or two depending on how much data you have.

In [None]:
# Apply the UDF to the 'chunk_text' column
# repartition(4) helps split the work across available cores
df_with_embeddings = df_chunks.repartition(4).withColumn(
    "embedding", 
    generate_embeddings_udf(col("chunk_text"))
)

display(df_with_embeddings)

## Step 5: Save to Gold Table
We save the results to `gold_embeddings`. This table now contains both the text and its mathematical representation.

In [None]:
df_with_embeddings.write.format("delta").mode("overwrite").saveAsTable("gold_embeddings")

print("Success! Embeddings generated and saved to 'gold_embeddings'.")