# Eventhouse as a Vector database for AI embeddings

This Notebook provides step by step instuctions on using Azure Data Explorer (Kusto) as a vector database with OpenAI embeddings. 

In this notebook, you will:

1. Store precomputed embeddings created by the OpenAI API in an Eventhouse.
2. Convert raw text query to an embedding ("vectorize") using the Azure OpenAI API.
3. Compare the embedded query vector to the stored vectors using KQL cosine similarity, returning the top 10 most similar vectors.


## Prerequisites

1. A workspace with a [Microsoft Fabric-enabled capacity](https://learn.microsoft.com/fabric/enterprise/licenses#capacity). 
2. An [Eventhouse in Microsoft Fabric](https://learn.microsoft.com/fabric/real-time-intelligence/eventhouse). 
3. An Azure OpenAI resource with the text-embedding-ada-002 (Version 2) model deployed. This model is currently only available in certain regions. For more information, see [Create a resource](https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/create-resource).


## Set up your environment

In [None]:
%%configure -f
{"conf":
    {
        "spark.rpc.message.maxSize": "1024"
    }
}

In [None]:
%pip install wget

In [None]:
%pip install openai

## Download precomputed embeddings



Prepared embedding data is available to you so that you don't have to embed articles using your own credits. This data contains tens of thousands of Wikipedia pages that have been embedded using the text-embedding-ada-002 OpenAI model.

The following commands will create a table & load the vectors in an Eventhouse based on the contents in the dataframe. The spark option CreakeIfNotExists will automatically create a table if it doesn't exist.


In [None]:
import wget

embeddings_url = "https://cdn.openai.com/API/examples/data/vector_database_wikipedia_articles_embedded.zip"

# The file is ~700 MB so it might take some time
wget.download(embeddings_url)

In [None]:
import zipfile

with zipfile.ZipFile("vector_database_wikipedia_articles_embedded.zip","r") as zip_ref:
    zip_ref.extractall("/lakehouse/default/Files/data")

In [None]:
import pandas as pd

from ast import literal_eval

article_df = pd.read_csv('/lakehouse/default/Files/data/vector_database_wikipedia_articles_embedded.csv')
# Read vectors from strings back into a list
article_df["title_vector"] = article_df.title_vector.apply(literal_eval)
article_df["content_vector"] = article_df.content_vector.apply(literal_eval)
article_df.head()

## Write to Eventhouse

The cluster URI can be found in the Eventhouse [system overview](https://learn.microsoft.com/fabric/real-time-intelligence/manage-monitor-eventhouse#view-system-overview-details-for-an-eventhouse). Enter the name of the database in this Eventhouse to which you will write the data. 

In [None]:
# replace with your Eventhouse Cluster URI, Database name, and Table name
KUSTO_CLUSTER =  "Cluster URI"
KUSTO_DATABASE = "Database Name"
KUSTO_TABLE = "Wiki"

In [None]:
kustoOptions = {"kustoCluster": KUSTO_CLUSTER, "kustoDatabase" :KUSTO_DATABASE, "kustoTable" : KUSTO_TABLE }

access_token=mssparkutils.credentials.getToken(kustoOptions["kustoCluster"])

In [None]:
#Pandas data frame to spark dataframe
sparkDF=spark.createDataFrame(article_df)

In [None]:
# Write data to a table in Eventhouse
sparkDF.write. \
format("com.microsoft.kusto.spark.synapse.datasource"). \
option("kustoCluster",kustoOptions["kustoCluster"]). \
option("kustoDatabase",kustoOptions["kustoDatabase"]). \
option("kustoTable", kustoOptions["kustoTable"]). \
option("accessToken", access_token). \
option("tableCreateOptions", "CreateIfNotExist").\
mode("Append"). \
save()

## Embed your query terms with Azure OpenAI


Now, you need to embed the query terms. The embedded query terms can then be compared to the vectors stored in the Eventhouse to find similar entries.

The OpenAI API key is used for vectorization of the query terms. For instructions on how to create and retrieve your Azure OpenAI key and endpoint, see https://learn.microsoft.com/azure/cognitive-services/openai/tutorials/embeddings.
Use the text-embedding-ada-002 (Version 2) model, since the precomputed embeddings were created with the text-embedding-ada-002 OpenAI model.


In [None]:
import openai

#### Connect to Azure Open AI

To successfully make a call against Azure OpenAI, you need an endpoint, key, and deployment ID.
| Variable name	| Value
|---|---|
| endpoint	|This value can be found in the **Keys & Endpoint** section when examining your resource from the Azure portal. Alternatively, you can find the value in the **Azure OpenAI Studio > Playground > Code View**. An example endpoint is: https://docs-test-001.openai.azure.com/.
| api key |	This value can be found in the **Keys & Endpoint** section when examining your resource from the Azure portal. You can use either KEY1 or KEY2.
| deployment id | This value can be found under the **Deployments** section in the [Azure OpenAI Studio](https://oai.azure.com/).

In [None]:
openai.api_version = '2022-12-01'
openai.api_base = 'endpoint' # Add your endpoint here
openai.api_type = 'azure'
openai.api_key = 'api key'  # Add your api key here

def embed(query):
    # Creates embedding vector from user query
    embedded_query = openai.Embedding.create(
            input=query,
            deployment_id="deployment id", # Add your deployment id here
            chunk_size=1
    )["data"][0]["embedding"]
    return embedded_query

### Generate embedding for the search term

In [None]:
searchedEmbedding = embed("most difficult gymnastics moves in the olympics")
#print(searchedEmbedding)

### Run the semantic search over the data in your Eventhouse

This query uses the [cosine similarity function](https://learn.microsoft.com/azure/data-explorer/kusto/query/series-cosine-similarity-function) to compare the query vector to the vectors stored in the Eventhouse. The example query below returns the top 10 most similar vectors.

If you have used a different table than the one defined above, change the table name in the query below.

In [None]:

kustoQuery = "Wiki | extend similarity = series_cosine_similarity(dynamic("+str(searchedEmbedding)+"), content_vector) | top 10 by similarity desc" 
accessToken = mssparkutils.credentials.getToken(KUSTO_CLUSTER)
kustoDf  = spark.read\
    .format("com.microsoft.kusto.spark.synapse.datasource")\
    .option("accessToken", accessToken)\
    .option("kustoCluster", KUSTO_CLUSTER)\
    .option("kustoDatabase", KUSTO_DATABASE)\
    .option("kustoQuery", kustoQuery).load()

# Example that uses the result data frame.
kustoDf.show()