## Exercise 1 : 
After creating f-64 capacity, fabric workspace in this capacity, 1 lakehouse, AI services multi-account, AI search, keyvault you can start playling 

### Retrieving the key from the services

In [9]:
from trident_token_library_wrapper import PyTridentTokenLibrary 



access_token = mssparkutils.credentials.getToken("keyvault")
KEY_VAULT_URL = "https://summerschool-kv.vault.azure.net/"

# Azure AI Search

AI_SEARCH_SECERETNAME = "aisearchkey"
AI_SEARCH_KEY = PyTridentTokenLibrary.get_secret_with_token(KEY_VAULT_URL,AI_SEARCH_SECERETNAME,access_token)

AI_SEARCH_INDEX_NAME = "rag-demo-index"

AI_SEARCH_NAME = "summerschoollab3search"

# Azure AI Services
AI_SERVICES_SECRETNAME = "aiservicekey"
AI_SERVICES_KEY = PyTridentTokenLibrary.get_secret_with_token(KEY_VAULT_URL,AI_SERVICES_SECRETNAME,access_token)

AI_SERVICES_LOCATION = "eastus"

StatementMeta(, 8156b84c-ec13-4bf9-91e5-be90adedc781, 11, Finished, Available, Finished)

## Exercise 2: Loading and Pre-processing PDF Documents 
### Task 1: Configure Azure API keys


In [2]:
import requests

import os

url = "https://github.com/Azure-Samples/azure-openai-rag-workshop/raw/main/data/support.pdf"

response = requests.get(url)

# Specify your path here

path = "/lakehouse/default/Files/"

# Ensure the directory exists

os.makedirs(path, exist_ok=True)

# Write the content to a file in the specified path

filename = url.rsplit("/")[-1]

with open(os.path.join(path, filename), "wb") as f:

    f.write(response.content)

StatementMeta(, 8156b84c-ec13-4bf9-91e5-be90adedc781, 4, Finished, Available, Finished)

### Task 2: Loading & Analyzing the Document

In [3]:
from pyspark.sql.functions import udf

from pyspark.sql.types import StringType

document_path = f"Files/{filename}"

df = spark.read.format("binaryFile").load(document_path).select("_metadata.file_name", "content").limit(10).cache()

display(df)

StatementMeta(, 8156b84c-ec13-4bf9-91e5-be90adedc781, 5, Finished, Available, Finished)

SynapseWidget(Synapse.DataFrame, c3508a57-a490-428a-8e82-56d1cab6d385)

 we'll use the Azure AI Document Intelligence to read the PDF documents and extract the text from them.

In [4]:
from synapse.ml.services import AnalyzeDocument

from pyspark.sql.functions import col

analyze_document = (

    AnalyzeDocument()

    .setPrebuiltModelId("prebuilt-layout")

    .setSubscriptionKey(AI_SERVICES_KEY)

    .setLocation(AI_SERVICES_LOCATION)

    .setImageBytesCol("content")

    .setOutputCol("result")

)

analyzed_df = (

    analyze_document.transform(df)

    .withColumn("output_content", col("result.analyzeResult.content"))

    .withColumn("paragraphs", col("result.analyzeResult.paragraphs"))

).cache()

StatementMeta(, 8156b84c-ec13-4bf9-91e5-be90adedc781, 6, Finished, Available, Finished)

In [5]:
# content is not used anymore so we drop it
analyzed_df = analyzed_df.drop("content")

display(analyzed_df)

StatementMeta(, 8156b84c-ec13-4bf9-91e5-be90adedc781, 7, Finished, Available, Finished)

SynapseWidget(Synapse.DataFrame, 2a290173-f339-429f-b733-935c05930420)

## Exercise 3: Generating and storing embeddings
### Task 1: Text Chunking
Before we can generate the embeddings, we need to split the text into chunks. To do this we leverage SynapseML's PageSplitter to divide the documents into smaller sections, which are subsequently stored in the chunks column. This allows for more granular representation and processing of the document content.

In [6]:
from synapse.ml.featurize.text import PageSplitter
from pyspark.sql.functions import col, explode

ps = (

    PageSplitter()

    .setInputCol("output_content")

    .setMaximumPageLength(4000)

    .setMinimumPageLength(3000)

    .setOutputCol("chunks")

)

splitted_df = ps.transform(analyzed_df)

display(splitted_df)

StatementMeta(, 8156b84c-ec13-4bf9-91e5-be90adedc781, 8, Finished, Available, Finished)

SynapseWidget(Synapse.DataFrame, 8ab3ef87-3af4-44d9-a17e-7e1ce335bf80)

In [7]:
from pyspark.sql.functions import posexplode, col, concat

# Each "chunks" column contains the chunks for a single document in an array

# The posexplode function will separate each chunk into its own row

exploded_df = splitted_df.select("file_name", posexplode(col("chunks")).alias("chunk_index", "chunk"))

# Add a unique identifier for each chunk

exploded_df = exploded_df.withColumn("unique_id", concat(exploded_df.file_name, exploded_df.chunk_index))

display(exploded_df)

StatementMeta(, 8156b84c-ec13-4bf9-91e5-be90adedc781, 9, Finished, Available, Finished)

SynapseWidget(Synapse.DataFrame, 7eeb3b48-f9e4-4055-8666-b7c49d37d78e)

### Task 2: Generating Embeddings
Next we'll generate the embeddings for each chunk. To do this we utilize both SynapseML and Azure OpenAI Service. By integrating the built in Azure OpenAI service with SynapseML, we can leverage the power of the Apache Spark distributed computing framework to process numerous prompts using the OpenAI service.

You need to use an F64 capacity for the embeddings generation

In [8]:
from synapse.ml.services import OpenAIEmbedding

embedding = (

    OpenAIEmbedding()

    .setDeploymentName("text-embedding-ada-002")

    .setTextCol("chunk")

    .setErrorCol("error")

    .setOutputCol("embeddings")

)

df_embeddings = embedding.transform(exploded_df)

display(df_embeddings)

StatementMeta(, 8156b84c-ec13-4bf9-91e5-be90adedc781, 10, Finished, Available, Finished)

SynapseWidget(Synapse.DataFrame, aaa62376-4c6b-4930-b1b3-4af7fafeedfb)

### Task 3: Creating AI search index

Azure AI Search is a powerful search engine that includes the ability to perform full text search, vector search, and hybrid search. For more examples of its vector search capabilities, see the [azure-search-vector-samples repository](https://github.com/Azure/azure-search-vector-samples/).

Storing data in Azure AI Search involves two main steps:

**Creating the index**: The first step is to define the schema of the search index, which includes the properties of each field as well as any vector search strategies that will be used.

**Adding chunked documents and embeddings**: The second step is to upload the chunked documents, along with their corresponding embeddings, to the index. This allows for efficient storage and retrieval of the data using hybrid and vector search.

In [11]:
import requests

import json

# Length of the embedding vector (OpenAI ada-002 generates embeddings of length 1536)

EMBEDDING_LENGTH = 1536

# Create index for AI Search with fields id, content, and contentVector

# Note the datatypes for each field below

url = f"https://{AI_SEARCH_NAME}.search.windows.net/indexes/{AI_SEARCH_INDEX_NAME}?api-version=2023-11-01"

payload = json.dumps(

    {

        "name": AI_SEARCH_INDEX_NAME,

        "fields": [

            # Unique identifier for each document

            {

                "name": "id",

                "type": "Edm.String",

                "key": True,

                "filterable": True,

            },

            # Text content of the document

            {

                "name": "content",

                "type": "Edm.String",

                "searchable": True,

                "retrievable": True,

            },

            # Vector embedding of the text content

            {

                "name": "contentVector",

                "type": "Collection(Edm.Single)",

                "searchable": True,

                "retrievable": True,

                "dimensions": EMBEDDING_LENGTH,

                "vectorSearchProfile": "vectorConfig",

            },

        ],

        "vectorSearch": {

            "algorithms": [{"name": "hnswConfig", "kind": "hnsw", "hnswParameters": {"metric": "cosine"}}],

            "profiles": [{"name": "vectorConfig", "algorithm": "hnswConfig"}],

        },

    }

)

headers = {"Content-Type": "application/json", "api-key": AI_SEARCH_KEY}

response = requests.request("PUT", url, headers=headers, data=payload)

if response.status_code == 201:

    print("Index created!")

elif response.status_code == 204:

    print("Index updated!")

else:

    print(f"HTTP request failed with status code {response.status_code}")

    print(f"HTTP response body: {response.text}")

StatementMeta(, 8156b84c-ec13-4bf9-91e5-be90adedc781, 13, Finished, Available, Finished)

Index created!


### Task 4: Storing embeddings in AI search index

The next step is to upload the chunks to the newly created Azure AI Search index. The Azure AI Search REST API supports up to 1000 "documents" per request. Note that in this case, each of our "documents" is in fact a chunk of the original file

In [13]:
import re

from pyspark.sql.functions import monotonically_increasing_id

def insert_into_index(documents):

    """Uploads a list of 'documents' to Azure AI Search index."""

    url = f"https://{AI_SEARCH_NAME}.search.windows.net/indexes/{AI_SEARCH_INDEX_NAME}/docs/index?api-version=2023-11-01"

    payload = json.dumps({"value": documents})

    headers = {

        "Content-Type": "application/json",

        "api-key": AI_SEARCH_KEY,

    }

    response = requests.request("POST", url, headers=headers, data=payload)

    if response.status_code == 200 or response.status_code == 201:

        return "Success"

    else:

        return f"Failure: {response.text}"

def make_safe_id(row_id: str):

    """Strips disallowed characters from row id for use as Azure AI search document ID."""

    return re.sub("[^0-9a-zA-Z_-]", "_", row_id)

def upload_rows(rows):

    """Uploads the rows in a Spark dataframe to Azure AI Search.

    Limits uploads to 1000 rows at a time due to Azure AI Search API limits.

    """

    BATCH_SIZE = 1000

    rows = list(rows)

    for i in range(0, len(rows), BATCH_SIZE):

        row_batch = rows[i : i + BATCH_SIZE]

        documents = []

        for row in rows:

            documents.append(

                {

                    "id": make_safe_id(row["unique_id"]),

                    "content": row["chunk"],

                    "contentVector": row["embeddings"].tolist(),

                    "@search.action": "upload",

                },

            )

        status = insert_into_index(documents)

        yield [row_batch[0]["row_index"], row_batch[-1]["row_index"], status]

# Add ID to help track what rows were successfully uploaded

df_embeddings = df_embeddings.withColumn("row_index", monotonically_increasing_id())

# Run upload_batch on partitions of the dataframe

res = df_embeddings.rdd.mapPartitions(upload_rows)

display(res.toDF(["start_index", "end_index", "insertion_status"]))

StatementMeta(, 8156b84c-ec13-4bf9-91e5-be90adedc781, 15, Finished, Available, Finished)

SynapseWidget(Synapse.DataFrame, d584d2cf-6c42-496d-8446-5d5f6b8499cc)

## Exercise 4 : Retrieving Relevant Documents and Answering Questions

After processing the document, we can proceed to pose a question. We will use SynapseML to convert the user's question into an embedding and then utilize cosine similarity to retrieve the top K document chunks that closely match the user's question.

### Task 1 : Generating embedding of the user question
The following function takes a user's question as input and converts it into an embedding using the text-embedding-ada-002 model. This code assumes you're using the Pre-built AI Services in Microsoft Fabric

In [14]:
def gen_question_embedding(user_question):

    """Generates embedding for user_question using SynapseML."""

    from synapse.ml.services import OpenAIEmbedding

    df_ques = spark.createDataFrame([(user_question, 1)], ["questions", "dummy"])

    embedding = (

        OpenAIEmbedding()

        .setDeploymentName('text-embedding-ada-002')

        .setTextCol("questions")

        .setErrorCol("errorQ")

        .setOutputCol("embeddings")

    )

    df_ques_embeddings = embedding.transform(df_ques)

    row = df_ques_embeddings.collect()[0]

    question_embedding = row.embeddings.tolist()

    return question_embedding

StatementMeta(, 8156b84c-ec13-4bf9-91e5-be90adedc781, 16, Finished, Available, Finished)

### Task 2: Retrieve Relevant Documents
The next step is to use the user question and its embedding to retrieve the top K most relevant document chunks from the search index. The following function retrieves the top K entries using hybrid search

In [16]:
import json

import requests

def retrieve_top_chunks(k, question, question_embedding):

    """Retrieve the top K entries from Azure AI Search using hybrid search."""

    url = f"https://{AI_SEARCH_NAME}.search.windows.net/indexes/{AI_SEARCH_INDEX_NAME}/docs/search?api-version=2023-11-01"

    payload = json.dumps({

        "search": question,

        "top": k,

        "vectorQueries": [

            {

                "vector": question_embedding,

                "k": k,

                "fields": "contentVector",

                "kind": "vector"

            }

        ]

    })

    headers = {

        "Content-Type": "application/json",

        "api-key": AI_SEARCH_KEY,

    }

    response = requests.request("POST", url, headers=headers, data=payload)

    output = json.loads(response.text)

    return output

StatementMeta(, 8156b84c-ec13-4bf9-91e5-be90adedc781, 18, Finished, Available, Finished)

With those functions defined, we can define a function that takes a user's question, generates an embedding for the question, retrieves the top K document chunks, and concatenates the content of the retrieved documents to form the context for the user's question.

In [17]:
def get_context(user_question, retrieved_k = 5):

    # Generate embeddings for the question

    question_embedding = gen_question_embedding(user_question)

    # Retrieve the top K entries

    output = retrieve_top_chunks(retrieved_k, user_question, question_embedding)

    # concatenate the content of the retrieved documents

    context = [chunk["content"] for chunk in output["value"]]

    return context

StatementMeta(, 8156b84c-ec13-4bf9-91e5-be90adedc781, 19, Finished, Available, Finished)

### Task 3: Answering the User's Question
Finally, we can define a function that takes a user's question, retrieves the context for the question, and sends both the context and the question to a large language model to generate a response. For this demo, we'll use the gpt-35-turbo-16k, a model that is optimized for conversation.

In [20]:
from pyspark.sql import Row

from synapse.ml.services.openai import OpenAIChatCompletion

def make_message(role, content):

    return Row(role=role, content=content, name=role)

def get_response(user_question):

    context = get_context(user_question)

    # Write a prompt with context and user_question as variables

    metaprompt = f"""

    context: {context}

    Answer the question based on the context above.

    If the information to answer the question is not present in the given context then reply "I don't know".

    """

    chat_df = spark.createDataFrame(

        [

            (

                [

                    make_message(

                        "system", metaprompt

                    ),

                    make_message("user", user_question),

                ],

            ),

        ]

    ).toDF("messages")

    chat_completion = (

        OpenAIChatCompletion()

        .setDeploymentName("gpt-35-turbo-16k") # deploymentName could be one of {gpt-35-turbo, gpt-35-turbo-16k}

        .setMessagesCol("messages")

        .setErrorCol("error")

        .setOutputCol("chat_completions")

    )

    result_df = chat_completion.transform(chat_df).select("chat_completions.choices.message.content")

    result = []

    for row in result_df.collect():

        content_string = ' '.join(row['content'])

        result.append(content_string)

    # Join the list into a single string

    result = ' '.join(result)

    return result

StatementMeta(, 8156b84c-ec13-4bf9-91e5-be90adedc781, 22, Finished, Available, Finished)

In [19]:
user_question = "how do i make a booking?"

response = get_response(user_question)

print(response)

StatementMeta(, 8156b84c-ec13-4bf9-91e5-be90adedc781, 21, Finished, Available, Finished)

To make a booking on Contoso Real Estate, follow these steps:

1. Search for Rentals:
   - Enter your destination, check-in and check-out dates, and the number of guests.
   - Apply filters such as price range, property type, and amenities to narrow down your options.
   - Browse through the listings to find the perfect place for your stay.

2. View Listing Details:
   - Click on a listing to view detailed information, including photos, property description, reviews, and host information.

3. Make a Booking:
   - Click the "Book Now" button on the listing page.
   - Review the booking details, including the total cost and house rules.
   - Confirm your booking by providing payment information.
   - Once the host accepts your booking, you'll receive a confirmation.

Please note that Contoso Real Estate handles the payment process securely, and you'll only be charged once your booking is confirmed. If you have any questions or special requests, you can communicate with the host through o