# Azure SQL + Azure Cognitive Search
This sample shows how to create and use  search index on Azure Cognitive Search when your data is in Azure SQL.

### Requirements

1. Install python dotenv `pip install python-dotenv`
   1. Enter your credentials in `example.env`
   2. If needed, install other python packages listed in `requirements.txt`
2. You will need pyodbc + driver to connect and interact with Azure SQL from python
   1. Please install pyodbc [instructions](https://pypi.org/project/pyodbc/)
   2. Install Microsoft ODBC 18 driver, [instructions here](https://learn.microsoft.com/en-us/sql/connect/odbc/microsoft-odbc-driver-for-sql-server?view=sql-server-ver16)
3. Whitelist your IP to access your SQL server  by adding your IP from the [Azure portal](https://ms.portal.azure.com/)
   1. Search for your SQL server resource (note: there are generaly a SQL database and a SQL server. Security / Networking is in SQL Server)
   2. Navigate to Security / Networking
   3. Add your IP

# Load environment variables and keys 

In [None]:
from dotenv import dotenv_values
# specify the name of the .env file name 
env_name = "example.env" # change to use your own .env file
config = dotenv_values(env_name)

# Connect to AZURE SQL

In [None]:
import pyodbc

# Define your Azure SQL database connection details
server = config["server"] 
database = config["database"] 
username = config["username"] 
password = config["password"] 
driver = '{ODBC Driver 18 for SQL Server}'  # Use the appropriate driver for your system

# Create a connection string
conn_str = f"DRIVER={driver};SERVER={server};DATABASE={database};UID={username};PWD={password}"

# Establish a connection to the Azure SQL database
conn = pyodbc.connect(conn_str)
cursor = conn.cursor()

#### Load data to a table in the database
If this is the first time you are running the notebook, you need to load our sample dataset into the database first. We will create a new table "food_review" and load the data from the csv file.

In [None]:
# Drop previous table of same name if one exists
cursor.execute("DROP TABLE IF EXISTS food_review;")
print("Finished dropping table (if existed)")

# Create a table
cursor.execute("CREATE TABLE food_review (Id integer, ProductId text, UserId text, ProfileName text, HelpfulnessNumerator integer, HelpfulnessDenominator integer, Score integer, Time bigint, Summary text, Text text);")
print("Finished creating table")

# Create a index
cursor.execute("CREATE INDEX idx_Id ON food_review(Id);")
print("Finished creating index")

##Load Data
import numpy as np
import pandas as pd
df = pd.read_csv('../../DataSet/Reviews_small.csv')

# Specify the batch size
batch_size = 30
table_name = "food_review" 

# Split the dataframe into batches
batches = [df[i:i + batch_size] for i in range(0, len(df), batch_size)]

#Iterate over each batch and insert the data into the database
for batch in batches:
    # Convert the batch dataframe to a list of tuples for bulk insertion
    rows = [tuple(row) for row in batch.itertuples(index=False)]
    
    # Define the SQL query for bulk insertion
    query = f"INSERT INTO {table_name} (Id, ProductId, UserId, ProfileName, HelpfulnessNumerator, HelpfulnessDenominator, Score, Time, Summary, Text) \
    VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)"
    cursor.executemany(query, rows)
    
    
# Insert the data from the CSV file into the database table row by row
# table_name = "food_review"
# for row in df.itertuples(index=False):
#     values = ', '.join(['?'] * len(row))
#     insert_query = f"INSERT INTO {table_name} VALUES ({values});"
#     cursor.execute(insert_query, row)

#### Example query

In [None]:
# Assuming you have already established a connection and have a cursor object

# Execute the SELECT statement
try:
    cursor.execute("SELECT count(Id) FROM food_review;")
    rows = cursor.fetchall()
    for row in rows:
        print(row)
except (Exception, Error) as e:
    print(f"Error executing SELECT statement: {e}")

## Retrieve data from database and store the embedding in CogSearch 
In this step, we will retrieve the id and concatenated data of desired columns from database first. Then we will use azure open ai to get the text embedding. We will then store the text embedding in azure CogSearch for the future retrieval purposes. 

#### Retrieve data from database

In [None]:
# Assuming you have already established a connection and have a cursor object

# Execute the SELECT statement
try:
    cursor.execute("SELECT id, CONCAT('productid: ', productid, ' ', 'score: ', score, ' ', 'text: ', text) AS concat FROM food_review;")
    rows = cursor.fetchall()

except (Exception, Error) as e:
    print(f"Error executing SELECT statement: {e}")

#### Create the content and generate the embedding

In [None]:
import openai
import time

openai.api_type = config["openai_api_type"] #"azure"
openai.api_key = config['openai_api_key']
openai.api_base = config['openai_api_base'] #"https://synapseml-openai.openai.azure.com/"
openai.api_version = config['openai_api_version'] 


def createEmbeddings(text):
    response = openai.Embedding.create(input=text , engine=config["openai_deployment_embedding"])
    embeddings = response['data'][0]['embedding']
    return embeddings

content_embeddings = []
idx = []
sleep_timer = 1
for row in rows:
    idx.append(row[0])
    content_embeddings.append(createEmbeddings(row[1]))

    # Delay embedding every 20 rows to stay within OpenAI throughput limit
    if sleep_timer % 20 == 0:
        print("Waiting...")
        time.sleep(10)
    sleep_timer += 1

df = pd.DataFrame({'embeddings': content_embeddings}, index=idx) # storing embeddings in a dataframe
df

#### Store the embeddings in Azure Cognitive Search Vector Store

[AzureCogSearch](https://learn.microsoft.com/en-us/azure/search/search-what-is-azure-search) provides a simple interface to create a vector database, store and retrieve data using vector search. You can read more about [here](https://github.com/Azure/cognitive-search-vector-pr/tree/main) more about Vector Search.

There are two steps to store data in AzureCogSearch vector database:
- First, we create the index (or schema) of the vector database
- Second, we add the chunked documents and their embeddings to the vector datastore

In [None]:
import requests
import json


# Azure Cognitive Search
cogsearch_name = config["cogsearch_name"] #TODO: fill in your cognitive search name
cogsearch_index_name = config["cogsearch_index_name"] #TODO: fill in your index name: must only contain lowercase, numbers, and dashes
cogsearch_api_key = config["cogsearch_api_key"] #TODO: fill in your api key with admin key

EMBEDDING_LENGTH = 1536


In [None]:
# Create Index for Cog Search with fields as id, and contentVector
# Note the datatypes for each field below

url = f"https://{cogsearch_name}.search.windows.net/indexes/{cogsearch_index_name}?api-version=2023-10-01-Preview"
payload = json.dumps({
  "name": cogsearch_index_name,
  "fields": [
    {
      "name": "id",
      "type": "Edm.String",
      "key": True,
      "filterable": True
    },
    {
      "name": "contentVector",
      "type": "Collection(Edm.Single)",
      "searchable": True,
      "retrievable": True,
      "dimensions": EMBEDDING_LENGTH,
      "vectorSearchProfile": "my-vector-search-profile"
    }
  ],
  "vectorSearch": {
    "algorithms": [
      {
        "name": "my-hnsw-config",
        "kind": "hnsw",
        # "hnswParameters": {
        #   "m": 4,
        #   "efConstruction": 400,
        #   "metric": "cosine"
        # }
      }
    ],
    "profiles": [
       {
         "name": "my-vector-search-profile",
         "algorithm": "my-hnsw-config"
       }
     ]
  },
  "semantic": {
    "configurations": [
      {
        "name": "my-semantic-config",
        "prioritizedFields": {
          "prioritizedContentFields": [
            {
              "fieldName": "id"
            }
          ],
        }
      }
    ]
  }
})
headers = {
  'Content-Type': 'application/json',
  'api-key': cogsearch_api_key
}

response = requests.request("PUT", url, headers=headers, data=payload)
print(response.status_code)

In [None]:
def batch_append_payload(df):
    """append payload for batch insertion (note: max 1000 rows per insertion) of embeddings to Cognitive Search"""
    value_list = []
    for index, row in df.iterrows():
        value_list.append(
            {
            "id": str(index),
            "contentVector": row['embeddings'],
            "@search.action": "upload"
            }
        )
    print('payload of size {}'.format(len(value_list)))
    print('start: {}'.format(value_list[0]))
    print('end: {}'.format(value_list[-1]))
    payload = json.dumps({
        "value": value_list
    })
    return payload

def BatchInsertToCogSearch(df):
    """Batch insertion of embedding to Cognitive Search, note: column name must be 'embeddings'"""
    url = f"https://{cogsearch_name}.search.windows.net/indexes/{cogsearch_index_name}/docs/index?api-version=2023-10-01-Preview"
    payload = batch_append_payload(df)
    headers = {
    'Content-Type': 'application/json',
    'api-key': cogsearch_api_key,
    }

    response = requests.request("POST", url, headers=headers, data=payload)
    print(response.json())

    if response.status_code == 200 or response.status_code == 201:
        return "Success"
    else:
        return "Failure"

In [None]:
BatchInsertToCogSearch(df)

## User Asks a Question 
In this step, the code will convert the user's question to an embedding and then retieve the top K document chunks based on the users' question using the cosine similirity. Please note that other similarity metrics can also be used.

In [None]:
userQuestion = "Great Taffy"
retrieve_k = 3 # Retrieve the top 2 documents from vector database

In [None]:
# retrieve k chnuks
def retrieve_k_chunk(k, questionEmbedding):
    # Retrieve the top K entries
    url = f"https://{cogsearch_name}.search.windows.net/indexes/{cogsearch_index_name}/docs/search?api-version=2023-10-01-Preview"

    payload = json.dumps({
    "vectorQueries": [
        {
            "kind": "vector",
            "vector": questionEmbedding,
            "fields": "contentVector",
            "k": k
        }
    ]
    })
    headers = {
    'Content-Type': 'application/json',
    'api-key': cogsearch_api_key,
    }

    response = requests.request("POST", url, headers=headers, data=payload)
    output = json.loads(response.text)
    print(response.status_code)
    return output

# Generate embeddings for the question and retrieve the top k document chunks
questionEmbedding = createEmbeddings(userQuestion)
output = retrieve_k_chunk(retrieve_k, questionEmbedding)

In [None]:
print(len(output['value']))

In [None]:
# Use the top k ids to retrieve the actual text from the database 
top_ids = []
for i in range(len(output['value'])):
    top_ids.append(int(output['value'][i]['id']))

print(top_ids)

#### Retrieve text from database

In [None]:
# Assuming you have already established a connection and have a cursor object
top_ids_string = ', '.join(map(str, top_ids))

sql = f"SELECT CONCAT('productid: ', productid, ' ', 'score: ', score, ' ', 'text: ', text) AS concat FROM food_review WHERE Id IN ({top_ids_string})"

# Execute the SELECT statement
try:
    cursor.execute(sql)    
    top_rows = cursor.fetchall()
    for row in top_rows:
        print(row)
except (Exception) as e:
    print(f"Error executing SELECT statement: {e}")


# OPTIONAL: Offer Response to User's Question
In order to offer a response, a user can either follow a simple prompting method as shown below or leverage more sophisticated ways used by other libraries, such as [langchain](https://python.langchain.com/en/latest/index.html).

#### Prompting directly using Azure Open AI service

In [None]:
# create a prompt template 
template = """
    context :{context}
    Answer the question based on the context above. Provide the product id associated with the answer as well. If the
    information to answer the question is not present in the given context then reply "I don't know".
    Question: {query}
    Answer: """

In [None]:
# create the context from the top_rows
context = ""
for row in top_rows:
    context += row[0]
    context += "\n"
    
print(context)

In [None]:
print(userQuestion)
prompt = template.format(context=context, query=userQuestion)
print(prompt)

In [None]:

response = openai.Completion.create(
    engine= config["openai_deployment_completion"],
    prompt=prompt,
    max_tokens=1024,
    n=1,
    stop=None,
    temperature=1,
)

print("prompt: ", prompt)
print('~~~~~')
# print("response: ", response['choices'][0]['text'].replace('\n', '').replace(' .', '.').strip())
print(response['choices'][0]['text'])

