## Setup Azure OpenAI

We'll start as usual by defining our Azure OpenAI service API key and endpoint details, specifying the model deployment we want to use and then we'll initiate a connection to the Azure OpenAI service.

**NOTE**: As with previous labs, we'll use the values from the `.env` file in the root of this repository.

In [65]:
import os
from dotenv import load_dotenv

 # Load environment variables
if load_dotenv():
    print("Found OpenAPI Base Endpoint: " + os.getenv("OPENAI_API_BASE"))
else: 
    print("No file .env found")
openai_api_type = os.getenv("OPENAI_API_TYPE")
openai_api_key = os.getenv("OPENAI_API_KEY")
openai_api_base = os.getenv("OPENAI_API_BASE")
openai_api_version = os.getenv("OPENAI_API_VERSION")
deployment_name = os.getenv("OPENAI_DEPLOYMENT_NAME")
embedding_name = os.getenv("OPENAI_EMBEDDING_DEPLOYMENTE")
acs_service_name = os.getenv("AZURE_SEARCH_SERVICE_NAME")
acs_endpoint_name = os.getenv("AZURE_SEARCH_ENDPOINT")
acs_index_name = "gds-metadata-index-hzm"
acs_api_key = os.getenv("AZURE_SEARCH_KEY")

Found OpenAPI Base Endpoint: https://trefoil.openai.azure.com/


### Inititate  the Embedding model, the completion and the instrcut model. 

In [67]:
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.chat_models import AzureChatOpenAI
from langchain.llms import AzureOpenAI

# Create an Embeddings Instance of Azure OpenAI

embeddings = OpenAIEmbeddings(
    openai_api_base = openai_api_base,
    openai_api_version = openai_api_version,
    deployment_name ="text-embedding-ada-002",
    openai_api_key = openai_api_key,
    openai_api_type = openai_api_type,
    embedding_ctx_length=8191,
    chunk_size=1000,
    max_retries=6)

# Create a Completion Instance of Azure OpenAI
llm = AzureChatOpenAI(
    openai_api_base= openai_api_base,
    openai_api_version= openai_api_version,
    deployment_name="gpt-35-turbo-16k",
    temperature=0,
    openai_api_key= openai_api_key,
    openai_api_type = openai_api_type,
    max_retries=6,
    max_tokens=4000
)

llmi = AzureOpenAI(
    openai_api_base= openai_api_base,
    openai_api_version= openai_api_version,
    deployment_name="gpt-35-turbo-instruct",
    temperature=0,
    openai_api_key= openai_api_key,
    openai_api_type = openai_api_type,
    max_retries=6,
    max_tokens=4000
)
print('Completed creation of embedding and completion instances.')

Completed creation of embedding and completion instances.


#### Load the data from the movies.csv file using the Langchain CSV document loader.

In [68]:
from langchain.document_loaders.csv_loader import CSVLoader

# metadata fields in CSV
# id,SourceSysId,SourceSysName,businessLine,BusinessEntity,Maturity,DataLifecycle,Location,dataDomain,DataSubDomain,GoldenDataSetName,
# DataExpert,DataValidator,DataDescription,data_steward_id,DataStewardID,data_owner_id,DataOwnerID,DataOwnerName,DataStewardName,
# DataClassification,LegalGroundCollection,HistoricalData,UnlockedGDP,CIARating,NbDataElements

# id,original_language,original_title,popularity,release_date,vote_average,vote_count,genre,overview,revenue,runtime,tagline
loader = CSVLoader(file_path='../data/metadatashort.csv', source_column='GoldenDataSetName', encoding='utf-8', 
                   csv_args={'delimiter':',', 
                             'fieldnames': ['id','SourceSysId','SourceSysName','businessLine','BusinessEntity','Maturity','DataLifecycle','Location','dataDomain','DataSubDomain','GoldenDataSetName',
                                            'DataExpert','DataValidator','DataDescription','DataStewardID','DataOwnerID','DataOwnerName','DataStewardName',
                                            'DataClassification','LegalGroundCollection','HistoricalData','UnlockedGDP','CIARating','NbDataElements']
                            }
                    )
data = loader.load()
data = data[1:101] # reduce dataset if you want
print('Loaded %s datasets' % len(data))
data[0]

Loaded 100 datasets


Document(page_content='id: GDS98394\nSourceSysId: SYSUID.288941\nSourceSysName: Dataedo CRDM\nbusinessLine: Leasing\nBusinessEntity: Masreph\nMaturity: Prepared for distribution\nDataLifecycle: Active\nLocation: Europe\ndataDomain: Product\nDataSubDomain: Lease\nGoldenDataSetName: Enterprise Equity Segmentation Map\nDataExpert: Braxton, Eddie\nDataValidator: Hussein, Jazmyne\nDataDescription: This dataset provides a comprehensive view of enterprise equity segmentation, enabling financial institutions to identify and target high-value customers.\nDataStewardID: DOWID384111\nDataOwnerID: DOWID384111\nDataOwnerName: Fernandez, Chelsea\nDataStewardName: Amos, Katelyn\nDataClassification: Non-personal data\nLegalGroundCollection: Corporate restructuring and bankruptcy\nHistoricalData: No\nUnlockedGDP: Achieved (Production)\nCIARating: 1-1-1\nNbDataElements: 14', metadata={'source': 'Enterprise Equity Segmentation Map', 'row': 1})

### Create embedings for every entry/row in our `data` object and put everything in an object called Items 

In [69]:
import uuid # The uuid library in Python is used to generate unique IDs, known as Universally Unique Identifiers (UUIDs), which can be used for objects, sessions, or transactions where uniqueness is required.

# Let's take a quick look at the data structure of the CSVLoader
print(data[0])
print(data[0].metadata['source'])
print("----------")

# Generate Document Embeddings for page_content field in the movies CSVLoader dataset using Azure OpenAI
items = []
for dataset in data:
    content = dataset.page_content
    items.append(dict([("id", str(uuid.uuid4())), ("GoldenDataSetName", dataset.metadata['source']), ("content", content), ("content_vector", embeddings.embed_query(content))]))

# Print out a sample item to validate the updated data structure.
# It should have the id, content, and content_vector values.
print(items[0])

page_content='id: GDS98394\nSourceSysId: SYSUID.288941\nSourceSysName: Dataedo CRDM\nbusinessLine: Leasing\nBusinessEntity: Masreph\nMaturity: Prepared for distribution\nDataLifecycle: Active\nLocation: Europe\ndataDomain: Product\nDataSubDomain: Lease\nGoldenDataSetName: Enterprise Equity Segmentation Map\nDataExpert: Braxton, Eddie\nDataValidator: Hussein, Jazmyne\nDataDescription: This dataset provides a comprehensive view of enterprise equity segmentation, enabling financial institutions to identify and target high-value customers.\nDataStewardID: DOWID384111\nDataOwnerID: DOWID384111\nDataOwnerName: Fernandez, Chelsea\nDataStewardName: Amos, Katelyn\nDataClassification: Non-personal data\nLegalGroundCollection: Corporate restructuring and bankruptcy\nHistoricalData: No\nUnlockedGDP: Achieved (Production)\nCIARating: 1-1-1\nNbDataElements: 14' metadata={'source': 'Enterprise Equity Segmentation Map', 'row': 1}
Enterprise Equity Segmentation Map
----------
{'id': 'f1705906-7a02-4bc2

In [70]:
dataset.page_content

'id: GDS49573\nSourceSysId: SYSUID.617170\nSourceSysName: Core contact repository\nbusinessLine: Innovation & Technology\nBusinessEntity: Masreph\nMaturity: Catalogued for processing\nDataLifecycle: Active (Under review)\nLocation: Europe\ndataDomain: IT\nDataSubDomain: Data\nGoldenDataSetName: Finance App Insights\nDataExpert: Olivares, Neri\nDataValidator: Webb, Jude\nDataDescription: "Finance App Insights" is a dataset that provides valuable insights into user behavior and trends within the finance app industry.\nDataStewardID: DOWID654616\nDataOwnerID: DOWID339056\nDataOwnerName: Sambula-Sheriff, Ethan\nDataStewardName: Webb, Jason\nDataClassification: Natural data\nLegalGroundCollection: Provision of financial products and services\nHistoricalData: Yes\nUnlockedGDP: Achieved (Production)\nCIARating: 1-1-1\nNbDataElements: 27'

In [71]:
content = dataset.page_content
print(content)

id: GDS49573
SourceSysId: SYSUID.617170
SourceSysName: Core contact repository
businessLine: Innovation & Technology
BusinessEntity: Masreph
Maturity: Catalogued for processing
DataLifecycle: Active (Under review)
Location: Europe
dataDomain: IT
DataSubDomain: Data
GoldenDataSetName: Finance App Insights
DataExpert: Olivares, Neri
DataValidator: Webb, Jude
DataDescription: "Finance App Insights" is a dataset that provides valuable insights into user behavior and trends within the finance app industry.
DataStewardID: DOWID654616
DataOwnerID: DOWID339056
DataOwnerName: Sambula-Sheriff, Ethan
DataStewardName: Webb, Jason
DataClassification: Natural data
LegalGroundCollection: Provision of financial products and services
HistoricalData: Yes
UnlockedGDP: Achieved (Production)
CIARating: 1-1-1
NbDataElements: 27


In [72]:
print(embeddings.embed_query(content))

[-0.0018467813342098478, 0.001482095906154002, -0.007882953555152969, -0.04443772844851857, -0.026559158936302937, 0.019919369202227203, -0.005091224288592525, -0.005245721389196196, -0.016110831736139773, -0.05455550133005336, 0.012237621368369785, 0.006668533608115649, -0.004354667347555534, 0.00251327546062434, -0.038947686225458876, 0.018295351262012198, 0.016599474335189334, -0.03406126023496327, 0.002795322774684376, -0.015061687551432792, -0.01887022501752022, 0.005903232793038744, -0.031589302900014264, 0.0004980740444757142, -0.005702027071743193, 0.007674562214593401, 0.006808659004530004, -0.01565093347679141, -0.030267093076081306, -0.007660190510404086, 0.0052277566425442306, -0.011892696742535945, 0.009219534617613961, -0.005360695954033287, -0.012302294270052347, -0.010577673934850852, 0.012251992606897818, -0.002383928958767292, 0.03386005358234515, -0.013868825393510089, 0.017088116934238895, 0.019488213419934904, -0.01935886761656978, -0.003323487909500517, -0.0150760

### Load metadadata  into Azure Cognitive Search

Next, we'll create the Azure Cognitive Search index, embed the loaded metdataset from the CSV file, and upload the data into the newly created index. Depending on the number of movies loaded and rate limiting, this might take a while to do the embeddings so be patient.

In [74]:
from azure.core.credentials import AzureKeyCredential
from azure.search.documents.indexes import SearchIndexClient
from azure.search.documents.indexes.models import (
    SearchIndex,
    SearchField,
    SearchFieldDataType,
    SimpleField,
    SearchableField,
    SearchIndex,
    SemanticConfiguration,
    PrioritizedFields,
    SemanticField,
    SemanticSettings,
    VectorSearch,
    HnswVectorSearchAlgorithmConfiguration,
)

### Let's Create the Azure Cognitive Search Index Client

In [75]:
index_client = SearchIndexClient(
    acs_endpoint_name,
    AzureKeyCredential(acs_api_key)
)

###  Define search carachteristics for the movie's fields in the CSV

In [77]:
# Fields in the csv file
# Definition of the structure of the index. 
# id,SourceSysId,SourceSysName,businessLine,BusinessEntity,Maturity,DataLifecycle,Location,dataDomain,DataSubDomain,GoldenDataSetName,
# DataExpert,DataValidator,DataDescription,data_steward_id,DataStewardID,data_owner_id,DataOwnerID,DataOwnerName,DataStewardName,
# DataClassification,LegalGroundCollection,HistoricalData,UnlockedGDP,CIARating,NbDataElements

fields = [
    SimpleField(name="id", type=SearchFieldDataType.String, key=True),
    SimpleField(name="SourceSysId", type=SearchFieldDataType.String),
    SearchableField(name="SourceSysName", type=SearchFieldDataType.String),
    SearchableField(name="Maturity", type=SearchFieldDataType.String),
    SearchableField(name="DataLifecycle", type=SearchFieldDataType.String),
    SearchableField(name="Location", type=SearchFieldDataType.String),
    SearchableField(name="dataDomain", type=SearchFieldDataType.String),
    SearchableField(name="DataSubDomain", type=SearchFieldDataType.String),
    SearchableField(name="GoldenDataSetName", type=SearchFieldDataType.String),
    SearchableField(name="DataExpert", type=SearchFieldDataType.String),
    SearchableField(name="DataValidator", type=SearchFieldDataType.String),
    SearchableField(name="DataDescription", type=SearchFieldDataType.String),
    SimpleField(name="DataStewardID", type=SearchFieldDataType.String),
    SimpleField(name="DataOwnerID", type=SearchFieldDataType.String),
    SearchableField(name="DataOwnerName", type=SearchFieldDataType.String),
    SearchableField(name="DataStewardName", type=SearchFieldDataType.String),
    SearchableField(name="DataClassification", type=SearchFieldDataType.String),
    SearchableField(name="LegalGroundCollection", type=SearchFieldDataType.String),
    SearchableField(name="HistoricalData", type=SearchFieldDataType.String),
    SearchableField(name="UnlockedGDP", type=SearchFieldDataType.String),
    SimpleField(name="CIARating", type=SearchFieldDataType.String),
    SimpleField(name="NbDataElements", type=SearchFieldDataType.String,sortable=True),
    SearchableField(name="content", type=SearchFieldDataType.String),
    SearchField(name="content_vector", type=SearchFieldDataType.Collection(SearchFieldDataType.Single), searchable=True, vector_search_dimensions=1536, vector_search_configuration="my-vector-config"),
]

### Configure Vector Search Configuration

In [78]:
vector_search = VectorSearch(
    algorithm_configurations=[
        HnswVectorSearchAlgorithmConfiguration(
            name="my-vector-config",
            kind="hnsw",
            parameters={
                "m": 4,
                "efConstruction": 400,
                "efSearch": 500,
                "metric": "cosine"
            }
        )
    ]
)

### Configure Semantic Configuration (work in progress)

In [79]:
semantic_config = SemanticConfiguration(
    name="my-semantic-config",
    prioritized_fields=PrioritizedFields(
        title_field=SemanticField(field_name="GoldenDataSetName"),
        prioritized_keywords_fields=[SemanticField(field_name="GoldenDataSetName"), SemanticField(field_name="DataDescription")],
        prioritized_content_fields=[SemanticField(field_name="content")]
    )
)


### Create the semantic settings with the configuration

In [80]:
# Create the semantic settings with the configuration
semantic_settings = SemanticSettings(configurations=[semantic_config])

### Create the search index with the desired vector search and semantic configuration

In [81]:
# Create the search index with the desired vector search and semantic configurations
index = SearchIndex(
    name=acs_index_name,
    fields=fields,
    vector_search=vector_search,
    semantic_settings=semantic_settings
)
result = index_client.create_or_update_index(index)
print(f'The {result.name} index was created.')

The gds-metadata-index-hzm index was created.


### Upload metadata to Azure Cognitive Search index.

In [82]:
from azure.search.documents.models import Vector
from azure.search.documents import SearchClient

# Insert Text and Embeddings into the Azure Cognitive Search index created.
search_client = SearchClient(
    acs_endpoint_name,
    acs_index_name,
    AzureKeyCredential(acs_api_key)
)
result = search_client.upload_documents(items)
print("Successfully added documents to Azure Cognitive Search index.")
print(f"Uploaded {len(data)} documents")

Successfully added documents to Azure Cognitive Search index.
Uploaded 100 documents


### First, let's do a plain vanilla text search, no vectors or embeddings.

In [86]:
query = "Finance Persona Data"

search_client = SearchClient(
    acs_endpoint_name,
    acs_index_name,
    AzureKeyCredential(acs_api_key)
)

# Execute the search
results = list(search_client.search(
    search_text=query,
    include_total_count=True,
    top=3
))

# Print count of total results.
print(f"Returned {len(results)} results using only text-based search.")
print("----------")
# Iterate over Results
# Index Fields - id, content, content_vector
for result in results:
    print("Dataset: {}".format(result["content"]))
    print("----------")

Returned 3 results using only text-based search.
----------
Dataset: id: GDS92046
SourceSysId: SYSUID.633613
SourceSysName: TransactFinance
businessLine: Leasing
BusinessEntity: America
Maturity: Prepared for distribution
DataLifecycle: Active
Location: United Kingdom
dataDomain: Client
DataSubDomain: Retail
GoldenDataSetName: Finance Persona Data
DataExpert: Bustos, Michael
DataValidator: el-Nour, Samraa
DataDescription: This dataset contains demographic and behavioral information on individuals in the financial sector, providing insights for targeted marketing and customer segmentation.
DataStewardID: DOWID165488
DataOwnerID: DOWID165488
DataOwnerName: Gonzalez Barajas, Freedom
DataStewardName: Hinckfoot, Jesse
DataClassification: Natural data
LegalGroundCollection: Execution of transactions
HistoricalData: Yes
UnlockedGDP: Achieved (Production)
CIARating: 1-2-2
NbDataElements: 34
----------
Dataset: id: GDS30178
SourceSysId: SYSUID.633613
SourceSysName: TransactFinance
businessLine:

### Now let's do a vector search that uses the embeddings we created and inserted into `content_vector` field in the index.

In [87]:
query = "Who is the data expert of Enterprise Equity Segmentation Map"

search_client = SearchClient(
    acs_endpoint_name,
    acs_index_name,
    AzureKeyCredential(acs_api_key)
)

# You can see here that we are getting the embedding representation of the query.
vector = Vector(
    value=embeddings.embed_query(query),
    k=5,
    fields="content_vector"
)

# Execute the search
results = list(search_client.search(
    search_text="",
    include_total_count=True,
    vectors=[vector],
    select=["id", "content", "GoldenDataSetName"],
))

# Print count of total results.
print(f"Returned {len(results)} results using only vector-based search.")
print("----------")
# Iterate over results and print out the content.
for result in results:
    print(result['GoldenDataSetName'])
    print("----------")

Returned 5 results using only vector-based search.
----------
Enterprise Equity Segmentation Map
----------
Finance Segmentation Master
----------
Finance Application Insights
----------
Finance Email Analytics
----------
Finance Regions Mapping
----------


Did that return what you expected? Probably not, let's dig deeper to see why. Let's do the same search again, but this time let's return the `search score` so we can see the value returned by the cosine similarity vector store calculation.

### Try again, but this time let's add the relevance score to maybe see why

In [90]:
query = "Who is the data expert of Enterprise Equity Segmentation Map"
search_client = SearchClient(
    acs_endpoint_name,
    acs_index_name,
    AzureKeyCredential(acs_api_key)
)

# You can see here that we are getting the embedding representation of the query.
vector = Vector(
    value=embeddings.embed_query(query),
    k=5,
    fields="content_vector"
)

# Execute the search
results = list(search_client.search(
    search_text="",
    include_total_count=True,
    vectors=[vector],
    select=["id", "content", "GoldenDataSetName"],
))

# Print count of total results.
print(f"Returned {len(results)} results using vector search.")
print("----------")
# Iterate over results and print out the id and search score.
for result in results:  
    print(f"Id: {result['id']}")
    print(f"Id: {result['GoldenDataSetName']}")
    print(f"Score: {result['@search.score']}")
    print("----------")

Returned 5 results using vector search.
----------
Id: f1705906-7a02-4bc2-821f-ac9290187abd
Id: Enterprise Equity Segmentation Map
Score: 0.8599365
----------
Id: b41f7ed8-9efd-4c58-81b9-fd283a93b043
Id: Finance Segmentation Master
Score: 0.82765174
----------
Id: 2c091438-cd28-431d-ba30-46a9459e51aa
Id: Finance Application Insights
Score: 0.8202118
----------
Id: dbf7ca24-8bc7-4a2b-bb8e-f3d30bf70586
Id: Finance Email Analytics
Score: 0.8196362
----------
Id: 2930e649-ba7d-4897-b384-b6db123fbf4d
Id: Finance Regions Mapping
Score: 0.8173157
----------


If you look at the `search score` you will see the relevant ranking of the closest vector match to the query inputted. The lower the score the farther apart the two vectors are. Let's change the search term and see if we can get a higher Search Score which means a higher match and closer vector proximity.

### Try again, but this time let's add the relevance score to maybe see why

In [93]:

query =  "Who is the data expert of Enterprise Equity Segmentation Map"

search_client = SearchClient(
    acs_endpoint_name,
    acs_index_name,
    AzureKeyCredential(acs_api_key)
)

# You can see here that we are getting the embedding representation of the query.
vector = Vector(
    value=embeddings.embed_query(query),
    k=5,
    fields="content_vector"
)

# Execute the search
results = list(search_client.search(
    search_text="",
    include_total_count=True,
    vectors=[vector],
    select=["id", "content", "GoldenDataSetName"],
))

# Print count of total results.
print(f"Returned {len(results)} results using vector search.")
print("----------")
# Iterate over results and print out the id and search score.
for result in results:  
    print(f"Id: {result['id']}")
    print(f"Id: {result['GoldenDataSetName']}")
    print(f"Score: {result['@search.score']}")
    print("----------")
    

Returned 5 results using vector search.
----------
Id: f1705906-7a02-4bc2-821f-ac9290187abd
Id: Enterprise Equity Segmentation Map
Score: 0.8599365
----------
Id: b41f7ed8-9efd-4c58-81b9-fd283a93b043
Id: Finance Segmentation Master
Score: 0.82765174
----------
Id: 2c091438-cd28-431d-ba30-46a9459e51aa
Id: Finance Application Insights
Score: 0.8202118
----------
Id: dbf7ca24-8bc7-4a2b-bb8e-f3d30bf70586
Id: Finance Email Analytics
Score: 0.8196362
----------
Id: 2930e649-ba7d-4897-b384-b6db123fbf4d
Id: Finance Regions Mapping
Score: 0.8173157
----------


**NOTE:** As you have seen from the results, different inputs can return different results, it all depends on what data is in the Vector Store. The higher the score the higher the likelihood of a match.

### Hybrid Searching using Azure Cognitive Search
What is Hybrid Search? The search is implemented at the field level, which means you can build queries that include vector fields and searchable text fields. The queries execute in parallel and the results are merged into a single response. Optionally, add semantic search, currently in preview, for even more accuracy with L2 reranking using the same language models that power Bing.

**NOTE:** Hybrid Search is a key value proposition of Azure Cognitive Search in comparison to vector only data stores. Click Hybrid Search for more details.

In [95]:
# Hybrid Search
# Let's try our original query again using Hybrid Search (ie. Combination of Text & Vector Search)
query = "Who is the data expert of Enterprise Equity Segmentation Map?"

search_client = SearchClient(
    acs_endpoint_name,
    acs_index_name,
    AzureKeyCredential(acs_api_key)
)

# You can see here that we are getting the embedding representation of the query.
vector = Vector(
    value=embeddings.embed_query(query),
    k=5,
    fields="content_vector"
)

# Notice we also fill in the search_text parameter with the query.
results = list(search_client.search(
    search_text=query,
    include_total_count=True,
    top=10,
    vectors=[vector],
    select=["id", "content", "GoldenDataSetName"],
))

# Print count of total results.
print(f"Returned {len(results)} results using vector search.")
print("----------")
# Iterate over results and print out the id and search score.
for result in results:  
    print(f"Id: {result['id']}")
    print(result['GoldenDataSetName'])
    print(f"Hybrid Search Score: {result['@search.score']}")
    print("----------")

Returned 10 results using vector search.
----------
Id: f1705906-7a02-4bc2-821f-ac9290187abd
Enterprise Equity Segmentation Map
Hybrid Search Score: 0.03333333507180214
----------
Id: b41f7ed8-9efd-4c58-81b9-fd283a93b043
Finance Segmentation Master
Hybrid Search Score: 0.032522473484277725
----------
Id: 2c091438-cd28-431d-ba30-46a9459e51aa
Finance Application Insights
Hybrid Search Score: 0.029386531561613083
----------
Id: 2930e649-ba7d-4897-b384-b6db123fbf4d
Finance Regions Mapping
Hybrid Search Score: 0.02524038404226303
----------
Id: dbf7ca24-8bc7-4a2b-bb8e-f3d30bf70586
Finance Email Analytics
Hybrid Search Score: 0.023171285167336464
----------
Id: d375b825-2a56-4d8a-80d1-20ea2274c9dc
Finance Persona Data
Hybrid Search Score: 0.016393441706895828
----------
Id: a9d55d31-b3d4-46f8-adba-7125ee80876f
TransactFinance Data
Hybrid Search Score: 0.01587301678955555
----------
Id: 7e9183db-77b5-4422-870b-5527017fecbb
TransactFinance Data
Hybrid Search Score: 0.015625
----------
Id: 5223

###  Hybrid Search
####  Let's try our more specific query again to see the difference in the score returned.

In [97]:
query = "Who is the data expert of Enterprise Equity Segmentation Map?"
search_client = SearchClient(
    acs_endpoint_name,
    acs_index_name,
    AzureKeyCredential(acs_api_key)
)

# You can see here that we are getting the embedding representation of the query.
vector = Vector(
    value=embeddings.embed_query(query),
    k=5,
    fields="content_vector"
)

# -----
# Notice we also fill in the search_text parameter with the query along with the vector.
# -----
results = list(search_client.search(
    search_text=query,
    include_total_count=True,
    top=10,
    vectors=[vector],
    select=["id", "content", "GoldenDataSetName"],
))

# Print count of total results.
print(f"Returned {len(results)} results using hybrid search.")
print("----------")
# Iterate over results and print out the id and search score.
for result in results:  
    print(f"Id: {result['id']}")
    print(f"Title: {result['GoldenDataSetName']}")
    print(f"Hybrid Search Score: {result['@search.score']}")
    print("----------")

Returned 10 results using hybrid search.
----------
Id: f1705906-7a02-4bc2-821f-ac9290187abd
Title: Enterprise Equity Segmentation Map
Hybrid Search Score: 0.03333333507180214
----------
Id: b41f7ed8-9efd-4c58-81b9-fd283a93b043
Title: Finance Segmentation Master
Hybrid Search Score: 0.032522473484277725
----------
Id: 2c091438-cd28-431d-ba30-46a9459e51aa
Title: Finance Application Insights
Hybrid Search Score: 0.029386531561613083
----------
Id: 2930e649-ba7d-4897-b384-b6db123fbf4d
Title: Finance Regions Mapping
Hybrid Search Score: 0.02524038404226303
----------
Id: dbf7ca24-8bc7-4a2b-bb8e-f3d30bf70586
Title: Finance Email Analytics
Hybrid Search Score: 0.023171285167336464
----------
Id: d375b825-2a56-4d8a-80d1-20ea2274c9dc
Title: Finance Persona Data
Hybrid Search Score: 0.016393441706895828
----------
Id: a9d55d31-b3d4-46f8-adba-7125ee80876f
Title: TransactFinance Data
Hybrid Search Score: 0.01587301678955555
----------
Id: 7e9183db-77b5-4422-870b-5527017fecbb
Title: TransactFinanc

### Bringing it All Together with Retrieval Augmented Generation (RAG) + Langchain (LC)
Now that we have our Vector Store setup and data loaded, we are now ready to implement the RAG pattern using AI Orchestration. At a high-level, the following steps are required:

* Ask the question
* Create Prompt Template with inputs
* Get Embedding representation of inputted question
* Use embedded version of the question to search Azure Cognitive Search (ie. The Vector Store)
* Inject the results of the search into the Prompt Template & Execute the Prompt to get the completion

### Question to be asked

In [99]:
question = "List all Golden dataset from Business Line Leasing."

In [105]:
# Create a prompt template with variables, note the curly braces
from langchain.prompts import PromptTemplate
prompt = PromptTemplate(
    input_variables=["original_question","search_results"],
    template="""
    Question: {original_question}

    Do not use any other data.
    Only use the movie data below when responding.
    {search_results}
    """,
)


### Get Embedding for the original question

In [106]:
question_embedded=embeddings.embed_query(question)

### Search Vector Store

In [107]:
vector = Vector(
    value=question_embedded,
    k=5,
    fields="content_vector"
)

In [108]:
results = list(search_client.search(
    search_text="",
    include_total_count=True,
    vectors=[vector],
    select=["GoldenDataSetName"],
))

### Build the Prompt and Execute against the Azure OpenAI to get the completion

In [109]:
from langchain.chains import LLMChain
chain = LLMChain(llm=llm, prompt=prompt, verbose=True)
response = chain.run({"original_question": question, "search_results": results})
print(response)



[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3m
    Question: List all Golden dataset from Business Line Leasing.

    Do not use any other data.
    Only use the movie data below when responding.
    [{'GoldenDataSetName': 'Golden Finance Elements Catalog', '@search.score': 0.851433, '@search.reranker_score': None, '@search.highlights': None, '@search.captions': None}, {'GoldenDataSetName': 'Non-Real Estate Collateral Data', '@search.score': 0.8487694, '@search.reranker_score': None, '@search.highlights': None, '@search.captions': None}, {'GoldenDataSetName': 'Finance Entity Insights', '@search.score': 0.84847194, '@search.reranker_score': None, '@search.highlights': None, '@search.captions': None}, {'GoldenDataSetName': 'Finance Persona Data', '@search.score': 0.8469015, '@search.reranker_score': None, '@search.highlights': None, '@search.captions': None}, {'GoldenDataSetName': 'Credit Finance Offerings Dataset', '@search.score': 0.84545535, '@search

### Using the `RConversationalRetrievalChain` chain   

In [125]:
from langchain.vectorstores.azuresearch import AzureSearch
acs = AzureSearch(azure_search_endpoint=acs_endpoint_name,
                 azure_search_key=acs_api_key,
                 index_name=acs_index_name,
                 embedding_function=embeddings.embed_query)

### initiate your retriever 

In [None]:
retriever = acs.as_retriever()

We create our  question-answering chat chain. In this case, we specify the condense question prompt, which converts the user’s question to a standalone question (using the chat history), in case the user asked a follow-up questio

In [132]:
# Adapt if needed
from langchain.chains import ConversationalRetrievalChain
CONDENSE_QUESTION_PROMPT = PromptTemplate.from_template("""Given the following conversation and a follow up question, rephrase the follow up question to be a standalone question.

Chat History:
{chat_history}
Follow Up Input: {question}
Standalone question:""")

qa = ConversationalRetrievalChain.from_llm(llm=llm,
                                           retriever=acs.as_retriever(),
                                           condense_question_prompt=CONDENSE_QUESTION_PROMPT,
                                           return_source_documents=True,
                                           verbose=False)


Let’s ask a question:

In [133]:
chat_history = [(query, result["answer"])]
query = "Who is the owner of the golden dataset named Trade Finance Collateral Dataset?"
result = qa({"question": query, "chat_history": chat_history})

print("Question:", query)
print("Answer:", result["answer"])


Question: Who is the owner of the golden dataset named Trade Finance Collateral Dataset?
Answer: The owner of the Trade Finance Collateral Dataset is Rovaris, Desiree.


From where, we can also ask follow up questions:

In [137]:
print(result)

{'question': 'ok thanks, I need to do some analysis on lending practices and risk management in the finance sector, which datasets you recomend?', 'chat_history': [('What is his data owner id?', 'The data owner ID of Rovaris, Desiree, the owner of the golden dataset named Trade Finance Collateral Dataset is DOWID945815.')], 'answer': 'Based on the provided information, I recommend the following datasets for analysis on lending practices and risk management in the finance sector:\n\n1. Mortgage Risk Finance Dataset (GoldenDataSetName: Mortgage Risk Finance Dataset)\n   - SourceSysName: Mortgage risk finance sys\n   - BusinessEntity: Insurance\n   - DataDescription: Provides information on the risk associated with mortgage loans, allowing financial institutions to make informed lending decisions.\n   - DataClassification: Non-sensitive personal data\n   - LegalGroundCollection: Internal reporting and analysis\n\n2. Risk Finance Insights Dataset (GoldenDataSetName: Risk Finance Insights D

In [134]:
chat_history = [(query, result["answer"])]
query = "What is his data owner id?"
result = qa({"question": query, "chat_history": chat_history})

print("Question:", query)
print("Answer:", result["answer"])


Question: What is his data owner id?
Answer: The data owner ID of Rovaris, Desiree, the owner of the golden dataset named Trade Finance Collateral Dataset is DOWID945815.


In [135]:
chat_history = [(query, result["answer"])]
query = "ok thanks, I need to do some analysis on lending practices and risk management in the finance sector, which datasets you recomend?"
result = qa({"question": query, "chat_history": chat_history})

print("Question:", query)
print("Answer:", result["answer"])

Question: ok thanks, I need to do some analysis on lending practices and risk management in the finance sector, which datasets you recomend?
Answer: Based on the provided information, I recommend the following datasets for analysis on lending practices and risk management in the finance sector:

1. Mortgage Risk Finance Dataset (GoldenDataSetName: Mortgage Risk Finance Dataset)
   - SourceSysName: Mortgage risk finance sys
   - BusinessEntity: Insurance
   - DataDescription: Provides information on the risk associated with mortgage loans, allowing financial institutions to make informed lending decisions.
   - DataClassification: Non-sensitive personal data
   - LegalGroundCollection: Internal reporting and analysis

2. Risk Finance Insights Dataset (GoldenDataSetName: Risk Finance Insights Dataset)
   - SourceSysName: HOC
   - BusinessEntity: Masreph
   - DataDescription: Provides valuable information on financial risk management, allowing businesses to make informed decisions and mit