## 📚 Prerequisites

Before executing this notebook, make sure you have properly set up your Azure Services, created your Conda environment, and configured your environment variables as per the instructions provided in the [README.md](README.md) file.

>%pip install azure-search-documents==11.4.0b10

## 📋 Table of Contents

Explore different retrieval methods in Azure AI Search:

1. [**Understanding Types of Search**](#define-field-types): This section provides a comprehensive overview of the different types of search methods available in Azure AI Search.
2. [**Keyword Search**](#keyword-search): Use direct query term matching with document content.
3. [**Vector Search**](#vector-search): Employ embeddings for semantic content understanding and relevance ranking.
4. [**Hybrid Search**](#hybrid-search): Combine keyword and vector search for comprehensive results.
5. [**Reranking Search**](#reranking-search): Reorder initial search results for improved top result relevance.

Additional resources:
- [Azure AI Search Documentation](https://learn.microsoft.com/en-us/azure/search/)

### 🧭 Understanding Types of Search  

+ **Keyword Search**: Traditional search method relying on direct term matching. Efficient for exact matches but struggles with synonyms and context. [Learn More](https://learn.microsoft.com/en-us/azure/search/search-lucene-query-architecture)

- **Vector Search**: Converts text into high-dimensional vectors to understand semantic meaning. Finds relevant documents even without exact keyword matches. Effectiveness depends on quality of training data. [Learn More](https://learn.microsoft.com/en-us/azure/search/vector-search-overview)

+ **Hybrid Search**: Combines Keyword and Vector Search for comprehensive, contextually relevant results. Effective for complex queries requiring nuanced understanding. [Learn More](https://learn.microsoft.com/en-us/azure/search/vector-search-ranking#hybrid-search)

- **Reranking Search**: Fine-tunes initial search results using advanced algorithms for relevance. Useful when initial retrieval returns relevant but not optimally ordered results. [Learn More](https://learn.microsoft.com/en-us/azure/search/semantic-search-overview)

### 🚧 Limitations

##### Keyword Search
- **Synonym Challenges**: Struggles with recognizing synonyms or different expressions of the same concept.
- **Context Understanding**: May not fully capture the broader context or the query's intent, especially in complex queries.
##### Embedding-Based Search
- **Keyword Precision**: May miss documents that contain exact terms if those terms don't semantically align with the query or document's overall content.
- **Contextual Misinterpretations**: May overgeneralize or incorrectly interpret context, missing specific nuances.
- **Training Data Dependency**: Performance heavily relies on the diversity and depth of the training data.
### 💡 Recommendations

To achieve higher relevance out of the box: 

1. **Hybrid Search**: Combines keyword and vector search methods to ensure comprehensive document retrieval across a range of queries, from highly specific to semantically complex.

2. **Re-Ranking and L2 in AI Search**: Enhances initial search results by applying sophisticated ranking algorithms, improving relevance and accuracy, especially for nuanced queries.

In [2]:
# %pip install azure-search-documents==11.4.0b10

In [1]:
import os

# Define the target directory (change yours)
target_directory = r"C:\Users\pablosal\Desktop\gbbai-chat-with-your-database"

# Check if the directory exists
if os.path.exists(target_directory):
    # Change the current working directory
    os.chdir(target_directory)
    print(f"Directory changed to {os.getcwd()}")
else:
    print(f"Directory {target_directory} does not exist.")

Directory changed to C:\Users\pablosal\Desktop\gbbai-chat-with-your-database


In [2]:
import os
from dotenv import load_dotenv
from azure.core.credentials import AzureKeyCredential
from azure.search.documents import SearchClient
from azure.search.documents.models import RawVectorQuery

from src.aoai.azure_openai import AzureOpenAIManager

In [3]:
# Load environment variables from .env file
load_dotenv()

# Set up Azure Cognitive Search credentials
service_endpoint = os.getenv("AZURE_AI_SEARCH_SERVICE_ENDPOINT")
key = os.getenv("AZURE_SEARCH_ADMIN_KEY")
credential = AzureKeyCredential(key)

# Define the name of the Azure Search index
# This is the index where your data is stored in Azure Search
index_name = "query-dev-index"

# Set up the Azure Search client with the specified index
# This prepares the client to interact with the Azure Search service
search_client = SearchClient(service_endpoint, index_name, credential=credential)

In [4]:
embedding_aoai_deployment_model = "foundational-ada"
aoai_client = AzureOpenAIManager()

In [5]:
search_query = "Identify players who played more than 100 games, have an OPS (On-base Plus Slugging) higher than .900, and have less than 10 errors in a season."
search_vector = aoai_client.generate_embedding(input_text=search_query)

## Keyword Search 

**Full-text search**: This method uses the `@search.score` parameter and the BM25 algorithm for scoring. The BM25 algorithm is a bag-of-words retrieval function that ranks a set of documents based on the query terms appearing in each document, regardless of their proximity within the document. There is no upper limit for the score in this method.

```json
"value": [
 {
    "@search.score": 5.1958685,
    "@search.features": {
        "description": {
            "uniqueTokenMatches": 1.0,
            "similarityScore": 0.29541412,
            "termFrequency" : 2
        },
        "title": {
            "uniqueTokenMatches": 3.0,
            "similarityScore": 1.75451557,
            "termFrequency" : 6
        }
    }
 }
]
 ```

- `uniqueTokenMatches`: This parameter indicates the number of unique query terms found in the document field. A higher value means more unique query terms were found, suggesting a stronger match.

- `similarityScore`: This parameter represents the semantic similarity between the content of the document field and the query terms. A higher `similarityScore` means the document content is more semantically similar to the query, indicating a more relevant match.

- `termFrequency`: This parameter shows how often the query terms appear within the document field. A higher `termFrequency` means the query terms appear more often, suggesting a stronger match.

These parameters contribute to the overall `@search.score`. The `@search.score` is a cumulative measure of a document's relevance to the search query. A higher `@search.score` indicates a stronger match between the document and the search query.

When interpreting search results, documents with higher scores are generally considered more relevant to the query than those with lower scores.

In [7]:
# keyword search
r = search_client.search(search_query, top=5)
for doc in r:
    if "players" in doc["table_content"]:
        content = doc["table_content"].replace("\n", " ")[:1000]
        print(f"score: {doc['@search.score']}. {content}")

score: 7.3274617. The provided schema is for a table named 'detroit_tigers_baseball_stats', which likely stores statistical data related to the players of the Detroit Tigers, a professional baseball team. The table has eight columns, each representing different characteristics or statistics related to a player.  1. 'name' (nvarchar): This column is used to store the name of the player. As the data type is 'nvarchar', it can contain both text and numbers, allowing for diverse names.  2. 'position' (nvarchar): This field stores the position of a player on the field. The 'nvarchar' data type indicates that this field can also contain alphanumeric characters, allowing for various positions like 'SS' (Short Stop), 'CF' (Center Field), etc.  3. 'Games_Played' (tinyint): This field records the number of games a player has played. The 'tinyint' data type indicates that this number is likely to be relatively small, generally less than 255.  4. 'At_Bats' (smallint): This field represents the num

## Vector Search 

This method also uses the `@search.score` parameter but uses the HNSW (Hierarchical Navigable Small World) algorithm for scoring. The HNSW algorithm is an efficient method for nearest neighbor search in high dimensional spaces. The scoring range is 0.333 - 1.00 for Cosine similarity, and 0 to 1 for Euclidean and DotProduct similarities.

In [8]:
# Pure vector Search
r = search_client.search(
    None,
    top=5,
    vector_queries=[RawVectorQuery(vector=search_vector, k=50, fields="table_vector")],
)
for doc in r:
    content = doc["table_content"].replace("\n", " ")[:1000]
    print(f"score: {doc['@search.score']}. {content}")

score: 0.82221097. The provided schema is for a table named 'detroit_tigers_baseball_stats', which likely stores statistical data related to the players of the Detroit Tigers, a professional baseball team. The table contains eight columns, each representing a different aspect of a player's performance or role in the team.  1. 'name': This field is of data type 'nvarchar', which means it stores non-Unicode character data. It's likely used to store the names of the players.  2. 'position': This 'nvarchar' field probably represents the role or position a player holds within the team (such as pitcher, catcher, infielder, etc).  3. 'Games_Played': This 'tinyint' field likely indicates the number of games a player has participated in.   4. 'At_Bats': The 'At_Bats' column, a 'smallint' data type, likely represents the number of times a player has been at bat.  5. 'Hits': This 'tinyint' field is likely used to record the number of successful hits a player has made.  6. 'Home_Runs': This is ano

## Hybrid search

This method uses the `@search.score` parameter and the RRF (Reciprocal Rank Fusion) algorithm for scoring. The RRF algorithm is a method for data fusion that combines the results of multiple queries. The upper limit of the score is bounded by the number of queries being fused, with each query contributing a maximum of approximately 1 to the RRF score. For example, merging three queries would produce higher RRF scores than if only two search results are merged.

In [9]:
r = search_client.search(
    search_query,
    top=5,
    vector_queries=[RawVectorQuery(vector=search_vector, k=50, fields="table_vector")],
)
for doc in r:
    content = doc["table_content"].replace("\n", " ")[:1000]
    print(
        f"score: {doc['@search.score']}, reranker: {doc['@search.reranker_score']}. {content}"
    )

score: 0.03306011110544205, reranker: None. The provided schema is for a table named 'detroit_tigers_baseball_stats', which likely stores statistical data related to the players of the Detroit Tigers, a professional baseball team. The table has eight columns, each representing different characteristics or statistics related to a player.  1. 'name' (nvarchar): This column is used to store the name of the player. As the data type is 'nvarchar', it can contain both text and numbers, allowing for diverse names.  2. 'position' (nvarchar): This field stores the position of a player on the field. The 'nvarchar' data type indicates that this field can also contain alphanumeric characters, allowing for various positions like 'SS' (Short Stop), 'CF' (Center Field), etc.  3. 'Games_Played' (tinyint): This field records the number of games a player has played. The 'tinyint' data type indicates that this number is likely to be relatively small, generally less than 255.  4. 'At_Bats' (smallint): Thi

#### Enable Exhaustive `ExhaustiveKnn`

In [10]:
r = search_client.search(
    search_query,
    top=5,
    vector_queries=[
        RawVectorQuery(
            vector=search_vector, k=50, fields="table_vector", exhaustive=True
        )
    ],
)
for doc in r:
    content = doc["table_content"].replace("\n", " ")[:1000]
    print(
        f"score: {doc['@search.score']}, reranker: {doc['@search.reranker_score']}. {content}"
    )

score: 0.03306011110544205, reranker: None. The provided schema is for a table named 'detroit_tigers_baseball_stats', which likely stores statistical data related to the players of the Detroit Tigers, a professional baseball team. The table has eight columns, each representing different characteristics or statistics related to a player.  1. 'name' (nvarchar): This column is used to store the name of the player. As the data type is 'nvarchar', it can contain both text and numbers, allowing for diverse names.  2. 'position' (nvarchar): This field stores the position of a player on the field. The 'nvarchar' data type indicates that this field can also contain alphanumeric characters, allowing for various positions like 'SS' (Short Stop), 'CF' (Center Field), etc.  3. 'Games_Played' (tinyint): This field records the number of games a player has played. The 'tinyint' data type indicates that this number is likely to be relatively small, generally less than 255.  4. 'At_Bats' (smallint): Thi

## Semantic ranking

This method uses the `@search.rerankerScore` parameter and a semantic ranking algorithm for scoring. Semantic ranking is a method that uses machine learning models to understand the semantic content of the queries and documents, and ranks the documents based on their relevance to the query. The scoring range is 0.00 - 4.00 in this method.

Remember, a higher score indicates a higher relevance of the document to the query.

In [28]:
# hybrid retrieval + rerank
r = search_client.search(
    search_query,
    top=5,
    vector_queries=[RawVectorQuery(vector=search_vector, k=50, fields="table_vector")],
    query_type="semantic",
    semantic_configuration_name="query-index-semantic-config",
    query_language="en-us",
)

for doc in r:
    content = doc["table_content"].replace("\n", " ")[:1000]
    table_name = doc["table_name"]
    score = doc["@search.score"]
    reranker_score = doc["@search.reranker_score"]

    print(f"Table Name: {table_name}")
    print(f"Score: {score}")
    print(f"Reranker Score: {reranker_score}")
    print(f"Content: {content}")
    print("-" * 50)  # prints a separator for readability

Table Name: detroit_tigers_baseball_stats
Score: 0.03306011110544205
Reranker Score: 2.504136085510254
Content: The provided schema is for a table named 'detroit_tigers_baseball_stats', which likely stores statistical data related to the players of the Detroit Tigers, a professional baseball team. The table has eight columns, each representing different characteristics or statistics related to a player.  1. 'name' (nvarchar): This column is used to store the name of the player. As the data type is 'nvarchar', it can contain both text and numbers, allowing for diverse names.  2. 'position' (nvarchar): This field stores the position of a player on the field. The 'nvarchar' data type indicates that this field can also contain alphanumeric characters, allowing for various positions like 'SS' (Short Stop), 'CF' (Center Field), etc.  3. 'Games_Played' (tinyint): This field records the number of games a player has played. The 'tinyint' data type indicates that this number is likely to be rela