## Lab 4: Building Scalable RAG Applications for Enterprise AI – Optimizing Performance, Cost, and Reliability

#### Overview 📋
In this lab, we will focus on designing and building a Retrieval-Augmented Generation (RAG) solution tailored for enterprise scalability and production readiness. The solution emphasizes performance optimization, cost efficiency, and modular design, ensuring seamless deployment across various platforms.

- **Semantic Caching**: Implement caching to enhance data retrieval efficiency, reduce latency, and optimize resource usage.
- **Logging and Monitoring**: Introduce robust logging mechanisms to track system performance and user behavior while ensuring reliability through real-time monitoring.
- **Modular Design**: Build a flexible, context-aware user interface (UI) that can adapt to various use cases and be deployed on platforms such as web applications, container services, or Kubernetes clusters.
- **Scalability and Availability**: Leverage load balancing and multi-region deployments to ensure throughput, redundancy, and high availability.

This lab will equip you with the skills to design a reliable, efficient, and scalable RAG system while meeting enterprise requirements for modularity, traceability, and cost-effectiveness.

#### Learning Objectives 🎯

**1. Ensure Throughput and Availability with Multi-Region Load Balancing**

**2. Integrate Semantic Caching with Cosmos DB (MongoDB Core)**

**3. Develop an Interactive Web Application with Streamlit**
  - **Why Streamlit?**
    - Streamlit is a Python-based framework for quickly creating dynamic, interactive user interfaces without requiring complex front-end development.
  - **Features to Build:**
    - **Dynamic User Interface**: Create a user-friendly interface that seamlessly connects users to the backend.
    - **"Chat with Your Data"**: Implement a feature that allows users to input questions and receive real-time, context-aware responses based on cached or retrieved data.

**4. Ensure Traceability and Logging with Application Insights**
  - **Why Logging and Monitoring?**
    - Logging and monitoring are essential for maintaining system performance, reliability, and traceability.
  - **Implementation Focus:**
    - **Logging**: Capture user interactions, queries, and responses to understand behavior and troubleshoot issues.
    - **Performance Monitoring**: Use telemetry data to identify latency issues and optimize the system.
    - **Error Handling**: Detect and address errors in real-time to ensure a consistent user experience.

By the end of this lab, you will have a comprehensive understanding of how to enhance data retrieval performance with semantic caching, build an interactive web application using Streamlit, and ensure traceability and logging for better system insights.

In [1]:
import os
# Define the target directory
target_directory = r"/Users/pablosal/Desktop/azure-ai-engineer-in-five-weeks"  # change your directory to the root folder

# Check if the directory exists
if os.path.exists(target_directory):
    # Change the current working directory
    os.chdir(target_directory)
    print(f"Directory changed to {os.getcwd()}")
else:
    print(f"Directory {target_directory} does not exist.")

from dotenv import load_dotenv
load_dotenv(dotenv_path=".env")

Directory changed to c:\Users\pablosal\Desktop\azure-ai-engineer-in-five-weeks


True

## **1. Ensure Throughput and Availability with Multi-Region Load Balancing in your AOAI (Azure OpenAI) solutions**

It's time to explore advanced system design. In production, we use an "AI gateway" approach where we combine serverless (PAYG) with PTU, and the gateways manage the throughput intelligently. This can "AutoScale" to handle many users/sessions, allowing you to focus on quality instead of troubleshooting.

![image.png](attachment:image.png)

- **Why Multi-Region and Load Balancing?**
    - Ensuring optimal performance and reliability is critical for enterprise AI applications. Load balancing and multi-region deployments improve throughput, ensure redundancy, and support global availability.
- **Implementation Focus:**
    - Use load-balancing tools, such as Azure API Management (APIM), to distribute API requests across multiple regions.
    - Deploy services in multiple regions to provide redundancy and minimize downtime.
    - Monitor usage and scale resources proactively to meet demand.

> **Note:** We won't implement `AI Gateaway` in the lab, but you can refer to the following articles for more information and to push forward with your Understanding why we need this layer enterprise level:

- [Key Technical Challenges While Transitioning GenAI Applications to Production](https://pabloaicorner.hashnode.dev/key-technical-challenges-while-transitioning-genai-applications-to-production)
- [Error 429 Explained: Navigating Azure OpenAI API Rate Limits](https://pabloaicorner.hashnode.dev/error-429-explained-navigating-azure-openai-api-rate-limits)


Please visit this guide for detailed instructions on how to configure and optimize your Azure OpenAI enterprise design: [Azure OpenAI in Production](https://github.com/pablosalvador10/gbbai-azure-openai-in-production).

## **2. Integrate Semantic Caching with Cosmos DB (MongoDB Core)**
  - **What is Semantic Caching?**
    - Semantic caching stores commonly queried data or dynamically generated responses, reducing database latency and optimizing query performance.
  - **Why Cosmos DB?**
    - Cosmos DB, with its MongoDB Core API, offers a scalable, low-latency backend solution for storing and querying data. It supports distributed, high-performance, real-time interactions.
  - **Implementation Focus:**
    - Set up caching layers to reduce repeated queries.
    - Optimize database performance using Cosmos DB's scalable architecture.

### Prerequisites

**1. Create an Azure Cosmos DB for MongoDB vCore resource**

Let's start by creating an Azure Cosmos DB for MongoDB vCore Resource following this quick start guide: https://learn.microsoft.com/en-us/azure/cosmos-db/mongodb/vcore/quickstart-portal

Then copy the connection details (server, user, pwd) into the .env file ('COSMOS_MONGO_USER', 'COSMOS_MONGO_PWD', 'COSMOS_MONGO_SERVER').

**2. Install packages**

In [2]:
#Make sure you have attached the correct kernel for your environment,
#then run the command below and restart the kernel:
#%pip install pymongo azure-core azure-cosmos

In [3]:
# Import necessary libraries
import os
import urllib.parse
import pymongo
from azure.core.credentials import AzureKeyCredential

# Environment variables for MongoDB connection
COSMOS_MONGO_USER = os.environ.get('COSMOS_MONGO_USER')
COSMOS_MONGO_PWD = os.environ.get('COSMOS_MONGO_PWD')
COSMOS_MONGO_SERVER = os.environ.get('COSMOS_MONGO_SERVER')

# Construct MongoDB connection string
mongo_conn = (
    "mongodb+srv://"
    + urllib.parse.quote(COSMOS_MONGO_USER)
    + ":"
    + urllib.parse.quote(COSMOS_MONGO_PWD)
    + "@"
    + COSMOS_MONGO_SERVER
    + "?tls=true&authMechanism=SCRAM-SHA-256&retrywrites=false&maxIdleTimeMS=120000"
)

# MongoDB connection setup
try:
    mongo_client = pymongo.MongoClient(mongo_conn)
    db = mongo_client["mydatabase"]
    print("✅ Connected to MongoDB.")
except pymongo.errors.ConnectionError as e:
    print(f"❌ MongoDB connection error: {e}")


  mongo_client = pymongo.MongoClient(mongo_conn)


✅ Connected to MongoDB.


**Set up the DB and collection**

In [4]:
# create a database called TutorialDB
db = mongo_client['ExampleDB']

# Create collection if it doesn't exist
COLLECTION_NAME = "ExampleCollection"

collection = db[COLLECTION_NAME]

if COLLECTION_NAME not in db.list_collection_names():
    # Creates a unsharded collection that uses the DBs shared throughput
    db.create_collection(COLLECTION_NAME)
    print("Created collection '{}'.\n".format(COLLECTION_NAME))
else:
    print("Using collection: '{}'.\n".format(COLLECTION_NAME))

Using collection: 'ExampleCollection'.



**Create the vector index**

> IMPORTANT: You can only create one index per vector property. That is, you cannot create more than one index that points to the same vector property. If you want to change the index type (e.g., from IVF to HNSW) you must drop the index first before creating a new index.

**IVF index** (choose this one for our lab)

IVF is an approximate nerarest neighbors (ANN) approach that uses clustering to speed up the search for similar vectors in a dataset. It's a good choice for proof-of-concepts and smaller datasets (under a few thousand documents). However it's not recommended to use at scale or when higher throughput is needed.

IVF is supported on all cluster tiers, including the free tier.

In [5]:
# Use only if re-reunning code and want to reset db and collection
collection.drop_indexes()
mongo_client.drop_database("ExampleDB")

In [6]:
db.command({
    'createIndexes': 'ExampleCollection',
    'indexes': [
        {
            'name': 'VectorSearchIndex',
            'key': {
                'queryVector': 'cosmosSearch'  # Removed space in field name
            },
            'cosmosSearchOptions': {
                'kind': 'vector-ivf',
                'numLists': 50,  # Increase if needed
                'similarity': 'COS',  # Ensure correct similarity metric
                'dimensions': 1536  # Match the embedding size from Azure OpenAI
            }
        }
    ]
})


{'raw': {'defaultShard': {'numIndexesBefore': 1,
   'numIndexesAfter': 2,
   'createdCollectionAutomatically': True,
   'ok': 1}},
 'ok': 1}

In [7]:
indexes = collection.index_information()
for index_name, index_info in indexes.items():
    print(f"Index Name: {index_name}, Details: {index_info}")

Index Name: _id_, Details: {'v': 2, 'key': [('_id', 1)]}
Index Name: VectorSearchIndex, Details: {'v': 2, 'key': [('queryVector', 'cosmosSearch')], 'cosmosSearchOptions': SON([('kind', 'vector-ivf'), ('numLists', 50), ('similarity', 'COS'), ('dimensions', 1536)])}


**HNSW Index** (Don't choose this one for our lab but preferred for Prod)

HNSW stands for Hierarchical Navigable Small World, a graph-based index that partitions vectors into clusters and subclusters. With HNSW, you can perform fast approximate nearest neighbor search at higher speeds with greater accuracy. HNSW is now available on M40 and higher cluster tiers.

In [16]:
# db.command({ 
#     "createIndexes": "ExampleCollection",
#     "indexes": [
#         {
#             "name": "VectorSearchIndex",
#             "key": {
#                 "contentVector": "cosmosSearch"
#             },
#             "cosmosSearchOptions": { 
#                 "kind": "vector-hnsw", 
#                 "m": 16, # default value 
#                 "efConstruction": 64, # default value 
#                 "similarity": "COS", 
#                 "dimensions": 1536
#             } 
#         } 
#     ] 
# }
# )

**Upload data to the collection**

In [8]:
import os
import uuid
import datetime
import logging
from pymongo.collection import Collection
from src.aoai.aoai_helper import AzureOpenAIManager
from typing import List, Dict

# Configure logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

# Simulated historical data to be stored in CosmosDB
data = [
    {
        "query": "What is covered under the PerksPlus program?",
        "response": "PerksPlus offers coverage for gym memberships, yoga, and outdoor sports.",
        "context": "PerksPlus covers a wide range of fitness activities, including gym memberships, yoga classes, outdoor activities like hiking and kayaking."
    },
    {
        "query": "What health plans are available at Contoso Electronics?",
        "response": "Northwind Health Plus and Northwind Standard are the two plans available.",
        "context": "Contoso Electronics offers Northwind Health Plus and Northwind Standard health plans, providing different levels of coverage."
    },
]

# Initialize Azure OpenAI Manager
aoai_helper = AzureOpenAIManager(
    api_key=os.getenv('AZURE_OPENAI_KEY'),
    api_version=os.getenv('AZURE_OPENAI_API_VERSION', "2023-05-15"),
    azure_endpoint=os.getenv('AZURE_OPENAI_ENDPOINT'),
    embedding_model_name=os.getenv('AZURE_OPENAI_EMBEDDING_DEPLOYMENT'),
    completion_model_name=os.getenv('AZURE_OPENAI_CHAT_DEPLOYMENT_ID'),
)

def generate_embeddings(text: str) -> List[float]:
    """
    Generates an embedding vector for the given text using Azure OpenAI.

    Args:
        text (str): The text to be converted into an embedding.

    Returns:
        List[float]: The generated embedding vector.
    """
    try:
        embedding_response = aoai_helper.generate_embedding(text)
        return embedding_response.data[0].embedding
    except Exception as e:
        logging.error(f"Error generating embedding for text: {text}. Error: {e}")
        return []

def generate_and_store_embeddings(data: List[Dict], collection: Collection, conversation_id: str) -> None:
    """
    Generates embeddings for the query fields only and stores them in CosmosDB.
    Ensures each message is associated with a provided conversation ID.

    Args:
        data (List[Dict]): The data entries containing queries and responses.
        collection (Collection): The CosmosDB collection where data will be stored.
        conversation_id (str): A consistent ID across all messages in a session.
    """
    processed_data = []

    for idx, item in enumerate(data):
        try:
            item["conversation_id"] = conversation_id
            item["message_id"] = str(uuid.uuid4())
            item["queryVector"] = generate_embeddings(item["query"])

            now = datetime.datetime.utcnow().isoformat()
            item["created_at"] = now
            item["updated_at"] = now
            item["@search.action"] = "upload"

            processed_data.append(item)
            logging.info(f"Processed item {idx+1}/{len(data)}")

        except Exception as e:
            logging.error(f"Error processing item {idx+1}: {e}")

    if processed_data:
        try:
            collection.insert_many(processed_data)
            logging.info("All records successfully inserted into CosmosDB.")
        except Exception as e:
            logging.error(f"Error inserting records into CosmosDB: {e}")

# Example usage (assuming `collection` is a valid CosmosDB collection)
conversation_id = "12345-abcde"
generate_and_store_embeddings(data, collection, conversation_id)


2025-01-29 15:23:43,918 - INFO - HTTP Request: POST https://pablo-m2unbt6w-swedencentral.openai.azure.com//openai/deployments/text-embedding-ada-002/embeddings?api-version=2023-05-15 "HTTP/1.1 200 OK"
2025-01-29 15:23:43,928 - INFO - Processed item 1/2
2025-01-29 15:23:44,089 - INFO - HTTP Request: POST https://pablo-m2unbt6w-swedencentral.openai.azure.com//openai/deployments/text-embedding-ada-002/embeddings?api-version=2023-05-15 "HTTP/1.1 200 OK"
2025-01-29 15:23:44,095 - INFO - Processed item 2/2
2025-01-29 15:23:44,171 - INFO - All records successfully inserted into CosmosDB.


**Retrieving data to the collection**

In [9]:
def vector_search_catching(query, collection, num_results=5, similarity_threshold=0.96):
    """
    Searches for semantically similar documents in CosmosDB.

    If a high-confidence result (similarity ≥ 0.96) is found, returns its response.
    Otherwise, returns None (indicating a fallback to an LLM-generated response may be needed).
    """

    # Generate the embeddings
    embedding_response = generate_embeddings(query)

    # Define search pipeline with required fields
    search_stage = {
        "$vectorSearch": {
            "index": "VectorSearchIndex",  # Ensure this matches the actual index name
            "path": "queryVector",         # Must match the field storing embeddings
            "queryVector": embedding_response,
            "numCandidates": num_results,
            "limit": num_results
        }
    }

    # Projection stage to include similarity score
    project_stage = {
        "$project": {
            "similarityScore": {"$meta": "searchScore"},
            "response": 1
        }
    }

    # Assemble and execute pipeline
    pipeline = [search_stage, project_stage]
    
    try:
        results = list(collection.aggregate(pipeline))  # Convert cursor to list for easy handling
    except Exception as e:
        print(f"❌ MongoDB vector search failed: {e}")
        return None

    if not results:
        print("⚠️ No matching vector results found in CosmosDB.")
        return None  # Prevents unnecessary fallback attempts

    # Retrieve best result by similarity score
    best_result = max(results, key=lambda x: x.get("similarityScore", 0), default=None)

    if best_result and best_result.get("similarityScore", 0) >= similarity_threshold:
        print(f"✅ Using cached response (score: {best_result['similarityScore']:.2f})")
        return best_result["response"]

    print(f"⚠️ No confident match found (best score: {best_result.get('similarityScore', 0):.2f}), falling back to LLM.")
    return None


In [10]:
# Example Usage
query = "What are the health plans available at Contoso Electronics?"

cached_response = vector_search_catching(query, collection, similarity_threshold=0.98)

if cached_response:
    print(f"🔹 Cached Response: {cached_response}")
else:
    # Here you can implement a fallback mechanism like calling an LLM
    print("❌ No cached response. Generating new response via LLM...")

2025-01-29 15:23:54,890 - INFO - HTTP Request: POST https://pablo-m2unbt6w-swedencentral.openai.azure.com//openai/deployments/text-embedding-ada-002/embeddings?api-version=2023-05-15 "HTTP/1.1 200 OK"


✅ Using cached response (score: 1.00)
🔹 Cached Response: Northwind Health Plus and Northwind Standard are the two plans available.


**Building RAG Pattern (Azure AI search + LLM) + Semantic Search**

![image.png](attachment:image.png)

This pattern helps us achieve:
- Reduced API Costs
- Faster Response Times
- Scalability

In [11]:
import os
import uuid
import datetime
import logging
import asyncio
from typing import List, Dict, Optional
from pymongo import MongoClient
from pymongo.collection import Collection
from azure.search.documents import SearchClient
from azure.search.documents.models import QueryType, QueryCaptionType, QueryAnswerType, VectorizableTextQuery
from azure.core.credentials import AzureKeyCredential
from src.aoai.aoai_helper import AzureOpenAIManager
import urllib.parse

# Configure logging
logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s")

# Initialize Azure OpenAI Manager
azure_aoai_client = AzureOpenAIManager(
    api_key=os.getenv('AZURE_OPENAI_KEY'),
    api_version=os.getenv('AZURE_OPENAI_API_VERSION', "2023-05-15"),
    azure_endpoint=os.getenv('AZURE_OPENAI_ENDPOINT'),
    embedding_model_name=os.getenv('AZURE_OPENAI_EMBEDDING_DEPLOYMENT'),
    completion_model_name=os.getenv('AZURE_OPENAI_CHAT_DEPLOYMENT_ID'),
)

# Initialize Azure AI Search Client
search_client = SearchClient(
    endpoint=os.environ["AZURE_AI_SEARCH_SERVICE_ENDPOINT"],
    index_name="lab-ai-index",  # Change this if you renamed your index
    credential=AzureKeyCredential(os.environ["AZURE_AI_SEARCH_ADMIN_KEY"]),
)

# Environment variables for MongoDB connection
COSMOS_MONGO_USER = os.environ.get('COSMOS_MONGO_USER')
COSMOS_MONGO_PWD = os.environ.get('COSMOS_MONGO_PWD')
COSMOS_MONGO_SERVER = os.environ.get('COSMOS_MONGO_SERVER')

# Construct MongoDB connection string
mongo_conn = (
    "mongodb+srv://"
    + urllib.parse.quote(COSMOS_MONGO_USER)
    + ":"
    + urllib.parse.quote(COSMOS_MONGO_PWD)
    + "@"
    + COSMOS_MONGO_SERVER
    + "?tls=true&authMechanism=SCRAM-SHA-256&retrywrites=false&maxIdleTimeMS=120000"
)

# MongoDB connection setup
mongo_client = MongoClient(mongo_conn)
db = mongo_client['ExampleDB']
COLLECTION_NAME = "ExampleCollection"
collection = db[COLLECTION_NAME]

if COLLECTION_NAME not in db.list_collection_names():
    db.create_collection(COLLECTION_NAME)
    logging.info(f"Created collection '{COLLECTION_NAME}'.")
else:
    logging.info(f"Using collection: '{COLLECTION_NAME}'.")


def generate_embeddings(text: str) -> List[float]:
    """
    Generates an embedding vector for the given text using Azure OpenAI.

    Args:
        text (str): The text to be converted into an embedding.

    Returns:
        List[float]: The generated embedding vector.
    """
    try:
        embedding_response = azure_aoai_client.generate_embedding(text)
        return embedding_response.data[0].embedding
    except Exception as e:
        logging.error(f"Error generating embedding for text: {text}. Error: {e}")
        return []


def retrieve_documents_from_azure_ai_search(query: str) -> List[str]:
    """
    Retrieve relevant documents using BM25 and vector search from Azure AI Search.
    """
    try:
        vector_query = VectorizableTextQuery(
            text=query, k_nearest_neighbors=5, fields="vector", weight=0.5
        )

        search_results = search_client.search(
            search_text=query,
            vector_queries=[vector_query],
            query_type=QueryType.SEMANTIC,
            semantic_configuration_name="my-semantic-config",
            query_caption=QueryCaptionType.EXTRACTIVE,
            query_answer=QueryAnswerType.EXTRACTIVE,
            top=5,
        )

        retrieved_content = []
        for doc in search_results:
            content = doc["chunk"].replace("\n", " ")
            retrieved_content.append(content)
            logging.info(f"score: {doc['@search.score']}, reranker: {doc['@search.reranker_score']}. {content}")

        return retrieved_content

    except Exception as e:
        logging.error(f"Error retrieving documents from Azure AI Search: {e}")
        return []


async def generate_llm_response(query: str, context: List[str]) -> str:
    """
    Generate a response from LLM using retrieved context.
    """
    prompt = f"""
    Context:
    {context}

    User Query:
    {query}

    Instructions:
    - If sufficient context exists, provide an answer.
    - If insufficient information, respond with "I'm sorry, I don't have enough information."
    """
    try:
        response = await azure_aoai_client.generate_chat_response(
            query=prompt,
            conversation_history=[],
            system_message_content="You are an AI assistant helping users find information from unstructured data.",
            max_tokens=2000,
            temperature=0.5,
            stream=True,
        )
        return response
    except Exception as e:
        logging.error(f"Error generating LLM response: {e}")
        return "I'm sorry, but I encountered an error processing your request."


def store_response(query: str, response: str, context: List[str], collection: Collection):
    """
    Store the generated response in CosmosDB for caching.
    """
    try:
        entry = {
            "query": query,
            "response": response,
            "context": context,
            "queryVector": generate_embeddings(query),
            "created_at": datetime.datetime.utcnow().isoformat(),
        }
        collection.insert_one(entry)
        logging.info("Stored response in cache.")
    except Exception as e:
        logging.error(f"Error storing response in CosmosDB: {e}")


def vector_search_catching(query, collection, num_results=5, similarity_threshold=0.96):
    """
    Searches for semantically similar documents in CosmosDB.

    If a high-confidence result (similarity ≥ 0.96) is found, returns its response.
    Otherwise, returns None (indicating a fallback to an LLM-generated response may be needed).
    """
    embedding_response = generate_embeddings(query)

    search_stage = {
        "$vectorSearch": {
            "index": "VectorSearchIndex",
            "path": "queryVector",
            "queryVector": embedding_response,
            "numCandidates": num_results,
            "limit": num_results
        }
    }

    project_stage = {
        "$project": {
            "similarityScore": {"$meta": "searchScore"},
            "response": 1
        }
    }

    pipeline = [search_stage, project_stage]
    
    try:
        results = list(collection.aggregate(pipeline))
    except Exception as e:
        logging.error(f"MongoDB vector search failed: {e}")
        return None

    if not results:
        logging.warning("No matching vector results found in CosmosDB.")
        return None

    best_result = max(results, key=lambda x: x.get("similarityScore", 0), default=None)

    if best_result and best_result.get("similarityScore", 0) >= similarity_threshold:
        logging.info(f"Using cached response (score: {best_result['similarityScore']:.2f})")
        return best_result["response"]

    logging.warning(f"No confident match found (best score: {best_result.get('similarityScore', 0):.2f}), falling back to LLM.")
    return None


async def main(query: str, use_cache=True) -> str:
    """
    Main function to orchestrate the full RAG pipeline.
    """
    if use_cache:
        cached_response = vector_search_catching(query, collection, similarity_threshold=0.99)
        if cached_response:
            return cached_response

    retrieved_docs = retrieve_documents_from_azure_ai_search(query)
    llm_response = await generate_llm_response(query, retrieved_docs)
    store_response(query, llm_response, retrieved_docs, collection)
    
    return llm_response

2025-01-29 15:24:05,177 - INFO - You appear to be connected to a CosmosDB cluster. For more information regarding feature compatibility and support please visit https://www.mongodb.com/supportability/cosmosdb
2025-01-29 15:24:07,221 - INFO - Using collection: 'ExampleCollection'.



**Exercise: Ask the same questions and evaluate the outcome. Evaluate the savings in time when `use_cache=True.`**

SEARCH_QUERY = "What are the differences between the Northwind Health Plus and Northwind Standard plans offered by Contoso Electronics?"

In [12]:
import asyncio

# run this cell to start the chatbot, it will keep running until you type 'exit'. look for the input box above to type your queries
async def interactive_chatbot(use_cache=True):
    """
    Interactive chatbot loop in Jupyter Notebook or script.
    """
    print("💬 AI Chatbot is running. Type 'exit' to stop.")
    
    while True:
        user_query = input("\n🧑 You: ")
        if user_query.lower() == "exit":
            print("👋 Chatbot stopped.")
            break

        response = await main(user_query, use_cache)
        print(f"\n🤖 AI: {response}")

# Run the interactive chatbot
await interactive_chatbot()


💬 AI Chatbot is running. Type 'exit' to stop.


2025-01-29 15:24:24,740 - INFO - HTTP Request: POST https://pablo-m2unbt6w-swedencentral.openai.azure.com//openai/deployments/text-embedding-ada-002/embeddings?api-version=2023-05-15 "HTTP/1.1 200 OK"
2025-01-29 15:24:24,812 - INFO - Request URL: 'https://azureaisearch-centraus-nw-dev.search.windows.net/indexes('lab-ai-index')/docs/search.post.search?api-version=REDACTED'
Request method: 'POST'
Request headers:
    'Content-Type': 'application/json'
    'Content-Length': '265'
    'api-key': 'REDACTED'
    'Accept': 'application/json;odata.metadata=none'
    'x-ms-client-request-id': '69c1f772-de87-11ef-bacb-f43bd8cfe843'
    'User-Agent': 'azsdk-python-search-documents/11.6.0b5 Python/3.10.16 (Windows-10-10.0.26100-SP0)'
A body is sent with the request
2025-01-29 15:24:25,897 - INFO - Response status: 200
Response headers:
    'Transfer-Encoding': 'chunked'
    'Content-Type': 'application/json; odata.metadata=none; odata.streaming=true; charset=utf-8'
    'Content-Encoding': 'REDACTE

I'm just a virtual assistant, but I'm here and ready to help you! How can I assist you today?

2025-01-29 15:24:26,946 - micro - MainProcess - INFO     Function generate_chat_response finished at 2025-01-29 15:24:26 (Duration: 1.00 seconds) (aoai_helper.py:generate_chat_response:481)
2025-01-29 15:24:26,946 - INFO - Function generate_chat_response finished at 2025-01-29 15:24:26 (Duration: 1.00 seconds)
2025-01-29 15:24:27,140 - INFO - HTTP Request: POST https://pablo-m2unbt6w-swedencentral.openai.azure.com//openai/deployments/text-embedding-ada-002/embeddings?api-version=2023-05-15 "HTTP/1.1 200 OK"
2025-01-29 15:24:27,204 - INFO - Stored response in cache.



🤖 AI: {'response': "I'm just a virtual assistant, but I'm here and ready to help you! How can I assist you today?", 'conversation_history': [{'role': 'system', 'content': 'You are an AI assistant helping users find information from unstructured data.'}, {'role': 'user', 'content': [{'type': 'text', 'text': '\n    Context:\n    ["7. Accountability: We take responsibility for our actions and hold ourselves and others accountable for their performance 8. Community: We are committed to making a positive impact in the communities in which we work and live. Performance Reviews Performance Reviews at Contoso Electronics At Contoso Electronics, we strive to ensure our employees are getting the feedback they need to continue growing and developing in their roles. We understand that performance reviews are a key part of this process and it is important to us that they are conducted in an effective and efficient manner. Performance reviews are conducted annually and are an important part of your

2025-01-29 15:24:37,816 - INFO - HTTP Request: POST https://pablo-m2unbt6w-swedencentral.openai.azure.com//openai/deployments/text-embedding-ada-002/embeddings?api-version=2023-05-15 "HTTP/1.1 200 OK"
2025-01-29 15:24:37,855 - INFO - Using cached response (score: 1.00)



🤖 AI: {'response': "I'm just a virtual assistant, but I'm here and ready to help you! How can I assist you today?", 'conversation_history': [{'role': 'system', 'content': 'You are an AI assistant helping users find information from unstructured data.'}, {'role': 'user', 'content': [{'type': 'text', 'text': '\n    Context:\n    ["7. Accountability: We take responsibility for our actions and hold ourselves and others accountable for their performance 8. Community: We are committed to making a positive impact in the communities in which we work and live. Performance Reviews Performance Reviews at Contoso Electronics At Contoso Electronics, we strive to ensure our employees are getting the feedback they need to continue growing and developing in their roles. We understand that performance reviews are a key part of this process and it is important to us that they are conducted in an effective and efficient manner. Performance reviews are conducted annually and are an important part of your

## **3. Develop an Interactive Web Application with Streamlit**

**Why Choose Streamlit?**

Streamlit is a powerful Python framework designed for quickly building interactive AI applications, even if you lack front-end development skills.

- ✅ **Simplicity & Speed** – Develop a complete UI with just a few lines of Python code.
- ✅ **Real-Time Interaction** – Enable live chatbot responses for an interactive user experience.
- ✅ **Effortless AI Integration** – Easily integrate with Azure AI Search, MongoDB, and Azure OpenAI for seamless AI functionality.

Streamlit is preferred for quick prototyping, but for production environments, frameworks like React or Angular are more suitable. Streamlit is perfect for MVPs and prototypes before having a front-end team ready to build you UI !!

For a basic introduction, refer to [A Beginner's Guide to Streamlit](https://www.geeksforgeeks.org/a-beginners-guide-to-streamlit/). For quick documentation, visit [Streamlit's Get Started Guide](https://docs.streamlit.io/get-started).

**Before diving in, install Streamlit using pip:**

In [20]:
#%pip install streamlit streamlit-chat

**Running a Streamlit App**

Open command prompt or Anaconda shell and type

```bash
streamlit run your_filename.py
```

in our case let's run, where our application (chatbot) leaves

```bash
streamlit run ./weeks/week-4/chatbot.py
```

> if you run into `file not found src...` run this command first `export PYTHONPATH=$(pwd)`

## **4. Ensure Traceability and Logging with Application Insights**


For detailed instructions on how to set up and configure Azure Application Insights, please refer to the [documentation](https://learn.microsoft.com/en-us/azure/azure-monitor/overview):

In [None]:
#install dependecies
#%pip install azure-monitor-opentelemetry opentelemetry-exporter-otlp opentelemetry-instrumentation-httpx

**Set Your Connection String**
Obtain your connection string from the Azure Portal 

Get the [Azure Application Insights Connection Strings](https://learn.microsoft.com/en-us/azure/azure-monitor/app/connection-strings) and add it to the .env file in 

Set it in your environment (.env):

APPLICATIONINSIGHTS_CONNECTION_STRING="app insight connnection string value"

**Testing APP isnsights**

1. Configures Azure Monitor.
2. Sends a log message.
3. manual trace span.

In [13]:
import os
import logging

from azure.monitor.opentelemetry import configure_azure_monitor
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from azure.monitor.opentelemetry.exporter import AzureMonitorTraceExporter

# Configure Azure Monitor (for logging, tracing, metrics).
configure_azure_monitor(
    connection_string=os.getenv("APPLICATIONINSIGHTS_CONNECTION_STRING"),
    logging_exporter_enabled=True,
    tracing_exporter_enabled=True,
    metrics_exporter_enabled=True
)

# Set up Python's built-in logging
logger = logging.getLogger("my_app_logger")
logger.setLevel(logging.INFO)

# Log something
logger.info("Hello from Python, this will appear in Application Insights!")

# --- Tracing Example ---
# Make sure there's a TracerProvider if not automatically created
if not isinstance(trace.get_tracer_provider(), TracerProvider):
    tracer_provider = TracerProvider()
    trace.set_tracer_provider(tracer_provider)
    exporter = AzureMonitorTraceExporter(connection_string=os.getenv("APPLICATIONINSIGHTS_CONNECTION_STRING"))
    tracer_provider.add_span_processor(BatchSpanProcessor(exporter))

tracer = trace.get_tracer("my_tracer")
with tracer.start_as_current_span("my_sample_span"):
    # Your business logic or code you'd like to trace
    logger.info("Inside the traced block!")
    # any operation here

logger.info("Done with the sample span.")


2025-01-29 15:30:19,698 - INFO - Request URL: 'https://westus-0.in.applicationinsights.azure.com//v2.1/track'
Request method: 'POST'
Request headers:
    'Content-Type': 'application/json'
    'Content-Length': '1403'
    'Accept': 'application/json'
    'x-ms-client-request-id': '3d4a3efb-de88-11ef-ace8-f43bd8cfe843'
    'User-Agent': 'azsdk-python-azuremonitorclient/unknown Python/3.10.16 (Windows-10-10.0.26100-SP0)'
A body is sent with the request
2025-01-29 15:30:20,012 - INFO - Response status: 200
Response headers:
    'Transfer-Encoding': 'chunked'
    'Content-Type': 'application/json; charset=utf-8'
    'Server': 'Microsoft-HTTPAPI/2.0'
    'Strict-Transport-Security': 'REDACTED'
    'X-Content-Type-Options': 'REDACTED'
    'Date': 'Wed, 29 Jan 2025 21:30:19 GMT'
2025-01-29 15:30:23,969 - INFO - Hello from Python, this will appear in Application Insights!
2025-01-29 15:30:23,977 - INFO - Inside the traced block!
2025-01-29 15:30:23,980 - INFO - Done with the sample span.


2025-01-29 15:30:25,029 - INFO - Request URL: 'https://eastus-8.in.applicationinsights.azure.com//v2.1/track'
Request method: 'POST'
Request headers:
    'Content-Type': 'application/json'
    'Content-Length': '1513'
    'Accept': 'application/json'
    'x-ms-client-request-id': '4077d125-de88-11ef-ace1-f43bd8cfe843'
    'User-Agent': 'azsdk-python-azuremonitorclient/unknown Python/3.10.16 (Windows-10-10.0.26100-SP0)'
A body is sent with the request
2025-01-29 15:30:25,051 - INFO - Request URL: 'https://eastus-8.in.applicationinsights.azure.com//v2.1/track'
Request method: 'POST'
Request headers:
    'Content-Type': 'application/json'
    'Content-Length': '2606'
    'Accept': 'application/json'
    'x-ms-client-request-id': '407b2556-de88-11ef-843e-f43bd8cfe843'
    'User-Agent': 'azsdk-python-azuremonitorclient/unknown Python/3.10.16 (Windows-10-10.0.26100-SP0)'
A body is sent with the request
2025-01-29 15:30:25,240 - INFO - Response status: 200
Response headers:
    'Transfer-Enco

##### **Verifying the Logs and Traces in Azure**

**Wait a Minute:** Telemetry can take up to 1–2 minutes to appear in Azure.

1. **Go to Azure Portal** → **Application Insights** → **Logs** (sometimes called Log Analytics).

2. **Run a Kusto Query to check your logs:**

    ```kusto
    traces
    | order by timestamp desc
    | limit 50
    ```

    You should see messages (e.g., “Hello from Python, this will appear in Application Insights!”).

3. **Check Traces (Spans):**

    In Application Insights, you can see the spans by going to **Transaction Search** or in the **Performance** tab.
    Look for operations labeled "my_sample_span" (as used in the snippet above).


![image.png](attachment:image.png)