# Lab 2: Configure Azure AI Search RAG Knowledge Base

## Overview

This notebook configures a RAG (Retrieval-Augmented Generation) knowledge base using Azure AI Search.

### Architecture

```
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ                  RAG Pipeline                            ‚îÇ
‚îÇ                                                          ‚îÇ
‚îÇ  Documents (JSON) ‚Üí Embeddings (OpenAI)                 ‚îÇ
‚îÇ                         ‚Üì                                ‚îÇ
‚îÇ              Azure AI Search Index                       ‚îÇ
‚îÇ           (Vector Store + Keyword Search)                ‚îÇ
‚îÇ                         ‚Üì                                ‚îÇ
‚îÇ            Hybrid Search (Vector + BM25)                 ‚îÇ
‚îÇ                         ‚Üì                                ‚îÇ
‚îÇ              Research Agent Query                        ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
```

### Learning Objectives

Upon completing this lab, you will be able to:

1. ‚úÖ Design and create Azure AI Search index schema
2. ‚úÖ Generate text embeddings with Azure OpenAI
3. ‚úÖ Upload documents for vector and keyword search
4. ‚úÖ Execute and test hybrid search (vector + BM25)
5. ‚úÖ Evaluate and optimize RAG pipeline performance

### Data to Use

- **data/knowledge-base.json**: 50 diverse Korean travel destination information entries
- **Categories**: Nature/Healing, Culture/History, City/Beach, Activity/Sports, Food/Market
- **Embedding Model**: text-embedding-3-large (3072 dimensions)
- **Search Testing**: Dataset that clearly demonstrates the strengths of vector/keyword/hybrid search

---

## ‚öôÔ∏è Before You Start

**Select a Python kernel:**

1. Click **"Select Kernel"** at the top right of the notebook
2. Select **"Python Environments..."**
3. Choose **`.venv (Python 3.x.x)`** (virtual environment created in the project root)

> üí° **GitHub Codespaces**: In Codespaces, the `.venv` environment is automatically created.  
> If you don't see `.venv`, create it with `python -m venv .venv` in the terminal.

---

## 1. Prerequisites Check

Verify that the following tools are installed:

- Python 3.9 or higher
- Azure CLI
- Azure Developer CLI (azd)
- Docker (required for Container Apps deployment)

In [None]:
import sys, subprocess, os
import platform

# Set PATH based on operating system (supports macOS, Linux, and Codespaces)
system = platform.system()
if system == 'Darwin':  # macOS
    # Add Homebrew paths (Intel & Apple Silicon)
    extra_paths = '/opt/homebrew/bin:/usr/local/bin'
elif system == 'Linux':  # Linux / Codespaces
    # Common Linux binary paths
    extra_paths = '/usr/local/bin:/usr/bin:/home/codespace/.local/bin'
else:  # Windows
    extra_paths = ''

if extra_paths:
    os.environ['PATH'] = extra_paths + ':' + os.environ.get('PATH', '')

def check(cmd, name):
    try:
        result = subprocess.run(cmd, shell=True, capture_output=True, timeout=3, env=os.environ)
        print(f"{'‚úì' if result.returncode == 0 else '‚úó'} {name}")
    except Exception as e:
        print(f"‚úó {name}")

print("=== Prerequisites Check ===")
print(f"‚úì Python {sys.version.split()[0]} ({system})")
check("az --version", "Azure CLI")
check("azd version", "Azure Developer CLI")
check("docker --version", "Docker")
print("="*50)

## 2. Install Required Packages

Install essential Azure AI-related packages. If you're running in GitHub Codespaces, most packages may already be installed.

In [None]:
# Install required packages
import subprocess
import sys

packages = [
    "azure-search-documents",
    "azure-identity",
    "openai",
    "python-dotenv"
]

print("=== Installing Required Packages ===\n")

for package in packages:
    print(f"Installing {package}...")
    result = subprocess.run(
        [sys.executable, "-m", "pip", "install", "-q", package],
        capture_output=True,
        text=True
    )
    if result.returncode == 0:
        print(f"‚úÖ {package} installed")
    else:
        print(f"‚ö†Ô∏è  {package} may already be installed or failed to install")

print("\n" + "="*50)
print("‚úÖ Package installation completed!")

## 3. Load Configuration

In [None]:
# Load configuration file saved from Notebook 1
import json
import os

config_path = "./config.json"

if not os.path.exists(config_path):
    raise FileNotFoundError(
        f"‚ùå Configuration file not found: {config_path}\n"
        "Please run Notebook 1 (01_deploy_azure_resources.ipynb) first."
    )

with open(config_path, 'r') as f:
    config = json.load(f)

# Verify required settings (no keys needed when using Managed Identity)
required_keys = ["search_endpoint", "project_connection_string"]
missing_keys = [key for key in required_keys if not config.get(key)]

if missing_keys:
    raise ValueError(f"‚ùå Required settings are missing: {', '.join(missing_keys)}")

print("‚úÖ Configuration file loaded successfully")
print(f"üìç Search Endpoint: {config['search_endpoint']}")
print(f"üìç AI Project Connection: {'‚úì Set' if config['project_connection_string'] else '‚úó Missing'}")

## 4. Azure Authentication

Although we already logged into Azure in Lab 1, the session may have expired, so we'll verify authentication status and re-authenticate if necessary.

### Tenant ID Setup Guide

**In most cases**: You don't need to specify a tenant ID. Leave the `tenant_id` variable as `"<YOUR_TENANT_ID>"` or `None` and run.

**When Tenant ID is required**:
- ‚úÖ When you have access to multiple Azure tenants (organizations/companies)
- ‚úÖ When you need to work only with resources from a specific organization
- ‚úÖ When you encounter "multiple tenants" related errors during login

**How to find your Tenant ID**:
- Azure Portal ‚Üí Azure Active Directory ‚Üí Overview ‚Üí Copy Tenant ID
- Or contact your organization administrator

In [None]:
import subprocess, json

print("=== Azure Authentication ===")
print("‚ÑπÔ∏è  Checking authentication status and logging in if necessary.\n")

# Enter your tenant ID here (optional)
# Example: tenant_id = "16b3c013-d300-468d-ac64-7eda0820b6d3"
tenant_id = "<YOUR_TENANT_ID>"  # Or set to None to use the default tenant

# Check Azure CLI authentication status
az_account = subprocess.run("az account show", shell=True, capture_output=True, text=True)

if az_account.returncode == 0:
    account_info = json.loads(az_account.stdout)
    print(f"‚úÖ Azure CLI authentication successful (using existing session)")
    print(f"   Subscription: {account_info.get('name', 'N/A')}")
    print(f"   Tenant: {account_info.get('tenantId', 'N/A')}")
else:
    print("‚ö†Ô∏è  Azure CLI authentication required. Opening browser...")
    # Login with tenant ID if set
    if tenant_id and tenant_id != "<YOUR_TENANT_ID>":
        az_login = subprocess.run(f"az login --tenant {tenant_id}", shell=True)
    else:
        az_login = subprocess.run("az login", shell=True)
    
    if az_login.returncode == 0:
        print("‚úÖ Azure CLI login successful")
    else:
        raise Exception("‚ùå Azure CLI login failed")

print("="*50)

## 5. Load Knowledge Base Data

In [None]:
# Load knowledge base JSON file
knowledge_base_path = "./data/knowledge-base.json"

if not os.path.exists(knowledge_base_path):
    raise FileNotFoundError(f"‚ùå Knowledge base file not found: {knowledge_base_path}")

with open(knowledge_base_path, 'r', encoding='utf-8') as f:
    knowledge_base = json.load(f)

# JSON can be a direct array or wrapped in documents
if isinstance(knowledge_base, list):
    documents = knowledge_base
else:
    documents = knowledge_base.get("documents", [])

print(f"‚úÖ Travel destination knowledge base loaded successfully")
print(f"üìö Total destinations: {len(documents)}")
print(f"\nüìÇ Destinations by category:")

# Classify by category
from collections import Counter
categories = Counter(doc.get("section", doc.get("category", "Other")) for doc in documents)
for category, count in sorted(categories.items()):
    print(f"  ‚Ä¢ {category}: {count} destinations")

# Display first document sample
if documents:
    print(f"\nüìÑ Sample destination:")
    sample = documents[0]
    print(f"  ID: {sample['id']}")
    print(f"  Title: {sample['title']}")
    print(f"  Category: {sample['category']}")
    print(f"  Section: {sample.get('section', 'N/A')}")
    print(f"  Content length: {len(sample['content'])} characters")
    if 'metadata' in sample and 'tags' in sample['metadata']:
        print(f"  Tags: {', '.join(sample['metadata']['tags'][:5])}")

## 6. Create Azure AI Search Index

### üìã Index Schema Design

The core of a RAG system is an **efficient index schema**. Index structure created in this Lab:

---

### üîë Key Field Descriptions

| Field | Type | Role | Attributes |
|------|------|------|------|
| **id** | String | Unique identifier | `key=True` (required) |
| **title** | String | Document title | `searchable=True` (keyword search) |
| **content** | String | Body content | `searchable=True`, Korean analyzer |
| **contentVector** | Float[] | Embedding vector | `dimensions=3072` (vector search) |
| **category** | String | Category | `filterable=True` (filtering) |
| **tags** | String[] | Tag list | `filterable=True` (multi-filter) |

---

### üß† Vector Search Configuration (HNSW Algorithm)

**Core settings for contentVector field:**

```python
VectorSearch(
    algorithms=[
        HnswAlgorithmConfiguration(
            name="hnsw-config",
            parameters=HnswParameters(
                m=4,                    # Graph connectivity
                ef_construction=400,    # Indexing quality
                metric="cosine"         # Similarity metric
            )
        )
    ],
    profiles=[
        VectorSearchProfile(
            name="vector-profile",
            algorithm_configuration_name="hnsw-config"
        )
    ]
)
```

**Parameter Meanings:**

| Parameter | Value | Meaning | Impact |
|----------|-----|------|------|
| **m** | 4 | Connections per node | Higher = more accurate but slower |
| **ef_construction** | 400 | Build-time search depth | Higher = better index quality |
| **metric** | cosine | Similarity calculation | Optimized for embeddings |

**HNSW vs Other Algorithms:**

| Algorithm | Search Speed | Accuracy | Memory | Index Build Speed | Recommended Use |
|---------|----------|--------|--------|-----------------|-----------|
| **HNSW** | ‚ö°‚ö°‚ö° Fast | ‚≠ê‚≠ê‚≠ê High (approximate) | Medium | Fast | **Production RAG** ‚≠ê |
| **Exhaustive KNN** | ‚ö° Slow | ‚≠ê‚≠ê‚≠ê‚≠ê Perfect (100%) | Low | Instant | Small scale (<1K docs), highest accuracy needed |
| Flat (Brute Force) | Slow | Perfect | Low | Instant | Small scale (<1K docs) |
| IVF | Fast | Medium | High | Slow | Large scale (>1M docs) |


**Lab Choice:** HNSW - optimal for 50 documents and scalable

---

### üìä Index Configuration Summary

Final configuration of the created index:

```yaml
Index name: agentic-ai-knowledge-base
Fields: 6
  - Keyword search: title, content (ko.microsoft)
  - Vector search: contentVector (3072 dimensions, HNSW)
  - Filtering: category, tags
Vector algorithm: HNSW (m=4, ef_construction=400)
Language support: Korean (ko.microsoft analyzer)
```

**Next Step:** Embed and upload documents according to this schema! üì§

In [None]:
from azure.search.documents.indexes import SearchIndexClient
from azure.search.documents.indexes.models import (
    SearchIndex,
    SearchField,
    SearchFieldDataType,
    VectorSearch,
    VectorSearchProfile,
    HnswAlgorithmConfiguration,
    SimpleField,
    SearchableField
)
from azure.core.credentials import AzureKeyCredential
import subprocess

# Get Azure AI Search Admin Key
print("üîë Retrieving AI Search Admin Key...")
search_service_name = config.get("search_service_name", "")
resource_group = config.get("resource_group", "")

key_result = subprocess.run(
    f"az search admin-key show --resource-group {resource_group} --service-name {search_service_name} --query primaryKey -o tsv",
    shell=True,
    capture_output=True,
    text=True
)

if key_result.returncode != 0:
    raise Exception(f"‚ùå Failed to retrieve Admin Key: {key_result.stderr}")

search_admin_key = key_result.stdout.strip()
print("‚úÖ Admin Key acquired successfully")

# Create Search Index Client (Admin Key authentication)
index_client = SearchIndexClient(
    endpoint=config["search_endpoint"],
    credential=AzureKeyCredential(search_admin_key)
)
print("‚úÖ Search Index Client created (Admin Key authentication)")

# Set index name
index_name = "ai-agent-knowledge-base"

# Define index schema
fields = [
    SimpleField(name="id", type=SearchFieldDataType.String, key=True),
    SearchableField(name="title", type=SearchFieldDataType.String, 
                   filterable=True, sortable=True),
    SearchableField(name="content", type=SearchFieldDataType.String, 
                   analyzer_name="ko.microsoft"),  # Korean analyzer
    SimpleField(name="category", type=SearchFieldDataType.String, 
               filterable=True, sortable=True, facetable=True),
    SimpleField(name="section", type=SearchFieldDataType.String, 
               filterable=True, sortable=False),
    SearchField(
        name="contentVector",
        type=SearchFieldDataType.Collection(SearchFieldDataType.Single),
        searchable=True,
        vector_search_dimensions=3072,  # text-embedding-3-large
        vector_search_profile_name="vector-profile"
    )
]

# Vector search configuration
vector_search = VectorSearch(
    algorithms=[
        HnswAlgorithmConfiguration(name="hnsw-algorithm")
    ],
    profiles=[
        VectorSearchProfile(
            name="vector-profile",
            algorithm_configuration_name="hnsw-algorithm"
        )
    ]
)

# Create index
index = SearchIndex(
    name=index_name,
    fields=fields,
    vector_search=vector_search
)

# Delete existing index and recreate
try:
    index_client.delete_index(index_name)
    print(f"‚ö†Ô∏è Existing index '{index_name}' deleted")
except:
    pass

result = index_client.create_index(index)
print(f"‚úÖ Index created successfully: {result.name}")
print(f"üìä Number of fields: {len(result.fields)}")
print("üîç Vector search: Enabled (3072 dimensions)")


## 7. Generate Embeddings with Azure OpenAI

### üß† What is Text Embedding?

Converting text into numeric vectors so computers can understand and compare **meaning**.

```
"Agent security"  ‚Üí  [0.123, -0.456, ..., 0.234]  (3072 numbers)
"agent security" ‚Üí  [0.119, -0.451, ..., 0.228]  (very similar vector!)
```

**Key Point:** Semantically similar text = similar vectors ‚Üí basis for RAG search

---

### üéØ text-embedding-3-large Model

**OpenAI's latest embedding model (released in 2024)**

| Model | Dimensions | Performance | Suitable For |
|------|------|------|------------|
| **text-embedding-3-large** | 3072 | ‚≠ê‚≠ê‚≠ê‚≠ê‚≠ê | **Production RAG (recommended)** |
| text-embedding-3-small | 1536 | ‚≠ê‚≠ê‚≠ê‚≠ê | Quick prototypes, cost savings |
| text-embedding-ada-002 | 1536 | ‚≠ê‚≠ê‚≠ê | Old model (legacy) |

**Meaning of 3072 dimensions:**
- More dimensions = finer semantic distinctions
- MTEB benchmark: **64.6%** (ada-002: 61.0%)
- Can handle long text (~8,191 tokens)

---

### üìê Embedding Generation Process

```
Text Input ‚Üí Tokenization ‚Üí Transformer Processing ‚Üí Vector Generation ‚Üí Normalization
```

**Implementation in this Lab:**
```python
def generate_embedding(text: str) -> list[float]:
    response = openai_client.embeddings.create(
        input=text,
        model="text-embedding-3-large",
        dimensions=3072  # Explicitly specify 3072 dimensions
    )
    return response.data[0].embedding

# Generate embedding by combining title + content
text_to_embed = f"{doc['title']}\n\n{doc['content']}"
doc["contentVector"] = generate_embedding(text_to_embed)
```

---

In [None]:
from openai import AzureOpenAI
from azure.identity import DefaultAzureCredential, get_bearer_token_provider
import time

# Create Azure OpenAI client (using Managed Identity)
token_provider = get_bearer_token_provider(
    DefaultAzureCredential(),
    "https://cognitiveservices.azure.com/.default"
)

# Extract OpenAI endpoint from project connection string
# Format 1: https://aoai-xxx.services.ai.azure.com/api/projects/proj-xxx;...
# Format 2: workspace=...;subscription_id=...;resource_group=...;aiservices_name=...
import re

project_conn_str = config['project_connection_string']

# Extract AI Services name from URL (Format 1)
url_match = re.match(r'https://([^.]+)\.', project_conn_str)
if url_match:
    aiservices_name = url_match.group(1)
    openai_endpoint = f"https://{aiservices_name}.openai.azure.com/"
else:
    # Extract from key-value format (Format 2)
    conn_parts = {}
    for part in project_conn_str.split(';'):
        if '=' in part:
            key, value = part.split('=', 1)
            conn_parts[key] = value
    
    aiservices_name = conn_parts.get('aiservices_name', '')
    if not aiservices_name:
        raise ValueError("‚ùå AI Services name not found. Please check project_connection_string in config.json.")
    
    openai_endpoint = f"https://{aiservices_name}.openai.azure.com/"

print(f"üîó Azure OpenAI Endpoint: {openai_endpoint}")

openai_client = AzureOpenAI(
    azure_ad_token_provider=token_provider,
    api_version="2024-02-01",
    azure_endpoint=openai_endpoint
)

# Set embedding model
embedding_model = "text-embedding-3-large"

def generate_embedding(text: str) -> list[float]:
    """Convert text to vector"""
    response = openai_client.embeddings.create(
        input=text,
        model=embedding_model,
        dimensions=3072  # Explicitly set to 3072 dimensions
    )
    return response.data[0].embedding

# Generate embeddings for all documents
print("üîÑ Generating embeddings...")
print(f"üìÑ Documents to process: {len(documents)}")

for i, doc in enumerate(documents, 1):
    # Generate embedding by combining title and content
    text_to_embed = f"{doc['title']}\n\n{doc['content']}"
    doc["contentVector"] = generate_embedding(text_to_embed)
    
    print(f"  [{i}/{len(documents)}] {doc['title'][:50]}... ‚úì")
    
    # Prevent rate limit (consider TPM limit)
    if i < len(documents):
        time.sleep(0.5)

print(f"\n‚úÖ Embedding generation completed")
print(f"üìä Vector dimensions: {len(documents[0]['contentVector'])} dimensions (3072 dimensions)")
print(f"üíæ Memory usage: ~{len(documents) * 3072 * 4 / 1024 / 1024:.2f} MB")


## 8. Upload Documents to Azure AI Search

### üì§ Batch Upload Strategy

Azure AI Search recommends **batch upload**:

| Batch Size | Processing Speed | Recommended Scenario |
|-----------|-----------|-----------|
| 1-10 | Slow | Real-time single document |
| **10-100** | **Fast ‚≠ê** | **General indexing (recommended)** |
| 100-1000 | Very fast | Bulk initial load |

**Lab Usage: 50 documents ‚Üí single batch (optimal)**

---

### üîß upload_documents() Method

```python
search_client.upload_documents(documents=documents_with_embeddings)
```

**Internal Operation:**
1. **Validation**: Check required fields (`id`, `contentVector`, etc.)
2. **Serialization**: Convert to JSON array
3. **HTTP POST**: `/docs/index` endpoint
4. **Response Processing**: Return success/failure results per document

**Response Example:**
```json
{
  "value": [
    {"key": "doc1", "status": true, "statusCode": 201},
    {"key": "doc2", "status": true, "statusCode": 201}
  ]
}
```


In [None]:
from azure.search.documents import SearchClient
from azure.core.credentials import AzureKeyCredential

# Create Search Client (for document upload)
search_client = SearchClient(
    endpoint=config["search_endpoint"],
    index_name=index_name,
    credential=AzureKeyCredential(search_admin_key)
)
print("‚úÖ Search Client created (for document upload)")

# Clean documents (include only fields matching index schema)
allowed_fields = {"id", "title", "content", "category", "section", "contentVector"}
cleaned_documents = []

for doc in documents:
    cleaned_doc = {key: value for key, value in doc.items() if key in allowed_fields}
    cleaned_documents.append(cleaned_doc)

print(f"\nüì¶ Upload preparation:")
print(f"   - Number of documents: {len(cleaned_documents)}")
print(f"   - Fields: {', '.join(allowed_fields)}")
print(f"   - Vector dimensions: {len(cleaned_documents[0]['contentVector'])}")

# Upload documents (batch)
print(f"\nüîÑ Uploading documents...")

try:
    result = search_client.upload_documents(documents=cleaned_documents)
    
    # Analyze upload results
    succeeded = sum(1 for r in result if r.succeeded)
    failed = len(result) - succeeded
    
    print(f"\n‚úÖ Upload completed!")
    print(f"   - Succeeded: {succeeded}")
    print(f"   - Failed: {failed}")
    
    if failed > 0:
        print(f"\n‚ö†Ô∏è Failed documents:")
        for r in result:
            if not r.succeeded:
                print(f"   - {r.key}: {r.error_message}")
    
except Exception as e:
    print(f"‚ùå Upload failed: {str(e)}")
    raise

print(f"\nüìä Index status:")
print(f"   - Index name: {index_name}")
print(f"   - Total documents: {len(cleaned_documents)}")
print(f"   - Search ready! üéâ")


## 9. Hybrid Search Example

### üîç What is Hybrid Search?

**Hybrid search** combines **vector search** and **keyword search** to leverage the advantages of both:

| Search Method | Algorithm | Strengths | Weaknesses |
|-----------|----------|------|------|
| **Keyword Search** | BM25 | Exact word/phrase matching | Cannot understand synonyms/meaning |
| **Vector Search** | Cosine similarity | Understands semantic similarity | Weaker exact term matching |
| **Hybrid** | RRF (rank fusion) | **Combines both advantages** ‚≠ê | Slightly increased computational cost |

---

### ‚öôÔ∏è RRF (Reciprocal Rank Fusion) Algorithm

Hybrid search uses **RRF** to fuse two search results:

```
RRF Score = 1/(k + vector_rank) + 1/(k + keyword_rank)
           (k = 60, Azure AI Search default)
```

**Example:**
- Document A: Vector rank 1, Keyword rank 3 ‚Üí RRF = 1/61 + 1/63 = 0.0323
- Document B: Vector rank 2, Keyword rank 1 ‚Üí RRF = 1/62 + 1/61 = 0.0325 ‚Üê **Higher (priority)**

---

### üí° Practice Query Analysis

**Query:** "Healing travel destination with ocean view"

This query is optimized for hybrid search:
- **Keyword strength**: "ocean" (exact location feature)
- **Vector strength**: "healing travel destination" (conceptual question)
- **Hybrid effect**: Considers both exact location features + atmosphere/feel

Execute search in the next cell, then compare 3 search methods in **Section 10**.

In [None]:
from azure.search.documents.models import VectorizedQuery

# ‚úÖ Practice query: location feature + conceptual question (optimal for hybrid search)
# Both keyword ("ocean") and meaning ("healing") are important
test_query = "Healing travel destinations with ocean views"

print(f"üîç Search query: '{test_query}'")
print(f"üìå This query includes both exact location features (ocean) and conceptual meaning (healing).")
print(f"   ‚Üí Hybrid search is expected to provide the most accurate results.\n")

# 1Ô∏è‚É£ Convert query to vector (embedding)
print("üîÑ Generating query embedding...")
query_vector = generate_embedding(test_query)
print(f"‚úÖ Embedding generation completed (dimensions: {len(query_vector)})\n")

# 2Ô∏è‚É£ Create vector query object
vector_query = VectorizedQuery(
    vector=query_vector,         # Query embedding vector
    k_nearest_neighbors=5,       # Search top 5 similar documents
    fields="contentVector"       # Target vector field for search
)

# 3Ô∏è‚É£ Execute hybrid search (vector + keyword)
print("üîç Executing hybrid search...")
results = search_client.search(
    search_text=test_query,      # Keyword search (BM25 algorithm)
    vector_queries=[vector_query],  # Vector search (cosine similarity)
    select=["title", "content", "category"],
    top=5  # Return only top 5 results
)

# 4Ô∏è‚É£ Display search results
print("=" * 100)
print("üìä Hybrid Search Results (Vector + Keyword)")
print("=" * 100)

result_count = 0
for i, result in enumerate(results, 1):
    print(f"\nüîπ Result {i}")
    print(f"   üìÇ Category: {result['category']}")
    print(f"   üìÑ Title: {result['title']}")
    print(f"   üìù Content preview: {result['content'][:150]}...")
    result_count = i

print("\n" + "=" * 100)
print(f"‚úÖ Search completed! Total {result_count} destinations found")
print("üí° Next section will compare keyword/vector/hybrid search.")

## 10. Search Performance Comparison

### üî¨ Comparison Experiment of 3 Search Methods

This section clearly demonstrates the differences between each search method through **travel destination search scenarios**.

---

### üìä Characteristics by Search Method

| Search Method | Algorithm | Key Advantages | Key Disadvantages | Recommended Use Cases |
|-----------|----------|----------|----------|-----------------|
| **Keyword Search** | BM25 | Exact word matching, fast speed | Cannot understand synonyms/meaning | **Specific place names/proper nouns** ‚≠ê, festival names, restaurant names |
| **Vector Search** | Cosine similarity | Semantic-based, synonym handling, atmosphere understanding | Weaker exact place name matching | Natural language questions, feel/atmosphere-based search |
| **Hybrid (RRF)** | Vector + BM25 fusion | **Combines advantages of both** ‚≠ê | Slightly increased computational cost | **Production RAG (recommended)** |

---

### üß™ Experimental Scenario: Specific Theme Travel Destination Search

**Query:** "Beach where you can surf"

This query clearly shows the differences between each search method:

1. **Keyword Search (Expected: exact matching)**
   - Search only destinations that exactly contain the string "surf"
   - Surf spots like Yangyang, Busan will be ranked high

2. **Vector Search (Expected: includes related activities)**
   - Search destinations semantically similar to "water leisure", "marine sports"
   - May include marine activity-related places even without explicit mention of surfing

3. **Hybrid Search (Expected: balanced results)**
   - Exact surf spots + related marine activity places
   - Provides most comprehensive results

---

### üéØ Search Method Selection Guide

**Hybrid Search (recommended for 90% of cases):**
- ‚úÖ Answering general user questions
- ‚úÖ Complex queries (place name + theme/atmosphere)
- ‚úÖ Production environments where accuracy is important

**Keyword-only Search:**
- ‚úÖ Specific place name search (e.g., "Jeju Island", "Gyeongju")
- ‚úÖ Festival names/proper nouns (e.g., "Mud Festival", "Lantern Festival")
- ‚úÖ Specialty products/food names (e.g., "snow crab", "bibimbap")

**Vector-only Search:**
- ‚úÖ When semantic similarity is important (feeling, atmosphere)
- ‚úÖ Natural language questions (e.g., "good place for family")
- ‚úÖ Conceptual questions (e.g., "healing travel" ‚âà "relaxation trip")

In [None]:
import sys
import time

# üß™ Experiment: Query that clearly shows strengths of each search method
# "Surfing": exact activity name (keyword search strength)
# "Beach": location type (vector search can find related places)
# ‚Üí Hybrid considers both

test_query = "Beaches where you can surf"

print("=" * 100)
print("üß™ Search Experiment: Specific Activity Travel Destination Search")
print("=" * 100)
print(f"üìå Query: '{test_query}'")
print(f"üí° Expected: Keyword search will be most accurate (exact activity name matching)")
print(f"   - Keyword search: Finds exact 'surfing' string ‚≠ê")
print(f"   - Vector search: Also includes semantically similar marine activity places")
print(f"   - Hybrid: Combines both for balanced results\n")

# Generate query embedding (once)
print("üîÑ Generating search query embedding...")
query_vector = generate_embedding(test_query)
vector_query = VectorizedQuery(vector=query_vector, k_nearest_neighbors=3, fields="contentVector")
print("‚úÖ Embedding generation completed\n")
sys.stdout.flush()  # Force flush output buffer

# === Method 1: Keyword-only Search (recommended: specific activity/place name search) ===
print("üîç Method 1: Keyword-only Search (BM25)")
print("   ‚Üí Search destinations that exactly match 'surfing' string")
print("-" * 100)
sys.stdout.flush()

# Execute search and collect results (synchronous)
keyword_search_results = search_client.search(
    search_text=test_query,
    vector_queries=None,  # Disable vector search
    select=["title", "category", "section"],
    top=3
)
keyword_results = []
for result in keyword_search_results:
    keyword_results.append(result)

# Display results
for i, r in enumerate(keyword_results, 1):
    section = r.get('section', 'Other')
    print(f"   {i}. [{section}] {r['title']}")
sys.stdout.flush()
time.sleep(0.1)  # Wait for output completion

# === Method 2: Vector-only Search ===
print("\nüîç Method 2: Vector-only Search (Cosine Similarity)")
print("   ‚Üí Search semantically similar destinations (marine activity focused)")
print("-" * 100)
sys.stdout.flush()

# Execute search and collect results (synchronous)
vector_search_results = search_client.search(
    search_text=None,  # Disable keyword search
    vector_queries=[vector_query],
    select=["title", "category", "section"],
    top=3
)
vector_results = []
for result in vector_search_results:
    vector_results.append(result)

# Display results
for i, r in enumerate(vector_results, 1):
    section = r.get('section', 'Other')
    print(f"   {i}. [{section}] {r['title']}")
sys.stdout.flush()
time.sleep(0.1)  # Wait for output completion

# === Method 3: Hybrid Search (RRF) ===
print("\nüîç Method 3: Hybrid Search (Vector + Keyword Fusion)")
print("   ‚Üí Combines exact activity matching + semantic similarity")
print("-" * 100)
sys.stdout.flush()

# Execute search and collect results (synchronous)
hybrid_search_results = search_client.search(
    search_text=test_query,  # Enable keyword search
    vector_queries=[vector_query],
    select=["title", "category", "section"],
    top=3
)
hybrid_results = []
for result in hybrid_search_results:
    hybrid_results.append(result)

# Display results
for i, r in enumerate(hybrid_results, 1):
    section = r.get('section', 'Other')
    print(f"   {i}. [{section}] {r['title']}")
sys.stdout.flush()
time.sleep(0.1)  # Wait for output completion

print("\n" + "=" * 100)
print("üìä Search Results Analysis:")
print("=" * 100)
print("‚úÖ Keyword Search (most accurate in this case):")
print("   - Ranks destinations that exactly contain 'surfing' keyword at top")
print("   - Surf spots like Yangyang surfing, Busan Haeundae are mainly searched")
print("\n‚ö†Ô∏è  Vector Search:")
print("   - Also includes conceptual destinations related to 'marine activities' (semantically similar but may not mention surfing)")
print("   - Beach and ocean-related destinations may be searched broadly")
print("\n‚≠ê Hybrid Search (recommended):")
print("   - Combines keyword search accuracy + vector search semantic understanding")
print("   - Prioritizes destinations with explicit surfing mention + appropriately includes related marine activity places")
print("\nüí° Conclusion: Keyword search is strong for specific activity/place name searches,")
print("         but hybrid provides more comprehensive results in actual production")

### üìà Search Results Analysis Guide

The above experimental results demonstrate the characteristics of Azure AI Search's 3 search methods:

---

#### üîë Keyword Search (BM25)
- **Strengths**: Exact keyword matching (surfing, hanok village, Seoraksan, etc.)
- **Weaknesses**: Weak synonym/similar concept search (e.g., "healing" search won't find "relaxation", "meditation" documents)
- **Recommended Use Cases**: 
  - Specific place names/activity search (e.g., "Gyeongbokgung", "Jeju Island", "surfing")
  - Technical term search (for technical documentation RAG)

#### üß† Vector Search (Semantic Search)
- **Strengths**: Semantic similarity search (e.g., "healing destination" ‚Üí meditation/yoga/nature retreat places)
- **Weaknesses**: Accuracy may drop when exact keyword matching is important
- **Recommended Use Cases**:
  - Abstract concept search (e.g., "family travel", "healing places")
  - Multilingual/synonym search (embedding model understands meaning)

#### ‚öñÔ∏è Hybrid Search (RRF)
- **Strengths**: Keyword + vector search combination ‚Üí ensures both accuracy and semantic understanding
- **Production Recommendation**: Hybrid is optimal for most real services
- **How it Works**: 
  - Re-ranks keyword search Top-K and vector search Top-K using RRF (Reciprocal Rank Fusion)
  - Documents ranked high in both methods get higher final scores

---

### üéØ Practice Tips
Try experimenting with various queries:
```python
# When exact keyword matching is important
test_query = "Jeju Island Seopjikoji"

# When semantic search is advantageous
test_query = "Natural retreat where I can heal with family"

# Activity-focused search
test_query = "Places to enjoy diving and scuba"
```

## 11. Update Configuration File

Save the index name to `config.json` so it can be used in Notebook 3 (Agent Deployment).

In [None]:
# Reload config.json
with open("./config.json", "r", encoding="utf-8") as f:
    config = json.load(f)

# Add index name
config["search_index"] = index_name

# Save updated configuration
with open("./config.json", "w", encoding="utf-8") as f:
    json.dump(config, f, indent=2, ensure_ascii=False)

print("‚úÖ Configuration file updated successfully!")
print(f"   - Index name: {index_name}")
print(f"   - Saved to: config.json")

## üìç Next Steps

You have successfully configured the RAG knowledge base! Now proceed to the following notebooks in order:

1. **Notebook 03**: Deploy Foundry Agent (`03_deploy_foundry_agent.ipynb`)
2. **Notebook 04**: Deploy MAF-based Agent (`04_deploy_foundry_agent_with_maf.ipynb`)
3. **Notebook 05**: MAF Workflow Patterns Practice (`05_maf_workflow_patterns.ipynb`)
4. **Notebook 06**: MAF Dev UI Practice (`06_maf_dev_ui.ipynb`)
5. **Notebook 07**: Evaluate Agents (`07_evaluate_agents.ipynb`)