# 🔍 Azure AI Search Index Creation for Multilingual RAG System

## 🎯 Use Case

This notebook creates **three Azure AI Search indexes** to test different multilingual retrieval strategies for a RAG (Retrieval-Augmented Generation) system. Each index uses a different approach to handle multilingual car troubleshooting queries, allowing you to compare and choose the best strategy for your use case.

### 💡 Why Three Index Strategies?

When building multilingual RAG systems, you face critical decisions:
- 🌍 **Preserve native languages** for cultural accuracy?
- 🔄 **Translate to English** to leverage powerful English-trained models?
- 🤖 **Which embedding model** works best for your multilingual data?

This notebook lets you test all three approaches!

### 🏗️ Architecture Overview

```mermaid
graph TB
    A[📊 Multilingual Dataset<br/>60 Records, 7 Languages] --> B{Index Strategy}
    
    B -->|Strategy 1| C[🌐 Multilanguage<br/>Cohere Embeddings]
    B -->|Strategy 2| D[🔄 Translated<br/>OpenAI Embeddings]
    B -->|Strategy 3| E[🌍 Multi Language OpenAI<br/>OpenAI Embeddings]
    
    C --> F[Cohere 1024-dim<br/>Native Language]
    D --> G[OpenAI 1536-dim<br/>English Only]
    E --> H[OpenAI 1536-dim<br/>Native Language]
    
    F --> I[🔍 HNSW Vector Search]
    G --> I
    H --> I
    
    I --> J[⚡ Fast Semantic Retrieval]
    
    style A fill:#4A90E2,stroke:#2E5C8A,stroke-width:3px,color:#fff
    style C fill:#27AE60,stroke:#1E8449,stroke-width:3px,color:#fff
    style D fill:#E74C3C,stroke:#C0392B,stroke-width:3px,color:#fff
    style E fill:#9B59B6,stroke:#7D3C98,stroke-width:3px,color:#fff
    style J fill:#F39C12,stroke:#D68910,stroke-width:3px,color:#fff
```

### 📋 What This Notebook Does

```mermaid
sequenceDiagram
    participant User as 👤 You
    participant Notebook as 📓 This Notebook
    participant Azure as ☁️ Azure AI Search
    
    User->>Notebook: Run cells
    Notebook->>Notebook: Load credentials
    Notebook->>Azure: Connect to service
    Notebook->>Azure: Delete old indexes (if exist)
    Notebook->>Azure: Create "multilanguage" index
    Notebook->>Azure: Create "translated" index
    Notebook->>Azure: Create "multi_language_openai" index
    Azure->>Notebook: ✅ All indexes created
    Notebook->>User: 🎉 Ready for data upload!
```

---

## 📚 Prerequisites

Before running this notebook, make sure you have:

| Requirement | Status | Description |
|------------|--------|-------------|
| ☁️ **Azure AI Search** | ⬜ | Service created in Azure Portal |
| 🔑 **API Keys** | ⬜ | Admin key for search service |
| 📁 **Environment File** | ⬜ | `.env` file with credentials |
| 📦 **Python Packages** | ⬜ | `azure-search-documents`, `python-dotenv` |

### 🔐 Required Environment Variables

Create a `.env` file in your project root:
```env
SEARCH_ENDPOINT=https://your-service.search.windows.net
SEARCH_API_KEY=your-admin-api-key-here
```

Let's get started! 🚀

## Step 1: Import Required Libraries 📦

We'll import Azure AI Search libraries to:
- 🔐 Authenticate securely with Azure
- 📊 Define index schemas with fields and data types
- 🧮 Configure vector search algorithms (HNSW)
- 🔍 Set up semantic search capabilities

In [3]:
from dotenv import load_dotenv
from azure.search.documents.indexes.aio import SearchIndexClient
from azure.core.credentials import AzureKeyCredential
from azure.search.documents.indexes.models import (
    SearchField,
    VectorSearch,
    HnswAlgorithmConfiguration,
    VectorSearchProfile,
    SearchIndex,    
    SearchFieldDataType
)
import os

✅ **Libraries imported successfully!**

## Step 2: Load Azure AI Search Configuration 🔑

Loading secure credentials from environment variables:

```mermaid
graph LR
    A[.env File] -->|Load| B[Environment Variables]
    B -->|Extract| C[🔗 SEARCH_ENDPOINT]
    B -->|Extract| D[🔑 SEARCH_API_KEY]
    C --> E[Azure Connection]
    D --> E
    
    style A fill:#4A90E2,stroke:#2E5C8A,stroke-width:2px,color:#fff
    style E fill:#27AE60,stroke:#1E8449,stroke-width:2px,color:#fff
```

💡 **Security Best Practice**: Never hardcode credentials in notebooks!

In [4]:
load_dotenv(override=True)

search_endpoint = os.getenv('SEARCH_ENDPOINT')
search_api_key = os.getenv('SEARCH_API_KEY')

🔌 **Connected to Azure AI Search service!**

## Step 3: Define Three Index Strategies 🏗️

We're creating **three indexes** to compare different multilingual search approaches:

### 📊 Index Comparison Table

| Feature | 🌐 Multilanguage | 🔄 Translated | 🌍 Multi Language OpenAI |
|---------|------------------|---------------|--------------------------|
| **Primary Strategy** | Native language | English only | Native language |
| **Embedding Model** | Cohere | OpenAI | OpenAI |
| **Vector Dimensions** | 1024 | 1536 | 1536 |
| **Best For** | Language-specific nuances | Leveraging English models | OpenAI with native text |
| **Storage Size** | Medium | Medium | Medium |
| **Query Language** | Must match data | Any (translated to EN) | Must match data |

### 🌐 Index 1: "multilanguage"

```mermaid
graph LR
    A[User Query<br/>Any Language] --> B[Cohere Embedding<br/>1024-dim]
    B --> C[Vector Search<br/>Same Language]
    C --> D[Results<br/>Native Language]
    
    style A fill:#4A90E2,stroke:#2E5C8A,stroke-width:2px,color:#fff
    style D fill:#27AE60,stroke:#1E8449,stroke-width:2px,color:#fff
```

**Strategy**: Keep everything in its original language with Cohere embeddings
- ✅ Preserves cultural context and terminology
- ✅ No translation quality loss
- ✅ Smaller vector dimensions (1024)
- ⚠️ Requires query in same language as data

**Fields**:
- `id` 🔑: Unique document identifier
- `brand` 🚗: Car manufacturer (searchable)
- `model` 🏷️: Car model (filterable, facetable)
- `fault` ⚠️: Problem description in native language
- `vector` 🧮: 1024-dim Cohere embedding
- `fix` 🔧: Solution in native language

### 🔄 Index 2: "translated"

```mermaid
graph LR
    A[User Query<br/>Any Language] --> B[Translate<br/>to English]
    B --> C[OpenAI Embedding<br/>1536-dim]
    C --> D[Vector Search<br/>English Space]
    D --> E[Results<br/>Original Language]
    
    style A fill:#4A90E2,stroke:#2E5C8A,stroke-width:2px,color:#fff
    style E fill:#27AE60,stroke:#1E8449,stroke-width:2px,color:#fff
```

**Strategy**: Translate everything to English before embedding with OpenAI
- ✅ Leverages powerful English-trained models (OpenAI)
- ✅ Consistent semantic space across languages
- ✅ Larger vector dimensions (1536)
- ⚠️ Translation step adds complexity
- ⚠️ May lose language-specific nuances

**Fields**:
- Same as multilanguage, but:
- `vector` 🧮: 1536-dim OpenAI embedding (of English translation)

### 🌍 Index 3: "multi_language_openai"

```mermaid
graph LR
    A[User Query<br/>Any Language] --> B[OpenAI Embedding<br/>1536-dim]
    B --> C[Vector Search<br/>Same Language]
    C --> D[Results<br/>Native Language]
    
    style A fill:#4A90E2,stroke:#2E5C8A,stroke-width:2px,color:#fff
    style D fill:#9B59B6,stroke:#7D3C98,stroke-width:2px,color:#fff
```

**Strategy**: Keep everything in its original language with OpenAI embeddings
- ✅ OpenAI's multilingual capabilities
- ✅ No translation required
- ✅ Larger vector dimensions (1536)
- ✅ Best of both worlds: native language + powerful model
- ⚠️ Requires query in same language as data

**Fields**:
- Same as multilanguage, but:
- `vector` 🧮: 1536-dim OpenAI embedding (of native language text)

### 🔍 Common Search Features

All three indexes include:
- **HNSW Algorithm**: Fast approximate nearest neighbor search
- **Faceted Search**: Filter by car model
- **Keyword Search**: Traditional text matching on brand, model, fault, fix
- **Hybrid Search**: Combine keyword + vector search

In [None]:
indexes = [
    {
        # This index will vectorize in the original language using cohere
        # This will affect the research since the embedding of the prompt
        # will be in the current language of the user
        'name': 'multilanguage',
        'fields': [
                SearchField(name="id", type=SearchFieldDataType.String,key=True),                   
                SearchField(name="brand", type=SearchFieldDataType.String, searchable=True,sortable=False, facetable=False, filterable=False),                      
                SearchField(name="model", type=SearchFieldDataType.String, searchable=True,sortable=False, facetable=True, filterable=True),                  
                SearchField(name="fault", type=SearchFieldDataType.String, searchable=True,sortable=False, facetable=False, filterable=False),                
                SearchField(name="vector", type=SearchFieldDataType.Collection(SearchFieldDataType.Single), vector_search_dimensions=1024, vector_search_profile_name="vector-profile-1",searchable=True,sortable=False, facetable=False, filterable=False),
                SearchField(name="fix", type=SearchFieldDataType.String, searchable=True,sortable=False, facetable=False, filterable=False)    
        ]    
    },
    {
        # This index will vectorize in english when the languague is not in english
        'name': 'translated',
        'fields': [
                SearchField(name="id", type=SearchFieldDataType.String,key=True),                   
                SearchField(name="brand", type=SearchFieldDataType.String, searchable=True,sortable=False, facetable=False, filterable=False),                      
                SearchField(name="model", type=SearchFieldDataType.String, searchable=True,sortable=False, facetable=True, filterable=True),                  
                SearchField(name="fault", type=SearchFieldDataType.String, searchable=True,sortable=False, facetable=False, filterable=False),                
                SearchField(name="vector", type=SearchFieldDataType.Collection(SearchFieldDataType.Single), vector_search_dimensions=1536, vector_search_profile_name="vector-profile-1",searchable=True,sortable=False, facetable=False, filterable=False),
                SearchField(name="fix", type=SearchFieldDataType.String, searchable=True,sortable=False, facetable=False, filterable=False)    
        ]            
    },
    {
        'name': 'multi_language_openai',
        'fields': [
                SearchField(name="id", type=SearchFieldDataType.String,key=True),                   
                SearchField(name="brand", type=SearchFieldDataType.String, searchable=True,sortable=False, facetable=False, filterable=False),                      
                SearchField(name="model", type=SearchFieldDataType.String, searchable=True,sortable=False, facetable=True, filterable=True),                  
                SearchField(name="fault", type=SearchFieldDataType.String, searchable=True,sortable=False, facetable=False, filterable=False),                
                SearchField(name="vector", type=SearchFieldDataType.Collection(SearchFieldDataType.Single), vector_search_dimensions=1536, vector_search_profile_name="vector-profile-1",searchable=True,sortable=False, facetable=False, filterable=False),
                SearchField(name="fix", type=SearchFieldDataType.String, searchable=True,sortable=False, facetable=False, filterable=False)    
        ]            
    },    
]

📐 **Index schemas defined!** Ready to deploy all three to Azure.

---

## Step 4: Deploy Indexes to Azure ☁️

### 🔄 Deployment Process

```mermaid
stateDiagram-v2
    [*] --> Initialize: Connect to Azure
    Initialize --> CheckExisting: For each index (3 total)
    CheckExisting --> Delete: If exists
    CheckExisting --> Configure: If new
    Delete --> Configure
    Configure --> CreateHNSW: Setup vector search
    CreateHNSW --> Deploy: Push to Azure
    Deploy --> Verify: Confirm creation
    Verify --> CheckExisting: Next index
    Verify --> [*]: All done ✅
```

### ⚙️ HNSW Algorithm Configuration

**HNSW (Hierarchical Navigable Small World)** provides:
- ⚡ **Speed**: Sub-millisecond search on millions of vectors
- 🎯 **Accuracy**: 95%+ recall with approximate search
- 📈 **Scalability**: Efficient memory usage
- 🔧 **Flexibility**: Configurable precision/speed tradeoff

### 🚀 Execution Steps

1. **Initialize client** - Connect to Azure AI Search
2. **Configure HNSW** - Set up vector search algorithm
3. **For each of the 3 indexes**:
   - 🗑️ Delete if exists (clean slate)
   - 🏗️ Create with schema definition
   - ✅ Verify successful creation
4. **Close connection** - Clean up resources

Let's deploy! 🎬

In [6]:
# Initialize the search index client
index_client = SearchIndexClient(endpoint=search_endpoint,credential=AzureKeyCredential(search_api_key))

# Configure vector search using HNSW (Hierarchical Navigable Small World) algorithm
# This enables efficient approximate nearest neighbor search for semantic similarity
vector_search = VectorSearch(  
    algorithms=[  
        HnswAlgorithmConfiguration(name="myHnsw"),
    ],  
    profiles=[  
        VectorSearchProfile(  
            name="vector-profile-1",  
            algorithm_configuration_name="myHnsw"
        )
    ]
)

for index in indexes:
    # Delete existing index if it exists to start fresh
    try:
        index_found = await index_client.get_index(index['name'])
        if index_found:
            await index_client.delete_index(index['name'])
    except Exception:
        print("No Index found")

    # Create the search index with the defined schema and vector search configuration
    index_definition = SearchIndex(name=index['name'], fields=index['fields'], vector_search=vector_search)
    result = await index_client.create_or_update_index(index_definition)
    print(f"{result.name} created")

# Clean up: close the index client connection
await index_client.close()

multilanguage created
translated created
No Index found
multi_language_openai created


## 🎉 Success! Indexes Created

All three indexes are now deployed and ready to receive data!

### 📊 Summary

✅ **multilanguage** - Cohere embeddings, native language (1024-dim)  
✅ **translated** - OpenAI embeddings, English translation (1536-dim)  
✅ **multi_language_openai** - OpenAI embeddings, native language (1536-dim)

Next step: Upload your multilingual data to test each strategy! 🚀