# üåê Multilingual Document Indexing with OpenAI Embeddings

## üìå Important Note
This notebook demonstrates a **simplified approach** to document indexing using **OpenAI's text-embedding-3-large model**. While OpenAI embeddings support multilingual text, they work best when:
- üìù Documents are in **major languages** (English, Spanish, French, German, etc.)
- üîÑ You want to use the **same embedding model** for both indexing and search
- ‚ö° You need a **quick setup** without complex language detection

## üìä Workflow Overview

```mermaid
graph TB
    A[üìÅ Multilingual Excel File] --> B[üìã Convert to JSON]
    B --> C[ü§ñ Generate OpenAI Embeddings]
    C --> D[üîç Upload to Azure AI Search]
    
    style A fill:#4A90E2,stroke:#2E5C8A,stroke-width:2px,color:#fff
    style C fill:#9013FE,stroke:#6A0DAD,stroke-width:2px,color:#fff
    style D fill:#50E3C2,stroke:#2ECC71,stroke-width:2px,color:#000
```

In [None]:
from dotenv import load_dotenv
from azure.identity.aio import DefaultAzureCredential
from azure.core.credentials import AzureKeyCredential
from azure.search.documents.aio import SearchClient
from openai import AsyncAzureOpenAI
import json
import os
import asyncio
import pandas as pd

## üì¶ Setup: Import Libraries and Initialize Clients

This section imports all necessary libraries for the indexing pipeline:
- üîê **Azure Authentication**: DefaultAzureCredential for secure access
- ü§ñ **OpenAI**: AsyncAzureOpenAI for generating embeddings
- üîç **Azure Cognitive Search**: SearchClient for document indexing

In [2]:
load_dotenv(override=True)

open_ai_endpoint = os.getenv('OPENAI_ENDPOINT')
open_ai_key = os.getenv('OPENAI_KEY')
open_ai_embedding_model = os.getenv('EMBEDDING_OPENAI_DEPLOYMENT')

# Search
search_endpoint = os.getenv('SEARCH_ENDPOINT')
search_api_key = os.getenv('SEARCH_API_KEY')

## ‚öôÔ∏è Configuration: Load Environment Variables

Load all required API keys and endpoints from the `.env` file:
- ü§ñ **OpenAI endpoint and key**: For generating embeddings with text-embedding-3-large
- üìê **Embedding model deployment**: The specific OpenAI embedding model to use
- üîç **Azure Cognitive Search credentials**: For uploading and indexing documents

In [3]:
default_credential = DefaultAzureCredential()

openai = AsyncAzureOpenAI(
    azure_endpoint=open_ai_endpoint,
    api_key=open_ai_key,
    api_version="2024-12-01-preview"
)

index_name="multi_language_openai"

credential = AzureKeyCredential(search_api_key)

search_client = SearchClient(endpoint=search_endpoint,
                             index_name=index_name,
                             credential=credential)

## üõ†Ô∏è Helper Functions

### üìã CSV/Excel to JSON Converter

The `csv_to_json_array()` function converts tabular data to JSON format:
- ‚úÖ Supports both **CSV** and **Excel** files (.xlsx, .xls)
- üîÑ Converts column names from "Title Case" to "snake_case"
- üßπ Replaces NaN values with empty strings
- üíæ Saves the result as a JSON array file

This function is essential for preparing data before embedding generation.

In [4]:
def csv_to_json_array(csv_file:str, output_file:str):
    """Convert CSV or Excel file to array of JSON objects with snake_case field names"""
    
    # Check file extension and read accordingly
    if csv_file.endswith('.xlsx') or csv_file.endswith('.xls'):
        # Read Excel file into DataFrame
        df = pd.read_excel(csv_file)
    else:
        # Read CSV file into DataFrame
        df = pd.read_csv(csv_file)
    
    # Replace NaN values with empty strings
    df = df.fillna('')
    
    # Convert column names from "Title Case" to "snake_case"
    def to_snake_case(name):
        # Replace spaces with underscores and convert to lowercase
        return name.replace(' ', '_').lower()
    
    # Rename all columns to snake_case
    df.columns = [to_snake_case(col) for col in df.columns]
    
    # Convert DataFrame to list of dictionaries (JSON objects)
    data = df.to_dict(orient='records')
    
    # Print the result
    print(f"Converted {len(data)} records from {csv_file} to JSON array")
    print(f"Converted column names: {list(df.columns)}")
    print("\nFirst record example:")
    print(json.dumps(data[0], indent=2))
    
    # Save JSON array to file
    with open(output_file, 'w', encoding='utf-8') as f:
        json.dump(data, f, indent=2, ensure_ascii=False)
    print(f"\nJSON array saved to: {output_file}")

## üìÑ Step 1: Convert Excel to JSON

Convert the multilingual car problems Excel file to JSON format:
- üì• **Input**: `car_problems_multilingual.xlsx` - Excel file with car problems in multiple languages
- üì§ **Output**: `car_problems_multilingual.json` - JSON array with snake_case field names

This step prepares the data structure for embedding generation.

In [None]:
csv_to_json_array(csv_file="car_problems_multilingual.xlsx",output_file="car_problems_multilingual.json")

## ü§ñ Step 2: Generate OpenAI Embeddings

This cell processes each document and generates vector embeddings:

### Process Flow:
1. üìñ **Load documents** from the JSON file
2. üîÑ **For each document**:
   - Extract the `fault` field as the text to embed
   - üßÆ Call OpenAI's embedding API to generate a vector representation
   - üìä Add the embedding vector to the document
   - ‚è±Ô∏è Sleep for 1 second to respect rate limits
3. ‚úÖ **Track progress** and count vectorized documents

### Why OpenAI Embeddings?
- üåç **Multilingual support**: Works well with major languages
- üìê **High-quality vectors**: text-embedding-3-large produces 3072-dimensional embeddings
- üîÑ **Consistency**: Same model can be used for both indexing and search queries

In [None]:
# Read the JSON file
with open('car_problems_multilingual.json', 'r', encoding='utf-8') as f:
    cars = json.load(f)

documents = []

for car in cars:

    text_to_embed = car['fault']

    response = await openai.embeddings.create(
        input=text_to_embed,
        model=open_ai_embedding_model
    )    

    car['vector'] = response.data[0].embedding

    documents.append(car)

    await asyncio.sleep(1)

print(f"{len(documents)} documents vectorized")    

## üîç Step 3: Upload Documents to Azure AI Search

Upload the vectorized documents to the Azure Cognitive Search index:
- üì§ **Batch upload**: All documents with their embeddings are uploaded at once
- ‚úÖ **Success verification**: Confirms the upload completed successfully
- üîç **Index**: Documents are added to the `multi_language_openai` search index
- ‚ö†Ô∏è **Error handling**: Catches and displays any upload errors

Once uploaded, these documents can be searched using:
- üìù **Keyword search**: Traditional text-based search
- üßÆ **Vector search**: Semantic similarity search using embeddings
- üîÄ **Hybrid search**: Combination of both approaches for best results

In [None]:
try:
    result = await search_client.upload_documents(cars)
    print("Upload of new document succeeded: {}".format(result[0].succeeded))
except Exception as ex:
    print(ex)