# üåç Multilingual Document Translation & Indexing Pipeline

This notebook demonstrates an end-to-end pipeline for translating multilingual documents to English and indexing them in Azure AI Search with vector embeddings.

## üìä Workflow Overview

```mermaid
graph TB
    A[üìÅ Multilingual Excel File] --> B[üì§ Upload to Source Container]
    B --> C[üîê Generate SAS Tokens]
    C --> D[üîÑ Azure Document Translation]
    D --> E[‚è≥ Wait for Translation]
    E --> F[üì• Download Translated File]
    F --> G[üìã Convert to JSON]
    G --> H[üßÆ Generate Embeddings]
    H --> I[üîç Index in Azure AI Search]
    
    style A fill:#e3f2fd
    style D fill:#fff3e0
    style H fill:#f3e5f5
    style I fill:#e8f5e9
```

## üéØ Key Steps

1. **Setup** - Initialize Azure service clients (Storage, Translation, OpenAI, Search)
2. **Upload** - Upload multilingual documents to blob storage
3. **Translate** - Use Azure Document Translation to convert to English
4. **Vectorize** - Generate embeddings using Azure OpenAI
5. **Index** - Store in Azure AI Search for semantic search


## üì¶ Step 1: Import Required Libraries

Import all necessary Azure SDK libraries and utilities for the translation and indexing pipeline.

In [148]:
from dotenv import load_dotenv
from azure.storage.blob.aio import BlobServiceClient, BlobClient, ContainerClient
from azure.storage.blob import (
    generate_blob_sas,     
    UserDelegationKey, 
    BlobSasPermissions,
    ContainerSasPermissions, 
    generate_container_sas
)
from azure.identity.aio import DefaultAzureCredential
from azure.ai.translation.document.aio import DocumentTranslationClient
from azure.ai.translation.document.aio import SingleDocumentTranslationClient
from azure.ai.translation.document.models import DocumentTranslateContent
from azure.core.credentials import AzureKeyCredential
from azure.search.documents.aio import SearchClient
from openai import AsyncAzureOpenAI
import datetime
import aiohttp
import json
import os
import asyncio
import pandas as pd

## ‚öôÔ∏è Step 2: Load Configuration

Load environment variables containing Azure service endpoints and credentials.

In [153]:
load_dotenv(override=True)

storage_account_url = os.getenv('STORAGE_ACCOUNT_URL')
translator_endpoint = os.getenv('DOCUMENT_TRANSLATION_ENDPOINT')
translator_key = os.getenv('TRANSLATION_KEY')

open_ai_endpoint = os.getenv('OPENAI_ENDPOINT')
open_ai_key = os.getenv('OPENAI_KEY')
open_ai_embedding_model = os.getenv('EMBEDDING_OPENAI_DEPLOYMENT')

# Search
search_endpoint = os.getenv('SEARCH_ENDPOINT')
search_api_key = os.getenv('SEARCH_API_KEY')

## üîå Step 3: Initialize Azure Clients

Create authenticated clients for:
- üì¶ **Azure Blob Storage** - Document storage
- üåê **Azure Document Translation** - Multi-language translation
- ü§ñ **Azure OpenAI** - Embedding generation
- üîç **Azure AI Search** - Vector search indexing

In [155]:
default_credential = DefaultAzureCredential()

blob_service_client = BlobServiceClient(storage_account_url,credential=default_credential)

document_translation_client = DocumentTranslationClient(translator_endpoint,AzureKeyCredential(translator_key))

openai = AsyncAzureOpenAI(
    azure_endpoint=open_ai_endpoint,
    api_key=open_ai_key,
    api_version="2024-12-01-preview"
)

index_name="translated"

credential = AzureKeyCredential(search_api_key)

search_client = SearchClient(endpoint=search_endpoint,
                             index_name=index_name,
                             credential=credential)

## üìÅ Step 4: Configure Storage Containers

Set up source and destination blob containers:
- **upload** - Source container for multilingual files
- **translation** - Destination container for translated files

In [90]:
container_source = "upload"
container_translation = "translation"

container_client_source = blob_service_client.get_container_client(container_source)
container_client_translation = blob_service_client.get_container_client(container_translation)

## üõ†Ô∏è Step 5: Define Helper Functions

Create utility functions for container management, SAS token generation, and file conversion.

In [63]:
async def create_container(container_client:ContainerClient):

    container_exists = await container_client.exists()

    if container_exists:
        await container_client.delete_container()

    await container_client.create_container()

## üßπ Step 6: Prepare Clean Containers

Recreate containers to ensure a fresh environment for the translation process.

In [64]:
await create_container(container_client_source)
await create_container(container_client_translation)

## üì§ Step 7: Upload Source Document

Upload the multilingual Excel file (`car_problems_multilingual.xlsx`) to the source container.

In [65]:
filename = "car_problems_multilingual.xlsx"

blob_client_original_doc = blob_service_client.get_blob_client(container=container_source,blob=filename)

if await blob_client_original_doc.exists():
    await blob_client_original_doc.delete_blob()

# Upload the data
with open(file=filename, mode="rb") as data:
    await blob_client_original_doc.upload_blob(data)

## üîê Step 8: Define SAS Token Functions

Functions to generate Shared Access Signatures (SAS) for secure, temporary access to blob storage resources.

In [137]:
async def request_user_delegation_key(blob_service_client: BlobServiceClient) -> UserDelegationKey:
    # Get a user delegation key that's valid for 1 day
    delegation_key_start_time = datetime.datetime.now(datetime.timezone.utc)
    delegation_key_expiry_time = delegation_key_start_time + datetime.timedelta(hours=1)

    user_delegation_key = await blob_service_client.get_user_delegation_key(
        key_start_time=delegation_key_start_time,
        key_expiry_time=delegation_key_expiry_time
    )

    return user_delegation_key

def create_user_delegation_sas_blob(blob_client: BlobClient, user_delegation_key: UserDelegationKey):
    # Create a SAS token that's valid for one day, as an example
    start_time = datetime.datetime.now(datetime.timezone.utc)
    expiry_time = start_time + datetime.timedelta(hours=1)

    sas_token = generate_blob_sas(
        account_name=blob_client.account_name,
        container_name=blob_client.container_name,
        blob_name=blob_client.blob_name,
        user_delegation_key=user_delegation_key,
        permission=BlobSasPermissions(read=True),
        expiry=expiry_time,
        start=start_time
    )

    return sas_token

def create_user_delegation_sas_container(container_client: ContainerClient, 
                                         permission:ContainerSasPermissions,
                                         user_delegation_key: UserDelegationKey):
    # Create a SAS token that's valid for one day, as an example
    start_time = datetime.datetime.now(datetime.timezone.utc)
    expiry_time = start_time + datetime.timedelta(hours=1)

    sas_token = generate_container_sas(
        account_name=container_client.account_name,
        container_name=container_client.container_name,
        user_delegation_key=user_delegation_key,
        permission=permission,
        expiry=expiry_time,
        start=start_time
    )

    return sas_token

def csv_to_json_array(csv_file:str, output_file:str):
    """Convert CSV or Excel file to array of JSON objects with snake_case field names"""
    
    # Check file extension and read accordingly
    if csv_file.endswith('.xlsx') or csv_file.endswith('.xls'):
        # Read Excel file into DataFrame
        df = pd.read_excel(csv_file)
    else:
        # Read CSV file into DataFrame
        df = pd.read_csv(csv_file)
    
    # Replace NaN values with empty strings
    df = df.fillna('')
    
    # Convert column names from "Title Case" to "snake_case"
    def to_snake_case(name):
        # Replace spaces with underscores and convert to lowercase
        return name.replace(' ', '_').lower()
    
    # Rename all columns to snake_case
    df.columns = [to_snake_case(col) for col in df.columns]
    
    # Convert DataFrame to list of dictionaries (JSON objects)
    data = df.to_dict(orient='records')
    
    # Print the result
    print(f"Converted {len(data)} records from {csv_file} to JSON array")
    print(f"Converted column names: {list(df.columns)}")
    print("\nFirst record example:")
    print(json.dumps(data[0], indent=2))
    
    # Save JSON array to file
    with open(output_file, 'w', encoding='utf-8') as f:
        json.dump(data, f, indent=2, ensure_ascii=False)
    print(f"\nJSON array saved to: {output_file}")

## üé´ Step 9: Request User Delegation Key

Obtain a user delegation key from Azure Storage for creating SAS tokens.

In [92]:
user_delegation_key = await request_user_delegation_key(blob_service_client)

## üîí Step 10: Generate SAS Tokens

Create SAS tokens for:
- **Source container** - Read and list permissions
- **Destination container** - Write and list permissions

In [108]:
source_container_sas = create_user_delegation_sas_container(container_client_source,
                                                            ContainerSasPermissions(list=True,read=True),
                                                            user_delegation_key)
destination_container_sas = create_user_delegation_sas_container(container_client_translation,
                                                                 ContainerSasPermissions(list=True,write=True),
                                                                 user_delegation_key)

## üîó Step 11: Build Container URLs

Construct complete URLs with SAS tokens for the Azure Document Translation service.

In [122]:
container_source_url = f"{container_client_source.url}?{source_container_sas}"
container_destination_url = f"{container_client_translation.url}?{destination_container_sas}"

## üîÑ Step 12: Start Translation Job

Initiate the document translation from multiple languages to English (en-US).

In [126]:
poller = await document_translation_client.begin_translation(source_url=container_source_url,
                                                             target_url=container_destination_url,
                                                             target_language="en-US")

## ‚è≥ Step 13: Wait for Translation Completion

Poll the translation service until the job completes.

In [127]:
result = await poller.result()

## üìä Step 14: Check Translation Status

Verify the current status of the translation operation.

In [128]:
poller.status()

'Succeeded'

## ‚úÖ Step 15: Retrieve Translation Results

Process the translation results and get the URL of the translated document.

In [None]:
async for document in result:

    print(f"Document ID: {document.id}")
    print(f"Document status: {document.status}")

    if document.status == "Succeeded":
        print(f"Source document location: {document.source_document_url}")
        blob_translated_document = document.translated_document_url
        print(f"Translated document location: {document.translated_document_url}")
        print(f"Translated to language: {document.translated_to}\n")
    elif document.error:
        print(f"Error Code: {document.error.code}, Message: {document.error.message}\n")

    # In this scenario we have only one document 
    break

## üîë Step 16: Create Download SAS Token

Generate a SAS token for downloading the translated blob.

In [143]:
blob_client = BlobClient.from_blob_url(blob_translated_document)

sas_blob = create_user_delegation_sas_blob(blob_client,user_delegation_key)

## üì• Step 17: Download Translated Document

Download the translated Excel file from blob storage to local disk.

In [None]:
# Create the full URL with SAS token
blob_url_with_sas = f"{blob_translated_document}?{sas_blob}"

async with aiohttp.ClientSession() as session:
    async with session.get(blob_url_with_sas) as response:
        if response.status == 200:
            blob_data = await response.read()
            
            # Write to disk
            output_filename = "car_problems_translated_english.xlsx"
            with open(output_filename, "wb") as file:
                file.write(blob_data)
            
            print(f"Translated file saved as: {output_filename}")
            print(f"File size: {len(blob_data)} bytes")
        else:
            print(f"Error downloading file: {response.status}")

## üìã Step 18: Convert Excel to JSON

Transform the translated Excel file into JSON format with snake_case field names for easier processing.

In [None]:
csv_to_json_array("car_problems_translated_english.xlsx","car_problems_translated_english.json")

## üßÆ Step 19: Generate Vector Embeddings

Use Azure OpenAI to create vector embeddings for each car problem description. These embeddings enable semantic search capabilities.

In [None]:
# Read the JSON file
with open('car_problems_translated_english.json', 'r', encoding='utf-8') as f:
    cars = json.load(f)

documents = []

for car in cars:

    text_to_embed = car['fault']

    response = await openai.embeddings.create(
        input=text_to_embed,
        model=open_ai_embedding_model
    )    

    car['vector'] = response.data[0].embedding

    documents.append(car)

print(f"{len(documents)} documents vectorized")        

## üîç Step 20: Upload to Azure AI Search

Index all translated and vectorized documents in Azure AI Search for powerful semantic search.

In [None]:
try:
    result = await search_client.upload_documents(cars)
    print("Upload of new document succeeded: {}".format(result[0].succeeded))
except Exception as ex:
    print(ex)

## üéâ Pipeline Complete!

### Summary of Achievements:

‚úÖ **Translation**: Multilingual documents converted to English  
‚úÖ **Vectorization**: Text embeddings generated for semantic search  
‚úÖ **Indexing**: Documents stored in Azure AI Search  

### üöÄ Next Steps:

You can now perform:
- üîé Semantic searches across translated content
- üåê Multi-language query support
- üìä Vector similarity searches for car problems

### üìà Results:

- **Source**: Multilingual Excel file with car problems
- **Target**: English-translated, vectorized, searchable index
- **Capability**: Semantic search with AI-powered understanding