# Multilingual Intent Classification with Amazon Nova Multimodal Embeddings

Customer service systems receive requests in multiple languages daily. A user might say "I want to cancel my order" in English, "Quiero cancelar mi pedido" in Spanish, or "Je veux annuler ma commande" in French. Each expresses the same intent but uses completely different words and grammar structures.

Traditional intent classification systems struggle with this linguistic diversity. Most approaches require training separate models for each language or translating everything to English first. Both solutions introduce complexity, cost, and potential errors.

This notebook explores a different approach using Amazon Nova Multimodal Embeddings. Instead of focusing on words, we'll work with semantic embeddings‚Äîmathematical representations that capture meaning across languages. When Nova processes "cancel order" in English and "cancelar pedido" in Spanish, it creates similar vector representations because the underlying intent is identical.

The system we'll build uses these embeddings with a K-Nearest Neighbors classifier. The classifier finds training examples with similar meanings to new inputs, regardless of language. This approach allows us to train once and classify everywhere, eliminating the need for language-specific models or translation steps.

We begin by installing the required Python libraries and importing the modules we'll need throughout the implementation. The installation includes boto3 for AWS services, scikit-learn for machine learning operations, and visualization libraries for analyzing results.

In [None]:
%pip install boto3 pandas numpy scikit-learn matplotlib seaborn tqdm langdetect

In [None]:


import boto3
import json
import pandas as pd
import numpy as np
import os
from typing import List, Dict, Optional, Tuple
from tqdm import tqdm
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings('ignore')

plt.style.use('default')
sns.set_palette('husl')
plt.rcParams['figure.figsize'] = (12, 8)

print('Libraries imported successfully')

Next, we configure our AWS environment and set up the connection to Nova Multimodal Embeddings. The configuration specifies us-east-1 as our region since Nova is currently available there. We also set the embedding dimension to 1024, which provides the best balance between representational capacity and computational efficiency. The S3 bucket name includes a hash to ensure uniqueness across different users running this notebook.

In [None]:
AWS_REGION = 'us-east-1'
S3_BUCKET_NAME = 'nova-multilingual-classification-' + str(hash(os.getlogin()))[-6:]
DATASET_PATH = './massive'

try:
    bedrock_client = boto3.client('bedrock-runtime', region_name=AWS_REGION)
    s3_client = boto3.client('s3', region_name=AWS_REGION)
    s3vectors_client = boto3.client('s3vectors', region_name=AWS_REGION)
    print('AWS clients initialized')
except Exception as e:
    print(f'AWS client initialization failed: {e}')
    print('Ensure AWS credentials are configured')

NOVA_MODEL_ID = 'amazon.nova-2-multimodal-embeddings-v1:0'
EMBEDDING_DIMENSION = 1024

print(f'Region: {AWS_REGION}')
print(f'Model: {NOVA_MODEL_ID}')
print(f'Embedding dimension: {EMBEDDING_DIMENSION}')

We need to choose our dataset strategy. The MASSIVE dataset contains over one million samples across 52 languages, making it ideal for production systems but expensive for learning purposes. Our sample dataset contains just 12 carefully chosen examples across 4 languages, which demonstrates the core concepts while minimizing API costs. For this tutorial, we'll use the sample data to keep costs low while still showing how the system works across multiple languages.

In [None]:
USE_MASSIVE_DATASET = False

MASSIVE_URL = 'https://amazon-massive-nlu-dataset.s3.amazonaws.com/amazon-massive-dataset-1.0.tar.gz'
MASSIVE_ARCHIVE = 'amazon-massive-dataset-1.0.tar.gz'
MASSIVE_DATA_DIR = '1.0/data'
SAMPLE_LANGUAGES = ['en-US', 'es-ES', 'fr-FR', 'de-DE']

dataset_mode = "MASSIVE" if USE_MASSIVE_DATASET else "Sample"
print(f'Dataset mode: {dataset_mode}')
if USE_MASSIVE_DATASET:
    print(f'Languages: {len(SAMPLE_LANGUAGES)} of 52 available')
else:
    print(f'Languages: 4 sample languages')

Now we implement the core system architecture. The NovaEmbeddings class handles all communication with the Bedrock API, managing request formatting, error handling, and cost tracking. Each text input gets converted into a 1024-dimensional vector that captures its semantic meaning. The class includes batch processing capabilities with progress tracking, which becomes important when processing larger datasets. Request counting helps monitor API usage for cost management in production deployments.

In [None]:
class NovaEmbeddings:
    def __init__(self, region='us-east-1'):
        self.bedrock = boto3.client('bedrock-runtime', region_name=region)
        self.model_id = 'amazon.nova-2-multimodal-embeddings-v1:0'
        self.request_count = 0
    
    def embed_text(self, text: str, dimension: int = 1024) -> List[float]:
        if not text or not text.strip():
            return [0.0] * dimension
            
        request_body = {
            'taskType': 'SINGLE_EMBEDDING',
            'singleEmbeddingParams': {
                'embeddingDimension': dimension,
                'embeddingPurpose': 'CLASSIFICATION',
                'text': {"truncationMode": "END", 'value': text[:8000]}
            }
        }
        
        try:
            response = self.bedrock.invoke_model(
                modelId=self.model_id,
                body=json.dumps(request_body)
            )
            result = json.loads(response['body'].read())
            self.request_count += 1
            return result['embeddings'][0]['embedding']
        except Exception as e:
            print(f'Embedding error: {e}')
            return [0.0] * dimension
    
    def embed_batch_text(self, texts: List[str], dimension: int = 1024) -> List[List[float]]:
        embeddings = []
        for text in tqdm(texts, desc='Generating embeddings'):
            embedding = self.embed_text(text, dimension)
            embeddings.append(embedding)
        return embeddings

The MultilingualClassifier class implements our intent recognition logic using K-Nearest Neighbors with cosine similarity. KNN works particularly well for this task because similar intents cluster together in the embedding space, regardless of the language used to express them. The classifier stores embeddings from training examples and finds the k most similar examples when making predictions. Cosine similarity measures the angle between vectors, which captures semantic similarity better than Euclidean distance for high-dimensional embeddings. The class provides both single predictions with confidence scores and batch processing for efficiency.

In [None]:
class MultilingualClassifier:
    def __init__(self, embeddings_client, k_neighbors: int = 3):
        self.embeddings = embeddings_client
        self.classifier = KNeighborsClassifier(n_neighbors=k_neighbors, metric='cosine')
        self.is_trained = False
        self.intent_labels = []
    
    def train(self, texts: List[str], intents: List[str]):
        print(f'Training classifier with {len(texts)} samples')
        
        embeddings = self.embeddings.embed_batch_text(texts)
        self.classifier.fit(embeddings, intents)
        self.intent_labels = list(set(intents))
        self.is_trained = True
        
        print(f'Training complete: {len(self.intent_labels)} intent classes')
    
    def predict(self, text: str) -> Dict:
        if not self.is_trained:
            raise ValueError('Classifier must be trained first')
        
        embedding = self.embeddings.embed_text(text)
        prediction = self.classifier.predict([embedding])[0]
        probabilities = self.classifier.predict_proba([embedding])[0]
        
        class_scores = dict(zip(self.classifier.classes_, probabilities))
        
        return {
            'predicted_intent': prediction,
            'confidence': max(probabilities),
            'all_scores': class_scores
        }
    
    def predict_batch(self, texts: List[str]) -> List[str]:
        if not self.is_trained:
            raise ValueError('Classifier must be trained first')
        
        embeddings = self.embeddings.embed_batch_text(texts)
        predictions = self.classifier.predict(embeddings)
        return predictions.tolist()
    
    def evaluate(self, test_texts: List[str], true_intents: List[str]) -> Dict:
        predictions = self.predict_batch(test_texts)
        accuracy = accuracy_score(true_intents, predictions)
        report = classification_report(true_intents, predictions, output_dict=True)
        
        return {
            'accuracy': accuracy,
            'classification_report': report,
            'predictions': predictions
        }

print('Classes defined')

Our data loading functions handle both the full MASSIVE dataset and our sample data. The MASSIVE dataset requires downloading and extracting a compressed archive, then parsing JSONL files for each language. Our sample data creation function builds a small but representative dataset with three common intents: setting alarms, querying weather, and playing music. Each intent appears in all four languages with natural variations in phrasing. This gives us enough data to demonstrate multilingual understanding while keeping API costs minimal.

In [None]:
import tarfile
import urllib.request

def download_massive_dataset():
    if not os.path.exists(MASSIVE_ARCHIVE):
        print(f'Downloading MASSIVE dataset from {MASSIVE_URL}')
        urllib.request.urlretrieve(MASSIVE_URL, MASSIVE_ARCHIVE)
        print('Download complete')
    
    if not os.path.exists(MASSIVE_DATA_DIR):
        print('Extracting dataset')
        with tarfile.open(MASSIVE_ARCHIVE, 'r:gz') as tar:
            tar.extractall()
        print('Extraction complete')

def load_massive_data(languages=None, max_samples_per_lang=1000):
    if languages is None:
        languages = SAMPLE_LANGUAGES
    
    multilingual_data = {}
    
    for lang in languages:
        jsonl_path = os.path.join(MASSIVE_DATA_DIR, f'{lang}.jsonl')
        if os.path.exists(jsonl_path):
            data = []
            with open(jsonl_path, 'r', encoding='utf-8') as f:
                for i, line in enumerate(f):
                    if i >= max_samples_per_lang:
                        break
                    data.append(json.loads(line.strip()))
            multilingual_data[lang] = pd.DataFrame(data)
            print(f'{lang}: {len(data)} samples loaded')
        else:
            print(f'{lang}: File not found')
    
    return multilingual_data

def create_sample_data():
    sample_data = {
        'en-US': [
            {'id': '1', 'locale': 'en-US', 'partition': 'train', 'intent': 'alarm_set', 'scenario': 'alarm', 'utt': 'set alarm for 7 AM tomorrow'},
            {'id': '2', 'locale': 'en-US', 'partition': 'train', 'intent': 'weather_query', 'scenario': 'weather', 'utt': 'what is the weather like today'},
            {'id': '3', 'locale': 'en-US', 'partition': 'train', 'intent': 'music_play', 'scenario': 'music', 'utt': 'play some jazz music'},
        ],
        'es-ES': [
            {'id': '1', 'locale': 'es-ES', 'partition': 'train', 'intent': 'alarm_set', 'scenario': 'alarm', 'utt': 'pon la alarma para las 7 de la ma√±ana'},
            {'id': '2', 'locale': 'es-ES', 'partition': 'train', 'intent': 'weather_query', 'scenario': 'weather', 'utt': '¬øc√≥mo est√° el tiempo hoy?'},
            {'id': '3', 'locale': 'es-ES', 'partition': 'train', 'intent': 'music_play', 'scenario': 'music', 'utt': 'reproduce m√∫sica jazz'},
        ],
        'fr-FR': [
            {'id': '1', 'locale': 'fr-FR', 'partition': 'train', 'intent': 'alarm_set', 'scenario': 'alarm', 'utt': 'r√®gle l\'alarme pour 7 heures demain matin'},
            {'id': '2', 'locale': 'fr-FR', 'partition': 'train', 'intent': 'weather_query', 'scenario': 'weather', 'utt': 'quel temps fait-il aujourd\'hui'},
            {'id': '3', 'locale': 'fr-FR', 'partition': 'train', 'intent': 'music_play', 'scenario': 'music', 'utt': 'joue de la musique jazz'},
        ],
        'de-DE': [
            {'id': '1', 'locale': 'de-DE', 'partition': 'train', 'intent': 'alarm_set', 'scenario': 'alarm', 'utt': 'stelle den Wecker auf 7 Uhr morgen fr√ºh'},
            {'id': '2', 'locale': 'de-DE', 'partition': 'train', 'intent': 'weather_query', 'scenario': 'weather', 'utt': 'wie ist das Wetter heute'},
            {'id': '3', 'locale': 'de-DE', 'partition': 'train', 'intent': 'music_play', 'scenario': 'music', 'utt': 'spiele Jazz-Musik'},
        ]
    }
    
    multilingual_data = {}
    for lang, samples in sample_data.items():
        multilingual_data[lang] = pd.DataFrame(samples)
    
    return multilingual_data

if USE_MASSIVE_DATASET:
    print('Loading MASSIVE dataset')
    download_massive_dataset()
    multilingual_data = load_massive_data(SAMPLE_LANGUAGES, max_samples_per_lang=100)
    print(f'Loaded MASSIVE data for {len(multilingual_data)} languages')
else:
    print('Creating sample data')
    multilingual_data = create_sample_data()
    print(f'Created sample data for {len(multilingual_data)} languages')

total_samples = sum(len(df) for df in multilingual_data.values())
print(f'\nDataset summary:')
for lang, df in multilingual_data.items():
    print(f'{lang}: {len(df)} samples')
print(f'Total: {total_samples} samples')

all_intents = []
for df in multilingual_data.values():
    all_intents.extend(df['intent'].tolist())
intent_counts = pd.Series(all_intents).value_counts()
print(f'\nIntent distribution:')
for intent, count in intent_counts.items():
    print(f'{intent}: {count} samples')

With our data loaded, we can now initialize and train our multilingual classification system. The training process involves two main steps: first, we create instances of our NovaEmbeddings and MultilingualClassifier classes, then we combine all text samples from all languages into a single training set. This approach allows the classifier to learn from examples across languages simultaneously, which improves its ability to recognize similar intents regardless of the language used to express them.

In [None]:
print('Initializing system components')

nova_embeddings = NovaEmbeddings(region=AWS_REGION)
classifier = MultilingualClassifier(nova_embeddings, k_neighbors=3)

print('System initialized')

train_texts = []
train_intents = []

for language, df in multilingual_data.items():
    for _, row in df.iterrows():
        train_texts.append(row['utt'])
        train_intents.append(row['intent'])

print(f'Prepared {len(train_texts)} training samples across {len(multilingual_data)} languages')

classifier.train(train_texts, train_intents)
print('Training complete')

Now we test our trained classifier with queries in different languages to verify that it can correctly identify intents across linguistic boundaries. The test queries include both complete phrases and partial expressions to see how well the system handles different levels of specificity. Each prediction returns not just the most likely intent, but also confidence scores for all possible intents, which helps us understand how certain the model is about each classification decision.

In [None]:
test_queries = [
    ('set an alarm for 8 AM', 'en-US'),
    ('¬øqu√© tiempo hace?', 'es-ES'),
    ('joue ma chanson pr√©f√©r√©e', 'fr-FR'),
    ('Wecker f√ºr morgen fr√ºh', 'de-DE'),
    ('play rock music', 'en-US'),
    ('tiempo ma√±ana', 'es-ES')
]

print('multilingual classification test\n')

for i, (query, lang) in enumerate(test_queries, 1):
    print(f'Query {i}: "{query}" ({lang})')
    
    result = classifier.predict(query)
    
    print(f'Predicted intent: {result["predicted_intent"]}')
    print(f'Confidence: {result["confidence"]:.4f}')
    
    print('All scores:')
    for intent, score in sorted(result['all_scores'].items(), key=lambda x: x[1], reverse=True):
        print(f'  {intent}: {score:.4f}')
    print()

To properly evaluate multilingual transfer capabilities, we need to test whether knowledge learned from one language can be applied to others. This experiment trains a new classifier using only English examples, then tests it on Spanish, French, and German inputs. This simulates a real-world scenario where you might have abundant labeled data in one language but need to support users speaking other languages. Strong performance in this test indicates that the embeddings truly capture semantic meaning rather than language-specific patterns.

In [None]:
print('multilingual transfer evaluation')
print('Training on English, testing on other languages\n')

en_df = multilingual_data['en-US']
en_texts = en_df['utt'].tolist()
en_intents = en_df['intent'].tolist()

en_classifier = MultilingualClassifier(nova_embeddings, k_neighbors=1)
en_classifier.train(en_texts, en_intents)

transfer_results = {}
for test_lang in ['es-ES', 'fr-FR', 'de-DE']:
    if test_lang in multilingual_data:
        test_df = multilingual_data[test_lang]
        test_texts = test_df['utt'].tolist()
        test_intents = test_df['intent'].tolist()
        
        print(f'Testing transfer to {test_lang}:')
        
        results = en_classifier.evaluate(test_texts, test_intents)
        transfer_results[test_lang] = results
        
        print(f'Accuracy: {results["accuracy"]:.4f} ({results["accuracy"]*100:.1f}%)')
        print(f'Macro F1: {results["classification_report"]["macro avg"]["f1-score"]:.4f}')
        
        print('Per-intent performance:')
        for intent in ['alarm_set', 'weather_query', 'music_play']:
            if intent in results['classification_report']:
                f1 = results['classification_report'][intent]['f1-score']
                print(f'  {intent}: F1={f1:.4f}')
        print()

avg_accuracy = np.mean([r['accuracy'] for r in transfer_results.values()])
print(f'Average multilingual accuracy: {avg_accuracy:.4f} ({avg_accuracy*100:.1f}%)')
print('multilingual evaluation complete')

Understanding how Nova represents the same intent across different languages provides insight into why multilingual classification works. We'll measure the cosine similarity between embeddings of the same intent expressed in different languages. High similarity scores indicate that Nova creates similar vector representations for semantically equivalent phrases, regardless of the specific words or grammar used. This analysis helps validate that the embeddings capture meaning rather than surface linguistic features.

In [None]:
print('multilingual semantic similarity analysis\n')

intent_groups = {}
for lang, df in multilingual_data.items():
    for _, row in df.iterrows():
        intent = row['intent']
        if intent not in intent_groups:
            intent_groups[intent] = {}
        intent_groups[intent][lang] = row['utt']

intent_similarities = {}
for intent, lang_texts in intent_groups.items():
    print(f'Intent: {intent}')
    print(f'Languages: {", ".join(lang_texts.keys())}')
    
    if len(lang_texts) >= 2:
        texts = list(lang_texts.values())
        langs = list(lang_texts.keys())
        
        embeddings = nova_embeddings.embed_batch_text(texts)
        similarities = cosine_similarity(embeddings)
        
        print('multilingual similarities:')
        intent_sims = []
        for i in range(len(langs)):
            for j in range(i+1, len(langs)):
                sim = similarities[i][j]
                intent_sims.append(sim)
                print(f'  {langs[i]} - {langs[j]}: {sim:.4f}')
        
        avg_similarity = np.mean(intent_sims)
        intent_similarities[intent] = avg_similarity
        print(f'Average similarity: {avg_similarity:.4f}')
    
    print()

if intent_similarities:
    overall_similarity = np.mean(list(intent_similarities.values()))
    print(f'Overall multilingual semantic consistency: {overall_similarity:.4f}')

print('Similarity analysis complete')

For a complete performance assessment, we need to evaluate the system using standard machine learning metrics and analyze the cost implications of our approach. When we have sufficient data, we'll split it into training and testing sets to get unbiased performance estimates. The evaluation includes accuracy, precision, recall, and F1 scores, which provide different perspectives on classification performance. We also track API usage to understand the cost structure for production deployments.

In [None]:
print('Performance analysis\n')

all_texts = []
all_intents = []

for language, df in multilingual_data.items():
    for _, row in df.iterrows():
        all_texts.append(row['utt'])
        all_intents.append(row['intent'])

if len(all_texts) > 6:
    X_train, X_test, y_train, y_test = train_test_split(
        all_texts, all_intents, test_size=0.3, random_state=42, stratify=all_intents
    )
    
    eval_classifier = MultilingualClassifier(nova_embeddings, k_neighbors=3)
    eval_classifier.train(X_train, y_train)
    
    results = eval_classifier.evaluate(X_test, y_test)
    
    print('Classification performance:')
    print(f'Overall accuracy: {results["accuracy"]:.4f} ({results["accuracy"]*100:.2f}%)')
    print(f'Macro F1: {results["classification_report"]["macro avg"]["f1-score"]:.4f}')
    print(f'Weighted F1: {results["classification_report"]["weighted avg"]["f1-score"]:.4f}')
        
else:
    print('Small dataset - using full dataset evaluation')
    results = classifier.evaluate(all_texts, all_intents)
    print(f'Full dataset accuracy: {results["accuracy"]:.4f} ({results["accuracy"]*100:.2f}%)')

print(f'\nCost analysis:')
print(f'Total Nova API calls: {nova_embeddings.request_count}')
print(f'Estimated cost: ${nova_embeddings.request_count * 0.0002:.4f}')
print(f'Cost per sample: ${(nova_embeddings.request_count * 0.0002) / len(all_texts):.6f}')

print(f'\nSystem specifications:')
print(f'Languages supported: {len(multilingual_data)}')
print(f'Intent classes: {len(set(all_intents))}')
print(f'Training samples: {len(all_texts)}')

print('Performance analysis complete')

To ensure our evaluation meets industry standards, we'll implement a comprehensive assessment following MTEB (Massive Text Embedding Benchmark) methodology. This includes stratified sampling to maintain class balance, multiple evaluation metrics, and systematic multilingual transfer analysis. The evaluation functions we'll define handle both single-language performance assessment and multilingual transfer matrix generation, which shows how well each language transfers to every other language in our dataset.

In [None]:
from sklearn.metrics import f1_score

def comprehensive_evaluation(texts, labels, test_size=0.2):
    X_train, X_test, y_train, y_test = train_test_split(
        texts, labels, test_size=test_size, random_state=42, stratify=labels
    )
    
    eval_classifier = MultilingualClassifier(nova_embeddings, k_neighbors=3)
    eval_classifier.train(X_train, y_train)
    
    y_pred = eval_classifier.predict_batch(X_test)
    
    accuracy = accuracy_score(y_test, y_pred)
    macro_f1 = f1_score(y_test, y_pred, average='macro')
    micro_f1 = f1_score(y_test, y_pred, average='micro')
    weighted_f1 = f1_score(y_test, y_pred, average='weighted')
    
    report = classification_report(y_test, y_pred, output_dict=True)
    cm = confusion_matrix(y_test, y_pred)
    
    return {
        'accuracy': accuracy,
        'macro_f1': macro_f1,
        'micro_f1': micro_f1,
        'weighted_f1': weighted_f1,
        'classification_report': report,
        'confusion_matrix': cm,
        'class_names': list(set(labels))
    }

def cross_lingual_transfer_matrix(multilingual_data, embeddings_client):
    languages = list(multilingual_data.keys())
    results_matrix = np.zeros((len(languages), len(languages)))
    
    for i, train_lang in enumerate(languages):
        train_df = multilingual_data[train_lang]
        train_texts = train_df['utt'].tolist()
        train_labels = train_df['intent'].tolist()
        
        classifier = MultilingualClassifier(embeddings_client, k_neighbors=1)
        classifier.train(train_texts, train_labels)
        
        for j, test_lang in enumerate(languages):
            test_df = multilingual_data[test_lang]
            test_texts = test_df['utt'].tolist()
            test_labels = test_df['intent'].tolist()
            
            predictions = classifier.predict_batch(test_texts)
            accuracy = accuracy_score(test_labels, predictions)
            results_matrix[i, j] = accuracy
    
    return results_matrix, languages

print('Running comprehensive evaluation\n')

all_texts = []
all_labels = []

for lang, df in multilingual_data.items():
    for _, row in df.iterrows():
        all_texts.append(row['utt'])
        all_labels.append(row['intent'])

if len(set(all_labels)) > 1 and len(all_texts) > 6:
    print('Overall classification performance:')
    
    results = comprehensive_evaluation(all_texts, all_labels)
    
    print(f'Accuracy: {results["accuracy"]:.4f} ({results["accuracy"]*100:.2f}%)')
    print(f'Macro F1: {results["macro_f1"]:.4f}')
    print(f'Micro F1: {results["micro_f1"]:.4f}')
    print(f'Weighted F1: {results["weighted_f1"]:.4f}')
    
    print('\nPer-class performance:')
    for intent in results['class_names']:
        if intent in results['classification_report']:
            metrics = results['classification_report'][intent]
            print(f'{intent:15} | F1: {metrics["f1-score"]:.4f} | Precision: {metrics["precision"]:.4f} | Recall: {metrics["recall"]:.4f}')
    
    plt.figure(figsize=(8, 6))
    sns.heatmap(results['confusion_matrix'], 
                annot=True, fmt='d', cmap='Blues',
                xticklabels=results['class_names'],
                yticklabels=results['class_names'])
    plt.title('Classification Confusion Matrix')
    plt.ylabel('True Intent')
    plt.xlabel('Predicted Intent')
    plt.tight_layout()
    plt.show()

print('\nmultilingual transfer analysis:')

transfer_matrix, langs = cross_lingual_transfer_matrix(multilingual_data, nova_embeddings)

print('Transfer accuracy matrix (train ‚Üí test):')
print(f'{'':12}', end='')
for lang in langs:
    print(f'{lang:8}', end='')
print()

for i, train_lang in enumerate(langs):
    print(f'{train_lang:12}', end='')
    for j, test_lang in enumerate(langs):
        print(f'{transfer_matrix[i,j]:8.3f}', end='')
    print()

same_lang_acc = np.mean([transfer_matrix[i,i] for i in range(len(langs))])
cross_lang_acc = np.mean([transfer_matrix[i,j] for i in range(len(langs)) for j in range(len(langs)) if i != j])

print(f'\nSame-language accuracy: {same_lang_acc:.4f} ({same_lang_acc*100:.1f}%)')
print(f'Multilingual accuracy: {cross_lang_acc:.4f} ({cross_lang_acc*100:.1f}%)')
print(f'Transfer efficiency: {cross_lang_acc/same_lang_acc:.4f} ({cross_lang_acc/same_lang_acc*100:.1f}%)')

plt.figure(figsize=(8, 6))
sns.heatmap(transfer_matrix, 
            annot=True, fmt='.3f', cmap='RdYlBu_r',
            xticklabels=langs, yticklabels=langs,
            vmin=0, vmax=1)
plt.title('multilingual Transfer Matrix')
plt.ylabel('Training Language')
plt.xlabel('Test Language')
plt.tight_layout()
plt.show()

print('Comprehensive evaluation complete')

## 12. Multilingual Utterance Grouping Analysis

Analyze utterance patterns across 10 languages to find the most common intents and expressions:
- **Load 10 Languages**: Expand to top 10 languages from MASSIVE dataset
- **Utterance Grouping**: Group equivalent utterances across languages by intent
- **Frequency Ranking**: Rank utterance groups by frequency across all languages
- **multilingual Patterns**: Identify most common expressions globally

In [None]:
# Load 10 languages for comprehensive analysis
TOP_10_LANGUAGES = ['en-US', 'es-ES', 'fr-FR', 'de-DE', 'it-IT', 'pt-PT', 'nl-NL', 'pl-PL', 'ru-RU', 'ja-JP']

def load_expanded_multilingual_data(languages, max_samples_per_lang=200):
    """Load data for 10 languages"""
    if USE_MASSIVE_DATASET:
        print('üìä Loading MASSIVE dataset for 10 languages...')
        download_massive_dataset()
        return load_massive_data(languages, max_samples_per_lang)
    else:
        # Create expanded sample data for 10 languages
        expanded_data = {}
        base_intents = {
            'alarm_set': ['set alarm for 7 AM', 'wake me up at 7', 'alarm for tomorrow morning'],
            'weather_query': ['what is the weather', 'how is the weather today', 'weather forecast'],
            'music_play': ['play music', 'start some music', 'play my playlist'],
            'calendar_query': ['what is my schedule', 'show my calendar', 'any meetings today'],
            'timer_set': ['set timer for 5 minutes', 'start a timer', 'timer for cooking']
        }
        
        for lang in languages[:4]:  # Use existing 4 languages
            expanded_data[lang] = multilingual_data.get(lang, pd.DataFrame())
        
        # Add mock data for remaining languages
        for i, lang in enumerate(languages[4:], 1):
            samples = []
            for j, (intent, utterances) in enumerate(base_intents.items(), 1):
                samples.append({
                    'id': str(j), 'locale': lang, 'partition': 'train',
                    'intent': intent, 'scenario': intent.split('_')[0],
                    'utt': f'{utterances[0]} ({lang})'
                })
            expanded_data[lang] = pd.DataFrame(samples)
        
        return expanded_data

# Load expanded dataset
expanded_multilingual_data = load_expanded_multilingual_data(TOP_10_LANGUAGES)

print(f'\nüìà Expanded Dataset Summary:')
total_samples = 0
for lang, df in expanded_multilingual_data.items():
    print(f'  {lang}: {len(df)} samples')
    total_samples += len(df)
print(f'  Total: {total_samples} samples across {len(expanded_multilingual_data)} languages')

In [None]:
# Analyze utterance groupings by intent across all languages
print('üîç Utterance Grouping Analysis Across 10 Languages\n')
print('=' * 70)

# Group utterances by intent across all languages
intent_groups = {}
intent_frequencies = {}

for lang, df in expanded_multilingual_data.items():
    for _, row in df.iterrows():
        intent = row['intent']
        utterance = row['utt']
        
        # Group by intent
        if intent not in intent_groups:
            intent_groups[intent] = {}
            intent_frequencies[intent] = 0
        
        if lang not in intent_groups[intent]:
            intent_groups[intent][lang] = []
        
        intent_groups[intent][lang].append(utterance)
        intent_frequencies[intent] += 1

# Rank intents by frequency
ranked_intents = sorted(intent_frequencies.items(), key=lambda x: x[1], reverse=True)

print('üèÜ Intent Ranking by Frequency Across All Languages:')
print(f'{'Rank':<4} {'Intent':<20} {'Total Count':<12} {'Languages':<10}')
print('-' * 50)

for rank, (intent, count) in enumerate(ranked_intents, 1):
    lang_count = len(intent_groups[intent])
    print(f'{rank:<4} {intent:<20} {count:<12} {lang_count:<10}')

print(f'\nüìä Detailed Utterance Groups (Top 3 Most Common Intents):')
print('=' * 70)

# Show detailed breakdown for top 3 intents
for rank, (intent, count) in enumerate(ranked_intents[:3], 1):
    print(f'\nüéØ #{rank} Intent: {intent} (Total: {count} utterances across {len(intent_groups[intent])} languages)')
    print('-' * 60)
    
    for lang in TOP_10_LANGUAGES:
        if lang in intent_groups[intent]:
            utterances = intent_groups[intent][lang]
            print(f'  {lang}: {len(utterances)} utterances')
            for utt in utterances[:2]:  # Show first 2 utterances per language
                print(f'    ‚Ä¢ "{utt}"')
            if len(utterances) > 2:
                print(f'    ... and {len(utterances)-2} more')
        else:
            print(f'  {lang}: No data')

print('\n‚úÖ Utterance grouping analysis complete!')

In [None]:
# Visualize intent distribution across languages
import matplotlib.pyplot as plt
import seaborn as sns

# Create intent-language matrix for visualization
intent_lang_matrix = []
intent_names = []
lang_names = TOP_10_LANGUAGES

for intent, lang_data in intent_groups.items():
    intent_names.append(intent)
    row = []
    for lang in lang_names:
        count = len(lang_data.get(lang, []))
        row.append(count)
    intent_lang_matrix.append(row)

# Convert to numpy array for plotting
matrix = np.array(intent_lang_matrix)

# Create heatmap
plt.figure(figsize=(12, 8))
sns.heatmap(matrix, 
            annot=True, fmt='d', cmap='YlOrRd',
            xticklabels=lang_names,
            yticklabels=intent_names,
            cbar_kws={'label': 'Number of Utterances'})
plt.title('Intent Distribution Across 10 Languages\n(Utterance Count per Intent-Language Pair)')
plt.xlabel('Languages')
plt.ylabel('Intents')
plt.xticks(rotation=45)
plt.yticks(rotation=0)
plt.tight_layout()
plt.show()

# Summary statistics
print('\nüìà Multilingual Intent Statistics:')
print(f'  üéØ Total Unique Intents: {len(intent_groups)}')
print(f'  üåç Languages Analyzed: {len(TOP_10_LANGUAGES)}')
print(f'  üìù Total Utterances: {sum(intent_frequencies.values())}')
print(f'  üìä Average Utterances per Intent: {sum(intent_frequencies.values())/len(intent_groups):.1f}')
print(f'  üèÜ Most Common Intent: {ranked_intents[0][0]} ({ranked_intents[0][1]} utterances)')
print(f'  üìâ Least Common Intent: {ranked_intents[-1][0]} ({ranked_intents[-1][1]} utterances)')

Finally, we'll summarize our system's performance and discuss practical considerations for production deployment. The summary includes key performance metrics, cost analysis, and recommendations for scaling the system. Understanding these factors helps determine whether the current implementation meets your requirements and what modifications might be needed for production use. The cost analysis is particularly important since embedding generation requires API calls, and understanding the cost structure helps with budgeting and optimization decisions.

In [None]:
print('System summary\n')

print('Performance metrics:')
if 'results' in locals():
    print(f'Overall accuracy: {results["accuracy"]*100:.2f}%')
    print(f'Macro F1: {results["macro_f1"]*100:.2f}%')
    print(f'Weighted F1: {results["weighted_f1"]*100:.2f}%')

print(f'\nmultilingual performance:')
print(f'Same-language: {same_lang_acc*100:.2f}%')
print(f'Multilingual: {cross_lang_acc*100:.2f}%')
print(f'Transfer efficiency: {cross_lang_acc/same_lang_acc*100:.1f}%')

print(f'\nCost analysis:')
print(f'Total API calls: {nova_embeddings.request_count}')
print(f'Total cost: ${nova_embeddings.request_count * 0.0002:.4f}')
print(f'Cost per sample: ${(nova_embeddings.request_count * 0.0002) / len(all_texts):.6f}')

print(f'\nSystem specifications:')
print(f'Model: {NOVA_MODEL_ID}')
print(f'Embedding dimension: {EMBEDDING_DIMENSION}d')
print(f'Intent classes: {len(set(all_labels))}')
print(f'Languages: {len(multilingual_data)}')
print(f'Training samples: {len(all_texts)}')

print('\nProduction considerations:')
print('Dataset expansion: Scale to full MASSIVE dataset for comprehensive coverage')
print('Embedding caching: Store embeddings to reduce API calls for repeated inference')
print('Batch processing: Process multiple requests together to improve throughput')
print('Rate limiting: Implement proper rate limiting for Bedrock API calls')
print('Architecture: Deploy with API Gateway and Lambda for real-time classification')
print('Monitoring: Use CloudWatch for API usage and performance tracking')
print('Optimization: Tune k-neighbors parameter based on dataset size')
print('Confidence thresholds: Implement thresholds for uncertain predictions')

print('\nMultilingual intent classification system complete')
print('The system demonstrates effective multilingual understanding through semantic embeddings,')
print('providing a foundation for multilingual intent classification in production environments.')