# Lab 09: Embeddings and Semantic Search

**Course:** Generative AI for Banking Sector  
**Institution:** Banco Nacional de Costa Rica (BNCR)  
**Instructor:** Manuela Larrea  
**Duration:** 3 hours

---

## Learning Objectives

By the end of this lab, you will be able to:

1. Understand what embeddings are and how they work
2. Use Azure OpenAI embeddings API
3. Implement vector similarity search
4. Build a document search system for banking
5. Create semantic search for banking FAQs
6. Implement a product recommendation system
7. Build a document similarity finder

---

## Azure Infrastructure

```
╔══════════════════════════════════════════════════════════════════════════╗
║                  LAB 09 - AZURE INFRASTRUCTURE                           ║
╚══════════════════════════════════════════════════════════════════════════╝

┌────────────────────────────────────────────────────────────────────────┐
│                        YOU (Jupyter Notebook)                          │
│                                                                        │
│  ┌──────────────────────────────────────────────────────────────────┐ │
│  │  Banking Documents & FAQs                                        │ │
│  │  • Product descriptions                                          │ │
│  │  • Customer FAQs                                                 │ │
│  │  • Policy documents                                              │ │
│  └──────────────────────────────────────────────────────────────────┘ │
└────────────────────────────────┬───────────────────────────────────────┘
                                 │
                                 │ Text to Embed
                                 ▼
                    ┌────────────────────────────┐
                    │   Azure OpenAI Service     │
                    │                            │
                    │  ┌──────────────────────┐  │
                    │  │  text-embedding-     │  │
                    │  │  ada-002             │  │
                    │  └──────────────────────┘  │
                    └────────────────────────────┘
                                 │
                                 │ Vector (1536 dimensions)
                                 ▼
┌────────────────────────────────────────────────────────────────────────┐
│                     Vector Database (In-Memory)                        │
│                     • Store embeddings                                 │
│                     • Compute similarity                               │
│                     • Return ranked results                            │
└────────────────────────────────────────────────────────────────────────┘

📊 Resources Used:
  • Azure OpenAI Service (text-embedding-ada-002)
  • In-memory vector storage (NumPy)

💰 Estimated Cost: ~$0.05 per lab session
```

In [None]:
import os
import sys
from openai import AzureOpenAI
from dotenv import load_dotenv
import json
import numpy as np
import pandas as pd
from typing import List, Dict, Tuple

sys.path.append('../../utils')
from azure_openai_helper import AzureOpenAIClient

load_dotenv()
client = AzureOpenAIClient()

print("✓ Setup complete")

## Introduction

In this lab, we will explore **Embeddings and Semantic Search** and its applications in the banking sector.

### What are Embeddings?

**Embeddings** are numerical representations of text that capture semantic meaning. They convert words, sentences, or documents into vectors (arrays of numbers) in a high-dimensional space.

**Key Properties:**
- Similar texts have similar embeddings
- Embeddings capture semantic relationships
- They enable mathematical operations on text

**Banking Use Cases:**
- FAQ search systems
- Product recommendations
- Document similarity
- Customer query routing
- Duplicate detection

This is a critical component for building production-ready GenAI systems at BNCR.

## Part 1: Understanding Embeddings

Let's start by generating embeddings for simple banking terms and visualizing their relationships.

In [None]:
# Generate embeddings for banking terms
banking_terms = [
    "cuenta de ahorros",
    "cuenta corriente",
    "préstamo personal",
    "préstamo hipotecario",
    "tarjeta de crédito",
    "tarjeta de débito"
]

print("Generando embeddings para términos bancarios...\n")

embeddings = []
for term in banking_terms:
    response = client.embeddings.create(
        input=term
    )
    embedding = response.data[0].embedding
    embeddings.append(embedding)
    print(f"✓ {term}: vector de {len(embedding)} dimensiones")

print(f"\n✓ Total de embeddings generados: {len(embeddings)}")

### Visualizing Embedding Dimensions

Let's look at the first few dimensions of an embedding:

In [None]:
# Show first 10 dimensions of the first embedding
print("Primeras 10 dimensiones del embedding de 'cuenta de ahorros':")
print(embeddings[0][:10])
print(f"\nTotal de dimensiones: {len(embeddings[0])}")

## Part 2: Cosine Similarity

To compare embeddings, we use **cosine similarity**, which measures the angle between two vectors.

**Cosine Similarity:**
- Range: -1 to 1
- 1 = identical
- 0 = orthogonal (unrelated)
- -1 = opposite

In [None]:
def cosine_similarity(vec1: List[float], vec2: List[float]) -> float:
    """Calculate cosine similarity between two vectors"""
    vec1 = np.array(vec1)
    vec2 = np.array(vec2)
    
    dot_product = np.dot(vec1, vec2)
    norm1 = np.linalg.norm(vec1)
    norm2 = np.linalg.norm(vec2)
    
    return dot_product / (norm1 * norm2)

# Test the function
print("Similitudes entre términos bancarios:\n")

# Compare "cuenta de ahorros" with others
base_term = "cuenta de ahorros"
base_idx = 0

for i, term in enumerate(banking_terms):
    if i != base_idx:
        similarity = cosine_similarity(embeddings[base_idx], embeddings[i])
        print(f"{base_term} ↔ {term}: {similarity:.4f}")

### 💡 Observation

Notice that:
- "cuenta de ahorros" is more similar to "cuenta corriente" (both are accounts)
- "préstamo personal" is more similar to "préstamo hipotecario" (both are loans)
- "tarjeta de crédito" is more similar to "tarjeta de débito" (both are cards)

This demonstrates that embeddings capture semantic relationships!

## Part 3: Building a Semantic Search System

Now let's build a complete semantic search system for banking FAQs.

In [None]:
class SemanticSearchEngine:
    """Simple semantic search engine using embeddings"""
    
    def __init__(self, client: AzureOpenAIClient):
        self.client = client
        self.documents = []
        self.embeddings = []
    
    def add_documents(self, documents: List[str]):
        """Add documents to the search index"""
        print(f"Indexando {len(documents)} documentos...")
        
        for doc in documents:
            # Generate embedding
            response = self.client.embeddings.create(input=doc)
            embedding = response.data[0].embedding
            
            # Store document and embedding
            self.documents.append(doc)
            self.embeddings.append(embedding)
        
        print(f"✓ {len(documents)} documentos indexados")
    
    def search(self, query: str, top_k: int = 3) -> List[Tuple[str, float]]:
        """Search for most similar documents"""
        # Generate query embedding
        response = self.client.embeddings.create(input=query)
        query_embedding = response.data[0].embedding
        
        # Calculate similarities
        similarities = []
        for i, doc_embedding in enumerate(self.embeddings):
            similarity = cosine_similarity(query_embedding, doc_embedding)
            similarities.append((self.documents[i], similarity))
        
        # Sort by similarity (descending)
        similarities.sort(key=lambda x: x[1], reverse=True)
        
        return similarities[:top_k]

print("✓ SemanticSearchEngine class defined")

### Testing the Search Engine with Banking FAQs

In [None]:
# Create search engine
search_engine = SemanticSearchEngine(client)

# Banking FAQs
faqs = [
    "¿Cómo puedo abrir una cuenta de ahorros en el BNCR?",
    "¿Cuál es la tasa de interés para préstamos personales?",
    "¿Cómo solicito una tarjeta de crédito?",
    "¿Dónde puedo consultar el saldo de mi cuenta?",
    "¿Cómo hago una transferencia bancaria?",
    "¿Qué documentos necesito para solicitar un préstamo hipotecario?",
    "¿Cuál es el horario de atención de las sucursales?",
    "¿Cómo activo mi tarjeta de débito?",
    "¿Puedo hacer pagos de servicios en línea?",
    "¿Cómo cambio mi PIN de la tarjeta?",
    "¿Qué hago si pierdo mi tarjeta?",
    "¿Cómo puedo aumentar el límite de mi tarjeta de crédito?",
    "¿Cuál es el monto mínimo para abrir un certificado de depósito?",
    "¿Cómo puedo descargar mi estado de cuenta?",
    "¿El BNCR tiene una aplicación móvil?"
]

# Index FAQs
search_engine.add_documents(faqs)

In [None]:
# Test searches
test_queries = [
    "¿Cómo veo mi dinero disponible?",
    "Perdí mi tarjeta, ¿qué hago?",
    "Quiero comprar una casa, ¿qué necesito?",
    "¿Tienen app para celular?"
]

for query in test_queries:
    print(f"\n{'='*80}")
    print(f"🔍 Consulta: '{query}'")
    print(f"{'='*80}\n")
    
    results = search_engine.search(query, top_k=3)
    
    for i, (doc, score) in enumerate(results, 1):
        print(f"{i}. [{score:.4f}] {doc}")

## Part 4: Product Recommendation System

Let's build a system that recommends banking products based on customer queries.

In [None]:
# Load banking products
products_df = pd.read_csv('../../data/banking_products.csv')
print("Productos bancarios disponibles:")
print(products_df.head())
print(f"\nTotal de productos: {len(products_df)}")

In [None]:
# Create product descriptions for embedding
product_descriptions = []
for _, row in products_df.iterrows():
    description = f"{row['product_name']}: {row['description']}. {row['features']}"
    product_descriptions.append(description)

# Create product search engine
product_search = SemanticSearchEngine(client)
product_search.add_documents(product_descriptions)

In [None]:
# Test product recommendations
customer_queries = [
    "Necesito guardar dinero para emergencias",
    "Quiero comprar una casa",
    "Necesito dinero rápido para pagar una deuda",
    "Quiero invertir mi dinero a largo plazo"
]

print("\n" + "="*80)
print("SISTEMA DE RECOMENDACIÓN DE PRODUCTOS")
print("="*80)

for query in customer_queries:
    print(f"\n👤 Cliente: '{query}'")
    print("\n💡 Productos recomendados:\n")
    
    results = product_search.search(query, top_k=2)
    
    for i, (product, score) in enumerate(results, 1):
        # Extract product name (before the colon)
        product_name = product.split(':')[0]
        print(f"  {i}. {product_name} (relevancia: {score:.2%})")
    print()

## Part 5: Document Similarity Finder

This is useful for:
- Finding duplicate customer inquiries
- Grouping similar support tickets
- Detecting related transactions

In [None]:
def find_similar_documents(documents: List[str], threshold: float = 0.85) -> List[Tuple[int, int, float]]:
    """Find pairs of similar documents above threshold"""
    
    print(f"Analizando {len(documents)} documentos...\n")
    
    # Generate embeddings
    embeddings = []
    for doc in documents:
        response = client.embeddings.create(input=doc)
        embeddings.append(response.data[0].embedding)
    
    # Find similar pairs
    similar_pairs = []
    
    for i in range(len(documents)):
        for j in range(i + 1, len(documents)):
            similarity = cosine_similarity(embeddings[i], embeddings[j])
            
            if similarity >= threshold:
                similar_pairs.append((i, j, similarity))
    
    return similar_pairs

print("✓ find_similar_documents function defined")

In [None]:
# Test with customer inquiries
customer_inquiries = [
    "No puedo acceder a mi cuenta en línea",
    "Mi banca en línea no funciona",
    "¿Cuál es la tasa de interés actual?",
    "Necesito ayuda con mi contraseña",
    "Olvidé mi clave de acceso",
    "¿Cuánto cobran de interés?",
    "Quiero abrir una cuenta nueva",
    "No puedo iniciar sesión en el portal"
]

similar_pairs = find_similar_documents(customer_inquiries, threshold=0.80)

print("\n" + "="*80)
print("DOCUMENTOS SIMILARES DETECTADOS")
print("="*80 + "\n")

if similar_pairs:
    for i, j, score in similar_pairs:
        print(f"Similitud: {score:.2%}")
        print(f"  📄 Doc {i+1}: {customer_inquiries[i]}")
        print(f"  📄 Doc {j+1}: {customer_inquiries[j]}")
        print()
else:
    print("No se encontraron documentos similares con el umbral especificado.")

## 🎯 Practical Exercises

Now it's your turn to practice!

### Exercise 1: Build a Multi-Language FAQ Search

Create a search engine that can find relevant FAQs even when the query is in a different language.

**Hint:** Embeddings work across languages!

In [None]:
# Exercise 1: Your code here

# FAQs in Spanish
spanish_faqs = [
    "¿Cómo abrir una cuenta de ahorros?",
    "¿Cuál es la tasa de interés?",
    "¿Cómo solicitar una tarjeta de crédito?"
]

# TODO: Create a search engine with Spanish FAQs
# TODO: Test with English queries like "How to open a savings account?"
# TODO: Observe that it still finds the correct Spanish FAQ

# Your code here:


### Exercise 2: Customer Intent Classification

Build a system that classifies customer queries into categories using embeddings.

**Categories:**
- Account Management
- Loans
- Cards
- Technical Support
- General Information

In [None]:
# Exercise 2: Your code here

# Define category examples
categories = {
    "Account Management": "Quiero abrir una cuenta de ahorros",
    "Loans": "Necesito un préstamo personal",
    "Cards": "¿Cómo solicito una tarjeta de crédito?",
    "Technical Support": "No puedo acceder a mi cuenta en línea",
    "General Information": "¿Cuál es el horario de atención?"
}

# Test queries
test_queries = [
    "Mi banca en línea no funciona",
    "Quiero solicitar un crédito hipotecario",
    "¿Cómo cierro mi cuenta?"
]

# TODO: Generate embeddings for category examples
# TODO: For each test query, find the most similar category
# TODO: Print the classification results

# Your code here:


### Exercise 3: Smart Document Routing

Build a system that routes customer inquiries to the appropriate department based on semantic similarity.

**Departments:**
- Customer Service
- Loans Department
- Credit Cards
- IT Support
- Fraud Prevention

In [None]:
# Exercise 3: Your code here

# Define department descriptions
departments = {
    "Customer Service": "Atención general al cliente, consultas sobre productos y servicios bancarios",
    "Loans Department": "Solicitudes de préstamos personales, hipotecarios y empresariales",
    "Credit Cards": "Tarjetas de crédito, límites, pagos y beneficios",
    "IT Support": "Problemas técnicos con banca en línea, aplicación móvil y acceso a cuentas",
    "Fraud Prevention": "Transacciones sospechosas, tarjetas perdidas o robadas, seguridad"
}

# Test inquiries
inquiries = [
    "Creo que alguien está usando mi tarjeta sin autorización",
    "No puedo iniciar sesión en la app",
    "Quiero solicitar un préstamo para comprar un carro",
    "¿Cuál es el límite de mi tarjeta de crédito?"
]

# TODO: Create embeddings for department descriptions
# TODO: For each inquiry, find the best matching department
# TODO: Print routing recommendations

# Your code here:


## Part 6: Best Practices for Production

### 1. Caching Embeddings

Don't regenerate embeddings for the same text repeatedly!

In [None]:
import hashlib
import pickle
from pathlib import Path

class CachedEmbeddingClient:
    """Embedding client with caching"""
    
    def __init__(self, client: AzureOpenAIClient, cache_dir: str = "./embedding_cache"):
        self.client = client
        self.cache_dir = Path(cache_dir)
        self.cache_dir.mkdir(exist_ok=True)
    
    def _get_cache_key(self, text: str) -> str:
        """Generate cache key from text"""
        return hashlib.md5(text.encode()).hexdigest()
    
    def get_embedding(self, text: str) -> List[float]:
        """Get embedding with caching"""
        cache_key = self._get_cache_key(text)
        cache_file = self.cache_dir / f"{cache_key}.pkl"
        
        # Check cache
        if cache_file.exists():
            with open(cache_file, 'rb') as f:
                return pickle.load(f)
        
        # Generate embedding
        response = self.client.embeddings.create(input=text)
        embedding = response.data[0].embedding
        
        # Save to cache
        with open(cache_file, 'wb') as f:
            pickle.dump(embedding, f)
        
        return embedding

print("✓ CachedEmbeddingClient class defined")

### 2. Batch Processing

Process multiple texts in a single API call when possible:

In [None]:
def batch_embed(texts: List[str], client: AzureOpenAIClient, batch_size: int = 16) -> List[List[float]]:
    """Generate embeddings in batches"""
    all_embeddings = []
    
    for i in range(0, len(texts), batch_size):
        batch = texts[i:i + batch_size]
        
        response = client.embeddings.create(input=batch)
        
        batch_embeddings = [item.embedding for item in response.data]
        all_embeddings.extend(batch_embeddings)
    
    return all_embeddings

# Test batch processing
test_texts = [f"Documento de prueba {i}" for i in range(5)]
batch_embeddings = batch_embed(test_texts, client)
print(f"✓ Generados {len(batch_embeddings)} embeddings en batch")

### 3. Monitoring and Logging

In [None]:
import time
from datetime import datetime

class MonitoredSearchEngine(SemanticSearchEngine):
    """Search engine with monitoring"""
    
    def __init__(self, client: AzureOpenAIClient):
        super().__init__(client)
        self.search_log = []
    
    def search(self, query: str, top_k: int = 3) -> List[Tuple[str, float]]:
        """Search with logging"""
        start_time = time.time()
        
        results = super().search(query, top_k)
        
        elapsed_time = time.time() - start_time
        
        # Log search
        self.search_log.append({
            'timestamp': datetime.now().isoformat(),
            'query': query,
            'top_result_score': results[0][1] if results else 0,
            'elapsed_time': elapsed_time
        })
        
        return results
    
    def get_stats(self) -> Dict:
        """Get search statistics"""
        if not self.search_log:
            return {}
        
        avg_time = np.mean([log['elapsed_time'] for log in self.search_log])
        avg_score = np.mean([log['top_result_score'] for log in self.search_log])
        
        return {
            'total_searches': len(self.search_log),
            'avg_response_time': f"{avg_time:.3f}s",
            'avg_top_score': f"{avg_score:.4f}"
        }

print("✓ MonitoredSearchEngine class defined")

## Summary

In this lab, you learned about **Embeddings and Semantic Search** and how to apply it in banking scenarios.

### Key Takeaways:

1. **Embeddings** convert text into numerical vectors that capture semantic meaning
2. **Cosine similarity** measures the similarity between embeddings
3. **Semantic search** finds relevant documents based on meaning, not just keywords
4. **Banking applications** include FAQ search, product recommendations, and document routing
5. **Production best practices** include caching, batch processing, and monitoring

### Banking Use Cases:

- ✅ FAQ search systems
- ✅ Product recommendation engines
- ✅ Customer query routing
- ✅ Duplicate detection
- ✅ Document similarity analysis
- ✅ Multi-language support

### Best Practices:

1. **Cache embeddings** to reduce API calls and costs
2. **Use batch processing** for multiple documents
3. **Monitor performance** and log searches
4. **Set appropriate similarity thresholds** for your use case
5. **Handle edge cases** (empty queries, no results, etc.)

### Next Steps:

In the next lab, we'll explore **RAG (Retrieval-Augmented Generation)**, which combines semantic search with LLMs to create powerful Q&A systems!

---

**Instructor:** Manuela Larrea | manuela.larrea@idataglobal.com