# Notebook 02: Index - "The Zero-Config Setup"

## Innocenti Risk Management Enablement Kit

---

### The Problem

Setting up semantic search traditionally requires:
1. Deploying an embedding model
2. Managing vector dimensions and similarity metrics
3. Writing ingest pipelines to generate embeddings
4. Configuring dense vector fields

### The Solution: Elastic Inference Service (EIS) + `semantic_text`

With EIS, you get:
- **One-click model deployment** via inference endpoints
- **`semantic_text` field type** that auto-embeds at index time
- **No pipeline configuration** - just index your documents

**What we'll do:**
1. Connect to Elastic Cloud
2. Create a Jina Embeddings v3 inference endpoint
3. Create an index with `semantic_text`
4. Bulk index our EU AI Act articles

---

## 1. Setup & Dependencies

In [None]:
# Install dependencies (uncomment if needed)
# !pip install elasticsearch python-dotenv tqdm

In [None]:
import json
from pathlib import Path
from elasticsearch import Elasticsearch, BadRequestError
from elasticsearch.helpers import bulk
from tqdm import tqdm

# Import our credential helper
import sys
sys.path.insert(0, str(Path.cwd()))
from utils.credentials import setup_notebook, get_index_name, get_inference_id

print("Libraries loaded successfully!")

In [None]:
# Setup credentials and display configuration
creds = setup_notebook(require_elastic=True, require_jina=False)

# Get unique names for this user
INDEX_NAME = get_index_name("search-eu-ai-act")
INFERENCE_ID = get_inference_id("embeddings")

## 2. Connect to Elastic Cloud

We'll use the official Python client with Cloud ID authentication.

In [None]:
# Initialize Elasticsearch client
es = Elasticsearch(
    cloud_id=creds["ELASTIC_CLOUD_ID"],
    api_key=creds["ELASTIC_API_KEY"]
)

# Verify connection
info = es.info()
print(f"✓ Connected to Elasticsearch {info['version']['number']}")
print(f"  Cluster: {info['cluster_name']}")

## 3. Create Inference Endpoint (Jina Embeddings v3)

The inference endpoint is like a "model service" that Elasticsearch calls to generate embeddings.

**Jina Embeddings v3 highlights:**
- Multilingual (100+ languages)
- Task-specific modes (retrieval, classification, etc.)
- 1024 dimensions by default

We'll wrap this in a try/except for idempotency - if the endpoint already exists, we continue gracefully.

In [None]:
def create_embedding_inference(es_client, inference_id: str) -> bool:
    """
    Create a Jina Embeddings v3 inference endpoint.
    
    Handles ResourceAlreadyExists gracefully for idempotency.
    
    Args:
        es_client: Elasticsearch client
        inference_id: Unique ID for this inference endpoint
    
    Returns:
        True if created, False if already existed
    """
    try:
        es_client.inference.put(
            inference_id=inference_id,
            task_type="text_embedding",
            body={
                "service": "jinaai",
                "service_settings": {
                    "model_id": "jina-embeddings-v3"
                    # Note: API key can be set here or in Kibana
                    # "api_key": "your-jina-key"  # Optional - can configure in Kibana
                },
                "task_settings": {
                    "task": "retrieval.passage"  # Optimized for document retrieval
                }
            }
        )
        print(f"✓ Created inference endpoint: {inference_id}")
        return True
        
    except BadRequestError as e:
        if "resource_already_exists_exception" in str(e).lower() or "already exists" in str(e).lower():
            print(f"✓ Inference endpoint already exists: {inference_id}")
            return False
        else:
            raise e

In [None]:
# Create the embedding inference endpoint
create_embedding_inference(es, INFERENCE_ID)

## 4. Create Index with `semantic_text`

The `semantic_text` field type is the magic:
- Automatically calls the inference endpoint at index time
- Stores both the original text and the embeddings
- Enables semantic search with zero configuration

**Our schema:**
- `article_number` - Keyword for filtering
- `title` - Keyword + text for hybrid search
- `text` - **semantic_text** for vector search
- `language`, `url` - Metadata

In [None]:
# Define the index mapping
INDEX_MAPPING = {
    "mappings": {
        "properties": {
            "article_number": {
                "type": "keyword"  # Exact match for filtering
            },
            "title": {
                "type": "text",  # Full-text search on titles
                "fields": {
                    "keyword": {"type": "keyword"}  # Also exact match
                }
            },
            "text": {
                "type": "semantic_text",  # THE MAGIC FIELD
                "inference_id": INFERENCE_ID  # Points to our Jina model
            },
            "language": {
                "type": "keyword"
            },
            "url": {
                "type": "keyword"
            }
        }
    }
}

print("Index mapping defined with semantic_text field")
print(f"  Inference endpoint: {INFERENCE_ID}")

In [None]:
# Create the index (delete if exists for clean re-runs)
if es.indices.exists(index=INDEX_NAME):
    print(f"Index {INDEX_NAME} exists, deleting for fresh start...")
    es.indices.delete(index=INDEX_NAME)

es.indices.create(index=INDEX_NAME, body=INDEX_MAPPING)
print(f"✓ Created index: {INDEX_NAME}")

## 5. Load and Index Documents

Now we'll load the JSON from Notebook 01 and bulk index it.

**Note:** With `semantic_text`, embeddings are generated automatically at index time. This may take a moment for larger datasets.

In [None]:
# Load the articles from Notebook 01
data_file = Path.cwd().parent / "data" / "eu_ai_act_clean.json"

if not data_file.exists():
    raise FileNotFoundError(
        f"Data file not found: {data_file}\n"
        "Please run Notebook 01 first to generate the data."
    )

with open(data_file, 'r', encoding='utf-8') as f:
    articles = json.load(f)

print(f"✓ Loaded {len(articles)} articles from {data_file.name}")

In [None]:
def generate_actions(articles: list, index_name: str):
    """
    Generator for bulk indexing actions.
    
    Args:
        articles: List of article dictionaries
        index_name: Target index name
    
    Yields:
        Bulk action dictionaries
    """
    for article in articles:
        yield {
            "_index": index_name,
            "_id": article["id"],  # Use article ID for idempotent indexing
            "_source": {
                "article_number": article["article_number"],
                "title": article["title"],
                "text": article["text"],
                "language": article.get("language", "en"),
                "url": article.get("url", "")
            }
        }

In [None]:
# Bulk index with progress bar
print(f"Indexing {len(articles)} articles to {INDEX_NAME}...")
print("(Embeddings are generated automatically - this may take a minute)")

success, errors = bulk(
    es,
    generate_actions(articles, INDEX_NAME),
    chunk_size=10,  # Smaller chunks for semantic_text (embedding generation)
    refresh=True    # Make documents immediately searchable
)

print(f"\n✓ Indexed {success} documents")
if errors:
    print(f"✗ {len(errors)} errors occurred")
    for error in errors[:5]:  # Show first 5 errors
        print(f"  - {error}")

## 6. Verification

Let's verify the indexing worked by running a test search.

In [None]:
# Check document count
count = es.count(index=INDEX_NAME)["count"]
print(f"✓ Index {INDEX_NAME} contains {count} documents")

In [None]:
# Run a test semantic search
test_query = "biometric identification systems"

results = es.search(
    index=INDEX_NAME,
    body={
        "query": {
            "semantic": {
                "field": "text",
                "query": test_query
            }
        },
        "size": 5,
        "_source": ["article_number", "title"]
    }
)

print(f"\n--- Test Search: \"{test_query}\" ---")
print(f"Found {results['hits']['total']['value']} matches\n")

for i, hit in enumerate(results['hits']['hits'], 1):
    print(f"{i}. Article {hit['_source']['article_number']}: {hit['_source']['title'][:60]}")
    print(f"   Score: {hit['_score']:.4f}")

## 7. View Sample Document with Embeddings

Let's peek at how `semantic_text` stores the embeddings internally.

In [None]:
# Get a sample document to see the structure
sample = es.get(index=INDEX_NAME, id="en_art_5")

print("--- Sample Document Structure ---")
print(f"ID: {sample['_id']}")
print(f"Article: {sample['_source']['article_number']}")
print(f"Title: {sample['_source']['title']}")
print(f"\nText field type: {type(sample['_source']['text'])}")

# The semantic_text field contains both text and inference metadata
if isinstance(sample['_source']['text'], dict):
    print("\nNote: semantic_text stores the original text plus embedding metadata")
    print(f"Keys in text field: {list(sample['_source']['text'].keys()) if isinstance(sample['_source']['text'], dict) else 'N/A'}")

---

## Next Steps

You've successfully:
1. ✅ Connected to Elastic Cloud
2. ✅ Created a Jina Embeddings v3 inference endpoint
3. ✅ Created an index with `semantic_text`
4. ✅ Bulk indexed the EU AI Act articles
5. ✅ Verified with a semantic search

**Continue to Notebook 03** to see how reranking with Jina Reranker v3 improves precision.

---

### Key Takeaways

| Concept | What We Learned |
|---------|----------------|
| **EIS** | Elastic Inference Service = managed model endpoints |
| **semantic_text** | Auto-embeds at index time, no pipelines needed |
| **Idempotency** | Wrap `create_inference` in try/except for safe re-runs |
| **Zero Config** | One field type + one inference ID = semantic search ready |