### 1.0 Create an Azure AI Search Index 


 #### Vector-Only Index Creation with Azure AI Search

This script creates a vector-only index in Azure AI Search using the General Availability (GA) schema introduced in mid-2024. It sets up an index with just two fields:

A string-based document ID (used as the primary key)
A vector field (contentVector) that holds embedding data (e.g.Azure OpenAI)
We configure the vector search behavior to use the HNSW algorithm with cosine similarity, which is ideal for semantic search scenarios. This vector-only setup is lean and optimized for scenarios where we rely purely on vector search (e.g., similarity search in embeddings) rather than keyword-based retrieval.



In [8]:
# create_index_vector_only.py – GA-compatible vector-only index
from dotenv import load_dotenv
import os
from azure.core.credentials import AzureKeyCredential
from azure.search.documents.indexes import SearchIndexClient
from azure.search.documents.indexes.models import (
    SearchIndex,
    SimpleField,
    SearchField,
    SearchFieldDataType,
    VectorSearch,
    HnswAlgorithmConfiguration,
    HnswParameters,
    VectorSearchProfile,
)

# ── 1. env ──────────────────────────────────────────────────────────────
load_dotenv()
ENDPOINT   = os.getenv("AZURE_SEARCH_ENDPOINT")
ADMIN_KEY  = os.getenv("AZURE_SEARCH_ADMIN_KEY")
INDEX_NAME = os.getenv("AZURE_SEARCH_INDEX_NAME", "index01")

# ── 2. algorithm + profile (HNSW + cosine) ─────────────────────────────
algo_cfg = HnswAlgorithmConfiguration(
    name="hnsw-cosine",
    parameters=HnswParameters(metric="cosine")  # defaults (m=4, ef* etc.)
)

profile_cfg = VectorSearchProfile(           # ← referenced by the field
    name="hnsw-cosine-profile",
    algorithm_configuration_name="hnsw-cosine",
)

vector_search = VectorSearch(
    algorithms=[algo_cfg],
    profiles=[profile_cfg],
)

# ── 3. schema: key + vector field only ─────────────────────────────────
fields = [
    SimpleField(name="id", type=SearchFieldDataType.String, key=True),
    SearchField(
        name="contentVector",
        type=SearchFieldDataType.Collection(SearchFieldDataType.Single),
        searchable=True,
        vector_search_dimensions=1536,
        vector_search_profile_name="hnsw-cosine-profile",
    ),
]

index = SearchIndex(
    name=INDEX_NAME,
    fields=fields,
    vector_search=vector_search,
)

# ── 4. push the index ──────────────────────────────────────────────────
client = SearchIndexClient(endpoint=ENDPOINT, credential=AzureKeyCredential(ADMIN_KEY))
print(f"Creating or updating index '{INDEX_NAME}' …")
client.create_or_update_index(index)
print("✅  Vector-only index ready")


Creating or updating index 'index01' …
✅  Vector-only index ready


✅ Result
Once this script runs, you’ll have a minimal, production-ready vector-only index that is compatible with the new GA schema and supports efficient vector similarity search via HNSW and cosine distance.

You can now upload vectorized documents and perform semantic search queries efficiently.

### 2.0 OCR the PDF 



In [1]:
"""
ocr_single_pdf.py – OCR one PDF (2504_IMF_WOO.pdf) with Azure Document Intelligence
Outputs 2504_IMF_WOO.txt in the same directory.
"""
from pathlib import Path
import os, sys
from dotenv import load_dotenv
from azure.core.credentials import AzureKeyCredential
from azure.ai.documentintelligence import DocumentIntelligenceClient

# ─────────────── SETUP ───────────────
SCRIPT_DIR = Path(__file__).resolve().parent           # current folder
load_dotenv(dotenv_path=SCRIPT_DIR / ".env")           # credentials in .env

ENDPOINT = os.getenv("DOCUMENTINTELLIGENCE_ENDPOINT")
KEY      = os.getenv("DOCUMENTINTELLIGENCE_API_KEY")
if not ENDPOINT or not KEY:
    sys.exit("❌  Missing DOCUMENTINTELLIGENCE_… values in .env")

client = DocumentIntelligenceClient(
    endpoint=ENDPOINT,
    credential=AzureKeyCredential(KEY)
)

PDF_FILE = SCRIPT_DIR / "2504_IMF_WOO.pdf"             # ← target file
if not PDF_FILE.exists():
    sys.exit(f"❗  {PDF_FILE.name} not found in {SCRIPT_DIR.resolve()}")

print(f"🔍  Processing {PDF_FILE.name} …")

# ─────────────── OCR ───────────────
try:
    with PDF_FILE.open("rb") as fh:
        poller = client.begin_analyze_document(
            "prebuilt-read",                           # model_id
            fh,                                        # binary stream
            content_type="application/pdf",
        )
    result = poller.result()

    pages_txt = [
        "\n".join(ln.content for ln in (p.lines or []))
        for p in (result.pages or [])
    ]
    raw_text = "\n\n".join(pages_txt)

    TXT_OUT = PDF_FILE.with_suffix(".txt")
    TXT_OUT.write_text(raw_text, encoding="utf-8")
    print(f"✅  Text saved to {TXT_OUT.name}")

except Exception as e:
    print(f"⚠️  Failed to process {PDF_FILE.name}: {e}")


NameError: name '__file__' is not defined