### 1.0 Create an Azure AI Search Index 


 #### Vector-Only Index Creation with Azure AI Search

This script creates a vector-only index in Azure AI Search using the General Availability (GA) schema introduced in mid-2024. It sets up an index with just two fields:

A string-based document ID (used as the primary key)
A vector field (contentVector) that holds embedding data (e.g.Azure OpenAI)
We configure the vector search behavior to use the HNSW algorithm with cosine similarity, which is ideal for semantic search scenarios. This vector-only setup is lean and optimized for scenarios where we rely purely on vector search (e.g., similarity search in embeddings) rather than keyword-based retrieval.



In [8]:
# create_index_vector_only.py – GA-compatible vector-only index
from dotenv import load_dotenv
import os
from azure.core.credentials import AzureKeyCredential
from azure.search.documents.indexes import SearchIndexClient
from azure.search.documents.indexes.models import (
    SearchIndex,
    SimpleField,
    SearchField,
    SearchFieldDataType,
    VectorSearch,
    HnswAlgorithmConfiguration,
    HnswParameters,
    VectorSearchProfile,
)

# ── 1. env ──────────────────────────────────────────────────────────────
load_dotenv()
ENDPOINT   = os.getenv("AZURE_SEARCH_ENDPOINT")
ADMIN_KEY  = os.getenv("AZURE_SEARCH_ADMIN_KEY")
INDEX_NAME = os.getenv("AZURE_SEARCH_INDEX_NAME", "index01")

# ── 2. algorithm + profile (HNSW + cosine) ─────────────────────────────
algo_cfg = HnswAlgorithmConfiguration(
    name="hnsw-cosine",
    parameters=HnswParameters(metric="cosine")  # defaults (m=4, ef* etc.)
)

profile_cfg = VectorSearchProfile(           # ← referenced by the field
    name="hnsw-cosine-profile",
    algorithm_configuration_name="hnsw-cosine",
)

vector_search = VectorSearch(
    algorithms=[algo_cfg],
    profiles=[profile_cfg],
)

# ── 3. schema: key + vector field only ─────────────────────────────────
fields = [
    SimpleField(name="id", type=SearchFieldDataType.String, key=True),
    SearchField(
        name="contentVector",
        type=SearchFieldDataType.Collection(SearchFieldDataType.Single),
        searchable=True,
        vector_search_dimensions=1536,
        vector_search_profile_name="hnsw-cosine-profile",
    ),
]

index = SearchIndex(
    name=INDEX_NAME,
    fields=fields,
    vector_search=vector_search,
)

# ── 4. push the index ──────────────────────────────────────────────────
client = SearchIndexClient(endpoint=ENDPOINT, credential=AzureKeyCredential(ADMIN_KEY))
print(f"Creating or updating index '{INDEX_NAME}' …")
client.create_or_update_index(index)
print("✅  Vector-only index ready")


Creating or updating index 'index01' …
✅  Vector-only index ready


✅ Result
Once this script runs, you’ll have a minimal, production-ready vector-only index that is compatible with the new GA schema and supports efficient vector similarity search via HNSW and cosine distance.

You can now upload vectorized documents and perform semantic search queries efficiently.

### 2.0 OCR the PDF 

This code performs OCR on a single PDF file, 2504_IMF_WOO.pdf, using Azure AI Document Intelligence and saves the extracted text as 2504_IMF_WOO.txt in the same directory. It loads API credentials from a .env file, sets up the client, and handles both script and notebook environments by resolving the working directory accordingly. The script submits the PDF to Azure’s prebuilt-read model, waits for the result, extracts text line-by-line from each page, and writes the output as a plain-text file. It includes basic error handling and status messages, making it a clean and reusable OCR workflow.



In [3]:
"""
OCR one PDF (2504_IMF_WOO.pdf) with Azure Document Intelligence
Saves 2504_IMF_WOO.txt in the same folder.
"""
from pathlib import Path
import os, sys
from dotenv import load_dotenv
from azure.core.credentials import AzureKeyCredential
from azure.ai.documentintelligence import DocumentIntelligenceClient

# ─────────────── SETUP ───────────────
try:
    SCRIPT_DIR = Path(__file__).resolve().parent   # works in a .py file
except NameError:
    SCRIPT_DIR = Path.cwd()                        # Jupyter fallback

load_dotenv(dotenv_path=SCRIPT_DIR / ".env")       # credentials in .env

ENDPOINT = os.getenv("DOCUMENTINTELLIGENCE_ENDPOINT")
KEY      = os.getenv("DOCUMENTINTELLIGENCE_API_KEY")
if not ENDPOINT or not KEY:
    sys.exit("❌  Missing DOCUMENTINTELLIGENCE_… values in .env")

client = DocumentIntelligenceClient(
    endpoint=ENDPOINT, credential=AzureKeyCredential(KEY)
)

PDF_FILE = SCRIPT_DIR / "2504_IMF_WOO.pdf"
if not PDF_FILE.exists():
    sys.exit(f"❗  {PDF_FILE.name} not found in {SCRIPT_DIR.resolve()}")

print(f"🔍  Processing {PDF_FILE.name} …")

# ─────────────── OCR ───────────────
try:
    with PDF_FILE.open("rb") as fh:
        poller = client.begin_analyze_document(
            "prebuilt-read", fh, content_type="application/pdf"
        )
    result = poller.result()

    pages_txt = [
        "\n".join(ln.content for ln in (p.lines or []))
        for p in (result.pages or [])
    ]
    (PDF_FILE.with_suffix(".txt")).write_text("\n\n".join(pages_txt), "utf-8")
    print(f"✅  Text saved to {PDF_FILE.with_suffix('.txt').name}")

except Exception as e:
    print(f"⚠️  Failed to process {PDF_FILE.name}: {e}")


🔍  Processing 2504_IMF_WOO.pdf …
✅  Text saved to 2504_IMF_WOO.txt


### 3. pre-Processing Text 

To prepare the OCR dump 2504_IMF_WOO.txt for RAG, the script first joins words split across line-break hyphens (e.g., “eco- \n nomic” → “economic”). It then strips generic noise—tabs, HTML/Markdown tags, non-UTF8 bytes, divider lines, and bold “IMPORTANT/NOTE” blocks—using regex replacements. Next, it removes IMF-specific clutter such as page headers/footers, Roman- or Arabic-numbered page numbers, table-of-contents lines, chapter titles, and figure/table captions. Finally, it replaces all remaining newlines with spaces and collapses multiple spaces to one, producing a compact, boilerplate-free string that is ideal for tokenization and chunking. The cleaned output is saved as 2504_IMF_WOO.cleaned.txt.

In [4]:
#!/usr/bin/env python3
"""
clean_2504_imf_woo.py  –  Create a RAG-ready version of 2504_IMF_WOO.txt

Reads the raw OCR dump, removes headers/footers, TOC noise, figure captions,
hyphen-breaks, HTML/Markdown tags, non-UTF8 chars, etc., and writes
2504_IMF_WOO.cleaned.txt in the same directory.

Run with:  python clean_2504_imf_woo.py
"""

from pathlib import Path
import re
import sys


# ────────────────────────────────────────────────────────────────
# 1.  Text-cleaning utility
# ────────────────────────────────────────────────────────────────
def clean_text(text: str) -> str:
    """Return a compact, boilerplate-free string suitable for chunking."""
    # fix hyphenated line breaks  (eco-\n  nomic → economic)
    text = re.sub(r"(\w+)-\s*\n\s*(\w+)", r"\1\2", text)

    # generic noise
    generic = [
        r"\t", r"\r\n", r"\r",                # tabs / CRs
        r"[^\x00-\x7F]+",                     # non-UTF8
        r"<\/?(table|tr|td|ul|li|p|br)>",     # HTML tags
        r"\*\*IMPORTANT:\*\*|\*\*NOTE:\*\*", # doc notes
        r"<!|no-loc |text=|<--|-->",          # markup
        r"```|:::|---|--|###|##|#",           # md code / hr / headers
    ]
    for pat in generic:
        text = re.sub(pat, " ", text, flags=re.I)

    # IMF-specific headers / footers / TOC lines / captions
    imf_noise = [
        r"INTERNATIONAL MONETARY FUND",
        r"WORLD\s+ECONOMIC\s+OUTLOOK",
        r"\|\s*April\s+\d{4}",
        r"^CONTENTS$|^DATA$|^PREFACE$|^FOREWORD$|^EXECUTIVE SUMMARY$",
        r"^ASSUMPTIONS AND CONVENTIONS$|^FURTHER INFORMATION$|^ERRATA$",
        r"^Chapter\s+\d+.*$",
        r"^(Table|Figure|Box|Annex)\s+[A-Z0-9].*$",
        r"^\s*[ivxlcdm]+\s*$",   # Roman numerals
        r"^\s*\d+\s*$",          # arabic page nos
    ]
    for pat in imf_noise:
        text = re.sub(pat, " ", text, flags=re.I | re.M)

    # remove remaining newlines → single spaces
    text = text.replace("\n", " ")
    text = re.sub(r"\s{2,}", " ", text).strip()
    return text


# ────────────────────────────────────────────────────────────────
# 2.  Entrypoint
# ────────────────────────────────────────────────────────────────
def main() -> None:
    raw_path = Path.cwd() / "2504_IMF_WOO.txt"
    if not raw_path.exists():
        sys.exit(f"❌  {raw_path.name} not found in {Path.cwd()}")

    raw_text = raw_path.read_text(encoding="utf-8", errors="ignore")
    cleaned   = clean_text(raw_text)

    out_path = raw_path.with_suffix(".cleaned.txt")
    out_path.write_text(cleaned, encoding="utf-8")

    print(f"✅  Saved cleaned text → {out_path.name}  "
          f"({len(cleaned):,} characters)")


if __name__ == "__main__":
    main()


✅  Saved cleaned text → 2504_IMF_WOO.cleaned.txt  (624,877 characters)


### 4. Chunking Documents for RAG

### 🔹 What Is “Chunking” in RAG?

*Chunking* means slicing long documents into smaller, self-contained pieces (“chunks”) so they fit the model’s token window and can be embedded, indexed, and retrieved accurately. The aim is to keep **enough context** for a useful answer while **avoiding overly large inputs** that waste tokens or hurt precision.

---

#### Common Chunking Methods

| Method | How it works | Best for |
|--------|--------------|----------|
| **Fixed-length windows** | Split every *N* tokens/characters, often with 10–20 % overlap. | Logs, code, data dumps where structure ≈ length. |
| **Sentence/paragraph split** | Use an NLP splitter; keep full sentences or paragraphs. | Narrative or news text; avoids mid-sentence cuts. |
| **Recursive / semantic split** | Split on headings → paragraphs → sentences until each piece < limit (e.g., LangChain `RecursiveCharacterTextSplitter`). | Long structured docs (white papers, legal contracts). |
| **Sliding window at retrieval** | No pre-processing; generate overlapping windows on demand around query anchors. | Recall-critical QA (wikis, forums) when storage is cheap. |
| **Adaptive / LLM-assisted** | An LLM places boundaries where topics shift. | Highly variable content; experimental but coherent. |

---

#### Choosing a Strategy

* **Code & logs:** fixed 400-token windows + 10 % overlap.  
* **Technical reports / legal PDFs:** recursive splitting on headings.  
* **Emails & web articles:** paragraph/sentence chunks of ~300-500 tokens.  
* **Large wiki corpora:** sliding windows to maximise recall.  
* **Mixed formats needing topic coherence:** try LLM-assisted splitting.

> **Rule of thumb:** keep chunks **200–800 tokens** and add a small overlap when continuity matters.


In [None]:
#!/usr/bin/env python3
"""
embed_and_upload_chunks.py
──────────────────────────
• Reads IMF WEO cleaned text (weo25_clean.txt) + page_map (weo25_page_map.json)
• Splits into ≈500-token chunks (10 % overlap, heading-aware)
• Embeds with Azure OpenAI (text-embedding-3-small, or whatever you set)
• Uploads to Azure AI Search index01
• Shows INFO logging + tqdm progress bars (Jupyter-friendly)
"""

import os, sys, json, re, logging
from pathlib import Path
from dotenv import load_dotenv
from tqdm.auto import tqdm

import tiktoken               # pip install tiktoken
import openai                 # pip install openai>=1.3
from azure.core.credentials import AzureKeyCredential
from azure.search.documents import SearchClient

# ── 1.  Logging ────────────────────────────────────────────────────
logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s [%(levelname)s] %(message)s",
    datefmt="%H:%M:%S",
)
log = logging.getLogger("IMF-Embed")

# ── 2.  Load .env & pull parameters ───────────────────────────────
load_dotenv()  # reads .env in cwd

# Azure Search
SEARCH_ENDPOINT   = os.getenv("AZURE_SEARCH_ENDPOINT")
SEARCH_ADMIN_KEY  = os.getenv("AZURE_SEARCH_ADMIN_KEY")
SEARCH_INDEX_NAME = os.getenv("AZURE_SEARCH_INDEX_NAME", "index01")
if not SEARCH_ENDPOINT or not SEARCH_ADMIN_KEY:
    sys.exit("❌  Set AZURE_SEARCH_ENDPOINT and AZURE_SEARCH_ADMIN_KEY in .env")

# Azure OpenAI
AOAI_ENDPOINT     = os.getenv("AZURE_OPENAI_ENDPOINT", "").rstrip("/")
AOAI_KEY          = os.getenv("AZURE_OPENAI_API_KEY")
AOAI_API_VERSION  = os.getenv("AZURE_OPENAI_API_VERSION", "2024-12-01-preview")
EMBED_DEPLOYMENT  = os.getenv("AZURE_TEXT_EMBEDDING_DEPLOYMENT_NAME")
if not AOAI_ENDPOINT or not AOAI_KEY or not EMBED_DEPLOYMENT:
    sys.exit("❌  Missing Azure OpenAI env vars (endpoint/key/deployment).")

# Files
CLEAN_FILE = Path("2504_IMF_WOO.cleaned.txt")
PAGE_MAP   = Path("weo25_page_map.json")   # [{start,end,page}, …]

# Chunking params
CHUNK_TOKENS = 500
OVERLAP      = 50
EMB_BATCH    = 16
UPL_BATCH    = 100

enc = tiktoken.get_encoding("cl100k_base")

# ── 3.  Helpers ────────────────────────────────────────────────────
def slide(tokens, size, step):
    for i in range(0, len(tokens), step):
        yield tokens[i : i + size]

def make_chunks(text: str, meta: dict):
    step = CHUNK_TOKENS - OVERLAP
    for idx, win in enumerate(slide(enc.encode(text), CHUNK_TOKENS, step)):
        yield {
            "id": f"{meta['parent']}_p{meta['page']}_c{idx:04}",
            "parentId": meta["parent"],
            "chapter": meta["chapter"],
            "pageStart": meta["page"],
            "pageEnd": meta["page"],
            "raw": enc.decode(win),
            "@search.action": "upload",
        }

# ── 4.  Load cleaned text & page map ──────────────────────────────
if not (CLEAN_FILE.exists() and PAGE_MAP.exists()):
    sys.exit("❌  Clean file or page map JSON missing")

full_text = CLEAN_FILE.read_text("utf-8")
page_map  = json.loads(PAGE_MAP.read_text())

log.info("Splitting document on headings …")
head_pat = re.compile(r"\n([A-Z][^\n]{3,100})\n")
blocks   = head_pat.split(full_text)

chunks = []
for i in range(0, len(blocks), 2):
    heading = blocks[i - 1].strip() if i else "Preamble"
    body    = blocks[i]
    start   = full_text.find(body)
    page    = next(p["page"] for p in page_map
                   if p["start"] <= start < p["end"])
    meta    = {"parent": "WEO25", "chapter": heading, "page": page}
    chunks.extend(make_chunks(body, meta))

log.info("Generated %s chunks (≈%s tokens ea).", len(chunks), CHUNK_TOKENS)

# ── 5.  Azure OpenAI embed ────────────────────────────────────────
openai_client = openai.AzureOpenAI(
    api_key     = AOAI_KEY,
    api_version = AOAI_API_VERSION,
    base_url    = f"{AOAI_ENDPOINT}/openai/deployments/{EMBED_DEPLOYMENT}",
)

log.info("Embedding with %s …", EMBED_DEPLOYMENT)
for i in tqdm(range(0, len(chunks), EMB_BATCH), desc="Embedding", unit="chunk"):
    batch  = chunks[i : i + EMB_BATCH]
    inputs = [c["raw"] for c in batch]
    resp   = openai_client.embeddings.create(model=EMBED_DEPLOYMENT, input=inputs)
    for rec, emb in zip(batch, resp.data):
        rec["rawVector"] = emb.embedding

# ── 6.  Upload to Azure Search ────────────────────────────────────
search = SearchClient(
    endpoint    = SEARCH_ENDPOINT,
    index_name  = SEARCH_INDEX_NAME,
    api_version = "2024-07-01",
    credential  = AzureKeyCredential(SEARCH_ADMIN_KEY),
)

log.info("Uploading to Search index %s …", SEARCH_INDEX_NAME)
for i in tqdm(range(0, len(chunks), UPL_BATCH), desc="Uploading", unit="chunk"):
    batch   = chunks[i : i + UPL_BATCH]
    results = search.upload_documents(batch)
    failed  = [r for r in results if not r.succeeded]
    if failed:
        log.warning("%d failures starting at ID %s", len(failed), batch[0]["id"])

log.info("✅  Finished — %s chunks embedded & indexed.", len(chunks))
