<a href="https://colab.research.google.com/github/prosy/Augmented-Worlds/blob/main/Copy_of_Mazda_RAG_Orchestration_MVP_v2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Mazda Owner's Manual RAG — **MVP Orchestration Notebook (v2)**

**Goal:** Keep Mazda-specific structure, but include a tiny in-notebook BM25 test and clear comments for swapping to FAISS/ColBERT.
Outputs export to `/app/` so your live app can use them immediately.

## 0) Setup (install if needed)

Notebook is dependency-light. You can run as-is for the demo (BM25-ish). If you want FAISS/ColBERT, install and wire in the optional cells later.

## 1) Config paths

In [None]:
from pathlib import Path
import json, csv, re, time, shutil, math

from google.colab import drive
drive.mount('/content/drive')

In [None]:
from pathlib import Path

# Set project output root explicitly
WORK_ROOT = Path("/Users/blackcat/MyDrive/AugWorlds/Mazda_PDFs")

# Define subdirectories inside Mazda_PDFs for pipeline outputs
PARSED_DIR = WORK_ROOT / "parsed"
ENRICHED_DIR = WORK_ROOT / "enriched"
INDEX_DIR = WORK_ROOT / "index"
APP_DIR = WORK_ROOT / "app"  # app configs also stored here

# Mazda-specific input PDF
INPUT_PDF = WORK_ROOT / "2024-cx-50-owners-manual.pdf"  # Mazda PDF lives in same folder
MODEL_YEAR = "2024"  # for configs if needed

# Config files used by the live app
SYNONYMS_CSV = APP_DIR / "config" / "synonyms.csv"
ANS_PLANS_JSON = APP_DIR / "config" / "answer_plans.json"
RETRIEVAL_CFG = APP_DIR / "config" / "retrieval_config.yaml"

# Ensure output directories exist
for d in [PARSED_DIR, ENRICHED_DIR, INDEX_DIR, APP_DIR, APP_DIR / "config"]:
    d.mkdir(parents=True, exist_ok=True)

print("WORK_ROOT:", WORK_ROOT)
print("INPUT_PDF:", INPUT_PDF)

if INPUT_PDF.is_file():
    print(f"✅ Found Mazda PDF: {INPUT_PDF}")
else:
    print(f"❌ PDF not found or path incorrect: {INPUT_PDF}")


## 2) Parsing: chunk pages + harvest TOC lines

In [None]:
# Replace with your real parser (PyMuPDF/Tika). Demo JSONL for structure.
SECTIONS_JSONL = PARSED_DIR / f"sections_{MODEL_YEAR}.jsonl"
if not SECTIONS_JSONL.exists():
    demo_sections = [
        {"manual_ref": "1-1", "title": "Introduction", "text": "Welcome to your Mazda owner's manual."},
        {"manual_ref": "2-14", "title": "TPMS", "text": "TPMS monitors tire pressure and warns if it is low."},
        {"manual_ref": "3-2", "title": "Engine Oil", "text": "Use 0W-20. Check level regularly."},
    ]
    with open(SECTIONS_JSONL, "w", encoding="utf-8") as f:
        for i, row in enumerate(demo_sections):
            # Explicitly add _id starting from 0 for demo data consistency
            row["_id"] = i
            f.write(json.dumps(row) + "\n")
print("Parsed →", SECTIONS_JSONL)

Parsed → /Users/blackcat/MyDrive/AugWorlds/Mazda_PDFs/parsed/sections_2024.jsonl


## 3) Enrichment: fill `manual_ref` for Top-20 plans (per year/manual)

In [None]:

def load_synonyms(path=SYNONYMS_CSV):
    norm = {}
    if path.exists():
        import csv
        with open(path, newline="", encoding="utf-8") as f:
            for row in csv.DictReader(f):
                canonical = row["canonical"].strip()
                for alias in row["alias_list"].split("|"):
                    norm[alias.strip().lower()] = canonical.lower()
    return norm

def normalize_text(text: str, norm_map: dict):
    out = text
    for alias, canon in norm_map.items():
        out = re.sub(rf"\b{re.escape(alias)}\b", canon, out, flags=re.IGNORECASE)
    return out

norm_map = load_synonyms()

ENRICHED_JSONL = ENRICHED_DIR / f"sections_{MODEL_YEAR}.enriched.jsonl"
with open(SECTIONS_JSONL, "r", encoding="utf-8") as fin, open(ENRICHED_JSONL, "w", encoding="utf-8") as fout:
    for line in fin:
        rec = json.loads(line)
        rec["text_norm"] = normalize_text(rec["text"], norm_map)
        rec["metadata"] = {"source": "mazda_demo", "model_year": MODEL_YEAR, "ts": int(time.time())}
        fout.write(json.dumps(rec) + "\n")

print("Enriched →", ENRICHED_JSONL)


Enriched → /Users/blackcat/MyDrive/AugWorlds/Mazda_PDFs/enriched/sections_2024.enriched.jsonl


## 4) Indexing: build lightweight retrieval artifacts (BM25-ish demo)

*This cell provides an in-notebook BM25-like index so the notebook works out-of-the-box.*

**Swap to FAISS/ColBERT**:
- FAISS: create vectors for `text_norm`, save `faiss.index` + `docs.jsonl` under `app/index/`.
- ColBERT: run your existing CLI to build an index and then point to it via `app/config/retrieval_config.yaml`.
Keep the *calling interface* below so your app code remains unchanged.

In [None]:

from collections import defaultdict

INDEX_JSON = INDEX_DIR / f"bm25_index_{MODEL_YEAR}.json"
DOCS_JSON = INDEX_DIR / f"docs_{MODEL_YEAR}.jsonl"

# Load docs
docs = []
with open(ENRICHED_JSONL, "r", encoding="utf-8") as f:
    for i, line in enumerate(f):
        rec = json.loads(line)
        rec["_id"] = i
        docs.append(rec)

def tokenize(s):
    import re
    return re.findall(r"[a-z0-9_]+", s.lower())

N = len(docs)
df = defaultdict(int)
postings = defaultdict(list)   # term -> list of (doc_id, tf)
doc_len = {}

for d in docs:
    tokens = tokenize(d["text_norm"])
    doc_len[d["_id"]] = len(tokens)
    tf = defaultdict(int)
    for t in tokens:
        tf[t] += 1
    for t, c in tf.items():
        df[t] += 1
        postings[t].append((d["_id"], c))

index_obj = {
    "N": N,
    "df": dict(df),
    "doc_len": {int(k): int(v) for k, v in doc_len.items()},
    "postings": {t: [(int(d), int(tf)) for d, tf in lst] for t, lst in postings.items()},
}

INDEX_DIR.mkdir(parents=True, exist_ok=True)
with open(DOCS_JSON, "w", encoding="utf-8") as f:
    for d in docs:
        f.write(json.dumps(d) + "\n")
with open(INDEX_JSON, "w", encoding="utf-8") as f:
    json.dump(index_obj, f)

print("Index written →", INDEX_JSON)


Index written → /Users/blackcat/MyDrive/AugWorlds/Mazda_PDFs/index/bm25_index_2024.json


## 5) Retrieval helper (for your app)

BM25-ish search now; swap internals for FAISS/ColBERT and keep the same function signature.

In [None]:
def load_index(path=INDEX_JSON):
    import json
    from pathlib import Path
    return json.loads(Path(path).read_text())

def bm25_search(query, k=5, k1=1.5, b=0.75):
    import math, re, json
    idx = load_index()
    N = idx["N"]
    scores = __import__("collections").defaultdict(float)
    tokens = re.findall(r"[a-z0-9_]+", query.lower())
    avgdl = sum(idx["doc_len"].values())/max(1, N)

    # Load docs into a dictionary keyed by their _id
    doc_map = {}
    with open(DOCS_JSON, "r", encoding="utf-8") as f:
        for line in f:
            doc = json.loads(line)
            doc_map[doc["_id"]] = doc

    for t in tokens:
        df = idx["df"].get(t, 0)
        if df == 0:
            continue
        idf = math.log((N - df + 0.5) / (df + 0.5) + 1)
        for doc_id, tf in idx["postings"].get(t, []):
            # Retrieve doc length from the doc_map using the correct doc_id
            dl = len(re.findall(r"[a-z0-9_]+", doc_map[doc_id]["text_norm"].lower()))
            denom = tf + k1*(1 - b + b*dl/avgdl)
            scores[doc_id] += idf * (tf*(k1+1))/denom
    top = sorted(scores.items(), key=lambda x: x[1], reverse=True)[:k]
    # map to docs
    results = []
    for doc_id, score in top:
        d = doc_map[doc_id]
        results.append({
            "doc_id": doc_id, "score": score,
            "manual_ref": d["manual_ref"], "title": d["title"], "snippet": d["text_norm"][:240]
        })
    return results

print(bm25_search("tpms warning light"))

[{'doc_id': 1, 'score': 0.8626796489204477, 'manual_ref': '2-14', 'title': 'TPMS', 'snippet': 'TPMS monitors tire pressure and warns if it is low.'}]


## 6) One-click: Copy artifacts into `/app/` layout

In [None]:

def export_to_app():
    targets = [
        (PARSED_DIR, APP_DIR / "data" / "parsed"),
        (ENRICHED_DIR, APP_DIR / "data" / "enriched"),
        (INDEX_DIR, APP_DIR / "index"),
    ]
    for src, dst in targets:
        dst.mkdir(parents=True, exist_ok=True)
        for p in src.glob("**/*"):
            if p.is_file():
                rel = p.relative_to(src)
                (dst / rel).parent.mkdir(parents=True, exist_ok=True)
                shutil.copy2(p, dst / rel)
    print("Export complete →", APP_DIR.resolve())

export_to_app()


Export complete → /Users/blackcat/MyDrive/AugWorlds/Mazda_PDFs/app
