Conceptually: strong idea, and the “mixture of experts” framing maps very naturally to how economists actually work (parse → frame → model → solve → interpret → critique assumptions). The main question is implementation strategy so you get something usable quickly without getting trapped in “train-a-giant-model-from-scratch” territory.

1) What I think works best in practice (and why)

A. Start as agentic MoE (router + specialist experts), not “true MoE pretraining”

“True” MoE (multiple expert subnetworks inside one transformer trained jointly) is powerful but expensive and operationally heavy. You can get 80–90% of the benefit by doing:
	•	A Router/Framer model that classifies + structures the task
	•	Specialist expert models (or LoRA adapters) per subfield/task
	•	A tool/code expert that writes and executes code
	•	A critic that checks assumptions, units, equilibrium conditions, identification strategy, etc.

This is also far easier to iterate on, evaluate, and swap components.

B. Use RAG + structured knowledge as the foundation, fine-tuning second

For economics, correctness often depends on definitions, assumptions, and canonical results. RAG over your curated sources plus a knowledge graph of concepts is extremely effective, and it avoids “baking in” errors.

Then fine-tune (LoRA/QLoRA) to improve:
	•	framing templates
	•	step ordering
	•	style (econ-precise)
	•	consistent notation
	•	code patterns

C. You should treat “econ ML expert → code” as a first-class toolchain

Make code generation verifiable:
	•	generated code runs in a sandbox (Python)
	•	includes tests or sanity checks (units, signs, boundary cases)
	•	produces plots/tables where relevant
	•	logs decisions and assumptions

That makes the system materially better than a “smart text bot”.

2) Data plan: what to use and what to avoid

What’s ideal
	•	Your own notes, homework solutions, summaries, derivations, cheat sheets
	•	Problem sets where you have the right to use the material
	•	Public-domain / permissively licensed resources
	•	Self-generated Q/A pairs from your notes (with careful review)

What’s risky
	•	Textbooks: typically copyrighted. Using them for training is a legal/licensing question.
	•	Quizlet: content and access are governed by their Terms; scraping or bulk use can be a problem, and sets may be copyrighted by the creators.

Pragmatic approach:
	•	Use copyrighted sources for private RAG if you have lawful access and keep it internal (still not legal advice), but avoid training directly on them unless you have permission.
	•	Prefer converting what you learn into your own “canonical cards” (definitions, theorems, standard derivations, solved examples) and train on those.

3) Proposed expert lineup (a concrete blueprint)
	1.	Problem Digestor (Econ Parser)

	•	Extracts: agents, goods, constraints, timing, information, equilibrium concept
	•	Detects: “this is actually a mechanism design / adverse selection / IV / general equilibrium / game” etc.
	•	Output format: a structured schema (JSON)

	2.	Field Router

	•	Multi-label classification: micro theory, macro, IO, metrics, finance, public, labor, trade, dev, behavioral, etc.
	•	Routes to 1–3 specialists, not just one

	3.	Model Framer

	•	Select canonical model class + assumptions
	•	Offers alternatives (e.g., partial vs general equilibrium; static vs dynamic; OLS vs IV vs DiD)

	4.	Solver / Derivation Expert

	•	Produces the math (FOCs, comparative statics, equilibrium existence, welfare)

	5.	Empirical Strategy Expert (Metrics)

	•	Identification strategy, threats, robustness, DAG thinking
	•	Suggests data transforms + estimator choice

	6.	Code Expert

	•	Turns the framed model/estimation into reproducible code
	•	Adds checks, diagnostics, and figures

	7.	Critic / Referee #2

	•	Assumption audit, sign checks, edge cases
	•	“What would a journal referee attack?”

This is exactly the kind of structure that benefits from a router + specialists.

4) Minimum Viable Product I’d build first (fast path)

MVP goal: A system that takes any econ question and returns:
	•	structured framing (schema)
	•	recommended approach (model/estimator)
	•	solution sketch (math or logic)
	•	optional runnable code (if asked)
	•	a short “assumptions & failure modes” section

Implementation: start with one strong base LLM + prompt-routed agents + RAG over your notes.

Then iterate:
	•	fine-tune router (cheap)
	•	fine-tune “framer” and “critic” on your canonical format
	•	later: LoRA adapters per subfield

5) The hardest parts (so you don’t get blindsided)
	•	Evaluation: you need a gold set of questions with expected framing and key points. Without this, progress becomes vibes-based.
	•	Notation consistency: econ is fragile to notation drift; solve with a global style guide + schema + critic enforcement.
	•	Data cleanliness: your notes/homework will be noisy. The trick is converting them into clean, structured “cards” and “worked examples”.
	•	Copyright/ToS: biggest non-technical risk. Design the pipeline so you can swap in compliant datasets easily.

6) If you want my recommendation in one line

Build it as an agentic MoE with a strong router + RAG + verifier, then add LoRA specialists once you’ve locked the schema and evaluation suite.

If you reply with (a) which subfields you care about most (top 5) and (b) your intended runtime stack (local Ollama vs API vs server GPUs), I’ll propose a concrete architecture diagram + repo structure + the exact schemas/prompts for each expert so you can start implementing immediately.

pip install pymupdf neo4j openai tiktoken
export OPENAI_API_KEY="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"
gemini AIzaSyAODasJVvnnhDYAu7-qh9B7jZKbfo-3B7w

In [16]:
import argparse
import hashlib
import os
import re
from dataclasses import dataclass
from typing import Any, Dict, List, Optional, Tuple

import fitz  # PyMuPDF
from neo4j import GraphDatabase
from openai import OpenAI
from google import genai
from google.genai import types


# -----------------------------
# Config
# -----------------------------
DEFAULT_EMBED_MODEL = "text-embedding-3-small"  # 1536 dims in common usage
# OpenAI embeddings guide lists text-embedding-3-small / 3-large as current.   [oai_citation:4‡OpenAI Platform](https://platform.openai.com/docs/guides/embeddings?utm_source=chatgpt.com)


In [17]:

# -----------------------------
# Helpers
# -----------------------------
def sha1(s: str) -> str:
    return hashlib.sha1(s.encode("utf-8", errors="ignore")).hexdigest()

def clean_text(s: str) -> str:
    s = s.replace("\u00a0", " ")
    s = re.sub(r"[ \t]+\n", "\n", s)
    s = re.sub(r"\n{3,}", "\n\n", s)
    return s.strip()

def chunk_text(text: str, target_chars: int = 2500, overlap_chars: int = 250) -> List[str]:
    """
    Simple, robust chunker (char-based) that tries to keep paragraphs together.
    Good enough for v0; you can swap to token-based later.
    """
    if not text:
        return []
    paras = [p.strip() for p in re.split(r"\n\s*\n", text) if p.strip()]
    chunks = []
    cur = ""

    for p in paras:
        if len(cur) + len(p) + 2 <= target_chars:
            cur = (cur + "\n\n" + p).strip() if cur else p
        else:
            if cur:
                chunks.append(cur)
            # if paragraph is huge, hard-split
            if len(p) > target_chars:
                start = 0
                while start < len(p):
                    end = min(len(p), start + target_chars)
                    chunks.append(p[start:end])
                    start = max(end - overlap_chars, end)
                cur = ""
            else:
                cur = p

    if cur:
        chunks.append(cur)

    # add overlap between chunks
    if overlap_chars > 0 and len(chunks) > 1:
        out = []
        for i, c in enumerate(chunks):
            if i == 0:
                out.append(c)
            else:
                prev = out[-1]
                overlap = prev[-overlap_chars:] if len(prev) > overlap_chars else prev
                out.append((overlap + "\n\n" + c).strip())
        return out

    return chunks

@dataclass
class TocItem:
    level: int
    title: str
    start_page: int  # 1-based page number (matches PyMuPDF TOC)
    end_page: int    # computed

def compute_end_pages(toc: List[Tuple[int, str, int]], total_pages: int) -> List[TocItem]:
    """
    toc is list of (level, title, page) with 1-based page numbers
    end_page is next item page - 1 at same or higher level, else total_pages
    """
    items = [TocItem(lvl, title.strip(), int(pg), total_pages) for (lvl, title, pg) in toc]
    for i in range(len(items)):
        this = items[i]
        end = total_pages
        for j in range(i + 1, len(items)):
            nxt = items[j]
            if nxt.level <= this.level:
                end = max(this.start_page, nxt.start_page - 1)
                break
        items[i].end_page = end
    return items


In [18]:
# -----------------------------
# Neo4j writers (NO APOC)
# -----------------------------

SCHEMA_CYPHER = """
MERGE (b:Book {book_id: $book_id})
SET b.title = $title,
    b.authors = $authors,
    b.edition = $edition,
    b.year = $year,
    b.total_pages = $total_pages;
"""

UPSERT_PAGES_CYPHER = """
UNWIND $pages AS row
MERGE (p:Page {book_id: $book_id, page_no: row.page_no})
SET p.text = row.text,
    p.checksum = row.checksum
WITH p
MATCH (b:Book {book_id: $book_id})
MERGE (b)-[:HAS_PAGE]->(p);
"""

# Outline ingestion in 3 passes (Neo4j 5 safe; no MATCH inside FOREACH)
UPSERT_OUTLINE_NODES_CYPHER = """
UNWIND $items AS row
MERGE (o:Outline {outline_id: row.outline_id})
SET o.kind = row.kind,
    o.level = row.level,
    o.title = row.title,
    o.start_page = row.start_page,
    o.end_page = row.end_page;
"""

UPSERT_OUTLINE_BOOK_REL_CYPHER = """
UNWIND $items AS row
WITH row
WHERE row.parent_outline_id IS NULL
MATCH (b:Book {book_id: $book_id})
MATCH (o:Outline {outline_id: row.outline_id})
MERGE (b)-[:HAS_OUTLINE]->(o);
"""

UPSERT_OUTLINE_PARENT_REL_CYPHER = """
UNWIND $items AS row
WITH row
WHERE row.parent_outline_id IS NOT NULL
MATCH (parent:Outline {outline_id: row.parent_outline_id})
MATCH (o:Outline {outline_id: row.outline_id})
MERGE (parent)-[:HAS_CHILD]->(o);
"""

LINK_COVERS_CYPHER = """
UNWIND $items AS row
MATCH (o:Outline {outline_id: row.outline_id})
MATCH (p:Page {book_id: $book_id, page_no: row.page_no})
MERGE (o)-[:COVERS]->(p);
"""

UPSERT_CHUNKS_CYPHER = """
UNWIND $chunks AS row
MERGE (c:Chunk {chunk_id: row.chunk_id})
SET c.book_id = $book_id,
    c.page_no = row.page_no,
    c.chunk_index = row.chunk_index,
    c.text = row.text,
    c.checksum = row.checksum
WITH c, row
MATCH (p:Page {book_id: $book_id, page_no: row.page_no})
MERGE (p)-[:HAS_CHUNK]->(c);
"""

# Store embeddings as LIST<FLOAT> property (no procedures)
SET_EMBEDDINGS_CYPHER = """
UNWIND $rows AS row
MATCH (c:Chunk {chunk_id: row.chunk_id})
SET c.embedding = row.embedding
RETURN count(*) AS updated;
"""

SUPPORTED_BY_CYPHER = """
MATCH (o:Outline)-[:COVERS]->(:Page)-[:HAS_CHUNK]->(c:Chunk)
MERGE (o)-[:SUPPORTED_BY]->(c);
"""

def kind_for_level(level: int) -> str:
    if level == 1: return "chapter"
    if level == 2: return "subchapter"
    if level == 3: return "topic"
    return "subtopic"

In [19]:

# -----------------------------
# Main
# -----------------------------
def main():
    ap = argparse.ArgumentParser()
    ap.add_argument("--pdf", required=True, help="Path to textbook PDF")
    ap.add_argument("--book-id", required=True)
    ap.add_argument("--title", required=True)
    ap.add_argument("--authors", default="")
    ap.add_argument("--edition", default="")
    ap.add_argument("--year", default="")
    ap.add_argument("--neo4j-uri", default="bolt://localhost:7687")
    ap.add_argument("--neo4j-user", default="neo4j")
    ap.add_argument("--neo4j-pass", default="testpassword")
    ap.add_argument("--embed-model", default=DEFAULT_EMBED_MODEL)
    ap.add_argument("--embed-batch", type=int, default=64)
    args = ap.parse_args()

    if not os.path.exists(args.pdf):
        raise FileNotFoundError(args.pdf)

    # ---- Read PDF
    doc = fitz.open(args.pdf)
    total_pages = doc.page_count

    # ---- Extract TOC (best-case: PDF has a real outline)
    raw_toc = doc.get_toc(simple=True)  # [(level, title, pageNo), ...]
    toc_items: List[TocItem] = []
    if raw_toc:
        toc_items = compute_end_pages(raw_toc, total_pages)

    # ---- Extract pages
    pages_payload = []
    for i in range(total_pages):
        page_no = i + 1
        text = clean_text(doc.load_page(i).get_text("text"))
        pages_payload.append({
            "page_no": page_no,
            "text": text,
            "checksum": sha1(text),
        })

    # ---- Connect Neo4j
    driver = GraphDatabase.driver(args.neo4j_uri, auth=(args.neo4j_user, args.neo4j_pass))

    with driver.session() as s:
        s.run(SET_EMBEDDINGS_CYPHER, rows=payload)

        # Pages (batch)
        s.run(UPSERT_PAGES_CYPHER, book_id=args.book_id, pages=pages_payload)

        # Outline + parent-child
        if toc_items:
            items_payload = []
            parent_stack: Dict[int, str] = {}  # level -> outline_id
            for idx, it in enumerate(toc_items):
                outline_id = f"{args.book_id}:toc:{idx}:{sha1(it.title)[:10]}"
                parent_id = parent_stack.get(it.level - 1)
                parent_stack[it.level] = outline_id
                # truncate deeper levels when we go up
                for lvl in list(parent_stack.keys()):
                    if lvl > it.level:
                        del parent_stack[lvl]

                items_payload.append({
                    "outline_id": outline_id,
                    "parent_outline_id": parent_id,
                    "kind": kind_for_level(it.level),
                    "level": it.level,
                    "title": it.title,
                    "start_page": it.start_page,
                    "end_page": it.end_page,
                })

            s.run(UPSERT_OUTLINE_CYPHER, book_id=args.book_id, items=items_payload)

            # Link COVERS -> Page for every page in range
            # For v0 we create explicit links to pages (fine for one book).
            covers_rows = []
            for row in items_payload:
                for p in range(row["start_page"], row["end_page"] + 1):
                    covers_rows.append({"outline_id": row["outline_id"], "page_no": p})
            # chunk this to avoid huge tx
            CHUNK = 2000
            for k in range(0, len(covers_rows), CHUNK):
                part = covers_rows[k:k+CHUNK]
                s.run(LINK_COVERS_CYPHER, book_id=args.book_id, items=part)

        # Chunks
        chunk_rows = []
        for p in pages_payload:
            page_no = p["page_no"]
            chunks = chunk_text(p["text"])
            for j, ct in enumerate(chunks):
                chunk_id = f"{args.book_id}:p{page_no}:c{j}:{sha1(ct)[:10]}"
                chunk_rows.append({
                    "chunk_id": chunk_id,
                    "page_no": page_no,
                    "chunk_index": j,
                    "text": ct,
                    "checksum": sha1(ct),
                })

        # Insert chunks in manageable batches
        CHUNK = 1000
        for k in range(0, len(chunk_rows), CHUNK):
            s.run(UPSERT_CHUNKS_CYPHER, book_id=args.book_id, chunks=chunk_rows[k:k+CHUNK])

        # If we have outline, create SUPPORTED_BY edges (Outline -> Chunk)
        if toc_items:
            s.run(SUPPORTED_BY_CYPHER)

    # ---- Embeddings (OpenAI) and write to Neo4j
    client = OpenAI()
    batch = args.embed_batch

    def embed_texts(texts: List[str]) -> List[List[float]]:
        resp = client.embeddings.create(model=args.embed_model, input=texts)
        return [d.embedding for d in resp.data]

    # embed only chunks missing embeddings
    with driver.session() as s:
        missing = s.run(
            "MATCH (c:Chunk) WHERE c.book_id=$book_id AND c.embedding IS NULL RETURN c.chunk_id AS id, c.text AS text",
            book_id=args.book_id,
        ).data()

    rows = [{"chunk_id": r["id"], "text": r["text"]} for r in missing]

    for i in range(0, len(rows), batch):
        group = rows[i:i+batch]
        vectors = embed_texts([g["text"] for g in group])
        payload = [{"chunk_id": g["chunk_id"], "embedding": v} for g, v in zip(group, vectors)]

        with driver.session() as s:
            try:
                s.run(SET_EMBEDDINGS_CYPHER, rows=payload)
            except Exception:
                # fallback for older Neo4j builds where the procedure may be absent
                s.run(SET_EMBEDDINGS_FALLBACK_CYPHER, rows=payload)

    driver.close()
    doc.close()

    print(f"Done. Ingested book_id={args.book_id}, pages={total_pages}, chunks={len(chunk_rows)}.")
    print("Reminder: CREATE VECTOR INDEX uses the embedding dimension you selected.")
    print("Vector index query procedure: CALL db.index.vector.queryNodes(...).")


In [None]:
# -----------------------------
# Notebook-friendly entrypoint (FIXED, Neo4j 5 safe, no APOC, no procedures)
# -----------------------------
def run_ingestion(
    pdf: str,
    book_id: str,
    title: str,
    authors: str = "",
    edition: str = "",
    year: str = "",
    neo4j_uri: str = "bolt://localhost:7687",
    neo4j_user: str = "neo4j",
    neo4j_pass: str = "testpassword",
    embed_model: str = DEFAULT_EMBED_MODEL,
    embed_batch: int = 64,
):
    """
    Notebook-friendly runner. Mirrors the argparse main() behavior,
    but takes explicit parameters.

    Assumes these globals already exist in the notebook:
      - SCHEMA_CYPHER, UPSERT_PAGES_CYPHER, LINK_COVERS_CYPHER, UPSERT_CHUNKS_CYPHER
      - UPSERT_OUTLINE_NODES_CYPHER, UPSERT_OUTLINE_BOOK_REL_CYPHER, UPSERT_OUTLINE_PARENT_REL_CYPHER
      - SUPPORTED_BY_CYPHER, SET_EMBEDDINGS_CYPHER
      - sha1, clean_text, chunk_text, compute_end_pages, kind_for_level
      - TocItem (dataclass), plus imports (os, fitz, GraphDatabase, OpenAI, Dict, List)
    """
    import os
    from typing import Dict, List

    if not os.path.exists(pdf):
        raise FileNotFoundError(pdf)

    # ---- Read PDF
    doc = fitz.open(pdf)
    total_pages = doc.page_count

    # ---- Extract TOC (best-case: PDF has a real outline)
    raw_toc = doc.get_toc(simple=True)  # [(level, title, pageNo), ...]
    toc_items: List[TocItem] = []
    if raw_toc:
        toc_items = compute_end_pages(raw_toc, total_pages)

    # ---- Extract pages
    pages_payload = []
    for i in range(total_pages):
        page_no = i + 1
        text = clean_text(doc.load_page(i).get_text("text"))
        pages_payload.append(
            {
                "page_no": page_no,
                "text": text,
                "checksum": sha1(text),
            }
        )

    # ---- Connect Neo4j
    driver = GraphDatabase.driver(neo4j_uri, auth=(neo4j_user, neo4j_pass))

    # ---- Write Book, Pages, Outline, Chunks
    chunk_rows = []  # keep for return value
    try:
        with driver.session() as s:
            # Book node
            s.run(
                SCHEMA_CYPHER,
                book_id=book_id,
                title=title,
                authors=authors,
                edition=edition,
                year=year,
                total_pages=total_pages,
            )

            # Pages
            s.run(UPSERT_PAGES_CYPHER, book_id=book_id, pages=pages_payload)

            # Outline + parent-child (Neo4j 5 safe; no FOREACH-with-MATCH)
            items_payload = []
            if toc_items:
                parent_stack: Dict[int, str] = {}  # level -> outline_id

                for idx, it in enumerate(toc_items):
                    outline_id = f"{book_id}:toc:{idx}:{sha1(it.title)[:10]}"
                    parent_id = parent_stack.get(it.level - 1)

                    parent_stack[it.level] = outline_id
                    # remove deeper levels when moving up
                    for lvl in list(parent_stack.keys()):
                        if lvl > it.level:
                            del parent_stack[lvl]

                    items_payload.append(
                        {
                            "outline_id": outline_id,
                            "parent_outline_id": parent_id,
                            "kind": kind_for_level(it.level),
                            "level": it.level,
                            "title": it.title,
                            "start_page": it.start_page,
                            "end_page": it.end_page,
                        }
                    )

                # 1) nodes
                s.run(UPSERT_OUTLINE_NODES_CYPHER, items=items_payload)
                # 2) book -> root outline rels
                s.run(UPSERT_OUTLINE_BOOK_REL_CYPHER, book_id=book_id, items=items_payload)
                # 3) parent -> child rels
                s.run(UPSERT_OUTLINE_PARENT_REL_CYPHER, items=items_payload)

                # Link COVERS -> Page for every page in range
                covers_rows = []
                for row in items_payload:
                    for pno in range(row["start_page"], row["end_page"] + 1):
                        covers_rows.append({"outline_id": row["outline_id"], "page_no": pno})

                RANGE_BATCH = 2000
                for k in range(0, len(covers_rows), RANGE_BATCH):
                    part = covers_rows[k : k + RANGE_BATCH]
                    s.run(LINK_COVERS_CYPHER, book_id=book_id, items=part)

            # Chunks
            for p in pages_payload:
                page_no = p["page_no"]
                chunks = chunk_text(p["text"])
                for j, ct in enumerate(chunks):
                    chunk_id = f"{book_id}:p{page_no}:c{j}:{sha1(ct)[:10]}"
                    chunk_rows.append(
                        {
                            "chunk_id": chunk_id,
                            "page_no": page_no,
                            "chunk_index": j,
                            "text": ct,
                            "checksum": sha1(ct),
                        }
                    )

            CHUNK_BATCH = 1000
            for k in range(0, len(chunk_rows), CHUNK_BATCH):
                s.run(
                    UPSERT_CHUNKS_CYPHER,
                    book_id=book_id,
                    chunks=chunk_rows[k : k + CHUNK_BATCH],
                )

            # If we have outline, create SUPPORTED_BY edges (Outline -> Chunk)
            if toc_items:
                s.run(SUPPORTED_BY_CYPHER)

        # ---- Embeddings (Gemini) and write to Neo4j
        gemini_client = genai.Client()  # picks up GEMINI_API_KEY automatically  [oai_citation:2‡Google AI for Developers](https://ai.google.dev/gemini-api/docs/quickstart)

        def embed_texts(texts: List[str]) -> List[List[float]]:
            # Keep 1536 dims so you don't need to change your Neo4j vector index settings
            # Gemini defaults to 3072, but output_dimensionality can be set.  [oai_citation:3‡Google AI for Developers](https://ai.google.dev/gemini-api/docs/embeddings)
            result = gemini_client.models.embed_content(
                model="text-embedding-004",
                contents=texts,  # list batching supported  [oai_citation:4‡Google AI for Developers](https://ai.google.dev/gemini-api/docs/embeddings)
                config=types.EmbedContentConfig(
                    task_type="RETRIEVAL_DOCUMENT",   # correct for indexing chunks  [oai_citation:5‡Google AI for Developers](https://ai.google.dev/gemini-api/docs/embeddings)
                    output_dimensionality=1536        # 768/1536/3072 recommended  [oai_citation:6‡Google AI for Developers](https://ai.google.dev/gemini-api/docs/embeddings)
                ),
            )
            return [e.values for e in result.embeddings]

        # embed only chunks missing embeddings
        with driver.session() as s:
            missing = s.run(
                "MATCH (c:Chunk) WHERE c.book_id=$book_id AND c.embedding IS NULL "
                "RETURN c.chunk_id AS id, c.text AS text "
                "ORDER BY c.page_no, c.chunk_index",
                book_id=book_id,
            ).data()

        rows = [{"chunk_id": r["id"], "text": r["text"]} for r in missing]

        for i in range(0, len(rows), embed_batch):
            group = rows[i : i + embed_batch]
            vectors = embed_texts([g["text"] for g in group])
            payload = [{"chunk_id": g["chunk_id"], "embedding": v} for g, v in zip(group, vectors)]

            with driver.session() as s:
                s.run(SET_EMBEDDINGS_CYPHER, rows=payload)

        print(f"Done. Ingested book_id={book_id}, pages={total_pages}, chunks={len(chunk_rows)}.")
        return {
            "book_id": book_id,
            "pages": total_pages,
            "chunks": len(chunk_rows),
            "embedded": len(rows),
            "has_toc_outline": bool(toc_items),
        }

    finally:
        driver.close()
        doc.close()

In [21]:
os.environ["GEMINI_API_KEY"] = "AIzaSyAODasJVvnnhDYAu7-qh9B7jZKbfo-3B7w"

result = run_ingestion(
    pdf="./data/(The Pearson series in economics) Pindyck, Robert S._ Rubinfeld, Daniel L - Microeconomics-Pearson (2018).pdf",
    book_id="pindyck_micro_9e",
    title="Microeconomics",
    authors="Pindyck, Rubinfeld",
    edition="9",
    year="2018",
    neo4j_uri="bolt://localhost:7687",
    neo4j_user="neo4j",
    neo4j_pass="testpassword",
    embed_model="text-embedding-3-small",
    embed_batch=64,
)

result

ClientError: 429 RESOURCE_EXHAUSTED. {'error': {'code': 429, 'message': 'You exceeded your current quota, please check your plan and billing details. For more information on this error, head to: https://ai.google.dev/gemini-api/docs/rate-limits. To monitor your current usage, head to: https://ai.dev/usage?tab=rate-limit. \n* Quota exceeded for metric: generativelanguage.googleapis.com/embed_content_free_tier_requests, limit: 100, model: gemini-embedding-1.0\nPlease retry in 24.519682647s.', 'status': 'RESOURCE_EXHAUSTED', 'details': [{'@type': 'type.googleapis.com/google.rpc.Help', 'links': [{'description': 'Learn more about Gemini API quotas', 'url': 'https://ai.google.dev/gemini-api/docs/rate-limits'}]}, {'@type': 'type.googleapis.com/google.rpc.QuotaFailure', 'violations': [{'quotaMetric': 'generativelanguage.googleapis.com/embed_content_free_tier_requests', 'quotaId': 'EmbedContentRequestsPerMinutePerUserPerProjectPerModel-FreeTier', 'quotaDimensions': {'location': 'global', 'model': 'gemini-embedding-1.0'}, 'quotaValue': '100'}]}, {'@type': 'type.googleapis.com/google.rpc.RetryInfo', 'retryDelay': '24s'}]}}