# Email Ingestion & RAG Prebuild (Simplified)

This notebook demonstrates the email ingestion pipeline using the production components from `ingestion.email`.

## Environment setup

To run against a live IMAP server set the following environment variables before launching the notebook:

- `IMAP_HOST`
- `IMAP_USERNAME`
- `IMAP_PASSWORD`
- `IMAP_MAILBOX` *(optional, defaults to `INBOX`)*

The notebook uses synthetic data by default. Set `USE_SYNTHETIC_DATA=false` to fetch from the live server instead.


In [None]:
# Imports
from __future__ import annotations

import os
import sqlite3
from datetime import datetime, timedelta
from typing import Any, Dict, List

from ingestion.email.connector import IMAPConnector
from ingestion.email.ingest import _normalize, run_email_ingestion
from ingestion.email.processor import EmailProcessor


In [None]:
# Configuration
USE_SYNTHETIC_DATA = os.getenv("USE_SYNTHETIC_DATA", "true").lower() == "true"
IMAP_HOST = os.getenv("IMAP_HOST", "")
IMAP_USERNAME = os.getenv("IMAP_USERNAME", "")
IMAP_PASSWORD = os.getenv("IMAP_PASSWORD", "")
IMAP_MAILBOX = os.getenv("IMAP_MAILBOX", "INBOX")
SINCE_DAYS = int(os.getenv("EMAIL_SYNC_SINCE_DAYS", "30"))

SYNTHETIC_EMAILS = [
    {
        "message_id": "1",
        "subject": "Test message",
        "from_addr": "alice@example.com",
        "to_addrs": ["bob@example.com"],
        "cc_addrs": [],
        "date_utc": datetime(2024,1,1, tzinfo=datetime.utcnow().tzinfo).isoformat(),
        "received_utc": datetime(2024,1,1, tzinfo=datetime.utcnow().tzinfo).isoformat(),
        "body_text": "Hello from Alice to Bob",
        "body_html": None,
        "thread_id": None,
        "in_reply_to": None,
        "references_ids": [],
        "is_reply": 0,
        "is_forward": 0,
        "raw_size_bytes": None,
        "language": None,
        "has_attachments": 0,
        "attachment_manifest": [],
        "processed": 0,
        "ingested_at": None,
        "updated_at": None,
        "content_hash": None,
        "summary": None,
        "keywords": None,
        "auto_topic": None,
        "manual_topic": None,
        "topic_confidence": None,
        "topic_version": None,
        "error_state": None,
        "direction": None,
        "participants": [],
        "participants_hash": None,
        "to_primary": None,
    }
]


### Email Record Field Reference (Detailed)
Each processed email record (pre-DB) uses the following semantics:

| Field | Type | Purpose / Semantics | Notes / Future Considerations |
|-------|------|--------------------|-------------------------------|
| message_id | str | Globally unique RFC822 Message-ID | Uniqueness anchor; collisions rare but validated. |
| thread_id | str | Logical conversation/thread root | Simplistic: first reference or self; could upgrade via References chain collapse. |
| subject | str | Original Subject line | Normalization (prefix stripping) deferred for display vs. analytics. |
| from_addr | str | Sender (lowercased) | Single sender only; malformed multiple ignored. |
| to_addrs | list[str] | Primary recipients | Stored as JSON array in DB. |
| cc_addrs | list[str] | CC recipients | Optional JSON array. |
| date_utc | str (ISO) | Declared sent date/time (UTC) | Not validated vs Received; trust header for now. |
| received_utc | str (ISO) | Proxy for first-seen timestamp | Placeholder = date_utc; real pipeline would parse Received headers. |
| in_reply_to | str | Single Message-ID replied to | Null if not a reply. |
| references_ids | list[str] | Ordered ancestry of Message-IDs | Enables deeper thread reconstruction. |
| is_reply | int (0/1) | Heuristic flag for reply | Derived from subject OR In-Reply-To presence. |
| is_forward | int (0/1) | Heuristic flag for forward | Subject starts with fwd:/fw: (case-insensitive). |
| raw_size_bytes | int | Approx body size pre‑processing | Used for batching / limits. |
| body_text | str | Extracted canonical plain text body | HTML stripped / MIME selection to be added later. |
| body_html | str or null | Original HTML (if retained) | Currently None; later can store sanitized HTML. |
| language | str or null | ISO language code | Deferred: language detection integration. |
| has_attachments | int (0/1) | True if any attachments present | Quick filter for enrichment policies. |
| attachment_manifest | list[dict] | Lightweight metadata for attachments | JSON (filename,size,mime). No binary content stored. |
| processed | int (0/1) | Downstream processing completion marker | Set when embeddings/chunking done (future). |
| ingested_at | str (ISO) | DB insertion timestamp | DB default or set programmatically. |
| updated_at | str (ISO) | Last mutation timestamp | Null until updates occur. |
| content_hash | str | Deterministic dedupe hash | Subject + body + participants_hash + date_utc (current recipe). |
| summary | str | Short abstract summary | Stub now; LLM or heuristic later. |
| keywords | list[str] | Extracted keywords (naive or LLM) | Stored JSON array. |
| auto_topic | str | Heuristic / model inferred topic | May evolve with model versions. |
| manual_topic | str | Human override | Never overwritten automatically. |
| topic_confidence | float | Confidence score for auto_topic | Range 0..1; stub values now. |
| topic_version | int | Version of topic inference algorithm | Increment on model rule change. |
| error_state | str | Classification of last processing error | Null if healthy; used for retry logic. |
| direction | str | inbound | outbound | cc_only | other | Derived relative to primary mailbox. |
| participants | list[str] | Sorted unique participants (from/to/cc) | Facilitates hash, search facets. |
| participants_hash | str | Hash of participants list | Stable across ordering differences. |
| to_primary | str or null | Echoes PRIMARY_MAILBOX if directly addressed | Aids direct vs peripheral classification. |

Guiding Principles:
- Keep raw fidelity (subject, addresses) while adding normalized derivatives (hashes, flags).
- Avoid premature heavy processing (LLM calls) until base ingestion stable.
- All enrichment fields are nullable and additive.


In [None]:
# Connector fetch test
if USE_SYNTHETIC_DATA:
    records = SYNTHETIC_EMAILS
else:
    connector = IMAPConnector(
        host=IMAP_HOST,
        username=IMAP_USERNAME,
        password=IMAP_PASSWORD,
        mailbox=IMAP_MAILBOX,
    )
    since = datetime.utcnow() - timedelta(days=SINCE_DAYS)
    records = connector.fetch_emails(since_date=since)
len(records), records[:1]


In [None]:
# Normalization output test
normalized = _normalize(records[0])
normalized


In [None]:
# Processor persistence test
class DummyEmbeddings:
    def embed_documents(self, texts: List[str]) -> List[List[float]]:
        return [[0.0] * 3 for _ in texts]

class MockMilvus:
    def __init__(self):
        self.data = []
    def add_embeddings(self, embeddings, ids, metadatas):
        self.data.extend(zip(ids, embeddings, metadatas))

sqlite_conn = sqlite3.connect(':memory:')
proc = EmailProcessor(MockMilvus(), sqlite_conn, embedding_model=DummyEmbeddings())
proc.process(normalized)
sqlite_conn.execute('SELECT message_id FROM emails').fetchall()


In [None]:
# End-to-end ingestion test
sqlite_conn = sqlite3.connect(':memory:')
mock_milvus = MockMilvus()
proc = EmailProcessor(mock_milvus, sqlite_conn, embedding_model=DummyEmbeddings())

if USE_SYNTHETIC_DATA:
    connector = IMAPConnector(host='example', username='user', password='pass')
    connector.fetch_emails = lambda since_date=None: SYNTHETIC_EMAILS
    since = None
else:
    connector = IMAPConnector(
        host=IMAP_HOST,
        username=IMAP_USERNAME,
        password=IMAP_PASSWORD,
        mailbox=IMAP_MAILBOX,
    )
    since = datetime.utcnow() - timedelta(days=SINCE_DAYS)

processed = run_email_ingestion(connector, proc, since_date=since)
processed, sqlite_conn.execute('SELECT COUNT(*) FROM emails').fetchone()[0]
