# Multi-Source HTML Enrichment (MCP)

Tests **compressed_html enrichment** for both Ra.co and Ticketmaster sources.

**What this notebook verifies:**
1. `fetch_compressed_html` is properly `await`-ed (bug fix validation)
2. Ra.co HTTP engine fetches real page text into `compressed_html`
3. Ticketmaster hybrid engine fetches universe.com event pages
4. Non-music TM events (museums, exhibitions) have empty `artists` list
5. DB persist succeeds — no `source_updated_at` NOT NULL violations

**Pipeline flow:**
1. Create pipelines with HTML enrichment **enabled** (override config)
2. Fetch small batch (1 page × 5 events, Barcelona only)
3. Inspect `compressed_html` field coverage + character counts
4. Standalone `SimpleHtmlFetcher` test — fetch 2 URLs per source directly
5. Persist to DB and verify `source_updated_at` is non-null

## Setup

In [1]:
import sys
import os
import logging

# Setup path — point to services/api so src.* imports work
API_ROOT = os.path.abspath(os.path.join("..", "services", "api"))
if API_ROOT not in sys.path:
    sys.path.insert(0, API_ROOT)

# Load .env from services/api so API keys are available
from dotenv import load_dotenv
env_path = os.path.join(API_ROOT, ".env")
load_dotenv(env_path, override=True)

# Enable logging
logging.basicConfig(
    level=logging.INFO,
    format="%(name)s - %(levelname)s - %(message)s",
)

# Verify key env vars
tm_key = os.environ.get("TICKETMASTER_API_KEY", "")
print(f"API root: {API_ROOT}")
print(f"TICKETMASTER_API_KEY loaded: {'yes (' + str(len(tm_key)) + ' chars)' if tm_key else 'NO - check .env'}")
print("Setup complete")

API root: /Users/josegarcia/Documents/GitHub/event-intelligence-platform/services/api
TICKETMASTER_API_KEY loaded: yes (32 chars)
Setup complete


## Step 1: Create Pipelines with HTML Enrichment ON

The factory uses YAML config. We override `html_enrichment_scraper` after creation
to force enrichment on regardless of the YAML setting.

In [2]:
from src.ingestion.factory import PipelineFactory
from src.ingestion.adapters.scraper_adapter import HtmlEnrichmentConfig, HtmlEnrichmentScraper

factory = PipelineFactory()
pipelines = factory.create_all_enabled_pipelines()

ra_co = pipelines["ra_co"]
ticketmaster = pipelines["ticketmaster"]

# Override HTML enrichment scraper — force enabled with hybrid engine
ra_enrichment_cfg = HtmlEnrichmentConfig(
    enabled=True,
    engine_type="hybrid",
    rate_limit_per_second=1.0,
    timeout_s=20.0,
    source_name="ra_co",
)
ra_co.html_enrichment_scraper = HtmlEnrichmentScraper(ra_enrichment_cfg)

tm_enrichment_cfg = HtmlEnrichmentConfig(
    enabled=True,
    engine_type="hybrid",
    rate_limit_per_second=1.0,
    timeout_s=20.0,
    source_name="ticketmaster",
)
ticketmaster.html_enrichment_scraper = HtmlEnrichmentScraper(tm_enrichment_cfg)

print("Pipelines ready with HTML enrichment enabled:")
print(f"  ra_co html_enrichment_scraper:       {ra_co.html_enrichment_scraper}")
print(f"  ticketmaster html_enrichment_scraper: {ticketmaster.html_enrichment_scraper}")

src.ingestion.factory - INFO - Created pipeline: ra_co
src.ingestion.factory - INFO - Created pipeline: ticketmaster


Pipelines ready with HTML enrichment enabled:
  ra_co html_enrichment_scraper:       <src.ingestion.adapters.scraper_adapter.HtmlEnrichmentScraper object at 0x10abf22d0>
  ticketmaster html_enrichment_scraper: <src.ingestion.adapters.scraper_adapter.HtmlEnrichmentScraper object at 0x1112b5210>


## Step 2: Execute Pipelines (Small Scope — Barcelona, 1 Page × 5 Events)

In [3]:
# Narrow to Barcelona only for speed
ra_co.source_config.defaults["areas"] = {"Barcelona": 20}

raco_result = await ra_co.execute(max_pages=1, page_size=5)

print("Ra.co Pipeline Results")
print("=" * 60)
print(f"Status:     {raco_result.status.value}")
print(f"Raw events: {raco_result.total_events_processed}")
print(f"Successful: {raco_result.successful_events}")
print(f"Duration:   {raco_result.duration_seconds:.2f}s")

pipeline.ra_co - INFO - Starting multi-city execution: ra_co_20260218_125102_b66e39a3 (1 cities)
pipeline.ra_co - INFO - Fetching events for Barcelona (area_id=20)...
pipeline.ra_co - INFO -   Barcelona: sliding window fetch [2026-02-18..2026-02-19] (capacity=500/call, window=168h)
src.ingestion.pipelines.apis.base_api - INFO - Fetching page 1/1...
httpx - INFO - HTTP Request: POST https://ra.co/graphql "HTTP/1.1 200 OK"
src.ingestion.pipelines.apis.base_api - INFO - Pagination complete: fetched 5 total events across 2 pages
pipeline.ra_co - INFO -   Barcelona: [2026-02-18..2026-02-19] 5/29 events (SATURATED — shrinking to 84h)
src.ingestion.pipelines.apis.base_api - INFO - Fetching page 1/1...
httpx - INFO - HTTP Request: POST https://ra.co/graphql "HTTP/1.1 200 OK"
src.ingestion.pipelines.apis.base_api - INFO - Pagination complete: fetched 5 total events across 2 pages
pipeline.ra_co - INFO -   Barcelona: [2026-02-18..2026-02-19] 5/29 events (SATURATED — shrinking to 42h)
src.ingesti

Ra.co Pipeline Results
Status:     partial_success
Raw events: 116
Successful: 28
Duration:   161.36s


In [4]:
# Narrow to Barcelona only
ticketmaster.source_config.defaults["cities"] = ["Barcelona"]

tm_result = await ticketmaster.execute(max_pages=1, page_size=5)

print("Ticketmaster Pipeline Results")
print("=" * 60)
print(f"Status:     {tm_result.status.value}")
print(f"Raw events: {tm_result.total_events_processed}")
print(f"Successful: {tm_result.successful_events}")
print(f"Duration:   {tm_result.duration_seconds:.2f}s")

pipeline.ticketmaster - INFO - Starting multi-city execution: ticketmaster_20260218_131019_b6540cd8 (1 cities)
pipeline.ticketmaster - INFO - Fetching events for Barcelona...
pipeline.ticketmaster - INFO -   Barcelona: sliding window fetch [2026-02-18..2026-02-19] (capacity=250/call, window=168h)
src.ingestion.pipelines.apis.base_api - INFO - Fetching page 0/0...
httpx - INFO - HTTP Request: GET https://app.ticketmaster.com/discovery/v2/events.json?apikey=guGHnH0k1CTQfmSGl9vhBDU1JLV3GW0w&city=Barcelona&countryCode=ES&startDateTime=2026-02-18T00%3A00%3A00Z&endDateTime=2026-02-18T23%3A59%3A59Z&size=5&page=0&sort=date%2Casc "HTTP/1.1 200 OK"
src.ingestion.pipelines.apis.base_api - INFO - Pagination complete: fetched 0 total events across 1 pages
pipeline.ticketmaster - INFO -   Barcelona: sliding window complete — 0 total raw events
pipeline.ticketmaster - INFO -   Barcelona: 0 raw events fetched
pipeline.ticketmaster - INFO - Total raw events across all cities: 0
pipeline.ticketmaster - 

Ticketmaster Pipeline Results
Status:     failed
Raw events: 0
Successful: 0
Duration:   0.56s


## Step 3: Inspect `compressed_html` Field Coverage

In [5]:
def show_html_coverage(events, source_label):
    """Print compressed_html coverage stats and a snippet for each event."""
    print(f"{source_label} — {len(events)} events")
    print("-" * 70)
    with_html = 0
    for i, event in enumerate(events):
        html = event.source.compressed_html or ""
        has_html = bool(html)
        if has_html:
            with_html += 1
        artists_str = ", ".join(a.name for a in event.artists) if event.artists else "(none)"
        print(f"[{i+1}] {event.title[:55]:<55}")
        print(f"     artists:       {artists_str}")
        print(f"     html chars:    {len(html):,}" if has_html else "     html:         None")
        if html:
            print(f"     snippet:       {repr(html[:120])}")
        print()
    print(f"Coverage: {with_html}/{len(events)} events have compressed_html")
    print()

show_html_coverage(raco_result.events, "Ra.co")
show_html_coverage(tm_result.events, "Ticketmaster")

Ra.co — 28 events
----------------------------------------------------------------------
[1] Plaiia Parties                                         
     artists:       Saulo Pisa, Miguel Silva, Civaro
     html:         None

[2] Hurtado + Rubén Seoane                                 
     artists:       Rubén Seoane, Hurtado, Rubén Seoane Hurtado
     html:         None

[3] Laurence Guy en microdosis - Razzmatazz 3, Barcelona   
     artists:       Laurence Guy
     html:         None

[4] Ofenbach: CLONED [LIVE] - Apolo, Barcelona             
     artists:       Ofenbach
     html:         None

[5] HiFi: Vultur, Moray                                    
     artists:       Moray, Vultur
     html:         None

[6] Wednesnight with Chill Miracle, Djaq, Keyblow          
     artists:       @chill miracle @ djaq
     html:         None

[7] DANCE HALL REGGAE:SIZZLA GAMBIA-JULIA TOWERS-LEANDRO-EN
     artists:       @SIZZLA GAMBIA @julia towers@ ena ghema@ leandro
     html:       

## Step 4: Verify Artists Filter — Non-Music TM Events Must Have Empty `artists`

After the fix, Museo Banksy Madrid (segment=`Arts & Theatre & Comedy`) should NOT appear in `artists`.

In [6]:
all_events = raco_result.events + tm_result.events

print("Artists per event (Ticketmaster):")
print("=" * 60)
for event in tm_result.events:
    artists_str = ", ".join(a.name for a in event.artists) if event.artists else "✓ (none — correctly filtered)"
    print(f"  {event.title[:50]:<50}  artists={artists_str}")

print()
print("Artists per event (Ra.co — should still have artists for music events):")
print("=" * 60)
for event in raco_result.events[:5]:
    artists_str = ", ".join(a.name for a in event.artists) if event.artists else "(none)"
    print(f"  {event.title[:50]:<50}  artists={artists_str}")

Artists per event (Ticketmaster):

Artists per event (Ra.co — should still have artists for music events):
  Plaiia Parties                                      artists=Saulo Pisa, Miguel Silva, Civaro
  Hurtado + Rubén Seoane                              artists=Rubén Seoane, Hurtado, Rubén Seoane Hurtado
  Laurence Guy en microdosis - Razzmatazz 3, Barcelo  artists=Laurence Guy
  Ofenbach: CLONED [LIVE] - Apolo, Barcelona          artists=Ofenbach
  HiFi: Vultur, Moray                                 artists=Moray, Vultur


## Step 5: Standalone `SimpleHtmlFetcher` — Direct URL Test

Tests HTML enrichment outside the pipeline for quick debugging of individual URLs.

In [7]:
class SimpleHtmlFetcher:
    """Direct URL fetcher using the scrapping service — tests HTML enrichment outside the pipeline."""

    def __init__(self, source_name: str):
        from src.ingestion.adapters.scraper_adapter import HtmlEnrichmentConfig, HtmlEnrichmentScraper
        self._scraper = HtmlEnrichmentScraper(HtmlEnrichmentConfig(
            enabled=True,
            engine_type="hybrid",
            timeout_s=20.0,
            source_name=source_name,
        ))

    async def fetch(self, url: str) -> str | None:
        return await self._scraper.fetch_compressed_html(url)

print("SimpleHtmlFetcher defined")

SimpleHtmlFetcher defined


In [8]:
# Pick 2 Ra.co event URLs from the pipeline result (or use hardcoded fallbacks)
raco_urls = [
    event.source.source_url
    for event in raco_result.events
    if event.source.source_url
][:2]

if not raco_urls:
    raco_urls = [
        "https://ra.co/events/2348963",
        "https://ra.co/events/2297033",
    ]

print(f"Testing Ra.co URLs: {raco_urls}")
raco_fetcher = SimpleHtmlFetcher("ra_co")

for url in raco_urls:
    text = await raco_fetcher.fetch(url)
    if text:
        print(f"\n✓ {url}")
        print(f"  chars: {len(text):,}")
        print(f"  snippet: {repr(text[:200])}")
    else:
        print(f"\n✗ {url} — returned None")



Testing Ra.co URLs: ['https://ra.co/events/2348963', 'https://ra.co/events/2338673']

✗ https://ra.co/events/2348963 — returned None

✗ https://ra.co/events/2338673 — returned None


In [9]:
# Pick 2 Ticketmaster/universe.com event URLs from the pipeline result
tm_urls = [
    event.source.source_url
    for event in tm_result.events
    if event.source.source_url
][:2]

if not tm_urls:
    tm_urls = [
        "https://www.universe.com/events/museo-banksy-madrid-tickets-X38GZF?ref=ticketmaster",
    ]

print(f"Testing Ticketmaster URLs: {tm_urls}")
tm_fetcher = SimpleHtmlFetcher("ticketmaster")

for url in tm_urls:
    text = await tm_fetcher.fetch(url)
    if text:
        print(f"\n✓ {url}")
        print(f"  chars: {len(text):,}")
        print(f"  snippet: {repr(text[:200])}")
    else:
        print(f"\n✗ {url} — returned None (TM pages may require JS rendering or have anti-scraping)")



Testing Ticketmaster URLs: ['https://www.universe.com/events/museo-banksy-madrid-tickets-X38GZF?ref=ticketmaster']

✗ https://www.universe.com/events/museo-banksy-madrid-tickets-X38GZF?ref=ticketmaster — returned None (TM pages may require JS rendering or have anti-scraping)


## Step 6: Persist to DB

Verifies that:
- `source_updated_at` NOT NULL constraint is satisfied (fallback to `datetime.now(UTC)`)
- `compressed_html` is stored in `sources` table where fetched

In [10]:
import psycopg2
from src.ingestion.persist import EventDataWriter
from urllib.parse import urlparse

DATABASE_URL = os.environ.get("DATABASE_URL", "")
if not DATABASE_URL:
    print("ERROR: DATABASE_URL not set — check .env")
else:
    u = urlparse(DATABASE_URL)
    conn_params = dict(
        host=u.hostname,
        port=u.port or 5432,
        dbname=u.path.lstrip("/"),
        user=u.username,
        password=u.password,
    )

    events_to_persist = raco_result.events + tm_result.events
    print(f"Persisting {len(events_to_persist)} events to PostgreSQL...")
    print(f"  DB: {u.hostname}:{u.port}/{u.path.lstrip('/')}")

    # Check source_updated_at before persist
    none_count = sum(1 for e in events_to_persist if e.source.source_updated_at is None)
    print(f"  Events with source_updated_at=None (will use fallback): {none_count}/{len(events_to_persist)}")
    print()

    try:
        conn = psycopg2.connect(**conn_params)
        writer = EventDataWriter(conn)

        saved = writer.persist_batch(events_to_persist)
        conn.close()

        print(f"Persist complete: {saved}/{len(events_to_persist)} events saved")
        print(f"  Failed/skipped: {len(events_to_persist) - saved}")

        # Verification query
        conn2 = psycopg2.connect(**conn_params)
        with conn2.cursor() as cur:
            cur.execute("SELECT COUNT(*) FROM events")
            total_events = cur.fetchone()[0]
            cur.execute(
                "SELECT source_name, COUNT(*), "
                "COUNT(compressed_html) AS with_html, "
                "COUNT(source_updated_at) AS with_updated_at "
                "FROM sources GROUP BY source_name ORDER BY source_name"
            )
            by_source = cur.fetchall()
        conn2.close()

        print("\nDB verification:")
        print(f"  Total events: {total_events}")
        print("  Sources table (source_name | total | with_html | with_updated_at):")
        for row in by_source:
            src_name, total, with_html, with_updated_at = row
            print(f"    {src_name:20}: {total:4} total | {with_html:4} with_html | {with_updated_at:4} with_updated_at")

    except Exception as e:
        print(f"ERROR: {e}")
        import traceback; traceback.print_exc()

ModuleNotFoundError: No module named 'psycopg2'

## Step 7: Field Coverage — Focus on `compressed_html` and `source_updated_at`

In [11]:

all_events = raco_result.events + tm_result.events

print("Field Coverage Summary")
print("=" * 70)

sources = sorted(set(e.source.source_name for e in all_events))
col_w = 25

def pct(events, pred):
    if not events:
        return "N/A"
    count = sum(1 for e in events if pred(e))
    return f"{100 * count / len(events):.0f}%  ({count}/{len(events)})"

checks = [
    ("compressed_html",    lambda e: bool(e.source.compressed_html)),
    ("source_updated_at",  lambda e: e.source.source_updated_at is not None),
    ("artists non-empty",  lambda e: len(e.artists) > 0),
    ("description",        lambda e: bool(e.description)),
    ("source_url",         lambda e: bool(e.source.source_url)),
]

print(f"  {'Field':<{col_w}}", end="")
for src in sources:
    print(f"  {src:>28}", end="")
print()
print("  " + "-" * (col_w + 32 * len(sources)))

for label, pred in checks:
    print(f"  {label:<{col_w}}", end="")
    for src in sources:
        src_events = [e for e in all_events if e.source.source_name == src]
        print(f"  {pct(src_events, pred):>28}", end="")
    print()

Field Coverage Summary
  Field                                             ra_co
  ---------------------------------------------------------
  compressed_html                              0%  (0/28)
  source_updated_at                          89%  (25/28)
  artists non-empty                         100%  (28/28)
  description                                68%  (19/28)
  source_url                                100%  (28/28)


## Cleanup

In [12]:
await ra_co.close()
await ticketmaster.close()
print("Resources released.")

Resources released.
