# Multi-Source Event Ingestion Pipeline

This notebook tests **both Ra.co and Ticketmaster** pipelines running simultaneously.
All sources are created through `PipelineFactory` using YAML configuration.

**Pipeline flow:**
1. Factory creates both `ra_co` (GraphQL) and `ticketmaster` (REST) pipelines from config
2. Each pipeline fetches raw data via its adapter
3. FieldMapper extracts + transforms fields per source config
4. TaxonomyMapper assigns Human Experience Taxonomy dimensions
5. Events from both sources are combined, deduplicated, and compared

In [1]:
import sys
import os
import logging

# Setup path — point to services/api so src.* imports work
API_ROOT = os.path.abspath(os.path.join("..", "services", "api"))
if API_ROOT not in sys.path:
    sys.path.insert(0, API_ROOT)

# Load .env from services/api so API keys are available
from dotenv import load_dotenv
env_path = os.path.join(API_ROOT, ".env")
load_dotenv(env_path, override=True)

# Enable logging
logging.basicConfig(
    level=logging.INFO,
    format="%(name)s - %(levelname)s - %(message)s",
)

# Verify key env vars are loaded (without printing the values)
tm_key = os.environ.get("TICKETMASTER_API_KEY", "")
print(f"API root: {API_ROOT}")
print(f"TICKETMASTER_API_KEY loaded: {'yes (' + str(len(tm_key)) + ' chars)' if tm_key else 'NO - check .env'}")
print("Setup complete")

API root: /Users/josegarcia/Documents/GitHub/event-intelligence-platform/services/api
TICKETMASTER_API_KEY loaded: yes (32 chars)
Setup complete


## Step 1: PipelineFactory — List All Configured Sources

In [2]:
from src.ingestion.factory import PipelineFactory

factory = PipelineFactory()

print("Configured Sources:")
print("=" * 50)
for name, info in factory.list_sources().items():
    status = "ENABLED" if info["enabled"] else "disabled"
    print(f"  {name:20} type={info['type']:10} [{status}]")

enabled = factory.list_enabled_sources()
print(f"\nEnabled sources: {enabled}")

Configured Sources:
  ra_co                type=api        [ENABLED]
  ticketmaster         type=api        [ENABLED]

Enabled sources: ['ra_co', 'ticketmaster']


## Step 2: Create Both Pipelines

- **Ra.co**: GraphQL API, uses `defaults.areas` (dict city → area_id), 1-indexed pagination
- **Ticketmaster**: REST API, uses `defaults.cities` (list), 0-indexed pagination

In [3]:
# Create all enabled API pipelines at once
pipelines = factory.create_all_enabled_pipelines()

ra_co = pipelines["ra_co"]
ticketmaster = pipelines["ticketmaster"]

print("Ra.co pipeline:")
print(f"  Protocol:    {ra_co.source_config.protocol}")
print(f"  Endpoint:    {ra_co.source_config.endpoint}")
print(f"  Areas:       {ra_co.source_config.defaults.get('areas', {})}")
print(f"  Days ahead:  {ra_co.source_config.defaults.get('days_ahead')}")
print(f"  Start page:  {ra_co.source_config.pagination_start_page}")

print("\nTicketmaster pipeline:")
print(f"  Protocol:    {ticketmaster.source_config.protocol}")
print(f"  Endpoint:    {ticketmaster.source_config.endpoint}")
print(f"  Cities:      {ticketmaster.source_config.defaults.get('cities', [])}")
print(f"  Days ahead:  {ticketmaster.source_config.defaults.get('days_ahead')}")
print(f"  Start page:  {ticketmaster.source_config.pagination_start_page}")

src.ingestion.factory - INFO - Created pipeline: ra_co
src.ingestion.factory - INFO - Created pipeline: ticketmaster


Ra.co pipeline:
  Protocol:    graphql
  Endpoint:    https://ra.co/graphql
  Areas:       {'Barcelona': 20, 'Madrid': 41}
  Days ahead:  1
  Start page:  1

Ticketmaster pipeline:
  Protocol:    rest
  Endpoint:    https://app.ticketmaster.com/discovery/v2/events.json
  Cities:      ['Barcelona', 'Madrid']
  Days ahead:  1
  Start page:  0


## Step 3: Execute Ra.co Pipeline

Limited to Barcelona, 1 page of 10 events for a quick smoke test.

In [4]:
# Limit scope for notebook speed: Barcelona only, 1 page
# ra_co.source_config.defaults["areas"] = {"Barcelona": 20}

raco_result = await ra_co.execute(max_pages=1, page_size=10)

print("Ra.co Pipeline Results")
print("=" * 60)
print(f"Status:          {raco_result.status.value}")
print(f"Raw events:      {raco_result.total_events_processed}")
print(f"Successful:      {raco_result.successful_events}")
print(f"Failed:          {raco_result.failed_events}")
print(f"Duration:        {raco_result.duration_seconds:.2f}s")
print(f"Success rate:    {raco_result.success_rate:.1f}%")
print(f"Cities:          {raco_result.metadata.get('cities', [])}")

if raco_result.errors:
    print(f"\nErrors: {raco_result.errors[:3]}")

pipeline.ra_co - INFO - Starting multi-city execution: ra_co_20260218_120327_79b18a5f (2 cities)
pipeline.ra_co - INFO - Fetching events for Barcelona (area_id=20)...
pipeline.ra_co - INFO -   Barcelona: sliding window fetch [2026-02-18..2026-02-19] (capacity=500/call, window=168h)
src.ingestion.pipelines.apis.base_api - INFO - Fetching page 1/1...
httpx - INFO - HTTP Request: POST https://ra.co/graphql "HTTP/1.1 200 OK"
src.ingestion.pipelines.apis.base_api - INFO - Pagination complete: fetched 10 total events across 2 pages
pipeline.ra_co - INFO -   Barcelona: [2026-02-18..2026-02-19] 10/28 events (SATURATED — shrinking to 84h)
src.ingestion.pipelines.apis.base_api - INFO - Fetching page 1/1...
httpx - INFO - HTTP Request: POST https://ra.co/graphql "HTTP/1.1 200 OK"
src.ingestion.pipelines.apis.base_api - INFO - Pagination complete: fetched 10 total events across 2 pages
pipeline.ra_co - INFO -   Barcelona: [2026-02-18..2026-02-19] 10/28 events (SATURATED — shrinking to 42h)
src.ing

Ra.co Pipeline Results
Status:          partial_success
Raw events:      212
Successful:      51
Failed:          161
Duration:        105.17s
Success rate:    24.1%
Cities:          ['Barcelona', 'Madrid']


## Step 4: Execute Ticketmaster Pipeline

Limited to Barcelona, 1 page of 10 events. Note: Ticketmaster uses 0-indexed pages.

In [5]:
# Limit scope: Barcelona only, 1 page
# ticketmaster.source_config.defaults["cities"] = ["Barcelona"]

tm_result = await ticketmaster.execute(max_pages=1, page_size=10)

print("Ticketmaster Pipeline Results")
print("=" * 60)
print(f"Status:          {tm_result.status.value}")
print(f"Raw events:      {tm_result.total_events_processed}")
print(f"Successful:      {tm_result.successful_events}")
print(f"Failed:          {tm_result.failed_events}")
print(f"Duration:        {tm_result.duration_seconds:.2f}s")
print(f"Success rate:    {tm_result.success_rate:.1f}%")
print(f"Cities:          {tm_result.metadata.get('cities', [])}")

if tm_result.errors:
    print(f"\nErrors: {tm_result.errors[:3]}")

pipeline.ticketmaster - INFO - Starting multi-city execution: ticketmaster_20260218_120512_3523c801 (2 cities)
pipeline.ticketmaster - INFO - Fetching events for Barcelona...
pipeline.ticketmaster - INFO -   Barcelona: sliding window fetch [2026-02-18..2026-02-19] (capacity=250/call, window=168h)
src.ingestion.pipelines.apis.base_api - INFO - Fetching page 0/0...
httpx - INFO - HTTP Request: GET https://app.ticketmaster.com/discovery/v2/events.json?apikey=guGHnH0k1CTQfmSGl9vhBDU1JLV3GW0w&city=Barcelona&countryCode=ES&startDateTime=2026-02-18T00%3A00%3A00Z&endDateTime=2026-02-18T23%3A59%3A59Z&size=10&page=0&sort=date%2Casc "HTTP/1.1 200 OK"
src.ingestion.pipelines.apis.base_api - INFO - Pagination complete: fetched 0 total events across 1 pages
pipeline.ticketmaster - INFO -   Barcelona: sliding window complete — 0 total raw events
pipeline.ticketmaster - INFO -   Barcelona: 0 raw events fetched
pipeline.ticketmaster - INFO - Fetching events for Madrid...
pipeline.ticketmaster - INFO - 

Ticketmaster Pipeline Results
Status:          success
Raw events:      10
Successful:      10
Failed:          0
Duration:        1.04s
Success rate:    100.0%
Cities:          ['Barcelona', 'Madrid']


## Step 5: Inspect Ticketmaster Events

Verify field mapping is correct: title, venue, artists, price, classification.

In [6]:
tm_events = tm_result.events

if tm_events:
    print(f"Ticketmaster events ({len(tm_events)} total):")
    print("=" * 70)
    for i, event in enumerate(tm_events[:10]):
        print(f"\n[{i+1}] {event.title}")
        print(f"    City:    {event.location.city} | Venue: {event.location.venue_name}")
        print(f"    Date:    {event.start_datetime}")
        print(f"    Type:    {event.event_type}")
        price = event.price
        if price and price.minimum_price is not None:
            price_str = f"{price.minimum_price}–{price.maximum_price} {price.currency}"
        elif price and price.price_raw_text:
            price_str = price.price_raw_text
        else:
            price_str = "N/A"
        print(f"    Price:   {price_str}")
        print(f"    Artists: {[a.name for a in event.artists]}")
        print(f"    Source:  {event.source.source_url}")
        if event.location.coordinates:
            print(f"    Coords:  ({event.location.coordinates.latitude}, {event.location.coordinates.longitude})")
        print(f"    Quality: {event.data_quality_score:.2f}")
        print(f"    Custom:  {event.custom_fields}")
else:
    print("No Ticketmaster events. Check pipeline logs above for errors.")

Ticketmaster events (10 total):

[1] Museo Banksy Madrid
    City:    Madrid | Venue: Museo Banksy Madrid
    Date:    2026-02-18 09:00:00+00:00
    Type:    concert
    Price:   N/A
    Artists: ['Museo Banksy Madrid']
    Source:  https://www.universe.com/events/museo-banksy-madrid-tickets-X38GZF?ref=ticketmaster
    Coords:  (40.40373, -3.70619)
    Quality: 0.58
    Custom:  {}

[2] Museo Banksy Madrid
    City:    Madrid | Venue: Museo Banksy Madrid
    Date:    2026-02-18 10:00:00+00:00
    Type:    concert
    Price:   N/A
    Artists: ['Museo Banksy Madrid']
    Source:  https://www.universe.com/events/museo-banksy-madrid-tickets-X38GZF?ref=ticketmaster
    Coords:  (40.40373, -3.70619)
    Quality: 0.58
    Custom:  {}

[3] Museo Banksy Madrid
    City:    Madrid | Venue: Museo Banksy Madrid
    Date:    2026-02-18 11:00:00+00:00
    Type:    concert
    Price:   N/A
    Artists: ['Museo Banksy Madrid']
    Source:  https://www.universe.com/events/museo-banksy-madrid-tickets-X

In [15]:
tm_events

 EventSchema(event_id='e148ec22-bcc1-5928-998f-72b9fcb41e57', title='Museo Banksy Madrid', description=None, taxonomy_dimension=TaxonomyDimension(primary_category='play_pure_fun', subcategory='1.4', subcategory_name=None, confidence=1.0, values=[], activity_id=None, activity_name=None, energy_level=None, social_intensity=None, cognitive_load=None, physical_involvement=None, cost_level=None, time_scale=None, environment=None, emotional_output=[], risk_level=None, age_accessibility=None, repeatability=None), start_datetime=datetime.datetime(2026, 2, 18, 13, 0, tzinfo=datetime.timezone.utc), end_datetime=datetime.datetime(2026, 2, 18, 14, 0, tzinfo=datetime.timezone.utc), duration_minutes=60, is_all_day=False, is_recurring=False, recurrence_pattern=None, location=LocationInfo(venue_name='Museo Banksy Madrid', street_address='P.º de la Esperanza, 1', city='Madrid', state_or_region='Madrid', postal_code='28005', country_code='ES', coordinates=Coordinates(latitude=40.40373, longitude=-3.7061

In [7]:
raco_events = raco_result.events

if raco_events:
    print(f"Ra.co events ({len(raco_events)} total):")
    print("=" * 70)
    for i, event in enumerate(raco_events[:10]):
        print(f"\n[{i+1}] {event.title}")
        print(f"    City:    {event.location.city} | Venue: {event.location.venue_name}")
        print(f"    Date:    {event.start_datetime}")
        print(f"    Type:    {event.event_type}")
        print(f"    Price:   {event.price.price_raw_text if event.price else 'N/A'}")
        print(f"    Artists: {[a.name for a in event.artists]}")
        print(f"    Source:  {event.source.source_url}")
        print(f"    Quality: {event.data_quality_score:.2f}")
else:
    print("No Ra.co events. Check pipeline logs above for errors.")

Ra.co events (51 total):

[1] Plaiia Parties
    City:    Barcelona | Venue: Macarena Club
    Date:    2026-02-18 23:59:00+00:00
    Type:    nightlife
    Price:   10€
    Artists: ['Saulo Pisa', 'Miguel Silva', 'Civaro']
    Source:  https://ra.co/events/2348963
    Quality: 0.70

[2] Hurtado + Rubén Seoane
    City:    Barcelona | Venue: Moog Club
    Date:    2026-02-18 23:59:00+00:00
    Type:    nightlife
    Price:   None
    Artists: ['Rubén Seoane', 'Hurtado', 'Rubén Seoane\xa0Hurtado']
    Source:  https://ra.co/events/2338673
    Quality: 0.65

[3] Laurence Guy en microdosis - Razzmatazz 3, Barcelona
    City:    Barcelona | Venue: Razzmatazz 3
    Date:    2026-02-18 20:00:00+00:00
    Type:    nightlife
    Price:   18
    Artists: ['Laurence Guy']
    Source:  https://ra.co/events/2297033
    Quality: 0.70

[4] Ofenbach: CLONED [LIVE] - Apolo, Barcelona
    City:    Barcelona | Venue: Sala Apolo
    Date:    2026-02-18 19:00:00+00:00
    Type:    concert
    Price:   22,

## Step 6: Combine Events from Both Sources

In [8]:
all_events = raco_events + tm_events

print("Combined Results")
print("=" * 60)
print(f"Ra.co events:        {len(raco_events)}")
print(f"Ticketmaster events: {len(tm_events)}")
print(f"Total combined:      {len(all_events)}")

# Source breakdown
from collections import Counter
source_counts = Counter(e.source.source_name for e in all_events)
print("\nBy source:")
for src, count in source_counts.most_common():
    print(f"  {src:20}: {count}")

# Event type breakdown
type_counts = Counter(e.event_type for e in all_events)
print("\nBy event type:")
for etype, count in type_counts.most_common():
    print(f"  {etype:20}: {count}")

Combined Results
Ra.co events:        51
Ticketmaster events: 10
Total combined:      61

By source:
  ra_co               : 51
  ticketmaster        : 10

By event type:
  nightlife           : 47
  concert             : 12
  festival            : 1
  party               : 1


## Step 7: Cross-Source Deduplication

Apply `ExactMatchDeduplicator` across events from both sources. Events that appear in both Ticketmaster and Ra.co will be merged.

In [9]:
from src.ingestion.deduplication import ExactMatchDeduplicator

deduplicator = ExactMatchDeduplicator()
deduplicated_events = deduplicator.deduplicate(all_events)

print("Cross-Source Deduplication")
print("=" * 60)
print(f"Events before dedup: {len(all_events)}")
print(f"Events after dedup:  {len(deduplicated_events)}")
print(f"Duplicates removed:  {len(all_events) - len(deduplicated_events)}")

if len(all_events) != len(deduplicated_events):
    print(f"\nNote: {len(all_events) - len(deduplicated_events)} event(s) appeared in both sources")
else:
    print("\nNo cross-source duplicates found — sources cover distinct event sets")

Cross-Source Deduplication
Events before dedup: 61
Events after dedup:  61
Duplicates removed:  0

No cross-source duplicates found — sources cover distinct event sets


In [16]:
deduplicated_events

[EventSchema(event_id='5b2f6d9c-5790-53e5-9792-1dcdc8d35b76', title='Plaiia Parties', description='Esta noche va a ser muy especial , Plaiia presenta showcase del sello chileno anakonda records, no falten ni lleguen tarde a esta fecha tan especial ! Tickets gratis antes de la 1am en esta plataforma :)', taxonomy_dimension=TaxonomyDimension(primary_category='play_pure_fun', subcategory='1.4', subcategory_name=None, confidence=1.0, values=[], activity_id=None, activity_name=None, energy_level=None, social_intensity=None, cognitive_load=None, physical_involvement=None, cost_level=None, time_scale=None, environment=None, emotional_output=[], risk_level=None, age_accessibility=None, repeatability=None), start_datetime=datetime.datetime(2026, 2, 18, 23, 59, tzinfo=datetime.timezone.utc), end_datetime=datetime.datetime(2026, 2, 19, 5, 0, tzinfo=datetime.timezone.utc), duration_minutes=301, is_all_day=False, is_recurring=False, recurrence_pattern=None, location=LocationInfo(venue_name='Macaren

## Step 8: Combined DataFrame

Build a unified DataFrame with `source_name` for side-by-side comparison.

In [10]:
import pandas as pd

# Build dataframes per source then concat
dfs = []
if raco_events:
    df_raco = ra_co.to_dataframe(raco_events)
    dfs.append(df_raco)
    print(f"Ra.co DataFrame:        {df_raco.shape}")

if tm_events:
    df_tm = ticketmaster.to_dataframe(tm_events)
    dfs.append(df_tm)
    print(f"Ticketmaster DataFrame: {df_tm.shape}")

if dfs:
    df_combined = pd.concat(dfs, ignore_index=True)
    print(f"\nCombined DataFrame:     {df_combined.shape}")
else:
    df_combined = pd.DataFrame()
    print("No events to combine")

Ra.co DataFrame:        (51, 79)
Ticketmaster DataFrame: (10, 79)

Combined DataFrame:     (61, 79)


In [11]:
# Side-by-side comparison view
focus_cols = [
    "source_name", "title", "event_type", "city",
    "venue_name", "start_datetime",
    "price_minimum", "price_maximum", "price_currency",
    "artists", "data_quality_score",
]
available = [c for c in focus_cols if c in df_combined.columns]
df_combined.head(50)

Unnamed: 0,event_id,title,description,start_datetime,end_datetime,duration_minutes,is_all_day,is_recurring,recurrence_pattern,venue_name,...,taxonomy_age_accessibility,taxonomy_repeatability,taxonomy_dimension_json,data_quality_score,normalization_errors,tags,artists,custom_fields_json,created_at,updated_at
0,5b2f6d9c-5790-53e5-9792-1dcdc8d35b76,Plaiia Parties,"Esta noche va a ser muy especial , Plaiia pres...",2026-02-18 23:59:00+00:00,2026-02-19 05:00:00+00:00,301,False,False,,Macarena Club,...,,,"{""primary_category"": ""play_pure_fun"", ""subcate...",0.7,,,"Saulo Pisa, Miguel Silva, Civaro","{""is_ticketed"": true}",2026-02-18 12:04:03.627827+00:00,2026-02-18 12:04:03.627829+00:00
1,9bd4b446-c7df-5fe6-acca-ba7e699627c1,Hurtado + Rubén Seoane,Hurtado és un artista espanyol establert a Ber...,2026-02-18 23:59:00+00:00,2026-02-19 05:00:00+00:00,301,False,False,,Moog Club,...,,,"{""primary_category"": ""play_pure_fun"", ""subcate...",0.65,,,"Rubén Seoane, Hurtado, Rubén Seoane Hurtado","{""is_ticketed"": true}",2026-02-18 12:04:03.763337+00:00,2026-02-18 12:04:03.763338+00:00
2,09510f6b-1ca5-5288-9dbc-01bf062187bd,"Laurence Guy en microdosis - Razzmatazz 3, Bar...",microdosis presenta: LAURENCE GUY “Music to ma...,2026-02-18 20:00:00+00:00,2026-02-18 23:30:00+00:00,210,False,False,,Razzmatazz 3,...,,,"{""primary_category"": ""play_pure_fun"", ""subcate...",0.7,,,Laurence Guy,"{""is_ticketed"": true}",2026-02-18 12:04:05.734464+00:00,2026-02-18 12:04:05.734469+00:00
3,e3eee00b-614b-5753-a17c-c2bd83850733,"Ofenbach: CLONED [LIVE] - Apolo, Barcelona","De Daft Punk a Polo & Pan, a menudo la histori...",2026-02-18 19:00:00+00:00,2026-02-18 23:00:00+00:00,240,False,False,,Sala Apolo,...,,,"{""primary_category"": ""play_pure_fun"", ""subcate...",0.7,,,Ofenbach,"{""is_ticketed"": false}",2026-02-18 12:04:06.734706+00:00,2026-02-18 12:04:06.734708+00:00
4,b1d8a473-dcc2-5d6e-8709-dab062021746,"HiFi: Vultur, Moray",,2026-02-18 22:00:00+00:00,2026-02-19 02:30:00+00:00,270,False,False,,Switch Bar,...,,,"{""primary_category"": ""play_pure_fun"", ""subcate...",0.65,,,"Moray, Vultur","{""is_ticketed"": false}",2026-02-18 12:04:07.745528+00:00,2026-02-18 12:04:07.745531+00:00
5,24fecb2e-8e7d-52d3-9886-d1d896d33ea3,"Wednesnight with Chill Miracle, Djaq, Keyblow",BEATIME presents: WEDNESNIGHT! Every Wednesday...,2026-02-18 22:30:00+00:00,2026-02-19 02:30:00+00:00,240,False,False,,Garage 442,...,,,"{""primary_category"": ""play_pure_fun"", ""subcate...",0.65,,,@chill miracle @ djaq,"{""is_ticketed"": true}",2026-02-18 12:04:08.746921+00:00,2026-02-18 12:04:08.746924+00:00
6,5c1775bd-f5ff-5b82-bbf5-f278a2958d10,DANCE HALL REGGAE:SIZZLA GAMBIA-JULIA TOWERS-L...,,2026-02-18 22:00:00+00:00,2026-02-19 02:30:00+00:00,270,False,False,,Absenta del Raval,...,,,"{""primary_category"": ""play_pure_fun"", ""subcate...",0.6,,,@SIZZLA GAMBIA @julia towers@ ena ghema@ leandro,"{""is_ticketed"": false}",2026-02-18 12:04:09.752163+00:00,2026-02-18 12:04:09.752168+00:00
7,f6ae9894-553d-5aee-96ff-cc97671c56a7,Dr. Resin Social Club meets Jazz K,"For more information, please send us your inqu...",2026-02-18 18:00:00+00:00,2026-02-18 22:00:00+00:00,240,False,False,,Dr. Resin Social Club,...,,,"{""primary_category"": ""play_pure_fun"", ""subcate...",0.65,,,Jazz K,"{""is_ticketed"": false}",2026-02-18 12:04:10.758706+00:00,2026-02-18 12:04:10.758710+00:00
8,51048682-f9ee-55cf-a002-6cda140948fd,Festivale alternativo,Djtao se pone en esta ocasión al mando para ha...,2026-02-18 19:00:00+00:00,2026-02-19 23:00:00+00:00,1680,False,False,,El Club Verde,...,,,"{""primary_category"": ""play_pure_fun"", ""subcate...",0.6,,,Dj tao,"{""is_ticketed"": false}",2026-02-18 12:04:12.761881+00:00,2026-02-18 12:04:12.761882+00:00
9,a4f2434f-0045-5d55-bc23-9bb375f5cc6b,Obelisk pres. Thursday Hard Ritual with Øxiyd ...,OBELISK presenta su segundo ritual en City Hal...,2026-02-19 23:59:00+00:00,2026-02-20 05:00:00+00:00,301,False,False,,City Hall,...,,,"{""primary_category"": ""play_pure_fun"", ""subcate...",0.65,,,"Øxiyd, Yeison M, Acidax, Roderiq, Rolo, Veydos...","{""is_ticketed"": true}",2026-02-18 12:04:14.773618+00:00,2026-02-18 12:04:14.773623+00:00


## Step 9: Summary Statistics Per Source

In [12]:
if df_combined.empty:
    print("No events to summarize.")
else:
    print("=" * 60)
    print("MULTI-SOURCE INGESTION SUMMARY")
    print("=" * 60)
    print(f"\nTotal events (combined): {len(df_combined)}")

    print("\n--- Events per Source ---")
    print(df_combined.groupby("source_name").size().to_string())

    print("\n--- Quality Score per Source ---")
    print(df_combined.groupby("source_name")["data_quality_score"].agg(["mean", "min", "max"]).round(3).to_string())

    print("\n--- Event Types per Source ---")
    print(df_combined.groupby(["source_name", "event_type"]).size().to_string())

    print("\n--- Artists Coverage per Source ---")
    artists_col = df_combined["artists"].fillna("")
    has_artists = artists_col != ""
    for src in df_combined["source_name"].unique():
        src_mask = df_combined["source_name"] == src
        total_src = src_mask.sum()
        with_artists = (src_mask & has_artists).sum()
        print(f"  {src:20}: {with_artists}/{total_src} events have artists ({100*with_artists/total_src:.0f}%)")

    print("\n--- Date Range per Source ---")
    for src in df_combined["source_name"].unique():
        src_df = df_combined[df_combined["source_name"] == src]
        print(f"  {src}:")
        print(f"    Earliest: {src_df['start_datetime'].min()}")
        print(f"    Latest:   {src_df['start_datetime'].max()}")

    print("\n--- Free vs Paid per Source ---")
    print(df_combined.groupby(["source_name", "price_is_free"]).size().to_string())

MULTI-SOURCE INGESTION SUMMARY

Total events (combined): 61

--- Events per Source ---
source_name
ra_co           51
ticketmaster    10

--- Quality Score per Source ---
               mean   min  max
source_name                   
ra_co         0.655  0.55  0.7
ticketmaster  0.592  0.58  0.6

--- Event Types per Source ---
source_name   event_type
ra_co         concert        2
              festival       1
              nightlife     47
              party          1
ticketmaster  concert       10

--- Artists Coverage per Source ---
  ra_co               : 51/51 events have artists (100%)
  ticketmaster        : 10/10 events have artists (100%)

--- Date Range per Source ---
  ra_co:
    Earliest: 2026-02-14 23:00:00+00:00
    Latest:   2026-02-19 23:59:00+00:00
  ticketmaster:
    Earliest: 2026-02-18 09:00:00+00:00
    Latest:   2026-02-18 18:00:00+00:00

--- Free vs Paid per Source ---
source_name   price_is_free
ra_co         False            25
              True             

## Step 10: City Stats & Full Field Coverage

City-level ingestion breakdown per source, followed by a comprehensive field coverage table
across all meaningful EventSchema sections (Core, Location, Pricing, Ticket, Organizer, etc.).

- **✓** = 100% populated
- **!** = 0% populated (gap to investigate)
- *blank* = partial coverage

In [13]:
if not df_combined.empty:
    # ── City Ingestion Statistics per Source ─────────────────────────────
    print("City Ingestion Statistics per Source")
    print("=" * 60)
    print(df_combined.groupby(["source_name", "city"]).size().rename("events").to_string())

    print()

    # ── Field Coverage Table (all meaningful schema fields) ───────────────
    # Grouped by schema section for readability
    field_groups = {
        "Core": [
            "title", "description", "event_type", "event_format",
            "capacity", "age_restriction", "is_all_day", "is_recurring",
        ],
        "Timing": [
            "start_datetime", "end_datetime", "duration_minutes",
        ],
        "Location": [
            "venue_name", "street_address", "city", "state_or_region",
            "postal_code", "country_code", "timezone", "latitude", "longitude",
        ],
        "Pricing": [
            "price_is_free", "price_minimum", "price_maximum",
            "price_currency", "price_raw_text",
        ],
        "Ticket": [
            "ticket_url", "ticket_is_sold_out", "ticket_count_available",
        ],
        "Organizer": [
            "organizer_name", "organizer_url", "organizer_phone",
            "organizer_email", "organizer_follower_count",
        ],
        "Artists & Media": [
            "artists", "media_assets_json",
        ],
        "Engagement": [
            "engagement_going_count", "engagement_interested_count",
            "engagement_views_count", "engagement_shares_count",
        ],
        "Source": [
            "source_event_id", "source_url", "source_updated_at",
        ],
    }

    sources = df_combined["source_name"].unique()
    col_w = 28

    def field_pct(src_df, field):
        """Return non-null, non-empty percentage for a field."""
        if field not in src_df.columns:
            return None
        total = len(src_df)
        if total == 0:
            return None
        col = src_df[field]
        non_null = col.notna().sum()
        if col.dtype == object:
            non_null = (
                col.notna()
                & (col.astype(str) != "")
                & (col.astype(str) != "nan")
                & (col.astype(str) != "None")
                & (col.astype(str) != "[]")
            ).sum()
        return 100 * non_null / total

    print("Field Coverage by Source (% non-null / non-empty)")
    print("=" * (col_w + 2 + 17 * len(sources)))
    print(f"  {'Field':<{col_w}}", end="")
    for src in sources:
        print(f"  {src:>15}", end="")
    print()

    for group_name, fields in field_groups.items():
        print(f"\n  ── {group_name} {'─' * (col_w - len(group_name) - 4)}")
        for field in fields:
            pcts = []
            for src in sources:
                src_df = df_combined[df_combined["source_name"] == src]
                pct = field_pct(src_df, field)
                pcts.append(pct)

            # Skip fields that are 0% for ALL sources (not interesting)
            if all(p == 0 or p is None for p in pcts):
                continue

            print(f"  {field:<{col_w}}", end="")
            for pct in pcts:
                if pct is None:
                    print(f"  {'N/A':>15}", end="")
                else:
                    marker = " ✓" if pct == 100 else (" !" if pct == 0 else "  ")
                    print(f"  {pct:>13.0f}%{marker}", end="")
            print()

City Ingestion Statistics per Source
source_name   city     
ra_co         Barcelona    27
              Madrid       24
ticketmaster  Madrid       10

Field Coverage by Source (% non-null / non-empty)
  Field                                   ra_co     ticketmaster

  ── Core ────────────────────
  title                                   100% ✓            100% ✓
  description                              76%                0% !
  event_type                              100% ✓            100% ✓
  event_format                            100% ✓            100% ✓
  capacity                                 94%                0% !
  age_restriction                          63%                0% !
  is_all_day                              100% ✓            100% ✓
  is_recurring                            100% ✓            100% ✓

  ── Timing ──────────────────
  start_datetime                          100% ✓            100% ✓
  end_datetime                            100% ✓            100% ✓

## Step 12: Save Pipeline Results (Parquet + Pickle)

## Step 11: Persist to Database

Write deduplicated events to PostgreSQL using `EventDataWriter`.
Each event is persisted atomically (per-event rollback on failure).
Run this after a successful pipeline execution to populate the DB.

In [14]:
import psycopg2
from src.ingestion.persist import EventDataWriter

DATABASE_URL = os.environ.get("DATABASE_URL", "")
if not DATABASE_URL:
    print("ERROR: DATABASE_URL not set — check .env")
else:
    # Parse postgresql://user:pass@host:port/dbname
    from urllib.parse import urlparse
    u = urlparse(DATABASE_URL)
    conn_params = dict(
        host=u.hostname,
        port=u.port or 5432,
        dbname=u.path.lstrip("/"),
        user=u.username,
        password=u.password,
    )

    events_to_persist = deduplicated_events  # from Step 7 cross-source dedup

    print(f"Persisting {len(events_to_persist)} events to PostgreSQL...")
    print(f"  DB: {u.hostname}:{u.port}/{u.path.lstrip('/')}")
    print()

    try:
        conn = psycopg2.connect(**conn_params)
        writer = EventDataWriter(conn)

        # Report taxonomy cache
        print(f"  Taxonomy cache: {len(writer._valid_primary_categories)} primary categories, "
              f"{len(writer._valid_subcategories)} subcategories, "
              f"{len(writer._valid_activities)} activities")
        print()

        saved = writer.persist_batch(events_to_persist)
        conn.close()

        print(f"Persist complete: {saved}/{len(events_to_persist)} events saved")
        print(f"  Failed/skipped: {len(events_to_persist) - saved}")

        # Quick verification query
        conn2 = psycopg2.connect(**conn_params)
        with conn2.cursor() as cur:
            cur.execute("SELECT COUNT(*) FROM events")
            total_events = cur.fetchone()[0]
            cur.execute("SELECT COUNT(*) FROM locations")
            total_locations = cur.fetchone()[0]
            cur.execute("SELECT COUNT(*) FROM sources")
            total_sources = cur.fetchone()[0]
            cur.execute(
                "SELECT source_name, COUNT(*) FROM sources GROUP BY source_name ORDER BY source_name"
            )
            by_source = cur.fetchall()
        conn2.close()

        print("\nDB verification:")
        print(f"  events    : {total_events}")
        print(f"  locations : {total_locations}")
        print(f"  sources   : {total_sources}")
        print("  By source :")
        for src_name, cnt in by_source:
            print(f"    {src_name:20}: {cnt}")

    except Exception as e:
        print(f"ERROR: {e}")
        import traceback; traceback.print_exc()

ModuleNotFoundError: No module named 'psycopg2'

In [None]:
import pickle

output_dir = "../data/raw"
os.makedirs(output_dir, exist_ok=True)

if not df_combined.empty:
    # Save combined parquet
    parquet_path = f"{output_dir}/multi_source_events.parquet"
    try:
        df_combined.to_parquet(parquet_path, index=False, engine="pyarrow")
    except ImportError:
        df_combined.to_parquet(parquet_path, index=False, engine="fastparquet")
    print(f"Saved {len(df_combined)} events to {parquet_path}")

# Save pipeline results as pickle for downstream use
results = {"ra_co": raco_result, "ticketmaster": tm_result}
pkl_path = f"{output_dir}/multi_source_results.pkl"
with open(pkl_path, "wb") as f:
    pickle.dump(results, f, protocol=pickle.HIGHEST_PROTOCOL)
print(f"Saved PipelineExecutionResult objects to {pkl_path}")
print(f"  ra_co:        {raco_result.successful_events} events, status={raco_result.status.value}")
print(f"  ticketmaster: {tm_result.successful_events} events, status={tm_result.status.value}")

## Cleanup

In [None]:
await ra_co.close()
await ticketmaster.close()
print("Resources released.")