# Event Ingestion Pipeline Testing

This notebook tests the **config-driven** event ingestion pipeline.
All sources (Ra.co, Ticketmaster, etc.) are created through `PipelineFactory`
using YAML configuration — no source-specific code needed.

**Pipeline flow:**
1. Factory reads `ingestion.yaml` and creates pipelines
2. Each pipeline fetches raw data via its adapter (GraphQL / REST)
3. FieldMapper extracts + transforms fields per config
4. TaxonomyMapper assigns Human Experience Taxonomy dimensions
5. Events are normalized to `EventSchema` and optionally enriched by LLM

In [1]:
import sys
import os
import logging


# Setup path — point to services/api so src.* imports work
API_ROOT = os.path.abspath(os.path.join("..", "services", "api"))
if API_ROOT not in sys.path:
    sys.path.insert(0, API_ROOT)

# Setup path — point to services/scrapping so scrapping.* imports work
SCRAPPING_ROOT = os.path.abspath(os.path.join("..", "services", "scrapping"))
if SCRAPPING_ROOT not in sys.path:
    sys.path.insert(0, SCRAPPING_ROOT)

# Enable logging
logging.basicConfig(
    level=logging.INFO,
    format="%(name)s - %(levelname)s - %(message)s",
)


print(f"API root: {API_ROOT}")
print(f"Scrapping root: {SCRAPPING_ROOT}")
print("Setup complete")

API root: /Users/josegarcia/Documents/GitHub/event-intelligence-platform/services/api
Scrapping root: /Users/josegarcia/Documents/GitHub/event-intelligence-platform/services/scrapping
Setup complete


## Step 1: PipelineFactory — List All Configured Sources

The factory reads `ingestion.yaml` and can create pipelines for any enabled source.

In [2]:
from src.ingestion.factory import PipelineFactory

factory = PipelineFactory()

print("Configured Sources:")
print("=" * 50)
for name, info in factory.list_sources().items():
    status = "ENABLED" if info["enabled"] else "disabled"
    print(f"  {name:20} type={info['type']:10} [{status}]")

print(f"\nEnabled sources: {factory.list_enabled_sources()}")

Configured Sources:
  ra_co                type=api        [ENABLED]
  ticketmaster         type=api        [disabled]

Enabled sources: ['ra_co']


## Step 2: Ra.co Pipeline — Multi-City Ingestion

The Ra.co pipeline is created entirely from config. It uses:
- GraphQL API adapter
- Multi-city execution (Barcelona + Madrid via `defaults.areas`)
- Date-window splitting for complete coverage
- FieldMapper for extraction + transformations
- FeatureExtractor (LLM) for taxonomy enrichment

In [3]:
ra_co = factory.create_pipeline("ra_co")

print(f"Pipeline: {ra_co.config.source_name}")
print(f"Source type: {ra_co.source_type.value}")
print(f"Protocol: {ra_co.source_config.protocol}")
print(f"Endpoint: {ra_co.source_config.endpoint}")
print(f"Areas: {ra_co.source_config.defaults.get('areas', {})}")
print(f"Days ahead: {ra_co.source_config.defaults.get('days_ahead')}")
print(f"Feature extractor: {ra_co.feature_extractor is not None}")

Pipeline: ra_co
Source type: api
Protocol: graphql
Endpoint: https://ra.co/graphql
Areas: {'Barcelona': 20, 'Madrid': 41}
Days ahead: 120
Feature extractor: False


In [4]:
# Execute Ra.co pipeline (multi-city: Barcelona + Madrid)
raco_result = ra_co.execute()

print("Ra.co Pipeline Results")
print("=" * 60)
print(f"Status: {raco_result.status.value}")
print(f"Total raw events: {raco_result.total_events_processed}")
print(f"Successful: {raco_result.successful_events}")
print(f"Failed: {raco_result.failed_events}")
print(f"Duration: {raco_result.duration_seconds:.2f}s")
print(f"Success rate: {raco_result.success_rate:.1f}%")
print(f"Cities: {raco_result.metadata.get('cities', [])}")

if raco_result.errors:
    print(f"\nErrors: {raco_result.errors}")

pipeline.ra_co - INFO - Starting multi-city execution: ra_co_20260211_195107_1bf5c7cf (2 cities)
pipeline.ra_co - INFO - Fetching events for Barcelona (area_id=20)...
pipeline.ra_co - INFO -   Barcelona: sliding window fetch [2026-02-11..2026-06-11] (capacity=500/call, window=168h)
src.ingestion.pipelines.apis.base_api - INFO - Fetching page 1/10...
src.ingestion.pipelines.apis.base_api - INFO - Fetching page 2/10...
src.ingestion.pipelines.apis.base_api - INFO - Fetching page 3/10...
src.ingestion.pipelines.apis.base_api - INFO - Received 39 events (less than page_size), stopping pagination
src.ingestion.pipelines.apis.base_api - INFO - Pagination complete: fetched 139 total events across 3 pages
pipeline.ra_co - INFO -   Barcelona: [2026-02-11..2026-02-18] 139/139 events
src.ingestion.pipelines.apis.base_api - INFO - Fetching page 1/10...
src.ingestion.pipelines.apis.base_api - INFO - Fetching page 2/10...
src.ingestion.pipelines.apis.base_api - INFO - Fetching page 3/10...
src.inges

Ra.co Pipeline Results
Status: partial_success
Total raw events: 949
Successful: 899
Failed: 50
Duration: 1610.28s
Success rate: 94.7%
Cities: ['Barcelona', 'Madrid']


In [7]:
# Show sample normalized events
if raco_result.events:
    print(f"Sample Events ({len(raco_result.events)} total):")
    print("=" * 70)

    for i, event in enumerate(raco_result.events[:5]):
        print(f"\n[{i+1}] {event.title}")
        print(f"    City: {event.location.city} | Venue: {event.location.venue_name}")
        print(f"    Date: {event.start_datetime}")
        print(f"    Type: {event.event_type} | Price: {event.price.price_raw_text}")
        print(f"    Source URL: {event.source.source_url}")
        desc = (event.description or 'N/A')[:120]
        print(f"    Description: {desc}...")
        print(f"    Quality: {event.data_quality_score:.2f}")
else:
    print("No events fetched. Check pipeline logs above for errors.")

Sample Events (899 total):

[1] NØVAROOM
    City: Barcelona | Venue: City Hall
    Date: 2026-02-11 23:59:00+00:00
    Type: nightlife | Price: 0
    Source URL: https://ra.co/events/2365398
    Description: NØVAROOM nace como un nuevo concepto enfocado en el groove, la conexión y la energía del club. Un espacio donde la músic...
    Quality: 0.55

[2] Mise en Place
    City: Barcelona | Venue: Macarena Club
    Date: 2026-02-11 23:59:00+00:00
    Type: nightlife | Price: 10€
    Source URL: https://ra.co/events/2348941
    Description: N/A...
    Quality: 0.55

[3] MEN (All Night Long)
    City: Barcelona | Venue: Moog Club
    Date: 2026-02-11 23:59:00+00:00
    Type: nightlife | Price: None
    Source URL: https://ra.co/events/2338538
    Description: Men llegó hace unos años a Barcelona después de recorrer todos los clubs de su Cantábrico natal (La Real, Locomotive, Wh...
    Quality: 0.55

[4] TIMEZERO SHOWCASE 
    City: Barcelona | Venue: Garage 442
    Date: 2026-02-11 22:30:0

In [15]:
raco_result.events

[EventSchema(event_id='430a341e-2b4e-5c50-ae7f-e565ea0569c5', title='NØVAROOM', description='NØVAROOM nace como un nuevo concepto enfocado en el groove, la conexión y la energía del club. Un espacio donde la música marca el ritmo y la pista se convierte en un punto de encuentro real. El proyecto se mueve entre la Tech House y el Minimal / Deep Tech , con sets diseñados para construir una narrativa progresiva, cuidando el detalle, la dinámica y la respuesta del público. NØVAROOM llega a City Hall Barcelona para transformar la sala en un viaje nocturno guiado por el groove, la intensidad y la conexión con la pista.', primary_category='play_and_fun', taxonomy_dimensions=[TaxonomyDimension(primary_category=<PrimaryCategory.PLAY_AND_PURE_FUN: 'play_and_fun'>, subcategory='1.4', subcategory_name=None, confidence=1.0, values=[], activity_id=None, activity_name=None, energy_level=None, social_intensity=None, cognitive_load=None, physical_involvement=None, cost_level=None, time_scale=None, envi

## Step 3: Ticketmaster Pipeline (REST API)

Ticketmaster uses a REST API (not GraphQL), proving the pipeline is source-agnostic.

**Note:** Requires `TICKETMASTER_API_KEY` environment variable. If not set, this section is skipped.

In [6]:
tm_result = None
tm_api_key = os.environ.get("TICKETMASTER_API_KEY")

if tm_api_key:
    ticketmaster = factory.create_pipeline("ticketmaster")
    print(f"Pipeline: {ticketmaster.config.source_name}")
    print(f"Protocol: {ticketmaster.source_config.protocol}")
    print(f"Endpoint: {ticketmaster.source_config.endpoint}")

    tm_result = ticketmaster.execute(
        city="Barcelona",
        country_code="ES",
    )

    print("\nTicketmaster Results")
    print("=" * 60)
    print(f"Status: {tm_result.status.value}")
    print(f"Total: {tm_result.total_events_processed}")
    print(f"Successful: {tm_result.successful_events}")
    print(f"Duration: {tm_result.duration_seconds:.2f}s")
else:
    print("TICKETMASTER_API_KEY not set — skipping Ticketmaster pipeline.")
    print("Set the env var and re-run to test REST API ingestion.")

TICKETMASTER_API_KEY not set — skipping Ticketmaster pipeline.
Set the env var and re-run to test REST API ingestion.


## Step 4: Multi-Source Aggregation

Combine events from all sources and deduplicate.

## Alternative: Orchestrator Pattern

The `PipelineOrchestrator` executes all enabled pipelines and provides
cross-source deduplication in a single call.

In [None]:
from src.ingestion.orchestrator import load_orchestrator_from_config

config_path = os.path.join(API_ROOT, "src", "configs", "ingestion.yaml")
orchestrator = load_orchestrator_from_config(config_path)

print("Orchestrator pipelines:")
for p in orchestrator.list_pipelines():
    print(f"  {p}")

# Execute all enabled pipelines through orchestrator
pipeline_results = orchestrator.execute_all_pipelines()

for name, result in pipeline_results.items():
    print(f"\n{name}: {result.status.value} — {result.successful_events}/{result.total_events_processed} events")

# Access the ra_co result specifically
raco_result = pipeline_results.get("ra_co")
print(f"\nRa.co events: {raco_result.successful_events if raco_result else 0}")

In [None]:
# Deduplicate across all pipeline results using orchestrator
all_events = orchestrator.deduplicate_results(pipeline_results)

print(f"Total deduplicated events across all sources: {len(all_events)}")

# Count by source
from collections import Counter
source_counts = Counter(e.source.source_name for e in all_events)
print("\nEvents by source:")
for source, count in source_counts.items():
    print(f"  {source}: {count}")

# Count by city
city_counts = Counter(e.location.city for e in all_events)
print("\nEvents by city:")
for city, count in sorted(city_counts.items(), key=lambda x: -x[1]):
    print(f"  {city}: {count}")

## Step 5: DataFrame Visualization

Convert events to a comprehensive DataFrame using `pipeline.to_dataframe()`.

In [None]:
# Use the orchestrator's pipeline to access to_dataframe
ra_co_pipeline = orchestrator.get_pipeline("ra_co")
df = ra_co_pipeline.to_dataframe(all_events)

print(f"DataFrame shape: {df.shape}")
print(f"\nColumns ({len(df.columns)} total):")
for col in df.columns:
    print(f"  {col}")

In [11]:
# Show key columns
df

Unnamed: 0,event_id,title,description,start_datetime,end_datetime,duration_minutes,is_all_day,is_recurring,recurrence_pattern,venue_name,...,taxonomy_age_accessibility,taxonomy_repeatability,taxonomy_dimensions_json,data_quality_score,normalization_errors,tags,artists,custom_fields_json,created_at,updated_at
0,430a341e-2b4e-5c50-ae7f-e565ea0569c5,NØVAROOM,NØVAROOM nace como un nuevo concepto enfocado ...,2026-02-11 23:59:00+00:00,2026-02-12 05:00:00+00:00,301,False,False,,City Hall,...,,,"[{""primary_category"": ""play_and_fun"", ""subcate...",0.55,,,,"{""artists"": []}",2026-02-11 17:45:59.580539+00:00,2026-02-11 17:45:59.580542+00:00
1,98d7ef79-e1a3-5fb7-8a7d-a84eb3b0d784,Mise en Place,,2026-02-11 23:59:00+00:00,2026-02-12 05:00:00+00:00,301,False,False,,Macarena Club,...,,,"[{""primary_category"": ""play_and_fun"", ""subcate...",0.55,,,,"{""artists"": []}",2026-02-11 17:45:59.580929+00:00,2026-02-11 17:45:59.580930+00:00
2,7d2f32b8-e581-515b-823d-bde03d2be99e,MEN (All Night Long),Men llegó hace unos años a Barcelona después d...,2026-02-11 23:59:00+00:00,2026-02-12 05:00:00+00:00,301,False,False,,Moog Club,...,,,"[{""primary_category"": ""play_and_fun"", ""subcate...",0.55,,,,"{""artists"": []}",2026-02-11 17:45:59.581191+00:00,2026-02-11 17:45:59.581192+00:00
3,1aeca039-a388-533f-99b8-8126979af7a7,TIMEZERO SHOWCASE,TIMEZERO RECORDS arrives in Barcelona for the ...,2026-02-11 22:30:00+00:00,2026-02-12 02:30:00+00:00,240,False,False,,Garage 442,...,,,"[{""primary_category"": ""play_and_fun"", ""subcate...",0.55,,,,"{""artists"": []}",2026-02-11 17:45:59.581402+00:00,2026-02-11 17:45:59.581403+00:00
4,a59da926-3a44-59c8-9eb3-41a3eb1a303f,"LeMichael, Alessio Panasiti",,2026-02-11 22:00:00+00:00,2026-02-12 02:30:00+00:00,270,False,False,,Switch Bar,...,,,"[{""primary_category"": ""play_and_fun"", ""subcate...",0.55,,,,"{""artists"": []}",2026-02-11 17:45:59.581577+00:00,2026-02-11 17:45:59.581578+00:00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
893,ecd894a8-151a-570a-be1c-5e6fbe8bca3c,Sigh.CLUB with P.O (live),__We are a club committed to creating an equal...,2026-05-29 23:59:00+00:00,2026-05-30 06:00:00+00:00,361,False,False,,Cadavra,...,,,"[{""primary_category"": ""play_and_fun"", ""subcate...",0.60,,,,"{""artists"": []}",2026-02-11 17:45:59.670963+00:00,2026-02-11 17:45:59.670963+00:00
894,7ea36eed-46bc-5a03-aa3b-048a8e3916b6,PervertMX en Madrid,,2026-05-30 22:00:00+00:00,2026-05-31 08:00:00+00:00,600,False,False,,TBA - visita www.pervert.mx para conocer la di...,...,,,"[{""primary_category"": ""play_and_fun"", ""subcate...",0.55,,,,"{""artists"": []}",2026-02-11 17:45:59.671137+00:00,2026-02-11 17:45:59.671138+00:00
895,82dd450f-0771-5a7f-b82e-a98c9827c190,MARTIN FREDES - Jodita Techno,JODITA PROGRESSIVE MADRID Después de ver el at...,2026-05-30 23:00:00+00:00,2026-05-31 08:00:00+00:00,540,False,False,,TBA - La Finca,...,,,"[{""primary_category"": ""play_and_fun"", ""subcate...",0.55,,,,"{""artists"": []}",2026-02-11 17:45:59.671208+00:00,2026-02-11 17:45:59.671209+00:00
896,20d928ec-9637-5e9c-8a5a-ffa6de982550,BC FESTIVAL,El BCFestival vuelve más fuerte que nunca. Dos...,2026-06-05 23:00:00+00:00,2026-06-07 07:00:00+00:00,1920,False,False,,"TBA - Villalgordo del Júcar, Albacete",...,,,"[{""primary_category"": ""play_and_fun"", ""subcate...",0.55,,,,"{""artists"": []}",2026-02-11 17:45:59.671381+00:00,2026-02-11 17:45:59.671381+00:00


## Step 6: Taxonomy Enrichment

View the taxonomy dimensions populated by TaxonomyMapper and FeatureExtractor.

In [12]:
enrichment_cols = [
    "title",
    "taxonomy_subcategory",
    "taxonomy_subcategory_name",
    "taxonomy_energy_level",
    "taxonomy_social_intensity",
    "taxonomy_cognitive_load",
    "taxonomy_physical_involvement",
    "taxonomy_cost_level",
    "taxonomy_time_scale",
    "taxonomy_environment",
    "taxonomy_emotional_output",
    "taxonomy_age_accessibility",
    "taxonomy_repeatability",
]

if not df.empty:
    available = [c for c in enrichment_cols if c in df.columns]
    print(f"Taxonomy Enrichment Data ({len(available)} columns):")
    df[available].head(10)
else:
    print("No taxonomy data — DataFrame is empty.")

Taxonomy Enrichment Data (13 columns):


## Step 7: Summary Statistics

In [13]:
if df.empty:
    print("No events ingested — summary statistics unavailable.")
else:
    print("=" * 60)
    print("INGESTION SUMMARY")
    print("=" * 60)

    print(f"\nTotal events: {len(df)}")
    print(f"Average quality score: {df['data_quality_score'].mean():.3f}")

    print("\n--- By Source ---")
    print(df.groupby("source_name").size().to_string())

    print("\n--- By City ---")
    print(df.groupby("city").size().sort_values(ascending=False).to_string())

    print("\n--- By Event Type ---")
    print(df.groupby("event_type").size().sort_values(ascending=False).to_string())

    print("\n--- Free vs Paid ---")
    print(df.groupby("price_is_free").size().to_string())

    print("\n--- Date Range ---")
    print(f"Earliest: {df['start_datetime'].min()}")
    print(f"Latest:   {df['start_datetime'].max()}")

INGESTION SUMMARY

Total events: 898
Average quality score: 0.570

--- By Source ---
source_name
ra_co    898

--- By City ---
city
Barcelona    513
Madrid       385

--- By Event Type ---
event_type
nightlife    835
party         28
concert       27
festival       8

--- Free vs Paid ---
price_is_free
False    544
True     354

--- Date Range ---
Earliest: 2026-02-11 22:00:00+00:00
Latest:   2026-06-11 20:00:00+00:00


## Step 8: Save Results (Optional)

In [32]:
if not df.empty:
    output_dir = "../data/raw"
    os.makedirs(output_dir, exist_ok=True)
    output_path = f"{output_dir}/events_all_sources.parquet"
    df.to_parquet(output_path, index=False, engine='fastparquet')
    print(f"Saved {len(df)} events to {output_path}")
else:
    print("DataFrame is empty — skipping save.")

Saved 898 events to ../data/raw/events_all_sources.parquet


In [None]:
import pickle

if raco_result:
    output_dir = "../data/raw"
    os.makedirs(output_dir, exist_ok=True)
    pkl_path = f"{output_dir}/raco_result.pkl"
    with open(pkl_path, "wb") as f:
        pickle.dump(raco_result, f, protocol=pickle.HIGHEST_PROTOCOL)
    print(f"Saved PipelineExecutionResult to {pkl_path}")
    print(f"  Events: {raco_result.successful_events}")
    print(f"  Status: {raco_result.status.value}")
else:
    print("No raco_result to save.")

## Cleanup

In [None]:
# Close pipeline resources
for p_info in orchestrator.list_pipelines():
    pipeline = orchestrator.get_pipeline(p_info["name"])
    if hasattr(pipeline, "close"):
        pipeline.close()

if tm_api_key:
    ticketmaster.close()
print("Resources released.")