# 00 – Exploratory Data Analysis of Source Data

This notebook performs streamlined exploratory data analysis to identify data structure, cleaning requirements, and transformation needs for all raw source datasets. 

**Approach**: Uses helper functions to eliminate repetition and consolidate analyses into focused sections.

**Datasets Analyzed**:
1. **EDGAR Emissions Data** (Excel): 4 gas sheets (CO2, CH4, N2O, F-gas), 1990-2022, 273 NUTS2 regions
2. **Eurostat Health Data** (TSV): Causes of Death (2011-2022) and Hospital Discharges (2000-2021)
3. **Eurostat Population Data** (TSV): 1990-2024, 352 NUTS2 codes

**Analysis Sections**:
- **Section 1**: EDGAR emissions structure, coverage, and missing values
- **Section 2**: Health data structure, parsing requirements, and completeness
- **Section 3**: Population data structure and NUTS2 availability
- **Section 4**: Geographic and temporal coverage intersection across all datasets
- **Section 5**: Summary table and pipeline requirements

**Output**: Technical requirements for ETL pipeline implementation, including data cleaning steps, transformation workflows, and harmonization window (2011-2021).


In [49]:
from pathlib import Path
import sys

PROJECT_ROOT = Path.cwd().resolve()
# Handle different notebook locations
if PROJECT_ROOT.name == "notebooks":
    PROJECT_ROOT = PROJECT_ROOT.parent
elif PROJECT_ROOT.name == "raw":
    # Notebook is in data/raw/
    PROJECT_ROOT = PROJECT_ROOT.parent.parent
elif PROJECT_ROOT.name == "data":
    # Notebook is in data/
    PROJECT_ROOT = PROJECT_ROOT.parent
elif PROJECT_ROOT.name == "mvp":
    PROJECT_ROOT = PROJECT_ROOT.parent
SRC_DIR = PROJECT_ROOT / "mvp" / "src"
if str(SRC_DIR) not in sys.path:
    sys.path.insert(0, str(SRC_DIR))

print(f"Project root: {PROJECT_ROOT}")


Project root: C:\Users\narek.pirumyan\Desktop\IAE\2025\Big Data\Capstone Project\air-health-eu


In [50]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import json
from collections import Counter
from pathlib import Path

# Set plotting style
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)

print("Libraries loaded successfully")

# ============================================================================
# Helper Functions for Data Analysis
# ============================================================================

def check_file_info(file_path: Path) -> dict:
    """Check if file exists and return basic info."""
    if file_path.exists():
        size_mb = file_path.stat().st_size / (1024*1024)
        return {"exists": True, "name": file_path.name, "size_mb": size_mb}
    return {"exists": False, "path": str(file_path)}

def extract_year_columns_edgar(df: pd.DataFrame) -> tuple:
    """Extract year columns from EDGAR format (Y_YYYY)."""
    year_cols = [col for col in df.columns if col.startswith("Y_")]
    years = sorted([int(col.replace("Y_", "")) for col in year_cols])
    return year_cols, years

def extract_year_columns_eurostat(df: pd.DataFrame) -> tuple:
    """Extract year columns from Eurostat TSV format (YYYY with trailing space)."""
    year_cols = [col for col in df.columns if col.strip().replace(' ', '').isdigit()]
    years = sorted([int(col.strip()) for col in year_cols])
    return year_cols, years

def parse_eurostat_dimensions(df: pd.DataFrame, dim_col_name: str, dim_indices: dict) -> pd.DataFrame:
    """Parse comma-separated dimensions from Eurostat TSV first column."""
    df_parsed = df.copy()
    for key, idx in dim_indices.items():
        df_parsed[key] = df_parsed[dim_col_name].str.split(',').str[idx]
    return df_parsed

def check_data_completeness(df: pd.DataFrame, year_cols: list, years: list, n_years: int = 5) -> None:
    """Check data completeness for the last N years."""
    print(f"\nData completeness by year (last {n_years} years):")
    for year in years[-n_years:]:
        # Find matching column (handles trailing spaces)
        year_col = None
        for col in year_cols:
            if int(col.strip().replace(' ', '')) == year:
                year_col = col
                break
        
        if year_col and year_col in df.columns:
            valid = df[year_col].apply(lambda x: pd.notna(x) and str(x).strip() != ':').sum()
            total = len(df)
            print(f"  {year}: {valid}/{total} ({valid/total*100:.1f}%) valid values")

def get_nuts_level(code: str) -> str:
    """Determine NUTS level from code length."""
    if not isinstance(code, str):
        code = str(code)
    length = len(code.strip())
    if length == 2:
        return 'Country'
    elif length == 3:
        return 'NUTS1'
    elif length == 4:
        return 'NUTS2'
    else:
        return 'NUTS3+'

print("Helper functions loaded")


Libraries loaded successfully
Helper functions loaded


## 1. EDGAR Emissions Data

**File**: Excel workbook with 4 gas sheets (CO2, CH4, N2O, F-gas)  
**Format**: Wide format with Y_YYYY year columns  
**Coverage**: 1990-2022, 273 NUTS2 regions


In [51]:
# File paths
edgar_path = PROJECT_ROOT / "data" / "raw" / "emissions" / "EDGARv8.0_GHG_by substance_GWP100_AR5_NUTS2_1990_2022.xlsx"
cod_path = PROJECT_ROOT / "data" / "raw" / "health" / "hlth_cd_asdr2.tsv"
discharge_path = PROJECT_ROOT / "data" / "raw" / "health" / "hlth_co_disch1t.tsv"
population_path = PROJECT_ROOT / "data" / "raw" / "population" / "demo_r_d2jan_tabular.tsv"

# Check all files
print("File Availability Check:")
print("=" * 60)
for name, path in [("EDGAR", edgar_path), ("Causes of Death", cod_path), 
                   ("Hospital Discharges", discharge_path), ("Population", population_path)]:
    info = check_file_info(path)
    if info["exists"]:
        print(f"✓ {name}: {info['name']} ({info['size_mb']:.2f} MB)")
    else:
        print(f"✗ {name}: Not found")


File Availability Check:
✓ EDGAR: EDGARv8.0_GHG_by substance_GWP100_AR5_NUTS2_1990_2022.xlsx (2.14 MB)
✓ Causes of Death: hlth_cd_asdr2.tsv (37.20 MB)
✓ Hospital Discharges: hlth_co_disch1t.tsv (94.75 MB)
✓ Population: demo_r_d2jan_tabular.tsv (33.63 MB)


In [56]:
# Comprehensive EDGAR Analysis
if edgar_path.exists():
    gas_sheets = {
        "Fossil CO2 AR5": "CO2",
        "CH4_AR5": "CH4",
        "N2O_AR5": "N2O",
        "F-gas AR5": "F-gas"
    }
    
    edgar_summary = {}
    missing_analysis = {}
    
    for sheet_name, gas_label in gas_sheets.items():
        try:
            # Read data (skip 5 metadata rows)
            df = pd.read_excel(edgar_path, sheet_name=sheet_name, skiprows=5)
            df_clean = df.dropna(subset=["NUTS 2"])
            
            # Extract year columns
            year_cols, years = extract_year_columns_edgar(df)
            
            # Summary statistics
            edgar_summary[gas_label] = {
                "total_rows": len(df),
                "unique_nuts2": df_clean["NUTS 2"].nunique(),
                "unique_countries": df_clean["ISO"].nunique(),
                "unique_sectors": df_clean["Sector"].nunique(),
                "year_range": f"{min(years)}-{max(years)}",
                "n_years": len(years)
            }
            
            # Missing values analysis (last 10 years)
            year_cols_recent = year_cols[-10:]
            missing_count = sum(df[col].isna().sum() for col in year_cols_recent)
            missing_analysis[gas_label] = {
                "missing_values": missing_count,
                "missing_pct": (missing_count / (len(df) * 10) * 100) if len(df) > 0 else 0
            }
            
        except Exception as e:
            print(f"Error analyzing {gas_label}: {e}")
    
    # Display summary
    print("\n" + "="*60)
    print("EDGAR Summary by Gas Type")
    print("="*60)
    summary_df = pd.DataFrame(edgar_summary).T
    print(summary_df)
    
    print("\n" + "="*60)
    print("Missing Values Summary (last 10 years)")
    print("="*60)
    missing_df = pd.DataFrame(missing_analysis).T
    print(missing_df)
else:
    print("EDGAR file not found")



EDGAR Summary by Gas Type
      total_rows unique_nuts2 unique_countries unique_sectors year_range  \
CO2         1420          273               32              8  1990-2022   
CH4         1505          273               32              8  1990-2022   
N2O         1538          274               32              9  1990-2022   
F-gas        237          237               27              1  1990-2022   

      n_years  
CO2        33  
CH4        33  
N2O        33  
F-gas      33  

Missing Values Summary (last 10 years)
       missing_values  missing_pct
CO2             132.0     0.929577
CH4              51.0     0.338870
N2O              70.0     0.455137
F-gas             0.0     0.000000


### Key Findings: EDGAR Data
- **Structure**: Excel workbook, 4 gas sheets, wide format with Y_YYYY columns
- **Metadata**: First 5 rows contain metadata (skip during read)
- **Temporal**: 1990-2022 (33 years)
- **Geographic**: 273 NUTS2 codes, 32 countries
- **Missing Values**: 0.2-1.1% in recent years (concentrated in specific sectors/regions)
- **Cleaning**: Skip metadata rows, melt year columns, standardize NUTS codes, drop null emissions


## 2. Eurostat Health Data

**Files**: TSV format with comma-separated dimensions in first column  
**Causes of Death**: 2011-2022, 491 geo codes, 93 ICD10 groups  
**Hospital Discharges**: 2000-2021, 261 geo codes, 152 ICD10 groups


In [53]:
# Comprehensive Health Data Analysis
health_summary = {}

# Causes of Death
if cod_path.exists():
    cod_full = pd.read_csv(cod_path, sep='\t')
    dim_col = [col for col in cod_full.columns if 'geo' in col.lower() and 'TIME_PERIOD' in col]
    
    if dim_col:
        dim_col_name = dim_col[0]
        # Parse dimensions: freq,unit,sex,age,icd10,geo\TIME_PERIOD
        cod_full = parse_eurostat_dimensions(cod_full, dim_col_name, {
            'freq': 0, 'unit': 1, 'sex': 2, 'age': 3, 'icd10': 4, 'geo': 5
        })
        
        year_cols, years = extract_year_columns_eurostat(cod_full)
        
        health_summary['Causes of Death'] = {
            'file': cod_path.name,
            'rows': len(cod_full),
            'years': f"{min(years)}-{max(years)}",
            'n_years': len(years),
            'geo_codes': cod_full['geo'].nunique(),
            'icd10_groups': cod_full['icd10'].nunique(),
            'nuts2_codes': cod_full[cod_full['geo'].str.len() == 4]['geo'].nunique()
        }
        
        print("="*60)
        print("Causes of Death Analysis")
        print("="*60)
        print(f"Temporal coverage: {min(years)} - {max(years)} ({len(years)} years)")
        print(f"Total rows: {len(cod_full):,}")
        print(f"Geographic codes: {cod_full['geo'].nunique()} (NUTS2: {health_summary['Causes of Death']['nuts2_codes']})")
        print(f"ICD10 groups: {cod_full['icd10'].nunique()}")
        
        # Respiratory codes
        resp_codes = [c for c in cod_full['icd10'].unique() 
                     if isinstance(c, str) and (c.startswith('J') or 'resp' in c.lower())]
        print(f"Respiratory ICD10 codes: {resp_codes}")
        
        check_data_completeness(cod_full, year_cols, years, n_years=5)

# Hospital Discharges
if discharge_path.exists():
    discharge_full = pd.read_csv(discharge_path, sep='\t')
    dim_col = [col for col in discharge_full.columns if 'geo' in col.lower() and 'TIME_PERIOD' in col]
    
    if dim_col:
        dim_col_name = dim_col[0]
        # Parse dimensions: freq,age,indic_he,unit,sex,icd10,geo\TIME_PERIOD
        discharge_full = parse_eurostat_dimensions(discharge_full, dim_col_name, {
            'freq': 0, 'age': 1, 'indic_he': 2, 'unit': 3, 'sex': 4, 'icd10': 5, 'geo': 6
        })
        
        year_cols, years = extract_year_columns_eurostat(discharge_full)
        
        health_summary['Hospital Discharges'] = {
            'file': discharge_path.name,
            'rows': len(discharge_full),
            'years': f"{min(years)}-{max(years)}",
            'n_years': len(years),
            'geo_codes': discharge_full['geo'].nunique(),
            'icd10_groups': discharge_full['icd10'].nunique(),
            'nuts2_codes': discharge_full[discharge_full['geo'].str.len() == 4]['geo'].nunique()
        }
        
        print("\n" + "="*60)
        print("Hospital Discharges Analysis")
        print("="*60)
        print(f"Temporal coverage: {min(years)} - {max(years)} ({len(years)} years)")
        print(f"Total rows: {len(discharge_full):,}")
        print(f"Geographic codes: {discharge_full['geo'].nunique()} (NUTS2: {health_summary['Hospital Discharges']['nuts2_codes']})")
        print(f"ICD10 groups: {discharge_full['icd10'].nunique()}")
        
        check_data_completeness(discharge_full, year_cols, years, n_years=5)

# Summary
if health_summary:
    print("\n" + "="*60)
    print("Health Data Summary")
    print("="*60)
    print(pd.DataFrame(health_summary).T)



Causes of Death Analysis
Temporal coverage: 2011 - 2022 (12 years)
Total rows: 398,544
Geographic codes: 491 (NUTS2: 334)
ICD10 groups: 93
Respiratory ICD10 codes: ['J', 'J09-J11', 'J12-J18', 'J40-J44_J47', 'J40-J47', 'J45_J46', 'J_OTH']

Data completeness by year (last 5 years):
  2018: 364779/398544 (91.5%) valid values
  2019: 323682/398544 (81.2%) valid values
  2020: 336147/398544 (84.3%) valid values
  2021: 336078/398544 (84.3%) valid values
  2022: 332670/398544 (83.5%) valid values

Hospital Discharges Analysis
Temporal coverage: 2000 - 2021 (22 years)
Total rows: 872,164
Geographic codes: 261 (NUTS2: 192)
ICD10 groups: 152

Data completeness by year (last 5 years):
  2017: 598733/872164 (68.6%) valid values
  2018: 592948/872164 (68.0%) valid values
  2019: 617126/872164 (70.8%) valid values
  2020: 613964/872164 (70.4%) valid values
  2021: 644028/872164 (73.8%) valid values

Health Data Summary
                                    file    rows      years n_years geo_codes  \

### Key Findings: Health Data
- **Format**: TSV with comma-separated dimensions in first column
- **Parsing**: Split first column by comma, extract geo code (index 5 for COD, 6 for discharges)
- **Year Columns**: Named as `YYYY ` (with trailing space) - need `.strip()` before conversion
- **Missing Data**: `:` indicator - replace with `pd.NA` before numeric conversion
- **Completeness**: 83-100% depending on dataset and year
- **Cleaning**: Parse dimensions → Melt → Handle missing (`:`) → Standardize geo codes → Filter dimensions


## 3. Population Data

**File**: TSV format (demo_r_d2jan_tabular.tsv)  
**Format**: Same as health data - comma-separated dimensions  
**Coverage**: 1990-2024, 521 geo codes (352 NUTS2)


In [54]:
# Population Data Analysis
if population_path.exists():
    population_full = pd.read_csv(population_path, sep='\t')
    
    # Parse dimensions: freq,unit,sex,age,geo\TIME_PERIOD
    dim_col = population_full.columns[0]
    population_full = parse_eurostat_dimensions(population_full, dim_col, {
        'freq': 0, 'unit': 1, 'sex': 2, 'age': 3, 'geo': 4
    })
    
    year_cols, years = extract_year_columns_eurostat(population_full)
    
    # Geographic level distribution
    population_full['nuts_level'] = population_full['geo'].apply(get_nuts_level)
    
    print("="*60)
    print("Population Data Analysis")
    print("="*60)
    print(f"Temporal coverage: {min(years)} - {max(years)} ({len(years)} years)")
    print(f"Total rows: {len(population_full):,}")
    print(f"Geographic codes: {population_full['geo'].nunique()}")
    print(f"\nGeographic level distribution:")
    print(population_full['nuts_level'].value_counts())
    
    nuts2_count = population_full[population_full['nuts_level'] == 'NUTS2']['geo'].nunique()
    print(f"\nNUTS2 codes: {nuts2_count} unique codes")
    
    # Data completeness
    check_data_completeness(population_full, year_cols, years, n_years=5)
    
    # NUTS2 availability (filtered to standard dimensions)
    filtered = population_full[
        (population_full['freq'] == 'A') & 
        (population_full['unit'] == 'NR') & 
        (population_full['sex'] == 'T') & 
        (population_full['age'] == 'TOTAL') &
        (population_full['nuts_level'] == 'NUTS2')
    ]
    
    if len(filtered) > 0:
        print(f"\nNUTS2 data availability (freq=A, unit=NR, sex=T, age=TOTAL):")
        print(f"  Filtered rows: {len(filtered)}")
        for year in years[-5:]:
            year_col = None
            for col in year_cols:
                if int(col.strip()) == year:
                    year_col = col
                    break
            if year_col:
                valid = filtered[year_col].apply(lambda x: pd.notna(x) and str(x).strip() != ':').sum()
                total = len(filtered)
                print(f"  {year}: {valid}/{total} ({valid/total*100:.1f}%) valid")
else:
    print("Population file not found")



Population Data Analysis
Temporal coverage: 1990 - 2024 (35 years)
Total rows: 160,575
Geographic codes: 521

Geographic level distribution:
nuts_level
NUTS2      108540
NUTS1       39768
Country     11388
NUTS3+        879
Name: count, dtype: int64

NUTS2 codes: 352 unique codes

Data completeness by year (last 5 years):
  2020: 138539/160575 (86.3%) valid values
  2021: 136823/160575 (85.2%) valid values
  2022: 135702/160575 (84.5%) valid values
  2023: 138078/160575 (86.0%) valid values
  2024: 135117/160575 (84.1%) valid values

NUTS2 data availability (freq=A, unit=NR, sex=T, age=TOTAL):
  Filtered rows: 352
  2020: 303/352 (86.1%) valid
  2021: 293/352 (83.2%) valid
  2022: 294/352 (83.5%) valid
  2023: 298/352 (84.7%) valid
  2024: 292/352 (83.0%) valid


### Key Findings: Population Data
- **Format**: Same TSV structure as health data
- **Dimensions**: freq,unit,sex,age,geo\TIME_PERIOD
- **Coverage**: 1990-2024 (35 years), 352 NUTS2 codes
- **Completeness**: 83-86% for recent years at NUTS2 level
- **Cleaning**: Same as health data - parse dimensions, melt, handle missing, standardize


## 4. Geographic & Temporal Coverage

**Geographic**: Compare NUTS2 coverage across datasets  
**Temporal**: Find intersection of available years for harmonization


### Technical Findings: EDGAR Data Structure & Cleaning Requirements

**Data Structure:**
- **Metadata Rows**: First 5 rows contain metadata (Content, Compound, Start year, End year) - must be skipped during read
- **Column Format**: Year columns use `Y_YYYY` format (e.g., `Y_1990`, `Y_2022`) - requires string replacement to extract year integer
- **Wide Format**: Data is in wide format with year columns - requires melt operation to convert to tidy format
- **NUTS2 desc**: Contains region names, some rows have null values - this is expected and indicates country-level emissions that cannot be assigned to a specific NUTS2 region (e.g., domestic aviation, domestic shipping)

**Data Cleaning Requirements:**
- **NUTS2 Standardization**: Codes need `.str.strip().str.upper()` to ensure consistent formatting for joins
- **Country ISO Codes**: Need `.str.strip().str.upper()` for consistency
- **Missing Values in Emission Columns**: 
  - **Pattern**: Small percentage (0.9-1.1%) of missing values in recent years (2018-2022)
  - **Cause**: Likely due to data collection gaps, reporting delays, or sectors/regions with incomplete reporting
  - **Distribution**: Missing values may be concentrated in specific sectors or geographic levels (country vs NUTS2)
  - **Handling Strategy**: Drop rows with missing emission values using `.dropna(subset=["emissions_kt_co2e"])` after melt operation
- **NUTS2 desc Missing Values**: 
  - **Expected Behavior**: Null values occur when `NUTS 2` code is a 2-character country code (e.g., "AT" for Austria)
  - **Handling**: No action needed - these represent country-level aggregations and should be preserved. Country-level rows should be kept for completeness, but may need separate handling in NUTS2-specific analyses
- **Sector Mapping**: Need to map detailed sectors to high-level groups (Industry, Buildings, Transport, Energy, Agriculture, Waste, other)

**Transformation Requirements:**
- Melt year columns: `id_vars` = [Substance, ISO, Country, NUTS 2, NUTS 2 desc, Sector], `value_vars` = all Y_* columns
- Rename columns: Substance→gas, ISO→country_iso, Country→country_name, NUTS 2→nuts_id, NUTS 2 desc→nuts_label
- Extract year: Convert `Y_2022` → `2022` (integer)
- Combine sheets: Concatenate all gas sheets (CO2, CH4, N2O, F-gas) into single dataframe
- Sector grouping: Apply mapping dictionary to create `sector_group` column


In [55]:
# Geographic Coverage Comparison
nuts2_coverage = {}

# EDGAR
if edgar_path.exists():
    df_co2 = pd.read_excel(edgar_path, sheet_name="Fossil CO2 AR5", skiprows=5)
    df_co2 = df_co2.dropna(subset=["NUTS 2"])
    edgar_nuts2 = set(df_co2["NUTS 2"].str.strip().str.upper().unique())
    nuts2_coverage['EDGAR'] = edgar_nuts2

# Health data
if cod_path.exists():
    cod_full = pd.read_csv(cod_path, sep='\t')
    dim_col = [col for col in cod_full.columns if 'geo' in col.lower() and 'TIME_PERIOD' in col]
    if dim_col:
        cod_full['geo'] = cod_full[dim_col[0]].str.split(',').str[5]
        cod_nuts2 = {g.strip().upper() for g in cod_full['geo'].unique() if isinstance(g, str) and len(str(g).strip()) == 4}
        nuts2_coverage['Causes of Death'] = cod_nuts2

if discharge_path.exists():
    discharge_full = pd.read_csv(discharge_path, sep='\t')
    dim_col = [col for col in discharge_full.columns if 'geo' in col.lower() and 'TIME_PERIOD' in col]
    if dim_col:
        discharge_full['geo'] = discharge_full[dim_col[0]].str.split(',').str[6]
        discharge_nuts2 = {g.strip().upper() for g in discharge_full['geo'].unique() if isinstance(g, str) and len(str(g).strip()) == 4}
        nuts2_coverage['Hospital Discharges'] = discharge_nuts2

# Population
if population_path.exists():
    pop_full = pd.read_csv(population_path, sep='\t')
    dim_col = pop_full.columns[0]
    pop_full['geo'] = pop_full[dim_col].str.split(',').str[4]
    pop_nuts2 = {g.strip().upper() for g in pop_full['geo'].unique() if isinstance(g, str) and len(str(g).strip()) == 4}
    nuts2_coverage['Population'] = pop_nuts2

# Display results
print("="*60)
print("NUTS2 Geographic Coverage")
print("="*60)
for name, codes in nuts2_coverage.items():
    print(f"{name}: {len(codes)} unique NUTS2 codes")

if len(nuts2_coverage) >= 2:
    all_nuts2 = set.union(*nuts2_coverage.values())
    print(f"\nTotal unique NUTS2 codes across all datasets: {len(all_nuts2)}")
    
    if 'EDGAR' in nuts2_coverage and 'Causes of Death' in nuts2_coverage:
        intersection = nuts2_coverage['EDGAR'] & nuts2_coverage['Causes of Death']
        print(f"NUTS2 codes in both EDGAR and Causes of Death: {len(intersection)}")

# Temporal Coverage Intersection
print("\n" + "="*60)
print("Temporal Coverage Intersection")
print("="*60)

temporal_coverage = {}

# EDGAR
if edgar_path.exists():
    df_co2 = pd.read_excel(edgar_path, sheet_name="Fossil CO2 AR5", skiprows=5)
    _, edgar_years = extract_year_columns_edgar(df_co2)
    temporal_coverage['EDGAR'] = set(edgar_years)
    print(f"EDGAR: {min(edgar_years)} - {max(edgar_years)} ({len(edgar_years)} years)")

# Causes of Death
if cod_path.exists():
    cod_full = pd.read_csv(cod_path, sep='\t')
    _, cod_years = extract_year_columns_eurostat(cod_full)
    temporal_coverage['Causes of Death'] = set(cod_years)
    print(f"Causes of Death: {min(cod_years)} - {max(cod_years)} ({len(cod_years)} years)")

# Hospital Discharges
if discharge_path.exists():
    discharge_full = pd.read_csv(discharge_path, sep='\t')
    _, discharge_years = extract_year_columns_eurostat(discharge_full)
    temporal_coverage['Hospital Discharges'] = set(discharge_years)
    print(f"Hospital Discharges: {min(discharge_years)} - {max(discharge_years)} ({len(discharge_years)} years)")

# Population
if population_path.exists():
    pop_full = pd.read_csv(population_path, sep='\t')
    _, pop_years = extract_year_columns_eurostat(pop_full)
    temporal_coverage['Population'] = set(pop_years)
    print(f"Population: {min(pop_years)} - {max(pop_years)} ({len(pop_years)} years)")

# Find intersection
if len(temporal_coverage) >= 2:
    intersection_years = set.intersection(*temporal_coverage.values())
    if intersection_years:
        print(f"\n✓ Harmonization window: {min(intersection_years)} - {max(intersection_years)} ({len(intersection_years)} years)")
        print(f"  Years: {sorted(intersection_years)}")
    else:
        print("\n⚠ No complete intersection - use pairwise intersections")
        
    if 'EDGAR' in temporal_coverage and 'Causes of Death' in temporal_coverage:
        edgar_cod = temporal_coverage['EDGAR'] & temporal_coverage['Causes of Death']
        if edgar_cod:
            print(f"\nEDGAR ∩ Causes of Death: {min(edgar_cod)} - {max(edgar_cod)} ({len(edgar_cod)} years)")


NUTS2 Geographic Coverage
EDGAR: 273 unique NUTS2 codes
Causes of Death: 334 unique NUTS2 codes
Hospital Discharges: 192 unique NUTS2 codes
Population: 352 unique NUTS2 codes

Total unique NUTS2 codes across all datasets: 416
NUTS2 codes in both EDGAR and Causes of Death: 242

Temporal Coverage Intersection
EDGAR: 1990 - 2022 (33 years)
Causes of Death: 2011 - 2022 (12 years)
Hospital Discharges: 2000 - 2021 (22 years)
Population: 1990 - 2024 (35 years)

✓ Harmonization window: 2011 - 2021 (11 years)
  Years: [2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021]

EDGAR ∩ Causes of Death: 2011 - 2022 (12 years)


## 5. Summary & Pipeline Requirements

### Dataset Structure

| Dataset | Format | Years | NUTS2 Codes | Key Features |
|---------|--------|------|-------------|--------------|
| **EDGAR** | Excel (4 sheets) | 1990-2022 | 273 | Wide format, Y_YYYY columns, skip 5 metadata rows |
| **Causes of Death** | TSV | 2011-2022 | 334 | Comma-separated dimensions, 93 ICD10 groups |
| **Hospital Discharges** | TSV | 2000-2021 | 192 | Same format, 152 ICD10 groups |
| **Population** | TSV | 1990-2024 | 352 | Same format, multiple age/sex dimensions |

### Harmonization Window
**✓ 2011-2021 (11 years)** - Complete coverage across all datasets

### Data Cleaning Requirements

**EDGAR:**
- Skip first 5 metadata rows
- Melt Y_YYYY columns to long format
- Standardize: `.str.strip().str.upper()` for NUTS codes
- Drop null emissions (0.2-1.1% missing)

**Health & Population (TSV):**
- Parse comma-separated dimensions from first column
- Melt year columns (handle trailing spaces)
- Replace `:` with `pd.NA` before numeric conversion
- Filter to: `freq='A'`, `sex='T'`, `age='TOTAL'` (for population)
- Standardize geo codes: `.str.strip().str.upper()`

### Transformation Pipeline

1. **EDGAR**: Read Excel → Skip metadata → Melt → Standardize → Combine sheets
2. **Health**: Read TSV → Parse dimensions → Melt → Clean missing → Standardize → Filter
3. **Population**: Read TSV → Parse dimensions → Melt → Clean missing → Standardize → Filter
4. **Harmonization**: Filter to 2011-2021 → Merge on `nuts_id` + `year` → Calculate derived metrics
5. **Output**: Save to Parquet format
