# ECG Digitization - Exploratory Data Analysis

This notebook explores the PhysioNet ECG Digitization competition data.

## Goals:
1. Understand the data structure
2. Analyze image variations (segments)
3. Examine ECG signal characteristics
4. Identify challenges and opportunities
5. Visualize sample data

---

In [None]:
# Import libraries
import sys
from pathlib import Path

# Add src directory to path
sys.path.append(str(Path.cwd().parent / 'src'))

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import cv2
from IPython.display import display

# Project imports
from config import *
from data.dataloader import ECGDataLoader, get_data_statistics
from utils.visualization import *
from utils.metrics import calculate_snr, evaluate_single_lead

# Set style
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)

%matplotlib inline

print("Imports successful!")

## 1. Dataset Overview

Let's start by understanding the structure of the competition data.

In [None]:
# Check if data exists
print(f"Data directory exists: {DATA_DIR.exists()}")
print(f"Train CSV exists: {TRAIN_CSV.exists()}")
print(f"Test CSV exists: {TEST_CSV.exists()}")
print(f"\nProject root: {PROJECT_ROOT}")
print(f"Data directory: {DATA_DIR}")

### 1.1 Load Metadata

In [None]:
# Load training metadata
if TRAIN_CSV.exists():
    train_df = pd.read_csv(TRAIN_CSV)
    print(f"Training records: {len(train_df)}")
    print(f"\nColumns: {train_df.columns.tolist()}")
    print(f"\nFirst few rows:")
    display(train_df.head())
    
    print(f"\nDataset info:")
    train_df.info()
else:
    print("⚠️  Training data not found. Please download the competition data first.")
    print("\nTo download data:")
    print("1. Install Kaggle API: pip install kaggle")
    print("2. Configure API credentials (kaggle.json)")
    print("3. Run: kaggle competitions download -c physionet-ecg-image-digitization")
    print("4. Extract to data/raw/")

In [None]:
# Load test metadata
if TEST_CSV.exists():
    test_df = pd.read_csv(TEST_CSV)
    print(f"Test records: {len(test_df)}")
    print(f"\nColumns: {test_df.columns.tolist()}")
    print(f"\nFirst few rows:")
    display(test_df.head())
else:
    print("⚠️  Test data not found.")

### 1.2 Sampling Frequency Analysis

In [None]:
if TRAIN_CSV.exists():
    # Analyze sampling frequencies
    print("Sampling Frequency Distribution:")
    print(train_df['fs'].value_counts().sort_index())
    
    # Visualize
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    
    # Distribution
    train_df['fs'].value_counts().sort_index().plot(kind='bar', ax=axes[0])
    axes[0].set_title('Sampling Frequency Distribution')
    axes[0].set_xlabel('Sampling Frequency (Hz)')
    axes[0].set_ylabel('Count')
    axes[0].grid(True, alpha=0.3)
    
    # Statistics
    axes[1].hist(train_df['fs'], bins=20, edgecolor='black')
    axes[1].set_title('Sampling Frequency Histogram')
    axes[1].set_xlabel('Sampling Frequency (Hz)')
    axes[1].set_ylabel('Frequency')
    axes[1].grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()
    
    print(f"\nStatistics:")
    print(f"  Mean: {train_df['fs'].mean():.2f} Hz")
    print(f"  Median: {train_df['fs'].median():.2f} Hz")
    print(f"  Min: {train_df['fs'].min():.2f} Hz")
    print(f"  Max: {train_df['fs'].max():.2f} Hz")

### 1.3 Signal Length Analysis

In [None]:
if TRAIN_CSV.exists():
    # Calculate expected signal lengths
    train_df['expected_lead_II_length'] = train_df['fs'] * 10  # Lead II: 10 seconds
    train_df['expected_other_length'] = train_df['fs'] * 2.5  # Other leads: 2.5 seconds
    
    print("Expected Signal Lengths:")
    print(f"\nLead II (10s):")
    print(train_df['expected_lead_II_length'].describe())
    
    print(f"\nOther Leads (2.5s):")
    print(train_df['expected_other_length'].describe())

## 2. Image Segment Analysis

Training data contains multiple image variants for each ECG record. Let's explore these.

In [None]:
# Display image segment types
print("Image Segments in Training Data:")
print("=" * 60)
for seg_id, description in IMAGE_SEGMENTS.items():
    print(f"  {seg_id}: {description}")

In [None]:
# Initialize data loader
if TRAIN_DIR.exists():
    train_loader = ECGDataLoader(mode='train')
    print(f"Training loader initialized with {len(train_loader)} records")
    
    # Get first record ID
    record_ids = train_loader.get_record_ids()
    if record_ids:
        sample_id = record_ids[0]
        print(f"\nSample record ID: {sample_id}")
        
        # Check available segments
        segments = train_loader.get_available_segments(sample_id)
        print(f"Available segments: {segments}")
else:
    print("⚠️  Training images not found.")

### 2.1 Visualize Multiple Segments

Let's visualize all image variants for a single ECG record.

In [None]:
# Display multiple segments for one record
if TRAIN_DIR.exists() and record_ids:
    sample_id = record_ids[0]  # First record
    fig = display_multiple_segments(sample_id, TRAIN_DIR)
    plt.show()
else:
    print("⚠️  Cannot display segments - data not available")

### 2.2 Image Size Analysis

In [None]:
# Analyze image sizes for different segments
if TRAIN_DIR.exists() and record_ids:
    image_sizes = {}
    
    # Sample a few records
    sample_records = record_ids[:10]
    
    for record_id in sample_records:
        segments = train_loader.get_available_segments(record_id)
        for segment in segments:
            try:
                image = train_loader.load_image(record_id, segment)
                if segment not in image_sizes:
                    image_sizes[segment] = []
                image_sizes[segment].append(image.shape[:2])  # (height, width)
            except:
                pass
    
    # Display results
    print("Image Sizes by Segment:")
    for segment, sizes in image_sizes.items():
        unique_sizes = set(map(tuple, sizes))
        print(f"\n  Segment {segment}:")
        for size in unique_sizes:
            count = sizes.count(list(size))
            print(f"    {size[0]}x{size[1]} - {count} images")

## 3. ECG Signal Analysis

Let's examine the time-series ECG data.

In [None]:
# Load a complete record (image + signals)
if TRAIN_DIR.exists() and record_ids:
    sample_id = record_ids[0]
    record = train_loader.load_record(sample_id, segment='0001')
    
    print(f"Record ID: {record['id']}")
    print(f"Sampling Frequency: {record['fs']} Hz")
    print(f"Image Shape: {record['image'].shape}")
    print(f"\nLeads:")
    for lead_name, signal in record['leads'].items():
        print(f"  {lead_name}: {len(signal)} samples, range [{signal.min():.3f}, {signal.max():.3f}] mV")

### 3.1 Visualize ECG Signals

In [None]:
# Plot all 12 leads
if TRAIN_DIR.exists() and record_ids:
    fig = plot_all_leads(
        record['leads'],
        record['fs'],
        title=f"12-Lead ECG - Record {sample_id}"
    )
    plt.show()

### 3.2 Signal Statistics

In [None]:
# Calculate statistics for each lead
if TRAIN_DIR.exists() and record_ids:
    stats_list = []
    
    for lead_name, signal in record['leads'].items():
        stats = {
            'Lead': lead_name,
            'Length': len(signal),
            'Duration (s)': len(signal) / record['fs'],
            'Mean (mV)': signal.mean(),
            'Std (mV)': signal.std(),
            'Min (mV)': signal.min(),
            'Max (mV)': signal.max(),
            'Range (mV)': signal.max() - signal.min()
        }
        stats_list.append(stats)
    
    stats_df = pd.DataFrame(stats_list)
    display(stats_df)

### 3.3 Image and Signal Visualization

In [None]:
# Display image with extracted signals
if TRAIN_DIR.exists() and record_ids:
    image_path = TRAIN_DIR / sample_id / f"{sample_id}-0001.png"
    fig = plot_image_with_signals(
        image_path,
        record['leads'],
        record['fs']
    )
    plt.show()

## 4. Test Set Analysis

In [None]:
# Load test data
if TEST_CSV.exists():
    test_loader = ECGDataLoader(mode='test')
    print(f"Test records: {len(test_loader)}")
    
    # Get first test record
    test_ids = test_loader.get_record_ids()
    if test_ids:
        test_sample_id = test_ids[0]
        test_record = test_loader.load_record(test_sample_id)
        
        print(f"\nSample Test Record: {test_sample_id}")
        print(f"Image Shape: {test_record['image'].shape}")
        print(f"Metadata: {test_record['metadata']}")
        
        # Display test image
        fig = display_ecg_image(
            TEST_DIR / f"{test_sample_id}.png",
            title=f"Test ECG Image - {test_sample_id}"
        )
        plt.show()
else:
    print("⚠️  Test data not found.")

## 5. Submission Format Analysis

In [None]:
# Load sample submission
if SAMPLE_SUBMISSION.exists():
    sample_sub = pd.read_parquet(SAMPLE_SUBMISSION)
    print(f"Sample submission shape: {sample_sub.shape}")
    print(f"\nColumns: {sample_sub.columns.tolist()}")
    print(f"\nFirst rows:")
    display(sample_sub.head(20))
    
    # Parse IDs to understand format
    sample_sub[['base_id', 'row_id', 'lead']] = sample_sub['id'].str.rsplit('_', n=2, expand=True)
    sample_sub['row_id'] = sample_sub['row_id'].astype(int)
    
    print(f"\nLeads distribution:")
    print(sample_sub['lead'].value_counts())
    
    print(f"\nUnique base IDs: {sample_sub['base_id'].nunique()}")
else:
    print("⚠️  Sample submission not found.")

## 6. Key Findings and Next Steps

### Summary:
1. **Data Structure**: 
   - Training: Multiple image variants + time series
   - Test: Single image per ECG
   
2. **Challenges**:
   - Variable sampling frequencies
   - Different image quality/artifacts
   - Complex 12-lead layout
   - Lead II has different duration (10s vs 2.5s)

3. **Opportunities**:
   - Multiple training variants help model robustness
   - Clear evaluation metric (SNR)
   - Standard ECG format

### Next Steps:
1. Develop image preprocessing pipeline
2. Build baseline signal extraction model
3. Implement evaluation metric testing
4. Explore data augmentation strategies
5. Research ECG digitization literature

In [None]:
print("\n" + "="*60)
print("EDA Complete!")
print("="*60)