# 01 - Data Exploration

This notebook demonstrates how to load and explore the facial keypoint dataset using the modernized `facial_keypoints` package.

## Overview

- Load training data from CSV files
- Explore data statistics and distributions
- Visualize sample images with keypoints
- Understand the data format and preprocessing

In [6]:
# Standard imports
import numpy as np
import matplotlib.pyplot as plt

# Package imports
from facial_keypoints.data.loader import load_data, get_data_statistics
from facial_keypoints.visualization.plotting import plot_keypoints, plot_training_samples

# Display settings
%matplotlib inline
plt.rcParams['figure.figsize'] = (10, 10)

## 1. Load Training Data

The `load_data()` function loads facial images and keypoint annotations from CSV files.

- Images: 96×96 grayscale, normalized to [0, 1]
- Keypoints: 15 facial landmarks (30 values: x,y pairs), normalized to [-1, 1]

In [7]:
# Load training data
# Note: Ensure data/training.csv exists in the project root
try:
    X_train, y_train = load_data(test=False)
    print(f"Training images shape: {X_train.shape}")
    print(f"Training keypoints shape: {y_train.shape}")
    print(f"\nImage dtype: {X_train.dtype}")
    print(f"Keypoints dtype: {y_train.dtype}")
except Exception as e:
    print(f"Could not load data: {e}")
    print("\nTo use this notebook, download the dataset and place it in data/training.csv")

Could not load data: Data file not found: data/training.csv

To use this notebook, download the dataset and place it in data/training.csv


## 2. Data Statistics

Use `get_data_statistics()` to compute summary statistics for validation and exploration.

In [8]:
# Compute statistics
try:
    stats = get_data_statistics(X_train, y_train)
    
    print("Dataset Statistics")
    print("=" * 40)
    print(f"Number of samples: {stats['n_samples']}")
    print(f"Image dimensions: {stats['image_height']}×{stats['image_width']}")
    print(f"Number of keypoints: {stats['n_keypoints']}")
    print()
    print("Image Value Statistics (normalized [0, 1])")
    print(f"  Min: {stats['x_min']:.4f}")
    print(f"  Max: {stats['x_max']:.4f}")
    print(f"  Mean: {stats['x_mean']:.4f}")
    print(f"  Std: {stats['x_std']:.4f}")
    print()
    print("Keypoint Statistics (normalized [-1, 1])")
    print(f"  Min: {stats['y_min']:.4f}")
    print(f"  Max: {stats['y_max']:.4f}")
    print(f"  Mean: {stats['y_mean']:.4f}")
    print(f"  Std: {stats['y_std']:.4f}")
except NameError:
    print("Data not loaded - run the cell above first")

Data not loaded - run the cell above first


## 3. Visualize Training Samples

The `plot_training_samples()` function displays a grid of images with their keypoints overlaid.

In [None]:
# Plot a grid of training samples
try:
    fig = plot_training_samples(X_train, y_train, n_samples=9, figsize=(12, 12))
    plt.suptitle("Training Samples with Facial Keypoints", fontsize=14, y=1.02)
    plt.show()
except NameError:
    print("Data not loaded - run the data loading cell first")

## 4. Individual Sample Visualization

Use `plot_keypoints()` for detailed visualization of a single sample.

In [None]:
# Visualize a single sample with different marker styles
try:
    fig, axes = plt.subplots(1, 3, figsize=(15, 5))
    
    # Sample index to visualize
    idx = 42
    
    # Original image without keypoints
    axes[0].imshow(X_train[idx].squeeze(), cmap='gray')
    axes[0].set_title('Original Image')
    axes[0].axis('off')
    
    # With cyan keypoints (default)
    plot_keypoints(X_train[idx], y_train[idx], ax=axes[1], 
                   denormalize=True, marker_color='cyan', marker_size=40)
    axes[1].set_title('Cyan Keypoints')
    
    # With red keypoints
    plot_keypoints(X_train[idx], y_train[idx], ax=axes[2],
                   denormalize=True, marker_color='red', marker_size=60)
    axes[2].set_title('Red Keypoints (larger)')
    
    plt.tight_layout()
    plt.show()
except NameError:
    print("Data not loaded - run the data loading cell first")

## 5. Keypoint Distribution Analysis

Analyze the distribution of keypoint positions across the dataset.

In [None]:
# Analyze keypoint distributions
try:
    # Denormalize keypoints to pixel coordinates
    y_pixels = y_train * 48 + 48
    
    # Reshape to (n_samples, n_keypoints, 2)
    y_reshaped = y_pixels.reshape(-1, 15, 2)
    
    # Compute mean position for each keypoint
    mean_positions = y_reshaped.mean(axis=0)
    
    # Plot mean keypoint positions on a reference grid
    fig, ax = plt.subplots(figsize=(8, 8))
    ax.set_xlim(0, 96)
    ax.set_ylim(96, 0)  # Invert y-axis (image coordinates)
    ax.set_aspect('equal')
    ax.grid(True, alpha=0.3)
    
    # Plot mean positions
    ax.scatter(mean_positions[:, 0], mean_positions[:, 1], 
               s=100, c='red', marker='o', edgecolors='black')
    
    # Add keypoint labels
    keypoint_names = [
        'left_eye_center', 'right_eye_center', 'left_eye_inner_corner',
        'left_eye_outer_corner', 'right_eye_inner_corner', 'right_eye_outer_corner',
        'left_eyebrow_inner', 'left_eyebrow_outer', 'right_eyebrow_inner',
        'right_eyebrow_outer', 'nose_tip', 'mouth_left_corner',
        'mouth_right_corner', 'mouth_center_top', 'mouth_center_bottom'
    ]
    
    for i, (x, y) in enumerate(mean_positions):
        ax.annotate(str(i), (x, y), xytext=(5, 5), textcoords='offset points', fontsize=8)
    
    ax.set_title('Mean Keypoint Positions (96×96 image space)')
    ax.set_xlabel('X coordinate')
    ax.set_ylabel('Y coordinate')
    
    plt.tight_layout()
    plt.show()
    
    # Print keypoint legend
    print("Keypoint Index Legend:")
    for i, name in enumerate(keypoint_names):
        print(f"  {i:2d}: {name}")
except NameError:
    print("Data not loaded - run the data loading cell first")

## 6. Pixel Intensity Distribution

In [None]:
# Plot pixel intensity histogram
try:
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    
    # Overall pixel distribution
    axes[0].hist(X_train.flatten(), bins=50, color='steelblue', edgecolor='black', alpha=0.7)
    axes[0].set_xlabel('Pixel Value (normalized)')
    axes[0].set_ylabel('Frequency')
    axes[0].set_title('Pixel Intensity Distribution')
    axes[0].axvline(X_train.mean(), color='red', linestyle='--', label=f'Mean: {X_train.mean():.3f}')
    axes[0].legend()
    
    # Keypoint coordinate distribution
    axes[1].hist(y_train.flatten(), bins=50, color='coral', edgecolor='black', alpha=0.7)
    axes[1].set_xlabel('Keypoint Coordinate (normalized)')
    axes[1].set_ylabel('Frequency')
    axes[1].set_title('Keypoint Coordinate Distribution')
    axes[1].axvline(y_train.mean(), color='red', linestyle='--', label=f'Mean: {y_train.mean():.3f}')
    axes[1].legend()
    
    plt.tight_layout()
    plt.show()
except NameError:
    print("Data not loaded - run the data loading cell first")

## Summary

This notebook demonstrated:

1. **Data Loading**: Using `load_data()` to load preprocessed images and keypoints
2. **Statistics**: Using `get_data_statistics()` for dataset validation
3. **Visualization**: Using `plot_keypoints()` and `plot_training_samples()` for visual inspection
4. **Analysis**: Understanding keypoint distributions and image properties

### Next Steps

- Proceed to `02_model_training.ipynb` to train a CNN model
- Or jump to `03_inference_pipeline.ipynb` to use a pre-trained model