# MediaPipe-Based Eye Region Extraction

This notebook preprocesses raw videos using MediaPipe Face Mesh to:
- Detect eye regions frame-by-frame
- Extract eye bounding boxes and landmarks
- Save metadata for downstream ViViT training

**Output**
- Eye metadata (`.npz`): bounding boxes + landmarks per frame
- Full-frame BMPs aligned with metadata

This preprocessing is used by the ViViT blink classification model.


## 1. Imports and Environment Setup


In [None]:
import os
import cv2
import numpy as np
import mediapipe as mp
from tqdm import tqdm

## 2. MediaPipe Face Mesh Initialization


In [None]:
mp_face_mesh = mp.solutions.face_mesh

face_mesh = mp_face_mesh.FaceMesh(
    static_image_mode=False,      # video stream
    max_num_faces=1,              # single subject
    refine_landmarks=True,        # enables iris landmarks
    min_detection_confidence=0.5,
    min_tracking_confidence=0.5,
)


## 3. Eye Landmark Definitions


In [None]:
# MediaPipe FaceMesh eye landmark indices
LEFT_EYE_IDX = [
    33, 133, 160, 159, 158, 157, 173,
    246, 161, 163, 144, 145, 153, 154, 155
]

RIGHT_EYE_IDX = [
    362, 263, 387, 386, 385, 384, 398,
    466, 388, 390, 373, 374, 380, 381, 382
]

EYE_LANDMARKS = LEFT_EYE_IDX + RIGHT_EYE_IDX


## 4. Frame-by-Frame Eye Detection


In [None]:
def eye_bbox_from_landmarks(landmarks, frame_shape, margin=0.15):
    """
    landmarks: list of (x,y) normalized coords
    frame_shape: (H,W,3)
    """
    h, w = frame_shape[:2]
    xs = [int(lm.x * w) for lm in landmarks]
    ys = [int(lm.y * h) for lm in landmarks]

    x1, x2 = min(xs), max(xs)
    y1, y2 = min(ys), max(ys)

    # expand bbox slightly
    bw, bh = x2 - x1, y2 - y1
    pad_w, pad_h = int(bw * margin), int(bh * margin)

    x1 = max(0, x1 - pad_w)
    y1 = max(0, y1 - pad_h)
    x2 = min(w, x2 + pad_w)
    y2 = min(h, y2 + pad_h)

    return x1, y1, x2 - x1, y2 - y1


## 6. Metadata Serialization


In [None]:
def process_video_extract_eyes(
    video_path,
    save_root,
    video_id,
):
    cap = cv2.VideoCapture(video_path)
    bboxes = []
    landmarks_all = []

    os.makedirs(save_root, exist_ok=True)

    last_bbox, last_landmarks = None, None
    frame_idx = 0

    while True:
        ret, frame = cap.read()
        if not ret:
            break

        rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
        result = face_mesh.process(rgb)

        if result.multi_face_landmarks:
            face = result.multi_face_landmarks[0]
            eye_lms = [face.landmark[i] for i in EYE_LANDMARKS]

            bbox = eye_bbox_from_landmarks(eye_lms, frame.shape)
            bboxes.append(bbox)
            landmarks_all.append([(lm.x, lm.y) for lm in eye_lms])

            last_bbox, last_landmarks = bbox, eye_lms
        else:
            # fallback to last valid detection
            if last_bbox is not None:
                bboxes.append(last_bbox)
                landmarks_all.append([(lm.x, lm.y) for lm in last_landmarks])

        frame_idx += 1

    cap.release()

    np.savez_compressed(
        os.path.join(save_root, f"{video_id}_metadata.npz"),
        bboxes=np.array(bboxes),
        landmarks=np.array(landmarks_all),
    )


## 7. Directory Structure and Output Format


training/
└── blink/
└── <video_id>/
└── 0/
├── 0.bmp
├── 1.bmp


eye_dataset_train/
└── blink/
└── <video_id>_metadata.npz

### Why metadata instead of raw crops?
Saving eye metadata allows:
- Reproducible cropping
- Different crop strategies without rerunning MediaPipe
- Efficient experimentation
