# Face Model: Detection, Cropping, and Training Prep

This notebook prepares a single-face dataset from FindingEmo and sketches a lightweight model for Valence/Arousal regression.

Data directory expected: `data/Run_1/`, `data/Run_2/`, with annotations in `data/processed_annotations.csv` (produced by scripts/findingemo_process_annotations.py).

## Environment & Dependencies
- Requires: opencv-python, mediapipe, numpy, pandas, torch, torchvision.
- In this project, prefer installing from shell with: `uv pip install mediapipe torch torchvision`
- If running directly in this notebook, you may use: `!pip install mediapipe torch torchvision`

In [None]:
# !pip install mediapipe torch torchvision --quiet

import os
import cv2
import numpy as np
import pandas as pd
import mediapipe as mp
import torch
import torch.nn as nn
import torchvision.transforms as T

from PIL import Image
from pathlib import Path
from tqdm import tqdm
from sklearn.model_selection import StratifiedGroupKFold


In [None]:
DATA_DIR = Path("..") / "data"
RUN_DIRS = [DATA_DIR / "Run_1", DATA_DIR / "Run_2"]
ANNOTATIONS_CSV = DATA_DIR / "processed_annotations.csv"
FACE_CROPS_DIR = DATA_DIR / "face_crops"
FACE_CSV = DATA_DIR / "face_annotations.csv"
FACE_CROPS_DIR.mkdir(parents=True, exist_ok=True)

assert ANNOTATIONS_CSV.exists(), f"Expected annotations at {ANNOTATIONS_CSV}"

In [None]:
# --- Config knobs ---
CONFIDENCE_THRESHOLD = 0.5  # lower -> more detections (incl. false positives)
MODEL_SELECTION = (
    0  # 0: short-range (bigger/closer faces), 1: long-range (smaller/farther faces)
)


## SingleFaceProcessor (MediaPipe)
Chooses the primary face (confidence × area × center proximity), then extracts a padded crop resized to 224×224.

**What MediaPipe is**

* MediaPipe is a Google framework that ships fast, pre-trained CV models (pose, hands, face) with easy Python APIs.  
* Here we use `mp.solutions.face_detection.FaceDetection`, a lightweight face detector (BlazeFace family) that runs in real time on CPU.  

**What the code does**

* Converts the image from BGR to RGB (OpenCV loads images as BGR; models typically expect RGB).
* Runs MediaPipe face detection to get zero or more face detections.
* Scores each detected face with: score = confidence × sqrt(area) × (0.7 + 0.3 × center_proximity)
* confidence: the detector’s probability for that face.
* area: prefers larger faces (proxy for closeness/visibility).
* center proximity: prefers faces near the image center.
* Selects the best-scoring face.
* Extracts a padded crop around that face and resizes to 224×224.
* Returns the crop and the bounding box; if no faces, returns None, None.

**Why resize to 224×224?**

* It’s a convention for many ImageNet models (e.g., ResNet18) which expect 224×224 inputs.  
* Using 224×224 gives you compatibility with torchvision backbones and pretraining. You could pick other sizes, but then you’d adjust the model or transforms accordingly.  

**What is OpenCV (cv2) and how it’s used here**

* OpenCV is a widely used computer-vision library.
* `cv2.cvtColor(image, cv2.COLOR_BGR2RGB)`: convert BGR→RGB for the detector.
* `cv2.resize(crop, (224, 224))`: resize the face crop to the target size.
* You’ll also typically use cv2.imread to load images and cv2.imwrite to save crops elsewhere in the notebook.  

**About confidence_threshold**

* FaceDetection(min_detection_confidence=confidence_threshold) tells MediaPipe to only return detections with confidence ≥ that threshold.
* In extract_primary_face, if results.detections is empty (e.g., no face ≥ threshold), it returns None, None. So yes, with too-high thresholds you may get no output.
* If you lower the threshold (e.g., 0.3), you’ll get more candidate faces (including some false positives). If you raise it (e.g., 0.8), you’ll get fewer but higher-confidence detections.
* Practical tip: start around 0.5, then adjust based on how many frames come back empty vs. how many false crops you see.  

**Crop padding and safety**

* The crop uses a 20% padding around the detected box and clamps to image bounds.
* If the crop is empty (edge case), it returns None to avoid downstream errors.


In [None]:
class SingleFaceProcessor:
    def __init__(self, confidence_threshold: float = 0.5, model_selection: int = 0):
        """
        MediaPipe single-face selector:
        - confidence_threshold: min detection confidence (0..1)
        - model_selection: 0 = short-range model, 1 = long-range model
        """
        self.mp_face = mp.solutions.face_detection
        self.detector = self.mp_face.FaceDetection(
            min_detection_confidence=confidence_threshold,
            model_selection=model_selection,
        )

    def extract_primary_face(self, image: np.ndarray):
        rgb = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
        results = self.detector.process(rgb)
        if not results.detections:
            return None, None
        bbox = self._select_best_face(results.detections, image.shape)
        if bbox is None:
            return None, None
        crop = self._extract_face_crop(image, bbox)
        return crop, bbox

    def _select_best_face(self, detections, image_shape):
        h, w = image_shape[:2]
        center = np.array([w / 2, h / 2])
        best, best_score = None, -1.0
        for det in detections:
            bb = det.location_data.relative_bounding_box
            conf = det.score[0]
            area = max(bb.width, 0) * max(bb.height, 0)
            face_center = np.array(
                [bb.xmin + bb.width / 2, bb.ymin + bb.height / 2]
            ) * np.array([w, h])
            dist = np.linalg.norm(face_center - center)
            maxd = np.linalg.norm(center) or 1.0
            prox = 1 - (dist / maxd)
            score = float(conf) * float(np.sqrt(area)) * (0.7 + 0.3 * float(prox))
            if score > best_score:
                best, best_score = bb, score
        return best

    def _extract_face_crop(self, image: np.ndarray, bb, padding: float = 0.2):
        h, w = image.shape[:2]
        x0 = int(max(0, (bb.xmin - padding * bb.width) * w))
        y0 = int(max(0, (bb.ymin - padding * bb.height) * h))
        x1 = int(min(w, (bb.xmin + bb.width * (1 + padding)) * w))
        y1 = int(min(h, (bb.ymin + bb.height * (1 + padding)) * h))
        crop = image[y0:y1, x0:x1]
        if crop.size == 0:
            return None
        return cv2.resize(crop, (224, 224), interpolation=cv2.INTER_AREA)


In [None]:
# --- Dry-run detection stats (no writes) ---
# Measures: total rows/images in annotations, how many exist locally, how many get a detection

df_all = pd.read_csv(ANNOTATIONS_CSV)
df_all["local_path"] = df_all["image_path"].apply(
    lambda p: str(DATA_DIR / p.lstrip("/"))
)
df_all["exists"] = df_all["local_path"].apply(lambda p: Path(p).exists())

total_rows = len(df_all)
total_images = df_all["image_path"].nunique()
exists_rows = int(df_all["exists"].sum())
exists_images = df_all[df_all["exists"]]["image_path"].nunique()

processor = SingleFaceProcessor(
    confidence_threshold=CONFIDENCE_THRESHOLD, model_selection=MODEL_SELECTION
)

det_rows = 0
det_images = set()

for _, row in tqdm(df_all[df_all["exists"]].iterrows(), total=exists_rows):
    img = cv2.imread(row["local_path"])
    if img is None:
        continue
    crop, bb = processor.extract_primary_face(img)
    if crop is None:
        continue
    det_rows += 1
    det_images.add(row["image_path"])

stats = {
    "total_rows_in_annotations": total_rows,
    "total_unique_images_in_annotations": total_images,
    "rows_with_local_file": exists_rows,
    "unique_images_with_local_file": exists_images,
    "detected_rows": det_rows,
    "detected_unique_images": len(det_images),
    "confidence_threshold": CONFIDENCE_THRESHOLD,
    "model_selection": MODEL_SELECTION,
}
stats


## Build face crops + metadata (combine Run_1 and Run_2)
Input annotations: `data/processed_annotations.csv` with columns like `image_path`, `valence`, `arousal`, `emotion`, `user`, etc.  

Output: `data/face_crops/` directory and `data/face_annotations.csv` with one row per successful face crop.

In [None]:
df = pd.read_csv(ANNOTATIONS_CSV)
# Normalize image_path to local file path: it starts with '/Run_*/...' relative to data/
df["local_path"] = df["image_path"].apply(lambda p: str(DATA_DIR / p.lstrip("/")))

processor = SingleFaceProcessor(
    confidence_threshold=CONFIDENCE_THRESHOLD, model_selection=MODEL_SELECTION
)
records = []

for _, row in tqdm(df.iterrows(), total=len(df)):
    img_path = Path(row["local_path"])
    if not img_path.exists():
        continue
    image = cv2.imread(str(img_path))
    if image is None:
        continue
    crop, bb = processor.extract_primary_face(image)
    if crop is None or bb is None:
        continue
    # Build crop path mirroring original structure
    rel_under_data = img_path.relative_to(DATA_DIR)
    crop_path = FACE_CROPS_DIR / rel_under_data
    crop_path.parent.mkdir(parents=True, exist_ok=True)
    cv2.imwrite(str(crop_path), crop)

    records.append(
        {
            "crop_path": str(crop_path),
            "image_path": row["image_path"],
            "run": rel_under_data.parts[0] if len(rel_under_data.parts) > 0 else None,
            "emotion": row.get("emotion"),
            "valence": row.get("valence"),
            "arousal": row.get("arousal"),
            "user": row.get("user"),
            # Store bbox (relative coords) for traceability
            "bbox_xmin": float(bb.xmin),
            "bbox_ymin": float(bb.ymin),
            "bbox_w": float(bb.width),
            "bbox_h": float(bb.height),
        }
    )

face_df = pd.DataFrame.from_records(records)
if not face_df.empty:
    face_df.to_csv(FACE_CSV, index=False)
face_df.head()

I0000 00:00:1754918801.572535  807678 gl_context.cc:369] GL version: 2.1 (2.1 Metal - 89.4), renderer: Apple M4 Pro
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
W0000 00:00:1754918801.583924  809652 inference_feedback_manager.cc:114] Feedback manager requires a model with a single signature inference. Disabling support for feedback tensors.
  1%|          | 200/20147 [00:01<02:26, 135.74it/s]Premature end of JPEG file
  1%|          | 214/20147 [00:01<02:27, 135.24it/s][ERROR:0@117.619] global loadsave.cpp:507 imread_ imread_('../data/Run_2/Appalled teenagers soccer/UAT4VQJ7VLN3UWZ7W2YNJWTO2I.jpg'): can't read data: OpenCV(4.11.0) /Users/xperience/GHA-Actions-OpenCV/_work/opencv-python/opencv-python/opencv/modules/imgcodecs/src/grfmt_jpeg2000_openjpeg.cpp:645: error: (-2:Unspecified error) in function 'virtual bool cv::detail::Jpeg2KOpjDecoderBase::readData(Mat &)'
> OpenJPEG2000: tiles are not supported (expected: '(int)comp.dx == 1'), where
>     '(int)comp.dx' is 2
> must

Unnamed: 0,crop_path,image_path,run,emotion,valence,arousal,user,bbox_xmin,bbox_ymin,bbox_w,bbox_h
0,../data/face_crops/Run_2/Frustrated forty-some...,/Run_2/Frustrated forty-something office/team-...,Run_2,Interest,0,2,5fd97d5b40332e276ea58209,0.487954,0.197185,0.209047,0.28276
1,../data/face_crops/Run_2/Remorseful toddlers c...,/Run_2/Remorseful toddlers court of law/dcfs-c...,Run_2,Interest,1,2,5985f6bdeef500000111db98,0.698446,0.256576,0.198112,0.29702
2,../data/face_crops/Run_2/Scared adolescents pr...,/Run_2/Scared adolescents prison/15-hampton-ro...,Run_2,Anger,0,3,5fd97d5b40332e276ea58209,0.621183,0.261322,0.193928,0.344931
3,../data/face_crops/Run_2/Cheerful soldiers des...,/Run_2/Cheerful soldiers desert/obamaslides_01...,Run_2,Joy,2,4,5985f6bdeef500000111db98,0.66878,0.132012,0.195301,0.28274
4,../data/face_crops/Run_2/Raging elderly war/vi...,/Run_2/Raging elderly war/vietvetwel0010.jpg,Run_2,Joy,3,4,5985f6bdeef500000111db98,0.099873,0.280671,0.162464,0.230036


In [8]:
face_df

Unnamed: 0,crop_path,image_path,run,emotion,valence,arousal,user,bbox_xmin,bbox_ymin,bbox_w,bbox_h
0,../data/face_crops/Run_2/Frustrated forty-some...,/Run_2/Frustrated forty-something office/team-...,Run_2,Interest,0,2,5fd97d5b40332e276ea58209,0.487954,0.197185,0.209047,0.282760
1,../data/face_crops/Run_2/Remorseful toddlers c...,/Run_2/Remorseful toddlers court of law/dcfs-c...,Run_2,Interest,1,2,5985f6bdeef500000111db98,0.698446,0.256576,0.198112,0.297020
2,../data/face_crops/Run_2/Scared adolescents pr...,/Run_2/Scared adolescents prison/15-hampton-ro...,Run_2,Anger,0,3,5fd97d5b40332e276ea58209,0.621183,0.261322,0.193928,0.344931
3,../data/face_crops/Run_2/Cheerful soldiers des...,/Run_2/Cheerful soldiers desert/obamaslides_01...,Run_2,Joy,2,4,5985f6bdeef500000111db98,0.668780,0.132012,0.195301,0.282740
4,../data/face_crops/Run_2/Raging elderly war/vi...,/Run_2/Raging elderly war/vietvetwel0010.jpg,Run_2,Joy,3,4,5985f6bdeef500000111db98,0.099873,0.280671,0.162464,0.230036
...,...,...,...,...,...,...,...,...,...,...,...
7675,../data/face_crops/Run_2/Guilt teenagers festi...,/Run_2/Guilt teenagers festival/SWUMXL4FUFEUFG...,Run_2,Vigilance,-1,3,6522a42f39b5bd8f96735aa9,0.072660,0.331239,0.262516,0.466683
7676,../data/face_crops/Run_2/Ashamed seniors rally...,/Run_2/Ashamed seniors rally/ap-19162542641934...,Run_2,Boredom,-1,1,6522a42f39b5bd8f96735aa9,0.374778,0.174922,0.233983,0.445683
7677,../data/face_crops/Run_2/Peaceful elderly rall...,/Run_2/Peaceful elderly rally/1569185152_chatt...,Run_2,Fear,-1,4,6522a42f39b5bd8f96735aa9,0.208560,0.235722,0.312466,0.387731
7678,../data/face_crops/Run_2/Guilt soldiers party/...,/Run_2/Guilt soldiers party/naroda-patiya-mass...,Run_2,Sadness,-2,5,6522a42f39b5bd8f96735aa9,0.316485,0.262779,0.160606,0.285523


## Train/Val/Test split (by emotion)
Creates a split column to keep experiments reproducible.

In [None]:
def add_splits(face_df: pd.DataFrame, seed: int = 42):
    df = face_df.copy().reset_index(drop=True)
    # Use emotion as strata; group by original image_path to avoid leakage
    y = df["emotion"].fillna("Unknown")
    groups = df["image_path"]
    sgkf = StratifiedGroupKFold(n_splits=5, shuffle=True, random_state=seed)
    split_idx = np.zeros(len(df), dtype=int)
    # First fold as test, second as val, rest train
    folds = list(sgkf.split(np.zeros(len(df)), y, groups))
    test_idx = folds[0][1]
    val_idx = folds[1][1]
    split = np.array(["train"] * len(df), dtype=object)
    split[test_idx] = "test"
    split[val_idx] = "val"
    df["split"] = split
    return df


if not face_df.empty:
    face_df = add_splits(face_df)
    face_df.to_csv(FACE_CSV, index=False)
face_df["split"].value_counts(dropna=False) if not face_df.empty else "No crops created"

## PyTorch Model (ResNet18 backbone with V/A heads)
Lightweight head with dropout; supports Monte Carlo Dropout.

In [None]:
import torch
import torch.nn as nn
import torchvision.transforms as T
from PIL import Image


class FaceEmotionRegressor(nn.Module):
    def __init__(self, dropout_rate: float = 0.3):
        super().__init__()
        backbone = torch.hub.load("pytorch/vision:v0.10.0", "resnet18", pretrained=True)
        for p in list(backbone.parameters())[:-10]:
            p.requires_grad = False
        num_features = backbone.fc.in_features
        backbone.fc = nn.Identity()
        self.backbone = backbone
        self.dropout = nn.Dropout(dropout_rate)
        self.valence_head = nn.Sequential(
            nn.Linear(num_features, 128), nn.ReLU(), self.dropout, nn.Linear(128, 1)
        )
        self.arousal_head = nn.Sequential(
            nn.Linear(num_features, 128), nn.ReLU(), self.dropout, nn.Linear(128, 1)
        )

    def forward(self, x, n_samples: int = 1):
        if n_samples and n_samples > 1:
            return self._mc_forward(x, n_samples)
        feats = self.backbone(x)
        feats = self.dropout(feats)
        v = self.valence_head(feats).squeeze(-1)
        a = self.arousal_head(feats).squeeze(-1)
        return v, a

    def _mc_forward(self, x, n_samples: int):
        preds = []
        self.train()  # enable dropout
        for _ in range(n_samples):
            v, a = self.forward(x, n_samples=1)
            preds.append(torch.stack([v, a]))
        self.eval()
        preds = torch.stack(preds)
        mean = preds.mean(dim=0)
        var = preds.var(dim=0)
        return mean, var


# Basic dataset for crops
class FaceCropsDataset(torch.utils.data.Dataset):
    def __init__(self, csv_file: str, split: str = "train"):
        self.df = pd.read_csv(csv_file)
        if "split" in self.df.columns:
            self.df = self.df[self.df["split"] == split].reset_index(drop=True)
        self.tx = T.Compose(
            [
                T.Resize((224, 224)),
                T.ToTensor(),
                T.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
            ]
        )

    def __len__(self):
        return len(self.df)

    def __getitem__(self, idx):
        row = self.df.iloc[idx]
        img = Image.open(row["crop_path"]).convert("RGB")
        x = self.tx(img)
        v = torch.tensor(row["valence"], dtype=torch.float32)
        a = torch.tensor(row["arousal"], dtype=torch.float32)
        return x, v, a

## Training Sketch
Quick example loop (not executed by default).

In [None]:
# Example training loop (set RUN_TRAINING=True to run)
RUN_TRAINING = False
if RUN_TRAINING and FACE_CSV.exists():
    train_ds = FaceCropsDataset(str(FACE_CSV), split="train")
    val_ds = FaceCropsDataset(str(FACE_CSV), split="val")
    train_loader = torch.utils.data.DataLoader(
        train_ds, batch_size=32, shuffle=True, num_workers=2
    )
    val_loader = torch.utils.data.DataLoader(
        val_ds, batch_size=32, shuffle=False, num_workers=2
    )

    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model = FaceEmotionRegressor().to(device)
    opt = torch.optim.AdamW(
        filter(lambda p: p.requires_grad, model.parameters()), lr=1e-3
    )
    loss_fn = nn.SmoothL1Loss(beta=0.5)

    for epoch in range(5):
        model.train()
        total = 0.0
        for x, v, a in train_loader:
            x, v, a = x.to(device), v.to(device), a.to(device)
            opt.zero_grad()
            pv, pa = model(x)
            loss = loss_fn(pv, v) + loss_fn(pa, a)
            loss.backward()
            opt.step()
            total += float(loss.item())
        print(f"Epoch {epoch + 1}: train loss {total / max(1, len(train_loader)):.4f}")

        # val
        model.eval()
        vtotal = 0.0
        with torch.no_grad():
            for x, v, a in val_loader:
                x, v, a = x.to(device), v.to(device), a.to(device)
                pv, pa = model(x)
                vtotal += float((loss_fn(pv, v) + loss_fn(pa, a)).item())
        print(f"          val loss {vtotal / max(1, len(val_loader)):.4f}")

## Dataset format summary (face_annotations.csv)
Each row corresponds to one detected primary face crop. Suggested columns:
- `crop_path` (str): absolute/relative path to saved 224×224 crop
- `image_path` (str): original relative path from FindingEmo (starts with `/Run_*`)
- `run` (str): `Run_1` or `Run_2`
- `emotion` (str): discrete emotion label from annotations
- `valence` (int/float): V score (dataset provides ints like -3..+3)
- `arousal` (int/float): A score (0..6)
- `user` (str): annotator id
- `bbox_xmin`, `bbox_ymin`, `bbox_w`, `bbox_h` (float, relative 0..1): MediaPipe bbox used for crop
- `split` (str): `train`/`val`/`test` after splitting

This format is sufficient to train the face emotion regressor with standard PyTorch datasets/dataloaders.