A deep learning pipeline for recognizing 8 foot gestures from video, designed for hands-free control in industrial production environments. Two model families are provided: a fast per-frame model (EfficientNetV2-S) and a spatiotemporal model (R(2+1)D-18) for gestures that require motion context.
Status: implementation complete — ready for data collection and training.
| Label | Code | Gesture Name | Description |
|---|---|---|---|
| 0 | G1 | Heel Tap | Tap heel down vertically |
| 1 | G2 | Forward Kick | Kick foot forward from the knee |
| 2 | G3 | Foot Lift | Lift foot straight up, hold briefly |
| 3 | G4 | Lateral Slide | Slide foot sideways (left or right) |
| 4 | G5 | Forward Step | Step forward and return |
| 5 | G6 | Cross Front | Cross one foot in front of the other |
| 6 | G7 | Foot Hold | Hold foot still above ground for ≥1.5 s |
| 7 | G8 | Flamingo Bend | Raise heel, bend knee (one-legged stance) |
G7 (Foot Hold) is a dwell gesture — it cannot be classified from a single frame. Use the video model (
train_video.py) if G7 accuracy is critical.
- ImageNet pretrained backbone, custom 8-class head
- Input:
(B, 3, 224, 224)— single frame - Inference: average softmax over all 16 extracted frames per clip
- ~21M parameters; fast enough for real-time on CPU
- Kinetics-400 pretrained backbone, custom 8-class head
- Input:
(B, 3, 16, 112, 112)— 16-frame clip - Handles temporal dynamics (direction, speed, dwell)
- ~33M parameters; GPU recommended
SensiFoot/
├── config/
│ └── config.py # All hyperparameters, paths, class names
│
├── data/
│ ├── raw/
│ │ ├── cam1/ # Front-facing camera
│ │ ├── cam2/ # 45° left (phone)
│ │ └── cam3/ # 45° right or overhead
│ ├── processed/
│ │ └── frames/ # Pre-extracted JPG frames, one dir per clip
│ └── splits/
│ ├── dataset_master.csv
│ ├── train.csv
│ ├── val.csv
│ └── test.csv
│
├── scripts/
│ ├── 01_sync_check.py # Detect beep timestamps, report + trim offsets
│ ├── 02_extract_frames.py # Extract FRAMES_PER_CLIP frames from each video
│ ├── 03_build_dataset_csv.py # Scan frames/ → build master CSV + stats
│ ├── 04_split_dataset.py # Subject-based train/val/test split
│ └── 05_visualize_samples.py # Grid of sample frames per class
│
├── datasets/
│ ├── foot_frame_dataset.py # Frame-level dataset (all/middle/random selection)
│ └── foot_video_dataset.py # Clip-level dataset with temporal jitter aug
│
├── models/
│ ├── frame_model.py # EfficientNetV2-S/M + ConvNeXt variants
│ └── video_model.py # R(2+1)D-18 with freeze/unfreeze helpers
│
├── train/
│ ├── train_frame.py # Full loop: AMP, cosine LR, early stopping
│ └── train_video.py # Same pipeline for R(2+1)D
│
├── evaluate/
│ └── evaluate.py # Test-set eval: overall + per-class + per-camera
│
├── inference/
│ └── predict.py # Inference on a new .mp4 — Top-3 output
│
├── utils/
│ ├── augmentation.py # MotionBlurTransform, VideoConsistentTransform
│ ├── metrics.py # accuracy, F1, per-class report
│ └── visualization.py # Training curves, confusion matrix, distribution plot
│
├── checkpoints/ # Saved model weights (.pth)
├── outputs/
│ ├── plots/ # PNG figures
│ └── predictions/ # predictions_frame.csv, predictions_video.csv
│
├── requirements.txt
├── DEVELOPMENT_LOG.md
└── .gitignore
pip install -r requirements.txtPlace .mp4 files in data/raw/cam1/, cam2/, cam3/. Use the naming format:
{gesture}_{personID}_{cameraID}_{repeatID}.mp4
Examples:
heeltap_p01_cam1_r01.mp4
forwardkick_p03_cam2_r05.mp4
flamingobend_p12_cam3_r03.mp4
Valid gesture names: heeltap forwardkick footlift lateralslide forwardstep crossfront foothold flamingobend
# Step 1 — Check/fix timing offset between cameras (requires ffmpeg)
python scripts/01_sync_check.py
python scripts/01_sync_check.py --trim # also trims offset videos
# Step 2 — Extract 16 frames from every video
python scripts/02_extract_frames.py
# Step 3 — Build master CSV
python scripts/03_build_dataset_csv.py
# Step 4 — Split by subject (train 70 / val 15 / test 15)
python scripts/04_split_dataset.py
# Step 5 — Sanity check visuals
python scripts/05_visualize_samples.py# Frame model (EfficientNetV2-S) — start here, ~2–4h on RTX 3070
python train/train_frame.py
# With backbone frozen except last 2 blocks (faster early epochs):
python train/train_frame.py --freeze 2
# Video model (R(2+1)D-18) — ~8–12h on RTX 3070
python train/train_video.py --freeze 2
# Resume from last checkpoint:
python train/train_frame.py --resumepython evaluate/evaluate.py --model frame
python evaluate/evaluate.py --model videoOutputs: confusion matrix PNG, per-class table, per-camera breakdown, predictions CSV.
python inference/predict.py --video path/to/clip.mp4
python inference/predict.py --video path/to/clip.mp4 --model videoExample output:
Input : heeltap_p01_cam1_r01.mp4
────────────────────────────────────────────
Prediction : G1 - Heel Tap (94.3%) ✓
Top 3:
1. G1 - Heel Tap 94.23% ████████████████████████████
2. G3 - Foot Lift 3.81% █
3. G8 - Flamingo Bend 1.02%
| Choice | Decision | Reason |
|---|---|---|
| Split strategy | Subject-based | Prevents data leakage across train/test |
| Frame model | EfficientNetV2-S | Best accuracy/speed ratio for 8-class |
| Video model | R(2+1)D-18 | Kinetics-400 pretrained, proven for gesture |
| Frame count | 16 | Covers 2–3 s clips; matches R(2+1)D input |
| Temporal aug | ±25% jitter | Regularizes clip start time at training |
| G7 sampling | Center window | Dwell gesture peaks in the middle of clip |
| Loss function | WeightedCE + label smoothing 0.05 | Handles class imbalance + overconfidence |
| LR schedule | Cosine annealing | Smooth decay, good fine-tuning convergence |
| Flip strategy | Per-gesture (SYMMETRIC_GESTURES) |
Lateral gestures must NOT be flipped |
- Mark a 50×50 cm square on the floor for foot placement
- 3 cameras: front (cam1), 45° left (cam2), 45° right or overhead (cam3)
- Play a 1000 Hz beep before each session (used by
01_sync_check.py) - Lighting: record 3 conditions — bright / medium / dim
- Target: 20 subjects × 15 repeats × 3 conditions = 900 clips/camera
- GPU: NVIDIA RTX 3070 (8 GB VRAM)
- CPU: AMD Ryzen 7 5800H
- RAM: 32 GB DDR4
Fall Detection Video Dataset — same team, same GPU, R(2+1)D backbone, 98.71% weighted F1: https://github.com/Mortezamohasebati/Fall-Detection-Video-Dataset
- Lucy (lead researcher & implementation)