SensiFoot — Industrial Foot Gesture Recognition

A deep learning pipeline for recognizing 8 foot gestures from video, designed for hands-free control in industrial production environments. Two model families are provided: a fast per-frame model (EfficientNetV2-S) and a spatiotemporal model (R(2+1)D-18) for gestures that require motion context.

Status: implementation complete — ready for data collection and training.

Gesture Classes

Label	Code	Gesture Name	Description
0	G1	Heel Tap	Tap heel down vertically
1	G2	Forward Kick	Kick foot forward from the knee
2	G3	Foot Lift	Lift foot straight up, hold briefly
3	G4	Lateral Slide	Slide foot sideways (left or right)
4	G5	Forward Step	Step forward and return
5	G6	Cross Front	Cross one foot in front of the other
6	G7	Foot Hold	Hold foot still above ground for ≥1.5 s
7	G8	Flamingo Bend	Raise heel, bend knee (one-legged stance)

G7 (Foot Hold) is a dwell gesture — it cannot be classified from a single frame. Use the video model (train_video.py) if G7 accuracy is critical.

Architecture

Frame model — EfficientNetV2-S

ImageNet pretrained backbone, custom 8-class head
Input: (B, 3, 224, 224) — single frame
Inference: average softmax over all 16 extracted frames per clip
~21M parameters; fast enough for real-time on CPU

Video model — R(2+1)D-18

Kinetics-400 pretrained backbone, custom 8-class head
Input: (B, 3, 16, 112, 112) — 16-frame clip
Handles temporal dynamics (direction, speed, dwell)
~33M parameters; GPU recommended

Project Structure

SensiFoot/
├── config/
│   └── config.py                # All hyperparameters, paths, class names
│
├── data/
│   ├── raw/
│   │   ├── cam1/                # Front-facing camera
│   │   ├── cam2/                # 45° left (phone)
│   │   └── cam3/                # 45° right or overhead
│   ├── processed/
│   │   └── frames/              # Pre-extracted JPG frames, one dir per clip
│   └── splits/
│       ├── dataset_master.csv
│       ├── train.csv
│       ├── val.csv
│       └── test.csv
│
├── scripts/
│   ├── 01_sync_check.py         # Detect beep timestamps, report + trim offsets
│   ├── 02_extract_frames.py     # Extract FRAMES_PER_CLIP frames from each video
│   ├── 03_build_dataset_csv.py  # Scan frames/ → build master CSV + stats
│   ├── 04_split_dataset.py      # Subject-based train/val/test split
│   └── 05_visualize_samples.py  # Grid of sample frames per class
│
├── datasets/
│   ├── foot_frame_dataset.py    # Frame-level dataset (all/middle/random selection)
│   └── foot_video_dataset.py    # Clip-level dataset with temporal jitter aug
│
├── models/
│   ├── frame_model.py           # EfficientNetV2-S/M + ConvNeXt variants
│   └── video_model.py           # R(2+1)D-18 with freeze/unfreeze helpers
│
├── train/
│   ├── train_frame.py           # Full loop: AMP, cosine LR, early stopping
│   └── train_video.py           # Same pipeline for R(2+1)D
│
├── evaluate/
│   └── evaluate.py              # Test-set eval: overall + per-class + per-camera
│
├── inference/
│   └── predict.py               # Inference on a new .mp4 — Top-3 output
│
├── utils/
│   ├── augmentation.py          # MotionBlurTransform, VideoConsistentTransform
│   ├── metrics.py               # accuracy, F1, per-class report
│   └── visualization.py         # Training curves, confusion matrix, distribution plot
│
├── checkpoints/                 # Saved model weights (.pth)
├── outputs/
│   ├── plots/                   # PNG figures
│   └── predictions/             # predictions_frame.csv, predictions_video.csv
│
├── requirements.txt
├── DEVELOPMENT_LOG.md
└── .gitignore

Quick Start

1. Install dependencies

pip install -r requirements.txt

2. Record videos

Place .mp4 files in data/raw/cam1/, cam2/, cam3/. Use the naming format:

{gesture}_{personID}_{cameraID}_{repeatID}.mp4

Examples:

heeltap_p01_cam1_r01.mp4
forwardkick_p03_cam2_r05.mp4
flamingobend_p12_cam3_r03.mp4

Valid gesture names: heeltap forwardkick footlift lateralslide forwardstep crossfront foothold flamingobend

3. Run the data pipeline

# Step 1 — Check/fix timing offset between cameras (requires ffmpeg)
python scripts/01_sync_check.py
python scripts/01_sync_check.py --trim          # also trims offset videos

# Step 2 — Extract 16 frames from every video
python scripts/02_extract_frames.py

# Step 3 — Build master CSV
python scripts/03_build_dataset_csv.py

# Step 4 — Split by subject (train 70 / val 15 / test 15)
python scripts/04_split_dataset.py

# Step 5 — Sanity check visuals
python scripts/05_visualize_samples.py

4. Train

# Frame model (EfficientNetV2-S) — start here, ~2–4h on RTX 3070
python train/train_frame.py

# With backbone frozen except last 2 blocks (faster early epochs):
python train/train_frame.py --freeze 2

# Video model (R(2+1)D-18) — ~8–12h on RTX 3070
python train/train_video.py --freeze 2

# Resume from last checkpoint:
python train/train_frame.py --resume

5. Evaluate on test set

python evaluate/evaluate.py --model frame
python evaluate/evaluate.py --model video

Outputs: confusion matrix PNG, per-class table, per-camera breakdown, predictions CSV.

6. Predict on a new clip

python inference/predict.py --video path/to/clip.mp4
python inference/predict.py --video path/to/clip.mp4 --model video

Example output:

  Input  : heeltap_p01_cam1_r01.mp4
  ────────────────────────────────────────────
  Prediction  :  G1 - Heel Tap           (94.3%)  ✓

  Top 3:
    1. G1 - Heel Tap               94.23%  ████████████████████████████
    2. G3 - Foot Lift               3.81%  █
    3. G8 - Flamingo Bend           1.02%

Key Design Decisions

Choice	Decision	Reason
Split strategy	Subject-based	Prevents data leakage across train/test
Frame model	EfficientNetV2-S	Best accuracy/speed ratio for 8-class
Video model	R(2+1)D-18	Kinetics-400 pretrained, proven for gesture
Frame count	16	Covers 2–3 s clips; matches R(2+1)D input
Temporal aug	±25% jitter	Regularizes clip start time at training
G7 sampling	Center window	Dwell gesture peaks in the middle of clip
Loss function	WeightedCE + label smoothing 0.05	Handles class imbalance + overconfidence
LR schedule	Cosine annealing	Smooth decay, good fine-tuning convergence
Flip strategy	Per-gesture (`SYMMETRIC_GESTURES`)	Lateral gestures must NOT be flipped

Recording Protocol (Recommended)

Mark a 50×50 cm square on the floor for foot placement
3 cameras: front (cam1), 45° left (cam2), 45° right or overhead (cam3)
Play a 1000 Hz beep before each session (used by 01_sync_check.py)
Lighting: record 3 conditions — bright / medium / dim
Target: 20 subjects × 15 repeats × 3 conditions = 900 clips/camera

Hardware

GPU: NVIDIA RTX 3070 (8 GB VRAM)
CPU: AMD Ryzen 7 5800H
RAM: 32 GB DDR4

Related Work

Fall Detection Video Dataset — same team, same GPU, R(2+1)D backbone, 98.71% weighted F1: https://github.com/Mortezamohasebati/Fall-Detection-Video-Dataset

Authors

Lucy (lead researcher & implementation)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SensiFoot — Industrial Foot Gesture Recognition

Gesture Classes

Architecture

Frame model — EfficientNetV2-S

Video model — R(2+1)D-18

Project Structure

Quick Start

1. Install dependencies

2. Record videos

3. Run the data pipeline

4. Train

5. Evaluate on test set

6. Predict on a new clip

Key Design Decisions

Recording Protocol (Recommended)

Hardware

Related Work

Authors

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
config		config
datasets		datasets
evaluate		evaluate
inference		inference
models		models
scripts		scripts
train		train
utils		utils
.gitignore		.gitignore
DEVELOPMENT_LOG.md		DEVELOPMENT_LOG.md
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

SensiFoot — Industrial Foot Gesture Recognition

Gesture Classes

Architecture

Frame model — EfficientNetV2-S

Video model — R(2+1)D-18

Project Structure

Quick Start

1. Install dependencies

2. Record videos

3. Run the data pipeline

4. Train

5. Evaluate on test set

6. Predict on a new clip

Key Design Decisions

Recording Protocol (Recommended)

Hardware

Related Work

Authors

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages