Skip to content

luuucciiffeerr/Foot_recognition

Repository files navigation

SensiFoot — Industrial Foot Gesture Recognition

A deep learning pipeline for recognizing 8 foot gestures from video, designed for hands-free control in industrial production environments. Two model families are provided: a fast per-frame model (EfficientNetV2-S) and a spatiotemporal model (R(2+1)D-18) for gestures that require motion context.

Status: implementation complete — ready for data collection and training.


Gesture Classes

Label Code Gesture Name Description
0 G1 Heel Tap Tap heel down vertically
1 G2 Forward Kick Kick foot forward from the knee
2 G3 Foot Lift Lift foot straight up, hold briefly
3 G4 Lateral Slide Slide foot sideways (left or right)
4 G5 Forward Step Step forward and return
5 G6 Cross Front Cross one foot in front of the other
6 G7 Foot Hold Hold foot still above ground for ≥1.5 s
7 G8 Flamingo Bend Raise heel, bend knee (one-legged stance)

G7 (Foot Hold) is a dwell gesture — it cannot be classified from a single frame. Use the video model (train_video.py) if G7 accuracy is critical.


Architecture

Frame model — EfficientNetV2-S

  • ImageNet pretrained backbone, custom 8-class head
  • Input: (B, 3, 224, 224) — single frame
  • Inference: average softmax over all 16 extracted frames per clip
  • ~21M parameters; fast enough for real-time on CPU

Video model — R(2+1)D-18

  • Kinetics-400 pretrained backbone, custom 8-class head
  • Input: (B, 3, 16, 112, 112) — 16-frame clip
  • Handles temporal dynamics (direction, speed, dwell)
  • ~33M parameters; GPU recommended

Project Structure

SensiFoot/
├── config/
│   └── config.py                # All hyperparameters, paths, class names
│
├── data/
│   ├── raw/
│   │   ├── cam1/                # Front-facing camera
│   │   ├── cam2/                # 45° left (phone)
│   │   └── cam3/                # 45° right or overhead
│   ├── processed/
│   │   └── frames/              # Pre-extracted JPG frames, one dir per clip
│   └── splits/
│       ├── dataset_master.csv
│       ├── train.csv
│       ├── val.csv
│       └── test.csv
│
├── scripts/
│   ├── 01_sync_check.py         # Detect beep timestamps, report + trim offsets
│   ├── 02_extract_frames.py     # Extract FRAMES_PER_CLIP frames from each video
│   ├── 03_build_dataset_csv.py  # Scan frames/ → build master CSV + stats
│   ├── 04_split_dataset.py      # Subject-based train/val/test split
│   └── 05_visualize_samples.py  # Grid of sample frames per class
│
├── datasets/
│   ├── foot_frame_dataset.py    # Frame-level dataset (all/middle/random selection)
│   └── foot_video_dataset.py    # Clip-level dataset with temporal jitter aug
│
├── models/
│   ├── frame_model.py           # EfficientNetV2-S/M + ConvNeXt variants
│   └── video_model.py           # R(2+1)D-18 with freeze/unfreeze helpers
│
├── train/
│   ├── train_frame.py           # Full loop: AMP, cosine LR, early stopping
│   └── train_video.py           # Same pipeline for R(2+1)D
│
├── evaluate/
│   └── evaluate.py              # Test-set eval: overall + per-class + per-camera
│
├── inference/
│   └── predict.py               # Inference on a new .mp4 — Top-3 output
│
├── utils/
│   ├── augmentation.py          # MotionBlurTransform, VideoConsistentTransform
│   ├── metrics.py               # accuracy, F1, per-class report
│   └── visualization.py         # Training curves, confusion matrix, distribution plot
│
├── checkpoints/                 # Saved model weights (.pth)
├── outputs/
│   ├── plots/                   # PNG figures
│   └── predictions/             # predictions_frame.csv, predictions_video.csv
│
├── requirements.txt
├── DEVELOPMENT_LOG.md
└── .gitignore

Quick Start

1. Install dependencies

pip install -r requirements.txt

2. Record videos

Place .mp4 files in data/raw/cam1/, cam2/, cam3/. Use the naming format:

{gesture}_{personID}_{cameraID}_{repeatID}.mp4

Examples:

heeltap_p01_cam1_r01.mp4
forwardkick_p03_cam2_r05.mp4
flamingobend_p12_cam3_r03.mp4

Valid gesture names: heeltap forwardkick footlift lateralslide forwardstep crossfront foothold flamingobend

3. Run the data pipeline

# Step 1 — Check/fix timing offset between cameras (requires ffmpeg)
python scripts/01_sync_check.py
python scripts/01_sync_check.py --trim          # also trims offset videos

# Step 2 — Extract 16 frames from every video
python scripts/02_extract_frames.py

# Step 3 — Build master CSV
python scripts/03_build_dataset_csv.py

# Step 4 — Split by subject (train 70 / val 15 / test 15)
python scripts/04_split_dataset.py

# Step 5 — Sanity check visuals
python scripts/05_visualize_samples.py

4. Train

# Frame model (EfficientNetV2-S) — start here, ~2–4h on RTX 3070
python train/train_frame.py

# With backbone frozen except last 2 blocks (faster early epochs):
python train/train_frame.py --freeze 2

# Video model (R(2+1)D-18) — ~8–12h on RTX 3070
python train/train_video.py --freeze 2

# Resume from last checkpoint:
python train/train_frame.py --resume

5. Evaluate on test set

python evaluate/evaluate.py --model frame
python evaluate/evaluate.py --model video

Outputs: confusion matrix PNG, per-class table, per-camera breakdown, predictions CSV.

6. Predict on a new clip

python inference/predict.py --video path/to/clip.mp4
python inference/predict.py --video path/to/clip.mp4 --model video

Example output:

  Input  : heeltap_p01_cam1_r01.mp4
  ────────────────────────────────────────────
  Prediction  :  G1 - Heel Tap           (94.3%)  ✓

  Top 3:
    1. G1 - Heel Tap               94.23%  ████████████████████████████
    2. G3 - Foot Lift               3.81%  █
    3. G8 - Flamingo Bend           1.02%

Key Design Decisions

Choice Decision Reason
Split strategy Subject-based Prevents data leakage across train/test
Frame model EfficientNetV2-S Best accuracy/speed ratio for 8-class
Video model R(2+1)D-18 Kinetics-400 pretrained, proven for gesture
Frame count 16 Covers 2–3 s clips; matches R(2+1)D input
Temporal aug ±25% jitter Regularizes clip start time at training
G7 sampling Center window Dwell gesture peaks in the middle of clip
Loss function WeightedCE + label smoothing 0.05 Handles class imbalance + overconfidence
LR schedule Cosine annealing Smooth decay, good fine-tuning convergence
Flip strategy Per-gesture (SYMMETRIC_GESTURES) Lateral gestures must NOT be flipped

Recording Protocol (Recommended)

  • Mark a 50×50 cm square on the floor for foot placement
  • 3 cameras: front (cam1), 45° left (cam2), 45° right or overhead (cam3)
  • Play a 1000 Hz beep before each session (used by 01_sync_check.py)
  • Lighting: record 3 conditions — bright / medium / dim
  • Target: 20 subjects × 15 repeats × 3 conditions = 900 clips/camera

Hardware

  • GPU: NVIDIA RTX 3070 (8 GB VRAM)
  • CPU: AMD Ryzen 7 5800H
  • RAM: 32 GB DDR4

Related Work

Fall Detection Video Dataset — same team, same GPU, R(2+1)D backbone, 98.71% weighted F1: https://github.com/Mortezamohasebati/Fall-Detection-Video-Dataset


Authors

  • Lucy (lead researcher & implementation)

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages