Skip to content
A deep neural network implementation for detection of food and drink intake gestures from video.
Python R
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
data
model
README.md

README.md

deep-intake-detection: Automatic detection of intake gestures from video

Implementation of various deep neural networks for detection of food and drink intake gestures from video. We use recordings from a 360-degree camera to predict frame-level labels.

Prepare data for TensorFlow

We use .tfrecords files to feed data into the TensorFlow model. This includes TV-L1 optical flow, which is stored in the file alongside with raw frames and labels. A dependency due to optical flow calculation is opencv. To generate these files from data in folder train:

$ python video_to_tfrecords.py --directory=train

Since we pre-compute optical flow, it has to be decided at this stage whether the framerate is to be downsampled. Use the argument keep_every to change this setting (defaults to 3, which downsamples from 24 fps to 8 fps).

Train and evaluate deep neural network classifier

Train and evaluate the TensorFlow classifier using the .tfrecords files in train/eval folders.

$ python oreba_main.py

The best checkpoints based on the evaluation set are automatically saved using best_checkpoint_copier.

Specify model architecture and other settings using the flags introduced below, see the code for more details.

What are the flags?

Argument Description
--base_learning_rate Choose the default learning rate for Adam optimizer
--batch_size Batch size used for training.
--data_format Set the data format used in the model (defaults to channels_last).
--dtype TensorFlow datatype used for calculations {fp16, fp32} (defaults to fp32).
--eval_dir Directory for eval data (defaults to 'eval')
--finetune_only If using warmstart, specify what types of layer should be finetuned (defaults to all)
--label_category Label category for classification task (defaults to 1)
--mode Which mode to start in {train_and_evaluate, predict_and_export_csv, predict_and_export_tfrecords} (defaults to train_and_evaluate)
--model Select the model: {oreba_2d_cnn, oreba_cnn_lstm, oreba_two_stream, oreba_slowfast, resnet_2d_cnn, resnet_3d_cnn, resnet_cnn_lstm, resnet_two_stream, resnet_slowfast}
--model_dir Output directory for model and training stats (defaults to 'run')
--num_frames Specify how many frames are in train data
--num_parallel_calls Number of parallel calls in input pipeline (defaults to None)
--num_sequences Specify how many sequences are in train data
--save_checkpoints_steps Save checkpoints every x steps (defaults to 100)
--save_summary_steps Save summaries every x steps (defaults to 10)
--sequence_shift Shift taken in sequence generation (defaults to 1)
--shuffle_buffer Buffer size for shuffling (defaults to 25000)
--slowfast_alpha Alpha parameter in SlowFast (defaults to 4)
--slowfast_beta Beta parameter in SlowFast (defaults to 0.25)
--train_dir Directory for training data (defaults to 'train')
--train_epochs Number of training epochs (defaults to 40)
--use_distribution Use tf.distribute.MirroredStrategy (defaults to False)
--use_flows Use optical flow as input (defaults to False)
--use_frames Use frames as input (defaults to True)
--use_sequence_input Use sequences as input instead of single examples (defaults to False)
--use_sequence_loss Use sequence-to-sequence loss instead of sequence-to-one loss (defaults to False)
--warmstart_dir Directory with checkpoints for warmstart (defaults to None)

Which models can be warmstarted and how?

  • resnet_2d_cnn can be warmstarted with resnet (ResNet-50 v2 checkpoints available here)
  • oreba_cnn_lstm can be warmstarted with oreba_2d_cnn
  • resnet_cnn_lstm can be warmstarted with resnet_2d_cnn
  • oreba_two_stream can be warmstarted with oreba_2d_cnn
  • resnet_two_stream can be warmstarted with resnet_2d_cnn

Checkpoints for these models have to be present in the directory specificied in --warmstart_dir.

Results on OREBA-LBY dataset

The following models have been trained on the training set of 62 participants to distinguish between idle and intake frames. Unweighted average recall (UAR) is the average classification accuracy for idle and intake frames on the test set of 20 participants. F1 is based on actual detection of individual intake gestures on the test set based on the evaluation scheme proposed by Kyritsis et al..

Model Features UAR F1
Small 2D CNN frames 82.63% 0.674
Small 2D CNN flows 71.76% 0.478
ResNet-50 2D CNN frames 86.39% 0.795
ResNet-50 2D CNN flows 71.34% 0.461
Small 3D CNN frames 87.54% 0.798
Small CNN-LSTM frames 83.36% 0.755
Small Two-Stream frames and flows 81.96% 0.700
Small SlowFast frames 88.71% 0.803
ResNet-50 3D CNN frames 88.77% 0.840
ResNet-50 CNN-LSTM frames 89.74% 0.856
ResNet-50 Two-Stream frames and flows 85.25% 0.836
ResNet-50 SlowFast frames 89.01% 0.858
You can’t perform that action at this time.