No description, website, or topics provided.
Clone or download
Pingchuan Ma
Pingchuan Ma typo fixed
Latest commit dcac8ba Nov 15, 2018
Type Name Latest commit message Commit time
Failed to load latest commit information.
audio_only typo fixed Nov 15, 2018
audiovisual typo fixed Nov 15, 2018
video_only typo fixed Nov 15, 2018 typo fixed Oct 22, 2018
label_sorted.txt upload Oct 11, 2018

End-to-End Audiovisual Speech Recognition


This is the respository of End-to-End Audiovisual Speech Recognition. Our paper can be found here.

The video-only stream is based on T. Stafylakis and G. Tzimiropoulos's implementation. The paper can be found here.

This implementation includes 2-layer BGRU which consists of 1024 cells in each layer while Themos's implementation uses 2-layer BLSTM with 512 cells.


  • python 2.7
  • pytorch 0.3.1
  • opencv-python


The results obtained with the proposed model on the LRW dataset. The coordinates for cropping mouth ROI are suggested as (x1, y1, x2, y2) = (80, 116, 175, 211).


This is the suggested order to train models including video-only model, audio-only model and audiovisual models:

i) Start by training with temporal convolutional backend, you can run the script:

CUDA_VISIBLE_DEVICES='' python --path '' --dataset <dataset_path> \
                                       --mode 'temporalConv' --every-frame False \
                                       --batch_size 36 --lr 3e-4 \
                                       --epochs 30 --test False 

ii)Throw away the temporal convolutional backend, freeze the parameters of the frontend and the ResNet and train the LSTM backend, then run the script:

CUDA_VISIBLE_DEVICES='' python --path './temporalConv/' --dataset <dataset_path> \
                                       --mode 'backendGRU' --every-frame True \
                                       --batch_size 36 --lr 3e-4 \
                                       --epochs 5 --test False 

iii)Train the whole network end-to-end. You can run the script:

CUDA_VISIBLE_DEVICES='' python --path './backendGRU/' --dataset <dataset_path> \
                                       --mode 'finetuneGRU' --every-frame True \
                                       --batch_size 36 --lr 3e-4 \
                                       --epochs 30 --test False 


every-frame is True when the backend is recurrent neural network.

dataset need be correctly specified before running. or are the models with best validation performance on step ii) or step iii).

.mat and .npy are the default format for audio wavform and mouth ROI dataset.


Stream Accuracy
video-only 83.39
audio-only 97.72
audiovisual 98.38

The results are slightly better than ones reported in the ICASSP paper due to further fine-tuning of the models. Please send email at pingchuan.ma16 <AT> with name and affiliation for the pre-trained models.


If the code of this repository was useful for your research, please cite our work:

  title={End-to-end audiovisual speech recognition},
  author={Petridis, Stavros and Stafylakis, Themos and Ma, Pingchuan and Cai, Feipeng and Tzimiropoulos, Georgios and Pantic, Maja},