Skip to content
No description or website provided.
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.
codes Add NSLTools audspec example Oct 8, 2018
demo Update readme, add demo Nov 6, 2017


This is the keras implementation of Lip2AudSpec: Speech reconstruction from silent lip movements video.

Main Network


In this study, we propose a deep neural network for reconstructing intelligible speech from silent lip movement videos. We use auditory spectrogram as spectral representation of speech and its corresponding sound generation method resulting in a more natural sounding reconstructed speech. Our proposed network consists of an autoencoder to extract bottleneck features from the auditory spectrogram which is then used as target to our main lip reading network comprising of CNN, LSTM and fully connected layers. Our experiments show that the autoencoder is able to reconstruct the original auditory spectrogram with a 98% correlation and also improves the quality of reconstructed speech from the main lip reading network. Our model, trained jointly on different speakers is able to extract individual speaker characteristics and gives promising results of reconstructing intelligible speech with superior word recognition accuracy.

Full paper for this work can be found here.


We implemented the code in python2 using tensorflow, keras, scipy, numpy, cv2, sklearn, IPython, fnmatch. The mentioned libraries should be installed before running the codes. All the libraries can be easily installed using pip:

pip install tensorflow-gpu keras scipy opencv-python sklearn

The backend for Keras can be changed easily if needed.

Data preparation

This study is based on GRID corpus( To run the codes, you need to first download and preprocess both videos and audios.

By running data will be downloaded and frames will be cropped by a manual mask. In order to generate auditory spectrograms, the audios should be processed by NSLTools( using wav2aud function in Matlab.

Since some of the frames in the dataset are corrupted, we generate a path for valid data by Last step before training the network is windowing and integration of all data in .mat formats. This can be done by running

Training the models

Once data preparation steps are done, autoencoder model could be trained on the auditory spectrograms corresponding to valid videos using Training the main network could be performed using


You can find all demo files here.

A few samples of the network output are given below:

Speaker 1


Speaker 29



If you found this work/code helpful, please cite:

  title={Lip2AudSpec: Speech reconstruction from silent lip movements video},
  author={Akbari, Hassan and Arora, Himani and Cao, Liangliang and Mesgarani, Nima},
  journal={arXiv preprint arXiv:1710.09798},
You can’t perform that action at this time.