Main repository for LipReading with Deep Neural Networks
The goal is to implement LipReading: Similar to how end-to-end Speech Recognition systems work, mapping high-fidelity speech audio to sensible characters and word level outputs, we will do the same for "speech visuals". In particular, we will take video frame input, extract the relevant mouth/chin signals as input to map to characters and words.
- Architecture: High level pipeline
- Setup: Quick setup and installation instructions
- SpaCy Setup: Setup for NLP utilities.
- Data Directories Structure: How data files are organized
- Collecting Data: See README_DATA_COLLECTION.md
- Getting Started: Finally get started on running things
- Train Model:
- Examples: Example initial configurations to experiment.
- Tensorboard Visualization
- Other Resources: Collection of reading material, and projects
A high level overview of some TODO items. For more project details please see the Github project
- Download Data (926 videos)
- Build Vision Pipeline (1 week) in review
- Build NLP Pipeline (1 week) wip
- Build Loss Fn and Training Pipeline (2 weeks) wip
🚋and Ship 🚢wip
There are two primary interconnected pipelines: a "vision" pipeline for extracting the face and lip features from video frames, along with a "nlp-inspired" pipeline for temporally correlating the sequential lip features into the final output.
Here's a quick dive into tensor dimensionalities
Video -> Frames -> Face Bounding Box Detection -> Face Landmarking Repr. -> (n, y, x, c) -> (n, (box=1, y_i, x_i, w_i, h_i)) -> (n, (idx=68, y, x))
-> Letters -> Words -> Language Model -> (chars,) -> (words,) -> (sentences,)
all: 926 videos (projected, not generated yet)
large: 464 videos (failed at 35/464)
medium: 104 videos (currently at 37/104)
small: 23 videos
micro: 6 videos
nano: 1 video
- Clone this repository and install the requirements. We will be using
Please make sure you run python scripts, setup your
./, as well as a workspace env variable.
git clone firstname.lastname@example.org:joseph-zhong/LipReading.git # (optional, setup venv) cd LipReading; python3 -m venv .
- Once the repository is cloned, the last step for setup is to setup the repository's
PYTHONPATHand workspace environment variable to take advantage of standardized directory utilities in
Copy the following into your
export PYTHONPATH="$PYTHONPATH:/path/to/LipReading/" export LIP_READING_WS_PATH="/path/to/LipReading/"
- Install the simple
SpaCy, and others.
On MacOS for CPU capabilities only.
pip3 install -r requirements.macos.txt
On Ubuntu, for GPU support
pip3 install -r requirements.ubuntu.txt
We need to install a pre-built English model for some capabilities
python3 -m spacy download en
Data Directories Structure
This allows us to have a simple standardized directory structure for all our datasets, raw data, model weights, logs, etc.
./data/ --/datasets (numpy dataset files for dataloaders to load) --/raw (raw caption/video files extracted from online sources) --/weights (model weights, both for training/checkpointing/running) --/tb (Tensorboard logging) --/...
./src/utils/utility.py for more.
Now that the dependencies are all setup, we can finally do stuff!
Each of our "standard" scripts in
./src/scripts (i.e. not
./src/scripts/misc) take the standard
arguments. For each of the "standard" scripts, you will be able to pass
--help to see the expected arguments.
To maintain reproducibility, cmdline arguments can be written in a raw text file with one argument per line.
Represent the arguments to pass to
./src/scripts/generate_dataview.py, automatically passable via
./src/scripts/generate_dataview.py $(cat ./config/gen_dataview/nano)
The arguments will be used from left-to-right order, so if arguments are repeated, they will be overwritten by the latter settings. This allows for modularity in configuring hyperparameters.
(For demonstration purposes, not a working example)
./src/scripts/train.py \ $(cat ./config/dataset/large) \ $(cat ./config/train/model/small-model) \ $(cat ./config/train/model/rnn/lstm) \ ...
- Train Model
Training on Micro
./src/scripts/train_model.py $(cat ./config/train/micro)
This is a collection of external links, papers, projects, and otherwise potentially helpful starting points for the project.
- Lip Reading - Cross Audio-Visual Recognition using 3D Convolutional Neural
Networks (Jul. 2017, West Virginia University)
- Github: https://github.com/joseph-zhong/lip-reading-deeplearning#demo
- Demo: https://codeocean.com/2017/07/14/3d-convolutional-neural-networks-for-audio-visual-recognition/code
- Lip reading using CNN and LSTM (2017, Stanford)
- LipNet (Dec. 2016, DeepMind)
Other Academic Papers
- Deep Audio-Visual Speech Recognition (Sept. 2018, DeepMind)
- Lip Reading Sentences in the Wild (Jan. 2017, Deepmind)
- CNN + LSTM encoder, attentive LSTM decoder
- LARGE-SCALE VISUAL SPEECH RECOGNITION (Oct. 2018, DeepMind)
- Lip Reading in Profile (2017, Oxford)
- JOINT CTC-ATTENTION BASED END-TO-END SPEECH RECOGNITION USING MULTI-TASK LEARNING (Jan. 2017, CMU)
- A Comparison of Sequence-to-Sequence Models for Speech Recognition (2017, Google & Nvidia)
- CTC vs. attention vs. RNN-transducer vs. RNN-transducer w/ attention
- EXPLORING NEURAL TRANSDUCERS FOR END-TO-END SPEECH RECOGNITION (July 2017, Baidu)
- CTC vs. attention vs. RNN-transducer
- Lip Reading Datasets (Oxford)