Skip to content
ObamaNet : Photo-realistic lip-sync from audio (Unofficial port)
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.
results Updated readme Mar 28, 2018
input.mp4 Updated readme Mar 28, 2018
requirements.txt Updated readme Mar 28, 2018 Updated readme Mar 28, 2018 Updated readme Mar 28, 2018 Updated readme Mar 28, 2018

ObamaNet : Lip Sync from Audio


List of Contents


You may install the requirements by running the following command

sudo pip3 install -r requirements.txt

The project is built for python 3.5 and above. The other libraries are listed below

  • OpenCV (sudo pip3 install opencv-contrib-python)
  • Dlib (sudo pip3 install dlib) with this file unzipped in the data folder
  • Python Speech Features (sudo pip3 install python-speech-features)

For a complete list refer to requirements.txt file.

I used the tools below to extract and manipulate the data:

Data Extraction

I extracted the data from youtube using youtube-dl. It's perhaps the best downloader for youtube on linux. Commands for extracting particular streams are given below.

  • Subtitle Extraction
youtube-dl --sub-lang en --skip-download --write-sub --output '~/obamanet/data/captions/%(autonumber)s.%(ext)s' --batch-file ~/obamanet/data/obama_addresses.txt --ignore-config
  • Video Extraction
youtube-dl --batch-file ~/obamanet/data/obama_addresses.txt -o '~/obamanet/data/videos/%(autonumber)s.%(ext)s' -f "best[height=720]" --autonumber-start 1

(Videos not available in 720p: 165)

  • Video to Audio Conversion
  • Video to Images
ffmpeg -i 00001.mp4 -r 1/5 -vf scale=-1:720 images/00001-$filename%05d.bmp

To convert from BMP format to JPG format, use the following in the directory

mogrify -format jpg *.bmp
rm -rf *.bmp

Copy the patched images into folder a and the cropped images to folder b

python3 tools/ --input_dir a --b_dir b --operation combine --output_dir c
python3 tools/ --dir c

You may use this pretrained model or train pix2pix from scratch using this dataset. Unzip the dataset into the pix2pix main directory.

python3 --mode train --output_dir output --max_epochs 200 --input_dir c/train/ --which_direction AtoB

To run the pix2pix trained model

python3 --mode test --output_dir test_out/ --input_dir c_test/ --checkpoint output/

To convert images to video

ffmpeg -r 30 -f image2 -s 256x256 -i %d-targets.png -vcodec libx264 -crf 25 ../targets.mp4

Pretrained Model

Link to the pretrained model and a subset of the data is here - Link

Download and extract the checkpoints and the data folders into the repository. The file structure should look as shown below.

└─ data
|   | audios
|   | a2key_data
|   ...
└─ checkpoints
|   | output
|   | model.h5
|   ...

Running sample wav file

Run the following commands

bash <relative_path_to_audio_wav_file>


bash data/audios/karan.wav

Feel free to experiment with different voices. However, the result will depend on how close your voice is to the subject we trained on.


If you use this code for your research, please cite the paper this code is based on: ObamaNet: Photo-realistic lip-sync from text and also the amazing repository of pix2pix by affinelayer.

Cite as arXiv:1801.01442v1 [cs.CV]


You can’t perform that action at this time.