<a href="https://colab.research.google.com/github/jrtabletsms6/document-qa/blob/main/wav2lip.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Wav2Lip: Accurately Lip-syncing Videos and OpenVINO

Lip sync technologies are widely used for digital human use cases, which enhance the user experience in dialog scenarios.

[Wav2Lip](https://github.com/Rudrabha/Wav2Lip) is an approach to generate accurate 2D lip-synced videos in the wild with only one video and an audio clip. Wav2Lip leverages an accurate lip-sync “expert" model and consecutive face frames for accurate, natural lip motion generation.

![teaser](https://github.com/user-attachments/assets/11d2fb00-4b5a-45f3-b13b-49636b0d48b1)

In this notebook, we introduce how to enable and optimize Wav2Lippipeline with OpenVINO. This is adaptation of the blog article [Enable 2D Lip Sync Wav2Lip Pipeline with OpenVINO Runtime](https://blog.openvino.ai/blog-posts/enable-2d-lip-sync-wav2lip-pipeline-with-openvino-runtime).

Here is Wav2Lip pipeline overview:

![wav2lip_pipeline](https://cdn.prod.website-files.com/62c72c77b482b372ac273024/669487bc70c2767fbb9b6c8e_wav2lip_pipeline.png)


#### Table of contents:

- [Prerequisites](#Prerequisites)
- [Convert the model to OpenVINO IR](#Convert-the-model-to-OpenVINO-IR)
- [Compiling models and prepare pipeline](#Compiling-models-and-prepare-pipeline)
- [Interactive inference](#Interactive-inference)

### Installation Instructions

This is a self-contained example that relies solely on its own code.

We recommend  running the notebook in a virtual environment. You only need a Jupyter server to start.
For details, please refer to [Installation Guide](https://github.com/openvinotoolkit/openvino_notebooks/blob/latest/README.md#-installation-guide).

<img referrerpolicy="no-referrer-when-downgrade" src="https://static.scarf.sh/a.png?x-pxid=5b5a4db0-7875-4bfb-bdbd-01698b5b1a77&file=notebooks/wav2lip/wav2lip.ipynb" />

## Prerequisites
[back to top ⬆️](#Table-of-contents:)

In [1]:
import requests
from pathlib import Path


r = requests.get(
    url="https://raw.githubusercontent.com/openvinotoolkit/openvino_notebooks/latest/utils/notebook_utils.py",
)
open("notebook_utils.py", "w").write(r.text)

r = requests.get(
    url="https://raw.githubusercontent.com/openvinotoolkit/openvino_notebooks/latest/utils/pip_helper.py",
)
open("pip_helper.py", "w").write(r.text)

r = requests.get(
    url="https://raw.githubusercontent.com/openvinotoolkit/openvino_notebooks/latest/utils/cmd_helper.py",
)
open("cmd_helper.py", "w").write(r.text)

from pip_helper import pip_install

pip_install("-q", "openvino>=2024.4.0")
pip_install(
    "-q",
    "huggingface_hub",
    "torch>=2.1",
    "gradio>=4.19",
    "librosa==0.9.2",
    "opencv-contrib-python",
    "opencv-python",
    "tqdm",
    "numba",
    "numpy<2",
    "--extra-index-url",
    "https://download.pytorch.org/whl/cpu",
)

helpers = ["gradio_helper.py", "ov_inference.py", "ov_wav2lip_helper.py"]
for helper_file in helpers:
    if not Path(helper_file).exists():
        r = requests.get(url=f"https://raw.githubusercontent.com/openvinotoolkit/openvino_notebooks/latest/notebooks/wav2lip/{helper_file}")
        open(helper_file, "w").write(r.text)

In [2]:
from cmd_helper import clone_repo


clone_repo("https://github.com/Rudrabha/Wav2Lip.git")

PosixPath('Wav2Lip')

Download example files.

In [3]:
from notebook_utils import download_file


download_file("https://github.com/sammysun0711/openvino_aigc_samples/blob/main/Wav2Lip/data_audio_sun_5s.wav?raw=true")
download_file("https://github.com/sammysun0711/openvino_aigc_samples/blob/main/Wav2Lip/data_video_sun_5s.mp4?raw=true")

data_audio_sun_5s.wav:   0%|          | 0.00/436k [00:00<?, ?B/s]

data_video_sun_5s.mp4:   0%|          | 0.00/916k [00:00<?, ?B/s]

PosixPath('/content/data_video_sun_5s.mp4')

### Convert the model to OpenVINO IR
[back to top ⬆️](#Table-of-contents:)

You don't need to download checkpoints and load models, just call the helper function `download_and_convert_models`. It takes care about it and will convert both model in OpenVINO format.

In [4]:
from ov_wav2lip_helper import download_and_convert_models


OV_FACE_DETECTION_MODEL_PATH = Path("models/face_detection.xml")
OV_WAV2LIP_MODEL_PATH = Path("models/wav2lip.xml")

download_and_convert_models(OV_FACE_DETECTION_MODEL_PATH, OV_WAV2LIP_MODEL_PATH)

Convert Face Detection Model ...


s3fd-619a316812.pth:   0%|          | 0.00/85.7M [00:00<?, ?B/s]

  model_weights = torch.load(path_to_detector)


Converted face detection OpenVINO model:  models/face_detection.xml
Convert Wav2Lip Model ...


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


wav2lip.pth:   0%|          | 0.00/436M [00:00<?, ?B/s]

Load checkpoint from: checkpoints/Wav2lip/wav2lip.pth


  checkpoint = torch.load(checkpoint_path, map_location=lambda storage, loc: storage)


Converted face detection OpenVINO model:  models/wav2lip.xml


## Compiling models and prepare pipeline
[back to top ⬆️](#Table-of-contents:)

Select device from dropdown list for running inference using OpenVINO.

In [5]:
from notebook_utils import device_widget

device = device_widget()

device

Dropdown(description='Device:', index=1, options=('CPU', 'AUTO'), value='AUTO')

`ov_inference.py` is an adaptation of original pipeline that has only cli-interface. `ov_inference` allows running the inference using python API and converted OpenVINO models.

In [6]:
import os

from ov_inference import ov_inference


if not os.path.exists("results"):
    os.mkdir("results")

ov_inference(
    "data_video_sun_5s.mp4",
    "data_audio_sun_5s.wav",
    face_detection_path=OV_FACE_DETECTION_MODEL_PATH,
    wav2lip_path=OV_WAV2LIP_MODEL_PATH,
    inference_device=device.value,
    outfile="results/result_voice.mp4",
)

Reading video frames...
Number of frames available for inference: 125


  return librosa.filters.mel(hp.sample_rate, hp.n_fft, n_mels=hp.num_mels,


(80, 405)
Length of mel chunks: 123


  0%|          | 0/1 [00:00<?, ?it/s]

face_detect_ov images[0].shape:  (768, 576, 3)



  0%|          | 0/8 [00:00<?, ?it/s][A
 12%|█▎        | 1/8 [01:02<07:18, 62.69s/it][A
 25%|██▌       | 2/8 [02:04<06:13, 62.19s/it][A
 38%|███▊      | 3/8 [03:06<05:11, 62.20s/it][A
 50%|█████     | 4/8 [04:07<04:06, 61.60s/it][A
 62%|██████▎   | 5/8 [05:09<03:05, 61.80s/it][A
 75%|███████▌  | 6/8 [06:10<02:02, 61.42s/it][A
 88%|████████▊ | 7/8 [07:11<01:01, 61.43s/it][A
100%|██████████| 8/8 [07:53<00:00, 59.20s/it]


Model loaded


100%|██████████| 1/1 [08:22<00:00, 502.89s/it]


'results/result_voice.mp4'

Here is an example to compare the original video and the generated video after the Wav2Lip pipeline:

In [7]:
from IPython.display import Video, Audio

Video("data_video_sun_5s.mp4", embed=True)

In [8]:
Audio("data_audio_sun_5s.wav")

The generated video:

In [9]:
Video("results/result_voice.mp4", embed=True)

## Interactive inference
[back to top ⬆️](#Table-of-contents:)

In [None]:
from gradio_helper import make_demo


demo = make_demo(fn=ov_inference)

try:
    demo.queue().launch(debug=True)
except Exception:
    demo.queue().launch(debug=True, share=True)
# if you are launching remotely, specify server_name and server_port
# demo.launch(server_name='your server name', server_port='server port in int')
# Read more in the docs: https://gradio.app/docs/"



Running Gradio in a Colab notebook requires sharing enabled. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. This cell will run indefinitely so that you can see errors and logs. To turn off, set debug=False in launch().
* Running on public URL: https://7fc7b956a89785d195.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)


Reading video frames...
Number of frames available for inference: 250
Extracting raw audio...


  return f(*args, **kwargs)
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/librosa/core/audio.py", line 164, in load
    y, sr_native = __soundfile_load(path, offset, duration, dtype)
  File "/usr/local/lib/python3.10/dist-packages/librosa/core/audio.py", line 195, in __soundfile_load
    context = sf.SoundFile(path)
  File "/usr/local/lib/python3.10/dist-packages/soundfile.py", line 658, in __init__
    self._file = self._open(file, mode_int, closefd)
  File "/usr/local/lib/python3.10/dist-packages/soundfile.py", line 1216, in _open
    raise LibsndfileError(err, prefix="Error opening {0!r}: ".format(self.name))
soundfile.LibsndfileError: Error opening 'temp/temp.wav': System error.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/gradio/queueing.py", line 625, in process_events
    response = await route_utils.call_process_api(
  File "/usr/local