# A Talking Head to describe the scene captured from a Webcam

Steps:
*   Capture an image using the webcam
*   Use moondream2 VLM to describe the image or for VQA
*   Clone a source reference voice using TTS (xtts2) model
*   Generate speech for the text output from moondream2 using the cloned voice
*   Animate a reference image (a head portrait) driven by the above audio using LivePortrait Talker  





## Open Webcam and Take a photo

In [None]:
from IPython.display import display, Javascript
from google.colab.output import eval_js
from base64 import b64decode

def take_photo(filename='photo.jpg', quality=0.8):
  js = Javascript('''
    async function takePhoto(quality) {
      const div = document.createElement('div');
      const capture = document.createElement('button');
      capture.textContent = 'Capture';
      div.appendChild(capture);

      const video = document.createElement('video');
      video.style.display = 'block';
      const stream = await navigator.mediaDevices.getUserMedia({video: true});

      document.body.appendChild(div);
      div.appendChild(video);
      video.srcObject = stream;
      await video.play();

      // Resize the output to fit the video element.
      google.colab.output.setIframeHeight(document.documentElement.scrollHeight, true);

      // Wait for Capture to be clicked.
      await new Promise((resolve) => capture.onclick = resolve);

      const canvas = document.createElement('canvas');
      canvas.width = video.videoWidth;
      canvas.height = video.videoHeight;
      canvas.getContext('2d').drawImage(video, 0, 0);
      stream.getVideoTracks()[0].stop();
      div.remove();
      return canvas.toDataURL('image/jpeg', quality);
    }
    ''')
  display(js)
  data = eval_js('takePhoto({})'.format(quality))
  binary = b64decode(data.split(',')[1])
  with open(filename, 'wb') as f:
    f.write(binary)
  return filename

In [None]:
from IPython.display import Image
try:
  webcam_image = take_photo()
  print('Saved to {}'.format(webcam_image))

  # Show the image which was just taken.
  display(Image(webcam_image))
except Exception as err:
  # Errors will be thrown if the user does not have a webcam or if they do not
  # grant the page permission to access it.
  print(str(err))

## Use moondream2 VLM to describe the captured image

Ref: https://github.com/vikhyat/moondream


In [None]:
pip install transformers einops

Collecting safetensors>=0.4.1 (from transformers)
  Downloading safetensors-0.4.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.8 kB)
Downloading safetensors-0.4.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (435 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m435.0/435.0 kB[0m [31m10.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: safetensors
  Attempting uninstall: safetensors
    Found existing installation: safetensors 0.3.1
    Uninstalling safetensors-0.3.1:
      Successfully uninstalled safetensors-0.3.1
Successfully installed safetensors-0.4.5


In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer
from PIL import Image
import torch

model_id = "vikhyatk/moondream2"
revision = "2024-08-26"
model = AutoModelForCausalLM.from_pretrained(
    model_id, trust_remote_code=True, revision=revision,
    torch_dtype=torch.float16
).to("cuda")
tokenizer = AutoTokenizer.from_pretrained(model_id, revision=revision)

image = Image.open(webcam_image)
enc_image = model.encode_image(image)
image_desc = model.answer_question(enc_image, "Describe this image.", tokenizer)
print(image_desc)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


A person is holding a smartphone with a blue and purple gradient screen in front of their face, obscuring their eyes and nose.


## Text to Speech with TTS (xtts_v2) using a reference voice and above text

Ref: https://pypi.org/project/TTS/


In [None]:

!pip install TTS

Collecting TTS
  Downloading TTS-0.22.0-cp310-cp310-manylinux1_x86_64.whl.metadata (21 kB)
Collecting anyascii>=0.3.0 (from TTS)
  Downloading anyascii-0.3.2-py3-none-any.whl.metadata (1.5 kB)
Collecting pysbd>=0.3.4 (from TTS)
  Downloading pysbd-0.3.4-py3-none-any.whl.metadata (6.1 kB)
Collecting umap-learn>=0.5.1 (from TTS)
  Downloading umap_learn-0.5.7-py3-none-any.whl.metadata (21 kB)
Collecting pandas<2.0,>=1.4 (from TTS)
  Downloading pandas-1.5.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (11 kB)
Collecting trainer>=0.0.32 (from TTS)
  Downloading trainer-0.0.36-py3-none-any.whl.metadata (8.1 kB)
Collecting coqpit>=0.0.16 (from TTS)
  Downloading coqpit-0.0.17-py3-none-any.whl.metadata (11 kB)
Collecting pypinyin (from TTS)
  Downloading pypinyin-0.53.0-py2.py3-none-any.whl.metadata (12 kB)
Collecting hangul-romanize (from TTS)
  Downloading hangul_romanize-0.1.0-py3-none-any.whl.metadata (1.2 kB)
Collecting gruut==2.2.3 (from gruut[de,es,fr]==2.2.3->T

In [None]:
!pip install numpy==1.23

Collecting numpy==1.23
  Downloading numpy-1.23.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.2 kB)
Downloading numpy-1.23.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m17.0/17.0 MB[0m [31m69.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: numpy
  Attempting uninstall: numpy
    Found existing installation: numpy 1.22.0
    Uninstalling numpy-1.22.0:
      Successfully uninstalled numpy-1.22.0
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tts 0.22.0 requires numpy==1.22.0; python_version <= "3.10", but you have numpy 1.23.0 which is incompatible.
albucore 0.0.19 requires numpy>=1.24.4, but you have numpy 1.23.0 which is incompatible.
albumentations 1.4.20 requires numpy>=1.24.4, but you have numpy 1.23.0 which is incomp

In [None]:
from TTS.api import TTS
import torch
from IPython.display import Audio, display

In [None]:
device = "cuda" if torch.cuda.is_available() else "cpu"
# Initialize TTS
tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to(device)


 > You must confirm the following:
 | > "I have purchased a commercial license from Coqui: licensing@coqui.ai"
 | > "Otherwise, I agree to the terms of the non-commercial CPML: https://coqui.ai/cpml" - [y/n]
 | | > y
 > Downloading model to /root/.local/share/tts/tts_models--multilingual--multi-dataset--xtts_v2


100%|█████████▉| 1.87G/1.87G [00:44<00:00, 42.3MiB/s]
100%|██████████| 1.87G/1.87G [00:44<00:00, 42.0MiB/s]
100%|██████████| 4.37k/4.37k [00:00<00:00, 21.6kiB/s]

100%|██████████| 361k/361k [00:00<00:00, 1.41MiB/s]
100%|██████████| 32.0/32.0 [00:00<00:00, 130iB/s]
 54%|█████▍    | 4.19M/7.75M [00:00<00:00, 41.2MiB/s]

 > Model's license - CPML
 > Check https://coqui.ai/cpml.txt for more info.
 > Using model: xtts


  self.speakers = torch.load(speaker_file_path)
  return torch.load(f, map_location=map_location, **kwargs)


In [None]:

# Save image_desc text to file
tts.tts_to_file(image_desc,
    speaker_wav='source_voice.wav',
    file_path="output_audio.wav",
    language="en"
)
# Display audio widget for reference and TTS audio
print("Source Audio Clip:\n")
audio_widget_source= Audio(filename="source_voice.wav", autoplay=False)
display(audio_widget_source)
print("\nOutput Audio - moondream text in cloned voice:\n")
audio_widget_output= Audio(filename="output_audio.wav", autoplay=False)
display(audio_widget_output)

 > Text splitted to sentences.
['The image features a close-up portrait of a man with gray hair and a beard, wearing a dark suit and a white shirt, against a dark background.']


The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


 > Processing time: 6.6382386684417725
 > Real-time factor: 0.6806281277394776


## Generate video with LivePortrait Talker using an Image and the Audio clip.


 Ref:
 https://github.com/zachysaur/liveportrait_talker

   
 **LivePortraitTalker** is a zero-shot talking head generation approach. It combines the pretrained models of SadTalker and LivePortrait.

     

*   Training the mapping network of Sadtalker for LivePortrait rendering networks.
*   A synthetic head pose generation which uses the initial head pose's and mappingnet outputs.




In [None]:
import locale
locale.getpreferredencoding = lambda: "UTF-8"

%cd /content
!git clone https://github.com/zachysaur/liveportrait_talker.git
%cd liveportrait_talker
!pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
!pip install -r requirements.txt
!pip install onnxruntime-gpu==1.18.0

/content
Cloning into 'liveportrait_talker'...
remote: Enumerating objects: 336, done.[K
remote: Counting objects: 100% (81/81), done.[K
remote: Compressing objects: 100% (35/35), done.[K
remote: Total 336 (delta 61), reused 45 (delta 45), pack-reused 255 (from 1)[K
Receiving objects: 100% (336/336), 2.62 MiB | 4.74 MiB/s, done.
Resolving deltas: 100% (176/176), done.
/content/liveportrait_talker
Looking in indexes: https://download.pytorch.org/whl/cu118
Collecting facexlib==0.3.0 (from -r requirements.txt (line 1))
  Downloading facexlib-0.3.0-py3-none-any.whl.metadata (4.6 kB)
Collecting kornia==0.7.3 (from -r requirements.txt (line 2))
  Downloading kornia-0.7.3-py2.py3-none-any.whl.metadata (7.7 kB)
Collecting librosa==0.10.1 (from -r requirements.txt (line 3))
  Using cached librosa-0.10.1-py3-none-any.whl.metadata (8.3 kB)
Collecting lws==1.2.8 (from -r requirements.txt (line 4))
  Downloading lws-1.2.8.tar.gz (140 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0

Collecting onnxruntime-gpu==1.18.0
  Downloading onnxruntime_gpu-1.18.0-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (4.3 kB)
Collecting coloredlogs (from onnxruntime-gpu==1.18.0)
  Downloading coloredlogs-15.0.1-py2.py3-none-any.whl.metadata (12 kB)
Collecting humanfriendly>=9.1 (from coloredlogs->onnxruntime-gpu==1.18.0)
  Downloading humanfriendly-10.0-py2.py3-none-any.whl.metadata (9.2 kB)
Downloading onnxruntime_gpu-1.18.0-cp310-cp310-manylinux_2_28_x86_64.whl (199.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.8/199.8 MB[0m [31m6.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading coloredlogs-15.0.1-py2.py3-none-any.whl (46 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m46.0/46.0 kB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading humanfriendly-10.0-py2.py3-none-any.whl (86 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.8/86.8 kB[0m [31m7.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected pa

In [None]:
# Download models
%cd /content/liveportrait_talker/
!sh scripts/download_models.sh

/content/liveportrait_talker
--2024-11-06 18:39:00--  https://github.com/OpenTalker/SadTalker/releases/download/v0.0.2-rc/SadTalker_V0.0.2_256.safetensors
Resolving github.com (github.com)... 140.82.114.3
Connecting to github.com (github.com)|140.82.114.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://objects.githubusercontent.com/github-production-release-asset-2e65be/569518584/93be550c-5100-467a-9ac3-994ddf04fb7e?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=releaseassetproduction%2F20241106%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20241106T183901Z&X-Amz-Expires=300&X-Amz-Signature=4e6f28cd3438214e09975660da5dbfc95133cdb7173a93221484aeeddabf8ed9&X-Amz-SignedHeaders=host&response-content-disposition=attachment%3B%20filename%3DSadTalker_V0.0.2_256.safetensors&response-content-type=application%2Foctet-stream [following]
--2024-11-06 18:39:01--  https://objects.githubusercontent.com/github-production-release-asset-2e65be/569518584/93be550c-51

In [None]:
# Generate and save video to file
!python inference.py --config_path config.yaml --source_path /content/lee.png --audio_path /content/output_audio.wav --save_path /content/



Config File is loaded successfully!
Downloading: "https://github.com/xinntao/facexlib/releases/download/v0.1.0/alignment_WFLW_4HG.pth" to /content/liveportrait_talker/pretrained_models/sadtalker/alignment_WFLW_4HG.pth

100% 185M/185M [00:00<00:00, 238MB/s]
Downloading: "https://github.com/xinntao/facexlib/releases/download/v0.1.0/detection_Resnet50_Final.pth" to /content/liveportrait_talker/pretrained_models/sadtalker/detection_Resnet50_Final.pth

100% 104M/104M [00:00<00:00, 262MB/s] 
Pipeline Objects are initialized!
Generating Mel Spectrograms...: 100% 224/224 [00:00<00:00, 42838.05it/s]
Audio2Exp Predicting...: 100% 23/23 [00:00<00:00, 179.36it/s]
Rendering..: 100% 224/224 [00:56<00:00,  3.96it/s]
Done


In [None]:
from IPython.display import HTML
from base64 import b64encode
mp4 = open('/content/lee/output_audio_06112024-183941.mp4','rb').read()
data_url = "data:video/mp4;base64," + b64encode(mp4).decode()
HTML("""
<video width=400 controls>
      <source src="%s" type="video/mp4">
</video>
""" % data_url)

