# Auto-AVSR Tutorial
**Authors**: [Pingchuan Ma](https://mpc001.github.io/), [Alexandros Haliassos](https://dblp.org/pid/257/3052.html), [Adriana Fernandez-Lopez](https://scholar.google.com/citations?user=DiVeQHkAAAAJ), [Honglie Chen](https://scholar.google.com/citations?user=HPwdvwEAAAAJ), [Stavros Petridis](https://ibug.doc.ic.ac.uk/people/spetridis), [Maja Pantic](https://ibug.doc.ic.ac.uk/people/mpantic).

This tutorial shows how to use Auto-AVSR model to perform speech recognition (ASR, VSR, and AV-ASR), crop mouth ROIs or extract visual speech features.

**Disclaimer**: Please note that both the VSR model and AV-ASR model have been trained with videos that were pre-processed by RetinaFace. For the purpose of improving inference speed, we use mediapipe instead.

In [1]:
%cd "/content/"
!git clone https://github.com/mpc001/Visual_Speech_Recognition_for_Multiple_Languages.git
%cd "Visual_Speech_Recognition_for_Multiple_Languages"

/content
Cloning into 'Visual_Speech_Recognition_for_Multiple_Languages'...
remote: Enumerating objects: 267, done.[K
remote: Counting objects: 100% (90/90), done.[K
remote: Compressing objects: 100% (69/69), done.[K
remote: Total 267 (delta 27), reused 73 (delta 18), pack-reused 177[K
Receiving objects: 100% (267/267), 69.77 MiB | 18.39 MiB/s, done.
Resolving deltas: 100% (52/52), done.
/content/Visual_Speech_Recognition_for_Multiple_Languages


In [2]:
!pip install torch torchvision torchaudio
!pip install opencv-python
!pip install scipy
!pip install scikit-image
!pip install av
!pip install six
!pip install mediapipe
!pip install ffmpeg-python

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting av
  Downloading av-10.0.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (31.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m31.0/31.0 MB[0m [31m24.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: av
Successfully installed av-10.0.0
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collec

## Video preparation

1. Download a video.

In [3]:
!mkdir -p /content/data/
!wget --content-disposition http://www.doc.ic.ac.uk/~pm4115/autoAVSR/autoavsr_demo_video.mp4 -O /content/data/clip.mp4

--2023-06-22 08:57:10--  http://www.doc.ic.ac.uk/~pm4115/autoAVSR/autoavsr_demo_video.mp4
Resolving www.doc.ic.ac.uk (www.doc.ic.ac.uk)... 146.169.13.6
Connecting to www.doc.ic.ac.uk (www.doc.ic.ac.uk)|146.169.13.6|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3644186 (3.5M) [video/mp4]
Saving to: ‘/content/data/clip.mp4’


2023-06-22 08:57:11 (38.0 MB/s) - ‘/content/data/clip.mp4’ saved [3644186/3644186]



In [4]:
from IPython.display import HTML
from base64 import b64encode

## play_video function based on: https://colab.research.google.com/drive/1bNXkfpHiVHzXQH8WjGhzQ-fsDxolpUjD

def play_video(video_path, width=200):
  mp4 = open(video_path,'rb').read()
  data_url = "data:video/mp4;base64," + b64encode(mp4).decode()
  return HTML(f"""
  <video width={width} controls>
        <source src="{data_url}" type="video/mp4">
  </video>
  """)

In [5]:
play_video('/content/data/clip.mp4', width=300)

2. Create a noisy clip.


In [6]:
!mkdir -p /content/data/
!wget http://www.doc.ic.ac.uk/~pm4115/autoAVSR/babble_noise.wav -O /content/data/babble_noise.wav

--2023-06-22 08:57:11--  http://www.doc.ic.ac.uk/~pm4115/autoAVSR/babble_noise.wav
Resolving www.doc.ic.ac.uk (www.doc.ic.ac.uk)... 146.169.13.6
Connecting to www.doc.ic.ac.uk (www.doc.ic.ac.uk)|146.169.13.6|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 15054806 (14M) [audio/x-wav]
Saving to: ‘/content/data/babble_noise.wav’


2023-06-22 08:57:11 (98.4 MB/s) - ‘/content/data/babble_noise.wav’ saved [15054806/15054806]



In [7]:
import os
import random
import ffmpeg
import torch
import torchaudio

def create_noisy_clip(src_filename, dst_filename, noise, snr_level):
    speech, sample_rate = torchaudio.load(src_filename)
    noise, _ = torchaudio.load(noise)
    if sample_rate != _:
        noise = torchaudio.functional.resample(noise, _, sample_rate)
    start_idx = random.randint(0, noise.shape[1] - speech.shape[1])
    noise = noise[:, start_idx:start_idx + speech.shape[1]]
    noisy_speech = torchaudio.functional.add_noise(speech, noise, torch.tensor([snr_level]))
    torchaudio.save(dst_filename[:-4]+".wav", noisy_speech, sample_rate)

    in1 = ffmpeg.input(src_filename)
    in2 = ffmpeg.input(dst_filename[:-4]+".wav")
    out = ffmpeg.output(in1['v'], in2['a'], dst_filename, loglevel="panic")
    out = out.overwrite_output()
    out.run()
    os.remove(f"{dst_filename[:-4]+'.wav'}")
    return

In [8]:
src_filename = "/content/data/clip.mp4"
dst_filename = "/content/data/noisy_clip.mp4"
noise = "/content/data/babble_noise.wav"
create_noisy_clip(src_filename, dst_filename, noise, snr_level=-5)

In [9]:
play_video("/content/data/noisy_clip.mp4", width=300)

## Building an inference pipeline


In [10]:
import os
import torch
from pipelines.model import AVSR
from pipelines.data.data_module import AVSRDataLoader
from pipelines.detectors.mediapipe.detector import LandmarksDetector

class InferencePipeline(torch.nn.Module):
    def __init__(self, modality, model_path, model_conf, detector="mediapipe", face_track=False, device="cuda:0"):
        super(InferencePipeline, self).__init__()
        self.device = device
        # modality configuration
        self.modality = modality
        self.dataloader = AVSRDataLoader(modality, detector=detector)
        self.model = AVSR(modality, model_path, model_conf, rnnlm=None, rnnlm_conf=None, penalty=0.0, ctc_weight=0.1, lm_weight=0.0, beam_size=40, device=device)
        if face_track and self.modality in ["video", "audiovisual"]:
            self.landmarks_detector = LandmarksDetector()
        else:
            self.landmarks_detector = None


    def process_landmarks(self, data_filename, landmarks_filename):
        if self.modality == "audio":
            return None
        if self.modality in ["video", "audiovisual"]:
            landmarks = self.landmarks_detector(data_filename)
            return landmarks


    def forward(self, data_filename, landmarks_filename=None):
        assert os.path.isfile(data_filename), f"data_filename: {data_filename} does not exist."
        landmarks = self.process_landmarks(data_filename, landmarks_filename)
        data = self.dataloader.load_data(data_filename, landmarks)
        transcript = self.model.infer(data)
        return transcript

    def extract_features(self, data_filename, landmarks_filename=None, extract_resnet_feats=False):
        assert os.path.isfile(data_filename), f"data_filename: {data_filename} does not exist."
        landmarks = self.process_landmarks(data_filename, landmarks_filename)
        data = self.dataloader.load_data(data_filename, landmarks)
        with torch.no_grad():
            if isinstance(data, tuple):
                enc_feats = self.model.model.encode(data[0].to(self.device), data[1].to(self.device), extract_resnet_feats)
            else:
                enc_feats = self.model.model.encode(data.to(self.device), extract_resnet_feats)
        return enc_feats

## Auto-AVSR functions

### Infer the noisy clip using an audio stream

1. Download an ASR checkpoint

In [11]:
%mkdir -p /content/data/
!wget http://www.doc.ic.ac.uk/~pm4115/autoAVSR/LRS3_A_WER1.0.zip -O /content/data/LRS3_A_WER1.0.zip
!unzip -o /content/data/LRS3_A_WER1.0.zip -d /content/data/

--2023-06-22 08:57:49--  http://www.doc.ic.ac.uk/~pm4115/autoAVSR/LRS3_A_WER1.0.zip
Resolving www.doc.ic.ac.uk (www.doc.ic.ac.uk)... 146.169.13.6
Connecting to www.doc.ic.ac.uk (www.doc.ic.ac.uk)|146.169.13.6|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 902649180 (861M) [application/zip]
Saving to: ‘/content/data/LRS3_A_WER1.0.zip’


2023-06-22 08:57:55 (149 MB/s) - ‘/content/data/LRS3_A_WER1.0.zip’ saved [902649180/902649180]

Archive:  /content/data/LRS3_A_WER1.0.zip
  inflating: /content/data/LRS3_A_WER1.0/model.json  
  inflating: /content/data/LRS3_A_WER1.0/model.pth  


2. Build an ASR pipeline

In [12]:
modality = "audio"
model_conf = "/content/data/LRS3_A_WER1.0/model.json"
model_path = "/content/data/LRS3_A_WER1.0/model.pth"
pipeline = InferencePipeline(modality, model_path, model_conf)

3. Infer the noisy clip using the audio stream.

In [13]:
transcript = pipeline("/content/data/noisy_clip.mp4")
print(transcript)

COMPLETELY UNCONSTRUED DEPARTMENTS WHERE WE HAVE LARGE CHANGES IN CATCALLS AND


### Infer the noisy clip using a video stream


1. Download a VSR checkpoint

In [14]:
%mkdir -p /content/data/
!wget http://www.doc.ic.ac.uk/~pm4115/autoAVSR/LRS3_V_WER19.1.zip -O /content/data/LRS3_V_WER19.1.zip
!unzip -o /content/data/LRS3_V_WER19.1.zip -d /content/data/

--2023-06-22 08:58:17--  http://www.doc.ic.ac.uk/~pm4115/autoAVSR/LRS3_V_WER19.1.zip
Resolving www.doc.ic.ac.uk (www.doc.ic.ac.uk)... 146.169.13.6
Connecting to www.doc.ic.ac.uk (www.doc.ic.ac.uk)|146.169.13.6|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 937274463 (894M) [application/zip]
Saving to: ‘/content/data/LRS3_V_WER19.1.zip’


2023-06-22 08:58:22 (179 MB/s) - ‘/content/data/LRS3_V_WER19.1.zip’ saved [937274463/937274463]

Archive:  /content/data/LRS3_V_WER19.1.zip
  inflating: /content/data/LRS3_V_WER19.1/model.json  
  inflating: /content/data/LRS3_V_WER19.1/model.pth  


2. Build a VSR pipeline

In [15]:
modality = "video"
model_conf = "/content/data/LRS3_V_WER19.1/model.json"
model_path = "/content/data/LRS3_V_WER19.1/model.pth"
pipeline = InferencePipeline(modality, model_path, model_conf, face_track=True)

3. Infer the noisy clip using the video stream

In [16]:
transcript = pipeline("/content/data/noisy_clip.mp4")
print(transcript)

COMPLETELY CONCENTRATED ENVIRONMENTS WHERE WE HAVE LARGE CHANGES IN GET POSTS AND


### Infer the noisy clip using both audio and visual streams

1. Download a AV-ASR checkpoint

In [17]:
%mkdir -p /content/data/
!wget http://www.doc.ic.ac.uk/~pm4115/autoAVSR/LRS3_AV_WER0.9.zip -O /content/data/LRS3_AV_WER0.9.zip
!unzip -o /content/data/LRS3_AV_WER0.9.zip -d /content/data/

--2023-06-22 08:58:47--  http://www.doc.ic.ac.uk/~pm4115/autoAVSR/LRS3_AV_WER0.9.zip
Resolving www.doc.ic.ac.uk (www.doc.ic.ac.uk)... 146.169.13.6
Connecting to www.doc.ic.ac.uk (www.doc.ic.ac.uk)|146.169.13.6|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1655043546 (1.5G) [application/zip]
Saving to: ‘/content/data/LRS3_AV_WER0.9.zip’


2023-06-22 08:58:59 (135 MB/s) - ‘/content/data/LRS3_AV_WER0.9.zip’ saved [1655043546/1655043546]

Archive:  /content/data/LRS3_AV_WER0.9.zip
  inflating: /content/data/LRS3_AV_WER0.9/model.json  
  inflating: /content/data/LRS3_AV_WER0.9/model.pth  


2. Build an AV-ASR pipeline

In [18]:
modality = "audiovisual"
model_conf = "/content/data/LRS3_AV_WER0.9/model.json"
model_path = "/content/data/LRS3_AV_WER0.9/model.pth"
pipeline = InferencePipeline(modality, model_path, model_conf, face_track=True)

3. Infer the noisy clip using both audio and video streams

In [19]:
transcript = pipeline("/content/data/noisy_clip.mp4")
print(transcript)

COMPLETELY CONSTRAINED ENVIRONMENTS WHERE WE HAVE LARGE CHANGES IN GET PLATFORMS


### Crop mouth ROIs


In [20]:
import cv2
import torchvision
from pipelines.data.data_module import AVSRDataLoader
from pipelines.detectors.mediapipe.detector import LandmarksDetector

def save2vid(filename, vid, frames_per_second):
    os.makedirs(os.path.dirname(filename), exist_ok=True)
    torchvision.io.write_video(filename, vid, frames_per_second)

def preprocess_video(src_filename, dst_filename):
    landmarks = landmarks_detector(src_filename)
    data = dataloader.load_data(src_filename, landmarks)
    fps = cv2.VideoCapture(src_filename).get(cv2.CAP_PROP_FPS)
    save2vid(dst_filename, data, fps)
    return

dataloader = AVSRDataLoader(modality="video", speed_rate=1, transform=False, detector="mediapipe", convert_gray=False)
landmarks_detector = LandmarksDetector()

In [21]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [22]:
preprocess_video(src_filename="/content/data/clip.mp4", dst_filename="/content/data/roi.mp4")

In [23]:
play_video("/content/data/roi.mp4", width=300)

### Extract visual-only features

In [24]:
modality = "video"
model_conf = "/content/data/LRS3_V_WER19.1/model.json"
model_path = "/content/data/LRS3_V_WER19.1/model.pth"
pipeline = InferencePipeline(modality, model_path, model_conf, face_track=True)

[**Option 1**]. Extract features from the output of Conformer.

In [25]:
features = pipeline.extract_features("/content/data/clip.mp4")
print(features.size())

torch.Size([178, 768])


[**Option 2**]. Extract features from the output of ResNet.

In [26]:
features = pipeline.extract_features("/content/data/clip.mp4", extract_resnet_feats=True)
print(features.size())

torch.Size([178, 512])
