## Setup

In [1]:
!nvidia-smi

Sun May 28 12:54:59 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 531.14                 Driver Version: 531.14       CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                      TCC/WDDM | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA GeForce GTX 1650       WDDM | 00000000:01:00.0 Off |                  N/A |
| N/A   44C    P8                5W /  N/A|      0MiB /  4096MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

##### Install required libraries

In [2]:
# # Specify your cuda version if needed
# !pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
# # Tensorboard
# !pip install tensorboard
# !pip install tensorboardX
# # Other
# !pip install opencv-python
# !pip install nltk
# !pip install editdistance

##### Import everything that will be used in this notebook

In [3]:
import os
import cv2
import glob
import shutil
import zipfile
import math
import time
import re
import copy
import json
import random
import editdistance

import nltk

import torch

import numpy as np

import torch.nn as nn
import torch.nn.init as init
import torch.nn.functional as F
import torch.optim as optim

from typing import List, Tuple, Dict

from nltk.corpus import cmudict

from torch.utils.data import Dataset
from torch.utils.data import DataLoader

from tensorboardX import SummaryWriter

from collections import Counter, defaultdict, deque

### Downloads:
## ------------------------------------------------
# Download the CMU Pronouncing Dictionary
nltk.download('cmudict')

[nltk_data] Downloading package cmudict to
[nltk_data]     C:\Users\admin\AppData\Roaming\nltk_data...
[nltk_data]   Package cmudict is already up-to-date!


True

The model used in this notebook was implemented in this [GitHub repo](https://github.com/nazarkohut/LipNet-PyTorch.git), so uncomment below if you want to clone that project and find out what's in it.

In [4]:
# !git clone https://github.com/nazarkohut/LipNet-PyTorch.git /content/drive/MyDrive/nn_course_work/LipNet-PyTorch/

Connect to Drive(if all your data is in there).

In [5]:
# from google.colab import drive
# drive.mount('/content/drive', force_remount=True)

Here we will use some boolean value to know if we get data from Drive or not and pathes to easily use relative pathes(which makes code more readable).

In [6]:
# drive=True
# drive_path='/content/drive/MyDrive'
#
# # init dummy variable that will not be used, but is essential for code to work
# data_path=''

Uncomment below cell if you run notebook localy(without connection to Google Drive)

In [7]:
drive = False
data_path = '/Users/admin/PycharmProjects'

# init dummy variable that will not be used, but is essential for code to work
drive_path = ''

Let's create two separate methods to form paths from relative paths based on where our data is(we use 2 methods for readability).

In [8]:
def form_path_from_drive(relative_path, drive_path=drive_path):
    return f"{drive_path}{relative_path}"


def form_path_from_relative_and_prefix(relative_path, prefix_path=data_path):
    return f"{prefix_path}{relative_path}"

Creating every variable may introduce a lot of redundant lines that will make code less readable, so let's implement method to avoid such behaviour.

In [9]:
def create_list_of_paths(relative_paths):
    global drive
    res = []
    for relative_path in relative_paths:
        if drive:
            res.append(form_path_from_drive(relative_path))
        else:
            res.append(form_path_from_relative_and_prefix(relative_path))
    return res

Initializing device variables.

In [10]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
cpu_device = torch.device('cpu')

## Data preparation and preprocessing

**These are separate files for each aspect of dataset:**
* front view videos
* side view videos
* alignments
* metadata

All of them are zip files, so we need some method to extract data from zip files.

In [39]:
def extract_zip(zip_file_path, dest_folder_path):
    os.makedirs(dest_folder_path, exist_ok=True)
    with zipfile.ZipFile(zip_file_path, 'r') as zip_ref:
        zip_ref.extractall(dest_folder_path)

Now let's create variables that reference our zip files and destination folders(it is where we will place our data). 

In [11]:
front_view_videos_zip = "/nn_course_work/data/lombardgrid_front.zip"
front_view_videos_folder = "/nn_course_work/data/front_view_videos"

side_view_videos_zip = "/nn_course_work/data/lombardgrid_side.zip"
side_view_videos_folder = "/nn_course_work/data/side_view_videos"

align_zip = "/nn_course_work/data/lombardgrid_alignment.zip"
align_folder = "/nn_course_work/data/alignment_view_videos"

metadata_zip = "/nn_course_work/data/lombardgrid_json.zip"
metadata_folder = "/nn_course_work/data/metadata"

zip_folder_list = [(front_view_videos_zip, front_view_videos_folder),
                   (side_view_videos_zip, side_view_videos_folder),
                   (align_zip, align_folder), (metadata_zip, metadata_folder)]

Also, it would be convinient for us to create a method that would take into account the enviroment in which we run our `.ipnb` file.

In [12]:
def create_initial_zip_folder_pathes(relative_pathes: List[Tuple[str, str]]):
    global drive
    zip_folder_list = []
    for zip_path, folder_path in relative_pathes:
        if drive:
            zip_folder_list.append((form_path_from_drive(zip_path), form_path_from_drive(folder_path)))
        else:
            zip_folder_list.append(
                (form_path_from_relative_and_prefix(zip_path), form_path_from_relative_and_prefix(folder_path)))
    return zip_folder_list

Finally, we can unpack our data from zip files.

In [13]:
zip_folder_path_list = create_initial_zip_folder_pathes(zip_folder_list)
zip_folder_path_list

[('/Users/admin/PycharmProjects/nn_course_work/data/lombardgrid_front.zip',
  '/Users/admin/PycharmProjects/nn_course_work/data/front_view_videos'),
 ('/Users/admin/PycharmProjects/nn_course_work/data/lombardgrid_side.zip',
  '/Users/admin/PycharmProjects/nn_course_work/data/side_view_videos'),
 ('/Users/admin/PycharmProjects/nn_course_work/data/lombardgrid_alignment.zip',
  '/Users/admin/PycharmProjects/nn_course_work/data/alignment_view_videos'),
 ('/Users/admin/PycharmProjects/nn_course_work/data/lombardgrid_json.zip',
  '/Users/admin/PycharmProjects/nn_course_work/data/metadata')]

Implement wrapper method to extract few files in one call.

In [114]:
def extract_multiple_zip_files(zip_folder_full_path_list: List[Tuple[str, str]]):
    for zip_file_path, dest_folder_path in zip_folder_full_path_list:
        extract_zip(zip_file_path, dest_folder_path)

Call wrapper method to unpack all zip files.

In [115]:
extract_multiple_zip_files(zip_folder_path_list)

Now let's check how many files we have in every folder.

In [85]:
def output_number_of_files_in_folder(complete_folder_pathes: List[str]):
    for folder_path in complete_folder_pathes:
        print(len(os.listdir(folder_path)))

There is a need to create mapper due to the nature of dataset folders.

In [87]:
destmapper = {"front_view_videos": "/front", "side_view_videos": "/side", "alignment_view_videos": "/alignment"}

destination_folders_list = [f"{dest_path}/lombardgrid{destmapper.get(dest_path.split('/')[-1])}" for
                            (zip_path, dest_path) in zip_folder_path_list]

output_number_of_files_in_folder(complete_folder_pathes=destination_folders_list)

5390
5390
5361


*To feed our videos our Neural Networks we need to change its representation. The video consists of seperate frames, so let’s create methods with which it will be easy to extract these photos.*

In [88]:
def extract_frames(video_path, output_folder, fps, subfolder=None):
    # Create the output folder if it doesn't exist
    os.makedirs(output_folder, exist_ok=True)

    # Open the video file
    cap = cv2.VideoCapture(video_path)

    # Check if video file was opened correctly
    if not cap.isOpened():
        print(f"Could not open video {video_path}. Skipping this video.")
        return

    # Get the video FPS and total number of frames
    video_fps = cap.get(cv2.CAP_PROP_FPS)
    total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))

    # Calculate the frame interval to extract frames at the specified fps
    frame_interval = int(video_fps / fps) if video_fps >= fps else 1

    # Initialize frame counter
    frame_count = 0

    # Loop through the video frames
    while True:
        # Read a frame from the video
        ret, frame = cap.read()

        # If there are no more frames, break the loop
        if not ret:
            break

        # Extract frames at the specified fps
        if frame_count % frame_interval == 0:
            # Create subfolder path if it is provided
            output_subfolder = os.path.join(output_folder, subfolder) if subfolder else output_folder

            # Create subfolder if it doesn't exist
            os.makedirs(output_subfolder, exist_ok=True)

            # Save the frame as an image file
            frame_file = os.path.join(output_subfolder, f"frame_{frame_count:05d}.jpg")
            cv2.imwrite(frame_file, frame)

        # Increment the frame counter
        frame_count += 1

    # Release the video file
    cap.release()


def extract_frames_for_all_videos(video_folder, output_folder, fps=25):
    # Get a list of all video files in the input folder
    video_files = glob.glob(os.path.join(video_folder, '*.mov'))

    # Loop over each video file
    for video_file in video_files:
        # Parse the video file name to get the speaker and video number
        file_name = os.path.basename(video_file)
        speaker, video_number, subfolder = file_name.split("_")[0:3]

        # Remove extension
        subfolder = subfolder[:-4]

        # Create the output directory for this video's frames
        video_output_folder = os.path.join(output_folder, speaker, video_number)

        # Call your function to extract the frames
        extract_frames(video_file, video_output_folder, fps, subfolder=subfolder)

Now, we can extract frames for every video we have

In [101]:
relative_front_video_folder = "/nn_course_work/data/front_view_videos/lombardgrid/front"
relative_front_output_folder = "/nn_course_work/clean_data/video_frames/front"

relative_side_video_folder = "/nn_course_work/data/side_view_videos/lombardgrid/side"
relative_side_output_folder = "/nn_course_work/clean_data/video_frames/side"

relative_paths = [relative_front_video_folder, relative_front_output_folder, relative_side_video_folder,
                  relative_side_output_folder]

folders = create_list_of_paths(relative_paths)
front_tuple, side_tuple = folders[:2], folders[2:]

front_video_folder, front_output_folder = front_tuple
side_video_folder, side_output_folder = side_tuple

Call to extract front videos(videos of people that look directly at camera).

In [102]:
extract_frames_for_all_videos(front_video_folder, front_output_folder)

Could not open video /Users/admin/PycharmProjects/nn_course_work/data/front_view_videos/lombardgrid/front\s32_l_pwip9p.mov. Skipping this video.
Could not open video /Users/admin/PycharmProjects/nn_course_work/data/front_view_videos/lombardgrid/front\s32_p_bwwj2n.mov. Skipping this video.
Could not open video /Users/admin/PycharmProjects/nn_course_work/data/front_view_videos/lombardgrid/front\s33_l_pwajza.mov. Skipping this video.
Could not open video /Users/admin/PycharmProjects/nn_course_work/data/front_view_videos/lombardgrid/front\s33_p_sgwq2s.mov. Skipping this video.


Call to extract side videos(the camera is filming people from the side).

In [104]:
extract_frames_for_all_videos(side_video_folder, side_output_folder)

Parsing takes a lot of time, so I do not want to parse all the data locally, in case the necessity arises, so I need a method that will create zip files which I will easily unpack. 

In this case, I won't need to do all the steps from scratch, which saves me a lot of time.

In [10]:
def create_zip_from_folder(folder_path, zip_filename):
    with zipfile.ZipFile(zip_filename, 'w', zipfile.ZIP_DEFLATED) as zipf:
        for root, _, files in os.walk(folder_path):
            for file in files:
                file_path = os.path.join(root, file)
                zipf.write(file_path, os.path.relpath(file_path, folder_path))

Here is how of I used it

In [None]:
# # Example usage
# folder_path = '/content/tmp_files'
# zip_filename = 'frames.zip'
# create_zip_from_folder(folder_path, zip_filename)

Here is how I extracted my zip files

In [None]:
# extract_zip("/content/drive/MyDrive/nn_course_work/clean_data/frames.zip", "/content/drive/MyDrive/nn_course_work/clean_data/video_frames/")

At first I had a confusion on how my data should be ordered, so I created below method to place all videos for specific speaker into particular folder.

In [None]:
def organize_videos(video_folder, output_folder):
    # Get a list of all video files in the input folder
    video_files = glob.glob(os.path.join(video_folder, '*.mov'))

    # Loop over each video file
    for video_file in video_files:
        # Parse the video file name to get the speaker and video number
        file_name = os.path.basename(video_file)
        speaker, sub_dir_name, video_number = file_name.split("_")[0:3]

        # Create the output directory for this speaker if it does not exist
        speaker_folder = os.path.join(output_folder, speaker)
        if not os.path.exists(speaker_folder):
            os.makedirs(speaker_folder)

        # Create the subdirectory inside the parent directory
        sub_dir_path = os.path.join(speaker_folder, sub_dir_name)
        if not os.path.exists(sub_dir_path):
            os.mkdir(sub_dir_path)

        # Copy the video file to the new directory
        shutil.copy(video_file, os.path.join(sub_dir_path, video_number))

Here is how I used it.

In [None]:
# video_folder = "/content/drive/MyDrive/nn_course_work/data/front_view_videos/lombardgrid/front"
# output_folder = "/content/drive/MyDrive/nn_course_work/clean_data/videos/"
#
# organize_videos(video_folder, output_folder)

*Current format of aligns is not acceptable, because it uses phonemes, but we would love to predict words* and not phonemes, so we will have to create methods that will *map phonemes to words*(this task would not be trivial without description from Lombard Grid Corpus dataset and Grid Corpus dataset which use similar sentences structure).

In [119]:
def decode_utterance_code(code):
    command_mapping = {'b': 'bin', 'l': 'lay', 'p': 'place', 's': 'set'}
    color_mapping = {'b': 'blue', 'g': 'green', 'r': 'red', 'w': 'white'}
    preposition_mapping = {'a': 'at', 'b': 'by', 'i': 'in', 'w': 'with'}
    letter_mapping = {chr(i): chr(i).upper() for i in range(ord('a'), ord('z') + 1)}

    digit_mapping = {
        'z': 'zero',  # The case which is not funny at all, I was so confused
        '1': 'one',
        '2': 'two',
        '3': 'three',
        '4': 'four',
        '5': 'five',
        '6': 'six',
        '7': 'seven',
        '8': 'eight',
        '9': 'nine'
    }
    adverb_mapping = {'a': 'again', 'n': 'now', 'p': 'please', 's': 'soon'}

    command = command_mapping[code[0]]
    color = color_mapping[code[1]]
    preposition = preposition_mapping[code[2]]
    letter = letter_mapping[code[3]]

    digit = digit_mapping[code[4]]

    adverb = adverb_mapping[code[5]]

    sentence = f"{command} {color} {preposition} {letter} {digit} {adverb}"

    return sentence


sentence = decode_utterance_code('braczp')
print(sentence)

bin red at C zero please


In the above code, we decode utterance code and get a sentence from it, in the next cell we have to create methods that will help us aggregate offset, duration and phonemes, meanwhile adding a word that corresponds to these values.

In [120]:
# Initialize the CMU Pronouncing Dictionary
pron_dict = cmudict.dict()


def arpabet_to_ipa(arpabet):
    """
    Convert Arpabet to IPA notation
    """
    return arpabet.lower()


def words_to_phonemes(words):
    """
    Convert a list of words to a list of phonemes
    """
    phonemes = []
    for word in words:
        if word.lower() in pron_dict:
            word_phonemes = pron_dict[word.lower()][0]  # Use the first pronunciation
            phonemes.extend(word_phonemes)
    return phonemes


def phoneme_to_word_mapping(sentence, phoneme_data):
    """
    Align phonemes with their corresponding words
    """
    words = sentence.split()
    sentence_phonemes = words_to_phonemes(words)
    word_index = 0
    word_phoneme_count = len(words_to_phonemes([words[word_index]]))
    phoneme_count = 0
    word_data = []
    last_phoneme_obj = None
    for phoneme_obj in phoneme_data:
        phoneme = phoneme_obj['phone'].split('_')[0]
        if phoneme == 'SIL':  # Ignore 'SIL' phonemes
            continue
        if last_phoneme_obj and last_phoneme_obj['phone'] == phoneme:  # Merge phoneme parts
            last_phoneme_obj['duration'] += phoneme_obj['duration']
        else:
            phoneme_count += 1
            phoneme_obj['phone'] = phoneme
            if word_index < len(words):
                phoneme_obj['word'] = words[word_index]
            last_phoneme_obj = phoneme_obj
            word_data.append(phoneme_obj)
        if phoneme_count >= word_phoneme_count and word_index < len(words) - 1:
            word_index += 1
            word_phoneme_count += len(words_to_phonemes([words[word_index]]))
    return word_data


def aggregate_words(word_data):
    """
    Merge same sequential words and adjust offset and duration
    """
    aggregated_words = []
    for word_obj in word_data:
        if not aggregated_words or aggregated_words[-1]['word'] != word_obj['word']:
            # If the list is empty or the last word is different from the current one, add the current word
            word_obj['phone'] = [word_obj['phone']]  # Initialize 'phone' as a list
            aggregated_words.append(word_obj)
        else:
            # If the last word is the same as the current one, merge them by increasing the duration and appending the phoneme
            aggregated_words[-1]['duration'] += word_obj['duration']
            aggregated_words[-1]['phone'].append(word_obj['phone'])  # Append the phoneme to the list
    return aggregated_words


def json_with_phonemes_to_align_with_words(phoneme_data: List[Dict], utterance_code: str):
    """
    This function takes the phoneme data and the utterance code as inputs and aligns the phonemes with their corresponding words.

    Parameters:
    phoneme_data (List[Dict]): A list of dictionaries, where each dictionary represents data of a phoneme.
    utterance_code (str): A code representing the utterance or sentence that needs to be decoded.

    Returns:
    aggregated_words (List[Dict]): A list of dictionaries, where each dictionary represents a word with its corresponding phonemes, offset, and duration.
    """
    sentence = decode_utterance_code(utterance_code)
    word_data = phoneme_to_word_mapping(sentence, phoneme_data)
    aggregated_words = aggregate_words(word_data)
    return aggregated_words

*Great, now we need some method to convert all values to the format that would work for us.*

The format that we need:

```
18 43 bin
46 84 blue
84 112 in
112 138 M
138 184 three
184 240 again
```



In [121]:
def convert_json_to_align(json_path, align_path):
    # Read the json file
    with open(json_path, 'r') as f:
        data = json.load(f)

    for key, values in data.items():
        curr_utterance_code = key.split('_')[2]
        data[key] = json_with_phonemes_to_align_with_words(values, curr_utterance_code)

    # Open the align file for writing
    with open(align_path, 'w') as f:
        # Loop over each item in the json data
        for key, values in data.items():
            for item in values:
                # Convert duration and offset from seconds to hundredths of a second
                start_time = int(item['offset'] * 100)
                end_time = int((item['offset'] + item['duration']) * 100)

                # Write the line to the align file
                f.write(f"{start_time} {end_time} {item['word']}\n")


def convert_all_json_files(json_folder, align_folder):
    # Get a list of all json files in the input folder
    json_files = glob.glob(os.path.join(json_folder, '*.json'))

    # Loop over each json file
    for json_file in json_files:
        # # Parse the video file name to get the speaker and video number
        file_name = os.path.basename(json_file)
        speaker, sub_dir_name, align_number = file_name.split("_")[0:3]

        # Remove the '.json' extension from align_number2
        align_number = align_number.split(".")[0]

        # # Create the output directory for this speaker if it does not exist
        speaker_folder = os.path.join(align_folder, speaker)  # Change output_folder to align_folder
        if not os.path.exists(speaker_folder):
            os.makedirs(speaker_folder)

        # # Create the subdirectory inside the parent directory
        sub_dir_path = os.path.join(speaker_folder, sub_dir_name)
        if not os.path.exists(sub_dir_path):
            # print("Sub True")
            os.mkdir(sub_dir_path)

        # Determine the align file path
        align_file = os.path.join(sub_dir_path, f"{align_number}.align")

        # Convert the json file to an align file
        convert_json_to_align(json_file, align_file)

Let's extract our align files, so that we can use them later.

In [122]:
align_relative_json_folder = "/nn_course_work/data/alignment_view_videos/lombardgrid/alignment/"

relative_align_folder = "/nn_course_work/clean_data/aligns/"

relative_paths = [align_relative_json_folder, relative_align_folder]

json_folder, align_folder = create_list_of_paths(relative_paths)

convert_all_json_files(json_folder, align_folder)

In case, something went wrong it would be great to be able to remove all data from our folder, so let's create method for this.

In [11]:
def remove_folder(folder_path):
    shutil.rmtree(folder_path)

# # Example usage
# folder_path = '/content/drive/MyDrive/nn_course_work/videos'
# remove_folder(folder_path)

At some point of time my Google Drive became full, so I had to copy all data from shared drive to my university account, because it does not have limits on how much data I can store.

In [3]:
def copy_directory(source_dir, dest_dir):
    # Create the destination directory if it doesn't exist
    if not os.path.exists(dest_dir):
        os.makedirs(dest_dir)

    # Use shutil to copy the content of the source directory to the destination directory
    for item in os.listdir(source_dir):
        s = os.path.join(source_dir, item)
        d = os.path.join(dest_dir, item)
        if os.path.isdir(s):
            shutil.copytree(s, d, dirs_exist_ok=True)
        else:
            shutil.copy2(s, d)


def copy_multiple_dirs_into_another_directory(source_dir_list, dest_dir):
    for src_dir in source_dir_list:
        copy_directory(src_dir, dest_dir)

Here is how I used it.

In [9]:
# Define the source and destination directories
# source_dirs = [form_path_from_drive('/clean_data/'), form_path_from_drive('/data/')]

# source_dirs = [form_path_from_drive('/clean_data/video_frames')]
# dest_dir = '/content/drive/MyDrive/nn_course_work/video_frames/'

# # Copy directories
# copy_multiple_dirs_into_another_directory(source_dirs, dest_dir)

*One of the important things for our model are `.txt` files that contain relative references to directories with video frames.*

In [14]:
def write_paths_to_file(root_dir, file_name, folder_ends, exclude_speakers: set or None = None):
    with open(file_name, 'w') as f:
        for root, dirs, files in os.walk(root_dir):
            for folder in dirs:
                if folder in folder_ends:
                    folder_path = os.path.join(root, folder).replace('\\', '/')
                    speaker = folder_path.split('/')[-2]
                    for subfolder in os.listdir(folder_path):
                        if len(subfolder) != 6:
                            continue

                        if exclude_speakers and speaker in exclude_speakers:
                            continue

                        subfolder_path = os.path.join(folder_path, subfolder).replace('\\', '/')
                        if os.path.isdir(subfolder_path):
                            relative_path = subfolder_path.replace(root_dir, '', 1).strip('/')
                            f.write(relative_path + '\n')


Let's call above method to fulfill the `.txt` files, but before doing this let's initialize some variables and the method that will help us in dividing txt files into both train and test.

In [15]:
def divide_data(original_file, train_file, test_file, test_ratio=0.2, seed=42):
    with open(original_file, 'r') as f:
        lines = f.readlines()

    # Set the seed
    random.seed(seed)

    # Shuffle the lines to get videos for different speakers
    random.shuffle(lines)

    # Return seed to default
    random.seed()

    num_test = int(test_ratio * len(lines))
    test_lines = lines[:num_test]
    train_lines = lines[num_test:]

    with open(train_file, 'w') as f:
        for line in train_lines:
            f.write(line)

    with open(test_file, 'w') as f:
        for line in test_lines:
            f.write(line)


# Specify the number of speakers in the dataset
number_of_speakers_in_dataset = 55

# Create a set of all speakers
all_speakers = {f"s{i}" for i in range(1, number_of_speakers_in_dataset + 1)}

# Specify the excluded speakers, so that we can form unseen speakers data
train_excluded_speakers = {'s3', 's4', 's7', 's10', 's14', 's23', 's29', 's32', 's37', 's38', 's40', 's47', 's50', 's53', 's55'}
test_excluded_speakers = copy.deepcopy(train_excluded_speakers)
test_excluded_speakers_unseen = all_speakers - test_excluded_speakers

# We use only front directory for root directory assuming that every front has corresponding side
relative_root_directory = '/nn_course_work/clean_data/video_frames/front'

# Absolute root path
root_directory = create_list_of_paths([relative_root_directory])[0]

It's time to make the calls🙃.

**First of all, let's initialize one type videos; by one type I mean only *lombard* or only *plain* videos.**

In [16]:
# Define the other relative paths that include one type of recordings(lombard, noisy or plain)
relative_train_noisy_output_file = '/nn_course_work/clean_data/txt_files/one_type_only/train_noisy_files.txt'
relative_test_plain_output_file = '/nn_course_work/clean_data/txt_files/one_type_only/test_plain_files.txt'
relative_test_noisy_unseen_output_file = '/nn_course_work/clean_data/txt_files/one_type_only/test_noisy_unseen_files.txt'
relative_test_plain_unseen_output_file = '/nn_course_work/clean_data/txt_files/one_type_only/test_plain_unseen_files.txt'

## Obtain the absolute paths
# Init real(full) paths for one type only
one_type_only_paths = create_list_of_paths([relative_train_noisy_output_file, relative_test_plain_output_file, relative_test_noisy_unseen_output_file, relative_test_plain_unseen_output_file])

# Unpack them
train_noisy_output_file, test_plain_output_file, test_noisy_unseen_output_file, test_plain_unseen_output_file = one_type_only_paths

# Write the paths to the train, test, and test unseen output files
write_paths_to_file(root_directory, train_noisy_output_file, folder_ends={'l'}, exclude_speakers=train_excluded_speakers)
write_paths_to_file(root_directory, test_plain_output_file, folder_ends={'p'}, exclude_speakers=test_excluded_speakers)
write_paths_to_file(root_directory, test_noisy_unseen_output_file, folder_ends={'l'}, exclude_speakers=test_excluded_speakers_unseen)
write_paths_to_file(root_directory, test_plain_unseen_output_file, folder_ends={'p'}, exclude_speakers=test_excluded_speakers_unseen)

*And here let's init that are close to the one we saw in [GitHub](https://github.com/VIPL-Audio-Visual-Speech-Understanding/LipNet-PyTorch) repository.*

In [17]:
# Define the other relative paths that include both plain and lombard
relative_full_data_output_file = '/nn_course_work/clean_data/txt_files/full/data_files.txt'
relative_full_unseen_data_output_file = '/nn_course_work/clean_data/txt_files/full/unseen_data_files.txt'

# Same paths, but will be used to store train and test data(paths) in different txt files
relative_train_output_file = '/nn_course_work/clean_data/txt_files/full/train_overlap_files.txt'
relative_test_output_file = '/nn_course_work/clean_data/txt_files/full/test_overlap_files.txt'
relative_train_unseen_output_file = '/nn_course_work/clean_data/txt_files/full/train_unseen_files.txt'
relative_test_unseen_output_file = '/nn_course_work/clean_data/txt_files/full/test_unseen_files.txt'

## Obtain the absolute paths
full_data_paths_list = [relative_full_data_output_file, relative_full_unseen_data_output_file]
test_and_train_data_paths_list = [relative_train_output_file, relative_test_output_file, relative_train_unseen_output_file, relative_test_unseen_output_file]


# Init real(full) paths for one type only
all_paths = create_list_of_paths(full_data_paths_list + test_and_train_data_paths_list)

# Unpack them
full_data_output_file, full_unseen_data_output_file, train_output_file, test_output_file, train_unseen_output_file, test_unseen_output_file = all_paths

# Init all folder ends
both_folder_ends = {'l', 'p'}

# Write the paths for full data
write_paths_to_file(root_directory, full_data_output_file, folder_ends=both_folder_ends, exclude_speakers=train_excluded_speakers)
write_paths_to_file(root_directory, full_unseen_data_output_file, folder_ends=both_folder_ends, exclude_speakers=test_excluded_speakers)

# divide data into train and test and write it to files
divide_data(full_data_output_file, train_output_file, test_output_file)
divide_data(full_unseen_data_output_file, train_unseen_output_file, test_unseen_output_file)

#### *Dataset and dataloaders creation*

Pretrained LipNet from repository mentioned previously uses `width=128` and `height=64`, so I would need some method to make default LipNet(the one from the original repository, so that I can use weights) shapes and all other shapes that will be used in this notebook.

In [18]:
class Shape:
    def __init__(self, width=128, height=64):
        self.width=width
        self.height=height

    def get_shape(self) -> tuple:
        return self.width, self.height

Let us init default shapes that we will use to train our models.

In [19]:
DEFAULT_LIPNET_IMAGE_SHAPE = Shape().get_shape()

Finally, let's create the dataset class with which we will be able to create dataloaders easily.

In [20]:
class MyDataset(Dataset):
    letters = [' ', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T',
               'U', 'V', 'W', 'X', 'Y', 'Z']

    def __init__(self, video_path, anno_path, file_list, vid_pad, txt_pad, phase, resize_shape=DEFAULT_LIPNET_IMAGE_SHAPE):
        self.anno_path = anno_path
        self.vid_pad = vid_pad
        self.txt_pad = txt_pad
        self.phase = phase
        self.size_of_image_reshapes = resize_shape

        with open(file_list, 'r') as f:
            self.videos = [os.path.join(video_path, line.strip()) for line in f.readlines()]

        self.data = []
        for vid in self.videos:
            items = vid.split('/')
            self.data.append((vid, items[-3], items[-1]))

    def __getitem__(self, idx):
        (vid, spk, name) = self.data[idx]
        type_of_vid = vid.replace('\\', '/').split('/')[-2] # l or p
        vid = self._load_vid(vid)

        anno = self._load_anno(os.path.join(self.anno_path, spk, type_of_vid, name + '.align').replace('\\', '/'))

        # if(self.phase == 'train'):
        #     vid = cv2.flip(vid, 1)

        vid = self._normalize(vid)

        vid_len = vid.shape[0]
        anno_len = anno.shape[0]
        vid = self._padding(vid, self.vid_pad)
        anno = self._padding(anno, self.txt_pad)

        return {'vid': torch.FloatTensor(vid.transpose(3, 0, 1, 2)), # [68]
                'txt': torch.LongTensor(anno),
                'txt_len': anno_len,
                'vid_len': vid_len}

    def __len__(self):
        return len(self.data)

    def _padding(self, array, length):
        array = [array[_] for _ in range(array.shape[0])]
        size = array[0].shape
        for i in range(length - len(array)):
            array.append(np.zeros(size))
        return np.stack(array, axis=0)

    def _load_vid(self, p):
        files = os.listdir(p)
        files = list(filter(lambda file: file.find('.jpg') != -1, files))
        files = sorted(files, key=lambda file: os.path.splitext(file)[0])
        array = [cv2.imread(os.path.join(p, file)) for file in files]
        array = list(filter(lambda im: not im is None, array))
        array = [cv2.resize(im, self.size_of_image_reshapes, interpolation=cv2.INTER_LANCZOS4) for im in array]
        array = np.stack(array, axis=0).astype(np.float32)
        return array

    def _load_anno(self, name):
        with open(name, 'r') as f:
            lines = [line.strip().split(' ') for line in f.readlines()]
            txt = [line[2] for line in lines]
            txt = list(filter(lambda s: not s.upper() in ['SIL', 'SP'], txt))
        return MyDataset.txt2arr(' '.join(txt).upper(), 1)

    @staticmethod
    def _normalize(vid):
        vid = vid.astype('float32')
        vid /= 255.0
        vid -= 0.5
        vid *= 2.0
        return vid

    @staticmethod
    def txt2arr(txt, start):
        arr = []
        for c in list(txt):
            arr.append([MyDataset.letters.index(ci) + start for ci in c])
        return np.array(arr)

    @staticmethod
    def arr2txt(arr, start):
        txt = []
        for n in arr:
            if n >= start:
                txt.append(MyDataset.letters[n - start])
        return ''.join(txt).strip()

    @staticmethod
    def ctc_arr2txt(arr, start):
        pre = -1
        txt = []
        for n in arr:
            if pre != n and n >= start:
                if len(txt) > 0 and txt[-1] == ' ' and MyDataset.letters[n - start] == ' ':
                    pass
                else:
                    txt.append(MyDataset.letters[n - start])
            pre = n
        return ''.join(txt).strip()

    @staticmethod
    def wer(predict, truth):
        word_pairs = [(p[0].split(' '), p[1].split(' ')) for p in zip(predict, truth)]
        wer = [1.0 * editdistance.eval(p[0], p[1]) / len(p[1]) for p in word_pairs]
        return wer

    @staticmethod
    def cer(predict, truth):
        cer = [1.0 * editdistance.eval(p[0], p[1]) / len(p[1]) for p in zip(predict, truth)]
        return cer

Before we create our models and train/test methods it would be great to initialize some required methods that will make our life easier.

In [21]:
# Initialize a SummaryWriter for logging

logs_dir = form_path_from_relative_and_prefix('/nn_course_work/logs/')
writer = SummaryWriter(logs_dir)


# Function to create a DataLoader from a dataset
def dataset2dataloader(dataset, batch_size=32, num_workers=0, shuffle=True):
    return DataLoader(dataset,
                      batch_size=batch_size,
                      shuffle=shuffle,
                      num_workers=num_workers,
                      drop_last=False)


# Function to calculate the mean learning rate from the optimizer
def show_lr(optimizer):
    lr = []
    for param_group in optimizer.param_groups:
        lr += [param_group['lr']]
    return np.array(lr).mean()


# Function for CTC decoding of predicted outputs
def ctc_decode(y):
    y = y.argmax(-1)
    return [MyDataset.ctc_arr2txt(y[_], start=1) for _ in range(y.size(0))]


# Function to obtain the paths for the LipNet dataset
def get_lipnet_paths(relative_anno, relative_video, relative_list):
    relative_paths_list = [relative_anno, relative_video, relative_list]
    anno_path, video_path, video_list_path = create_list_of_paths(relative_paths_list)
    return anno_path, video_path, video_list_path


def get_dataset(relative_anno_path, relative_front_video_path, relative_train_list_path, vid_padding=150,
                txt_padding=300, method='train'):
    # Obtain the annotation, video, and train list paths using the get_lipnet_paths function
    anno_path, video_path, train_list_path = get_lipnet_paths(relative_anno_path, relative_front_video_path,
                                                              relative_train_list_path)

    # Create the LombardGrid train dataset using MyDataset class
    dataset = MyDataset(video_path,
                        anno_path,
                        train_list_path,
                        vid_padding,
                        txt_padding,
                        method)

    return dataset

Let's prepare some paths and other standard data, so that we can easily create dataset variables.

In [22]:
# Number of samples in each batch
batch_size = 8
# Relative paths that point out required dir and files for dataset class
relative_anno_path = "/nn_course_work/clean_data/aligns/"
# Front and side paths
relative_front_video_path = "/nn_course_work/clean_data/video_frames/front/"
relative_side_video_path = "/nn_course_work/clean_data/video_frames/side/"

# Reference only one type paths, so that it is easy to understand which path is which(also it is easier to find out where out value lies)
relative_train_noisy_list_path = relative_train_noisy_output_file
relative_test_plain_list_path = relative_test_plain_output_file
relative_test_noisy_unseen_list_path = relative_test_noisy_unseen_output_file
relative_test_plain_unseen_list_path = relative_test_plain_unseen_output_file


And now we can init our dataset variables.

In [23]:
# Obtain train datasets
lombardgrid_train_noisy_front_dataset = get_dataset(relative_anno_path, relative_front_video_path, relative_train_noisy_list_path)
lombardgrid_train_noisy_side_dataset = get_dataset(relative_anno_path, relative_side_video_path, relative_train_noisy_list_path)

# Obtain test datasets
lombardgrid_test_plain_front_dataset = get_dataset(relative_anno_path, relative_front_video_path, relative_test_plain_list_path)
lombardgrid_test_plain_side_dataset = get_dataset(relative_anno_path, relative_side_video_path, relative_test_plain_list_path)

lombardgrid_test_plain_unseen_front_dataset = get_dataset(relative_anno_path, relative_front_video_path,
                                                          relative_test_plain_unseen_list_path)
lombardgrid_test_plain_unseen_side_dataset = get_dataset(relative_anno_path, relative_side_video_path,
                                                         relative_test_plain_unseen_list_path)

Let's create dataloaders.

In [25]:
# Create a DataLoader for the LombardGrid(to train on front videos dataset)
lombardgrid_train_noisy_front_dataloader = dataset2dataloader(lombardgrid_train_noisy_front_dataset, batch_size)

# Create a DataLoader for the LombardGrid(to train on side videos dataset)
lombardgrid_train_noisy_side_dataloader = dataset2dataloader(lombardgrid_train_noisy_side_dataset, batch_size)

# Create test DataLoaders for the datasets
lombardgrid_test_plain_front_dataloader = dataset2dataloader(lombardgrid_test_plain_front_dataset, batch_size)
lombardgrid_test_plain_side_dataloader = dataset2dataloader(lombardgrid_test_plain_side_dataset, batch_size)

lombardgrid_test_plain_unseen_front_dataloader = dataset2dataloader(lombardgrid_test_plain_unseen_front_dataset, batch_size)
lombardgrid_test_plain_unseen_side_dataloader = dataset2dataloader(lombardgrid_test_plain_unseen_side_dataset, batch_size)

# Print dataset sizes
print('Number of train front data:{}'.format(len(lombardgrid_train_noisy_front_dataset.data)))
print('Number of train side data:{}'.format(len(lombardgrid_train_noisy_side_dataset.data)))
print('Number of test front data:{}'.format(len(lombardgrid_test_plain_front_dataset.data)))
print('Number of test side data:{}'.format(len(lombardgrid_test_plain_side_dataset.data)))
print('Number of test unseen front data:{}'.format(len(lombardgrid_test_plain_unseen_front_dataset.data)))
print('Number of test unseen side data:{}'.format(len(lombardgrid_test_plain_unseen_side_dataset.data)))

Number of train front data:1793
Number of train side data:1793
Number of test front data:1825
Number of test side data:1825
Number of test unseen front data:693
Number of test unseen side data:693


#### *Train, Test and other useful methods*

First of all, we need test, because we may use it in train method(After training model on some number of epochs we could just bring the model to evaluation stage and test and save model at the same time).

In [26]:
def test(model, dataloader, device=device, display_mod=1):
    with torch.no_grad():
        model.eval()
        loss_list = []
        wer = []
        cer = []
        crit = nn.CTCLoss()
        tic = time.time()
        for (i_iter, input) in enumerate(dataloader):
            vid = input.get('vid').to(device)
            txt = input.get('txt').to(device)
            vid_len = input.get('vid_len').to(device)
            txt_len = input.get('txt_len').to(device)

            y = model(vid)

            txt = txt.squeeze(-1)

            loss = crit(y.transpose(0, 1).log_softmax(-1), txt, vid_len.view(-1),
                        txt_len.view(-1)).detach().cpu().numpy()
            
            loss_list.append(loss)
            pred_txt = ctc_decode(y)

            truth_txt = [MyDataset.arr2txt(txt[_], start=1) for _ in range(txt.size(0))]
            wer.extend(MyDataset.wer(pred_txt, truth_txt))
            cer.extend(MyDataset.cer(pred_txt, truth_txt))
            if i_iter % display_mod == 0:
                v = 1.0 * (time.time() - tic) / (i_iter + 1)
                eta = v * (len(dataloader) - i_iter) / 3600.0

                print(''.join(101 * '-'))
                print('{:<50}|{:>50}'.format('predict', 'truth'))
                print(''.join(101 * '-'))
                for (predict, truth) in list(zip(pred_txt, truth_txt))[:10]:
                    print('{:<50}|{:>50}'.format(predict, truth))
                print(''.join(101 * '-'))
                print(
                    'test_iter={},eta={},wer={},cer={}'.format(i_iter, eta, np.array(wer).mean(), np.array(cer).mean()))
                print(''.join(101 * '-'))

        overall_metrics = (np.array(loss_list).mean(), np.array(wer).mean(), np.array(cer).mean())
        metrics_history = (loss_list, wer, cer)
        return overall_metrics, metrics_history

Now,let's create `train` method.

In [27]:
def train(model, dataloader, epochs, optimizer_callback=optim.Adam, optimizer_kwargs={'weight_decay': 0, 'amsgrad':True}, device=device, base_lr=0.0001, display_mod=1, **kwargs):
    print("Loading options...")

    optimizer = optimizer_callback(model.parameters(),
                           lr=base_lr,
                           **optimizer_kwargs)

    crit = nn.CTCLoss()
    tic = time.time()

    train_wer = []
    for epoch in range(0, epochs):
        for (i_iter, input) in enumerate(dataloader):
            model.train()
            vid = input.get('vid').to(device)
            txt = input.get('txt').to(device)
#             txt = txt.type(torch.LongTensor)
#             txt = txt.to(device)
            vid_len = input.get('vid_len').to(device)
            txt_len = input.get('txt_len').to(device)
#             vid = vid.type(FloatTensor)
#             vid = vid.type(torch.LongTensor)
            optimizer.zero_grad()
            y = model(vid)
            # print(txt.shape)
            # Removing one dimension
#             squeezed_tensor = txt.squeeze()
            
            txt = txt.squeeze(-1)

#             txt = squeezed_tensor
            
            # print(y.transpose(0, 1).log_softmax(-1).shape, txt.shape, vid_len.shape, txt_len.shape)
            loss = crit(y.transpose(0, 1).log_softmax(-1), txt, vid_len.view(-1), txt_len.view(-1))

            loss.backward()
            optimizer.step()

            tot_iter = i_iter + epoch * len(dataloader)
            pred_txt = ctc_decode(y)

            truth_txt = [MyDataset.arr2txt(txt[_], start=1) for _ in range(txt.size(0))]
            train_wer.extend(MyDataset.wer(pred_txt, truth_txt))

            if epoch % display_mod == 0:
                print("here")
                v = 1.0 * (time.time() - tic) / (tot_iter + 1)
                eta = (len(dataloader) - i_iter) * v / 3600.0
                current_model_name = kwargs.get('model_name', 'LipNet')
                writer.add_scalar(f"{current_model_name} train loss", loss, tot_iter)
                writer.flush()
                writer.add_scalar(f"{current_model_name} train wer", np.array(train_wer).mean(), tot_iter)
                writer.flush()
                print(''.join(101 * '-'))
                print('{:<50}|{:>50}'.format('predict', 'truth'))
                print(''.join(101 * '-'))

                for (predict, truth) in list(zip(pred_txt, truth_txt))[:3]:
                    print('{:<50}|{:>150}'.format(predict, truth))
                print(''.join(101 * '-'))
                print('epoch={},tot_iter={},eta={},loss={},train_wer={}'.format(epoch, tot_iter, eta, loss,
                                                                                np.array(train_wer).mean()))
                print(''.join(101 * '-'))

            test_and_save_model = kwargs.get('test_and_save_model', None)

            if test_and_save_model:

                if epoch % kwargs.get('test_step') == 0:

                    (test_loss, wer, cer), metrics_history = test(**kwargs.get('params'))
                    print('i_iter={},lr={},loss={},wer={},cer={}'
                          .format(tot_iter, show_lr(optimizer), test_loss, wer, cer))
                    writer.add_scalar('val loss', test_loss, tot_iter)
                    writer.flush()
                    writer.add_scalar('wer', wer, tot_iter)
                    writer.flush()
                    writer.add_scalar('cer', cer, tot_iter)
                    writer.flush()

                    current_model_name = kwargs.get('model_name', 'LipNet')

                    savename = '{}_loss_{}_wer_{}_cer_{}.pt'.format(
                        kwargs.get('save_prefix', f"weights/{current_model_name}/weight_{time.time()}/"), test_loss, wer, cer)
                    (path, name) = os.path.split(savename)
                    if not os.path.exists(path):
                        os.makedirs(path)
                    torch.save(model.state_dict(), savename)

            # Specify max_total_iteration to finish training before all epochs
            if kwargs.get('max_total_iteration_number', math.inf) <= tot_iter:
                return model

We will fine-tune models and it would be great to have some method to do this.

In [28]:
def freeze_layers(model, n, display_current_param_grads=True, freeze_first_layers=False):
    # Get a list of all parameters in the model
    initial_params = list(model.parameters())
    if n > len(initial_params):
        print("The parameter n is too big make sure it is not bigger than number of layers.\n")
        print("IMPORTANT: The layers have not been frozen.")
        return model

    # Freeze the first n layers
    params = initial_params[-n:] if not freeze_first_layers else initial_params[:n]
    for param in params:
        param.requires_grad = False

    param_grads = [param.requires_grad for param in list(model.parameters())]
    if display_current_param_grads:
        print("Layers:", param_grads)
        print(Counter(param_grads))
    return model

As we will use pretrained models we would have to load weights, let's create wrapper method for `load_state_dict` method.

In [29]:
def load_model_weights(model, weights_path, device):
    model.load_state_dict(torch.load(weights_path, map_location=device))
    return model

To easily save models in directories which might not exist it would be great to have separate method.

In [30]:
def save_model_weights(model, absolute_path_for_output):
    # Create the directory if it doesn't exist
    os.makedirs(os.path.dirname(absolute_path_for_output), exist_ok=True)

    # Save the model weights
    torch.save(model.state_dict(), absolute_path_for_output)

As we use tensorboard we need a method to plot our test metric.

In [31]:
def plot_metric_in_tensorboard(metric_history, metric_name: str, board_name: str):
    global writer
    for epoch, metric in enumerate(metric_history):
        writer.add_scalars(board_name, {metric_name: metric}, epoch)
        writer.flush()
    writer.close()

And a wrapper for the previous method, because usually, we will use 2 metrics(WER, CER) and loss, so we need to do it conveniently.

In [32]:
def plot_metrics_in_tensorboard(metrics_and_names: List[Tuple]):
    for metric_history, metric_name, board_name in metrics_and_names:
        plot_metric_in_tensorboard(metric_history, metric_name, board_name)

## **Architecture of LipNET and Simple LipNET**

##### *LipNet*

In [33]:
class LipNet(torch.nn.Module):
    def __init__(self, dropout_p=0.5):
        super().__init__()
        self.conv1 = nn.Conv3d(3, 32, (3, 5, 5), (1, 2, 2), (1, 2, 2))
        self.pool1 = nn.MaxPool3d((1, 2, 2), (1, 2, 2))

        self.conv2 = nn.Conv3d(32, 64, (3, 5, 5), (1, 1, 1), (1, 2, 2))
        self.pool2 = nn.MaxPool3d((1, 2, 2), (1, 2, 2))

        self.conv3 = nn.Conv3d(64, 96, (3, 3, 3), (1, 1, 1), (1, 1, 1))
        self.pool3 = nn.MaxPool3d((1, 2, 2), (1, 2, 2))

        self.gru1 = nn.GRU(96 * 4 * 8, 256, 1, bidirectional=True)
        self.gru2 = nn.GRU(512, 256, 1, bidirectional=True)

        self.FC = nn.Linear(512, 27 + 1)  # original LipNet
        # self.FC    = nn.Linear(512, 16+1) # tmp
        self.dropout_p = dropout_p

        self.relu = nn.ReLU(inplace=True)
        self.dropout = nn.Dropout(self.dropout_p)
        self.dropout3d = nn.Dropout3d(self.dropout_p)
        self._weights_init()

    def _weights_init(self):

        init.kaiming_normal_(self.conv1.weight, nonlinearity='relu')
        init.constant_(self.conv1.bias, 0)

        init.kaiming_normal_(self.conv2.weight, nonlinearity='relu')
        init.constant_(self.conv2.bias, 0)

        init.kaiming_normal_(self.conv3.weight, nonlinearity='relu')
        init.constant_(self.conv3.bias, 0)

        init.kaiming_normal_(self.FC.weight, nonlinearity='sigmoid')
        init.constant_(self.FC.bias, 0)

        for m in (self.gru1, self.gru2):
            stdv = math.sqrt(2 / (96 * 3 * 6 + 256))
            for i in range(0, 256 * 3, 256):
                init.uniform_(m.weight_ih_l0[i: i + 256],
                              -math.sqrt(3) * stdv, math.sqrt(3) * stdv)
                init.orthogonal_(m.weight_hh_l0[i: i + 256])
                init.constant_(m.bias_ih_l0[i: i + 256], 0)
                init.uniform_(m.weight_ih_l0_reverse[i: i + 256],
                              -math.sqrt(3) * stdv, math.sqrt(3) * stdv)
                init.orthogonal_(m.weight_hh_l0_reverse[i: i + 256])
                init.constant_(m.bias_ih_l0_reverse[i: i + 256], 0)

    def forward(self, x):

        x = self.conv1(x)
        x = self.relu(x)
        x = self.dropout3d(x)
        x = self.pool1(x)

        x = self.conv2(x)
        x = self.relu(x)
        x = self.dropout3d(x)
        x = self.pool2(x)

        x = self.conv3(x)
        x = self.relu(x)
        x = self.dropout3d(x)
        x = self.pool3(x)

        # (B, C, T, H, W)->(T, B, C, H, W)
        x = x.permute(2, 0, 1, 3, 4).contiguous()
        # (B, C, T, H, W)->(T, B, C*H*W)
        x = x.view(x.size(0), x.size(1), -1)

        self.gru1.flatten_parameters()
        self.gru2.flatten_parameters()

        x, h = self.gru1(x)
        x = self.dropout(x)
        x, h = self.gru2(x)
        x = self.dropout(x)

        x = self.FC(x)
        x = x.permute(1, 0, 2).contiguous()
        return x

##### *Simple LipNet*

In [34]:
class SimpleLipNet(torch.nn.Module):
    def __init__(self, dropout_p=0.5):
        super().__init__()

        self.conv = nn.Conv3d(3, 32, (3, 3, 3), stride=(1, 1, 1), padding=(1, 1, 1))
        self.pool = nn.MaxPool3d((1, 2, 2), stride=(1, 2, 2))
        self.gru = nn.GRU(65536, 256, 1, bidirectional=True)
        self.FC = nn.Linear(512, 27 + 1)
        self.dropout_p = dropout_p

        self.relu = nn.ReLU(inplace=True)
        self.dropout = nn.Dropout(self.dropout_p)
        self.dropout3d = nn.Dropout3d(self.dropout_p)

    def forward(self, x):
        x = self.conv(x)
        x = self.relu(x)
        x = self.dropout3d(x)
        x = self.pool(x)

        x = x.permute(2, 0, 1, 3, 4).contiguous()
        x = x.view(x.size(0), x.size(1), -1)

        self.gru.flatten_parameters()
        x, _ = self.gru(x)
        x = self.dropout(x)

        x = self.FC(x)
        x = x.permute(1, 0, 2).contiguous()
        return x


## *Training models on whole faces*

As you may know the weights we have were trained on mouth area, but even so we use them as the starting point to see how the model performs on faces. In this section we also review how the model performs on side view faces vs front view faces.



*And to be honest at first I did not notice that they used mouth area only, so I have trained models on whole faces until I realized that 128x64 is far from what we have in our video frames(720x480) after that I started reading [paper](https://arxiv.org/abs/1611.01599) more carefully and wrote lip and mouth area crop scrips.*

### SimpleLipNet

First of all, let's train `SimpleLipNet` model on side faces and see how it performs.

In [31]:
simple_lipnet = SimpleLipNet()
simple_lipnet = simple_lipnet.to(cpu_device)

train(simple_lipnet, lombardgrid_train_noisy_side_dataloader, 1, device=cpu_device, **{'model_name': 'SimpleLipNet'})

Loading options...
here
-----------------------------------------------------------------------------------------------------
predict                                           |                                             truth
-----------------------------------------------------------------------------------------------------
CBOXJCPCRLHSPCXQLXLSBCBXRPEPSRPEJSLRLCPBCACACVAVAPVAOAVBWVGCRACVPSCJVAVCACVCZJQVCVCVCVCOAJZACVACAGCVAQARZUJCVRCVQVCUVZAJROWCV|                                                                                                                                 BIN BLUE AT M ONE NOW
JWJWIJWNJPWEWDJIJAJPJWJKJKMJWJIJDJEJMIJKJGRKJZGPWJBJKRWZWEBJIAJRZJIQZCJWJEQWBXAWXJEJOJWZJTAOWAZKORWAWZJ GWRJFIJWJWPJQW|                                                                                                                             PLACE RED BY Y ONE PLEASE
KPSEWOIKOWECIOEMOKGIRIEIUXIJUXUICOCBOITEUOCOIRCITOWPJIEIORJREROURURUORUBRDRWURURURURUOKHBURURURUVJRORZRURZURZRURU|         

We can see that above architecture predicted empty strings at first, but later it gave a shot for a guess of one or two character that go consequently. At this point it is difficult to say how the model will perform, so we may save the weights and try training it on more epochs and using plain side data.

In [34]:
weights_save_path = create_list_of_paths(["/nn_course_work/Simple_LIPNET/weights/side_noisy_1.pth"])[0]
save_model_weights(simple_lipnet, weights_save_path)

Let's load the weights(we do this, so that if we run code next time there will be no need to train model from scratch).

In [36]:
simple_lipnet = load_model_weights(simple_lipnet , weights_save_path, device)

# Assign new variable as the main intention of lombardgrid_test_plain_side_dataloader was using it as test data for lipnet model, but here(simplelipnet) we won't use it for testing.
lombardgrid_train_plain_side_dataloader = lombardgrid_test_plain_side_dataloader
train(simple_lipnet, lombardgrid_train_plain_side_dataloader, 2, device=cpu_device, **{'model_name': 'SimpleLipNet'})

Loading options...
-----------------------------------------------------------------------------------------------------
predict                                           |                                             truth
-----------------------------------------------------------------------------------------------------
LNNNNNNNNNNNNNNNNNNNNNNN                          |                                                                                                                           SET GREEN BY R THREE PLEASE
NNNNNNNNNNNNN                                     |                                                                                                                             SET WHITE WITH E FIVE NOW
NNNNNNNNNNNNN                                     |                                                                                                                             SET RED BY J THREE PLEASE
---------------------------------------------------------------------

We won't test this model as from predictions it is clear that the model is not able to predict Speech(at least on this stage).

### **LipNET**

We have two files with weights for LipNET, so let's download both of them and try them out on our data.

Initializing the model and loading weights.

In [115]:
pretrained_lipnet = LipNet()
pretrained_lipnet.to(device)


# Specify path to weights and load them into
weights_path = "LIPNET_original_weights/LipNet_overlap_loss_0.07664558291435242_wer_0.04644484056248762_cer_0.019676921477851092 (1).pt"
pretrained_lipnet.load_state_dict(torch.load(weights_path))

<All keys matched successfully>

Let's evaluate the model.

In [116]:
test(pretrained_lipnet, lombardgrid_test_plain_front_dataloader, device=device)

-----------------------------------------------------------------------------------------------------
predict                                           |                                             truth
-----------------------------------------------------------------------------------------------------
IT WHITE WITH I TNSSEN WHITE WITH O FEN           |                           BIN RED BY B ONE PLEASE
IT WHITE WITH T TNN WHITE WITH U ZEEN             |                       SET WHITE WITH D TWO PLEASE
SET WHITE WATH T SN                               |                         LAY BLUE AT E FIVE PLEASE
PLT WHIE WITH H SN                                |                          SET RED AT O FOUR PLEASE
SET WHITE ITH T SN WHT AT M ZEN                   |                        BIN RED WITH V NINE PLEASE
SET WHITE WITH T S WHTE ATH M FEN                 |                         LAY GREEN WITH G FIVE NOW
IT WHITE WITH T TNN WHT ITH T ZE                  |                          SET B

((6.5112514, 1.2137351598173516, 0.8532662809934636),
 ([array(6.012312, dtype=float32),
   array(6.7095633, dtype=float32),
   array(6.55056, dtype=float32),
   array(7.025367, dtype=float32),
   array(6.5995407, dtype=float32),
   array(5.387609, dtype=float32),
   array(5.9868765, dtype=float32),
   array(6.5832, dtype=float32),
   array(5.6628485, dtype=float32),
   array(7.2165956, dtype=float32),
   array(6.67186, dtype=float32),
   array(6.2709484, dtype=float32),
   array(6.6673183, dtype=float32),
   array(5.3873863, dtype=float32),
   array(6.3064866, dtype=float32),
   array(6.7428246, dtype=float32),
   array(5.5375605, dtype=float32),
   array(6.5459785, dtype=float32),
   array(6.333563, dtype=float32),
   array(6.3145347, dtype=float32),
   array(7.01388, dtype=float32),
   array(6.5476494, dtype=float32),
   array(6.252098, dtype=float32),
   array(5.8124247, dtype=float32),
   array(6.4311123, dtype=float32),
   array(7.5283914, dtype=float32),
   array(6.561282, dtype

Let's upload other weights that said to perform better than the weights from original paper.

In [121]:
pretrained_lipnet = LipNet()
pretrained_lipnet.to(device)


# Specify path to weights and load them into
weights_path = "LIPNET_original_weights/LipNet_unseen_loss_0.44562849402427673_wer_0.1332580699113564_cer_0.06796452465503355.pt"
pretrained_lipnet = load_model_weights(pretrained_lipnet, weights_path, device)

It is time to test the weights.

In [122]:
overall_pretrained_metrics,  pretrained_metrics_history = test(pretrained_lipnet, lombardgrid_test_plain_front_dataloader, device=device)

-----------------------------------------------------------------------------------------------------
predict                                           |                                             truth
-----------------------------------------------------------------------------------------------------
BIN BRUE BIN I ED FUF F B N                       |                            SET RED IN U NINE SOON
BIN BLIE IN                                       |                            SET RED AT X EIGHT NOW
IAN WHN H U                                       |                            PLACE RED IN Z TWO NOW
BIN IN IN S                                       |                       BIN GREEN IN B SEVEN PLEASE
BIN GHIEN IN                                      |                      LAY GREEN WITH B ZERO PLEASE
BIN BLUE IN BON I                                 |                       SET RED WITH Q THREE PLEASE
NN                                                |                           LAY 

We can see that they have smaller error rates than previous weights, so we probably will use them more often in this notebook.

Now it would be great to write them down into tensorboard, so that the results are more interpretable.

In [132]:
pretrained_loss_list, pretrained_wer_list, pretrained_cer_list = pretrained_metrics_history
board_name = 'Pretrained_LipNet_unseen_weights'
pretrained_list_for_metrics_and_names = [(pretrained_loss_list, 'val loss', board_name), (pretrained_wer_list, 'wer', board_name), (pretrained_cer_list, 'cer', board_name)]

# Write test data to logs
plot_metrics_in_tensorboard(pretrained_list_for_metrics_and_names)

Well, we may start *fine-tuning the model on front side view videos that have a lombard effect it may help us to improve the performance of the model in the situations that are closer to real-world dialogs*(not only because of lombard effect, but because in Lombard Grid Corpus there was a man who asked speakers to say the sentence one more time which changed the way how people communicated).

In [26]:
# device = torch.device('cuda')
lipnet_fine_tuned = LipNet()
lipnet_fine_tuned.to(device)

# Specify path to weights and load them into
weights_path = "LIPNET_original_weights/LipNet_overlap_loss_0.07664558291435242_wer_0.04644484056248762_cer_0.019676921477851092 (1).pt"
lipnet_fine_tuned.load_state_dict(torch.load(weights_path))

# Freeze layers, we start with 2
freeze_layers(lipnet_fine_tuned, 2, freeze_first_layers=False)

Layers: [True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, False, False]
Counter({True: 22, False: 2})


LipNet(
  (conv1): Conv3d(3, 32, kernel_size=(3, 5, 5), stride=(1, 2, 2), padding=(1, 2, 2))
  (pool1): MaxPool3d(kernel_size=(1, 2, 2), stride=(1, 2, 2), padding=0, dilation=1, ceil_mode=False)
  (conv2): Conv3d(32, 64, kernel_size=(3, 5, 5), stride=(1, 1, 1), padding=(1, 2, 2))
  (pool2): MaxPool3d(kernel_size=(1, 2, 2), stride=(1, 2, 2), padding=0, dilation=1, ceil_mode=False)
  (conv3): Conv3d(64, 96, kernel_size=(3, 3, 3), stride=(1, 1, 1), padding=(1, 1, 1))
  (pool3): MaxPool3d(kernel_size=(1, 2, 2), stride=(1, 2, 2), padding=0, dilation=1, ceil_mode=False)
  (gru1): GRU(3072, 256, bidirectional=True)
  (gru2): GRU(512, 256, bidirectional=True)
  (FC): Linear(in_features=512, out_features=28, bias=True)
  (relu): ReLU(inplace=True)
  (dropout): Dropout(p=0.5, inplace=False)
  (dropout3d): Dropout3d(p=0.5, inplace=False)
)

We freeze last layers, because we train models on whole faces, meaning we need to adjust filters and first GRU layer to work with whole faces.

In [27]:
fine_tuned_lipnet_test_kwargs_for_train = {'model': lipnet_fine_tuned, 'dataloader': lombardgrid_train_noisy_front_dataloader, 'device':device, 'display_mod': 100}
fine_tuned_lipnet_train_kwargs = {'test_and_save_model': True, 'test_step': 2, 'params': fine_tuned_lipnet_test_kwargs_for_train, 'model_name': 'FineTunedLipNet'}
train(lipnet_fine_tuned, lombardgrid_train_noisy_front_dataloader, 10) # **fine_tuned_lipnet_train_kwargs)

Loading options...
-----------------------------------------------------------------------------------------------------
predict                                           |                                             truth
-----------------------------------------------------------------------------------------------------
ST WHITE ITH Z SNE                                |                                                                                                                              SET RED WITH J ZERO SOON
BEC WHITE WITH T TONNTN                           |                                                                                                                                    SET RED IN U SEVEN
TAT WAT A TWTN NE E                               |                                                                                                                               SET BLUE AT D SIX AGAIN
---------------------------------------------------------------------

Let's save the model in case we want to use or train it later.

In [36]:
# Save the model weights
torch.save(lipnet_fine_tuned.state_dict(), f"/Users/admin/PycharmProjects/nn_course_work/LIPNET_fine_tuned/weights/front_many_speakers_{2}.pth")


We can try training the model almost from scratch the only difference will be the use of weights we have in order to save us some time.

In [35]:
weights_path = "LIPNET_original_weights/LipNet_unseen_loss_0.44562849402427673_wer_0.1332580699113564_cer_0.06796452465503355.pt"

Let's use above weights in our model.

In [None]:
lipnet_plain_front = LipNet()
lipnet_plain_front = lipnet_plain_front.to(device)
lipnet_plain_front = load_model_weights(lipnet_plain_front, weights_path, device)
train(lipnet_plain_front, lombardgrid_test_plain_front_dataloader, 10)

*There output was too long due to mistake made in parameters that control output, so the original output was deleted.*

<details><summary>Click here to see the output for last epoch</summary>

-----------------------------------------------------------------------------------------------------
predict                                           |                                             truth
-----------------------------------------------------------------------------------------------------
LAT WREE BT C FEE LGAOW S EE NSNA                 |                                                                                                                            PLACE BLUE WITH Z NINE NOW
LAT GRUTEN IT P TIRE AOAONI Z Z ZER OV            |                                                                                                                            PLACE GREEN BY T NINE SOON
SIY BRE IT X FIVE LGONWNLEEER EZE A               |                                                                                                                               BIN RED BY N FOUR AGAIN
-----------------------------------------------------------------------------------------------------
epoch=11,tot_iter=2531,eta=0.259840351075839,loss=1.344794750213623,train_wer=1.3400069379057435
-----------------------------------------------------------------------------------------------------

-----------------------------------------------------------------------------------------------------
predict                                           |                                             truth
-----------------------------------------------------------------------------------------------------
LAT WRIE IT I FIE SNASNWNEN TR OIA                |                                                                                                                                SET WHITE IN I TWO NOW
BAT RTEN IT O FIE LGASEW O N                      |                                                                                                                             SET WHITE AT J FOUR AGAIN
LAT RUED IT R TIRE PSOAOENW T ZIRE A              |                                                                                                                                LAY BLUE AT Y NINE NOW
-----------------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------------
epoch=11,tot_iter=2532,eta=0.25864000196377845,loss=1.2597496509552002,train_wer=1.3400042931919882
-----------------------------------------------------------------------------------------------------

-----------------------------------------------------------------------------------------------------
predict                                           |                                             truth
-----------------------------------------------------------------------------------------------------
LAC GRUE IT X SIE LEONWW S FI                     |                                                                                                                              BIN WHITE IN H ONE AGAIN
LEN WRUE IT K FERE AOONWN T E OSAH                |                                                                                                                                BIN RED IN I ZERO SOON
SAT GRE IT M FINE LOAON                           |                                                                                                                             SET BLUE IN U FOUR PLEASE
-----------------------------------------------------------------------------------------------------
epoch=11,tot_iter=2533,eta=0.2574455263500338,loss=1.3339658975601196,train_wer=1.339968639102088
-----------------------------------------------------------------------------------------------------

-----------------------------------------------------------------------------------------------------
predict                                           |                                             truth
-----------------------------------------------------------------------------------------------------
LAT GRUED IT D SOVT PLGASNSNWESEE Z ZIR SGA       |                                                                                                                                SET BLUE BY I SIX SOON
LAC GRIEN IT P TIVE LEAON T EIE A                 |                                                                                                                              PLACE WHITE WITH F SEVEN
LAT RE IT M SIRE AGAONW ZHE ESE                   |                                                                                                                         PLACE RED WITH K THREE PLEASE
-----------------------------------------------------------------------------------------------------
epoch=11,tot_iter=2534,eta=0.25624167054352104,loss=1.3747767210006714,train_wer=1.3399726113283505
-----------------------------------------------------------------------------------------------------

-----------------------------------------------------------------------------------------------------
predict                                           |                                             truth
-----------------------------------------------------------------------------------------------------
LAT GRUE IT F FIRO LESN ZRE A                     |                                                                                                                             LAY RED AT N THREE PLEASE
LET WRE IT T FIE AEIN T                           |                                                                                                                             PLACE RED WITH E ONE SOON
LAT BREE BT S FIVE SOASNET I A                    |                                                                                                                           PLACE GREEN AT R FIVE AGAIN
-----------------------------------------------------------------------------------------------------
epoch=11,tot_iter=2535,eta=0.2550366528498324,loss=1.2807831764221191,train_wer=1.3399369980043871
-----------------------------------------------------------------------------------------------------

-----------------------------------------------------------------------------------------------------
predict                                           |                                             truth
-----------------------------------------------------------------------------------------------------
LAT RE IT Y TIRE SOON ZI                          |                                                                                                                             SET RED WITH X FIVE AGAIN
BEY BRIEN IN S SIUE PSGAON                        |                                                                                                                            SET BLUE AT B EIGHT PLEASE
LAT GREN BT V FIVE PLEAONWN                       |                                                                                                                               BIN RED WITH V TWO SOON
-----------------------------------------------------------------------------------------------------
epoch=11,tot_iter=2536,eta=0.25383718439821906,loss=1.2519567012786865,train_wer=1.3398601975039979
-----------------------------------------------------------------------------------------------------

</details>


In [69]:
extract_zip("/Users/admin/PycharmProjects/nn_course_work/data/lip_frames/lips_15_17.zip", "/Users/admin/PycharmProjects/nn_course_work/clean_data/video_frames_lips")

## *Training models on lips*

***In [paper](https://arxiv.org/abs/1611.01599) they only mention that they use mouth region and there were photos of what they feed the LipNet. Even so we are going to use lip frames to see how pretrained LipNet will behave and, perhaps, we will train SimpleLipNet or/and LipNet on our lip frames.***

After we extracted lip photos from all video frames we are able to create new dataloaders, by following all of the steps we did previously.

In [30]:
# Specify the number of speakers in the dataset
# Create a set of all speakers that has corresponding lip crop files
all_speakers = {'s15', 's16', 's17', 's18', 's19', 's20'}

# Specify the excluded speakers, so that we can form unseen speakers data
train_excluded_speakers = {'s15', 's16', 's17'}
test_excluded_speakers = copy.deepcopy(train_excluded_speakers)
test_excluded_speakers_unseen = all_speakers - test_excluded_speakers

# We use only front directory for root directory assuming that every front has corresponding side
relative_root_directory = '/nn_course_work/clean_data/video_frames_lips/front'

# Absolute root path
root_directory = create_list_of_paths([relative_root_directory])[0]

Just as before we need to write paths to our txt files.

In [31]:
relative_train_plain_output_file = '/nn_course_work/clean_data/txt_files/lips/train_plain_files.txt'
relative_test_noisy_output_file = '/nn_course_work/clean_data/txt_files/lips/test_noisy_files.txt'
relative_test_noisy_unseen_output_file = '/nn_course_work/clean_data/txt_files/lips/test_noisy_unseen_files.txt'
relative_test_plain_unseen_output_file = '/nn_course_work/clean_data/txt_files/lips/test_plain_unseen_files.txt'


lips_paths = create_list_of_paths([relative_train_plain_output_file, relative_test_noisy_output_file, relative_test_noisy_unseen_output_file, relative_test_plain_unseen_output_file])

# Unpack them
train_plain_output_file, test_noisy_output_file, test_noisy_unseen_output_file, test_plain_unseen_output_file = lips_paths

# Write the paths to the train, test, and test unseen output files
write_paths_to_file(root_directory, train_plain_output_file, folder_ends={'p'}, exclude_speakers=train_excluded_speakers)
write_paths_to_file(root_directory, test_noisy_output_file, folder_ends={'l'}, exclude_speakers=test_excluded_speakers)
write_paths_to_file(root_directory, test_noisy_unseen_output_file, folder_ends={'l'}, exclude_speakers=test_excluded_speakers_unseen)
write_paths_to_file(root_directory, test_plain_unseen_output_file, folder_ends={'p'}, exclude_speakers=test_excluded_speakers_unseen)

Setting variables that we will use later.

In [32]:
# Number of samples in each batch
batch_size = 8
# Relative paths that point out required dir and files for dataset class
relative_anno_path = "/nn_course_work/clean_data/aligns/"
# Front and side paths
relative_front_video_path = "/nn_course_work/clean_data/video_frames_lips/front/"

Some paths remain the same, so to make things clear we just reference previous variable values.

In [33]:
relative_train_plain_list_path = relative_train_plain_output_file
relative_test_noisy_list_path = relative_test_noisy_output_file
relative_test_plain_unseen_list_path = relative_test_noisy_unseen_output_file
relative_test_noisy_unseen_list_path = relative_test_noisy_unseen_output_file

Now, we can create datasets.

In [34]:
# Obtain train datasets
lombardgrid_train_plain_front_dataset = get_dataset(relative_anno_path, relative_front_video_path, relative_train_plain_list_path)

# Obtain test datasets
lombardgrid_test_noisy_front_dataset = get_dataset(relative_anno_path, relative_front_video_path, relative_test_noisy_list_path)

lombardgrid_test_plain_unseen_front_dataset = get_dataset(relative_anno_path, relative_front_video_path,
                                                          relative_test_plain_unseen_list_path)

lombardgrid_test_noisy_unseen_front_dataset = get_dataset(relative_anno_path, relative_front_video_path,
                                                          relative_test_noisy_unseen_list_path)

Finally, just as previously we create dataloaders for lip frames.

In [35]:
# Create a DataLoader for the LombardGrid(to train on front videos dataset)
lombardgrid_train_plain_front_dataloader = dataset2dataloader(lombardgrid_train_plain_front_dataset, batch_size)

# Create test DataLoaders for the datasets
lombardgrid_test_noisy_front_dataloader = dataset2dataloader(lombardgrid_test_noisy_front_dataset, batch_size)

lombardgrid_test_plain_unseen_front_dataloader = dataset2dataloader(lombardgrid_test_plain_unseen_front_dataset, batch_size)
lombardgrid_test_noisy_unseen_front_dataloader = dataset2dataloader(lombardgrid_test_noisy_unseen_front_dataset, batch_size)

# Print dataset sizes
print('Number of train front data:{}'.format(len(lombardgrid_train_plain_front_dataset.data)))
print('Number of test front data:{}'.format(len(lombardgrid_test_noisy_front_dataset.data)))
print('Number of test unseen front data:{}'.format(len(lombardgrid_test_plain_unseen_front_dataset.data)))
print('Number of test unseen front data:{}'.format(len(lombardgrid_test_noisy_unseen_front_dataset.data)))

Number of train front data:150
Number of test front data:149
Number of test unseen front data:149
Number of test unseen front data:149


Just like previously let's test pretrained model.

In [83]:
pretrained_lipnet = LipNet()
pretrained_lipnet.to(device)


# Specify path to weights and load them into
weights_path = "LIPNET_original_weights/LipNet_unseen_loss_0.44562849402427673_wer_0.1332580699113564_cer_0.06796452465503355.pt"
pretrained_lipnet = load_model_weights(pretrained_lipnet, weights_path, device)

Here we call the test method on data with lombard effect.

In [84]:
overall_pretrained_metrics,  pretrained_metrics_history = test(pretrained_lipnet, lombardgrid_test_noisy_front_dataloader, device=device)

-----------------------------------------------------------------------------------------------------
predict                                           |                                             truth
-----------------------------------------------------------------------------------------------------
N LYLU OU                                         |                      PLACE WHITE AT F THREE AGAIN
BIBBIN BLUE WIT WITH OU                           |                           BIN BLUE WITH G SIX NOW
BY G N B I BBI M W M                              |                          PLACE WHITE AT M TWO NOW
UBBI IYFIV U                                      |                       PLACE GREEN IN F FIVE AGAIN
R BUB BI WW OU                                    |                            LAY WHITE AT B ONE NOW
B UHR BIB BI W OU                                 |                         LAY BLUE BY N THREE AGAIN
I BA WIH I                                        |                         PLACE 

Clearly, the format of the images they used is different(probably, frames they used in schemes are the one they used), but it is not the big deal, so let's try training SimpleLipNet model and see how it performs on images of lips.

In [119]:
simple_lipnet = SimpleLipNet()
simple_lipnet = simple_lipnet.to(cpu_device)

train(simple_lipnet, lombardgrid_train_plain_front_dataloader, 10, device=cpu_device, **{'model_name': 'SimpleLipNet'})

Loading options...
-----------------------------------------------------------------------------------------------------
predict                                           |                                             truth
-----------------------------------------------------------------------------------------------------
JXKC KCK CKIMZVKCUMWKJXITKU MXKLTBX MHSBCMKXC M MH CHKH WF NFNFN HZNCXNCNCNUVG GXF VNFCNCX N NFGN F F N CNFGNGCNFCNFNFN YX|                                                                                                                            PLACE WHITE WITH A TWO NOW
BRXJXBKRURTPT XRX YX R AJXZ M R THCIE EXE N XIYKN X XNX Y Q Q NX I J Q E A X V Q YNXQ IVX KW V X|                                                                                                                            PLACE WHITE IN J EIGHT NOW
JKRRANCYYKZTNYAXLXAL YNHNKTXZKLNX ZXZAXFZ CNRW WF PDJ J JXZ AJ C X ZJCXWA K ZBZFJPCFABD JMCJAMCJ D XC JEXJZC SJCZCGZ|                                    

We won't be able to compare SimpleLipNet with LipNet model, becauseI am not able to move the model to the GPU and LipNet has pretrained weights, but I am sure that LipNet performs better due to a lot of papers that use this model and a lot of research on it.

In [42]:
lipnet = LipNet()
lipnet.to(device)

train(lipnet, lombardgrid_train_plain_front_dataloader, 100, display_mod=10, device=device, **{'model_name': 'LipNet_lips_only'})

Loading options...
here
-----------------------------------------------------------------------------------------------------
predict                                           |                                             truth
-----------------------------------------------------------------------------------------------------
T PBWBOCJSPI LOPHJP BCHELACOHOPZMEHAJHUJCHEP LUHEHVBOHWJVCWGWGWGWGMWCPWGWGWTGGWGWAWMWMEWGEGWJEMWGWGEMGMWGWMGWM|                                                                                                                             LAY BLUE AT Z NINE PLEASE
ZUACMCWGZCXACXZAUCBZWACXCZUZCUZCAJBZBXZXJCZCXZEZKEXUEWCWGICWJWJWMWWGWGWGWSMWGWGWACGWEWCWMWGCGCWGWWWGMWVJEWE|                                                                                                                               LAY GREEN IN L NINE NOW
JUJBQPQPZQJBOBYRPZQHWLQJLQWXQXQVLBYRXWJQMJQJQHQLJQMXZMEXEWEWMJEJWGWEGWGMWLCMEMQWJWGWGTMGWXMGVWMWEWCWCWGEGWGQWJEJWCWGMGWMEWGSGJWGW|                   

It is clear that the model will require a lot of epochs of training until we will be able to see good predictions or we won't see them at all, anyway let's save the weights.

In [54]:
save_model_weights(lipnet, "/Users/admin/PycharmProjects/nn_course_work/weights/LIPNET/lips_only/version1_6_speakers.pth")

In [87]:
extract_zip("/Users/admin/PycharmProjects/nn_course_work/data/mouth_frames/mouth_21_23.zip", "/Users/admin/PycharmProjects/nn_course_work/clean_data/video_frames_mouths")

## *Training models on mouth frames*

Creating datasets and dataloaders for mouth frames, we did it previously 2 times, so I won't comment it.

In [35]:
# Create a set of all speakers that has corresponding lip crop files
all_speakers = {'s15', 's16', 's17', 's18', 's19', 's20'}

# Specify the excluded speakers, so that we can form unseen speakers data
train_excluded_speakers = {'s15', 's16', 's17'}
test_excluded_speakers = copy.deepcopy(train_excluded_speakers)
test_excluded_speakers_unseen = all_speakers - test_excluded_speakers

# We use only front directory for root directory assuming that every front has corresponding side
relative_root_directory = '/nn_course_work/clean_data/video_frames_mouths/front'

# Absolute root path
root_directory = create_list_of_paths([relative_root_directory])[0]

In [36]:
relative_train_plain_output_file = '/nn_course_work/clean_data/txt_files/mouths/train_plain_files.txt'
relative_test_noisy_output_file = '/nn_course_work/clean_data/txt_files/mouths/test_noisy_files.txt'
relative_test_noisy_unseen_output_file = '/nn_course_work/clean_data/txt_files/mouths/test_noisy_unseen_files.txt'
relative_test_plain_unseen_output_file = '/nn_course_work/clean_data/txt_files/mouths/test_plain_unseen_files.txt'


lips_paths = create_list_of_paths([relative_train_plain_output_file, relative_test_noisy_output_file, relative_test_noisy_unseen_output_file, relative_test_plain_unseen_output_file])

# Unpack them
train_plain_output_file, test_noisy_output_file, test_noisy_unseen_output_file, test_plain_unseen_output_file = lips_paths

# Write the paths to the train, test, and test unseen output files
write_paths_to_file(root_directory, train_plain_output_file, folder_ends={'p'}, exclude_speakers=train_excluded_speakers)
write_paths_to_file(root_directory, test_noisy_output_file, folder_ends={'l'}, exclude_speakers=test_excluded_speakers)
write_paths_to_file(root_directory, test_noisy_unseen_output_file, folder_ends={'l'}, exclude_speakers=test_excluded_speakers_unseen)
write_paths_to_file(root_directory, test_plain_unseen_output_file, folder_ends={'p'}, exclude_speakers=test_excluded_speakers_unseen)

In [37]:
# Number of samples in each batch
batch_size = 8
# Relative paths that point out required dir and files for dataset class
relative_anno_path = "/nn_course_work/clean_data/aligns/"

# Front and side paths
relative_front_video_path = "/nn_course_work/clean_data/video_frames_mouths/front/"
relative_train_plain_list_path = relative_train_plain_output_file
relative_test_noisy_list_path = relative_test_noisy_output_file
relative_test_plain_unseen_list_path = relative_test_noisy_unseen_output_file
relative_test_noisy_unseen_list_path = relative_test_noisy_unseen_output_file
# Obtain train datasets
lombardgrid_train_plain_front_dataset = get_dataset(relative_anno_path, relative_front_video_path,
                                                    relative_train_plain_list_path)

# Obtain test datasets
lombardgrid_test_noisy_front_dataset = get_dataset(relative_anno_path, relative_front_video_path,
                                                   relative_test_noisy_list_path)

lombardgrid_test_plain_unseen_front_dataset = get_dataset(relative_anno_path, relative_front_video_path,
                                                          relative_test_plain_unseen_list_path)

lombardgrid_test_noisy_unseen_front_dataset = get_dataset(relative_anno_path, relative_front_video_path,
                                                          relative_test_noisy_unseen_list_path)
# Create a DataLoader for the LombardGrid(to train on front videos dataset)
lombardgrid_train_plain_front_dataloader = dataset2dataloader(lombardgrid_train_plain_front_dataset, batch_size)

# Create test DataLoaders for the datasets
lombardgrid_test_noisy_front_dataloader = dataset2dataloader(lombardgrid_test_noisy_front_dataset, batch_size)

lombardgrid_test_plain_unseen_front_dataloader = dataset2dataloader(lombardgrid_test_plain_unseen_front_dataset,
                                                                    batch_size)
lombardgrid_test_noisy_unseen_front_dataloader = dataset2dataloader(lombardgrid_test_noisy_unseen_front_dataset,
                                                                    batch_size)

# Print dataset sizes
print('Number of train front data:{}'.format(len(lombardgrid_train_plain_front_dataset.data)))
print('Number of test front data:{}'.format(len(lombardgrid_test_noisy_front_dataset.data)))
print('Number of test unseen front data:{}'.format(len(lombardgrid_test_plain_unseen_front_dataset.data)))
print('Number of test unseen front data:{}'.format(len(lombardgrid_test_noisy_unseen_front_dataset.data)))

Number of train front data:250
Number of test front data:282
Number of test unseen front data:282
Number of test unseen front data:282


### SimpleLipNet

Let's start with training SimpleLipNet model.

In [38]:
simple_lipnet = SimpleLipNet()
simple_lipnet = simple_lipnet.to(cpu_device)

In [40]:
train(simple_lipnet, lombardgrid_train_plain_front_dataloader, 50, display_mod=10, device=cpu_device, **{'model_name': 'Simple_LipNet/mouth_only/plain/'})

Loading options...
here
-----------------------------------------------------------------------------------------------------
predict                                           |                                             truth
-----------------------------------------------------------------------------------------------------
                                                  |                                                                                                                            PLACE WHITE AT R TWO AGAIN
                                                  |                                                                                                                               BIN GREEN BY J NINE NOW
                                                  |                                                                                                                         PLACE GREEN BY A SEVEN PLEASE
----------------------------------------------------------------

Let's save the weights.

In [41]:
save_model_weights(simple_lipnet, "/Users/admin/PycharmProjects/nn_course_work/weights/Simple_LIPNET/mouths_only/version1_6_speakers.pth")

And now test the model on data with the lombard effect and on speakers our network has already seen.

In [43]:
simple_lipnet_metrics,  simple_lipnet_metrics_history = test(simple_lipnet, lombardgrid_test_noisy_front_dataloader, device=cpu_device)

-----------------------------------------------------------------------------------------------------
predict                                           |                                             truth
-----------------------------------------------------------------------------------------------------
LY BUE BE LASEN                                   |                         LAY BLUE AT Z NINE PLEASE
L R N                                             |                          PLACE GREEN BY R SIX NOW
S RE N                                            |                                SET RED IN V SEVEN
B W EN                                            |                        BIN WHITE AT C FOUR PLEASE
S R EWN                                           |                           LAY GREEN IN U FIVE NOW
LY RE F ON                                        |                          LAY GREEN IN G FOUR SOON
BIN BLUE WI E OWN                                 |                             BI

Writing results to tensorboard.

In [44]:
simple_lipnet_loss_list, simple_lipnet_wer_list, simple_lipnet_cer_list = simple_lipnet_metrics_history
board_name = 'Simple_LipNet/mouth_only/noisy/version1_6_speakers'
simple_lipnet_list_for_metrics_and_names = [(simple_lipnet_loss_list, 'val loss', board_name), (simple_lipnet_wer_list, 'wer', board_name), (simple_lipnet_cer_list, 'cer', board_name)]

# Write test data to logs
plot_metrics_in_tensorboard(simple_lipnet_list_for_metrics_and_names)

simple_lipnet_loss, simple_lipnet_wer, simple_lipnet_cer = simple_lipnet_metrics
print(f"Loss: {simple_lipnet_loss}, WER: {simple_lipnet_wer}, CER: {simple_lipnet_cer}")

Loss: 1.653975486755371, WER: 0.947874720357942, CER: 0.6581407460010694


Let's test the model on unseen speakers and data without lombard effect.

In [45]:
simple_lipnet_metrics,  simple_lipnet_metrics_history = test(simple_lipnet, lombardgrid_test_plain_unseen_front_dataloader, device=cpu_device)

-----------------------------------------------------------------------------------------------------
predict                                           |                                             truth
-----------------------------------------------------------------------------------------------------
B OON                                             |                          PLACE RED BY K NINE SOON
B N                                               |                            BIN RED IN O FOUR SOON
B N                                               |                           BIN BLUE AT M ONE AGAIN
S N                                               |                     LAY WHITE WITH H THREE PLEASE
B N                                               |                       SET WHITE AT J EIGHT PLEASE
B N                                               |                        PLACE GREEN IN Q NINE SOON
S N                                               |                         LAY GR

And output results and write plots to tensorboard.

In [46]:
simple_lipnet_loss_list, simple_lipnet_wer_list, simple_lipnet_cer_list = simple_lipnet_metrics_history
board_name = 'Simple_LipNet/mouth_only/unseen_plain/version1_6_speakers'
simple_lipnet_list_for_metrics_and_names = [(simple_lipnet_loss_list, 'val loss', board_name), (simple_lipnet_wer_list, 'wer', board_name), (simple_lipnet_cer_list, 'cer', board_name)]

# Write test data to logs
plot_metrics_in_tensorboard(simple_lipnet_list_for_metrics_and_names)

simple_lipnet_loss, simple_lipnet_wer, simple_lipnet_cer = simple_lipnet_metrics
print(f"Loss: {simple_lipnet_loss}, WER: {simple_lipnet_wer}, CER: {simple_lipnet_cer}")

Loss: 2.748852491378784, WER: 0.9899328859060402, CER: 0.8773863821392283


### LipNet

#### Pretrained

Just as we did previously let's evaluate pretrained model, but now on mouth images.

In [60]:
pretrained_lipnet = LipNet()
pretrained_lipnet.to(device)


# Specify path to weights and load them into
weights_path = "LIPNET_original_weights/LipNet_unseen_loss_0.44562849402427673_wer_0.1332580699113564_cer_0.06796452465503355.pt"
pretrained_lipnet = load_model_weights(pretrained_lipnet, weights_path, device)

Testing on speakers that model has been trained on, but now data has lombard effect.

In [65]:
overall_pretrained_metrics,  pretrained_metrics_history = test(pretrained_lipnet, lombardgrid_test_noisy_front_dataloader, device=device)

-----------------------------------------------------------------------------------------------------
predict                                           |                                             truth
-----------------------------------------------------------------------------------------------------
LAY WRIE ATA NLPA AIN M                           |                     PLACE GREEN AT Z SEVEN PLEASE
SET LE AT L NE BIN OU                             |                         PLACE GREEN IN F FIVE NOW
IN WHEN IN E IN                                   |                        LAY GREEN AT O ZERO PLEASE
RDEREDNAN OU                                      |                          LAY WHITE IN T ONE AGAIN
RD RED BY                                         |                           LAY RED BY F NINE AGAIN
A IN                                              |                         SET RED WITH Q THREE SOON
LAC GRIE AT NB BIN W MIN                          |                       PLACE BL

Writing metrics to tensorboard.

In [69]:
pretrained_loss_list, pretrained_wer_list, pretrained_cer_list = pretrained_metrics_history
board_name = 'Pretrained_LipNet_unseen_weights/mouth_only/noisy'
pretrained_list_for_metrics_and_names = [(pretrained_loss_list, 'val loss', board_name), (pretrained_wer_list, 'wer', board_name), (pretrained_cer_list, 'cer', board_name)]

# Write test data to logs
plot_metrics_in_tensorboard(pretrained_list_for_metrics_and_names)

Now let's test the model on data without lombard effect and on unseen speakers.

In [70]:
overall_pretrained_metrics,  pretrained_metrics_history = test(pretrained_lipnet, lombardgrid_test_plain_unseen_front_dataloader, device=device)

-----------------------------------------------------------------------------------------------------
predict                                           |                                             truth
-----------------------------------------------------------------------------------------------------
PLACE BLUE IN A LN MW                             |                         SET BLUE WITH Q ZERO SOON
PLACE WHITE BY N ON E M                           |                         PLACE BLUE WITH L TWO NOW
LET WHIE IN E S ON                                |                          SET RED WITH J ZERO SOON
PLACE WHITE BY A INE SOON YMY                     |                         PLACE WHITE AT X SIX SOON
BUN BLUE BY P LR                                  |                       LAY GREEN AT B EIGHT PLEASE
IEN WHUE WIT DPP OO                               |                       PLACE GREEN IN Y ZERO AGAIN
LAT BLUE ANY ONE AIN OW                           |                       LAY WHIT

And, of course, write the logs for tensorboard plots.

In [71]:
pretrained_loss_list, pretrained_wer_list, pretrained_cer_list = pretrained_metrics_history
board_name = 'Pretrained_LipNet_unseen_weights/mouth_only/unseen_plain'
pretrained_list_for_metrics_and_names = [(pretrained_loss_list, 'val loss', board_name), (pretrained_wer_list, 'wer', board_name), (pretrained_cer_list, 'cer', board_name)]

# Write test data to logs
plot_metrics_in_tensorboard(pretrained_list_for_metrics_and_names)

Output overall metrics.

In [79]:
pretrained_loss, pretrained_wer, pretrained_cer = overall_pretrained_metrics
print(f"Loss: {pretrained_loss}, WER: {pretrained_wer}, CER: {pretrained_cer}")

Loss: 4.383695125579834, WER: 1.0161073825503355, CER: 0.6710928588902635


Our metrics are a bit closer to benchmark, but not close enough to produce 'normal' predictions. It may be due to the difference we may have in shapes of mouth frames we retrieve, but we do not know fo sure if they used 128x64 frames or something else. Also decreasing height and width may result in cropping mouth images a bit too much, so I decided to leave our current implementation(view lips and moths extraction notebook for more info).

### *Fine-tuned*

In [48]:
fine_tuned_lipnet = LipNet()
fine_tuned_lipnet.to(device)


# Specify path to weights and load them into
weights_path = "LIPNET_original_weights/LipNet_unseen_loss_0.44562849402427673_wer_0.1332580699113564_cer_0.06796452465503355.pt"
fine_tuned_lipnet = load_model_weights(fine_tuned_lipnet, weights_path, device)

Just as we did previously let's freeze the layers at the end, so that we are able to adjust CNN layers.

In [49]:
freeze_layers(fine_tuned_lipnet, 2, freeze_first_layers=False)

Layers: [True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, False, False]
Counter({True: 22, False: 2})


LipNet(
  (conv1): Conv3d(3, 32, kernel_size=(3, 5, 5), stride=(1, 2, 2), padding=(1, 2, 2))
  (pool1): MaxPool3d(kernel_size=(1, 2, 2), stride=(1, 2, 2), padding=0, dilation=1, ceil_mode=False)
  (conv2): Conv3d(32, 64, kernel_size=(3, 5, 5), stride=(1, 1, 1), padding=(1, 2, 2))
  (pool2): MaxPool3d(kernel_size=(1, 2, 2), stride=(1, 2, 2), padding=0, dilation=1, ceil_mode=False)
  (conv3): Conv3d(64, 96, kernel_size=(3, 3, 3), stride=(1, 1, 1), padding=(1, 1, 1))
  (pool3): MaxPool3d(kernel_size=(1, 2, 2), stride=(1, 2, 2), padding=0, dilation=1, ceil_mode=False)
  (gru1): GRU(3072, 256, bidirectional=True)
  (gru2): GRU(512, 256, bidirectional=True)
  (FC): Linear(in_features=512, out_features=28, bias=True)
  (relu): ReLU(inplace=True)
  (dropout): Dropout(p=0.5, inplace=False)
  (dropout3d): Dropout3d(p=0.5, inplace=False)
)

Now we can train fine-tuned lipnet on data without Lombard Effect.

In [50]:
train(fine_tuned_lipnet, lombardgrid_train_plain_front_dataloader, 100, display_mod=10, device=device, **{'model_name': 'LipNet_fine_tuned_train/mouth_only/plain/'})

Loading options...
here
-----------------------------------------------------------------------------------------------------
predict                                           |                                             truth
-----------------------------------------------------------------------------------------------------
I I BI WINI I OIO                                 |                                                                                                                                    SET RED IN V SEVEN
IT HTNWIT TI NOHU                                 |                                                                                                                               LAY WHITE IN F ONE SOON
T PEWNEBWW R OIW                                  |                                                                                                                          PLACE GREEN WITH U SIX AGAIN
----------------------------------------------------------------

Let's save the model.

In [None]:
save_model_weights(fine_tuned_lipnet, "/Users/admin/PycharmProjects/nn_course_work/weights/LIPNET_fine_tuned/mouths_only/version1_6_speakers.pth")

Test the fine-tuned model on the unseen speakers without lombard effect.

In [36]:
overall_fine_tune_metrics,  fine_tuned_metrics_history = test(fine_tuned_lipnet, lombardgrid_test_plain_unseen_front_dataloader, device=device)

-----------------------------------------------------------------------------------------------------
predict                                           |                                             truth
-----------------------------------------------------------------------------------------------------
PLACE WHITE IN S ONE PLEASE BREEN IN B            |                       PLACE WHITE AT S ONE PLEASE
PLACE GREEN IT O FOUR SOON N N AB                 |                        PLACE GREEN IN Q NINE SOON
PLAC GREEN AT T TINE PLEASEN REN AIN B            |                      PLACE GREEN BY Z FIVE PLEASE
LAY BLUE AT M ONE SOON WN B                       |                         LAY BLUE AT F THREE AGAIN
LAC WHETEN AT G NINE PLEASEN IN AINN A            |                       LAY GREEN AT T THREE PLEASE
BIN BLUE AT N ONE AGAIN B                         |                           BIN BLUE AT M ONE AGAIN
LAY WHITE BY Y NINE AGAIN EN N B                  |                       LAY WHIT

Write logs and output metrics.

In [37]:
fine_tuned_loss_list, fine_tuned_wer_list, fine_tuned_cer_list = fine_tuned_metrics_history
board_name = 'FineTuned_LipNet/mouth_only/unseen_plain/version1_6_speakers'
fine_tuned_list_for_metrics_and_names = [(fine_tuned_loss_list, 'val loss', board_name), (fine_tuned_wer_list, 'wer', board_name), (fine_tuned_cer_list, 'cer', board_name)]

# Write test data to logs
plot_metrics_in_tensorboard(fine_tuned_list_for_metrics_and_names)

fine_tuned_loss, fine_tuned_wer, fine_tuned_cer = overall_fine_tune_metrics
print(f"Loss: {fine_tuned_loss}, WER: {fine_tuned_wer}, CER: {fine_tuned_cer}")

Loss: 1.5598423480987549, WER: 0.8011185682326621, CER: 0.5526594838834806


Test it one more time, but on unseen speakers with lombard effect.

In [38]:
overall_fine_tune_metrics,  fine_tuned_metrics_history = test(fine_tuned_lipnet, lombardgrid_test_noisy_unseen_front_dataloader, device=device)

-----------------------------------------------------------------------------------------------------
predict                                           |                                             truth
-----------------------------------------------------------------------------------------------------
PLACE GREEN BY E SEVEN PLEASE BRUE IN B           |                       PLACE RED BY E SEVEN PLEASE
PLACE GREEN WITH B SEVEN SOON AB                  |                       BIN GREEN WITH V SEVEN SOON
SET GREEN WITH Y SINX SOONWNIENAN AITN N          |                           SET BLUE WITH I SIX NOW
SET WHITE ATH B EIGHT SOONWI B                    |                       PLACE GREEN WITH K TWO SOON
LAY WHITE BY N ZERR AGAIN B                       |                           LAY RED BY N ZERO AGAIN
PLACE WHITE WITH Y SINX SOONWNSIENGAN AITN        |                          BIN WHITE BY U THREE NOW
SET GREEN BY E FOUR PLEASE RUEN IN B              |                        SET GRE

Same step as previously, but for unseen lombard data.

In [39]:
fine_tuned_loss_list, fine_tuned_wer_list, fine_tuned_cer_list = fine_tuned_metrics_history
board_name = 'FineTuned_LipNet/mouth_only/unseen_noise/version1_6_speakers'
fine_tuned_list_for_metrics_and_names = [(fine_tuned_loss_list, 'val loss', board_name), (fine_tuned_wer_list, 'wer', board_name), (fine_tuned_cer_list, 'cer', board_name)]

# Write test data to logs
plot_metrics_in_tensorboard(fine_tuned_list_for_metrics_and_names)

fine_tuned_loss, fine_tuned_wer, fine_tuned_cer = overall_fine_tune_metrics
print(f"Loss: {fine_tuned_loss}, WER: {fine_tuned_wer}, CER: {fine_tuned_cer}")

Loss: 1.5543553829193115, WER: 0.8011185682326621, CER: 0.5526594838834806


### Training almost from scratch(using weights from [repo](https://github.com/VIPL-Audio-Visual-Speech-Understanding/LipNet-PyTorch) as initial weights)

#### Training using 6 speakers(such approach gives us an opportunity to parse mouth frames for more speakers and saves us some time to research possible problems and how the model might learn).

Let's try training the model on 100 epochs and see what do we get. Also, here we won't write markdown comments as the process of training and testing is the same as previously.

In [75]:
lipnet = LipNet()
lipnet.to(device)


# Specify path to weights and load them into
weights_path = "LIPNET_original_weights/LipNet_unseen_loss_0.44562849402427673_wer_0.1332580699113564_cer_0.06796452465503355.pt"
lipnet = load_model_weights(lipnet, weights_path, device)

In [76]:
train(lipnet, lombardgrid_train_plain_front_dataloader, 100, display_mod=10, device=device, **{'model_name': 'LipNet/mouth_only/plain/'})

Loading options...
here
-----------------------------------------------------------------------------------------------------
predict                                           |                                             truth
-----------------------------------------------------------------------------------------------------
LAT HIE AY OU                                     |                                                                                                                             LAY GREEN IN G ZERO AGAIN
SAT BRT A I I NI                                  |                                                                                                                             BIN WHITE AT U NINE AGAIN
GLRED ATI I II                                    |                                                                                                                           PLACE RED AT Q THREE PLEASE
----------------------------------------------------------------

In [80]:
overall_metrics,  metrics_history = test(lipnet, lombardgrid_test_plain_unseen_front_dataloader, device=device)

-----------------------------------------------------------------------------------------------------
predict                                           |                                             truth
-----------------------------------------------------------------------------------------------------
PLANE WHITE WITH U SIX NOW BI                     |                          BIN WHITE WITH U SIX NOW
PLACE WHITE WITH T THREE SOON N I                 |                         LAY RED WITH T THREE SOON
SET WHITE WITH N FEVHT SOONW I                    |                       PLACE GREEN WITH K TWO SOON
PLACE WHITE WBYH B FIVE NOW I                     |                           SET RED WITH P FIVE NOW
SET RED BY S NINE AGAIN B                         |                           LAY RED BY S NINE AGAIN
PBANE WHITE IT A ZERO SOON I                      |                          BIN RED IN A SEVEN AGAIN
SETE BLUE AT B THREE NOW I                        |                         SET BL

In [82]:
loss_list, wer_list, cer_list = metrics_history
board_name = 'LipNet/mouth_only/unseen_plain/version1_6_speakers'
list_for_metrics_and_names = [(loss_list, 'val loss', board_name), (wer_list, 'wer', board_name), (cer_list, 'cer', board_name)]

# Write test data to logs
plot_metrics_in_tensorboard(list_for_metrics_and_names)

loss, wer, cer = overall_metrics
print(f"Loss: {loss}, WER: {wer}, CER: {cer}")

Loss: 1.4870673418045044, WER: 0.7219239373601789, CER: 0.46412362047286365


Just in case, we save the model.

In [84]:
save_model_weights(lipnet, "/Users/admin/PycharmProjects/nn_course_work/weights/LIPNET/mouths_only/version1_6_speakers.pth")

In [85]:
overall_metrics,  metrics_history = test(lipnet, lombardgrid_test_noisy_unseen_front_dataloader, device=device)

-----------------------------------------------------------------------------------------------------
predict                                           |                                             truth
-----------------------------------------------------------------------------------------------------
SET BLUE BY H FIGHT SOON I                        |                           LAY BLUE BY G NINE SOON
BIN BLUE AT N ONE AGAIN B                         |                           BIN BLUE AT M ONE AGAIN
SETE BLUE BT D SEVEN NON N W                      |                               LAY BLUE BY E SEVEN
LATE WHITE AT J EIGHT PLEASEN BRUED IN W          |                       SET WHITE AT J EIGHT PLEASE
PLACE WHITE WITH G TWO PLEASE REN IN W            |                     PLACE WHITE WITH G TWO PLEASE
PLACE GREEN WITH A FIVE PLEASE RUED IN W          |                    PLACE GREEN WITH H FIVE PLEASE
LAY BLUE IT M FIVE PLEASE RUED IN W               |                         LAY BL

In [86]:
loss_list, wer_list, cer_list = metrics_history
board_name = 'LipNet/mouth_only/unseen_noise/version1_6_speakers'
list_for_metrics_and_names = [(loss_list, 'val loss', board_name), (wer_list, 'wer', board_name), (cer_list, 'cer', board_name)]

# Write test data to logs
plot_metrics_in_tensorboard(list_for_metrics_and_names)

loss, wer, cer = overall_metrics
print(f"Loss: {loss}, WER: {wer}, CER: {cer}")

Loss: 1.4998291730880737, WER: 0.7219239373601789, CER: 0.46412362047286365


Let's see what happens when we train our model on 100 more epochs.

In [94]:
lipnet = LipNet()
lipnet.to(device)


# Specify path to weights and load them into
weights_path = "/Users/admin/PycharmProjects/nn_course_work/weights/LIPNET/mouths_only/version1_6_speakers.pth"
lipnet = load_model_weights(lipnet, weights_path, device)

In [95]:
train(lipnet, lombardgrid_train_plain_front_dataloader, 100, display_mod=10, device=device, **{'model_name': 'LipNet/mouth_only/plain/'})

Loading options...
here
-----------------------------------------------------------------------------------------------------
predict                                           |                                             truth
-----------------------------------------------------------------------------------------------------
SET WHITE IN K FOUR PLEASE REN IN T OI            |                                                                                                                            SET WHITE IN K FOUR PLEASE
BIN BLUE WITH A SEVEN N N N AS                    |                                                                                                                                 BIN BLUE WITH H SEVEN
BINE RED BN M EIGHT SOONWU OY UEN WINBY BIR       |                                                                                                                               BIN RED AT M EIGHT SOON
----------------------------------------------------------------

In [96]:
save_model_weights(lipnet, "/Users/admin/PycharmProjects/nn_course_work/weights/LIPNET/mouths_only/version2_6_speakers.pth")

In [97]:
overall_metrics,  metrics_history = test(lipnet, lombardgrid_test_noisy_unseen_front_dataloader, device=device)

-----------------------------------------------------------------------------------------------------
predict                                           |                                             truth
-----------------------------------------------------------------------------------------------------
PLACE WHITE WITH M TWO SOON NW                    |                       PLACE WHITE WITH M TWO SOON
BIN WHITE WITH U SIX NOWONNW                      |                          BIN WHITE WITH U SIX NOW
PLACE WHITE AT S ONE AGAIN NA                     |                        PLACE WHITE AT E ONE AGAIN
PLACE WHITE AT D SEVEN AGAIN WINN W               |                          BIN RED BY T SEVEN AGAIN
LAY BLUE WITH M FIVE PLEASE REN IN NA             |                         LAY BLUE IN M FIVE PLEASE
LAY GREEN AT P ZERO SOON NW                       |                          LAY GREEN AT I ZERO SOON
PLACE GREEN WITH H EIGHT PLEASEN REN WINN         |                    PLACE GREEN

In [100]:
loss_list, wer_list, cer_list = metrics_history
board_name = 'LipNet/mouth_only/unseen_plain/version2_6_speakers'
list_for_metrics_and_names = [(loss_list, 'val loss', board_name), (wer_list, 'wer', board_name), (cer_list, 'cer', board_name)]

# Write test data to logs
plot_metrics_in_tensorboard(list_for_metrics_and_names)

loss, wer, cer = overall_metrics
print(f"Loss: {loss}, WER: {wer}, CER: {cer}")

Loss: 1.5455169677734375, WER: 0.6512304250559284, CER: 0.4427242392351484


In [40]:
lipnet = LipNet()
lipnet.to(device)


# Specify path to weights and load them into
weights_path = "/Users/admin/PycharmProjects/nn_course_work/weights/LIPNET/mouths_only/version2_6_speakers.pth"
lipnet = load_model_weights(lipnet, weights_path, device)

In [41]:
overall_metrics,  metrics_history = test(lipnet, lombardgrid_test_plain_unseen_front_dataloader, device=device)

-----------------------------------------------------------------------------------------------------
predict                                           |                                             truth
-----------------------------------------------------------------------------------------------------
PLACE RED AT Z TWO SOON                           |                           BIN RED BY G SEVEN SOON
SET WHITE AT X TIGO SOON NNA                      |                           SET RED AT V EIGHT SOON
SET BLUE AT P THREE NOWNNA                        |                         SET BLUE AT V THREE AGAIN
PLACE RED BY E SEVEN PLEASE RUEN INN              |                       PLACE RED BY E SEVEN PLEASE
PLACE BLUE BY X THREE NOWONN                      |                         PLACE BLUE WITH L TWO NOW
SET GREE WITH P SIX SOONWA                        |                           SET BLUE WITH I SIX NOW
PLACE GREEN WITH T SEVEN AGAIN WNANA              |                       BIN GREE

In [101]:
loss_list, wer_list, cer_list = metrics_history
board_name = 'LipNet/mouth_only/unseen_plain/version2_6_speakers'
list_for_metrics_and_names = [(loss_list, 'val loss', board_name), (wer_list, 'wer', board_name), (cer_list, 'cer', board_name)]

# Write test data to logs
plot_metrics_in_tensorboard(list_for_metrics_and_names)

loss, wer, cer = overall_metrics
print(f"Loss: {loss}, WER: {wer}, CER: {cer}")

Loss: 1.5455169677734375, WER: 0.6512304250559284, CER: 0.4427242392351484


# *Change above which are not fine_tuned models, but models trained from scratch.*

It is clear that I need additional methods that would help me improve the code.

In [42]:
def get_model_with_weights(model_class, weights_path, device=device, freeze_layers_params=None):
    model = model_class()
    model.to(device)


    # Specify path to weights and load them into
    model = load_model_weights(model, weights_path, device)

    if freeze_layers_params:
        model = freeze_layers(model, freeze_layers_params.get('number_of_layers_to_freeze', 1), freeze_layers_params.get('freeze_first_layers', True))
    return model

Apart from initializing models with weights there is a problem with dataloaders creation as it is unreadable and I need to do it frequently, so let's create a method for this and hope that everything will be different.

In [46]:
def create_datasets_and_dataloaders(anno_path, video_path, list_paths, batch_size):
    """
    This function creates datasets and dataloaders for given paths.

    Parameters:
    anno_path (str): The relative path to the annotations.
    video_path (str): The relative path to the videos.
    list_paths (list): The list of relative paths to the file lists.
    batch_size (int): The batch size.

    Returns:
    list: A list of datasets.
    list: A list of dataloaders.
    """

    # Create datasets
    datasets = [get_dataset(anno_path, video_path, list_path) for list_path in list_paths]

    # Create dataloaders
    dataloaders = [dataset2dataloader(dataset, batch_size) for dataset in datasets]

    return datasets, dataloaders


In [55]:
def print_dataset_sizes(datasets):
    """
    This function prints the sizes of all datasets.

    Parameters:
    datasets (list): The list of datasets.

    Returns:
    None
    """
    names = ['train front', 'test front', 'test unseen front', 'test noisy unseen front']
    for name, dataset in zip(names, datasets):
        print(f'Number of {name} data:{len(dataset.data)}')

In [91]:
def log_and_print_metrics(metrics_history, overall_metrics, board_name):
    """
    Log metrics in TensorBoard and print the overall metrics.

    :param metrics_history: A tuple containing three lists (loss_list, wer_list, cer_list)
    :param overall_metrics: A tuple containing the overall metrics (loss, wer, cer)
    :param board_name: The name of the board to which the metrics will be written
    """
    loss_list, wer_list, cer_list = metrics_history
    list_for_metrics_and_names = [(loss_list, 'val loss', board_name),
                                  (wer_list, 'wer', board_name),
                                  (cer_list, 'cer', board_name)]

    # Write test data to logs
    plot_metrics_in_tensorboard(list_for_metrics_and_names)

    loss, wer, cer = overall_metrics
    print(f"Loss: {loss}, WER: {wer}, CER: {cer}")

In [79]:
# Create a set of all speakers that has corresponding lip crop files
all_speakers = {'s15', 's16', 's17', 's18', 's19', 's20', 's21', 's22', 's23', 's25', 's26', 's30', 's31', 's32'}

# Specify the excluded speakers, so that we can form unseen speakers data
train_excluded_speakers = {'s16', 's17', 's22', 's25', 's31'}
test_excluded_speakers = copy.deepcopy(train_excluded_speakers)
test_excluded_speakers_unseen = all_speakers - test_excluded_speakers

# We use only front directory for root directory assuming that every front has corresponding side
relative_root_directory = '/nn_course_work/clean_data/video_frames_mouths/front2'
relative_front_video_path = "/nn_course_work/clean_data/video_frames_mouths/front2/"

# Absolute root path
root_directory = create_list_of_paths([relative_root_directory])[0]

In [80]:
relative_train_plain_output_file = '/nn_course_work/clean_data/txt_files/mouths2/train_plain_files.txt'
relative_test_noisy_output_file = '/nn_course_work/clean_data/txt_files/mouths2/test_noisy_files.txt'
relative_test_noisy_unseen_output_file = '/nn_course_work/clean_data/txt_files/mouths2/test_noisy_unseen_files.txt'
relative_test_plain_unseen_output_file = '/nn_course_work/clean_data/txt_files/mouths2/test_plain_unseen_files.txt'

mouths_paths = create_list_of_paths([relative_train_plain_output_file, relative_test_noisy_output_file, relative_test_noisy_unseen_output_file, relative_test_plain_unseen_output_file])

# Unpack them
train_plain_output_file, test_noisy_output_file, test_noisy_unseen_output_file, test_plain_unseen_output_file = mouths_paths

# Write the paths to the train, test, and test unseen output files
write_paths_to_file(root_directory, train_plain_output_file, folder_ends={'p'}, exclude_speakers=train_excluded_speakers)
write_paths_to_file(root_directory, test_noisy_output_file, folder_ends={'l'}, exclude_speakers=test_excluded_speakers)
write_paths_to_file(root_directory, test_noisy_unseen_output_file, folder_ends={'l'}, exclude_speakers=test_excluded_speakers_unseen)
write_paths_to_file(root_directory, test_plain_unseen_output_file, folder_ends={'p'}, exclude_speakers=test_excluded_speakers_unseen)

In [81]:
# Relative paths to file lists
relative_file_list_paths = [relative_train_plain_output_file, relative_test_noisy_output_file,
                            relative_test_plain_unseen_output_file, relative_test_noisy_unseen_output_file]

# Create datasets and dataloaders
datasets, dataloaders = create_datasets_and_dataloaders(relative_anno_path, relative_front_video_path,
                                                        relative_file_list_paths, batch_size)

# Unpack datasets
(lombardgrid_train_plain_front_dataset, lombardgrid_test_noisy_front_dataset,
 lombardgrid_test_plain_unseen_front_dataset, lombardgrid_test_noisy_unseen_front_dataset) = datasets

# Unpack dataloaders
(lombardgrid_train_plain_front_dataloader, lombardgrid_test_noisy_front_dataloader,
 lombardgrid_test_plain_unseen_front_dataloader, lombardgrid_test_noisy_unseen_front_dataloader) = dataloaders

# Output sizes of datasets
print_dataset_sizes(datasets)


Number of train front data:397
Number of test front data:431
Number of test unseen front data:249
Number of test noisy unseen front data:248


## Train LipNet almost from scratch on 15 speakers.

6 speakers is small number especially when we train on 3 speakers and validate on other 3, so we parsed mouth frames for 15 more speakers while training models on 3 speakers and will train model using 10 speakers for training and 5 for validation. Doing this should improve model generalization and it should be able to predict better wider variety of speakers.

In [87]:
weights_path = "LIPNET_original_weights/LipNet_unseen_loss_0.44562849402427673_wer_0.1332580699113564_cer_0.06796452465503355.pt"
lipnet = get_model_with_weights(LipNet, weights_path, device=device)

Training the model on 500 epochs(previously checking if everything works on 1 epoch, so that we won't lose any time).

In [88]:
train(lipnet, lombardgrid_train_plain_front_dataloader, 500, display_mod=50, device=device, **{'model_name': 'LipNet/mouth_only/plain2/15_speakers/'})

Loading options...
here
-----------------------------------------------------------------------------------------------------
predict                                           |                                             truth
-----------------------------------------------------------------------------------------------------
N BLUE BN RQRR RROO                               |                                                                                                                              SET BLUE WITH I TWO SOON
IIN WBEY BYH B IR UOU                             |                                                                                                                               SET WHITE AT K NINE NOW
PLACE GREE ATN E WWH W R R GYOU                   |                                                                                                                               LAY WHITE BY M FOUR NOW
----------------------------------------------------------------

It is worth to save the model, after 15 hours of training🙃.

In [90]:
save_model_weights(lipnet, "/Users/admin/PycharmProjects/nn_course_work/weights/LIPNET/mouths_only/version3_15_speakers.pth")

Validate the model on unseen speakers without lombard effect.

In [92]:
overall_metrics,  metrics_history = test(lipnet, lombardgrid_test_plain_unseen_front_dataloader, device=device)

-----------------------------------------------------------------------------------------------------
predict                                           |                                             truth
-----------------------------------------------------------------------------------------------------
PLACE WHITE IY E EIGHT AGAIN E YNBIN BYT          |                     PLACE WHITE WITH K NINE AGAIN
LAY RED IN M EIGHT PLEASE BLUE INBITN             |                          LAY RED IN F NINE PLEASE
PLACE BLUE BY V EIGHT PLEASE BLUEN INB B          |                         BIN BLUE BY V FIVE PLEASE
SET WHITE BY X EIGHT AGAIN WITNNBINTH             |                           LAY BLUE BY N SIX AGAIN
PLACE WRIED WITH N FOUR NOWOWOBIEN BIEN IN        |                          PLACE RED WITH L TWO NOW
PLACE BRED WITH K ONE SOON N YB I                 |                        PLACE BLUE WITH K ONE SOON
SET WHITE WITH Z ONE AGAIN EN NINBEN BIEN IT      |                        SET WHI

Output overall metrics and write logs.

In [93]:
board_name = "LipNet/mouth_only/unseen_plain/version3_15_speakers"
log_and_print_metrics(metrics_history, overall_metrics, board_name)

Loss: 0.9177827835083008, WER: 0.7045515394912986, CER: 0.6138050740177413


Validate the model on unseen speakers with lombard effect.

In [94]:
overall_metrics,  metrics_history = test(lipnet, lombardgrid_test_noisy_unseen_front_dataloader, device=device)

-----------------------------------------------------------------------------------------------------
predict                                           |                                             truth
-----------------------------------------------------------------------------------------------------
PLACE WHITE AT F SIX NOWO OBINE BIEN IT           |                         PLACE GREEN IN L ZERO NOW
BIN BLUE WITH H SIX SOON WN YB                    |                          BIN BLUE WITH O SIX SOON
PLACE WHITE AT L THREE SOONN BINBT                |                       PLACE WHITE IN L THREE SOON
PLACE GREEN AT Y ZEVEN SOONIN M                   |                          BIN GREEN IN P ZERO SOON
BIN RED BY A SEVEN PLEASE BLUEN ITB I             |                   PLACE GREEN WITH N SEVEN PLEASE
PLACE WHITE AT N TWO AGAIN WIENNB BIF             |                           PLACE RED IN K TWO SOON
PLACE WHITE BY T EIGHT NOWON BIE WITN             |                            BIN

Write the logs and output the metrics.

In [95]:
log_and_print_metrics(metrics_history, overall_metrics, board_name)

Loss: 1.2797966003417969, WER: 0.7205645161290322, CER: 0.5663246415666794


Training learning curves show us constant improvements in WER, so let's train our model on 100 epochs again.

In [96]:
train(lipnet, lombardgrid_train_plain_front_dataloader, 100, display_mod=10, device=device, **{'model_name': 'LipNet/mouth_only/plain2/15_speakers2/'})

Loading options...
here
-----------------------------------------------------------------------------------------------------
predict                                           |                                             truth
-----------------------------------------------------------------------------------------------------
PLACE WHITE BY D NINE SOON NWNYBYWSLSET BHLEN IT  |                                                                                                                            PLACE WHITE BY D NINE SOON
PLACE BLUE WITH F FOUR AGAIN YBYBYBBIMBVBEBIUEN IN|                                                                                                                          PLACE BLUE WITH F FOUR AGAIN
PLACE BLUE BY Y FOUR AGAIN BBBNBININIBISEN BIEN IN|                                                                                                                            PLACE BLUE BY Y FOUR AGAIN
----------------------------------------------------------------

In [97]:
save_model_weights(lipnet, "/Users/admin/PycharmProjects/nn_course_work/weights/LIPNET/mouths_only/version4_15_speakers.pth")

In [98]:
overall_metrics,  metrics_history = test(lipnet, lombardgrid_test_plain_unseen_front_dataloader, device=device)

-----------------------------------------------------------------------------------------------------
predict                                           |                                             truth
-----------------------------------------------------------------------------------------------------
PLACE GREEN AT Y ZERO AGAIN AEN W                 |                       PLACE GREEN IN Y ZERO AGAIN
BIN RED WIT Z THREE NOWNNBNBEN BIYN               |                            BIN RED BY N THREE NOW
LAY RED AT S SIX PLEASE LIE BIN U                 |                           LAY RED AT S SIX PLEASE
PLACE GREEN WITH N FIVE PLEASEN WIUEN AN W        |                       BIN GREEN WITH D FIVE AGAIN
SAY RED BY G ONE NOWN NNYBNBIEN IN                |                            LAY RED BY G THREE NOW
PLACE RED BY F EIGHT SOON N WU                    |                          PLACE RED BY F NINE SOON
BIN BLUE BY B FOUR PLEASE BIEN ITN U              |                         BIN BL

In [99]:
board_name = "LipNet/mouth_only/unseen_plain/version4_15_speakers"
log_and_print_metrics(metrics_history, overall_metrics, board_name)

Loss: 0.9937862157821655, WER: 0.6499330655957162, CER: 0.44494882658585744


In [100]:
overall_metrics,  metrics_history = test(lipnet, lombardgrid_test_noisy_unseen_front_dataloader, device=device)

-----------------------------------------------------------------------------------------------------
predict                                           |                                             truth
-----------------------------------------------------------------------------------------------------
LAY BRED IT S EIGHT NOW WNMU W                    |                            LAY BLUE AT D NINE NOW
PLACE WRED WITH S ONE NOWNNBNBNNN                 |                          PLACE RED WITH S ONE NOW
SET WHITE BY P FIVE NOW WY                        |                           SET WHITE BY E FIVE NOW
BIN GREEN WITH D ZERO NOW N WU                    |                         BIN GREEN WITH D ZERO NOW
BIN BLUE BY P FOUR PLEASE BIEN WIT                |                         BIN BLUE BY B FOUR PLEASE
PLACE WHITE WITH M TWO SOON N                     |                       PLACE WHITE WITH M TWO SOON
BIN RED IN O FOUR SOON WN                         |                            BIN

In [101]:
board_name = "LipNet/mouth_only/unseen_noise/version4_15_speakers"
log_and_print_metrics(metrics_history, overall_metrics, board_name)

Loss: 1.3093518018722534, WER: 0.6561827956989247, CER: 0.4415391095571322


We can see the improvements in our model, so let's train it on 100 more epochs.

In [102]:
train(lipnet, lombardgrid_train_plain_front_dataloader, 100, display_mod=10, device=device, **{'model_name': 'LipNet/mouth_only/plain2/15_speakers3/'})

Loading options...
here
-----------------------------------------------------------------------------------------------------
predict                                           |                                             truth
-----------------------------------------------------------------------------------------------------
LAY RED BY S NINE AGAIN N NWN                     |                                                                                                                               LAY RED BY S NINE AGAIN
PLACE WHITE WITH A TWO NOW IMMMWYMYY M            |                                                                                                                            PLACE WHITE WITH A TWO NOW
PLACE WHITE IN E TWO PLEASEN BUEN WINW YM         |                                                                                                                           PLACE WHITE IN E TWO PLEASE
----------------------------------------------------------------

Let's follow the tradition and save the weights.

In [103]:
save_model_weights(lipnet, "/Users/admin/PycharmProjects/nn_course_work/weights/LIPNET/mouths_only/version5_15_speakers.pth")

Just as usual we need to evaluate the model, and we do this on unseen speakers and recordings without lombard effect.

In [104]:
overall_metrics,  metrics_history = test(lipnet, lombardgrid_test_plain_unseen_front_dataloader, device=device)

-----------------------------------------------------------------------------------------------------
predict                                           |                                             truth
-----------------------------------------------------------------------------------------------------
PLACE RED BY F EIGHT SOON WN N                    |                          PLACE RED BY F NINE SOON
BIN BLUE WITH B TWO PLEASE WLIE IN                |                          BIN BLUE IN B TWO PLEASE
PLACE WHITE WITH Y EIGHT NOW W M                  |                       PLACE WHITE WITH Y NINE NOW
BLANE RE WITH H TWO PLEASE BEN WIN                |                       BIN WHITE WITH K TWO PLEASE
SET GHITEN WITH D SEVEN SOONN                     |                       LAY GREEN WITH C SEVEN SOON
PLACE WHITE AT T EIGHT SOONWN                     |                          BIN BLUE BY S EIGHT SOON
SET RED AT B EIGHT SOON WEN                       |                           SET 

Let's write the logs and output metrics.

In [105]:
board_name = "LipNet/mouth_only/unseen_plain/version5_15_speakers"
log_and_print_metrics(metrics_history, overall_metrics, board_name)

Loss: 1.0790446996688843, WER: 0.5050870147255689, CER: 0.32634025671223404


Testing on unseen speakers and recordings with lombard effect.

In [106]:
overall_metrics,  metrics_history = test(lipnet, lombardgrid_test_noisy_unseen_front_dataloader, device=device)

-----------------------------------------------------------------------------------------------------
predict                                           |                                             truth
-----------------------------------------------------------------------------------------------------
PLACE GREEN BY G SEVENWNN                         |                         BIN GREEN BY Q ZERO AGAIN
PLACE WHITE BN M EIGHT AGAIN E                    |                        PLACE WHITE BY M SIX AGAIN
LAY GREEN IN H ONE LEASEWEN IN                    |                       LAY GREEN WITH X ONE PLEASE
SET BLUE IN I EIVT NOWW M                         |                            SET BLUE IN G FIVE NOW
PLACE BLUE IT V THREE NOWM M                      |                         PLACE BLUE IN U TWO AGAIN
LAY WHITE BY M SEVENENW                           |                        LAY WHITE BY M SEVEN AGAIN
LAY WHITE BY O ONE NOWIE                          |                           LAY 

Time to see the metrics.

In [107]:
board_name = "LipNet/mouth_only/unseen_noise/version5_15_speakers"
log_and_print_metrics(metrics_history, overall_metrics, board_name)

Loss: 1.4667141437530518, WER: 0.5538978494623655, CER: 0.365775200548317


In [108]:
train(lipnet, lombardgrid_train_plain_front_dataloader, 100, display_mod=10, device=device, **{'model_name': 'LipNet/mouth_only/plain2/15_speakers4/'})

Loading options...
here
-----------------------------------------------------------------------------------------------------
predict                                           |                                             truth
-----------------------------------------------------------------------------------------------------
PLACE BLUE AT Y SIX PLEASESEARENT IN              |                                                                                                                            PLACE BLUE AT Y SIX PLEASE
SET BLUE IN J ONE SOONWNG G                       |                                                                                                                                SET BLUE IN J ONE SOON
LAY GREEN IN U FIVE NOWON WIWNI                   |                                                                                                                               LAY GREEN IN U FIVE NOW
----------------------------------------------------------------

In [111]:
save_model_weights(lipnet, "/Users/admin/PycharmProjects/nn_course_work/weights/LIPNET/mouths_only/version6_15_speakers.pth")

In [109]:
overall_metrics,  metrics_history = test(lipnet, lombardgrid_test_plain_unseen_front_dataloader, device=device)

-----------------------------------------------------------------------------------------------------
predict                                           |                                             truth
-----------------------------------------------------------------------------------------------------
PLACE RED BY L FOUR AGAINWIEN                     |                          PLACE RED BY Q FOUR SOON
BIN RED IN T THREE NOW                            |                          BIN WHITE IN N THREE NOW
BIN BLUE WITH P TWO PLEASEN WHIE IN               |                          BIN BLUE IN B TWO PLEASE
SET GREEN IN X SIX SOONWEN                        |                         SET GREEN IN X EIGHT SOON
PLACE RED IN K TWO SOONWREN                       |                           PLACE RED IN K TWO SOON
LAY RED BY H SEVENIE IN                           |                            LAY RED BY H THREE NOW
BIN WHITE AT X NINE SOONWN N                      |                          SET W

In [110]:
board_name = "LipNet/mouth_only/unseen_plain/version6_15_speakers"
log_and_print_metrics(metrics_history, overall_metrics, board_name)

Loss: 1.0260133743286133, WER: 0.5119143239625168, CER: 0.3571097888913527


In [112]:
overall_metrics,  metrics_history = test(lipnet, lombardgrid_test_noisy_unseen_front_dataloader, device=device)

-----------------------------------------------------------------------------------------------------
predict                                           |                                             truth
-----------------------------------------------------------------------------------------------------
SET GREE AT S FOUR AGAIN IN                       |                         SET GREEN AT J FOUR AGAIN
PLACE GREEN IN U TWO SOONWEN N                    |                        PLACE GREEN IN Q NINE SOON
SET WHITE IN X EIGHT PLEASE EREENIT               |                        SET WHITE IN H NINE PLEASE
PLACE GREEN WITH M EIGHT PLEASEN WHIE IN          |                    PLACE GREEN WITH A FIVE PLEASE
SET RED AT C FIVE PLEASE BRUEN IN                 |                          SET RED AT C FIVE PLEASE
SET GREEN WITH L FOUR SOONWN                      |                        LAY GREEN WITH H FOUR SOON
PLACE GREEN AT S SEVEN AGAIN WEN IN               |                       PLACE GR

In [113]:
board_name = "LipNet/mouth_only/unseen_noise/version6_15_speakers"
log_and_print_metrics(metrics_history, overall_metrics, board_name)

Loss: 1.3939191102981567, WER: 0.5600806451612902, CER: 0.39876400109223337
