# Automatic Speech Recognition


Here we will try to use minimal amount of speech data to build a speech-recognition model and learn how much data matters, also the type of it: 

- How much does equal phonemic distribution matter? 
- How much variation in speakers is required?
- What is a good-enough model?

We want to model acoustic behaviour of human speech. To do that, we will:

1. Pick up some audios from any open source data-set, place it in `./data/raw`.
2. Tag them with phonemes.
3. Split each audio file such that it contains 250ms worth of audio, if the audio runs out of data, we must add padding and save all such audios in `./data/phonemes`.
4. Each 250ms chunk audio file is labelled to contain a single phoneme.
5. We will also window it by 50ms. (Kaldi does 20-25ms)
6. If a frame contains no phonemes, we will mark it with SIL for silence.

```
Chunks:
|-----------------------------------------------|
|<----------- total audio (400ms) ------------->|

|-----------------------------| [chunk:1](file__symA__1.wav)
         |-----------------------------| [chunk:2](file__symB__2.wav)
                  |-----------------------------| [chunk:3](file__symC__3.wav)
|< 50ms >|
|<----------- 250ms --------->|
```
Here, we can see the conventions we are about to follow, a given `file` is split into 3 chunks
by our frame @250ms and window @50ms sizes, also we create new files with the name same as the original file, the phoneme contained in the frame and lastly the chunk number of the file all joined by `__` to give:

```
(file__symA__1.wav)
```
Once such files are crated, we will prepare `pytorch` [IterableDataset](https://pytorch.org/docs/stable/data.html?highlight=dataloader#torch.utils.data.Dataset). There is another [guide](https://medium.com/speechmatics/how-to-build-a-streaming-dataloader-with-pytorch-a66dd891d9dd) that can be followed for a more detailed review on the topic.

Finally, we will train a phoneme decoder using CNN-GRU architecture.

## Imports

In [1]:
# In case we need to save some artifacts
import pickle
# To utilize cores for data tasks (split audios and save)
import multiprocessing as mp
# Type check
from typing import List, Dict, Callable
# Get file paths
from glob import glob

# We should avoid np as much as possible and use torch since
# we have to use torch for models anyway.
import numpy as np
# NN models
import torch
# Visualizations
import matplotlib.pyplot as plt
# Control cells that must not always run
import ipywidgets as widgets
# read-write wave audios
from scipy.io import wavfile
# Since we will be training on spectrograms
from scipy.signal import spectrogram
# Play audios (analysis)
from IPython.display import Audio
# To not worry about vis legibility since we are using a dark theme.
from jupyterthemes import jtplot
# Building iterable dataset
from itertools import zip_longest

In [2]:
# Set matplotlib to go along the theme
jtplot.style(theme='solarizedd')

## Config

In [3]:
FRAME_SIZE = 250
WINDOW = 50

## Data preparation

This section will be devoted to splitting audio files by phonemes and saving them in chunks of 250ms.

### Audio file chunk creation

Audio files are assumed to be saved in "./data/raw/" dir. Prepare phoneme tags via Audacity or
any open source audio-text aligners, save the result in this format:

| start time | end time | phoneme |
| ---------- | -------- | ------- |
|  0.100     |  0.150   |  ah_S   |

and name it as `<file>_labels.txt` in the "./data/raw/" dir.

Now we will break each audio into `FRAME_SIZE` parts.

In [4]:
def get_padded_last_timestamp(timestamp: float) -> float:
    """
    Extrapolate a number to fit within the frame-size.
    
    example:
    
    Let, timestamp be 5.120s. This falls short of the FRAME_SIZE(250ms)
    for its last chunk. So we want to fix it such that the last chunk matches
    the frame-size.
    
    >> 5120 - 5000 => 120 (can be covered in 1 extra frame)
    >> 120//250 + 1 (gives us the number of extra frames)
    >> 250 * (120//250 + 1) (gives us the correction value to be added to timestamp)
    """
    diff = (timestamp - int(timestamp)) * 1000
    if not diff:
        return timestamp
    padding = diff//FRAME_SIZE + 1
    return (int(timestamp) + (FRAME_SIZE * padding)/1000) * 1000

In [5]:
def get_phoneme_timestamps(audio_label_path: str) -> List[Dict]:
    """
    Read audio label, return phoneme information for each frame.
    
    audio_labels contain information in this format:
    | start time | end time | phoneme |
    | ---------- | -------- | ------- |
    |  0.100     |  0.150   |  ah_S   |
    
    We want to return:    
    [{
        "start": 0,
        "end": 250,
        "phoneme": SIL
    }, {
        "start": 0.025,
        "end": 0.275,
        "phoneme": "ah_S"
    }, ...]
    """
    with open(audio_label_path, "r") as f:
        phoneme_timestamps = f.read().splitlines()
    
    phoneme_timestamps_ = []
    starts, ends, phonemes = zip(*[t.split("\t") for t in phoneme_timestamps])
    
    last_timestamp = float(ends[-1])
    padded_last_timestamp = int(get_padded_last_timestamp(last_timestamp))

    for i in range(0, padded_last_timestamp, WINDOW):
        start_time = i/1000
        end_time = (i + FRAME_SIZE)/1000
        sym = None
        for start, end, phoneme in zip(starts, ends, phonemes):
            if float(end) > end_time:
                break
            else:
                sym = phoneme
            
        phoneme_timestamps_.append({
            "start": start_time,
            "end": end_time,
            "phoneme": sym or "SIL"
        })

    return phoneme_timestamps_

In [6]:
def save_phoneme_frames(audio_path: str) -> None:
    """
    Break the audio at `audio_path` into frames of FRAME_SIZE.
    audio files are assumed to be placed in "./data/raw/" dir.
    chunked audio files are placed in "./data/phonemes/" dir.
    
    Files with oos phonemes or missing phoneme labels are skipped.
    """
    audio_name = audio_path.replace(".wav", "").rsplit("/")[-1]
    audio_label_path = f"data/raw/{audio_name}_labels.txt"
    try:
        phoneme_timestamps = get_phoneme_timestamps(audio_label_path)
        sr, wav = wavfile.read(audio_path)
        data_size = int((sr * FRAME_SIZE)/1000)
        for i, phoneme_timestamp in enumerate(phoneme_timestamps):
            phoneme = phoneme_timestamp["phoneme"]
            if "oov" not in phoneme:
                start = int(phoneme_timestamp["start"] * sr)
                end =  int(phoneme_timestamp["end"] * sr)
                data = wav[start:end]
                if len(data) < data_size:
                    data = np.pad(data, (0, data_size - len(data)), 'constant', constant_values=(0))
                wavfile.write(f"data/phonemes/{audio_name}__{phoneme}__{i + 1}.wav", sr, data)        
    except FileNotFoundError:
        pass

In [7]:
def mp__save_phoneme_frames(fn: Callable, data: List[str], workers: int = None) -> None:
    workers = workers or mp.cpu_count()
    pool = mp.Pool(workers)
    _ = pool.map(save_phoneme_frames, audio_files)

In [8]:
audio_chunk_btn = widgets.Button(description="Chunk audios")
output = widgets.Output()

display(audio_chunk_btn, output)
audio_files = glob("data/raw/*.wav")

def on_audio_chunk(btn):
    with output:
        mp__save_phoneme_frames(save_phoneme_frames, audio_files)

audio_chunk_btn.on_click(on_audio_chunk)

Button(description='Chunk audios', style=ButtonStyle())

Output()

### Pytorch IterableDataset

An iterable Dataset.

- All datasets that represent an iterable of data samples should subclass it. Such form of datasets is particularly useful when data come from a stream.
- All subclasses should overwrite `__iter__()`, which would return an iterator of samples in this dataset.

When a subclass is used with DataLoader, each item in the dataset will be yielded from the DataLoader iterator. When `num_workers > 0`, each worker process will have a different copy of the dataset object, so it is often desired to configure each copy independently to avoid having duplicate data returned from the workers. `get_worker_info()`, when called in a worker process, returns information about the worker. It can be used in either the dataset’s `__iter__()` method or the `DataLoader` ‘s `worker_init_fn` option to modify each copy’s behavior.

In [9]:
def grouper(iterable, n, fillvalue=None):
    """
    Does what np.split_array(arr, chunk) does, but lazily!
    
    Let, iterable = [1, 2, 3, 4, 5]
    >> grouper(iterable, 2)
    >> [(1, 2), (3, 4), (5, None)]
    
    Tried grouper(list(range(200000)), 2)
    returns in 51ms.
    """
    # Create a list with the iterable repeated n times.
    args = [iter(iterable)] * n
    
    # zip will bunch together n elements
    # zip_longest, additionally will take a lone chunk
    # that zip would have discarded, and replaces None with fillvalue
    return zip_longest(*args, fillvalue=fillvalue)

In [10]:
def wav_to_spectrogram(self, file: str):
    """
    Given a path to a wav file,
    Return a spectrogram as np.ndarray
    """
    sr, data = wavfile.read(file)
    _, _, specgram = spectrogram(data, Fs=sr)
    return specgram


def lazy_load_files(self, files: List[str]):
    """
    For each wav file in a list,
    return their spectrogram within an iterator.
    """
    return map(wav_to_spectrogram, files)


def stream_batch(self, batched_files: List[str]):
    """
    Wrapper around lazy_load_files to implement
    batches as per pytorch's IterableDataset spec.
    """
    return map(self.lazy_load_files, batched_files)

In [11]:
class IterableAudioStreamDataset(torch.utils.data.IterableDataset):
    def __init__(self, data_path: str = None, batch_size: int = 0) -> None:
        """
        Setup the instance with all the assets that can be used by __iter__.
        We also do validations here.
        """
        super(IterableAudioStreamDataset).__init__()
        self.files = glob(f"{data_path}/*.wav")
        self.batch_size = batch_size
        assert self.files, f"Not enough files at path {data_path}\nIs there a typo?"
        assert self.batch_size > 0, "Batch size should be greated than 0!"

    def __iter__(self):
        """
        Since we could leverage multiple cpu cores to fetch data
        from an IterableDataset. 
        
        Why? Since this loader will be returning
        data as required instead of loading everything in memory, it will 
        slow down training speed, even if GPUs are involved.
        
        If the DataLoader instructs to use `num_workers` to fetch data.
        A multiprocessing job will handle the distribution.
        """
        worker_info = torch.utils.data.get_worker_info()
        start_idx = 0
        # This would be used if we have `num_workers = 1`
        # We want to slice our list of files such that 
        # all batches are read by the single worker.
        files_per_batch = int(np.ceil(len(self.files)/self.batch_size))
        end_idx = files_per_batch
        if worker_info is not None:
            # If we have `num_workers > 1` then we need to set
            # indices to slice our list of files such that there is 
            # no duplication of data.
            
            # so if we have ...
            n_workers = worker_info.num_workers
            
            # This should be the number of files per worker
            work_load = len(self.files) // n_workers
            # So, we will index each slice by worker id
            worker_id = worker_info.id
            # such that each worker starts...
            start_idx = worker_id * work_load            
            # and ends a slice without duplication.
            end_idx = start_idx + work_load
        return stream_batch(grouper(self.files[start_idx:end_idx], self.batch_size))

In [12]:
ds = IterableAudioStreamDataset(data_path="./data/phonemes", batch_size=5)
# We are handling the batch_size withing the dataset, so we have to inform
# the DataLoader to peace out for a bit.
loader = torch.utils.data.DataLoader(ds, batch_size=None, num_workers=2)

## Model Architecture

We will be building a CNN-GRU model.

- CNN: To identify phonemes in a spectrogram
- GRU: Remember information across sequences.