# Seminar 4 and Homework 3

In Lecture 5, you got acquainted with another speech recognition model -- RNN-Transducer (RNN-T). This assignment is again a combination of a seminar and a homework. In them, you will first learn how the training of this model works, and also build and train a small version of it.

## Seminar 4 (10 points)

In seminar 4 you will implemement forward and backward algorithms for calculating the RNN-T loss.

## Homework 3 (40 points)

In this homework you will implement a variant of the RNN-T model. For that, you will have to
- implement each part of its architecture: Encoder, Predictor, Joiner
- implement the greedy decoding algorithm
- train your model on a subset of the LibriSpeech corpus

## Submitting results
This Jupyter notebook contains both a seminar and a homework.

For the seminar deadline please submit this Jupiter Notebook `(.ipynb)` with completed cells of the seminar. Save the artifact to a directory named `{your last name}_{your first name}_sem3` and pack them in `.zip` archive.

For the homework deadline please submit this Jupiter Notebook `(.ipynb)` with all cells completed.
Save the notebook and model weights to a directory named `{your last name}_{your first name}_hw3` and pack them in `.zip` archive.

# Setup - Install package, download files, etc...

In [1]:
!mkdir week_06_files
!wget -O week_06_files/utils.py https://raw.githubusercontent.com/yandexdataschool/speech_course/main/week_05/utils.py


mkdir: cannot create directory ‘week_06_files’: File exists
--2022-06-13 13:27:08--  https://raw.githubusercontent.com/yandexdataschool/speech_course/main/week_05/utils.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 6770 (6.6K) [text/plain]
Saving to: ‘week_06_files/utils.py’


2022-06-13 13:27:08 (33.7 MB/s) - ‘week_06_files/utils.py’ saved [6770/6770]



In [2]:
# TODO: change link to a link from repository
!wget --load-cookies /tmp/cookies.txt "https://drive.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://drive.google.com/uc?export=download&id=14vgOVBayQGYv9B1P3hYo3JM56rS6ap3U' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=14vgOVBayQGYv9B1P3hYo3JM56rS6ap3U" -O week_06_files/model_scripted_epoch_5.pt && rm -rf /tmp/cookies.txt


--2022-06-13 13:27:09--  https://drive.google.com/uc?export=download&confirm=t&id=14vgOVBayQGYv9B1P3hYo3JM56rS6ap3U
Resolving drive.google.com (drive.google.com)... 173.194.222.194, 2a00:1450:4010:c0b::c2
Connecting to drive.google.com (drive.google.com)|173.194.222.194|:443... connected.
HTTP request sent, awaiting response... 303 See Other
Location: https://doc-0c-9s-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/dmtshdck6a46oklrjv0dbnbpe59utnsd/1655115975000/02999746975866030610/*/14vgOVBayQGYv9B1P3hYo3JM56rS6ap3U?e=download [following]
--2022-06-13 13:27:09--  https://doc-0c-9s-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/dmtshdck6a46oklrjv0dbnbpe59utnsd/1655115975000/02999746975866030610/*/14vgOVBayQGYv9B1P3hYo3JM56rS6ap3U?e=download
Resolving doc-0c-9s-docs.googleusercontent.com (doc-0c-9s-docs.googleusercontent.com)... 142.251.1.132, 2a00:1450:4010:c1e::84
Connecting to doc-0c-9s-docs.googleusercontent.com (doc-0c-9s-docs.g

In [3]:
import os
import string
from typing import Tuple, List, Dict, Optional

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import torch
import torchaudio
import torch.nn as nn
import torch.nn.functional as F
import torch.utils.data as data
import wandb
import ipywidgets as widgets
import itertools
from torch import optim
from torchaudio.transforms import RNNTLoss
from tqdm import tqdm_notebook, tqdm
from IPython.display import display, clear_output

In [4]:
import week_06_files.utils as utils 

In [5]:
snapshot_dir = "rnn_t_snapshots"

In [6]:
!mkdir rnn_t_snapshots

mkdir: cannot create directory ‘rnn_t_snapshots’: File exists


In [7]:
os.environ["CUDA_VISIBLE_DEVICES"] = "0"

# Lecture recap

## Alignment

Let $\mathbf{x} = (x_1, x_2, \ldots, x_T)$ be a length $T$ input sequence of arbitrary length beloging to the set $X^*$ of all sequences over some input space $X$. Let $\mathbf{y} = (y_1, \ldots, y_U)$ be a length $U$ output sequence belonging to the set $Y^*$ of all sequences over some output space $Y$.

Define the *extended output space* $\overline Y$ as $Y \cup \emptyset$, where $\emptyset$ denotes the null output. The intuitive meaning of $\emptyset$ is 'output nothing'. The sequence $(y_1, \emptyset, \emptyset, y_2, \emptyset, y_3) \in \overline Y^*$ is therefore equivalent to $(y_1, y_2, y_3) \in Y^*$. We refer to the elements $\mathbf{a} \in \overline Y^*$ as *alignments*, since the location of the null symbols determines an alignment between the input and output sequences.

As we saw in CTC, various alignments can be represented in the form of a table called trellis. An example of how an RNN-T trellis may look like:

<p style="text-align:center;"><img src="http://drive.google.com/uc?export=view&id=1CfXfkePAESz2n20AABVUw9SaZ_xszxwf">
    
    
Possible alignments in that trellis:
    
<p style="text-align:center;"><img src="http://drive.google.com/uc?export=view&id=1ipRlSrznwmoD5gCk7k6G06JeUtqPzDQq">
    
The final label can be determined by simply removing the blank characher:
    
$$
    C \emptyset \emptyset A \emptyset T \emptyset \to CAT
$$
$$
    \emptyset \emptyset \emptyset C A T \emptyset \to CAT
$$
    
Given $\mathbf{x}$, the RNN transducer defines a conditional distribution $P(\mathbf{a} \in \overline Y^* | \mathbf{x})$. This distribution is then collapsed onto the following distribution over $Y^*$:
    
$$
    P(\mathbf y \in Y^* | \mathbf x) = \sum_{\mathbf a \in \mathcal{B}^{-1}(\mathbf y)} P(\mathbf a | \mathbf x),
$$
    
where $\mathcal B: \overline Y^* \mapsto Y^*$ is a function that removes the null symbols from the alignments in $Y^*$.


## Architecture

<p style="text-align:center;"><img src="http://drive.google.com/uc?export=view&id=1P2aztCi9Z7ookMbHmWBcGtSmG_JHIiMj">

The RNN-T model consists of three neural networks: Encoder, Predictor and Joiner. The Encoder converts the acoustic feature $x_t$ into a high-level representation $f_t$, where $t$ is time index:

$$
    f_t = \mathrm{Encoder}(x_t)
$$

The Predictor works like an RNN language model, which produces a high-level representation $g_u$ by conditioning on the previous non-blank target $y_{u - 1}$ predicted by the RNN-T model, where $u$ is output label index:

$$
    g_u = \mathrm{Predictor}(y_{u - 1})
$$

Note that the input sequence for the predictor **is prepended with the special symbol** $\langle s \rangle$ that defines the start of a sentence.

The Joiner is a feed forward network that combines the Encoder output $f_t$ and the Predictor output $g_u$ as

$$
    h_{t, u} = \mathrm{Joiner}(f_t, g_u) = \mathrm{FeedForward}(\mathrm{ReLU}(f_t + g_u))
$$

The final posterior for each output token $y$ is obtained after applying the softmax operation:

$$
    P(y | t, u) = \mathrm{softmax}(h_{t, u})
$$
    
where $P(y | t, u)$ is a distribution of probabilities to emit $y \in \overline Y$ at time step $t$ after $u$ previously generated characters, $t \in [1, T], u \in [0, U]$.

<p style="text-align:center;"><img src="http://drive.google.com/uc?export=view&id=1tn1wS3fCVFJGwrYumf5Im6gOFZsxRMV-">

We will further need to work with probabilities of individual tokens $y$ for different $t$ and $u$. Instead of writing each time something like $P(y = C | t = 1, u = 0)$, we will, for the sake of simplicity, write it as $P(C | 1, 0)$.

## Training: forward-backward algorithm

The loss function of RNN-T is the negative log posterior of output label sequence $\mathbf y$ given acoustic feature $\mathbf x$:

$$
    \mathcal L = -\ln P(\mathbf y \in Y^* | \mathbf x) = -\ln \sum_{\mathbf a \in \mathcal{B}^{-1}(\mathbf y)} P(\mathbf a | \mathbf x)
$$

To determine $P(\mathbf a | \mathbf x)$ for an arbitrary alignment $\mathbf a$, we need to multiply the probabilities $P(y | t, u)$ of each symbol across the path:

<p style="text-align:center;"><img src="http://drive.google.com/uc?export=view&id=1O-aykP5Wods7ZESCJDBsBw2MeBo5egW4">

$$
    \mathbf a = C \emptyset \emptyset A \emptyset T \emptyset
$$
    
$$
    P(\mathbf a | \mathbf x) = P(C | 1, 0) \cdot P(\emptyset | 1, 0) \cdot P(\emptyset | 2, 1) \cdot P(A | 3, 1) \cdot P(\emptyset | 3, 2) \cdot P(T | 3, 2) \cdot P(\emptyset | 4, 3)
$$

There are usually too many possible alignments to compute the loss function by just adding them all up directly. We will use dynamic programming to make this computation feasible.

Define the *forward variable* $\alpha(t, u)$ as the probability of outputting $\mathbf y_{[1:u]}$ during $\mathbf f_{[1:t]}$. The forward variables for all $1 \le t \le T$ and $0 \le u \le U$ can be calculated recursively using

$$
    \alpha(t, u) = \alpha(t - 1, u) P(\emptyset | t - 1, u) + \alpha(t, u - 1) P(y_{u - 1} | t, u - 1)
$$

with initial condition $\alpha(1, 0) = 1$. Here $y_{u - 1}$ is the $(u - 1)$-th symbol from the ground truth label $\mathbf y$.

The total output sequene probability is equal to the forward variable at the terminal node:

$$
    P(\mathbf y | \mathbf x) = \alpha(T, U) P(\emptyset | T, U)
$$

Define the *backward variable* $\beta(t, u)$ as the probability of outputting $\mathbf y_{[u + 1: U]}$ during $\mathbf f_{[t:T]}$. Then

$$
    \beta(t, u) = \beta(t + 1, u) P(\emptyset | t, u) + \beta(t, u + 1) P(y_u | t, u)
$$

with initial condition $\beta(T, U) = P(\emptyset | T, U)$. The final value is $\beta(1, 0)$.

From the definition of the forward and backward variables it follows that their product $\alpha(t, u) \beta(t, u)$ at any point $(t, u)$ in the output lattice is equal to the probability of emitting the complete output sequence *if $y_u$ is emitted during transcription step $t$*.

# Seminar 4: RNN-T Forward-Backward Algorithm (10 points)

```
[ ] (5 points) Implement a Forward Pass
[ ] (5 points) Implement a Backward Pass
```

Implement forward and backward passes.


### Implementation tips

- Note that all indices in the arrays you will work with in your code start with zeros. So, the initial condition for forward algorithm will be $\alpha(0, 0) = 1$ (and $\log \alpha(0, 0) = 0$) and the output value for backward algorithm will be $\beta(0, 0)$. The recurrent formulas stay the same. Also, don't be confused with the terminal node: you don't have to add it to $\alpha$- and $\beta$-arrays. The dynamic starts in the upper left corner for forward variables and in the lower right corner for backward variables.
- You will need to do everything in log-domain for calculations to be numercally stable. The function [np.logaddexp](https://numpy.org/doc/stable/reference/generated/numpy.logaddexp.html) might help you with it.

In [8]:
def log_mult(*args) -> float:
    exps = np.prod([np.exp(x) for x in args])
    if exps == 0:
        return NEG_INF
    
    return np.log(exps)

In [9]:
def forward(log_probs: torch.FloatTensor, targets: torch.LongTensor, 
            blank: int = -1) -> Tuple[torch.FloatTensor, torch.FloatTensor]:
    """
    :param log_probs: model outputs after applying log_softmax
    :param targets: the target sequence of tokens, represented as integer indexes
    :param blank: the index of blank symbol
    :return: Tuple[ln alpha, -(ln alpha(T, U) + ln P(blank | T, U))]. The latter term is loss value, which is -ln P(y | x)
    """
    max_T, max_U, D = log_probs.shape
    
    # here the alpha variable contains logarithm of the alpha variable from the formulas above
    alpha = np.zeros((max_T, max_U), dtype=np.float32)
    
    for t in range(1, max_T):
        alpha[t, 0] = log_mult(alpha[t - 1, 0], log_probs[t - 1, 0, blank])

    for u in range(1, max_U):
        alpha[0, u] = log_mult(alpha[0, u - 1], log_probs[0, u - 1, targets[u - 1]])

    for t in range(1, max_T):
        for u in range(1, max_U):
            up = log_mult(alpha[t, u - 1], log_probs[t, u - 1, targets[u - 1]])
            left = log_mult(alpha[t - 1, u], log_probs[t - 1, u, blank])

            alpha[t, u] = np.logaddexp(up, left)

    cost = -log_mult(alpha[-1, -1], log_probs[-1, -1, blank])
    return alpha, cost


def backward(log_probs: torch.FloatTensor, targets: torch.LongTensor, 
             blank: int = -1) -> Tuple[torch.FloatTensor, torch.FloatTensor]:
    """
    :param log_probs: model outputs after applying log_softmax
    :param targets: the target sequence of tokens, represented as integer indexes
    :param blank: the index of blank symbol
    :return: Tuple[ln beta, -ln beta(0, 0)]. The latter term is loss value, which is -ln P(y | x)
    """
    max_T, max_U, D = log_probs.shape
    
    # here the beta variable contains logarithm of the beta variable from the formulas above
    beta = np.zeros((max_T, max_U), dtype=np.float32)
    beta[-1, -1] = log_probs[-1, -1, blank]

    for t in reversed(range(max_T - 1)):
        beta[t, max_U - 1] = log_mult(beta[t + 1, max_U - 1], log_probs[t, max_U - 1, blank])

    for u in reversed(range(max_U - 1)):
        beta[max_T - 1, u] = log_mult(beta[max_T - 1, u + 1], log_probs[max_T - 1, u, targets[u]])

    for t in reversed(range(max_T - 1)):
        for u in reversed(range(max_U - 1)):
            right = log_mult(beta[t + 1, u], log_probs[t, u, blank])
            down = log_mult(beta[t, u + 1], log_probs[t, u, targets[u]])

            beta[t, u] = np.logaddexp(right, down)
            
    cost = -beta[0, 0]
    return beta, cost

In [10]:
def run_test(logits: torch.FloatTensor, targets: torch.LongTensor, 
             ref_costs: torch.FloatTensor, blank: int = -1) -> None:
    """
    :param logits: model outputs
    :param targets: the target sequence of tokens, represented as integer indexes
    :param ref_costs: the true values of RNN-T costs for test inputs
    :param blank: the index of blank symbol
    """
    log_probs = torch.nn.functional.log_softmax(logits, dim=-1)
    cost = np.zeros(log_probs.shape[0])
    
    for batch_id in range(log_probs.shape[0]):        
        alphas, cost_alpha = forward(log_probs[batch_id], targets[batch_id], blank=blank)
        betas, cost_beta = backward(log_probs[batch_id], targets[batch_id], blank=blank)
        np.testing.assert_almost_equal(cost_alpha, cost_beta, decimal=2)
        cost[batch_id] = cost_alpha
    
    np.testing.assert_almost_equal(cost, ref_costs, decimal=2)

In [11]:
# Tests

'''
All logits in tests have shapes in the form (B, T, U, D) where

B: batch size
T: maximum source sequence length in batch
U: maximum target sequence length in batch
D: feature dimension of each source sequence element
'''

# test 1
logits = torch.FloatTensor([
    0.1, 0.6, 0.1, 0.1, 0.1,
    0.1, 0.1, 0.6, 0.1, 0.1,
    0.1, 0.1, 0.2, 0.8, 0.1,
    0.1, 0.6, 0.1, 0.1, 0.1,
    0.1, 0.1, 0.2, 0.1, 0.1,
    0.7, 0.1, 0.2, 0.1, 0.1,
]).reshape(1, 2, 3, 5)

targets = torch.LongTensor([[1, 2]])
ref_costs = torch.FloatTensor([5.09566688538])

run_test(
    logits=logits, 
    targets=targets, 
    ref_costs=ref_costs, 
    blank=-1
)

# test 2
logits = torch.FloatTensor([
    0.065357, 0.787530, 0.081592, 0.529716, 0.750675, 0.754135, 0.609764, 0.868140,
    0.622532, 0.668522, 0.858039, 0.164539, 0.989780, 0.944298, 0.603168, 0.946783,
    0.666203, 0.286882, 0.094184, 0.366674, 0.736168, 0.166680, 0.714154, 0.399400,
    0.535982, 0.291821, 0.612642, 0.324241, 0.800764, 0.524106, 0.779195, 0.183314,
    0.113745, 0.240222, 0.339470, 0.134160, 0.505562, 0.051597, 0.640290, 0.430733,
    0.829473, 0.177467, 0.320700, 0.042883, 0.302803, 0.675178, 0.569537, 0.558474,
    0.083132, 0.060165, 0.107958, 0.748615, 0.943918, 0.486356, 0.418199, 0.652408,
    0.024243, 0.134582, 0.366342, 0.295830, 0.923670, 0.689929, 0.741898, 0.250005,
    0.603430, 0.987289, 0.592606, 0.884672, 0.543450, 0.660770, 0.377128, 0.358021,
]).reshape(2, 4, 3, 3)

targets = torch.LongTensor([[1, 2], [1, 1]])
ref_costs = torch.FloatTensor([4.2806528590890736, 3.9384369822503591])

run_test(
    logits=logits, 
    targets=targets, 
    ref_costs=ref_costs, 
    blank=0
)

# Homework 4: Implementing, training and evaluating your RNN-T ASR model (40 points)

```
[ ] (18 points) Build the model
[ ] (18 points) Implementing a greedy decoder
[ ] (4 points) Train the model 
```

In [12]:
BLANK_SYMBOL = "_"
BOS = "<BOS>"


class Tokenizer:
    """
    Maps characters to integers and vice versa
    """
    def __init__(self):
        self.char_map = {}
        self.index_map = {}
        for i, ch in enumerate(["'", " "] + list(string.ascii_lowercase) + [BLANK_SYMBOL, BOS]):
            self.char_map[ch] = i
            self.index_map[i] = ch
        
    def text_to_indices(self, text: str) -> List[int]:
        """
        Maps string to a list of integers
        """
        return [self.char_map[ch] for ch in text]

    def indices_to_text(self, labels: List[int]) -> str:
        """
        Maps integers back to text
        """
        return "".join([self.index_map[i] for i in labels])
    
    def get_symbol_index(self, sym: str) -> int:
        """
        Returns index for the specified symbol
        """
        return self.char_map[sym]
    

tokenizer = Tokenizer()

### Utils for creating a dataloader

In [13]:
# Download LibriSpeech 100hr training and test data

if not os.path.isdir("./data"):
    os.makedirs("./data")

train_dataset = torchaudio.datasets.LIBRISPEECH("./data", url="train-clean-100", download=True)
test_dataset = torchaudio.datasets.LIBRISPEECH("./data", url="test-clean", download=True)

In [14]:
# For train you can use SpecAugment data aug here.
train_audio_transforms = nn.Sequential(
    torchaudio.transforms.MelSpectrogram(sample_rate=16000, n_mels=80),
    torchaudio.transforms.FrequencyMasking(freq_mask_param=27),
    torchaudio.transforms.TimeMasking(time_mask_param=100)
)

test_audio_transforms = torchaudio.transforms.MelSpectrogram(sample_rate=16000, n_mels=80)

In [15]:
def data_processing(data: torchaudio.datasets.librispeech.LIBRISPEECH, 
                    data_type: str = "train") -> Tuple[torch.Tensor, torch.IntTensor, torch.IntTensor, torch.IntTensor]:
    """
    :param data: a LIBRISPEECH dataset
    :param data_type: "train" or "test"
    :return: tuple of
        spectrograms, shape: (B, T, n_mels)
        labels, shape: (B, U)
        input_lengths -- the length of each spectrogram in the batch, shape: (B,)
        label_lengths -- the length of each text label in the batch, shape: (B,)
        where
        B: batch size
        T: maximum source sequence length in batch
        U: maximum target sequence length in batch
        D: feature dimension of each source sequence element
    """
    spectrograms = []
    labels = []
    input_lengths = []
    label_lengths = []
    for (waveform, _, utterance, _, _, _) in data:
        if data_type == 'train':
            spec = train_audio_transforms(waveform).squeeze(0).transpose(0, 1)
        elif data_type == 'test':
            spec = test_audio_transforms(waveform).squeeze(0).transpose(0, 1)
        else:
            raise Exception('data_type should be train or valid')
        spectrograms.append(spec)
        label = torch.IntTensor(tokenizer.text_to_indices(utterance.lower()))
        labels.append(label)
        input_lengths.append(spec.shape[0])
        label_lengths.append(len(label))

    spectrograms = nn.utils.rnn.pad_sequence(spectrograms, batch_first=True)
    labels = nn.utils.rnn.pad_sequence(labels, batch_first=True)

    return spectrograms, torch.IntTensor(labels), torch.IntTensor(input_lengths), torch.IntTensor(label_lengths)


## Build the model (18 points)

In [16]:
from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence


class EncoderRNNT(nn.Module):
    def __init__(self, input_dim: int, hidden_size: int, output_dim: int, n_layers: int, 
                 dropout: float = 0.2, bidirectional: bool = True):
        """
        An RNN-based model that encodes input audio features into a hidden representation. 
        The architecture is a stack of LSTM's followed by a fully-connected output layer.

        :param input_dim: the number of mel-spectrogram features
        :param hidden_size: the number of features in the hidden states in LSTM layers
        :param output_dim: the output dimension
        :param n_layers: the number of stacked LSTM layers
        :param dropout: the dropout probability for LSTM layers
        :param bidirectional: If True, each LSTM layer becomes bidirectional
        """
        super().__init__()

        self.lstm = nn.LSTM(input_size=input_dim, hidden_size=hidden_size, num_layers=n_layers,
                           dropout=dropout, bidirectional=bidirectional)

        right_hidden = 2 * hidden_size if bidirectional else hidden_size
        self.output_proj = nn.Linear(right_hidden, output_dim)

    def forward(self, inputs: torch.Tensor, input_lengths: torch.Tensor) -> Tuple[torch.Tensor, List[torch.Tensor]]:
        """
        :param inputs: spectrograms, shape: (B, T, n_mels)
        :param input_lengths: the lengths of the spectrograms in the batch, shape: (B,)
        :return: outputs of the projection layer and hidden states from LSTMs
        """
        padded_states = pack_padded_sequence(inputs, input_lengths, batch_first=True, enforce_sorted=False)
        x, hidden = self.lstm(padded_states)
        x, _ = pad_packed_sequence(x, batch_first=True)

        logits = self.output_proj(x)

        return logits, hidden

In [17]:
encoder = EncoderRNNT(
    input_dim=80,
    hidden_size=320,
    output_dim=512, 
    n_layers=4,
    dropout=0.2,
    bidirectional=True
)

loader = data.DataLoader(test_dataset, batch_size=2, shuffle=False, collate_fn=lambda x: data_processing(x, 'test'))
spectrograms, labels, input_lengths, label_lengths = next(iter(loader))
logits, hidden_states = encoder.forward(spectrograms, input_lengths)

assert spectrograms.shape == torch.Size([2, 835, 80])
assert logits.shape == torch.Size([2, 835, 512]) 
assert len(hidden_states) == 2
assert hidden_states[0].shape == torch.Size([8, 2, 320])

In [18]:
class DecoderRNNT(nn.Module):
    def __init__(self, hidden_size: int, vocab_size: int, output_dim: int, 
                 n_layers: int, dropout: float = 0.2):
        """
        A simple RNN-based autoregressive language model that takes as input previously generated text tokens
        and outputs a hidden representation of the next token

        :param hidden_size: the number of features in the hidden states in LSTM layers
        :param vocab_size: the number of text tokens in the dictionary
        :param output_dim: the output dimension
        :param n_layers: the number of stacked LSTM layers
        :param dropout: the dropout probability for LSTM layers
        """
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, hidden_size)
        self.lstm = nn.LSTM(input_size=hidden_size, hidden_size=hidden_size, 
                            num_layers=n_layers, dropout=dropout)

        self.output_proj = nn.Linear(hidden_size, output_dim)

    def forward(self, inputs: torch.Tensor, input_lengths: Optional[torch.Tensor] = None, 
                hidden_states: Optional[Tuple[torch.Tensor, torch.Tensor]] = None) -> Tuple[torch.Tensor, List[torch.Tensor]]:
        """
        :param inputs: labels, shape: (B, U)
        :param input_lengths: the lengths of the text labels in the batch, shape: (B,)
        :return: outputs of the projection layer and hidden states from LSTMs
        """
        embed_inputs = self.embedding(inputs)

        if input_lengths is not None:
            # training phase, the code here is close to `forward` of the Encoder 
            padded_states = pack_padded_sequence(embed_inputs, input_lengths, 
                                                 batch_first=True, enforce_sorted=False)
            outputs, hidden = self.lstm(padded_states)
            outputs, _ = pad_packed_sequence(outputs, batch_first=True)
        else:
            outputs, hidden = self.lstm(embed_inputs, hidden_states)

        outputs = self.output_proj(outputs)
        return outputs, hidden

In [19]:
decoder = DecoderRNNT(
    hidden_size=512,
    vocab_size=len(tokenizer.char_map),
    output_dim=512, 
    n_layers=1, 
    dropout=0.2
)

loader = data.DataLoader(test_dataset, batch_size=2, shuffle=False, collate_fn=lambda x: data_processing(x, 'test'))
spectrograms, labels, input_lengths, label_lengths = next(iter(loader))
logits, hidden_states = decoder.forward(labels, label_lengths)

assert labels.shape == torch.Size([2, 158])
assert logits.shape == torch.Size([2, 158, 512])
assert len(hidden_states) == 2
assert hidden_states[0].shape == torch.Size([1, 2, 512])



In [20]:
class Joiner(torch.nn.Module):
    def __init__(self, joiner_dim: int, num_outputs: int):
        """
        Adds encoder and decoder outputs, applies ReLU and passes the result 
        through a fully connected layer to get the output logits

        :param joiner_dim: the dimension of the encoder and decoder outputs
        :num_outputs: the number of text tokens in the dictionary
        """
        super().__init__()
        self.linear = nn.Linear(joiner_dim, num_outputs)

    def forward(self, encoder_outputs: torch.Tensor, decoder_outputs: torch.Tensor) -> torch.Tensor:
        """
        :param encoder_outputs: the encoder outputs (f_t), shape: (B, T, joiner_dim) or (joiner_dim,)
        :param decoder_outputs: the decoder outputs (g_u), shape: (B, U, joiner_dim) or (joiner_dim,)
        :return: output logits
        """
        if encoder_outputs.dim() == 3 and decoder_outputs.dim() == 3:    # True for training phase
            encoder_outputs = encoder_outputs.unsqueeze(2)
            decoder_outputs = decoder_outputs.unsqueeze(1)

        # Linear(ReLU(f_t + g_u))
        out = F.relu(self.linear(encoder_outputs + decoder_outputs))
        return out

In [21]:
class RNNTransducer(torch.nn.Module):
    def __init__(self,
        num_classes: int,
        input_dim: int,
        num_encoder_layers: int = 4,
        num_decoder_layers: int = 1,
        encoder_hidden_state_dim: int = 320,
        decoder_hidden_state_dim: int = 512,
        output_dim: int = 512,
        encoder_is_bidirectional: bool = True,
        encoder_dropout_p: float = 0.2,
        decoder_dropout_p: float = 0.2
    ):
        """
        :param num_classes: the number of text tokens in the dictionary
        :param input_dim: the number of mel-spectrogram features
        :param num_encoder_layers: the number of LSTM layers in the encoder
        :param num_decoder_layers: the number of LSTM layers in the decoder
        :param encoder_hidden_state_dim: the number of features in the hidden states for the encoder
        :param decoder_hidden_state_dim: the number of features in the hidden states for the decoder
        :param output_dim: the output dimension
        :param encoder_is_bidirectional: whether to use bidirectional LSTM's in the encoder
        :param encoder_dropout_p: the dropout probability for the encoder
        :param decoder_dropout_p: the dropout probability for the decoder
        """
        super().__init__()

        self.encoder = EncoderRNNT(input_dim=input_dim, hidden_size=encoder_hidden_state_dim,
                                   output_dim=output_dim, n_layers=num_encoder_layers, 
                                   dropout=encoder_dropout_p, bidirectional=encoder_is_bidirectional)

        # The decoder takes the input <BOS> + the original sequence. 
        # You need to shift the current label, and F.pad can help with that.
        self.decoder = DecoderRNNT(
            hidden_size=decoder_hidden_state_dim,
            vocab_size=num_classes,
            output_dim=output_dim, 
            n_layers=num_decoder_layers, 
            dropout=decoder_dropout_p
        )
        self.joiner = Joiner(output_dim, num_classes)

    def forward(self, inputs: torch.Tensor, input_lengths: torch.Tensor, 
                targets: torch.Tensor, target_lengths: torch.Tensor) -> torch.Tensor:
        """
        :param inputs: spectrograms, shape: (B, T, n_mels)
        :param input_lengths: the lengths of the spectrograms in the batch, shape: (B,)
        :param targets: labels, shape: (B, U)
        :param target_lengths: the lengths of the text labels in the batch, shape: (B,)
        :return: the output logits, shape: (B, T, U, n_tokens)
        """
        # <BOS> adding
        targets = F.pad(targets, (1, 0, 0, 0), 'constant', tokenizer.char_map['<BOS>'])
        new_len = target_lengths + 1

        encoder_outputs, _ = self.encoder(inputs, input_lengths)
        decoder_outputs, _ = self.decoder(targets, new_len)
        joiner_out = self.joiner(encoder_outputs, decoder_outputs)
        return joiner_out


In [22]:
transducer = RNNTransducer(
    num_classes=len(tokenizer.char_map),
    input_dim=80,
    num_encoder_layers=4,
    num_decoder_layers=1,
    encoder_hidden_state_dim=320,
    decoder_hidden_state_dim=512,
    output_dim=512,
    encoder_is_bidirectional=True,
    encoder_dropout_p=0.2,
    decoder_dropout_p=0.2
)

loader = data.DataLoader(test_dataset, batch_size=2, shuffle=False, collate_fn=lambda x: data_processing(x, 'test'))
spectrograms, labels, input_lengths, label_lengths = next(iter(loader))
result = transducer.forward(spectrograms, input_lengths, labels, label_lengths)

assert spectrograms.shape == torch.Size([2, 835, 80])
assert labels.shape == torch.Size([2, 158])
assert result.shape == torch.Size([2, 835, 159, 30]), result.shape

## Implementing a greedy decoder (18 points)

<p style="text-align:center;"><img src="http://drive.google.com/uc?export=view&id=1tHsoq0ZH0tHSHYlYlw00y8ksF-wHmrmC">

Now we know how to train a Transducer, but how do we infer it? Our task is to generate an output sequence $\mathbf y$ given an input acoustic sequence $\mathbf x$.

Here we will index the encoder outputs $f_t$ starting from zero, because it is more convenient when describing an algorithm.

The greedy decoding procedure is as follows:
1. Compute $\{f_0, \ldots, f_T\}$ using $\mathbf x$.
2. Set $t = 0$, $u = 0$, $\mathbf y = []$, $\mathrm{iteration} = 0$.
3. If $u = 0$, set $g_0 = \mathrm{Encoder}(\langle s \rangle)$. If $u > 0$, compute $g_u$ using the last predicted token $\mathbf y[-1]$.
4. Compute $P(y | t, u)$ using $f_t$ and $g_u$.
5. If argmax of $P(y | t, u)$ is a label, set $u = u + 1$ and append the new label to $\mathbf y$. 
6. If argmax of $P(y | t, u)$ is $\emptyset$, set $t = t + 1$.
7. If $t = T$ or $\mathrm{iteration} = \mathrm{max\_iterations}$, we are done. Else, set $\mathrm{iteration} = \mathrm{iteration + 1}$ and go to step 3.

In [23]:
@torch.no_grad()
def greedy_decode(model: RNNTransducer, encoder_output: torch.Tensor, max_steps: int = 2000) -> torch.Tensor:
    """
    :param model: an RNN-T model in eval mode
    :param encoder_output: the output of the encoder part of RNN-T, shape: (T, encoder_output_dim)
    :param max_steps: the maximum number of decoding steps
    :return: the predicted labels
    """
    pred_tokens, hidden_state = [], None
    blank = tokenizer.get_symbol_index(BLANK_SYMBOL)
    max_time_steps = encoder_output.size(0)
    t = 0
    u = 0

    decoder_input = encoder_output.new_tensor([[tokenizer.get_symbol_index(BOS)]], dtype=torch.long)
    decoder_output, hidden_state = model.decoder(decoder_input, hidden_states=hidden_state)

    for _ in range(max_steps):
        prob_t_u = model.joiner(encoder_output.unsqueeze(0), decoder_output)
        argmax_prob = torch.argmax(prob_t_u[0, t, 0, :]).item()

        if argmax_prob == blank:
            t += 1
        else:
            u += 1
            pred_tokens.append(argmax_prob)
            decoder_input = encoder_output.new_tensor([[pred_tokens[-1]]], dtype=torch.long)
            decoder_output, hidden_state = model.decoder(decoder_input, hidden_states=hidden_state)

        if t == max_time_steps:
            break

    return torch.LongTensor(pred_tokens)


@torch.no_grad()
def recognize(model: RNNTransducer, inputs: torch.Tensor, input_lengths: torch.Tensor) -> List[torch.Tensor]:
    """
    :param model: an RNN-T model in eval mode
    :param inputs: spectrograms, shape: (B, T, n_mels)
    :param input_lengths: the lengths of the spectrograms in the batch, shape: (B,)
    :return: a list with the predicted labels
    """
    outputs = []
    encoder_outputs, _ = model.encoder(inputs, input_lengths)

    for encoder_output in encoder_outputs:
        decoded_seq = greedy_decode(model, encoder_output)
        outputs.append(decoded_seq)

    return outputs


def get_transducer_predictions(
        transducer: RNNTransducer, inputs: torch.Tensor, input_lengths: torch.Tensor, 
        targets: torch.Tensor, target_lengths: torch.Tensor
    ) -> pd.DataFrame:
    """
    :param transducer: an RNN-T model in eval mode
    :param inputs: spectrograms, shape: (B, T, n_mels)
    :param input_lengths: the lengths of the spectrograms in the batch, shape: (B,)
    :param targets: labels, shape: (B, U)
    :param target_lengths: the lengths of the text labels in the batch, shape: (B,)
    :return: a pd.DataFrame with inference results
    """
    predictions = recognize(transducer, inputs, input_lengths)
    result = []
    for pred, target, target_len in zip(predictions, targets, target_lengths):
        label = target[:target_len]
        utterance = tokenizer.indices_to_text(list(map(int, label)))
        pred_utterance = tokenizer.indices_to_text(list(map(int, pred)))
        result.append({
            "ground_truth": utterance,
            "prediction": pred_utterance,
            "cer": utils.cer(utterance, pred_utterance),
            "wer": utils.wer(utterance, pred_utterance)
        })
    return pd.DataFrame.from_records(result)


In [24]:
model = torch.jit.load('week_06_files/model_scripted_epoch_5.pt')
model.eval()

RecursiveScriptModule(
  original_name=RNNTransducer
  (encoder): RecursiveScriptModule(
    original_name=EncoderRNNT
    (lstm): RecursiveScriptModule(original_name=LSTM)
    (output_proj): RecursiveScriptModule(original_name=Linear)
  )
  (decoder): RecursiveScriptModule(
    original_name=DecoderRNNT
    (embedding): RecursiveScriptModule(original_name=Embedding)
    (lstm): RecursiveScriptModule(original_name=LSTM)
    (output_proj): RecursiveScriptModule(original_name=Linear)
  )
  (joiner): RecursiveScriptModule(
    original_name=Joiner
    (linear): RecursiveScriptModule(original_name=Linear)
  )
)

In [25]:
loader = data.DataLoader(test_dataset, batch_size=5, shuffle=False, collate_fn=lambda x: data_processing(x, 'test'))
spectrograms, labels, input_lengths, label_lengths = next(iter(loader))
predictions = get_transducer_predictions(
    model, spectrograms, input_lengths,
    labels, label_lengths
)
predictions

Unnamed: 0,ground_truth,prediction,cer,wer
0,he hoped there would be stew for dinner turnip...,he hoped there would be stew for dinner turnip...,0.132911,0.25
1,stuff it into you his belly counselled him,stuffed into you his belly counciled him,0.142857,0.375
2,after early nightfall the yellow lamps would l...,after early night fall the yellow lamps would ...,0.096154,0.333333
3,hello bertie any good in your mind,her about he and he good in your mind,0.352941,0.714286
4,number ten fresh nelly is waiting on you good ...,none but den fresh now as waiting on you could...,0.254237,0.545455


In [26]:
reference_values = [
    {
        "gt": "he hoped there would be stew for dinner turnips and carrots and bruised potatoes and fat mutton pieces to be ladled out in thick peppered flour fattened sauce",
        "prediction": "he hoped there would be stew for dinner turnips and characts and bruised potatoes and fat much and pieces to be lateled out in the thick peppered flowerfacton sauce"
    },
    {
        "gt": "stuff it into you his belly counselled him",
        "prediction": "stuffed into you his belly counciled him"
    },
    {
        "gt": "after early nightfall the yellow lamps would light up here and there the squalid quarter of the brothels",
        "prediction": "after early night fall the yellow lamps would lie how peer and there the squalit quarter of the brothels"
    },
    {
        "gt": "hello bertie any good in your mind",
        "prediction": "her about he and he good in your mind"
    },
    {
        "gt": "number ten fresh nelly is waiting on you good night husband",
        "prediction": "none but den fresh now as waiting on you could night husband"
    }
]



In [27]:
for index in range(5):
    gt = predictions.iloc[index].ground_truth
    prediction = predictions.iloc[index].prediction
    assert gt == reference_values[index]["gt"]
    assert prediction == reference_values[index]["prediction"], f'{prediction} - {reference_values[index]["prediction"]}'

## Train your model (4 points)

Here you can launch training of the model you've just built. To get **2 points**, provide the curves for test loss, CER and WER from Weights & Biases.

After training, you will get the test metric values on the hold-out test set. To get the rest **2 points**, you need to pass the following thresholds:

- 0.15 test CER
- 0.3 test WER

In [28]:
def train(model: nn.Module, device: str, train_loader: data.DataLoader,
          test_sample: List[torch.Tensor], criterion: nn.Module, optimizer: 
          torch.optim.Optimizer, epoch: int, eval_period: int = 100) -> None:
    """
    :param model: an RNN-T model
    :param device: "gpu" or "cpu"
    :param train_loader: training data loader
    :param test_sample: a sample from the test set to log preliminary inference metrics
    :param criterion: the loss function
    :param optimizer: the training optimizer
    :param epoch: the current epoch number
    :param eval_period: the number of iterations between evaluations
    """
    model.train()
    data_len = len(train_loader.dataset)

    for batch_idx, _data in tqdm(enumerate(train_loader), total=data_len):
        spectrograms, labels, input_lengths, label_lengths = _data
        spectrograms, labels = spectrograms.to(device), labels.to(device)

        optimizer.zero_grad()

        output = model.forward(spectrograms, input_lengths, labels, label_lengths)   # (batch, time, label_length, n_class)
        output = F.log_softmax(output, dim=-1)

        loss = criterion(
            output, 
            labels, 
            input_lengths.to(device), 
            label_lengths.to(device)
        )
        loss.backward()
        optimizer.step()
        
        if batch_idx % eval_period == 0 or batch_idx == data_len:
            wandb.log({'loss_train': loss.item()})
            
            with torch.no_grad():
                spectrograms, labels, input_lengths, label_lengths = test_sample
                spectrograms, labels = spectrograms.to(device), labels.to(device)
                predictions = get_transducer_predictions(
                    model, spectrograms, input_lengths, 
                    labels, label_lengths
                )
                output = model.forward(spectrograms, input_lengths, labels, label_lengths)
                val_loss = criterion(
                  output, 
                  labels, 
                  input_lengths.to(device), 
                  label_lengths.to(device)
                )
                wandb.log({'loss_val': val_loss.item()})
                clear_output(wait=True)
                print('\nTrain Epoch: {} [{}/{} ({:.0f}%)]\tTrain Loss: {:.6f}\tVal loss: {:.6f}'.format(
                      epoch, batch_idx * len(spectrograms), data_len,
                      100. * batch_idx / len(train_loader), loss.item(), val_loss.item()))
                print(f"cer: {predictions.cer.mean()}, wer: {predictions.wer.mean()}")
                display(predictions)
                wandb.log({'cer_val': predictions.cer.mean()})
                wandb.log({'wer_val': predictions.wer.mean()})
                wandb.log({'val_predictions': wandb.Table(dataframe=predictions)})


def test(model: nn.Module, device: str, test_loader: data.DataLoader, 
         criterion: nn.Module, epoch: int, total_steps: int = None, 
         log_predictions: bool = False) -> None:
    """
    :param model: an RNN-T model
    :param device: "gpu" or "cpu"
    :param test_loader: test data loader
    :param criterion: the loss function
    :param epoch: the current epoch number
    :param total_steps: the number of test steps to perform. If None, the whole test set will be used for evaluation
    :param log_predictions: if True, the predicted labels will be logged to the W&B dashboard
    """
    print('Beginning eval...')
    model.eval()
    test_cer, test_wer, test_loss = [], [], []
    test_predictions = []
    if total_steps is None:
        total_steps = len(test_loader)
        
    with torch.no_grad():
        for i, _data in tqdm_notebook(enumerate(test_loader), total=total_steps):
            if i == total_steps:
                break
            spectrograms, labels, input_lengths, label_lengths = _data
            spectrograms, labels = spectrograms.to(device), labels.to(device)
            output = model.forward(spectrograms, input_lengths, labels, label_lengths)
            loss = criterion(
              output, 
              labels, 
              input_lengths.to(device), 
              label_lengths.to(device)
            )
            test_loss.append(loss.item())
            
            predictions = get_transducer_predictions(
                model, spectrograms, input_lengths, 
                labels, label_lengths
            )
            test_cer += list(predictions.cer)
            test_wer += list(predictions.wer)
            if log_predictions:
                test_predictions.append(predictions)

    avg_cer = np.mean(test_cer)
    avg_wer = np.mean(test_wer)
    avg_loss = np.mean(test_loss)
    
    if total_steps < len(test_loader):
        wandb.log({
            'loss_test': avg_loss, 
            'avg_cer': avg_cer, 
            'avg_wer': avg_wer
        })
    else:
        wandb.log({
            'loss_test_final': avg_loss, 
            'avg_cer_final': avg_cer, 
            'avg_wer_final': avg_wer
        })
    if log_predictions:
        wandb.log({'test_predictions': wandb.Table(dataframe=pd.concat(test_predictions, ignore_index=True))})
        
    print('Epoch: {:d}, Test set: Average loss: {:.4f}, Average CER: {:4f} Average WER: {:.4f}\n'.format(
        epoch, avg_loss, avg_cer, avg_wer))
    

In [29]:
torch.manual_seed(7)
if torch.cuda.is_available():
    print('GPU found! 🎉')
    device = 'cuda'
else:
    print('Only CPU found! 💻')
    device = 'cpu'

# Hyperparameters for your model

hparams = {
    'model': {
        'num_classes': len(tokenizer.char_map),
        'input_dim': 80,
        'num_encoder_layers': 4,
        'num_decoder_layers': 1,
        'encoder_hidden_state_dim': 320,
        'decoder_hidden_state_dim': 512,
        'output_dim': 512,
        'encoder_is_bidirectional': True,
        'encoder_dropout_p': 0.2,
        'decoder_dropout_p': 0.2
    },
    'data': {
        'batch_size': 3,
        'epochs': 10,
        'learning_rate': 1e-4
    }
}

kwargs = {'num_workers': 1, 'pin_memory': True} if device == 'cuda' else {}
train_loader = data.DataLoader(train_dataset, batch_size=hparams['data']['batch_size'], 
                               shuffle=True, collate_fn=lambda x: data_processing(x), **kwargs)
test_loader = data.DataLoader(test_dataset, batch_size=hparams['data']['batch_size'], 
                              shuffle=False, collate_fn=lambda x: data_processing(x, 'test'), **kwargs)



GPU found! 🎉


In [30]:
model = RNNTransducer(**hparams['model'])
model.to(device)



RNNTransducer(
  (encoder): EncoderRNNT(
    (lstm): LSTM(80, 320, num_layers=4, dropout=0.2, bidirectional=True)
    (output_proj): Linear(in_features=640, out_features=512, bias=True)
  )
  (decoder): DecoderRNNT(
    (embedding): Embedding(30, 512)
    (lstm): LSTM(512, 512, dropout=0.2)
    (output_proj): Linear(in_features=512, out_features=512, bias=True)
  )
  (joiner): Joiner(
    (linear): Linear(in_features=512, out_features=30, bias=True)
  )
)

In [31]:
wandb.init(project="speech-transducer", 
           group="base-architecture",
           config=hparams)

[34m[1mwandb[0m: Currently logged in as: [33mgrazder[0m. Use [1m`wandb login --relogin`[0m to force relogin


In [None]:
optimizer = optim.Adam(model.parameters(), lr=hparams['data']['learning_rate'])
criterion = RNNTLoss(blank=tokenizer.get_symbol_index(BLANK_SYMBOL), reduction='mean')
test_sample = next(iter(test_loader))

for epoch in tqdm_notebook(range(1, hparams['data']['epochs'] + 1)):
    train(model, device, train_loader, test_sample, criterion, optimizer, epoch, eval_period=50)
    utils.save_checkpoint(model, checkpoint_name=f'model_epoch{epoch}.tar', path=snapshot_dir)
    wandb.save(f'model_epoch{epoch}.tar')
    test(model, device, test_loader, criterion, epoch, total_steps=20, log_predictions=True)

utils.save_checkpoint(model, checkpoint_name=f'model.tar')


cer: 0.4079264617239301, wer: 0.7589285714285714


Unnamed: 0,ground_truth,prediction,cer,wer
0,he hoped there would be stew for dinner turnip...,he hope the wouth stood ordinar ternison care ...,0.506329,0.892857
1,stuff it into you his belly counselled him,stuffiding to you his belly confelling,0.309524,0.625



 38%|███▊      | 10951/28539 [1:04:42<3:27:45,  1.41it/s][A
 38%|███▊      | 10952/28539 [1:04:43<3:02:28,  1.61it/s][A
 38%|███▊      | 10953/28539 [1:04:43<2:43:07,  1.80it/s][A
 38%|███▊      | 10954/28539 [1:04:43<2:27:41,  1.98it/s][A
 38%|███▊      | 10955/28539 [1:04:44<2:15:10,  2.17it/s][A
 38%|███▊      | 10956/28539 [1:04:44<2:11:06,  2.24it/s][A
 38%|███▊      | 10957/28539 [1:04:45<2:05:51,  2.33it/s][A
 38%|███▊      | 10958/28539 [1:04:45<1:58:47,  2.47it/s][A
 38%|███▊      | 10959/28539 [1:04:45<1:58:54,  2.46it/s][A
 38%|███▊      | 10960/28539 [1:04:46<1:55:28,  2.54it/s][A
 38%|███▊      | 10961/28539 [1:04:46<1:55:02,  2.55it/s][A
 38%|███▊      | 10962/28539 [1:04:47<1:58:38,  2.47it/s][A
 38%|███▊      | 10963/28539 [1:04:47<1:53:23,  2.58it/s][A
 38%|███▊      | 10964/28539 [1:04:47<1:55:38,  2.53it/s][A
 38%|███▊      | 10965/28539 [1:04:48<1:53:44,  2.58it/s][A
 38%|███▊      | 10966/28539 [1:04:48<1:49:49,  2.67it/s][A
 38%|███▊      | 10967/

In [35]:
test(model, device, test_loader, criterion, epoch)

Beginning eval...


Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for i, _data in tqdm_notebook(enumerate(test_loader), total=total_steps):


  0%|          | 0/1310 [00:00<?, ?it/s]

Epoch: 10, Test set: Average loss: 25.4773, Average CER: 0.183985 Average WER: 0.3551



Results: https://wandb.ai/grazder/speech-transducer