# EXPLAINING AUDIO MODEL USED IN THE RESEARCH.

## Whisper

In our project, we aimed at finetuning several existing audio models, one of which includes Whisper model by OpenAI.

Whisper is an automatic speech recognition (ASR) system developed by OpenAI. It utilizes a deep learning architecture to convert spoken language into written text. It involves several steps:

1. Data Preparation: Whisper requires a large amount of labeled audio data for training. This data is typically collected and transcribed, creating pairs of audio segments and their corresponding textual transcripts. The data is then preprocessed to extract features that capture relevant information from the audio signals, such as Mel-frequency cepstral coefficients (MFCCs) or spectrograms.

2. Acoustic Modeling: Whisper utilizes a deep neural network architecture, often based on recurrent neural networks (RNNs), to model the relationship between audio features and textual representations. Long Short-Term Memory (LSTM) networks are commonly used due to their ability to capture long-term dependencies in sequential data. The audio features are fed into the network, which predicts the probability distribution over a set of linguistic units (such as phonemes or subword units) at each time step.

3. Language Modeling: To improve the accuracy of transcription, Whisper incorporates a language model that adds linguistic context to the ASR system. Language models can be based on n-gram models, recurrent neural networks, or transformers. These models help resolve ambiguities and improve the overall quality of the transcriptions by considering the likelihood of certain word sequences.

4. Training: The Whisper model is trained using a large amount of paired audio-text data. The training process involves optimizing the model's parameters to minimize the difference between predicted transcriptions and the ground truth transcriptions. This is typically done using gradient-based optimization algorithms such as stochastic gradient descent (SGD) or its variants.

5. Decoding: Once the Whisper model is trained, it can be used for inference on new, unseen audio data. During decoding, the model takes as input the audio features and generates a sequence of predicted linguistic units. This sequence is then transformed into the final text output using decoding techniques such as the Connectionist Temporal Classification (CTC) algorithm or attention mechanisms.

The actual code of Whisper model provided by OpenAI is quite long, and in this part of research we aim at analyzing each of the parts of the model code in detail.

In [None]:
import base64
import gzip
from dataclasses import dataclass
from typing import Dict, Iterable, Optional

import numpy as np
import torch
import torch.nn.functional as F
from torch import Tensor, nn

from .decoding import decode as decode_function
from .decoding import detect_language as detect_language_function
from .transcribe import transcribe as transcribe_function

In this part we import all the necessary Python libraries needed to run the following model code.

In [None]:
@dataclass
class ModelDimensions:
    n_mels: int
    n_audio_ctx: int
    n_audio_state: int
    n_audio_head: int
    n_audio_layer: int
    n_vocab: int
    n_text_ctx: int
    n_text_state: int
    n_text_head: int
    n_text_layer: int


class LayerNorm(nn.LayerNorm):
    def forward(self, x: Tensor) -> Tensor:
        return super().forward(x.float()).type(x.dtype)


class Linear(nn.Linear):
    def forward(self, x: Tensor) -> Tensor:
        return F.linear(
            x,
            self.weight.to(x.dtype),
            None if self.bias is None else self.bias.to(x.dtype),
        )

Analyzing the above code:

1. @dataclass: This is a decorator from the dataclass module in Python's standard library. It allows you to easily create classes that are primarily used to hold data. In this code, it is used to define the ModelDimensions class as a data class.

2. ModelDimensions: This class represents the dimensions or sizes of various components of a model. It is defined using the dataclass decorator. The class has several attributes such as n_mels, n_audio_ctx, n_audio_state, and so on, which are integers representing the sizes or dimensions of different parts of the model.

3. LayerNorm: This class is a subclass of nn.LayerNorm, which is a PyTorch module for performing layer normalization. The forward method of LayerNorm overrides the base class's forward method. It takes a tensor x as input, applies layer normalization to x.float() (casting it to float), and then returns the normalized tensor. The type(x.dtype) part ensures that the output tensor has the same data type as the input tensor.

4. Linear: This class is a subclass of nn.Linear, which represents a linear transformation (commonly known as a fully connected or dense layer) in a neural network. The forward method of Linear overrides the base class's forward method. It takes a tensor x as input and applies a linear transformation to x using the weights and biases defined in the nn.Linear class. The F.linear function is used to perform the linear transformation, and the resulting tensor is returned.

In [None]:
class Conv1d(nn.Conv1d):
    def _conv_forward(
        self, x: Tensor, weight: Tensor, bias: Optional[Tensor]
    ) -> Tensor:
        return super()._conv_forward(
            x, weight.to(x.dtype), None if bias is None else bias.to(x.dtype)
        )


def sinusoids(length, channels, max_timescale=10000):
    """Returns sinusoids for positional embedding"""
    assert channels % 2 == 0
    log_timescale_increment = np.log(max_timescale) / (channels // 2 - 1)
    inv_timescales = torch.exp(-log_timescale_increment * torch.arange(channels // 2))
    scaled_time = torch.arange(length)[:, np.newaxis] * inv_timescales[np.newaxis, :]
    return torch.cat([torch.sin(scaled_time), torch.cos(scaled_time)], dim=1)


class MultiHeadAttention(nn.Module):
    def __init__(self, n_state: int, n_head: int):
        super().__init__()
        self.n_head = n_head
        self.query = Linear(n_state, n_state)
        self.key = Linear(n_state, n_state, bias=False)
        self.value = Linear(n_state, n_state)
        self.out = Linear(n_state, n_state)

    def forward(
        self,
        x: Tensor,
        xa: Optional[Tensor] = None,
        mask: Optional[Tensor] = None,
        kv_cache: Optional[dict] = None,
    ):
        q = self.query(x)

        if kv_cache is None or xa is None or self.key not in kv_cache:
            # hooks, if installed (i.e. kv_cache is not None), will prepend the cached kv tensors;
            # otherwise, perform key/value projections for self- or cross-attention as usual.
            k = self.key(x if xa is None else xa)
            v = self.value(x if xa is None else xa)
        else:
            # for cross-attention, calculate keys and values once and reuse in subsequent calls.
            k = kv_cache[self.key]
            v = kv_cache[self.value]

        wv, qk = self.qkv_attention(q, k, v, mask)
        return self.out(wv), qk

    def qkv_attention(
        self, q: Tensor, k: Tensor, v: Tensor, mask: Optional[Tensor] = None
    ):
        n_batch, n_ctx, n_state = q.shape
        scale = (n_state // self.n_head) ** -0.25
        q = q.view(*q.shape[:2], self.n_head, -1).permute(0, 2, 1, 3) * scale
        k = k.view(*k.shape[:2], self.n_head, -1).permute(0, 2, 3, 1) * scale
        v = v.view(*v.shape[:2], self.n_head, -1).permute(0, 2, 1, 3)

        qk = q @ k
        if mask is not None:
            qk = qk + mask[:n_ctx, :n_ctx]
        qk = qk.float()

        w = F.softmax(qk, dim=-1).to(q.dtype)
        return (w @ v).permute(0, 2, 1, 3).flatten(start_dim=2), qk.detach()

The provided Python code defines several classes and functions related to neural network modules used in a model. Let's break down the code step by step:

1. Conv1d: This class is a subclass of nn.Conv1d, which represents a 1-dimensional convolutional layer in a neural network. The _conv_forward method is implemented to override the base class's _conv_forward method. It performs the convolution operation by calling the _conv_forward method of the base class and passing the input tensor x, weight tensor weight (converted to the same data type as x), and bias tensor (if not None, converted to the same data type as x).

2. Sinusoids: This function generates sinusoids for positional embedding. It takes three arguments: length (the length of the sinusoids), channels (the number of channels in the output tensor), and max_timescale (the maximum timescale value for the sinusoids). It calculates logarithmic timescale increments based on the number of channels and uses them to generate sinusoids for both sine and cosine functions. The resulting sinusoids are concatenated along the channel dimension and returned as a tensor.

3. MultiHeadAttention: This class represents the multi-head attention mechanism in a neural network. It is a subclass of nn.Module. The class constructor takes two arguments: n_state (the input dimension of the attention mechanism) and n_head (the number of attention heads). The class defines several linear layers (self.query, self.key, self.value, self.out) which are used to project the input to the corresponding dimensions for attention calculations.

4. Forward method: This method overrides the base class's forward method. It performs the forward pass of the multi-head attention mechanism. It takes x as the input tensor, xa as an optional auxiliary input tensor, mask as an optional mask tensor, and kv_cache as an optional dictionary used for caching key-value projections for cross-attention. It first applies the query projection (self.query) to x. Depending on the presence of kv_cache and xa, it either performs key-value projections using self.key and self.value, or retrieves them from the cache. Then, it calls the qkv_attention method to compute the attention weights and the weighted sum of values. Finally, it applies the output projection (self.out) to the weighted sum and returns the result along with the attention weights.

5. qkv_attention method: This method performs the core calculations of the multi-head attention mechanism. It takes query tensor q, key tensor k, and value tensor v. It reshapes and permutes the tensors to prepare them for attention calculations. It calculates the attention scores by multiplying the query and key tensors, applies an optional mask, and performs softmax normalization. Finally, it computes the weighted sum of values using the attention scores and returns the result along with the attention scores.

Overall, the code defines classes and functions related to convolutional layers (Conv1d), positional embedding generation (sinusoids), and multi-head attention mechanism (MultiHeadAttention) commonly used in neural network models.

In [None]:
class ResidualAttentionBlock(nn.Module):
    def __init__(self, n_state: int, n_head: int, cross_attention: bool = False):
        super().__init__()

        self.attn = MultiHeadAttention(n_state, n_head)
        self.attn_ln = LayerNorm(n_state)

        self.cross_attn = (
            MultiHeadAttention(n_state, n_head) if cross_attention else None
        )
        self.cross_attn_ln = LayerNorm(n_state) if cross_attention else None

        n_mlp = n_state * 4
        self.mlp = nn.Sequential(
            Linear(n_state, n_mlp), nn.GELU(), Linear(n_mlp, n_state)
        )
        self.mlp_ln = LayerNorm(n_state)

    def forward(
        self,
        x: Tensor,
        xa: Optional[Tensor] = None,
        mask: Optional[Tensor] = None,
        kv_cache: Optional[dict] = None,
    ):
        x = x + self.attn(self.attn_ln(x), mask=mask, kv_cache=kv_cache)[0]
        if self.cross_attn:
            x = x + self.cross_attn(self.cross_attn_ln(x), xa, kv_cache=kv_cache)[0]
        x = x + self.mlp(self.mlp_ln(x))
        return x


class AudioEncoder(nn.Module):
    def __init__(
        self, n_mels: int, n_ctx: int, n_state: int, n_head: int, n_layer: int
    ):
        super().__init__()
        self.conv1 = Conv1d(n_mels, n_state, kernel_size=3, padding=1)
        self.conv2 = Conv1d(n_state, n_state, kernel_size=3, stride=2, padding=1)
        self.register_buffer("positional_embedding", sinusoids(n_ctx, n_state))

        self.blocks: Iterable[ResidualAttentionBlock] = nn.ModuleList(
            [ResidualAttentionBlock(n_state, n_head) for _ in range(n_layer)]
        )
        self.ln_post = LayerNorm(n_state)

    def forward(self, x: Tensor):
        """
        x : torch.Tensor, shape = (batch_size, n_mels, n_ctx)
            the mel spectrogram of the audio
        """
        x = F.gelu(self.conv1(x))
        x = F.gelu(self.conv2(x))
        x = x.permute(0, 2, 1)

        assert x.shape[1:] == self.positional_embedding.shape, "incorrect audio shape"
        x = (x + self.positional_embedding).to(x.dtype)

        for block in self.blocks:
            x = block(x)

        x = self.ln_post(x)
        return x


class TextDecoder(nn.Module):
    def __init__(
        self, n_vocab: int, n_ctx: int, n_state: int, n_head: int, n_layer: int
    ):
        super().__init__()

        self.token_embedding = nn.Embedding(n_vocab, n_state)
        self.positional_embedding = nn.Parameter(torch.empty(n_ctx, n_state))

        self.blocks: Iterable[ResidualAttentionBlock] = nn.ModuleList(
            [
                ResidualAttentionBlock(n_state, n_head, cross_attention=True)
                for _ in range(n_layer)
            ]
        )
        self.ln = LayerNorm(n_state)

        mask = torch.empty(n_ctx, n_ctx).fill_(-np.inf).triu_(1)
        self.register_buffer("mask", mask, persistent=False)

    def forward(self, x: Tensor, xa: Tensor, kv_cache: Optional[dict] = None):
        """
        x : torch.LongTensor, shape = (batch_size, <= n_ctx)
            the text tokens
        xa : torch.Tensor, shape = (batch_size, n_audio_ctx, n_audio_state)
            the encoded audio features to be attended on
        """
        offset = next(iter(kv_cache.values())).shape[1] if kv_cache else 0
        x = (
            self.token_embedding(x)
            + self.positional_embedding[offset : offset + x.shape[-1]]
        )
        x = x.to(xa.dtype)

        for block in self.blocks:
            x = block(x, xa, mask=self.mask, kv_cache=kv_cache)

        x = self.ln(x)
        logits = (
            x @ torch.transpose(self.token_embedding.weight.to(x.dtype), 0, 1)
        ).float()

        return logits

The provided Python code defines three classes: ResidualAttentionBlock, AudioEncoder, and TextDecoder. These classes are typically used in models for audio-to-text synthesis.

1. ResidualAttentionBlock: This class represents a residual attention block in the model. It is a subclass of nn.Module. The class constructor takes three arguments: n_state (the input dimension of the attention mechanism), n_head (the number of attention heads), and cross_attention (a boolean flag indicating whether cross-attention is performed). The class defines the following components:

  1.1. self.attn: An instance of the MultiHeadAttention class representing self-attention.

  1.2. self.attn_ln: An instance of the LayerNorm class representing layer normalization applied to the output of self-attention.

  1.3. self.cross_attn (optional): An instance of the MultiHeadAttention class representing cross-attention if cross_attention is True, otherwise None.

  1.4. self.cross_attn_ln (optional): An instance of the LayerNorm class representing layer normalization applied to the output of cross-attention if cross_attention is True, otherwise None.

  1.5. self.mlp: A sequential neural network module consisting of linear layers and GELU activation.

  1.6. self.mlp_ln: An instance of the LayerNorm class representing layer normalization applied to the output of the MLP.

  The forward method performs the forward pass of the residual attention block. It takes the input tensor x, an optional auxiliary input tensor xa, an optional mask tensor mask, and an optional key-value cache dictionary kv_cache. It applies self-attention (self.attn) to x, adds the result to the input tensor x, and applies layer normalization (self.attn_ln). If cross_attention is True, it performs cross-attention (self.cross_attn) using xa as the auxiliary input, adds the result to x, and applies layer normalization (self.cross_attn_ln). Finally, it applies the MLP (self.mlp) to x, adds the result to x, and applies layer normalization (self.mlp_ln). The output tensor x is returned.

2. AudioEncoder: This class represents the audio encoder in the model. It is a subclass of nn.Module. The class constructor takes five arguments: n_mels (the number of mel spectrogram channels), n_ctx (the maximum sequence length), n_state (the dimension of the hidden state), n_head (the number of attention heads), and n_layer (the number of residual attention blocks). The class defines the following components:

  2.1. self.conv1: An instance of the Conv1d class representing the first convolutional layer applied to the mel spectrogram.

  2.2. self.conv2: An instance of the Conv1d class representing the second convolutional layer applied to the output of the first convolutional layer.

  2.3. self.register_buffer("positional_embedding", sinusoids(n_ctx, n_state)): A buffer tensor containing positional embeddings generated using the sinusoids function.

  2.4. self.blocks: A module list containing n_layer instances of the ResidualAttentionBlock class.

  2.5. self.ln_post: An instance of the LayerNorm class representing layer normalization applied to the output of the residual attention blocks.

  The forward method performs the forward pass of the audio encoder. It takes the input tensor x representing the mel spectrogram. It applies the first convolutional layer (self.conv1), the second convolutional layer (self.conv2), and permutes the dimensions of the tensor. It checks if the shape of the tensor matches the shape of the positional embeddings and adds the positional embeddings to the tensor. Then, it iterates through the residual attention blocks (self.blocks) and applies each block to the tensor. Finally, it applies layer normalization (self.ln_post) to the tensor and returns the result.

3. TextDecoder: This class represents the text decoder in the model. It is a subclass of nn.Module. The class constructor takes five arguments: n_vocab (the number of vocabulary tokens), n_ctx (the maximum sequence length), n_state (the dimension of the hidden state), n_head (the number of attention heads), and n_layer (the number of residual attention blocks). The class defines the following components:

  3.1. self.token_embedding: An embedding layer for the text tokens.

  3.2. self.positional_embedding: A trainable parameter representing the positional embeddings for the text tokens.

  3.3. self.blocks: A module list containing n_layer instances of the ResidualAttentionBlock class with cross_attention set to True.

  3.4. self.ln: An instance of the LayerNorm class representing layer normalization applied to the output of the residual attention blocks.

  3.5. self.register_buffer("mask", mask, persistent=False): A buffer tensor containing a triangular mask used in the self-attention mechanism.

  The forward method performs the forward pass of the text decoder. It takes the input tensor x representing the text tokens, the auxiliary input tensor xa representing the encoded audio features, and an optional key-value cache dictionary kv_cache. It applies token embedding (self.token_embedding) and positional embedding (self.positional_embedding) to the text tokens. It then iterates through the residual attention blocks (self.blocks) and applies each block to the tensor, using the auxiliary input xa and the mask tensor self.mask. After the residual attention blocks, it applies layer normalization (self.ln) to the tensor. Finally, it computes the logits by multiplying the tensor with the transposed weight matrix of the token embedding (self.token_embedding.weight) and returns the logits.

In [None]:
class Whisper(nn.Module):
    def __init__(self, dims: ModelDimensions):
        super().__init__()
        self.dims = dims
        self.encoder = AudioEncoder(
            self.dims.n_mels,
            self.dims.n_audio_ctx,
            self.dims.n_audio_state,
            self.dims.n_audio_head,
            self.dims.n_audio_layer,
        )
        self.decoder = TextDecoder(
            self.dims.n_vocab,
            self.dims.n_text_ctx,
            self.dims.n_text_state,
            self.dims.n_text_head,
            self.dims.n_text_layer,
        )
        # use the last half among the decoder layers for time alignment by default;
        # to use a specific set of heads, see `set_alignment_heads()` below.
        all_heads = torch.zeros(
            self.dims.n_text_layer, self.dims.n_text_head, dtype=torch.bool
        )
        all_heads[self.dims.n_text_layer // 2 :] = True
        self.register_buffer("alignment_heads", all_heads.to_sparse(), persistent=False)

    def set_alignment_heads(self, dump: bytes):
        array = np.frombuffer(
            gzip.decompress(base64.b85decode(dump)), dtype=bool
        ).copy()
        mask = torch.from_numpy(array).reshape(
            self.dims.n_text_layer, self.dims.n_text_head
        )
        self.register_buffer("alignment_heads", mask.to_sparse(), persistent=False)

    def embed_audio(self, mel: torch.Tensor):
        return self.encoder(mel)

    def logits(self, tokens: torch.Tensor, audio_features: torch.Tensor):
        return self.decoder(tokens, audio_features)

    def forward(
        self, mel: torch.Tensor, tokens: torch.Tensor
    ) -> Dict[str, torch.Tensor]:
        return self.decoder(tokens, self.encoder(mel))

    @property
    def device(self):
        return next(self.parameters()).device

    @property
    def is_multilingual(self):
        return self.dims.n_vocab >= 51865

    @property
    def num_languages(self):
        return self.dims.n_vocab - 51765 - int(self.is_multilingual)

    def install_kv_cache_hooks(self, cache: Optional[dict] = None):
        """
        The `MultiHeadAttention` module optionally accepts `kv_cache` which stores the key and value
        tensors calculated for the previous positions. This method returns a dictionary that stores
        all caches, and the necessary hooks for the key and value projection modules that save the
        intermediate tensors to be reused during later calculations.

        Returns
        -------
        cache : Dict[nn.Module, torch.Tensor]
            A dictionary object mapping the key/value projection modules to its cache
        hooks : List[RemovableHandle]
            List of PyTorch RemovableHandle objects to stop the hooks to be called
        """
        cache = {**cache} if cache is not None else {}
        hooks = []

        def save_to_cache(module, _, output):
            if module not in cache or output.shape[1] > self.dims.n_text_ctx:
                # save as-is, for the first token or cross attention
                cache[module] = output
            else:
                cache[module] = torch.cat([cache[module], output], dim=1).detach()
            return cache[module]

        def install_hooks(layer: nn.Module):
            if isinstance(layer, MultiHeadAttention):
                hooks.append(layer.key.register_forward_hook(save_to_cache))
                hooks.append(layer.value.register_forward_hook(save_to_cache))

        self.decoder.apply(install_hooks)
        return cache, hooks

    detect_language = detect_language_function
    transcribe = transcribe_function
    decode = decode_function

The provided Python code defines a class called Whisper, which is a subclass of nn.Module. This class represents a Whisper model used for speech synthesis, specifically converting audio input into text output.

Here is a breakdown of the code:

1. Initialization:

  1.1. The __init__ method takes an input argument dims, which is an instance of the ModelDimensions class. This class contains various dimensions and parameters related to the model.

  1.2. The method initializes the parent class nn.Module using super().__init__().

  1.3. It assigns the dims argument to an instance variable self.dims for later use.

  1.4. It creates an instance of the AudioEncoder class, passing the necessary dimensions from dims, and assigns it to self.encoder.

  1.5. It creates an instance of the TextDecoder class, passing the necessary dimensions from dims, and assigns it to self.decoder.

  1.6. It initializes a tensor alignment_heads with shape (n_text_layer, n_text_head) where n_text_layer and n_text_head are dimensions from dims. The tensor is initialized with all False values except for the last half of the n_text_layer which is set to True. This tensor is registered as a buffer using self.register_buffer() and assigned to self.alignment_heads.

2. Setting Alignment Heads:

  2.1. The set_alignment_heads method takes a byte string dump as input.

  2.2. It decompresses the byte string using gzip and decodes it using base85.

  2.3. It converts the resulting array into a boolean array and reshapes it to (n_text_layer, n_text_head).

  2.4. It registers the reshaped array as a buffer using self.register_buffer() and assigns it to self.alignment_heads.

3. Audio Embedding:

  3.1. The embed_audio method takes a tensor mel representing mel spectrogram as input.

  3.2. It passes the mel tensor through the self.encoder and returns the result.

4. Logits Calculation:

  4.1. The logits method takes two tensors tokens and audio_features as input.

  4.2. It passes the tokens and audio_features tensors through the self.decoder and returns the result.

5. Forward Pass:

  5.1. The forward method takes two tensors mel and tokens as input.

  5.2. It passes the mel tensor through the self.encoder to obtain audio features.

  5.3. It passes the tokens and audio features through the self.decoder to obtain the output logits.

  5.4. It returns a dictionary containing the output logits.
  
6. Properties:

  6.1. The device property returns the device of the model parameters.

  6.2. The is_multilingual property returns a boolean indicating whether the model supports multiple languages based on the vocabulary size.

  6.3. The num_languages property returns the number of languages supported by the model based on the vocabulary size.

7. Key-Value Cache Hooks:

  7.1. The install_kv_cache_hooks method is used to install hooks for caching intermediate tensors during the key-value projection in the MultiHeadAttention module.

  7.2. It takes an optional cache dictionary as input, which stores the key-value caches.

  7.3. It initializes an empty dictionary cache if cache is None.

  7.4. It defines two nested functions: save_to_cache and install_hooks.

  7.5. The save_to_cache function is a hook that saves the intermediate tensors to the cache dictionary.

  7.6. The install_hooks function is used to traverse through the self.decoder and install hooks for the key and value projection modules.

  7.7. The method applies the install_hooks function to the self.decoder module and returns the cache dictionary and a list of hooks.

The remaining part of the code includes references to external functions (detect_language, transcribe, decode) which are not provided in the code snippet. These functions are likely defined elsewhere and serve specific purposes related to the speech synthesis model.