- Trained model: https://huggingface.co/facebook/s2t-small-librispeech-asr
- Model description: https://huggingface.co/transformers/v4.4.2/model_doc/speech_to_text.html

1. It’s a transformer-based seq2seq model, so the transcripts/translations are generated autoregressively
2. Sentence piece tokenizer
3. Input from speech:  log-mel filter-bank features extracted from the speech signal (https://haythamfayek.com/2016/04/21/speech-processing-for-machine-learning.html)

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt

In [2]:
# import librosa.display
# import librosa

In [3]:
import pandas as pd
import numpy as np

In [23]:
import torch
from transformers import Speech2TextProcessor, Speech2TextForConditionalGeneration, Speech2TextModel
from datasets import load_dataset
import soundfile as sf

## Functions

In [5]:
def map_to_array(batch):
    speech, _ = sf.read(batch["file"])
    batch["speech"] = speech
    return batch

## Step1: Load the model and processer (feature extractor + tokenizer) (Speech2TextFeatureExtractor + Speech2TextTokenizer)

In [6]:
## Speech2TextForConditionalGeneration => The Speech2Text Model with a language modeling head. Can be used for summarization
## https://huggingface.co/transformers/v4.4.2/model_doc/speech_to_text.html#speech2textforconditionalgeneration
model = Speech2TextForConditionalGeneration.from_pretrained("facebook/s2t-small-librispeech-asr")
processor = Speech2TextProcessor.from_pretrained("facebook/s2t-small-librispeech-asr")

Some weights of Speech2TextForConditionalGeneration were not initialized from the model checkpoint at facebook/s2t-small-librispeech-asr and are newly initialized: ['model.decoder.embed_positions.weights', 'model.encoder.embed_positions.weights']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## Step 2: Load dataset

In [7]:
ds = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation")

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.


In [11]:
ds = ds.map(map_to_array)

Map: 100%|█████████████████████████████████████████████████████████████████████████████████████| 73/73 [00:00<00:00, 329.73 examples/s]


In [16]:
inputs = processor(ds["speech"][0], sampling_rate=16_000, return_tensors="pt")

In [19]:
generated_ids = model.generate(input_features=inputs["input_features"], attention_mask=inputs["attention_mask"])

In [21]:
translation = processor.batch_decode(generated_ids)

In [22]:
translation

['</s> a man said to the universe sir i exist</s>']

In [None]:
model.

In [24]:
ptt = Speech2TextModel.from_pretrained("facebook/s2t-small-librispeech-asr")

Some weights of Speech2TextModel were not initialized from the model checkpoint at facebook/s2t-small-librispeech-asr and are newly initialized: ['model.decoder.embed_positions.weights', 'model.encoder.embed_positions.weights']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [28]:
ptt(input_features=inputs["input_features"], attention_mask=inputs["attention_mask"])

ValueError: You have to specify either decoder_input_ids or decoder_inputs_embeds

In [25]:
ptt

Speech2TextModel(
  (encoder): Speech2TextEncoder(
    (conv): Conv1dSubsampler(
      (conv_layers): ModuleList(
        (0): Conv1d(80, 1024, kernel_size=(5,), stride=(2,), padding=(2,))
        (1): Conv1d(512, 512, kernel_size=(5,), stride=(2,), padding=(2,))
      )
    )
    (embed_positions): Speech2TextSinusoidalPositionalEmbedding()
    (layers): ModuleList(
      (0-11): 12 x Speech2TextEncoderLayer(
        (self_attn): Speech2TextAttention(
          (k_proj): Linear(in_features=256, out_features=256, bias=True)
          (v_proj): Linear(in_features=256, out_features=256, bias=True)
          (q_proj): Linear(in_features=256, out_features=256, bias=True)
          (out_proj): Linear(in_features=256, out_features=256, bias=True)
        )
        (self_attn_layer_norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
        (activation_fn): ReLU()
        (fc1): Linear(in_features=256, out_features=2048, bias=True)
        (fc2): Linear(in_features=2048, out_features

In [26]:
model

Speech2TextForConditionalGeneration(
  (model): Speech2TextModel(
    (encoder): Speech2TextEncoder(
      (conv): Conv1dSubsampler(
        (conv_layers): ModuleList(
          (0): Conv1d(80, 1024, kernel_size=(5,), stride=(2,), padding=(2,))
          (1): Conv1d(512, 512, kernel_size=(5,), stride=(2,), padding=(2,))
        )
      )
      (embed_positions): Speech2TextSinusoidalPositionalEmbedding()
      (layers): ModuleList(
        (0-11): 12 x Speech2TextEncoderLayer(
          (self_attn): Speech2TextAttention(
            (k_proj): Linear(in_features=256, out_features=256, bias=True)
            (v_proj): Linear(in_features=256, out_features=256, bias=True)
            (q_proj): Linear(in_features=256, out_features=256, bias=True)
            (out_proj): Linear(in_features=256, out_features=256, bias=True)
          )
          (self_attn_layer_norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
          (activation_fn): ReLU()
          (fc1): Linear(in_features=2

In [None]:
# x, sr = librosa.load(ds["file"][0])

In [None]:
# plt.figure(figsize=(14, 5))
# librosa.display.waveshow(x, sr=sr)

In [None]:
# import IPython.display as ipd
# ipd.Audio(x, sr)

In [None]:
# ds["file"][55]

In [None]:
# import numpy
# sr = 22050 # sample rate
# T = 2.0    # seconds
# t = numpy.linspace(0, T, int(T*sr), endpoint=False) # time variable
# x = 0.5*numpy.sin(2*numpy.pi*440*t) 

In [None]:
# ipd.Audio(x, rate=sr) 

In [None]:
ss

In [None]:
ds['audio'][0]['array']

In [None]:
ds['audio'][0]['array'].shape

In [None]:
ds['file'][0]

In [None]:
ds['text'][0]

## Links:
1. wav2vec-2: https://jonathanbgn.com/2021/09/30/illustrated-wav2vec-2.html