https://huggingface.co/facebookwav2vec2-lv-60-espeak-cv-ft

In [1]:
!pip install transformers
!pip install torchaudio
!pip install torch
!pip install datasets
!pip install phonemizer
!apt-get install espeak

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
espeak is already the newest version (1.48.15+dfsg-3).
0 upgraded, 0 newly installed, 0 to remove and 19 not upgraded.


In [2]:
 from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
 from datasets import load_dataset
 import torch

In [3]:
"GPU" if torch.cuda.is_available() else "CPU"

'GPU'

In [4]:
 # load model and processor
 processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-lv-60-espeak-cv-ft")
 model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-lv-60-espeak-cv-ft")

Some weights of the model checkpoint at facebook/wav2vec2-lv-60-espeak-cv-ft were not used when initializing Wav2Vec2ForCTC: ['wav2vec2.encoder.pos_conv_embed.conv.weight_v', 'wav2vec2.encoder.pos_conv_embed.conv.weight_g']
- This IS expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-lv-60-espeak-cv-ft and are newly initialized: ['wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original1', 'wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original0']
You should probably

In [5]:
 # load dummy dataset and read soundfiles
 ds = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation")

In [6]:
 # tokenize
 input_values = processor(ds[0]["audio"]["array"], return_tensors="pt").input_values
 input_values, input_values.shape

It is strongly recommended to pass the ``sampling_rate`` argument to this function. Failing to do so can result in silent errors that might be hard to debug.


(tensor([[-0.0014, -0.0037, -0.0019,  ..., -0.0467, -0.0585, -0.0571]]),
 torch.Size([1, 174160]))

In [7]:
 # retrieve logits
 with torch.no_grad():
   logits = model(input_values).logits
   print(logits, logits.shape)

tensor([[[ 14.0357, -20.4767, -20.3324,  ..., -18.9290, -20.5111, -19.6695],
         [ 13.9722, -20.5228, -20.3650,  ..., -18.9446, -20.5404, -19.6964],
         [ 14.0599, -20.4421, -20.2371,  ..., -18.8507, -20.4842, -19.6655],
         ...,
         [ 14.1237, -20.5276, -20.3348,  ..., -18.8896, -20.5254, -19.7500],
         [ 14.0878, -20.6103, -20.4305,  ..., -18.9614, -20.5918, -19.8240],
         [ 14.1897, -20.5412, -20.3406,  ..., -18.9045, -20.5430, -19.7956]]]) torch.Size([1, 544, 392])


In [8]:
# take argmax and decode
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(predicted_ids)
# => should give ['m ɪ s t ɚ k w ɪ l t ɚ ɹ ɪ z ð ɪ ɐ p ɑː s əl ʌ v ð ə m ɪ d əl k l æ s ᵻ z æ n d w iː ɑːɹ ɡ l æ d t ə w ɛ l k ə m h ɪ z ɡ ɑː s p əl']

print(f"predicted_ids: {predicted_ids, predicted_ids.shape}\n\n\n")
print("-"*20)
print(f"\n\n\ntranscription: {transcription, len(transcription)}")

predicted_ids: (tensor([[ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
          0,  0,  0, 26,  0, 17,  0,  0,  0,  0, 11,  0, 33,  0,  0,  0,  0,  0,
          0, 21,  0,  0, 24, 24, 46,  0,  0,  0,  0, 43,  0,  0,  0,  0,  0,  0,
          5,  0,  0,  0,  0,  0,  0,  0,  8,  8,  0, 30,  0,  0,  0,  0,  0,  0,
         18,  0, 17,  0,  0,  0, 42,  0, 17,  0,  0,  4,  0,  0,  5,  0,  0,  6,
          6, 14,  0,  0,  0, 12, 33,  0,  0, 25,  0,  0,  0,  0, 11,  0,  0,  0,
          0, 41,  0,  0,  0,  0, 42, 42,  0,  0, 11,  0,  0, 43,  0, 27,  0,  0,
         17,  0,  0,  0,  0,  0, 42,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
          0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
          0,  0,  0,  0,  0,  0,  0, 22,  0,  7,  0,  0,  0,  0,  0,  8,  0, 33,
          0,  0,  0,  0, 25,  0,  0,  8,  0, 10,  0,  0,  0,  0,  0,  0,  0, 27,
         27,  0,  0, 49,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 21,  0,  0,  0,
          0,