In this notebook we are going to see how to convert speech into text using Facebook's **Wav2Vec 2.0** model.

**Wav2Vec2** is a speech model that accepts a float array corresponding to the raw waveform of the speech signal. 

**Wav2Vec2** model is trained using **connectionist temporal classification (CTC)** so the model output has to be decoded using Wav2Vec2Tokenizer.

In [2]:
!pip install --upgrade transformers

Requirement already up-to-date: transformers in /usr/local/lib/python3.7/dist-packages (4.5.1)


In [3]:
import transformers
print(transformers.__version__)

4.5.1


In [4]:
import librosa
import torch
import IPython.display as display
from transformers import Wav2Vec2ForCTC, Wav2Vec2Tokenizer
import numpy as np

In [5]:
#load pre-trained model and tokenizer
tokenizer = Wav2Vec2Tokenizer.from_pretrained("facebook/wav2vec2-base-960h")
model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h")

Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-base-960h and are newly initialized: ['wav2vec2.masked_spec_embed']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [7]:
import os
from google.colab import drive
drive.mount('/content/drive/')
os.chdir("/content/drive/My Drive/Colab Notebooks")

Drive already mounted at /content/drive/; to attempt to forcibly remount, call drive.mount("/content/drive/", force_remount=True).


In [8]:
#load audio file 
audio, sampling_rate = librosa.load("Reasons.m4a",sr=16000)



In [9]:
audio,sampling_rate

(array([7.4539844e-06, 3.3039858e-05, 3.4422897e-05, ..., 2.1207858e-05,
        3.6763315e-05, 0.0000000e+00], dtype=float32), 16000)

In [16]:
# audio
display.Audio("Reasons.m4a", autoplay=True)

In [17]:
input_values = tokenizer(audio, return_tensors = 'pt').input_values
input_values

tensor([[ 0.0003,  0.0019,  0.0020,  ...,  0.0012,  0.0021, -0.0001]])

In [18]:
# store logits (non-normalized predictions)
logits = model(input_values).logits
logits

tensor([[[ 14.4304, -27.6985, -27.3353,  ...,  -7.3044,  -6.9058,  -7.5037],
         [ 14.4707, -27.6978, -27.3268,  ...,  -7.2829,  -6.9399,  -7.4950],
         [ 14.5191, -27.6290, -27.2457,  ...,  -7.0882,  -6.8463,  -7.4283],
         ...,
         [ 14.6774, -27.6292, -27.2344,  ...,  -7.2010,  -7.4676,  -7.3104],
         [ 14.4614, -27.5317, -27.1502,  ...,  -7.3144,  -7.5679,  -7.3884],
         [ 14.5024, -27.4075, -27.0266,  ...,  -7.2562,  -7.4774,  -7.3426]]],
       grad_fn=<AddBackward0>)

In [19]:
# store predicted id's
# pass the logit values to softmax to get the predicted values
predicted_ids = torch.argmax(logits, dim =-1)

In [20]:
# pass the prediction to the tokenzer decode to get the transcription
transcriptions = tokenizer.decode(predicted_ids[0])

In [21]:
transcriptions

'LOOKING FOR THE REASONS WHY BUSINESS FAIL'

In [23]:
type(transcriptions)

str