In this notebook we are going to see how to convert speech into text using Facebook Wav2Vec 2.0 model.Wav2Vec2 is a speech model that accepts a float array corresponding to the raw waveform of the speech signal. Wav2Vec2 model was trained using connectionist temporal classification (CTC) so the model output has to be decoded using Wav2Vec2Tokenizer.For learning more about it click on this [link](https://huggingface.co/transformers/model_doc/wav2vec2.html)

In [1]:
!pip install --upgrade transformers

Collecting transformers
  Downloading transformers-4.4.1-py3-none-any.whl (2.1 MB)
[K     |████████████████████████████████| 2.1 MB 4.6 MB/s 
[?25hCollecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.1-cp37-cp37m-manylinux2010_x86_64.whl (3.2 MB)
[K     |████████████████████████████████| 3.2 MB 24.9 MB/s 
Installing collected packages: tokenizers, transformers
  Attempting uninstall: tokenizers
    Found existing installation: tokenizers 0.9.4
    Uninstalling tokenizers-0.9.4:
      Successfully uninstalled tokenizers-0.9.4
  Attempting uninstall: transformers
    Found existing installation: transformers 4.2.2
    Uninstalling transformers-4.2.2:
      Successfully uninstalled transformers-4.2.2
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
allennlp 2.0.1 requires transformers<4.3,>=4.1, but you have transformers 4.4.1 w

In [2]:
import transformers
print(transformers.__version__)

4.4.1


If you don't see at least 4.3.0 version,then upgrade it

### Import Libraries

In [3]:
import librosa
import torch
import IPython.display as display
from transformers import Wav2Vec2ForCTC, Wav2Vec2Tokenizer
import numpy as np

### Load pre-trained Wav2Vec model

In [4]:
#load pre-trained model and tokenizer
tokenizer = Wav2Vec2Tokenizer.from_pretrained("facebook/wav2vec2-base-960h")
model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h")

Downloading:   0%|          | 0.00/291 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/163 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/85.0 [00:00<?, ?B/s]



Downloading:   0%|          | 0.00/843 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/378M [00:00<?, ?B/s]

Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-base-960h and are newly initialized: ['wav2vec2.masked_spec_embed']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


### Load Audio file

In [5]:
#load audio file 
audio, sampling_rate = librosa.load("../input/audio-dataset/Voice 002.m4a",sr=16000)



In [6]:
audio,sampling_rate

(array([0.        , 0.        , 0.        , ..., 0.00085527, 0.00068164,
        0.        ], dtype=float32),
 16000)

# Play the Audio

In [7]:
# audio
display.Audio("../input/audio-dataset/Voice 002.m4a", autoplay=True)

### Speech to Text

First of all tokenize the input values,take the maximum prediction from the logit and then extraxt the text

In [8]:
input_values = tokenizer(audio, return_tensors = 'pt').input_values
input_values

tensor([[-4.1209e-05, -4.1209e-05, -4.1209e-05,  ...,  2.6336e-02,
          2.0981e-02, -4.1209e-05]])

In [9]:
# store logits (non-normalized predictions)
logits = model(input_values).logits
logits

tensor([[[ 11.9173, -26.7431, -26.4653,  ...,  -6.2810,  -6.3824,  -7.3416],
         [ 11.9257, -26.7902, -26.5125,  ...,  -6.2941,  -6.3788,  -7.3500],
         [ 11.9959, -26.8095, -26.5268,  ...,  -6.2574,  -6.3953,  -7.3421],
         ...,
         [ 12.2501, -26.4285, -26.1227,  ...,  -6.1991,  -6.4601,  -7.0932],
         [ 12.1604, -26.3363, -26.0326,  ...,  -6.2217,  -6.5658,  -7.1099],
         [ 12.0844, -26.2884, -25.9919,  ...,  -6.2779,  -6.6117,  -7.1327]]],
       grad_fn=<AddBackward0>)

In [10]:
# store predicted id's
# pass the logit values to softmax to get the predicted values
predicted_ids = torch.argmax(logits, dim =-1)

In [11]:
# pass the prediction to the tokenzer decode to get the transcription
transcriptions = tokenizer.decode(predicted_ids[0])

In [12]:
transcriptions

'DO WHATEVER YOU WANT TO DO'