### Wav2vec demo leveraging Huggingface
The original model can be found under https://github.com/pytorch/fairseq/tree/master/examples/wav2vec#wav2vec-20
<br> For application, https://huggingface.co/transformers/model_doc/wav2vec2.html is helpful

In [31]:
# # Installing Transformer
# !pip install -q transformers
# !pip install -q datasets
# !pip install -q jiwer

In [29]:
# For managing audio file
import librosa
import torch
import soundfile as sf
import numpy as np
from jiwer import wer
import pandas as pd

#Importing Wav2Vec
from transformers import Wav2Vec2ForCTC, Wav2Vec2Tokenizer, Wav2Vec2Processor, Wav2Vec2Model

In [30]:
print(torch.version.cuda)
torch.backends.cudnn.enabled

10.1


True

In [None]:
# Reading taken audio clip
import IPython.display as display
display.Audio("../../data/wave2vec_audio_samples/Pulp_fiction_clip.wav", autoplay=False)

In [4]:
import IPython.display as display
display.Audio("../../data/wave2vec_audio_samples/Taken_clip.wav", autoplay=False)

In [49]:
# Loading the audio file
audio, rate = librosa.load("Pulp_fiction_clip.wav", sr = 16000)
audio2, rate2 = librosa.load("Taken_clip.wav", sr = 16000)

In [50]:
# printing audio 
print(audio, audio2
     )
# printing rate
print(rate, rate2)

[-0.00037006 -0.00090173 -0.00261051 ...  0.00249109  0.00074104
  0.        ] [-4.3040191e-06  3.3560192e-07 -4.8272054e-06 ... -1.2877666e-06
  4.8289276e-06  0.0000000e+00]
16000 16000


### Wav2vec 2.0
The base model pretrained and fine-tuned on 960 hours of Librispeech on 16kHz sampled speech audio. When using the model make sure that your speech input is also sampled at 16Khz.

In [34]:
# Importing Wav2Vec pretrained model
tokenizer = Wav2Vec2Tokenizer.from_pretrained("facebook/wav2vec2-base-960h")
model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h")

Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-base-960h and are newly initialized: ['wav2vec2.masked_spec_embed']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [51]:
model.config # to learn about the parameters fed into the model

Wav2Vec2Config {
  "_name_or_path": "facebook/wav2vec2-base-960h",
  "activation_dropout": 0.1,
  "apply_spec_augment": true,
  "architectures": [
    "Wav2Vec2ForCTC"
  ],
  "attention_dropout": 0.1,
  "bos_token_id": 1,
  "conv_bias": false,
  "conv_dim": [
    512,
    512,
    512,
    512,
    512,
    512,
    512
  ],
  "conv_kernel": [
    10,
    3,
    3,
    3,
    3,
    2,
    2
  ],
  "conv_stride": [
    5,
    2,
    2,
    2,
    2,
    2,
    2
  ],
  "ctc_loss_reduction": "sum",
  "ctc_zero_infinity": false,
  "do_stable_layer_norm": false,
  "eos_token_id": 2,
  "feat_extract_activation": "gelu",
  "feat_extract_dropout": 0.0,
  "feat_extract_norm": "group",
  "feat_proj_dropout": 0.1,
  "final_dropout": 0.1,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout": 0.1,
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-05,
  "layerdrop": 0.1,
  "mask_feature_leng

#### Taking the input values, passing the audio (array) into tokenizer to get tensors instead of Python integers

In [60]:
# Taking an input value
input_values = tokenizer(audio, return_tensors = "pt").input_values

In [61]:
# Getting the logit values (non-normalized values)
# Storing logits (non-normalized prediction values)
logits = model(input_values).logits

In [62]:
input_values.shape

torch.Size([1, 139152])

In [63]:
logits.shape

torch.Size([1, 434, 32])

In [64]:
# Storing predicted ids
prediction = torch.argmax(logits, dim = -1)

In [65]:
prediction.shape

torch.Size([1, 434])

In [66]:
# Passing the prediction to the tokeizer to decode 
transcription = tokenizer.batch_decode(prediction)[0]

In [67]:
transcription

'BASICLEUM JUST ON A WALK TH EARTH WHICH YOU MOLOCK TO EARTH YOU KNOW LIKE CANE AN CON FO WALK FROM PLACE TO PLACE MEET PEOPLE GET IN ADVENTURES'

### Implementing Wav2vec2 with librispeech data

In [17]:
from datasets import load_dataset
import soundfile as sf

#Constructs a Wav2Vec2 processor which wraps a Wav2Vec2 feature extractor and a Wav2Vec2 CTC tokenizer into a single processor
processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-base-960h") 
base_model = Wav2Vec2Model.from_pretrained("facebook/wav2vec2-base-960h")

Some weights of the model checkpoint at facebook/wav2vec2-base-960h were not used when initializing Wav2Vec2Model: ['lm_head.bias', 'lm_head.weight']
- This IS expected if you are initializing Wav2Vec2Model from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing Wav2Vec2Model from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of Wav2Vec2Model were not initialized from the model checkpoint at facebook/wav2vec2-base-960h and are newly initialized: ['wav2vec2.masked_spec_embed']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [18]:
# define function to read in sound file
def map_to_array(batch):
    speech, _ = sf.read(batch["file"])
    batch["speech"] = speech
    return batch

In [19]:
# Load a sample of the Librispeech clean dataset for inference
ds = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation")
ds = ds.map(map_to_array)



HBox(children=(FloatProgress(value=0.0, max=73.0), HTML(value='')))




In [20]:
ds.description

'LibriSpeech is a corpus of approximately 1000 hours of read English speech with sampling rate of 16 kHz,\nprepared by Vassil Panayotov with the assistance of Daniel Povey. The data is derived from read\naudiobooks from the LibriVox project, and has been carefully segmented and aligned.\n\nNote that in order to limit the required storage for preparing this dataset, the audio\nis stored in the .flac format and is not converted to a float32 array. To convert, the audio\nfile to a float32 array, please make use of the `.map()` function as follows:\n\n\n```python\nimport soundfile as sf\n\ndef map_to_array(batch):\n    speech_array, _ = sf.read(batch["file"])\n    batch["speech"] = speech_array\n    return batch\n\ndataset = dataset.map(map_to_array, remove_columns=["file"])\n```\n'

In [21]:
np.array(ds['speech']).shape

(73,)

In [22]:
input_values = processor(ds["speech"][0], return_tensors="pt", sampling_rate = 16000).input_values  # Batch size 1

In [23]:
logits = model(input_values).logits
predicted_ids = torch.argmax(logits, dim=-1)

In [24]:
transcription = processor.decode(predicted_ids[0])
print(transcription)

A MAN SAID TO THE UNIVERSE SIR I EXIST


In [25]:
# compute loss
target_transcription = "A MAN SAID TO THE UNIVERSE SIR I EXIST"

# wrap processor as target processor to encode labels
with processor.as_target_processor():
    labels = processor(transcription, return_tensors="pt").input_ids

loss = model(input_values, labels=labels).loss

### Evaluate model

In [26]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'

librispeech_eval = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation")

model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h").to(device)
tokenizer = Wav2Vec2Tokenizer.from_pretrained("facebook/wav2vec2-base-960h")

def map_to_array(batch):
    speech, _ = sf.read(batch["file"])
    batch["speech"] = speech
    return batch

librispeech_eval = librispeech_eval.map(map_to_array)

def map_to_pred(batch):
    input_values = tokenizer(batch["speech"], return_tensors="pt", padding="longest").input_values
    with torch.no_grad():
        logits = model(input_values.to(device)).logits

    predicted_ids = torch.argmax(logits, dim=-1)
    transcription = tokenizer.batch_decode(predicted_ids)
    batch["transcription"] = transcription
    return batch

result = librispeech_eval.map(map_to_pred, batched=True, batch_size=1, remove_columns=["speech"])

print("WER:", wer(result["text"], result["transcription"]))

Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-base-960h and are newly initialized: ['wav2vec2.masked_spec_embed']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


HBox(children=(FloatProgress(value=0.0, max=73.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=73.0), HTML(value='')))


WER: 0.05304347826086957


In [27]:
difference_location = np.where(pd.Series(result['text']) != pd.Series(result['transcription']))

In [28]:
changed_from = pd.Series(result['text']).values[difference_location]
changed_to = pd.Series(result['transcription']).values[difference_location]

pd.DataFrame({'from': changed_from, 'to': changed_to})

Unnamed: 0,from,to
0,SWEAT COVERED BRION'S BODY TRICKLING INTO THE ...,SWEAT COVERED BRION'S BODY TRICKLING INTO THE ...
1,THE CUT ON HIS CHEST STILL DRIPPING BLOOD THE ...,THE CUT ON HIS CHEST STILL DRIPPING BLOOD THE ...
2,HIS INSTANT OF PANIC WAS FOLLOWED BY A SMALL S...,HIS INSTANCT PANIC WAS FOLLOWED BY A SMALL SHA...
3,ONE MINUTE A VOICE SAID AND THE TIME BUZZER SO...,ONE MINUTE A VOICE SAID AND THE TIMEBUZZ ARE S...
4,THE BUZZER'S WHIRR TRIGGERED HIS MUSCLES INTO ...,THE BUZZERS WIRE TRIGGERED HIS MUSCLES INTO CO...
5,THE CONTESTANTS IN THE TWENTIES NEEDED UNDISTU...,THE CONTESTANTS INTO TWENTIES NEEDED UNDISTURB...
6,THE OTHER VOICE SNAPPED WITH A HARSH URGENCY C...,THE UTTER VOICE SNAPPED WITH A HARSH URGENCY C...
7,I'M HERE BECAUSE THE MATTER IS OF UTMOST IMPOR...,I'M HERE BECAUSE THE MATTER IS OF UTMOST IMPOR...
8,HE MUST HAVE DRAWN HIS GUN BECAUSE THE INTRUDE...,HE MUST HAVE DRAWN HIS GUN BECAUSE THE INTRUDE...
9,HE ASKED THE HANDLER WHO WAS KNEADING HIS ACHI...,HE ASKED THE HANDLER WHO WAS KNEEDING HIS ACHI...
