<a href="https://colab.research.google.com/github/nguyenquoctrung98/ant-design/blob/master/Vietnamese_ASR_wav2vec.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Vietnamese end-to-end speech recognition using wav2vec 2.0

In [3]:
!pip3 install transformers==4.9.2 soundfile datasets==1.11.0 pyctcdecode==v0.1.0
!pip3 install https://github.com/kpu/kenlm/archive/master.zip

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting https://github.com/kpu/kenlm/archive/master.zip
  Using cached https://github.com/kpu/kenlm/archive/master.zip (550 kB)


In [4]:
from transformers.file_utils import cached_path, hf_bucket_url
import os, zipfile
from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
from datasets import load_dataset
import soundfile as sf
import torch
import kenlm
from pyctcdecode import Alphabet, BeamSearchDecoderCTC, LanguageModel
import IPython

## Load wav2vec model and tokenizer

In [5]:
cache_dir = './cache/'
processor = Wav2Vec2Processor.from_pretrained("nguyenvulebinh/wav2vec2-base-vietnamese-250h", cache_dir=cache_dir)
model = Wav2Vec2ForCTC.from_pretrained("nguyenvulebinh/wav2vec2-base-vietnamese-250h", cache_dir=cache_dir)
lm_file = hf_bucket_url("nguyenvulebinh/wav2vec2-base-vietnamese-250h", filename='vi_lm_4grams.bin.zip')
lm_file = cached_path(lm_file,cache_dir=cache_dir)
with zipfile.ZipFile(lm_file, 'r') as zip_ref:
    zip_ref.extractall(cache_dir)
lm_file = cache_dir + 'vi_lm_4grams.bin'

Downloading:   0%|          | 0.00/215 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.11k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/181 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/85.0 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Downloading:   0%|          | 0.00/1.65k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/378M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/904M [00:00<?, ?B/s]

## Load n-gram LM

In [6]:
def get_decoder_ngram_model(tokenizer, ngram_lm_path):
    vocab_dict = tokenizer.get_vocab()
    sort_vocab = sorted((value, key) for (key, value) in vocab_dict.items())
    vocab = [x[1] for x in sort_vocab][:-2]
    vocab_list = vocab
    # convert ctc blank character representation
    vocab_list[tokenizer.pad_token_id] = ""
    # replace special characters
    vocab_list[tokenizer.unk_token_id] = ""
    # vocab_list[tokenizer.bos_token_id] = ""
    # vocab_list[tokenizer.eos_token_id] = ""
    # convert space character representation
    vocab_list[tokenizer.word_delimiter_token_id] = " "
    # specify ctc blank char index, since conventially it is the last entry of the logit matrix
    alphabet = Alphabet.build_alphabet(vocab_list, ctc_token_idx=tokenizer.pad_token_id)
    lm_model = kenlm.Model(ngram_lm_path)
    decoder = BeamSearchDecoderCTC(alphabet,
                                   language_model=LanguageModel(lm_model))
    return decoder

In [7]:
ngram_lm_model = get_decoder_ngram_model(processor.tokenizer, lm_file)

## Load audio and infer

In [8]:
# download test sound file
audio_file = hf_bucket_url("nguyenvulebinh/wav2vec2-base-vietnamese-250h", filename='audio-test/t1_utt000000042.wav')
audio_file = cached_path(audio_file,cache_dir=cache_dir)
os.rename(audio_file, cache_dir+"t1_utt000000042.wav")
audio_file = cache_dir+"t1_utt000000042.wav"

Downloading:   0%|          | 0.00/76.8k [00:00<?, ?B/s]

In [9]:
# define function to read in sound file
def map_to_array(batch):
    speech, sampling_rate = sf.read(batch["file"])
    batch["speech"] = speech
    batch["sampling_rate"] = sampling_rate
    return batch

# load dummy dataset and read soundfiles
ds = map_to_array({"file": audio_file})

In [10]:
# infer model
input_values = processor(
      ds["speech"], 
      sampling_rate=ds["sampling_rate"], 
      return_tensors="pt"
).input_values
# ).input_values.to("cuda")
# model.to("cuda")
logits = model(input_values).logits[0]
print(logits.shape)

torch.Size([119, 110])


In [11]:
# decode ctc output
pred_ids = torch.argmax(logits, dim=-1)
greedy_search_output = processor.decode(pred_ids)
beam_search_output = ngram_lm_model.decode(logits.cpu().detach().numpy(), beam_width=500)
print("Greedy search output: {}".format(greedy_search_output))
print("Beam search output: {}".format(beam_search_output))
IPython.display.Audio(filename=audio_file, autoplay=True)

Greedy search output: những nơi đã không chế được căn bệnh
Beam search output: những nơi đã khống chế được căn bệnh
