# Wav2vec2 example
Accentuation and transcription can be useful for acustic corpora analysis. This notebook contains an example of running the wav2vec2-lv-60-espeak-cv-ft model finetuned with [`RUSLAN`](https://ruslan-corpus.github.io/) and [`Common Voice`](https://commonvoice.mozilla.org/ru)


In [2]:
# @title Download model from Hugging Face
!mkdir model
!git clone https://huggingface.co/omogr/wav2vec2-lv-60-ru-ipa model

Cloning into 'model'...
remote: Enumerating objects: 20, done.[K
remote: Counting objects: 100% (16/16), done.[K
remote: Compressing objects: 100% (16/16), done.[K
remote: Total 20 (delta 2), reused 0 (delta 0), pack-reused 4 (from 1)[K
Unpacking objects: 100% (20/20), 304.43 KiB | 6.34 MiB/s, done.


In [5]:
import os
import sys
import numpy as np
import torch
import torchaudio
import random

from transformers import Wav2Vec2CTCTokenizer
from transformers import Wav2Vec2FeatureExtractor
from transformers import Wav2Vec2Processor
from transformers import Wav2Vec2ForCTC

MODEL_PATH = 'model'

tokenizer = Wav2Vec2CTCTokenizer(
    "model/vocab.json",
    bos_token="<s>",
    eos_token="</s>",
    unk_token="<unk>",
    pad_token="<pad>",
    word_delimiter_token="|",
    do_lower_case=False
)

# @title Load model and processor
feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained(MODEL_PATH)
processor = Wav2Vec2Processor(feature_extractor=feature_extractor, tokenizer=tokenizer)

model = Wav2Vec2ForCTC.from_pretrained(
        MODEL_PATH,
        attention_dropout=0.0,
        hidden_dropout=0.0,
        feat_proj_dropout=0.0,
        mask_time_prob=0.0,
        layerdrop=0.0,
        gradient_checkpointing=True,
        ctc_loss_reduction="mean",
        ctc_zero_infinity=True,
        bos_token_id=processor.tokenizer.bos_token_id,
        eos_token_id=processor.tokenizer.eos_token_id,
        pad_token_id=processor.tokenizer.pad_token_id,
        vocab_size=len(processor.tokenizer.get_vocab()),
    )

def process_wav_file(wav_file_path: str):
    # read soundfiles
    waveform, sample_rate = torchaudio.load(wav_file_path)

    bundle_sample_rate = 16000
    if sample_rate != bundle_sample_rate:
        waveform = torchaudio.functional.resample(waveform, sample_rate, bundle_sample_rate)

    # tokenize
    input_values = processor(waveform, sampling_rate=16000, return_tensors="pt").input_values
    # retrieve logits
    with torch.no_grad():
        logits = model(input_values.view(1, -1)).logits
    # take argmax and decode
    predicted_ids = torch.argmax(logits, dim=-1)
    return processor.batch_decode(predicted_ids)

Some weights of the model checkpoint at model were not used when initializing Wav2Vec2ForCTC: ['wav2vec2.masked_spec_embed']
- This IS expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [6]:
sample_wav_files = [
    'model/sample_wav_files/common_voice_ru_38488940.wav',
    'model/sample_wav_files/common_voice_ru_38488941.wav',
]

# @title Transcribe wav files
for wav_file_path in sample_wav_files:
    print('File:', wav_file_path)
    transcription = process_wav_file(wav_file_path)
    print('Transcription:', transcription)



File: model/sample_wav_files/common_voice_ru_38488940.wav
Transcription: ['kak v tr`udnɨje tak i d`obrɨj vrʲɪmʲɪn`a n`aʂɨ məɫɐdʲ`ɵʂ `ɛtə ɡɫ`avnəjə bɐɡ`atstvə']
File: model/sample_wav_files/common_voice_ru_38488941.wav
Transcription: ['mɨ nɐdʲ`ejɪmsʲə ʂto fsʲe ɡəsʊd`arstvə pɐdʲː`erʐɨvəjɪt `ɛtət tʲekst pənʲɪm`ajɪt ʂto n`ɨnʲɪʂnʲɪjə bʲɪzʲdʲ`ejstvʲɪje nʲɪprʲɪ`jemlʲɪmə']
