- This is a training demo, you can run this code locally, using better GPUs.
- The inference part is here: [Bengali SR wav2vec_v1_bengali [Inference]](https://www.kaggle.com/takanashihumbert/bengali-sr-wav2vec-v1-bengali-inference), it scores **0.445** on the leaderboard.
- Feel free to upvote, thanks!

In [1]:
# this part is not needed because the packages are already described in pyproject.toml

# !cp -r ../input/python-packages2 ./

# !tar xvfz ./python-packages2/jiwer.tgz
# !pip install ./jiwer/jiwer-2.3.0-py3-none-any.whl -f ./ --no-index
# !tar xvfz ./python-packages2/normalizer.tgz
# !pip install ./normalizer/bnunicodenormalizer-0.0.24.tar.gz -f ./ --no-index
# !tar xvfz ./python-packages2/pyctcdecode.tgz
# !pip install ./pyctcdecode/attrs-22.1.0-py2.py3-none-any.whl -f ./ --no-index --no-deps
# !pip install ./pyctcdecode/exceptiongroup-1.0.0rc9-py3-none-any.whl -f ./ --no-index --no-deps
# !pip install ./pyctcdecode/hypothesis-6.54.4-py3-none-any.whl -f ./ --no-index --no-deps
# !pip install ./pyctcdecode/numpy-1.21.6-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl -f ./ --no-index --no-deps
# !pip install ./pyctcdecode/pygtrie-2.5.0.tar.gz -f ./ --no-index --no-deps
# !pip install ./pyctcdecode/sortedcontainers-2.4.0-py2.py3-none-any.whl -f ./ --no-index --no-deps
# !pip install ./pyctcdecode/pyctcdecode-0.4.0-py2.py3-none-any.whl -f ./ --no-index --no-deps

# !tar xvfz ./python-packages2/pypikenlm.tgz
# !pip install ./pypikenlm/pypi-kenlm-0.1.20220713.tar.gz -f ./ --no-index --no-deps

In [2]:
import torch 
import torch.nn as nn
import torchaudio
import torchaudio.transforms as tat
from datasets import load_dataset, load_metric, Audio
import os

import typing as tp
from pathlib import Path
from functools import partial
from dataclasses import dataclass, field
from typing import Any, Dict, List, Optional, Union

import pandas as pd
import pyctcdecode
import numpy as np
from tqdm.notebook import tqdm

import librosa
import gc
import jiwer
import pyctcdecode
import kenlm
import torch
from transformers import Wav2Vec2Processor, Wav2Vec2ProcessorWithLM, Wav2Vec2ForCTC
from transformers import TrainingArguments, Trainer, EarlyStoppingCallback
from bnunicodenormalizer import Normalizer
import warnings
warnings.filterwarnings('ignore')
torchaudio.set_audio_backend("soundfile")

In [3]:
### hyper-parameters
SR = 16000
torch.backends.cudnn.benchmark = True
from pathlib import Path

ROOT = Path.cwd().parent
INPUT = ROOT / "input"
DATA = INPUT / "bengaliai-speech"
TRAIN = DATA / "train_mp3s"
TEST = DATA  / "test_mp3s"

output_dir = INPUT / "saved_model"
MODEL_PATH = INPUT / "ai4bharat-indicwav2vec-v1-bengali/indicwav2vec_v1_bengali"
LM_PATH = INPUT / "arijitx-full-model/wav2vec2-xls-r-300m-bengali/language_model"

SENTENCES_PATH = INPUT / "macro-normalization/normalized.csv"
INDEXES_PATH = INPUT / "dataset-overlaps-with-commonvoice-11-bn/indexes.csv"

In [4]:
processor = Wav2Vec2Processor.from_pretrained(MODEL_PATH)
vocab_dict = processor.tokenizer.get_vocab()
sorted_vocab_dict = {k: v for k, v in sorted(vocab_dict.items(), key=lambda item: item[1])}

decoder = pyctcdecode.build_ctcdecoder(
    list(sorted_vocab_dict.keys()),
    str(LM_PATH) + "/5gram.bin",
    str(LM_PATH) + "/unigrams.txt",
)
processor_with_lm = Wav2Vec2ProcessorWithLM(
    feature_extractor=processor.feature_extractor,
    tokenizer=processor.tokenizer,
    decoder=decoder
)

Found entries of length > 1 in alphabet. This is unusual unless style is BPE, but the alphabet was not recognized as BPE type. Is this correct?
Unigrams and labels don't seem to agree.
Only 141 unigrams passed as vocabulary. Is this small or artificial data?


- From @mbmmurad's [Dataset overlaps with CommonVoice 11 bn](https://www.kaggle.com/code/mbmmurad/dataset-overlaps-with-commonvoice-11-bn), The competition dataset might contain the audios of the mozilla-foundation/common_voice_11_0 dataset. Here I just simply exclude them from the validation set.
- Also, I use @UmongSain's normalized data [here](https://www.kaggle.com/code/umongsain/macro-normalization/notebook). Thanks to him!

In [5]:
sentences = pd.read_csv(SENTENCES_PATH)
indexes = set(pd.read_csv(INDEXES_PATH)['id'])
print(len(sentences))
sentences = sentences[~((sentences.index.isin(indexes))&(sentences['split']=='train'))].reset_index(drop=True)
print(len(sentences))

963636
706689


* sample 10% data from "valid" part into validation set, 90% into training set.
* sample 5% data from "train" part, and additionally sample 8% from it into validation set, 92% into training set.
* There will be **57776** train data, **5667** valid data.

In [6]:
data_0 = sentences.loc[sentences['split']=='valid'].reset_index(drop=True)
valid_0 = data_0.sample(frac=0.1, random_state=42)
train_0 = data_0[~data_0.index.isin(valid_0.index)]

data_1 = sentences.loc[sentences['split']=='train'].reset_index(drop=True).sample(frac=0.05, random_state=42)
valid_1 = data_1.sample(frac=0.08, random_state=42)
train_1 = data_1[~data_1.index.isin(valid_1.index)]

train = pd.concat([train_0, train_1], axis=0).sample(frac=1, random_state=42).reset_index(drop=True)
valid = pd.concat([valid_0, valid_1], axis=0).sample(frac=1, random_state=42).reset_index(drop=True)

del data_0, data_1, valid_0, valid_1, train_0, train_1
all_ids = sentences['id'].to_list()
train_ids = train['id'].to_list()
valid_ids = valid['id'].to_list()

# in kaggle notebook, validating is very time-consuming, so here I use a small validation set, rather than 5667.
valid = valid.sample(n=2000, random_state=42)

print(len(all_ids))
print("train_ids", len(train_ids))
print("valid_ids", len(valid_ids))

706689
train_ids 57776
valid_ids 5667


In [7]:
# i = 0...9 について、TRAIN / train_ids[i] にある音声ファイルの容量を表示
for i in range(10):
    print(os.path.getsize(str(TRAIN / train_ids[i]) + ".mp3"))

# すべての i について、TRAIN / train_ids[i] にある音声ファイルの容量の平均を表示
print(np.mean([os.path.getsize(str(TRAIN / train_id) + ".mp3") for train_id in train_ids]))
# 最大値も表示
print(np.max([os.path.getsize(str(TRAIN / train_id) + ".mp3") for train_id in train_ids]))
# 50000 以上あるファイルが全体の何 % あるか % 単位で表示
print(np.sum([os.path.getsize(str(TRAIN / train_id) + ".mp3") >= 50000 for train_id in train_ids]) / len(train_ids))
# valid_ids についても同様に
print(np.sum([os.path.getsize(str(TRAIN / valid_id) + ".mp3") >= 50000 for valid_id in valid_ids]) / len(valid_ids))

# train_ids, valid_ids から、50000 以上あるファイルを除去
train_ids = [train_id for train_id in train_ids if os.path.getsize(str(TRAIN / train_id) + ".mp3") < 50000]
valid_ids = [valid_id for valid_id in valid_ids if os.path.getsize(str(TRAIN / valid_id) + ".mp3") < 50000]

print("train_ids", len(train_ids))
print("valid_ids", len(valid_ids))

# train_ids の中にある隣接する 2 つのファイルの容量の和の最小値・最大値を計算
train_sizes = [os.path.getsize(str(TRAIN / train_id) + ".mp3") for train_id in train_ids]
min_size = np.min([train_sizes[i] + train_sizes[i + 1] for i in range(len(train_sizes) - 1)])
max_size = np.max([train_sizes[i] + train_sizes[i + 1] for i in range(len(train_sizes) - 1)])
print(min_size, max_size)

19485
38061
20565
26181
21213
29421
36981
19053
7605
25101
28911.71623165328
88821
0.07193298255330934
0.08011293453326275
train_ids 53620
valid_ids 5213
10890 99882


In [8]:
# train_ids の中にあるファイルを容量の小さい順に並び替えたものを train_ids_sorted とする
train_ids_sorted = sorted(train_ids, key=lambda train_id: os.path.getsize(str(TRAIN / train_id) + ".mp3"))
# train_ids_sorted[0], ... , train_ids_sorted[n] とする
# train_ids_sorted[0], train_ids_sorted[n], train_ids_sorted[1], train_ids_sorted[n-1], train_ids_sorted[2], train_ids_sorted[n-2], ... となるようにする
train_ids_aligned = []
for i in range(len(train_ids_sorted) // 2):
    train_ids_aligned.append(train_ids_sorted[i])
    train_ids_aligned.append(train_ids_sorted[len(train_ids_sorted) - 1 - i])

In [9]:
# train_ids_aligned の中にある隣接する 2 つのファイルの容量の和の最小値・最大値を計算
train_sizes_aligned = [os.path.getsize(str(TRAIN / train_id) + ".mp3") for train_id in train_ids_aligned]
min_size_aligned = np.min([train_sizes_aligned[i] + train_sizes_aligned[i + 1] for i in range(len(train_sizes_aligned) - 1)])
max_size_aligned = np.max([train_sizes_aligned[i] + train_sizes_aligned[i + 1] for i in range(len(train_sizes_aligned) - 1)])
print(min_size_aligned, max_size_aligned)

52362 56995


In [10]:
print(len(train_ids), len(train_ids_aligned))

53620 53620


In [11]:
# valid_ids についても同様の処理を行う
valid_sizes = [os.path.getsize(str(TRAIN / valid_id) + ".mp3") for valid_id in valid_ids]
min_size = np.min([valid_sizes[i] + valid_sizes[i + 1] for i in range(len(valid_sizes) - 1)])
max_size = np.max([valid_sizes[i] + valid_sizes[i + 1] for i in range(len(valid_sizes) - 1)])
print(min_size, max_size)

valid_ids_sorted = sorted(valid_ids, key=lambda valid_id: os.path.getsize(str(TRAIN / valid_id) + ".mp3"))
valid_ids_aligned = []
for i in range(len(valid_ids_sorted) // 2):
    valid_ids_aligned.append(valid_ids_sorted[i])
    valid_ids_aligned.append(valid_ids_sorted[len(valid_ids_sorted) - 1 - i])

valid_sizes_aligned = [os.path.getsize(str(TRAIN / valid_id) + ".mp3") for valid_id in valid_ids_aligned]
min_size_aligned = np.min([valid_sizes_aligned[i] + valid_sizes_aligned[i + 1] for i in range(len(valid_sizes_aligned) - 1)])
max_size_aligned = np.max([valid_sizes_aligned[i] + valid_sizes_aligned[i + 1] for i in range(len(valid_sizes_aligned) - 1)])
print(min_size_aligned, max_size_aligned)

print(len(valid_ids), len(valid_ids_aligned))

13698 99234
54306 57762
5213 5212


In [7]:
class W2v2Dataset(torch.utils.data.Dataset):
    def __init__(self, df):
        self.df = df
        self.pathes = df['id'].values
        self.sentences = df['normalized'].values
        self.resampler = tat.Resample(32000, SR)

    def __getitem__(self, idx):
        apath = TRAIN / f'{self.pathes[idx]}.mp3'
        waveform, sample_rate = torchaudio.load(apath, format="mp3")
        waveform = self.resampler(waveform)
        batch = dict()
        y = processor(waveform.reshape(-1), sampling_rate=SR).input_values[0] 
        batch["input_values"] = y
        with processor.as_target_processor():
            batch["labels"] = processor(self.sentences[idx]).input_ids       
        
        return batch

    def __len__(self):
        return len(self.df)

train_dataset = W2v2Dataset(train)
valid_dataset = W2v2Dataset(valid)

In [8]:
@dataclass
class DataCollatorCTCWithPadding:
    """
    Data collator that will dynamically pad the inputs received.
    Args:
        processor (:class:`~transformers.Wav2Vec2Processor`)
            The processor used for proccessing the data.
        padding (:obj:`bool`, :obj:`str` or :class:`~transformers.tokenization_utils_base.PaddingStrategy`, `optional`, defaults to :obj:`True`):
            Select a strategy to pad the returned sequences (according to the model's padding side and padding index)
            among:
            * :obj:`True` or :obj:`'longest'`: Pad to the longest sequence in the batch (or no padding if only a single
              sequence if provided).
            * :obj:`'max_length'`: Pad to a maximum length specified with the argument :obj:`max_length` or to the
              maximum acceptable input length for the model if that argument is not provided.
            * :obj:`False` or :obj:`'do_not_pad'` (default): No padding (i.e., can output a batch with sequences of
              different lengths).
        max_length (:obj:`int`, `optional`):
            Maximum length of the ``input_values`` of the returned list and optionally padding length (see above).
        max_length_labels (:obj:`int`, `optional`):
            Maximum length of the ``labels`` returned list and optionally padding length (see above).
        pad_to_multiple_of (:obj:`int`, `optional`):
            If set will pad the sequence to a multiple of the provided value.
            This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability >=
            7.5 (Volta).
    """

    processor: Wav2Vec2Processor
    padding: Union[bool, str] = True
    max_length: Optional[int] = None
    max_length_labels: Optional[int] = None
    pad_to_multiple_of: Optional[int] = None
    pad_to_multiple_of_labels: Optional[int] = None

    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        # split inputs and labels since they have to be of different lenghts and need
        # different padding methods
        input_features = [{"input_values": feature["input_values"]} for feature in features]
        label_features = [{"input_ids": feature["labels"]} for feature in features]

        batch = self.processor.pad(
            input_features,
            padding=self.padding,
            max_length=self.max_length,
            pad_to_multiple_of=self.pad_to_multiple_of,
            return_tensors="pt",
        )
        with self.processor.as_target_processor():
            labels_batch = self.processor.pad(
                label_features,
                padding=self.padding,
                max_length=self.max_length_labels,
                pad_to_multiple_of=self.pad_to_multiple_of_labels,
                return_tensors="pt",
            )

        # replace padding with -100 to ignore loss correctly
        labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)

        batch["labels"] = labels

        return batch

In [9]:
data_collator = DataCollatorCTCWithPadding(processor=processor, padding=True)

- In kaggle notebook, there is an error: **cannot import name 'compute_measures' from 'jiwer' (unknown location)**. But in my local notebook, there is no such error.

In [10]:
wer_metric = load_metric("wer")

def compute_metrics(pred):
    pred_logits = pred.predictions
    pred_ids = np.argmax(pred_logits, axis=-1)

    pred.label_ids[pred.label_ids == -100] = processor.tokenizer.pad_token_id

    pred_str = processor.batch_decode(pred_ids)
    # we do not want to group tokens when computing the metrics
    label_str = processor.batch_decode(pred.label_ids, group_tokens=False)

    wer = wer_metric.compute(predictions=pred_str, references=label_str)

    return {"wer": wer}

In [11]:
model = Wav2Vec2ForCTC.from_pretrained(
    MODEL_PATH,
    attention_dropout=0.1,
    hidden_dropout=0.1,
    feat_proj_dropout=0.0,
    mask_time_prob=0.05,
    layerdrop=0.1,
    #gradient_checkpointing=True, 
    ctc_loss_reduction="mean", 
    pad_token_id=processor.tokenizer.pad_token_id,
    vocab_size=len(processor.tokenizer),
    ctc_zero_infinity=True,
    diversity_loss_weight=100 
)

In [12]:
# you can freeze some params
model.freeze_feature_extractor()

- As a demo, "**num_train_epochs**", "**eval_steps**" and "**early_stopping_patience**" are set to very small values, you can make them larger.
- If there is no error about jiwer, you can set **metric_for_best_model**="wer", and remember to set **greater_is_better**=False and use **compute_metrics**.

In [13]:
training_args = TrainingArguments(
    output_dir=output_dir,
    overwrite_output_dir=True,
    group_by_length=False,
    lr_scheduler_type='cosine',
    weight_decay=0.01,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=16,
    gradient_accumulation_steps=1,
    evaluation_strategy="steps",
    save_strategy="steps",
    # max_steps=1000, # you can change to "num_train_epochs"
    num_train_epochs=1,
    fp16=True,
    save_steps=20,
    eval_steps=20,
    logging_steps=20,
    learning_rate=2e-5,
    warmup_steps=600,
    save_total_limit=1,
    load_best_model_at_end=True,
    metric_for_best_model="wer",
    greater_is_better=False,
    prediction_loss_only=False,
    auto_find_batch_size=True,
    report_to="none"
)

In [14]:
trainer = Trainer(
    model=model,
    data_collator=data_collator,
    args=training_args,
    compute_metrics=compute_metrics,
    train_dataset=train_dataset,
    eval_dataset=valid_dataset,
    tokenizer=processor.feature_extractor,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=1)],
)

In [15]:
trainer.train()

  0%|          | 0/14444 [00:00<?, ?it/s]

{'loss': 2.4066, 'learning_rate': 6.333333333333334e-07, 'epoch': 0.0}


  0%|          | 0/125 [00:00<?, ?it/s]

{'eval_loss': 2.4455785751342773, 'eval_wer': 0.5333676151296994, 'eval_runtime': 166.7851, 'eval_samples_per_second': 11.991, 'eval_steps_per_second': 0.749, 'epoch': 0.0}
{'loss': 2.1576, 'learning_rate': 1.3e-06, 'epoch': 0.0}


  0%|          | 0/125 [00:00<?, ?it/s]

{'eval_loss': 2.4403610229492188, 'eval_wer': 0.5334818877842532, 'eval_runtime': 75.8077, 'eval_samples_per_second': 26.383, 'eval_steps_per_second': 1.649, 'epoch': 0.0}
{'train_runtime': 295.0803, 'train_samples_per_second': 195.798, 'train_steps_per_second': 48.949, 'train_loss': 2.2820834159851073, 'epoch': 0.0}


TrainOutput(global_step=40, training_loss=2.2820834159851073, metrics={'train_runtime': 295.0803, 'train_samples_per_second': 195.798, 'train_steps_per_second': 48.949, 'train_loss': 2.2820834159851073, 'epoch': 0.0})

- To improve scores you can: 
    * use different pretrained models
    * alter the parameters
    * choose more data
    * filter data in another way.

In [16]:
trainer.save_model(output_dir)

In [17]:
model.save_pretrained(output_dir)
processor.feature_extractor.save_pretrained(output_dir)

['/home/nago/Documents/ML/kaggle-Bengali.AI_Speech-Recognition/input/saved_model/preprocessor_config.json']