# INTRO.

### Title -  ASR series by Professor - Facebook (Cross Lingual Speech Representation - XLSR)

###### ASR with Wav2vec2-XLSR: Swahili Challenge Edition 🌍

XLSR-Wav2Vec2 is a groundbreaking creation by Facebook AI for automatic speech recognition (ASR). 🚀🔊

In this tutorial, We'll walk you through the ins and outs of this multilingual ASR solution, from setup to inference for the Tanzania ASR challenge. 🛠️🎧➡️📝

_PS: A lot of code snippet was copied from my Notebook on training Whisper. You may want to check that out._

###### Import libraries

In [1]:
!pip install pydub



In [2]:
import pandas as pd
import os
import warnings
warnings.filterwarnings("ignore")

###### Load in the Dataset

After loading the data, We specified the model name as "alamsher/wav2vec2-large-xlsr-53-common-voice-sw." This model is fine-tuned for Swahili using the Wav2Vec 2 - XLSR architecture. It's a tailored ASR model ready to transcribe Swahili speech into text. With this model, we're all set to work with Swahili audio data and convert it into written text.

* You can check out more models on the huggingface hub

In [3]:
train = pd.read_csv("/kaggle/input/swahili-audio-dataset/train.csv")#.head(20)
test = pd.read_csv("/kaggle/input/swahili-audio-dataset/test.csv")#.head(20)
train["path"] = "/kaggle/input/swahili-audio-dataset/all_wav/content/all_wav_files/" + train["audio_ID"] +".wav"
test["path"] = "/kaggle/input/swahili-audio-dataset/all_wav/content/all_wav_files/" + test["audio_ID"] + ".wav"
test["sentence"] = " "
train["sentence"] = train["sentence"].str.lower()

name = "alamsher/wav2vec2-large-xlsr-53-common-voice-sw"

In [4]:
test

Unnamed: 0,audio_ID,path,sentence
0,audio_3c2f423a,/kaggle/input/swahili-audio-dataset/all_wav/co...,
1,audio_ca8009ec,/kaggle/input/swahili-audio-dataset/all_wav/co...,
2,audio_fffcdb86,/kaggle/input/swahili-audio-dataset/all_wav/co...,
3,audio_0e3e8e6f,/kaggle/input/swahili-audio-dataset/all_wav/co...,
4,audio_859b792d,/kaggle/input/swahili-audio-dataset/all_wav/co...,
...,...,...,...
3162,audio_bdc72517,/kaggle/input/swahili-audio-dataset/all_wav/co...,
3163,audio_50ee5991,/kaggle/input/swahili-audio-dataset/all_wav/co...,
3164,audio_b9867662,/kaggle/input/swahili-audio-dataset/all_wav/co...,
3165,audio_a0739877,/kaggle/input/swahili-audio-dataset/all_wav/co...,


In [5]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Load the CSV file into a pandas DataFrame
df = train.copy()

# Split the DataFrame into train and validation sets
train_df, val_df = train_test_split(df, test_size=0.1, random_state=42)

# Save the train and validation DataFrames to CSV files
train_df.to_csv('train_split.csv', index=False)
val_df.to_csv('valid_split.csv', index=False)
test.to_csv("test.csv", index=False)

print("Splitting complete. Train and validation CSV files created.")


Splitting complete. Train and validation CSV files created.


###### Install major Libraries

In [6]:
!pip install -qq datasets==2.1.0
!pip install -qq transformers==4.18.0
!pip install -qq jiwer

In [7]:
import pandas as pd
from sklearn.model_selection import train_test_split
import numpy as np
import random
from datasets import load_dataset, load_metric, Audio
import re
import json
from transformers import TrainingArguments, Trainer
from transformers import AutoProcessor, AutoModelForCTC

import torch
import librosa
import IPython.display as ipd

from dataclasses import dataclass, field
from typing import Any, Dict, List, Optional, Union

Here, we create a processor using AutoProcessor.from_pretrained(name), where name corresponds to the Swahili-specific ASR model we've defined earlier.

In [8]:
processor = AutoProcessor.from_pretrained(name)

Downloading:   0%|          | 0.00/254 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/396 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/657 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/30.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.73k [00:00<?, ?B/s]

###### Data Cleaning Functions

So this is a long one, but I'll keep it concise. 

* load_data(data_files): Loads data using dataset.load_dataset().

* remove_special_characters(batch, column_name = "sentence"): Removes special characters from text data. You'll see in the code the list of special characters

* replace_extra_characters(batch, column_name="sentence"): Replaces specific characters and removes extra characters from text. I removed some weird characters from the text like ...

* extract_all_chars(batch, column_name="sentence"): This function extracts unique characters from text data.

* speech_file_to_array_fn(batch, column_name = "sentence", audio_folder): Loads and resamples audio data, adds sampling rate and target text to the batch.

* prepare_dataset(batch): Prepares data by resampling audio, computing input features, and encoding target text to label IDs.

In [9]:
def load_data(data_files):
  "Load data using hugginface dataset.load_dataset()"
  data = load_dataset("csv", data_files=data_files)
  return data

chars_to_remove_regex = '[\,\?\.\!\-\;\:\"\“\%\‘\”\']'

def remove_special_characters(batch, column_name = "sentence"):
    batch[column_name] = re.sub(chars_to_remove_regex, '', batch[column_name])
    return batch

def replace_extra_characters(batch,column_name="sentence"):
    batch[column_name] = re.sub('[ç]', 'c', batch[column_name])
    batch[column_name] = re.sub('[é]', 'e', batch[column_name])
    batch[column_name] = re.sub('[...]', '', batch[column_name])
    batch[column_name] = re.sub('[\t]', '', batch[column_name])
    return batch

def extract_all_chars(batch,column_name="sentence"):
  all_text = " ".join(batch[column_name])
  vocab = list(set(all_text))
  return {"vocab": [vocab], "all_text": [all_text]}


def speech_file_to_array_fn(batch,column_name = "sentence", audio_folder = "/kaggle/input/swahili-audio-dataset/all_wav/content/all_wav_files/"):
    speech_array, sampling_rate = librosa.load(audio_folder + batch["audio_ID"]+".wav", mono=True, sr=16000)
    batch["speech"] = np.array(speech_array, dtype = np.float32)
    batch["sampling_rate"] = sampling_rate
    batch["target_text"] = batch[column_name]
    return batch

def prepare_dataset(batch,column_name = "sentence"):
    # batch = batch["batch"]

    # batched output is "un-batched"
    batch["input_values"] = processor(np.array(batch["speech"]), sampling_rate=16000).input_values[0]
    batch["input_length"] = len(batch["input_values"])

    with processor.as_target_processor():
        batch["labels"] = processor(batch["target_text"]).input_ids
    return batch

###### Some Jargon. - DataCollatorSpeechSeq2SeqWithPadding

This data collator is designed to prepare the data for a sequence-to-sequence ASR task, ensuring that inputs and labels are correctly padded and processed for training. 

We define a data collator class named DataCollatorSpeechSeq2SeqWithPadding for processing data used in a sequence-to-sequence ASR task.
 
The class DataCollatorSpeechSeq2SeqWithPadding is defined as a dataclass, taking two primary arguments:

1. processor: An instance of the WhisperProcessor used for data processing.
2. padding: A flag or string indicating the padding method (default is True).

Within the __call__ method, the input data features are split into inputs and labels to handle different padding requirements.

* For audio input features, we return torch tensors.
* For label sequences, we pad them to the maximum length using the processor's feature extractor and tokenizer.

Note that The padding is handled differently for labels, with padding replaced by -100 to correctly ignore loss in training.

In [10]:
@dataclass
class DataCollatorCTCWithPadding:
    """
    Data collator that will dynamically pad the inputs received.
    Args:
        processor (:class:`~transformers.Wav2Vec2Processor`)
            The processor used for proccessing the data.
        padding (:obj:`bool`, :obj:`str` or :class:`~transformers.tokenization_utils_base.PaddingStrategy`, `optional`, defaults to :obj:`True`):
            Select a strategy to pad the returned sequences (according to the model's padding side and padding index)
            among:
            * :obj:`True` or :obj:`'longest'`: Pad to the longest sequence in the batch (or no padding if only a single
              sequence if provided).
            * :obj:`'max_length'`: Pad to a maximum length specified with the argument :obj:`max_length` or to the
              maximum acceptable input length for the model if that argument is not provided.
            * :obj:`False` or :obj:`'do_not_pad'` (default): No padding (i.e., can output a batch with sequences of
              different lengths).
    """

    processor: processor
    padding: Union[bool, str] = True

    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        # split inputs and labels since they have to be of different lenghts and need
        # different padding methods
        input_features = [{"input_values": feature["input_values"]} for feature in features]
        label_features = [{"input_ids": feature["labels"]} for feature in features]

        batch = self.processor.pad(
            input_features,
            padding=self.padding,
            return_tensors="pt",
        )
        with self.processor.as_target_processor():
            labels_batch = self.processor.pad(
                label_features,
                padding=self.padding,
                return_tensors="pt",
            )

        # replace padding with -100 to ignore loss correctly
        labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)

        batch["labels"] = labels

        return batch

##### WER

Here we defined a function for computing Word Error Rate (WER) metrics, which is commonly used to assess the performance of Automatic Speech Recognition (ASR) models.

This function is crucial for evaluating the performance of your ASR model by comparing its predictions to the ground truth reference transcriptions.

In [11]:
wer_metric = load_metric("wer")
def compute_metrics(pred):
    pred_logits = pred.predictions
    pred_ids = np.argmax(pred_logits, axis=-1)

    pred.label_ids[pred.label_ids == -100] = processor.tokenizer.pad_token_id

    pred_str = processor.batch_decode(pred_ids)
    # we do not want to group tokens when computing the metric
    label_str = processor.batch_decode(pred.label_ids, group_tokens=False)

    wer = wer_metric.compute(predictions=pred_str, references=label_str)

    return {"wer": wer}

Downloading builder script:   0%|          | 0.00/1.90k [00:00<?, ?B/s]

In [12]:
pd.read_csv("train_split.csv")

Unnamed: 0,audio_ID,path,sentence
0,audio_3ff8be72,/kaggle/input/swahili-audio-dataset/all_wav/co...,haiwezi kurejeshwa kama umefanya makosa kama u...
1,audio_14af6640,/kaggle/input/swahili-audio-dataset/all_wav/co...,benki kuu ya tanzania inatanua wigo katika mif...
2,audio_27dd9959,/kaggle/input/swahili-audio-dataset/all_wav/co...,baada ya tanganyika na kisiwa cha zanzibar kuu...
3,audio_9074f8eb,/kaggle/input/swahili-audio-dataset/all_wav/co...,mwaka elfu moja mia tisa arobaini na saba aska...
4,audio_f946459b,/kaggle/input/swahili-audio-dataset/all_wav/co...,inaweza kuwa ngumu kutangaza kifaa cha umeme
...,...,...,...
17178,audio_4ab93a25,/kaggle/input/swahili-audio-dataset/all_wav/co...,laba shabani sasa unafikiria tatizo nini hasa
17179,audio_868b13da,/kaggle/input/swahili-audio-dataset/all_wav/co...,"wakifanya ubaya, uwajibu kwa wema."
17180,audio_c610a669,/kaggle/input/swahili-audio-dataset/all_wav/co...,wanu hafidh ameir ni mwakilishi maalum katika ...
17181,audio_f351ce61,/kaggle/input/swahili-audio-dataset/all_wav/co...,benki kuu inasimamia soko la hisa la dar es sa...


In [13]:
pd.read_csv("test.csv")

Unnamed: 0,audio_ID,path,sentence
0,audio_3c2f423a,/kaggle/input/swahili-audio-dataset/all_wav/co...,
1,audio_ca8009ec,/kaggle/input/swahili-audio-dataset/all_wav/co...,
2,audio_fffcdb86,/kaggle/input/swahili-audio-dataset/all_wav/co...,
3,audio_0e3e8e6f,/kaggle/input/swahili-audio-dataset/all_wav/co...,
4,audio_859b792d,/kaggle/input/swahili-audio-dataset/all_wav/co...,
...,...,...,...
3162,audio_bdc72517,/kaggle/input/swahili-audio-dataset/all_wav/co...,
3163,audio_50ee5991,/kaggle/input/swahili-audio-dataset/all_wav/co...,
3164,audio_b9867662,/kaggle/input/swahili-audio-dataset/all_wav/co...,
3165,audio_a0739877,/kaggle/input/swahili-audio-dataset/all_wav/co...,


In [14]:
pd.read_csv("valid_split.csv")

Unnamed: 0,audio_ID,path,sentence
0,audio_e0a09962,/kaggle/input/swahili-audio-dataset/all_wav/co...,rock hudson alioa mke wake na kutalakiana hapo...
1,audio_2257a639,/kaggle/input/swahili-audio-dataset/all_wav/co...,mazingira haya ni pamoja na mipangilio ya kuzu...
2,audio_0cbda50b,/kaggle/input/swahili-audio-dataset/all_wav/co...,piga hatua sana ya kimaendeleo afrika mashariki
3,audio_a07549f6,/kaggle/input/swahili-audio-dataset/all_wav/co...,alikimbia lakini walimkamata na kumlazimisha k...
4,audio_137dc8b9,/kaggle/input/swahili-audio-dataset/all_wav/co...,daktari muoga anakuja
...,...,...,...
1905,audio_8ff8fb7a,/kaggle/input/swahili-audio-dataset/all_wav/co...,mungu amempenda na amemchukua siku ya amueke m...
1906,audio_0be7e3dd,/kaggle/input/swahili-audio-dataset/all_wav/co...,vile ninavyohisi
1907,audio_b337b4d1,/kaggle/input/swahili-audio-dataset/all_wav/co...,ghana ule aa namuuona yule kiongozi kidogo ame...
1908,audio_57daf8fe,/kaggle/input/swahili-audio-dataset/all_wav/co...,katika hospitali ya cedars of lebanon huko mji...


In [15]:
from transformers import AutoProcessor, AutoModelForCTC

###### Apply Data Cleaning and Extraction 

* train
* Valid
* Test

In [16]:
# ### STEP 1
data_files = {
   "train": "train_split.csv",
             "valid": "valid_split.csv",
             "test": "test.csv"
             }
# # load data
data = load_data(data_files)

# # clean data
data["train"] = data["train"].map(remove_special_characters)
data["valid"] = data["valid"].map(remove_special_characters)
data["test"] = data["test"].map(remove_special_characters)

data["train"] = data["train"].map(replace_extra_characters)
data["valid"] = data["valid"].map(replace_extra_characters)
data["test"] = data["test"].map(replace_extra_characters)

!mkdir "artifacts/"

data["train"] = data["train"].map(speech_file_to_array_fn,
                                remove_columns=data["train"].column_names)
data["valid"] = data["valid"].map(speech_file_to_array_fn,
                                remove_columns=data["valid"].column_names)
data["test"] = data["test"].map(speech_file_to_array_fn,
                               remove_columns=data["test"].column_names)

print("check data")
rand_int = random.randint(0, len(data)-1)

#print("Sampling rate:", data["train"][rand_int]["sampling_rate"])
ipd.Audio(data=data["train"][rand_int]["speech"], autoplay=True, rate=16000)

Downloading and preparing dataset csv/default to /root/.cache/huggingface/datasets/csv/default-87479a57ac18879d/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519...


Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/3 [00:00<?, ?it/s]

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/csv/default-87479a57ac18879d/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

  0%|          | 0/17183 [00:00<?, ?ex/s]

  0%|          | 0/1910 [00:00<?, ?ex/s]

  0%|          | 0/3167 [00:00<?, ?ex/s]

  0%|          | 0/17183 [00:00<?, ?ex/s]

  0%|          | 0/1910 [00:00<?, ?ex/s]

  0%|          | 0/3167 [00:00<?, ?ex/s]

  0%|          | 0/17183 [00:00<?, ?ex/s]

  0%|          | 0/1910 [00:00<?, ?ex/s]

  0%|          | 0/3167 [00:00<?, ?ex/s]

check data


In [17]:
data["train"] = data["train"].map(prepare_dataset, remove_columns=data["train"].column_names)
data["valid"] = data["valid"].map(prepare_dataset, remove_columns=data["valid"].column_names)
data["test"] = data["test"].map(prepare_dataset, remove_columns=data["test"].column_names)

  0%|          | 0/17183 [00:00<?, ?ex/s]

  0%|          | 0/1910 [00:00<?, ?ex/s]

  0%|          | 0/3167 [00:00<?, ?ex/s]

In [18]:
data["test"]#.column_names

Dataset({
    features: ['input_values', 'input_length', 'labels'],
    num_rows: 3167
})

Instantiate the Model

In [19]:
model = AutoModelForCTC.from_pretrained(name)

Downloading:   0%|          | 0.00/2.25k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.18G [00:00<?, ?B/s]

In [20]:
model.freeze_feature_extractor()

Here, We configure the data collator and training arguments for our ASR tasks:

1. data_collator = DataCollatorCTCWithPadding(processor=processor, padding=True):

We set up a data collator, which helps prepare data for training. We also enabled padding (padding=True) to ensure sequences are of consistent length.

2. training_args = TrainingArguments(...):

We simply define the training arguments for our ASR model.

In [21]:
data_collator = DataCollatorCTCWithPadding(processor=processor, padding=True)
training_args = TrainingArguments(
      output_dir="artifacts",
        group_by_length=True,
  per_device_train_batch_size=8,
  gradient_accumulation_steps=2,
  evaluation_strategy="steps",
  num_train_epochs=1,
  gradient_checkpointing=True,
  fp16=True,
  save_steps=1000,
  eval_steps=1000,
  logging_steps=400,
  learning_rate=3e-4,
  warmup_steps=500,
  save_total_limit=2,
  )

In [22]:
!wandb disabled

W&B disabled.


In this code cell, we initialize and start training our ASR model using the Seq2SeqTrainer.

Finally, we start the training process using trainer.train(). The trainer takes care of training the model, saving checkpoints, and evaluating the model's performance based on the configured settings.

This code efficiently trains our ASR model according to the defined parameters and data.

In [23]:
trainer = Trainer(
    model=model,
    data_collator=data_collator,
    args=training_args,
    compute_metrics=compute_metrics,
    train_dataset=data["train"],
    eval_dataset=data["valid"],
    tokenizer=processor.feature_extractor,
  )
trainer.train()

Using amp half precision backend
The following columns in the training set  don't have a corresponding argument in `Wav2Vec2ForCTC.forward` and have been ignored: input_length. If input_length are not expected by `Wav2Vec2ForCTC.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 17183
  Num Epochs = 1
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 2
  Total optimization steps = 1074
Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"


Step,Training Loss,Validation Loss,Wer
1000,0.7664,inf,0.383399


The following columns in the evaluation set  don't have a corresponding argument in `Wav2Vec2ForCTC.forward` and have been ignored: input_length. If input_length are not expected by `Wav2Vec2ForCTC.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1910
  Batch size = 8
Saving model checkpoint to artifacts/checkpoint-1000
Configuration saved in artifacts/checkpoint-1000/config.json
Model weights saved in artifacts/checkpoint-1000/pytorch_model.bin
Feature extractor saved in artifacts/checkpoint-1000/preprocessor_config.json


Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=1074, training_loss=0.7883610929634762, metrics={'train_runtime': 3674.3167, 'train_samples_per_second': 4.677, 'train_steps_per_second': 0.292, 'total_flos': 2.8386807963677225e+18, 'train_loss': 0.7883610929634762, 'epoch': 1.0})

In [24]:
val_df

Unnamed: 0,audio_ID,path,sentence
13792,audio_e0a09962,/kaggle/input/swahili-audio-dataset/all_wav/co...,rock hudson alioa mke wake na kutalakiana hapo...
3855,audio_2257a639,/kaggle/input/swahili-audio-dataset/all_wav/co...,mazingira haya ni pamoja na mipangilio ya kuzu...
344,audio_0cbda50b,/kaggle/input/swahili-audio-dataset/all_wav/co...,piga hatua sana ya kimaendeleo afrika mashariki
12092,audio_a07549f6,/kaggle/input/swahili-audio-dataset/all_wav/co...,alikimbia lakini walimkamata na kumlazimisha k...
7882,audio_137dc8b9,/kaggle/input/swahili-audio-dataset/all_wav/co...,daktari muoga anakuja
...,...,...,...
14769,audio_8ff8fb7a,/kaggle/input/swahili-audio-dataset/all_wav/co...,mungu amempenda na amemchukua siku ya amueke m...
14251,audio_0be7e3dd,/kaggle/input/swahili-audio-dataset/all_wav/co...,vile ninavyohisi
15099,audio_b337b4d1,/kaggle/input/swahili-audio-dataset/all_wav/co...,ghana ule aa namuuona yule kiongozi kidogo ame...
14165,audio_57daf8fe,/kaggle/input/swahili-audio-dataset/all_wav/co...,katika hospitali ya cedars of lebanon huko mji...


In [25]:
test

Unnamed: 0,audio_ID,path,sentence
0,audio_3c2f423a,/kaggle/input/swahili-audio-dataset/all_wav/co...,
1,audio_ca8009ec,/kaggle/input/swahili-audio-dataset/all_wav/co...,
2,audio_fffcdb86,/kaggle/input/swahili-audio-dataset/all_wav/co...,
3,audio_0e3e8e6f,/kaggle/input/swahili-audio-dataset/all_wav/co...,
4,audio_859b792d,/kaggle/input/swahili-audio-dataset/all_wav/co...,
...,...,...,...
3162,audio_bdc72517,/kaggle/input/swahili-audio-dataset/all_wav/co...,
3163,audio_50ee5991,/kaggle/input/swahili-audio-dataset/all_wav/co...,
3164,audio_b9867662,/kaggle/input/swahili-audio-dataset/all_wav/co...,
3165,audio_a0739877,/kaggle/input/swahili-audio-dataset/all_wav/co...,


###### Inference

In the subsequent code cells, we simply try to do inference on the test set

In [26]:
### BATCH INFERENCE
def map_to_result(batch):
  with torch.no_grad():
    input_values = torch.tensor(batch["input_values"], device="cuda").unsqueeze(0)
    logits = model(input_values).logits

  pred_ids = torch.argmax(logits, dim=-1)
  batch["pred_str"] = processor.batch_decode(pred_ids)[0]
  batch["text"] = processor.decode(batch["labels"], group_tokens=False)

  return batch

In [27]:
def single_audio_file_inference(filepath):
  speech_array, sampling_rate = librosa.load(filepath, mono=True, sr=16000)
  speech_num = np.array(speech_array, dtype = np.float32)
  #input_value = processor(speech_num, sampling_rate).input_values[0]
  input_values = torch.tensor(speech_num, device="cuda")#.unsqueeze(0)
  input_dict = processor(input_values, return_tensors="pt", padding=True)
  logits = model(input_dict.input_values.to("cuda")).logits
  pred_ids = torch.argmax(logits, dim=-1)[0]
  return(processor.decode(pred_ids))

In [28]:
train

Unnamed: 0,audio_ID,path,sentence
0,audio_faa7312a,/kaggle/input/swahili-audio-dataset/all_wav/co...,huko kwa wakiroba mkoa wa mara
1,audio_643a10c1,/kaggle/input/swahili-audio-dataset/all_wav/co...,alingaa katika medani za kisiasa na uongozi nd...
2,audio_5b626e74,/kaggle/input/swahili-audio-dataset/all_wav/co...,vitu saba ambavyo kila baba atakuwa.
3,audio_5972c5f3,/kaggle/input/swahili-audio-dataset/all_wav/co...,inaonyesha mawaziri wapya ambao wamechukua naf...
4,audio_deebd5b0,/kaggle/input/swahili-audio-dataset/all_wav/co...,ee hii pia inatumiwa na kiwanda cha
...,...,...,...
19088,audio_020c9902,/kaggle/input/swahili-audio-dataset/all_wav/co...,bado timu zitapata nafasi katika
19089,audio_7de63630,/kaggle/input/swahili-audio-dataset/all_wav/co...,wabunge wapo huru kufanya mikutano ya hadhara ...
19090,audio_a52c777e,/kaggle/input/swahili-audio-dataset/all_wav/co...,aah mimi sio mkaaji sana mzee hata mazungumzo ...
19091,audio_9adccc74,/kaggle/input/swahili-audio-dataset/all_wav/co...,rini na saba hadi kumalizika kwake mwaka wa el...


In [29]:
for i in range(5):
    example_ = "/kaggle/input/swahili-audio-dataset/all_wav/content/all_wav_files/"+train.iloc[i]["audio_ID"] + ".wav"
    print("Prediction: ",single_audio_file_inference(example_))
    print("Truth: ", train.iloc[i]["sentence"])

It is strongly recommended to pass the ``sampling_rate`` argument to this function. Failing to do so can result in silent errors that might be hard to debug.
It is strongly recommended to pass the ``sampling_rate`` argument to this function. Failing to do so can result in silent errors that might be hard to debug.
It is strongly recommended to pass the ``sampling_rate`` argument to this function. Failing to do so can result in silent errors that might be hard to debug.
It is strongly recommended to pass the ``sampling_rate`` argument to this function. Failing to do so can result in silent errors that might be hard to debug.


Prediction:  hukwa wakioba mkoa wa mara
Truth:  huko kwa wakiroba mkoa wa mara
Prediction:  alingaa katika medani za kisiasa nwa uongozi ndani ya serikali ya mapinduzi ya zanzibar
Truth:  alingaa katika medani za kisiasa na uongozi ndani ya serikali ya mapinduzi ya zanzibar
Prediction:  vitu saba ambavyo kila baba atakua
Truth:  vitu saba ambavyo kila baba atakuwa.


It is strongly recommended to pass the ``sampling_rate`` argument to this function. Failing to do so can result in silent errors that might be hard to debug.


Prediction:  naanesha mawaziri wake ambao wamechuanafasihiyo isita
Truth:  inaonyesha mawaziri wapya ambao wamechukua nafasi hiyo ni sita
Prediction:  e hipia inatumiwa na kiwanda cha
Truth:  ee hii pia inatumiwa na kiwanda cha


In [30]:
%%time
single_audio_file_inference("/kaggle/input/swahili-audio-dataset/all_wav/content/all_wav_files/audio_40342cf4.wav")

It is strongly recommended to pass the ``sampling_rate`` argument to this function. Failing to do so can result in silent errors that might be hard to debug.


CPU times: user 65.5 ms, sys: 0 ns, total: 65.5 ms
Wall time: 88.6 ms


'mwaka elfu mbili na kumi na nane ndege yenye wahudumu wa kike wote iliruka'

In [31]:
%%time
results = data["test"].map(map_to_result, remove_columns=data["test"].column_names)

  0%|          | 0/3167 [00:00<?, ?ex/s]

CPU times: user 6min 15s, sys: 2.38 s, total: 6min 17s
Wall time: 6min 18s


In [32]:
preds = [results[i]['pred_str'] for i in range(len(results))]
text_ = [results[i]['text'] for i in range(len(results))]

In [33]:
results[4]

{'pred_str': 'ni kama hapa akimwambia mtutwenda kuvaua hatatua kwenda kupaua ni nini',
 'text': ''}

In [34]:
ipd.Audio(data= "/kaggle/input/swahili-audio-dataset/all_wav/content/all_wav_files/audio_3c2f423a.wav", autoplay=True, rate=16000)

In [35]:
test

Unnamed: 0,audio_ID,path,sentence
0,audio_3c2f423a,/kaggle/input/swahili-audio-dataset/all_wav/co...,
1,audio_ca8009ec,/kaggle/input/swahili-audio-dataset/all_wav/co...,
2,audio_fffcdb86,/kaggle/input/swahili-audio-dataset/all_wav/co...,
3,audio_0e3e8e6f,/kaggle/input/swahili-audio-dataset/all_wav/co...,
4,audio_859b792d,/kaggle/input/swahili-audio-dataset/all_wav/co...,
...,...,...,...
3162,audio_bdc72517,/kaggle/input/swahili-audio-dataset/all_wav/co...,
3163,audio_50ee5991,/kaggle/input/swahili-audio-dataset/all_wav/co...,
3164,audio_b9867662,/kaggle/input/swahili-audio-dataset/all_wav/co...,
3165,audio_a0739877,/kaggle/input/swahili-audio-dataset/all_wav/co...,


###### Prepare to submit to Zindi

In [36]:
submission = pd.DataFrame(columns = ['audio_ID','pred_str'])

submission["pred_str"] = preds
submission["audio_ID"] = test.audio_ID
submission.head()

Unnamed: 0,audio_ID,pred_str
0,audio_3c2f423a,chuza vyema
1,audio_ca8009ec,wtta wake kudeelea
2,audio_fffcdb86,biashara biasiana
3,audio_0e3e8e6f,hachufukiipirisa lrusho
4,audio_859b792d,ni kama hapa akimwambia mtutwenda kuvaua hatat...


In [37]:
submission.to_csv("data_cleaning and 1 epochs of alam.csv", index=False)

I am always open to assist enthusiasts with difficulties they are facing in Machine learning and Deep learning. Feel free to reach out to me: most preferably LinkedIn.

. [Twitter](https://twitter.com/olufemivictort).

. [Linkedin](https://www.linkedin.com/in/olufemi-victor-tolulope).

. [Github](https://github.com/osinkolu)

### Author: Olufemi Victor Tolulope