For deploying an ASR model on mobile devices, you’ll need a smaller, more efficient model. One suitable option is the Wav2Vec2.0-base model or its distilled versions. Another lightweight alternative is the DeepSpeech model.

## Step-by-Step Guide for Training a Small ASR Model
## Step 1: Install Necessary Libraries

Make sure you have the required libraries installed:

In [None]:
pip install transformers datasets torch soundfile

## Step 2: Load a Smaller Pre-trained Model and Tokenizer
Use a smaller variant of the Wav2Vec2.0 model:

In [None]:
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

# Load a smaller pre-trained model
model_name = "facebook/wav2vec2-base"
processor = Wav2Vec2Processor.from_pretrained(model_name)
model = Wav2Vec2ForCTC.from_pretrained(model_name)

## Step 3: Prepare Your Dataset
Ensure your dataset has audio files and corresponding transcripts. Load and preprocess the dataset:

In [None]:
from datasets import load_dataset
import soundfile as sf

# Load your dataset
dataset = load_dataset("path_to_your_dataset")

# Preprocess the dataset
def speech_file_to_array_fn(batch):
    speech_array, _ = sf.read(batch["file"])
    batch["speech"] = speech_array
    return batch

def prepare_dataset(batch):
    batch["input_values"] = processor(batch["speech"], sampling_rate=16_000).input_values[0]
    with processor.as_target_processor():
        batch["labels"] = processor(batch["transcript"]).input_ids
    return batch

dataset = dataset.map(speech_file_to_array_fn)
dataset = dataset.map(prepare_dataset, remove_columns=dataset.column_names)

If you don't have your own dataset, try the common voice dataset. Please check out the [huggingface open ASR datasets](https://huggingface.co/datasets?task_categories=task_categories:automatic-speech-recognition&sort=trending&search=ko)

In [None]:
from datasets import load_dataset
import soundfile as sf

# Load the Common Voice dataset
common_voice = load_dataset("JaepaX/korean_dataset", split="train+validation")

# Preprocess the dataset
def speech_file_to_array_fn(batch):
    speech_array, _ = sf.read(batch["path"])
    batch["speech"] = speech_array
    return batch

def prepare_dataset(batch):
    batch["input_values"] = processor(batch["speech"], sampling_rate=16_000).input_values[0]
    with processor.as_target_processor():
        batch["labels"] = processor(batch["sentence"]).input_ids
    return batch

common_voice = common_voice.map(speech_file_to_array_fn)
common_voice = common_voice.map(prepare_dataset, remove_columns=common_voice.column_names)

## Step 4: Fine-Tune the Model
Set up the training arguments and fine-tune the smaller model:

In [None]:
from transformers import TrainingArguments, Trainer

# Define training arguments
training_args = TrainingArguments(
    output_dir="./wav2vec2-small-korean",
    group_by_length=True,
    per_device_train_batch_size=16,
    gradient_accumulation_steps=2,
    evaluation_strategy="steps",
    num_train_epochs=3,
    save_steps=400,
    eval_steps=400,
    logging_steps=400,
    learning_rate=3e-4,
    warmup_steps=500,
    save_total_limit=2,
)

# Define the data collator
from dataclasses import dataclass
from typing import Dict, List, Union
import torch

@dataclass
class DataCollatorCTCWithPadding:
    processor: Wav2Vec2Processor
    padding: Union[bool, str] = True

    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        input_features = [{"input_values": feature["input_values"]} for feature in features]
        label_features = [{"input_ids": feature["labels"]} for feature in features]

        batch = self.processor.pad(
            input_features,
            padding=self.padding,
            return_tensors="pt",
        )
        with self.processor.as_target_processor():
            labels_batch = self.processor.pad(
                label_features,
                padding=self.padding,
                return_tensors="pt",
            )

        batch["labels"] = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)

        return batch

data_collator = DataCollatorCTCWithPadding(processor=processor)

# Initialize Trainer
trainer = Trainer(
    model=model,
    data_collator=data_collator,
    args=training_args,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    tokenizer=processor.feature_extractor,
)

# Train the model
trainer.train()

## Step 5: Evaluate the Model
Evaluate your fine-tuned model on the test dataset:

In [None]:
# Evaluate the model
results = trainer.evaluate()
print(results)

# Model Optimization for Mobile
To further optimize for mobile deployment, consider converting the model to ONNX format or using TensorFlow Lite:

## Convert to ONNX

In [None]:
import torch
from transformers import Wav2Vec2ForCTC

# Load the fine-tuned model
model = Wav2Vec2ForCTC.from_pretrained("./wav2vec2-small-korean")

# Export the model to ONNX
dummy_input = torch.zeros(1, 16000)  # Example input tensor
torch.onnx.export(model, dummy_input, "wav2vec2-small-korean.onnx")

This approach will help you create a smaller and efficient ASR model suitable for mobile devices. Adjust the parameters and dataset paths as needed for your specific use case.