Whisper by OpenAI is a robust ASR model known for its high accuracy. Although it’s larger and more resource-intensive than some other models, you can still use it effectively, especially if you manage to optimize it for your specific use case. Here’s how you can fine-tune and use Whisper for corean ASR using a publicly available dataset like Common Voice.

## Train a whisper model

### Step 1: Install Necessary Libraries
First, install the required libraries:

In [None]:
pip install transformers datasets torch soundfile
pip install git+https://github.com/openai/whisper.git

### Step 2: Load and Preprocess the Common Voice Dataset
Use the datasets library to load and preprocess the Common Voice dataset:

In [None]:
from datasets import load_dataset
import whisper

# Load the Common Voice dataset
common_voice_train = load_dataset("JaepaX/corean_dataset", split="train")
common_voice_test = load_dataset("JaepaX/corean_dataset", split="test")

# Preprocess the dataset
def preprocess(batch):
    audio = whisper.load_audio(batch["path"])
    batch["audio"] = whisper.pad_or_trim(audio)
    batch["text"] = batch["sentence"]
    return batch

common_voice_train = common_voice_train.map(preprocess)
common_voice_test = common_voice_test.map(preprocess)

### Step 3: Define the Model and Tokenizer
Load the Whisper model and processor:

In [None]:
from transformers import WhisperProcessor, WhisperForConditionalGeneration

# Load Whisper model and processor
model_name = "openai/whisper-base"
processor = WhisperProcessor.from_pretrained(model_name)
model = WhisperForConditionalGeneration.from_pretrained(model_name)

# Adjust the model for fine-tuning
model.config.forced_decoder_ids = processor.get_decoder_prompt_ids(language="ko", task="transcribe")

### Step 4: Prepare Data Loaders
Convert the dataset into PyTorch data loaders for training:

In [None]:
from torch.utils.data import DataLoader

# Define collate function
def collate_fn(batch):
    input_features = [processor(feature["audio"], sampling_rate=16000).input_features for feature in batch]
    labels = [processor(feature["text"]).input_ids for feature in batch]
    input_features = torch.tensor(input_features)
    labels = torch.tensor(labels)
    return input_features, labels

# Create data loaders
train_dataloader = DataLoader(common_voice_train, batch_size=8, shuffle=True, collate_fn=collate_fn)
test_dataloader = DataLoader(common_voice_test, batch_size=8, shuffle=False, collate_fn=collate_fn)

### Step 5: Fine-Tune the Model
Set up the training loop:

In [None]:
import torch
from torch.optim import AdamW
from tqdm import tqdm

# Define optimizer
optimizer = AdamW(model.parameters(), lr=1e-5)

# Move model to GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# Training loop
num_epochs = 3
for epoch in range(num_epochs):
    model.train()
    train_loss = 0.0
    for batch in tqdm(train_dataloader, desc=f"Training Epoch {epoch+1}/{num_epochs}"):
        input_features, labels = batch
        input_features = input_features.to(device)
        labels = labels.to(device)

        optimizer.zero_grad()
        outputs = model(input_features, labels=labels)
        loss = outputs.loss
        loss.backward()
        optimizer.step()
        train_loss += loss.item()

    avg_train_loss = train_loss / len(train_dataloader)
    print(f"Epoch {epoch+1} - Training Loss: {avg_train_loss:.4f}")

    # Evaluation loop
    model.eval()
    eval_loss = 0.0
    with torch.no_grad():
        for batch in tqdm(test_dataloader, desc=f"Evaluating Epoch {epoch+1}/{num_epochs}"):
            input_features, labels = batch
            input_features = input_features.to(device)
            labels = labels.to(device)

            outputs = model(input_features, labels=labels)
            loss = outputs.loss
            eval_loss += loss.item()

    avg_eval_loss = eval_loss / len(test_dataloader)
    print(f"Epoch {epoch+1} - Evaluation Loss: {avg_eval_loss:.4f}")

### Step 6: Save the Fine-Tuned Model
Save the model and processor for later use:

In [None]:
model.save_pretrained("whisper-corean-asr")
processor.save_pretrained("whisper-corean-asr")

### Step 7: Inference with the Fine-Tuned Model
Use the fine-tuned model for inference:

In [None]:
# Load the fine-tuned model and processor
model = WhisperForConditionalGeneration.from_pretrained("whisper-corean-asr")
processor = WhisperProcessor.from_pretrained("whisper-corean-asr")

# Move model to GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# Load an example audio file
audio_path = "path_to_audio_file.wav"
audio = whisper.load_audio(audio_path)
audio = whisper.pad_or_trim(audio)

# Preprocess the audio
input_features = processor(audio, sampling_rate=16000).input_features
input_features = torch.tensor(input_features).unsqueeze(0).to(device)

# Generate predictions
with torch.no_grad():
    predicted_ids = model.generate(input_features)
    transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)

print("Transcription:", transcription[0])

This guide outlines the process of fine-tuning Whisper for corean ASR using the Common Voice dataset. Adjust the parameters and paths as needed for your specific use case and dataset. Whisper, though resource-intensive, can deliver high accuracy for ASR tasks.

## Convert a pretrained whisper model into ONNX format
### Install Required Libraries
Ensure you have the necessary libraries installed:

In [None]:
pip install onnx onnxruntime

### Prepare the Model for Export
You need to define a helper function to handle the model’s forward pass for the ONNX conversion.

In [None]:
import torch

# Function to handle the model's forward pass
def forward_pass(input_features):
    # Move the input to the appropriate device
    input_features = input_features.to(device)
    
    # Forward pass through the model
    outputs = model(input_features)
    
    return outputs.logits

# Move model to GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
model.eval()

### Define a Dummy Input
Create a dummy input that matches the input signature of the model:

In [None]:
# Define a dummy input in the same shape as your model expects
dummy_input = torch.randn(1, 80, 3000).to(device)  # Example input shape (batch_size, feature_dim, seq_length)

### Export the Model to ONNX
Use torch.onnx.export to convert the model:

In [None]:
import os

# Export the model to ONNX
onnx_model_path = "whisper_corean_asr.onnx"
torch.onnx.export(
    model,
    dummy_input,
    onnx_model_path,
    input_names=["input_features"],
    output_names=["logits"],
    dynamic_axes={
        "input_features": {0: "batch_size", 2: "sequence_length"},  # Variable length axes
        "logits": {0: "batch_size", 1: "sequence_length"}
    },
    opset_version=11,
)

print(f"Model successfully exported to {onnx_model_path}")

### Verify the ONNX Model
Load the ONNX model and verify it using onnxruntime:

In [None]:
import onnxruntime as ort

# Load the ONNX model
onnx_model = ort.InferenceSession(onnx_model_path)

# Verify the model by running an inference
onnx_inputs = {"input_features": dummy_input.cpu().numpy()}
onnx_outputs = onnx_model.run(None, onnx_inputs)

print("ONNX model output shape:", onnx_outputs[0].shape)

This code converts your fine-tuned Whisper model into ONNX format and verifies the conversion by running an inference. This ONNX model can now be optimized further and used for deployment on various platforms, including mobile devices.

#Notes:
Dynamic Axes: The dynamic_axes parameter allows the ONNX model to accept variable-length inputs, which is crucial for inference on variable-length audio sequences.
Optimization: After exporting to ONNX, consider using tools like ONNX Runtime or TensorRT to optimize the model for better performance on your target deployment platform.
By following these steps, you can convert your fine-tuned Whisper ASR model into an ONNX model, making it suitable for deployment on various platforms, including mobile devices.