# üéôÔ∏è Vedda ASR Model Training - Google Colab

Train a custom speech recognition model for Vedda language using Whisper on Google Colab with FREE GPU!

**‚ö†Ô∏è IMPORTANT:** Before running this notebook:
1. Click "Runtime" ‚Üí "Change runtime type" ‚Üí Select "GPU (T4)" ‚Üí "Save"
2. Have your Vedda audio files ready (WAV format, 16kHz recommended)
3. Create a folder `vedda-asr-dataset` in your Google Drive with audio files and transcriptions

## üìã Step 1: Setup Environment

Install all required dependencies

In [None]:
# Check GPU availability
import torch
print(f"GPU Available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU Device: {torch.cuda.get_device_name(0)}")
    print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")
else:
    print("‚ö†Ô∏è WARNING: GPU not available. Training will be very slow!")
    print("   Go to Runtime ‚Üí Change runtime type ‚Üí Select GPU")

In [None]:
# Install required packages
print("üîß Installing dependencies...")
!pip install -q openai-whisper transformers datasets evaluate jiwer librosa soundfile scikit-learn tqdm tensorboard accelerate
print("‚úÖ Installation complete!")

## üìÇ Step 2: Mount Google Drive

Connect your Google Drive to access your Vedda audio files

In [None]:
from google.colab import drive
import os

# Mount Google Drive
drive.mount('/content/drive')
print("‚úÖ Google Drive mounted!")

# Check if dataset folder exists
dataset_dir = '/content/drive/MyDrive/vedda-asr-dataset'
if os.path.exists(dataset_dir):
    print(f"‚úÖ Dataset found at: {dataset_dir}")
    print(f"\nüìÅ Contents:")
    for item in os.listdir(dataset_dir):
        print(f"   - {item}")
else:
    print(f"‚ö†Ô∏è Dataset folder not found at: {dataset_dir}")
    print(f"\nüìã Please create the folder in Google Drive and upload your audio files:")
    print(f"   1. Go to Google Drive")
    print(f"   2. Create folder 'vedda-asr-dataset'")
    print(f"   3. Upload audio files (WAV format)")
    print(f"   4. Create 'transcriptions.json' with transcriptions")

## üìù Step 3: Prepare Dataset

Create training and testing datasets from your audio files

In [None]:
import json
import os
from pathlib import Path

# Install noisereduce if not already installed
print("üîß Installing audio processing tools...")
!pip install -q noisereduce
import noisereduce as nr

# Create project structure
project_dir = '/content/vedda-asr'
data_dir = os.path.join(project_dir, 'data')
processed_dir = os.path.join(data_dir, 'processed')
models_dir = os.path.join(project_dir, 'models')

os.makedirs(processed_dir, exist_ok=True)
os.makedirs(models_dir, exist_ok=True)

print(f"‚úÖ Project structure created:")
print(f"   {project_dir}/")
print(f"   ‚îú‚îÄ‚îÄ data/")
print(f"   ‚îÇ   ‚îî‚îÄ‚îÄ processed/")
print(f"   ‚îî‚îÄ‚îÄ models/")

In [None]:
import librosa
import soundfile as sf
import numpy as np
from tqdm import tqdm
import noisereduce as nr

def prepare_dataset():
    """
    Prepare dataset from audio files in Google Drive with noise cancellation

    Expected structure in Google Drive (vedda-asr-dataset/):
    - audio/
      - vedda_001.wav
      - vedda_002.wav
      - ...
    - transcriptions.json (format: {"vedda_001": "‡∑Ñ‡∑ô‡∂Ω‡∑ù", "vedda_002": "‡∂Ü‡∂∫‡∑î‡∂∂‡∑ù‡∑Ä‡∂±‡∑ä", ...})
    """

    dataset_dir = '/content/drive/MyDrive/vedda-asr-dataset'
    audio_dir = os.path.join(dataset_dir, 'audio')
    transcriptions_file = os.path.join(dataset_dir, 'transcriptions.json')

    # Check if files exist
    if not os.path.exists(audio_dir):
        print(f"‚ùå Audio directory not found: {audio_dir}")
        print(f"\nüìã Create this structure in Google Drive:")
        print(f"   vedda-asr-dataset/")
        print(f"   ‚îú‚îÄ‚îÄ audio/")
        print(f"   ‚îÇ   ‚îú‚îÄ‚îÄ vedda_001.wav")
        print(f"   ‚îÇ   ‚îú‚îÄ‚îÄ vedda_002.wav")
        print(f"   ‚îÇ   ‚îî‚îÄ‚îÄ ...")
        print(f"   ‚îî‚îÄ‚îÄ transcriptions.json")
        return None

    if not os.path.exists(transcriptions_file):
        print(f"‚ùå Transcriptions file not found: {transcriptions_file}")
        print(f"\nüìã Create transcriptions.json with format:")
        print(f'   {{"vedda_001": "‡∑Ñ‡∑ô‡∂Ω‡∑ù", "vedda_002": "‡∂Ü‡∂∫‡∑î‡∂∂‡∑ù‡∑Ä‡∂±‡∑ä"}}')
        return None

    # Load transcriptions
    with open(transcriptions_file, 'r', encoding='utf-8') as f:
        transcriptions = json.load(f)

    print(f"üìÇ Found {len(transcriptions)} transcriptions")

    # Process audio files
    processed_entries = []
    audio_files = [f for f in os.listdir(audio_dir) if f.lower().endswith(('.wav', '.mp3', '.ogg', '.flac'))]

    print(f"üîß Processing {len(audio_files)} audio files with noise cancellation...")

    for audio_file in tqdm(audio_files):
        audio_path = os.path.join(audio_dir, audio_file)
        file_id = Path(audio_file).stem

        # Get transcription
        if file_id not in transcriptions:
            print(f"‚ö†Ô∏è  Skipping {audio_file} (no transcription)")
            continue

        transcription = transcriptions[file_id]

        try:
            # Load audio
            audio, sr = librosa.load(audio_path, sr=None, mono=True)

            # Resample to 16kHz
            audio = librosa.resample(audio, orig_sr=sr, target_sr=16000)

            # ===== NOISE CANCELLATION =====
            print(f"  üîá Removing noise from {audio_file}...")
            audio = nr.reduce_noise(
                y=audio,
                sr=16000,
                stationary=True,          # Remove stationary background noise
                prop_decrease=1.0,        # Aggressiveness (0-1)
                chunk_size=600000,        # Memory-efficient chunking
                padding=30000,
                freq_mask_smooth_hz=500,  # Smooth frequency mask
                time_mask_smooth_ms=50    # Smooth time mask
            )

            # Normalize
            audio = librosa.util.normalize(audio)

            # Trim silence
            audio, _ = librosa.effects.trim(audio, top_db=25)

            # Pre-emphasis
            audio = librosa.effects.preemphasis(audio)

            # Final normalize
            audio = librosa.util.normalize(audio)

            duration = len(audio) / 16000
            if duration < 0.5:
                continue

            # Save processed audio
            output_path = os.path.join(processed_dir, audio_file)
            sf.write(output_path, audio, 16000)

            processed_entries.append({
                'audio_path': output_path,
                'transcription': transcription,
                'duration': duration,
                'noise_cancelled': True
            })

        except Exception as e:
            print(f"‚ö†Ô∏è  Error processing {audio_file}: {e}")
            continue

    print(f"\n‚úÖ Processed {len(processed_entries)} audio files (with noise cancellation)")

    total_duration = sum(e['duration'] for e in processed_entries)
    print(f"‚è±Ô∏è  Total duration: {total_duration/60:.1f} minutes")

    if total_duration < 600:  # Less than 10 minutes
        print(f"‚ö†Ô∏è  WARNING: Dataset is small (1 hour recommended)")

    return processed_entries

# Prepare dataset
dataset = prepare_dataset()

if dataset is None:
    print("\n‚ö†Ô∏è  Please set up your dataset first!")

In [None]:
# Split dataset into train/test
if dataset:
    from sklearn.model_selection import train_test_split

    train_data, test_data = train_test_split(dataset, test_size=0.1, random_state=42)

    print(f"üìä Dataset split:")
    print(f"   Train: {len(train_data)} samples ({sum(e['duration'] for e in train_data)/60:.1f} min)")
    print(f"   Test:  {len(test_data)} samples ({sum(e['duration'] for e in test_data)/60:.1f} min)")

    # Save datasets
    train_file = os.path.join(data_dir, 'train_dataset.json')
    test_file = os.path.join(data_dir, 'test_dataset.json')

    with open(train_file, 'w', encoding='utf-8') as f:
        json.dump({'data': train_data}, f, ensure_ascii=False)

    with open(test_file, 'w', encoding='utf-8') as f:
        json.dump({'data': test_data}, f, ensure_ascii=False)

    print(f"\n‚úÖ Datasets saved")

## üéì Step 4: Train Whisper Model

Fine-tune OpenAI Whisper on your Vedda data

In [None]:
from transformers import (
    WhisperProcessor,
    WhisperForConditionalGeneration,
    Seq2SeqTrainingArguments,
    Seq2SeqTrainer,
)
from datasets import Dataset, Audio
from dataclasses import dataclass
from typing import Any, Dict, List, Union

@dataclass
class DataCollatorSpeechSeq2SeqWithPadding:
    processor: Any
    decoder_start_token_id: int

    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        input_features = [{"input_features": feature["input_features"]} for feature in features]
        label_features = [{"input_ids": feature["labels"]} for feature in features]

        batch = self.processor.feature_extractor.pad(input_features, return_tensors="pt")
        labels_batch = self.processor.tokenizer.pad(label_features, return_tensors="pt")

        labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)

        if (labels[:, 0] == self.decoder_start_token_id).all().cpu().item():
            labels = labels[:, 1:]

        batch["labels"] = labels
        return batch

print("‚úÖ Data structures ready")

In [None]:
# Load model and processor
print("·Äí·Ä¢ Loading Whisper-small model...")

model_name = "openai/whisper-small"
processor = WhisperProcessor.from_pretrained(
    model_name,
    language="Sinhala",
    task="transcribe"
)

model = WhisperForConditionalGeneration.from_pretrained(model_name)

# Correct way to modify generation settings in newer transformers versions
model.generation_config.forced_decoder_ids = None
model.generation_config.suppress_tokens = []

print(f"‚úÖ Model loaded: {model_name}")
print(f"·Äí·Ä† Parameters: {model.num_parameters() / 1e6:.1f}M")

In [None]:
# Load datasets
print("üìÇ Loading datasets...")

# Define paths (in case Step 3 cells weren't executed sequentially)
data_dir = '/content/vedda-asr/data'
train_file = os.path.join(data_dir, 'train_dataset.json')
test_file = os.path.join(data_dir, 'test_dataset.json')

# Check if dataset files exist
if not os.path.exists(train_file) or not os.path.exists(test_file):
    print(f"‚ùå Error: Dataset files not found!")
    print(f"   Train file: {train_file} ({'‚úÖ' if os.path.exists(train_file) else '‚ùå'})")
    print(f"   Test file: {test_file} ({'‚úÖ' if os.path.exists(test_file) else '‚ùå'})")
    print(f"\nüìã Please run Step 3 cells in order:")
    print(f"   1. 'Prepare dataset' cell")
    print(f"   2. 'Split dataset into train/test' cell")
else:
    with open(train_file, 'r', encoding='utf-8') as f:
        train_json = json.load(f)
    with open(test_file, 'r', encoding='utf-8') as f:
        test_json = json.load(f)

    train_dataset = Dataset.from_dict({
        'audio': [x['audio_path'] for x in train_json['data']],
        'transcription': [x['transcription'] for x in train_json['data']]
    })

    test_dataset = Dataset.from_dict({
        'audio': [x['audio_path'] for x in test_json['data']],
        'transcription': [x['transcription'] for x in test_json['data']]
    })

    train_dataset = train_dataset.cast_column("audio", Audio(sampling_rate=16000))
    test_dataset = test_dataset.cast_column("audio", Audio(sampling_rate=16000))

    print(f"‚úÖ Train samples: {len(train_dataset)}")
    print(f"‚úÖ Test samples: {len(test_dataset)}")

In [None]:
# Prepare datasets
print("üîß Preparing datasets for training...")

def prepare_dataset(batch):
    audio = batch["audio"]
    input_features = processor.feature_extractor(
        audio["array"],
        sampling_rate=audio["sampling_rate"]
    ).input_features[0]
    labels = processor.tokenizer(batch["transcription"]).input_ids
    return {
        "input_features": input_features,
        "labels": labels
    }

train_dataset_proc = train_dataset.map(
    prepare_dataset,
    remove_columns=train_dataset.column_names
)
test_dataset_proc = test_dataset.map(
    prepare_dataset,
    remove_columns=test_dataset.column_names
)

print(f"‚úÖ Datasets prepared")

In [None]:
# Training arguments
output_dir = os.path.join(models_dir, 'whisper-vedda')

training_args = Seq2SeqTrainingArguments(
    output_dir=output_dir,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    gradient_accumulation_steps=1,
    learning_rate=1e-5,
    warmup_steps=50,
    num_train_epochs=10,
    eval_strategy="steps",
    eval_steps=50,
    save_steps=100,
    logging_steps=25,
    load_best_model_at_end=True,
    metric_for_best_model="wer",
    greater_is_better=False,
    push_to_hub=False,
    save_total_limit=3,
    fp16=torch.cuda.is_available(),
    report_to=["tensorboard"],
    predict_with_generate=True,
    generation_max_length=225,
)

data_collator = DataCollatorSpeechSeq2SeqWithPadding(
    processor=processor,
    decoder_start_token_id=model.config.decoder_start_token_id
)

print(f"‚úÖ Training configuration ready")
print(f"   Output: {output_dir}")
print(f"   Epochs: 10")
print(f"   Batch size: 8")

In [None]:
# Create trainer
def compute_metrics(pred):
    from evaluate import load
    wer_metric = load("wer")
    pred_ids = pred.predictions
    label_ids = pred.label_ids
    label_ids[label_ids == -100] = processor.tokenizer.pad_token_id
    pred_str = processor.tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    label_str = processor.tokenizer.batch_decode(label_ids, skip_special_tokens=True)
    wer = wer_metric.compute(predictions=pred_str, references=label_str)
    return {"wer": wer}

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset_proc,
    eval_dataset=test_dataset_proc,
    data_collator=data_collator,
    compute_metrics=compute_metrics
)

print(f"‚úÖ Trainer ready")

In [None]:
# Train model
print(f"\n{'='*60}")
print(f"üöÄ STARTING TRAINING")
print(f"{'='*60}\n")

trainer.train()

print(f"\n{'='*60}")
print(f"‚úÖ TRAINING COMPLETE")
print(f"{'='*60}")

## üìä Step 5: Evaluate Model

In [None]:
# Evaluate on test set
print("üìä Evaluating on test set...\n")

results = trainer.evaluate(test_dataset_proc)

print(f"\n{'='*60}")
print(f"üìà EVALUATION RESULTS")
print(f"{'='*60}")
print(f"\nWord Error Rate (WER): {results['eval_wer']*100:.2f}%")

if results['eval_wer'] < 0.10:
    quality = "üåü Excellent (Production-ready)"
elif results['eval_wer'] < 0.20:
    quality = "‚úÖ Good (Usable)"
elif results['eval_wer'] < 0.30:
    quality = "‚ö†Ô∏è  Fair (Needs more data)"
else:
    quality = "‚ùå Poor (Collect more data)"

print(f"Quality: {quality}")
print(f"{'='*60}\n")

## üíæ Step 6: Save and Download Model

In [None]:
# Save final model
final_dir = os.path.join(models_dir, 'whisper-vedda-final')
os.makedirs(final_dir, exist_ok=True)

print(f"üíæ Saving final model...")
model.save_pretrained(final_dir)
processor.save_pretrained(final_dir)

print(f"‚úÖ Model saved to: {final_dir}")

In [None]:
# Create ZIP file for download
import shutil

print(f"·ÄÅ·Ä∞ Creating download package...")
zip_path = os.path.join(project_dir, 'vedda-asr-model')
shutil.make_archive(zip_path, 'zip', final_dir)

zip_file = zip_path + '.zip'
size_mb = os.path.getsize(zip_file) / 1e6

print(f"\n‚úÖ Model packaged for download!")
print(f"   File: vedda-asr-model.zip")
print(f"   Size: {size_mb:.1f} MB")

print(f"\n·Ä±·Ä≥ The file is available in Colab's file browser (left panel)")
print(f"   Right-click ‚Üí Download")

## üß™ Step 7: Test Trained Model (Optional)

In [None]:
# Test model on a sample audio
import librosa

print("üé§ Testing trained model...\n")

# Get first test sample
test_sample = test_json['data'][0]
audio_path = test_sample['audio_path']
reference = test_sample['transcription']

# Load audio
audio, sr = librosa.load(audio_path, sr=16000, mono=True)

# Process with processor
input_features = processor(
    audio,
    sampling_rate=16000,
    return_tensors="pt"
).input_features

# Generate prediction
with torch.no_grad():
    predicted_ids = model.generate(input_features)

prediction = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]

print(f"üìù Sample Test Result:")
print(f"\n   Reference:  {reference}")
print(f"   Predicted:  {prediction}")
print(f"   Match: {'‚úÖ YES' if reference == prediction else '‚ùå NO'}")

## üìã Summary

### What you learned:
1. ‚úÖ Set up GPU environment in Colab
2. ‚úÖ Loaded Vedda audio data from Google Drive
3. ‚úÖ Prepared dataset for training
4. ‚úÖ Fine-tuned Whisper model on Vedda speech
5. ‚úÖ Evaluated model performance (WER)
6. ‚úÖ Saved and packaged trained model

### Next steps:
1. Download the trained model (vedda-asr-model.zip)
2. Extract it locally
3. Integrate with your speech service
4. For better accuracy: Collect more data and retrain

### Performance improvements:
- **More data:** 1 hour ‚Üí Better accuracy
- **More epochs:** 10 ‚Üí 15 or 20
- **Larger model:** small ‚Üí base or medium
- **Lower learning rate:** 1e-5 ‚Üí 5e-6

---

**Happy training! üöÄ**