ASR Toolkit

A minimalistic repository for educational purposes that implements a complete Automatic Speech Recognition (ASR) pipeline using Connectionist Temporal Classification (CTC). This toolkit allows you to train CTC models on single GPU/CPU or in distributed fashion, with support for decoding and evaluation.

Features

CTC Model: Bidirectional LSTM-based CTC model for speech recognition
Training: Single GPU/CPU and distributed multi-GPU training
Decoding: Greedy CTC decoding with optional language model integration
Evaluation: Word Error Rate (WER) computation
Data Processing: LibriSpeech dataset support with mel-spectrogram features
Augmentation: SpecAugment for improved model robustness
Monitoring: Weights & Biases (wandb) integration for experiment tracking
Mixed Precision: Automatic Mixed Precision (AMP) support for faster training

Installation

Clone the repository:

git clone <repository-url>
cd asr_toolkit

Create a virtual environment:

python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

Install dependencies:

pip install -r requirements.txt

Install additional audio libraries (for FLAC support):

# On macOS
brew install flac

# On Ubuntu
sudo apt-get install flac

# Or install soundfile
pip install soundfile

Quick Start

Training

Single GPU/CPU Training

python train_ctc.py \
  --data_root /path/to/LibriSpeech \
  --train_subset train-clean-100 \
  --valid_subset dev-clean \
  --epochs 10 \
  --batch_size 32 \
  --lr 1e-3 \
  --device cuda \
  --wandb_project ctc-minilab \
  --wandb_run my-experiment

Distributed Multi-GPU Training

torchrun --nproc_per_node=4 train_distributed.py \
  --data_root /path/to/LibriSpeech \
  --train_subset train-clean-100 \
  --valid_subset dev-clean \
  --epochs 10 \
  --batch_size 8 \
  --lr 1e-3 \
  --backend nccl \
  --wandb_project ctc-minilab \
  --wandb_run distributed-experiment

Decoding and Evaluation

Greedy Decoding

python decode_ctc.py \
  --checkpoint_path ./exp_ctc_bilstm/best.pt \
  --data_root /path/to/LibriSpeech \
  --subset test-clean \
  --output_file results.txt

Decoding with Language Model

python decode_ctc.py \
  --checkpoint_path ./exp_ctc_bilstm/best.pt \
  --data_root /path/to/LibriSpeech \
  --subset test-clean \
  --lm_path /path/to/language_model.arpa \
  --output_file results_with_lm.txt

Dataset

This toolkit uses the LibriSpeech dataset, which is automatically downloaded and processed:

train-clean-100: 100 hours of clean speech for training
dev-clean: Development set for validation
test-clean: Test set for evaluation

The dataset is processed to extract 80-dimensional mel-spectrogram features with log-magnitude scaling.

Model Architecture

The CTC model consists of:

Feature Extractor: Mel-spectrogram with log-magnitude scaling
Encoder: Bidirectional LSTM with 256 hidden units
Projection: Linear layer mapping to vocabulary size
CTC Loss: Connectionist Temporal Classification loss

Training Features

Data Augmentation

SpecAugment: Frequency and time masking for improved robustness
CMVN: Cepstral Mean and Variance Normalization

Optimization

AdamW Optimizer: With gradient clipping
Learning Rate Scaling: Automatic scaling for distributed training
Mixed Precision: AMP support for faster training on modern GPUs

Monitoring

Weights & Biases: Experiment tracking and visualization
Gradient Monitoring: Gradient norm tracking
Loss Tracking: Training and validation loss monitoring

Evaluation

The toolkit provides comprehensive evaluation metrics:

Word Error Rate (WER): Primary metric for ASR evaluation
Character Error Rate (CER): Character-level accuracy
Decoding Speed: Inference time measurement

File Structure

asr_toolkit/
├── train_ctc.py          # Single GPU/CPU training
├── train_distributed.py  # Distributed training
├── decode_ctc.py         # Decoding and evaluation
├── model_ctc.py          # CTC model definition
├── data.py               # Dataset and data loading
├── features.py           # Feature extraction and augmentation
├── tokenizer.py          # Character tokenizer
├── utils.py              # Utility functions
├── requirements.txt      # Python dependencies
└── README.md            # This file

Configuration

Training Parameters

--data_root: Path to LibriSpeech dataset
--train_subset: Training subset (e.g., train-clean-100)
--valid_subset: Validation subset (e.g., dev-clean)
--epochs: Number of training epochs
--batch_size: Batch size
--lr: Learning rate
--device: Device (cuda/cpu)
--num_workers: Number of data loading workers

Model Parameters

--hidden: LSTM hidden size (default: 256)
--n_mels: Number of mel filters (default: 80)
--n_fft: FFT size (default: 400)
--hop_length: Hop length (default: 160)

Troubleshooting

Common Issues

FLAC Backend Error: Install FLAC support or use num_workers=0
MPS CTC Loss Error: Set PYTORCH_ENABLE_MPS_FALLBACK=1 for Apple Silicon
Segmentation Fault: Reduce num_workers or use num_workers=0
CUDA Out of Memory: Reduce batch_size or use gradient accumulation

Performance Tips

Use pin_memory=True for faster GPU data transfer
Enable torch.backends.cudnn.benchmark = True for consistent input sizes
Use mixed precision training for faster training on modern GPUs
Adjust num_workers based on your system (start with 0, then increase)

Educational Purpose

This repository is designed for educational purposes to understand:

CTC loss and its applications in ASR
Distributed training with PyTorch
Speech feature extraction and preprocessing
Model evaluation and decoding strategies
End-to-end ASR pipeline implementation

License

This project is for educational purposes. Please check the original LibriSpeech dataset license for commercial use.

Contributing

This is an educational repository. Feel free to fork and modify for your learning purposes.

Acknowledgments

LibriSpeech dataset
PyTorch team for the excellent framework
Weights & Biases for experiment tracking

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ASR Toolkit

Features

Installation

Quick Start

Training

Single GPU/CPU Training

Distributed Multi-GPU Training

Decoding and Evaluation

Greedy Decoding

Decoding with Language Model

Dataset

Model Architecture

Training Features

Data Augmentation

Optimization

Monitoring

Evaluation

File Structure

Configuration

Training Parameters

Model Parameters

Troubleshooting

Common Issues

Performance Tips

Educational Purpose

License

Contributing

Acknowledgments

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.gitignore		.gitignore
README.md		README.md
data.py		data.py
decode_ctc.py		decode_ctc.py
features.py		features.py
model_ctc.py		model_ctc.py
requirements.txt		requirements.txt
tokenizer.py		tokenizer.py
train_ctc.py		train_ctc.py
train_distributed.py		train_distributed.py
utils.py		utils.py

orech/tiny_asr

Folders and files

Latest commit

History

Repository files navigation

ASR Toolkit

Features

Installation

Quick Start

Training

Single GPU/CPU Training

Distributed Multi-GPU Training

Decoding and Evaluation

Greedy Decoding

Decoding with Language Model

Dataset

Model Architecture

Training Features

Data Augmentation

Optimization

Monitoring

Evaluation

File Structure

Configuration

Training Parameters

Model Parameters

Troubleshooting

Common Issues

Performance Tips

Educational Purpose

License

Contributing

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages