A minimalistic repository for educational purposes that implements a complete Automatic Speech Recognition (ASR) pipeline using Connectionist Temporal Classification (CTC). This toolkit allows you to train CTC models on single GPU/CPU or in distributed fashion, with support for decoding and evaluation.
- CTC Model: Bidirectional LSTM-based CTC model for speech recognition
- Training: Single GPU/CPU and distributed multi-GPU training
- Decoding: Greedy CTC decoding with optional language model integration
- Evaluation: Word Error Rate (WER) computation
- Data Processing: LibriSpeech dataset support with mel-spectrogram features
- Augmentation: SpecAugment for improved model robustness
- Monitoring: Weights & Biases (wandb) integration for experiment tracking
- Mixed Precision: Automatic Mixed Precision (AMP) support for faster training
- Clone the repository:
git clone <repository-url>
cd asr_toolkit- Create a virtual environment:
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate- Install dependencies:
pip install -r requirements.txt- Install additional audio libraries (for FLAC support):
# On macOS
brew install flac
# On Ubuntu
sudo apt-get install flac
# Or install soundfile
pip install soundfilepython train_ctc.py \
--data_root /path/to/LibriSpeech \
--train_subset train-clean-100 \
--valid_subset dev-clean \
--epochs 10 \
--batch_size 32 \
--lr 1e-3 \
--device cuda \
--wandb_project ctc-minilab \
--wandb_run my-experimenttorchrun --nproc_per_node=4 train_distributed.py \
--data_root /path/to/LibriSpeech \
--train_subset train-clean-100 \
--valid_subset dev-clean \
--epochs 10 \
--batch_size 8 \
--lr 1e-3 \
--backend nccl \
--wandb_project ctc-minilab \
--wandb_run distributed-experimentpython decode_ctc.py \
--checkpoint_path ./exp_ctc_bilstm/best.pt \
--data_root /path/to/LibriSpeech \
--subset test-clean \
--output_file results.txtpython decode_ctc.py \
--checkpoint_path ./exp_ctc_bilstm/best.pt \
--data_root /path/to/LibriSpeech \
--subset test-clean \
--lm_path /path/to/language_model.arpa \
--output_file results_with_lm.txtThis toolkit uses the LibriSpeech dataset, which is automatically downloaded and processed:
- train-clean-100: 100 hours of clean speech for training
- dev-clean: Development set for validation
- test-clean: Test set for evaluation
The dataset is processed to extract 80-dimensional mel-spectrogram features with log-magnitude scaling.
The CTC model consists of:
- Feature Extractor: Mel-spectrogram with log-magnitude scaling
- Encoder: Bidirectional LSTM with 256 hidden units
- Projection: Linear layer mapping to vocabulary size
- CTC Loss: Connectionist Temporal Classification loss
- SpecAugment: Frequency and time masking for improved robustness
- CMVN: Cepstral Mean and Variance Normalization
- AdamW Optimizer: With gradient clipping
- Learning Rate Scaling: Automatic scaling for distributed training
- Mixed Precision: AMP support for faster training on modern GPUs
- Weights & Biases: Experiment tracking and visualization
- Gradient Monitoring: Gradient norm tracking
- Loss Tracking: Training and validation loss monitoring
The toolkit provides comprehensive evaluation metrics:
- Word Error Rate (WER): Primary metric for ASR evaluation
- Character Error Rate (CER): Character-level accuracy
- Decoding Speed: Inference time measurement
asr_toolkit/
├── train_ctc.py # Single GPU/CPU training
├── train_distributed.py # Distributed training
├── decode_ctc.py # Decoding and evaluation
├── model_ctc.py # CTC model definition
├── data.py # Dataset and data loading
├── features.py # Feature extraction and augmentation
├── tokenizer.py # Character tokenizer
├── utils.py # Utility functions
├── requirements.txt # Python dependencies
└── README.md # This file
--data_root: Path to LibriSpeech dataset--train_subset: Training subset (e.g., train-clean-100)--valid_subset: Validation subset (e.g., dev-clean)--epochs: Number of training epochs--batch_size: Batch size--lr: Learning rate--device: Device (cuda/cpu)--num_workers: Number of data loading workers
--hidden: LSTM hidden size (default: 256)--n_mels: Number of mel filters (default: 80)--n_fft: FFT size (default: 400)--hop_length: Hop length (default: 160)
- FLAC Backend Error: Install FLAC support or use
num_workers=0 - MPS CTC Loss Error: Set
PYTORCH_ENABLE_MPS_FALLBACK=1for Apple Silicon - Segmentation Fault: Reduce
num_workersor usenum_workers=0 - CUDA Out of Memory: Reduce
batch_sizeor use gradient accumulation
- Use
pin_memory=Truefor faster GPU data transfer - Enable
torch.backends.cudnn.benchmark = Truefor consistent input sizes - Use mixed precision training for faster training on modern GPUs
- Adjust
num_workersbased on your system (start with 0, then increase)
This repository is designed for educational purposes to understand:
- CTC loss and its applications in ASR
- Distributed training with PyTorch
- Speech feature extraction and preprocessing
- Model evaluation and decoding strategies
- End-to-end ASR pipeline implementation
This project is for educational purposes. Please check the original LibriSpeech dataset license for commercial use.
This is an educational repository. Feel free to fork and modify for your learning purposes.
- LibriSpeech dataset
- PyTorch team for the excellent framework
- Weights & Biases for experiment tracking