Skip to content

lenzo-ka/arpabo

Repository files navigation

arpabo

Build ARPA format statistical language models with multiple smoothing methods.

Tests Lint Python 3.9+ License: MIT

Features

Core Features

  • Multiple smoothing methods (Good-Turing, Kneser-Ney, Katz backoff)
  • Support for arbitrary n-gram orders
  • Standard ARPA format output
  • Binary format conversion (PocketSphinx, Kaldi)
  • Corpus normalization tool
  • Interactive debug mode
  • Zero runtime dependencies (pure Python)

New: First-Pass ASR Optimization

  • Multi-order training - Train multiple n-gram orders efficiently (--orders 1-4)
  • Perplexity evaluation - Evaluate model quality on test data (--eval test.txt)
  • Model statistics - Analyze backoff rates and model behavior (--stats --backoff test.txt)
  • Presets - Pre-configured settings for common use cases (--preset first-pass)
  • Smoothing comparison - Automatically compare smoothing methods (--compare-smoothing)
  • Vocabulary pruning - Reduce model size for mobile (--prune-vocab topk:10000)
  • ModelComparison API - High-level Python API for complete workflows
  • Uniform baseline - Maximum entropy models for comparison
  • Cross-validation - K-fold CV for robust model selection
  • Model interpolation - Alternative probability mixing strategy

Installation

pip install arpabo

This installs two commands:

  • arpabo - Build language models
  • arpabo-normalize - Normalize text corpora

Quick Start

# Quick demo
arpabo --demo -o model.arpa

# Build from your corpus
arpabo corpus.txt -o model.arpa

# With binary conversion
arpabo corpus.txt -o model.arpa --to-bin

# Two-stage: normalize then build
arpabo-normalize corpus.txt -o normalized.txt -c lower -n
arpabo normalized.txt -o model.arpa

Python API

Basic Usage

from arpabo import ArpaBoLM

# Build a language model
lm = ArpaBoLM(max_order=3, smoothing_method="good_turing")
with open("corpus.txt") as f:
    lm.read_corpus(f)
lm.compute()
lm.write_file("model.arpa")

Use Presets (New!)

# No need to pick parameters - use a preset!
lm = ArpaBoLM.from_preset("first-pass")  # For first-pass ASR
lm.read_corpus(open("corpus.txt"))
lm.compute()
lm.write_file("model.arpa")

High-Level API (New!)

from arpabo import ModelComparison

# Complete optimization workflow
comparison = ModelComparison(corpus_file="train.txt")
comparison.train_orders([1, 2, 3, 4])
comparison.add_uniform_baseline()
comparison.evaluate(test_file="test.txt")
comparison.print_comparison()

# Get recommendation
best = comparison.recommend(goal="first-pass")
print(f"Best model: {best}-gram")

# Export for deployment
comparison.export_for_optimization("experiments/", convert_to_binary=True)

Evaluation & Analysis (New!)

# Evaluate model quality
with open("test.txt") as f:
    results = lm.perplexity(f)
print(f"Perplexity: {results['perplexity']:.1f}")

# Analyze backoff behavior
with open("test.txt") as f:
    backoff = lm.backoff_rate(f)
print(f"Backoff rate: {backoff['overall_backoff_rate']*100:.1f}%")

# Get model statistics
stats = lm.get_statistics()
print(f"Vocabulary: {stats['vocab_size']:,} words")

Smoothing Methods

  • good_turing (default) - Best for sparse data
  • kneser_ney - Best for larger corpora
  • auto - Automatically optimizes discount mass
  • fixed - Fixed discount mass (use -d 0.0 for MLE)

Common Workflows

Basic Usage

# Simple model
arpabo corpus.txt -o model.arpa

# Use a preset (easiest!)
arpabo corpus.txt -o model.arpa --preset balanced

# List available presets
arpabo --list-presets

Multi-Order Training (New!)

# Train multiple orders efficiently
arpabo corpus.txt -o models/ --orders 1-4 --to-bin

# Creates: 1gram.arpa, 2gram.arpa, 3gram.arpa, 4gram.arpa (+ .lm.bin files)

Model Evaluation (New!)

# Train and evaluate
arpabo corpus.txt -o model.arpa --eval test.txt

# Evaluate existing model
arpabo --eval-only model.arpa test.txt

# With statistics and backoff analysis
arpabo corpus.txt -o model.arpa --stats --backoff test.txt

Advanced: Compare & Optimize (New!)

# Compare smoothing methods
arpabo corpus.txt --compare-smoothing --eval test.txt

# Prune for mobile deployment
arpabo corpus.txt -o mobile.arpa --prune-vocab topk:10000 --to-bin

Traditional Options

# 4-gram with Kneser-Ney smoothing
arpabo corpus.txt -o model.arpa -m 4 -s kneser_ney

# Lowercase normalization
arpabo corpus.txt -o model.arpa -c lower -v

# Token normalization (strip punctuation)
arpabo corpus.txt -o model.arpa -n

Corpus Preprocessing

# Normalize separately
arpabo-normalize corpus.txt -o clean.txt -c lower -n

# Build model
arpabo clean.txt -o model.arpa

# Or pipeline
cat corpus.txt | arpabo-normalize -c lower -n | arpabo -o model.arpa

Binary Conversion (Optional)

ARPA files work directly with PocketSphinx. Binary conversion is optional for better performance:

# Use ARPA directly (works as-is)
arpabo corpus.txt -o model.arpa

# Optional: Convert to binary for faster loading
arpabo corpus.txt -o model.arpa --to-bin

# Optional: Kaldi FST format
arpabo corpus.txt -o model.arpa --to-fst

# Or convert manually later
pocketsphinx_lm_convert -i model.arpa -o model.lm.bin

Compatibility

Produces standard ARPA format models that work directly with:

  • PocketSphinx - Use ARPA directly (optional binary conversion for speed)
  • Kaldi - Use ARPA directly or convert to FST
  • SphinxTrain - Use ARPA directly
  • NVIDIA Riva - ARPA format supported
  • Julius, HTK - ARPA compatible

Binary conversion is optional and only improves loading speed.

Documentation

Guides

Examples

Feature Summaries

See PHASE_1_COMPLETE.md, PHASE_2_COMPLETE.md, and PHASE_3_COMPLETE.md for detailed feature documentation.

Development

git clone https://github.com/lenzo-ka/arpabo.git
cd arpabo
make venv
source venv/bin/activate
make test

See CONTRIBUTING.md for details.

License

MIT

About

ARPA back-off language model builder with smoothing options

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors