Skip to content

mozilla/fx-client-lang-detection

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

fx-client-lang-detection

Benchmarks several language detection libraries against a common dataset to evaluate accuracy and latency at varying input lengths.

Libraries tested: cld2, fastText, lingua, langdetect, whatlang, franc.

Prerequisites

  • Python 3
  • Rust + Cargo
  • Node.js

Setup

Python virtualenv:

python3 -m venv .venv
.venv/bin/pip install -r requirements.txt

Node dependencies (for franc):

cd runners/run_franc && npm install && cd ../..

fastText model (downloaded once, ~900MB):

curl -Lo data/raw/lid.176.ftz https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.ftz

Running

1. Build the dataset (downloads ~1GB Tatoeba sentences.csv on first run):

python dataset/build_dataset.py

2. Run all benchmarks and aggregate results:

./run_all.sh

This runs each library against data/ground_truth.csv, writes per-library CSVs to results/, then prints comparison tables for accuracy by length bucket, per-language accuracy, and inference latency.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors