fx-client-lang-detection

Benchmarks several language detection libraries against a common dataset to evaluate accuracy and latency at varying input lengths.

Libraries tested: cld2, fastText, lingua, langdetect, whatlang, franc.

Prerequisites

Python 3
Rust + Cargo
Node.js

Setup

Python virtualenv:

python3 -m venv .venv
.venv/bin/pip install -r requirements.txt

Node dependencies (for franc):

cd runners/run_franc && npm install && cd ../..

fastText model (downloaded once, ~900MB):

curl -Lo data/raw/lid.176.ftz https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.ftz

Running

1. Build the dataset (downloads ~1GB Tatoeba sentences.csv on first run):

python dataset/build_dataset.py

2. Run all benchmarks and aggregate results:

./run_all.sh

This runs each library against data/ground_truth.csv, writes per-library CSVs to results/, then prints comparison tables for accuracy by length bucket, per-language accuracy, and inference latency.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.idea		.idea
context		context
dataset		dataset
runners		runners
.gitignore		.gitignore
README.md		README.md
aggregate.py		aggregate.py
requirements.txt		requirements.txt
run_all.sh		run_all.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

fx-client-lang-detection

Prerequisites

Setup

Running

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

fx-client-lang-detection

Prerequisites

Setup

Running

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages