Natural language detection library for Rust
Clone or download
Latest commit 8199a92 Jan 29, 2019
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
benches Fix formatting Nov 9, 2018
docs Add life demo Dec 21, 2017
examples Fix formatting Nov 9, 2018
misc Add support of Afrikaans Jan 29, 2019
src Migrate to Rust 2018 edition Dec 7, 2018
templates Remove #[inline] Oct 27, 2018
tests Add support of Afrikaans Jan 29, 2019
.gitignore Add src/main.rs to gitignore Jul 29, 2017
.travis.yml Support rust 1.32.0 Jan 18, 2019
CHANGELOG.md Update CHANGELOG Jan 29, 2019
Cargo.toml Update hashbrown dependency Jan 18, 2019
LICENSE Add MIT license Jul 29, 2017
README.md Add support of Afrikaans Jan 29, 2019
SUPPORTED_LANGUAGES.md Add support of Afrikaans Jan 29, 2019
build.rs Fix formatting Nov 9, 2018

README.md

Whatlang

Build Status License Documentation online demo

Natural language detection for Rust with focus on simplicity and performance.

Features

  • Supports 84 languages
  • 100% written in Rust
  • Lightweight, fast and simple
  • Recognizes not only a language, but also a script (Latin, Cyrillic, etc)
  • Provides reliability information

Get started

Add to you Cargo.toml:

[dependencies]

whatlang = "0.6.0"

Example:

extern crate whatlang;

use whatlang::{detect, Lang, Script};

fn main() {
    let text = "Ĉu vi ne volas eklerni Esperanton? Bonvolu! Estas unu de la plej bonaj aferoj!";

    let info = detect(text).unwrap();
    assert_eq!(info.lang(), Lang::Epo);
    assert_eq!(info.script(), Script::Latin);
    assert_eq!(info.confidence(), 1.0);
    assert!(info.is_reliable());
}

For more details (e.g. how to blacklist some languages) please check the documentation.

Requirements

The latest whatlang library works with rust 1.31.0 or higher.

How does it work?

How does the language recognition work?

The algorithm is based on the trigram language models, which is a particular case of n-grams. To understand the idea, please check the original whitepaper Cavnar and Trenkle '94: N-Gram-Based Text Categorization'.

How is_reliable calculated?

It is based on the following factors:

  • How many unique trigrams are in the given text
  • How big is the difference between the first and the second(not returned) detected languages? This metric is called rate in the code base.

Therefore, it can be presented as 2d space with threshold functions, that splits it into "Reliable" and "Not reliable" areas. This function is a hyperbola and it looks like the following one:

Language recognition whatlang rust

For more details, please check a blog article Introduction to Rust Whatlang Library and Natural Language Identification Algorithms.

Running benchmarks

This is mostly useful to test performance optimizations.

cargo bench

Ports and clones

Derivation

Whatlang is a derivative work from Franc (JavaScript, MIT) by Titus Wormer.

License

MIT © Sergey Potapov

Contributors

  • greyblake Potapov Sergey - creator, maintainer.
  • Dr-Emann Zachary Dremann - optimization and improvements