# The `langdetect` and `langid` Libraries

This is a bonus chapter about two useful Python libraries. They are not affiliated with NLTK.

## 1. `langdetect`

[`langdetect`](https://pypi.org/project/langdetect/) supports 55 languages out of the box ([ISO 639-1 codes](https://en.wikipedia.org/wiki/List_of_ISO_639_language_codes)):

> af, ar, bg, bn, ca, cs, cy, da, de, el, en, es, et, fa, fi, fr, gu, he, hi, hr, hu, id, it, ja, kn, ko, lt, lv, mk, ml, mr, ne, nl, no, pa, pl, pt, ro, ru, sk, sl, so, sq, sv, sw, ta, te, th, tl, tr, uk, ur, vi, zh-cn, zh-tw

In [1]:
from langdetect import detect
from langdetect import detect_langs

In [2]:
detect('I am a noob.')

'so'

In [3]:
detect('Eisai gia ta panigiria.')

'lt'

In [4]:
detect('Είσαι')

'el'

In [5]:
detect_langs("What's up, man?")

[en:0.7142842020356717, tl:0.14285988119791657, id:0.14285591676641163]

## 2. `langid`

[`langid`](https://github.com/saffsd/langid.py) comes pre-trained on 97 languages:

> af, am, an, ar, as, az, be, bg, bn, br, bs, ca, cs, cy, da, de, dz, el, en, eo, es, et, eu, fa, fi, fo, fr, ga, gl, gu, he, hi, hr, ht, hu, hy, id, is, it, ja, jv, ka, kk, km, kn, ko, ku, ky, la, lb, lo, lt, lv, mg, mk, ml, mn, mr, ms, mt, nb, ne, nl, nn, no, oc, or, pa, pl, ps, pt, qu, ro, ru, rw, se, si, sk, sl, sq, sr, sv, sw, ta, te, th, tl, tr, ug, uk, ur, vi, vo, wa, xh, zh, zu

In [6]:
import langid

In [7]:
langid.classify('I am a noob.')[0]

'en'

In [8]:
langid.classify('Eisai gia ta panigiria.')[0]

'lv'

In [9]:
langid.classify("What's up, bro?")

('en', -17.363881587982178)

Note that the “probability” displayed above is not normalized. This is faster because it is not necessary to compute a full probability to determine the most probable language in a set of candidate languages.

If a confidence score is required, the following can be done:

In [10]:
from langid.langid import LanguageIdentifier, model

identifier = LanguageIdentifier.from_modelstring(model, norm_probs=True)
identifier.classify("What's up, girl?")

('en', 0.9997648731958723)

It is also possible to constrain the language set:

In [11]:
identifier.set_languages(['de', 'fr', 'it'])
identifier.classify("What's up, dude?")

('fr', 0.7643981024330109)