# **Automatic Text Language Identification In Python**.  
> https://zoumanakeita.medium.com/  

## LangDetect

In [2]:
# Install the library
!pip install langdetect
from langdetect import detect, detect_langs

Collecting langdetect
  Downloading langdetect-1.0.9.tar.gz (981 kB)
[?25l[K     |▍                               | 10 kB 15.0 MB/s eta 0:00:01[K     |▊                               | 20 kB 17.8 MB/s eta 0:00:01[K     |█                               | 30 kB 21.4 MB/s eta 0:00:01[K     |█▍                              | 40 kB 12.7 MB/s eta 0:00:01[K     |█▊                              | 51 kB 5.6 MB/s eta 0:00:01[K     |██                              | 61 kB 6.5 MB/s eta 0:00:01[K     |██▍                             | 71 kB 6.8 MB/s eta 0:00:01[K     |██▊                             | 81 kB 6.1 MB/s eta 0:00:01[K     |███                             | 92 kB 6.8 MB/s eta 0:00:01[K     |███▍                            | 102 kB 7.4 MB/s eta 0:00:01[K     |███▊                            | 112 kB 7.4 MB/s eta 0:00:01[K     |████                            | 122 kB 7.4 MB/s eta 0:00:01[K     |████▍                           | 133 kB 7.4 MB/s eta 0:00:01[K   

In [3]:
def language_detection(text, method = "single"):

  """
  @desc: 
    - detects the language of a text
  @params:
    - text: the text which language needs to be detected
    - method: detection method: 
      single: if the detection is based on the first option (detect)
  @return:
    - the langue/list of languages
  """

  if(method.lower() != "single"):
    result = detect_langs(text)

  else:
    result = detect(text)

  return result

Executing the following two cells, you might get different results each time, because of the non-deterministic aspect.


In [8]:
# Test first case 
text = """
This library is the direct port of Google's language-detection library from Java to Python. 
Elle est vraiment éfficace dans la détection de langue.
"""
print(language_detection(text))

en


In [11]:
# Test second case
print(language_detection(text, "all languages"))

[en:0.714282782965898, fr:0.28571541584211635]


In [14]:
from langdetect import DetectorFactory
DetectorFactory.seed = 0

# Test first case 
text = """
This library is the direct port of Google's language-detection library from Java to Python. Developed by Nakatani Shuyo at Cybozu Labs, Inc in 2010, this library has about 99% precision for over 50 languages.
"""
print(language_detection(text))

# Test second case
print(language_detection(text, "show_proba"))

en
[en:0.9999977497863055]


## Spacy-langdetect

In [7]:
#!pip install spacy-langdetect
import spacy
from spacy_langdetect import LanguageDetector 

In [8]:
# https://www.actuia.com/english/africa-launch-of-the-initiative-for-the-development-of-artificial-intelligence-in-french-speaking-african-countries/

def spacy_language_detection(text, model):

  pipeline = list(dict(model.pipeline).keys())

  if(not "language_detector" in pipeline):
    model.add_pipe(LanguageDetector(), name = "language_detector", last=True)
    
  doc = model(text)

  return doc._.language

In [9]:
english_text = """Niyel, a Dakar-based company that designs, implements, 
and evaluates advocacy campaigns to change policies, behaviors, and practices, 
will support the researchers in using the results to influence the implementation of AI-friendly policies.
"""
french_text = """Intelligence artificielle : la solution pour améliorer l'accès au crédit en Afrique ? 
Déjà une réalité au Kenya, en Afrique du Sud et au Nigeria, l'évaluation du risque crédit via 
l'intelligence artificielle dispose d'un fort potentiel en Afrique de l'Ouest malgré les inquiétudes liées à la protection de la vie privée."""


In [11]:
pre_trained_model = spacy.load("en_core_web_sm")

In [12]:
# Detection on English text
print(spacy_language_detection(english_text, pre_trained_model))

# Detection on French text
print(spacy_language_detection(french_text, pre_trained_model))

{'language': 'en', 'score': 0.9999968055183488}
{'language': 'fr', 'score': 0.9999948438951478}


['tagger', 'parser', 'ner', 'language_detector']

## fastText

In [16]:
!pip install fasttext
import fasttext as ft

In [26]:
# Load the pretrained model
ft_model = ft.load_model("./pretrained_model/lid.176.bin")

def fasttext_language_predict(text, model = ft_model):

  text = text.replace('\n', " ")
  prediction = model.predict([text])

  return prediction



In [27]:
print(fasttext_language_predict(english_text))
print(fasttext_language_predict(french_text))


([['__label__en']], [array([0.8957091], dtype=float32)])
([['__label__fr']], [array([0.99077034], dtype=float32)])


## gcld3.   

Running the following install and import instructions might not work, because it is recommended to use a virtual environment when implementing gcld3. So I wrote the function so that you can copy-paste in your notebook/.py files.

In [37]:
#!pip install gcld3
#import gcld3

In [33]:
# First feature: Single Language detection

def cld3_single_language_detection(text):

  max_num_bytes = len(text)
  detector = gcld3.NNetLanguageIdentifier(min_num_bytes=0, 
                                        max_num_bytes=max_num_bytes)
  result = {}
  result['language'] = detector.FindLanguage(text=text).language
  result['probability'] = detector.FindLanguage(text=text).probability
  
  return result


# Second feature: Multiple Language detection

def cld3_multiple_language_detection(text, nb_language=2):
    
    max_num_bytes = len(text)
    results = []
    language_info = {}
    
    detector = gcld3.NNetLanguageIdentifier(min_num_bytes=5, 
                                        max_num_bytes=max_num_bytes)
    
    languages = detector.FindTopNMostFreqLangs(text=text,
                                                 num_langs=nb_language)
    
    return [{"Language":l.language, 
             "Probability":l.probability} for l in languages]
    