Skip to content
forked from shuyo/ldig

Language Detection with Infinity-gram, Python 3 version.

License

Notifications You must be signed in to change notification settings

lkevers/ldig-python3

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

32 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ldig (Language Detection with Infinity Gram)

This is a prototype of language detection for short message service (twitter) with 99.1% accuracy for 17 languages.

ldig can also be used with success on longer documents. This version was initiated in the context of research conducted at the University of Corsica, for the automatic processing of less resourced languages, and in particular of Corsican. The results recorded in Kevers (2022) showed an average accuracy between 99.10% and 99.71% for 18 languages (17 official EU languages + Corsican).

ldig-python3 fork

The motivations for this fork are :

  • adaptation of ldig to python3
  • add alphabets (greek an cyrilic) not supported in the original version.

Usage

You can use ldig with the provided model or you can retrain a new model from your own data.

Standard use with provided models :

  1. Extract model directory

     tar xf models/[select model archive]
    
  2. Detect

     ldig.py -m [model directory] [text data file]
    

Train new models

  1. Compile maxsubst executable (if not already done)

     cd maxsubst
     g++ -Icybozulib/include maxsubst.cpp -o maxsubst
    
  2. Prepare your data

    Learning data must be placed in a file formated as follow :

     CorrectLabel [TAB] Metadata [TAB] Text.
    
  3. Initialisation

     python3 ldig.py -m [ModelDir] -x [MaxSubStBin] --init [LearnCorpusFile]
    

    Several options are available :

     --ff=[LowerLimitOfFrequency] : threshold of feature frequency
     -n [NgramUpperBound] : n-gram upper bound
    
  4. Learning

     python3 ldig.py -m [ModelDir] --learn [TxtCorpusFile] -e [LearningRate]
    

    Several options are available :

     -r [RegularizationConstant] : regularization constant
     --wr [NumWholeRegularizations] : number of whole regularizations
    
  5. Optimisation (optional)

     python3 ldig.py -m [ModeliDir] --shrink
    

Data format

As input data, Each "document" (tweet or other) is one line in text file as the below format.

[label]\t[some metadata separated '\t']\t[text without '\t']

[label] is a language name like en, de, fr and so on. Metadata is optional, but the tabulation symbol has to be present. (ldig doesn't use metadata and label for detection, of course :D)

The output data of lidg is as the below.

[correct label]\t[detected label]\t[original metadata and text]

Estimation Tool

ldig has a estimation tool.

./server.py -m [model directory]

Open http://localhost:48000 and input target text into textarea. Then ldig outputs language probabilities and feature parameters in the text.

Supported Languages (with provided models)

  • cs Czech
  • da Dannish
  • de German
  • en English
  • es Spanish
  • fi Finnish
  • fr French
  • id Indonesian
  • it Italian
  • nl Dutch
  • no Norwegian
  • pl Polish
  • pt Portuguese
  • ro Romanian
  • sv Swedish
  • tr Turkish
  • vi Vietnamese

Supported Languages (with Laurent Kevers models, data available at https://github.com/lkevers/ldig-models-TAL62-3)

The models have to be generated from the data following the documented procedure.

These models are designed to support 17 official languages of the European Union, plus Corsican.

  • bg / bul - Bulgarian
  • co / cos - Corsican
  • cs / ces - Czech
  • da / dan - Danish
  • de / deu - German
  • el / ell - Greek
  • en / eng - English
  • fi / fin - Finnish
  • fr / fra - French
  • hu / hun - Hungarian
  • it / ita - Italian
  • lt / lit - Lithuanian
  • nl / nld - Dutch
  • pl / pol - Polish
  • pt / por - Portuguese
  • ro / ron - Romanian
  • sp / spa - Spanish
  • sv / swe - Swedish

Documents

Copyright & License

  • (c) 2011-2012 Nakatani Shuyo / Cybozu Labs Inc. All rights reserved.
  • (c) 2021 Laurent Kevers / University of Corsica (changes made for ldig-python3)
  • All codes and resources are available under the MIT License.

About

Language Detection with Infinity-gram, Python 3 version.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • C++ 64.0%
  • Python 31.6%
  • HTML 4.4%