polyglot

Program language savant. It is used to detect program languages just like github/linguist, but based on naive Bayes classifier.

Getting Started

Install using pip

pip install git+https://github.com/polyrabbit/polyglot

First, we need to train polyglot on a multilingual training corpus, each folder in the corpus should contain files of the same language whose name is identified by the folder. e.g.

polyglot train --corpus=./corpus --ngram=3 --verbose --output=./model.json

A pre-included model.json is generated using the above command. Run polyglot train --help for usage specifics.

After training, we can use the Naive Bayes classifier to classify a given file. e.g.

echo import os | polyglot classify --ngram=3 --top=3 --verbose --model=./model.json -

Which outputs top 3 most likely languages in descending order with their scores

[(u'Python', 6.719828065958895), (u'Frege', -11.021531184412824), (u'Objective-C++', -13.244791737113022)]

Run polyglot classify --help for usage specifics.

Classification Algorithm

Lex input string into tokens, and generate n-grams from those tokens (trigram by default)

 "#include<stdio.h>".lex().ngram(max_n=3)
 
 =>[['#'], ['include'], ['<'], ['stdio'], ['.'], ['h'], ['>'], ['#', 'include'], ['include', '<'], ['<', 'stdio'], ['stdio', '.'], ['.', 'h'], ['h', '>'], ['#', 'include', '<'], ['include', '<', 'stdio'], ['<', 'stdio', '.'], ['stdio', '.', 'h'], ['.', 'h', '>']]

Computing the probability of a language given a token

 P(lang | token) 
 
 = P(token | lang) * P(lang) / P(token)

    n_token_on_lang(token, lang)     n_lang_tokens(lang)     n_token(token)
 = ------------------------------ * -------------------- /  ---------------
    n_lang_tokens(lang)              n_tokens()              n_tokens()

    n_token_on_lang(token, lang) 
 = ------------------------------
    n_token(token)

Combining individual probabilities

 P(lang | tok_1, tok_2, tok_3...tok_n)
 
   P(tok_1, tok_2, tok_3...tok_n | lang) * P(lang)
 = --------------------------------------------
           P(tok_1, tok_2, tok_3...tok_n)

   (naively assume that tokens are independent from each other)

   P(tok_1|lang) * P(tok_2|lang) * P(tok_3|lang) ... P(tok_n|lang) * P(lang)
 = ---------------------------------------------------------------------
      P(tok_1) * P(tok_2) * P(tok_3) ... P(tok_n)
 
         P(tok|lang)   P(lang|tok)
       ( ----------- = ----------- )
           P(tok)        P(lang)

   P(lang|tok_1) * P(lang|tok_2) * P(lang|tok_3) ... P(lang|tok_n) * P(lang)
 = ---------------------------------------------------------------------
                        P(lang)^N

Dealing with rare words

 P(unseen_token) = 1.0/(n_all_tokens()+1)

References

V2EX讨论

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
corpus		corpus
polyglot		polyglot
tests		tests
.gitignore		.gitignore
.travis.yml		.travis.yml
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
Makefile		Makefile
README.md		README.md
model.json		model.json
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

polyglot

Getting Started

Classification Algorithm

References

About

Releases

Packages

Languages

License

polyrabbit/polyglot

Folders and files

Latest commit

History

Repository files navigation

polyglot

Getting Started

Classification Algorithm

References

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages