Program language savant. It is used to detect program languages just like github/linguist, but based on naive Bayes classifier.
Install using pip
pip install git+https://github.com/polyrabbit/polyglot
First, we need to train polyglot on a multilingual training corpus, each folder in the corpus should contain files of the same language whose name is identified by the folder. e.g.
polyglot train --corpus=./corpus --ngram=3 --verbose --output=./model.json
A pre-included model.json is generated using the above command. Run polyglot train --help
for usage specifics.
After training, we can use the Naive Bayes classifier to classify a given file. e.g.
echo import os | polyglot classify --ngram=3 --top=3 --verbose --model=./model.json -
Which outputs top 3 most likely languages in descending order with their scores
[(u'Python', 6.719828065958895), (u'Frege', -11.021531184412824), (u'Objective-C++', -13.244791737113022)]
Run polyglot classify --help
for usage specifics.
-
Lex input string into tokens, and generate n-grams from those tokens (trigram by default)
"#include<stdio.h>".lex().ngram(max_n=3) =>[['#'], ['include'], ['<'], ['stdio'], ['.'], ['h'], ['>'], ['#', 'include'], ['include', '<'], ['<', 'stdio'], ['stdio', '.'], ['.', 'h'], ['h', '>'], ['#', 'include', '<'], ['include', '<', 'stdio'], ['<', 'stdio', '.'], ['stdio', '.', 'h'], ['.', 'h', '>']]
-
Computing the probability of a language given a token
P(lang | token) = P(token | lang) * P(lang) / P(token) n_token_on_lang(token, lang) n_lang_tokens(lang) n_token(token) = ------------------------------ * -------------------- / --------------- n_lang_tokens(lang) n_tokens() n_tokens() n_token_on_lang(token, lang) = ------------------------------ n_token(token)
-
Combining individual probabilities
P(lang | tok_1, tok_2, tok_3...tok_n) P(tok_1, tok_2, tok_3...tok_n | lang) * P(lang) = -------------------------------------------- P(tok_1, tok_2, tok_3...tok_n) (naively assume that tokens are independent from each other) P(tok_1|lang) * P(tok_2|lang) * P(tok_3|lang) ... P(tok_n|lang) * P(lang) = --------------------------------------------------------------------- P(tok_1) * P(tok_2) * P(tok_3) ... P(tok_n) P(tok|lang) P(lang|tok) ( ----------- = ----------- ) P(tok) P(lang) P(lang|tok_1) * P(lang|tok_2) * P(lang|tok_3) ... P(lang|tok_n) * P(lang) = --------------------------------------------------------------------- P(lang)^N
-
Dealing with rare words
P(unseen_token) = 1.0/(n_all_tokens()+1)