Simple project to build language models I did for fun
Python
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
.gitignore
Predictor.py
README.md
ancient-greek_language.txt
chinese_language.txt
dutch_language.txt
english_forum_post.txt
english_language.txt
french_language.txt
german_language.txt
japanese_language.txt
latin_language.txt
modern-greek_language.txt
polish_language.txt
predict_language.py
predict_server.py
romanian_language.txt
russian_language.txt
spanish_language.txt

README.md

language_predictor

Simple project I made for fun to build language models and predict what language a string or text file belongs to. It creates a model for each reference language, caching it to disk for repeated use, and rebuilds the model if the language file is updated.

I use the NGram class from http://blog.ebookglue.com/write-language-detector-50-lines-python/, which I've modified slightly. Check it out for the theory as to why this predictor works.

I've included some language files with text I got from Project Gutenberg

Included languages: chinese, dutch, english, french, german, japanese, latin, ancient-greek, modern-greek, polish, romanian, russian, spanish

To create a new language reference text, just slap any old text into a file and load it in (as described below). The only thing it does before building the language model is reduce all whitespace down to single spaces. It also has unicode support.

One thing to note is that languages with a lot of diacritics (á, é, č, ć, đ, etc) might not perform well if you input a string without the marks, as it treats them as completely separate characters. For these languages make sure to construct the model file with text that includes the marks and text that doesn't have them, so the model has a variety to work with.

Using the Predictor Class

from Predictor import Predictor
language_files = [
    {"language": "english", "training_file": "english_language.txt"},
    {"language": "spanish", "training_file": "spanish_language.txt"},
    {"language": "german", "training_file": "german_language.txt"},
]

predictor = Predictor(language_files)
best_guess = predictor.predict("Ich bin ein Berliner")

print("Language is: {}".format(best_guess.language))

# prints: Language is: german

predict_language.py

Or you can use the script I included, which takes a string or text file from the command line and guesses it's language. It loads all files of the format <languagename>_language.txt in the current directory and builds models for them

usage: predict_language.py [-h] [-s STRING] [-f TEXT_FILE]

optional arguments:
  -h, --help    show this help message and exit
  -s STRING
  -f TEXT_FILE

Example:

$ predict_language.py -s "Fryderyk Franciszek Chopin polski kompozytor i pianista"
Loading model for ancient-greek - ancient-greek_language_ngrams.txt
Loading model for polish - polish_language_ngrams.txt
.
.
.
Loading model for spanish - spanish_language_ngrams.txt
Language is: polish