Skip to content


Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time

Language Independet Truecaser for Python

This is an implementation of a trainable Truecaser for Python.

A truecaser converts a sentence where the casing was lost to the most probable casing. Use cases are sentences that are in all-upper case, in all-lower case or in title case.

A model for English is provided, achieving an accuracy of 98.39% on a small test set of random sentences from Wikipedia.


The model was inspired by the paper of Lucian Vlad Lita et al., tRuEcasIng but with some simplifications.

The model applies a greed strategy. For each token, from left to right, it computes for all possible casing:

score(w_0) = P(w_0) * P(w_0 | w_{-1}) * P(w_0 | w_1) * P(w_0 | w_{-1}, w_1)

with w_0 the word at the current position, w_{-1} the previous word, w_1 the next word in the sentence.

All observed casings for w_0 are tested and the casing with the highest score is selected.

The probabilities P(...) are computed based on a large training corpus.


The Code was written for Python 2.7 and requires NLTK 3.0.

From NLTK, it uses the functions to spilt sentences into tokens and the FreqDist(). These parts of the code can easily be replaced, so that the code can be used without NLTK install.

Run the Code

You need a distributions.obj that contains information on the frequencies of unigrams, bigrams, and trigrams.

A pre-trained distributions.obj for English is provided in the release section (name: Unzip it before you can use it).

One large distributions.obj for English is provided in the download section of github.

You can train your own distributions.obj using the script.

To run the model on one (or multiple) text files, provide distributions.obj to the script. If no text file(s) are provided as arguments, input is read from STDIN.

To evaluate a model, have a look at

Train your own Truecaser

You can retrain the Truecaser easily. Simply change the train.txt file with a large sample of sentences, change the such that is uses the train.txt and run the script. You can also use it for other languages than English like German, Spanish, or French.


Sorry that this is kind of shitty code without documentation. I was looking for my research for a truecaser, but I couldn't find any working implementation. I implemented this script in a hacky manner and it works quite well (at least for me).

I think the code is so simple that everyone can use and adapt it and maybe it is handy for someone. The principle behind the code is really simple, but as mentioned above, it achieves good results.

Hint: The casing of company and product names is the hardest. Train the system on a large and recent dataset to achieve the best results (e.g. on a recent dump of Wikipedia).


Language independent truecaser in Python.







No packages published