Sandhi Splitter for Indian Languages (Currently only Malaylam)
Python Makefile
Latest commit 4648adb Jul 21, 2016 @copyninja copyninja committed on GitHub Merge pull request #4 from jerinphilip/master
Joiner, Improved Tests and more.
Failed to load latest commit information.
docs Documentation Additions Jul 18, 2016
.gitignore Pep8 fixes + python2-3 json fix Jul 18, 2016
.testr.conf Add testrepository configuration May 28, 2016
.travis.yml fix script of .travis.yml May 28, 2016
AUTHORS Updated changelog to reflect rebase Jun 21, 2016
ChangeLog More data, updated ChangeLog Jul 18, 2016
Makefile Minor readme edit Jul 20, 2016
setup.cfg Console entry points + Rename Jun 26, 2016 Cleaned up May 28, 2016
test-requirements.txt Add test-requirements.txt May 28, 2016

Sandhi Splitter

Build Status Coverage Status

A probabalistic approach to solving the problem of agglutination which exists in indic languages. Implementation here applies for Malayalam, although codes used are mostly language agnostic.


  1. First clone the repository
    git clone
  1. Create a installable source and then install using pip
    python sdist
    pip install dist/sandhisplitter*.tar.gz

Note: We suggest you work on virtualenv instead of installing system-wide using sudo, since module is still under development.

Training and Testing

After installation, with necessary arguments, use

    sandhisplitter_train [--help] [args]
    sandhisplitter_benchmark_model [--help] [args]

For more details, refer to docs/index.rst

Using the Sandhisplitter class

Sandhisplitter class provides two main functions, split and join.

>>> from sandhisplitter import Sandhisplitter
>>> s = Sandhisplitter()
>>> s.split('ആദ്യമെത്തി')
(['ആദ്യം', 'എത്തി'], [4])
>>> s.split('വയ്യാതെയായി')
(['വയ്യാതെ', 'ആയി'], [7])
>>> s.split('എന്നെക്കൊണ്ടുവയ്യ')
(['എന്നെക്കൊണ്ടുവയ്യ'], [])
>>> s.split('ഇന്നത്തെക്കാലത്ത്')
(['ഇന്നത്തെക്കാലത്ത്'], [])
>>> s.split('എന്തൊക്കെയോ')
(['എന്ത്', 'ഒക്കെയോ'], [3])

>>> s.join(['ആദ്യം', 'ആയി'])