Sandhi Splitter for Indian Languages (Currently only Malaylam)
Python Makefile
Latest commit 4648adb Jul 21, 2016 @copyninja copyninja committed on GitHub Merge pull request #4 from jerinphilip/master
Joiner, Improved Tests and more.
Permalink
Failed to load latest commit information.
data
docs Documentation Additions Jul 18, 2016
output
sandhisplitter
.gitignore Pep8 fixes + python2-3 json fix Jul 18, 2016
.testr.conf Add testrepository configuration May 28, 2016
.travis.yml fix script of .travis.yml May 28, 2016
AUTHORS Updated changelog to reflect rebase Jun 21, 2016
ChangeLog More data, updated ChangeLog Jul 18, 2016
Makefile
README.md Minor readme edit Jul 20, 2016
requirements.txt
setup.cfg Console entry points + Rename test.py Jun 26, 2016
setup.py Cleaned up setup.py May 28, 2016
test-requirements.txt Add test-requirements.txt May 28, 2016
tox.ini

README.md

Sandhi Splitter

Build Status Coverage Status

A probabalistic approach to solving the problem of agglutination which exists in indic languages. Implementation here applies for Malayalam, although codes used are mostly language agnostic.

Installation

  1. First clone the repository
    git clone https://github.com/libindic/sandhi-splitter.git
  1. Create a installable source and then install using pip
    python setup.py sdist
    pip install dist/sandhisplitter*.tar.gz

Note: We suggest you work on virtualenv instead of installing system-wide using sudo, since module is still under development.

Training and Testing

After installation, with necessary arguments, use

    sandhisplitter_train [--help] [args]
    sandhisplitter_benchmark_model [--help] [args]

For more details, refer to docs/index.rst

Using the Sandhisplitter class

Sandhisplitter class provides two main functions, split and join.

>>> from sandhisplitter import Sandhisplitter
>>> s = Sandhisplitter()
>>> s.split('ആദ്യമെത്തി')
(['ആദ്യം', 'എത്തി'], [4])
>>> s.split('വയ്യാതെയായി')
(['വയ്യാതെ', 'ആയി'], [7])
>>> s.split('എന്നെക്കൊണ്ടുവയ്യ')
(['എന്നെക്കൊണ്ടുവയ്യ'], [])
>>> s.split('ഇന്നത്തെക്കാലത്ത്')
(['ഇന്നത്തെക്കാലത്ത്'], [])
>>> s.split('എന്തൊക്കെയോ')
(['എന്ത്', 'ഒക്കെയോ'], [3])

>>> s.join(['ആദ്യം', 'ആയി'])
'ആദ്യമായി'