- Introduction
- News
- Installing
- Getting started
- Public API
- Supported languages
- Adding a new language
- Contributing
- Contact
- License
Introduction
libvarnam is a cross platform, self learning, open source library which support transliteration and reverse transliteration for Indian languages. At the core is a C shared library providing algorithms and patterns for transliteration. libvarnam has a simple learning module built-in which can learn words to improve the transliteration experience.
Installing libvarnam
Dependencies
| libvarnam | Standard C libraries |
| varnamc (libvarnam's command line client) |
|
| libvarnam's unit tests | Check (http://check.sourceforge.net) |
Installation
libvarnam uses CMake as build system. libvarnam doesn't have any external dependencies. So building it is easy.
$ cmake .
$ make
$ make installGetting started
You can use varnamc which is a command line client to libvarnam to quickly try out libvarnam.
$ ./varnamc
varnamc : no actions specified
Usage: varnamc options args
-l, --library FILE Sets the varnam library
-v, --verbose Enable verbose output
-t, --transliterate TEXT Transliterate the given text
--indic-digits Turns on indic digit rendering while transliterating
-r, --reverse-transliterate TEXT Reverse transliterate the given text
-n, --learn [TEXT] Learn the given text
-a, --train PATTERN=WORD Train varnam to use PATTERN for WORD
-f, --learn-from FILE|DIRECTORY Reads from the specified file/directory
--train-from FILE|DIRECTORY Reads the specified file/directory and trains all the words specified
-e, --export-words Export words to the output directory
-s, --symbols VALUE Sets the symbols file
-c, --compile FILE Compile symbols file
--learnings-file FILE Specify the file to store all learnings
--detect-language WORD Detect language of the word
-d, --output-dir dir Sets the output directory
-h, --help Display this screen
--version Display versionEach supported language will have a scheme file under the schemes directory. This scheme file is in plain text format and needs to be compiled before using. To compile a scheme file, use the following command.
$ ./varnamc --compile schemes/<SCHEME_FILE_NAME>This will generate a file named <SCHEME_FILE_NAME>.vst (Varnam Symbol Table) which is a binary file which contains all symbols defined in the scheme file. Running make install will install VST files and this allows varnamc to be used outside the source directory.
You can now start using libvarnam. To transliterate a word.
$ ./varnamc --symbols ml --transliterate navaneethAbove command uses Malayalam symbols and transliterate the text 'navaneeth'.
Similarly if you want varnam to learn some word
$ ./varnamc --symbols ml --learn വർണംPublic API
api.h defines the public API for libvarnam. Take a look at api.h in the source for available functions.
In short, libvarnam can be initialized using varnam_init(). varnam_init() will initialize a handle which needs to be passed to all other functions. varnam_transliterate() can transliterate a word. varnam_learn() can be used to learn a word.
Following example shows a simple usage of libvarnam.
#include <varnam.h>
int main(int args, char **argv)
{
int rc, i;
char *error;
varnam *handle;
varray *result;
vword *word;
rc = varnam_init("path/to/ml-unicode.vst", &handle, &error);
if (rc != VARNAM_SUCCESS)
{
printf ("Initialization failed. %s\n", error);
return 1;
}
rc = varnam_transliterate (handle, "navaneeth", &result);
if (rc != VARNAM_SUCCESS)
{
printf ("Transliteration failed. %s\n", varnam_get_last_error(handle));
return 1;
}
for (i = 0; i < varray_length (result); i++)
{
word = varray_get (result, i);
printf ("%s\n", word->text);
}
return 0;
}On a GNU/Linux machine, above example can be compiled using the following command:
gcc `pkg-config --cflags --libs varnam` -o example example.cSupported languages
- Hindi
- Malayalam
- Gujarati (Experimental)
Adding a new language
A new language can be added to libvarnam by adding a new scheme file. A scheme file is a simple Ruby file which can be used to specify the symbols for a language. The best way to write a new scheme file is to refer to an existing one. All the scheme files are stored under schemes/ directory.
Metadata
A scheme file often starts with metadata.
| language_code | Language code for the scheme |
| identifier | A unique identifier to identify this scheme |
| display_name | Friendly name for this scheme |
| author | Author of the scheme file |
Syntax
<symbol-type> options, symbolsoptions and symbols should be valid Ruby hashes. options is optional argument and can contain the following values.
options = {:accept_if => starts_with | ends_with | in_between, :priority => 0..9}symbols should be a hash with patterns as keys and replacement as values. It can have the following form.
'a' => 'a-value', 'b' => 'b-value'
['a', 'aa'] => 'b-value'Given the above mapping, varnam will replace token a with a-value and token b with b-value. Multiple patterns can be specified in an array. In this case, both a and aa will resolve to b-value.
Symbol types
The following functions are available in the scheme files to define different types of symbols.
- vowels
- consonants - Usually specified with the inherent 'a' sound.
- consonant_vowel_combinations
- anusvara
- visarga
- virama
- symbols
- numbers
- others
Other functions
Following functions are available in a scheme file.
infer_dead_consonants
Usage
infer_dead_consonants trueWhen this option is set, varnam will infer dead consonant from a consonant definition. Consider the following statements.
infer_dead_consonants true
consonants 'ka' => 'क'In this case, varnam will create a consonant ka which will resolve to क and a dead consonant k which resolves to क्.
generate_cv
When this function is called, varnam will autogenerate consonant-vowel combinations. Consider the following statements.
vowels 'aa' => ['आ', 'ा']
consonants 'ka' => 'क'
generate_cvIn this case, varnam will generate consonant-vowel combinations like, kaa => 'का'
list
Creates a custom list and adds the tokens into the list.
list :consonants_with_inherent_a_sound do
consonants 'ka' => 'क'
end
# Token 'ka' will be added to the custom list named 'consonants_with_inherent_a_sound'. To read it,
consonants_with_inherent_a_sound.each do |c|
puts c
endcombine
combine function can be used to generate combination of tokens. Consider the following scheme file for Hindi.
consonants "k" => "क",
["kh", ["gh"]] => "ख",
["gh", ["kh"]] => "घ",
# Generating ka, kha etc
consonants(combine get_consonants, ["*a"] => ["*1"])It takes a list as the first argument and hash as the second argument. List could be any custom defined lists created using the list function or it could be any built-in list. In the above example, combine will iterate over the list get_consonants and replace the wildcard character * with current pattern and *1 with value1. For values, you can use *1, *2 and *3 for getting value1, value2 and value3.
combine function returns a hash that can be passed to token creation functions like consonants or vowels.
Setting priority for a token
When defining a token, you can assign some priority to it. When varnam does the tokenization, high priority tokens will appear first in the list.
consonants({:priority => :high}, 'ka' => 'क')This will generate consonant ka with priority set to high.
Setting accept condition for a token
Each token can have an optional accept condition. Accept condition can have 1 of 3 possible values. starts_with, ends_with and in_between.
consonants({:accept_if => :starts_with}, 'ka' => 'क')In this case, varnam will accept token ka only if the pattern starts with ka.
Contributing
Thank you for your interest. You can look at issues and pick one which you find interesting to work with. Submit a pull request after the fix.
Contact
| Website | [www.varnamproject.com](http://www.varnamproject.com) |
| IRC | #varnamproject at freenode |
| Questions | Tweet your questions to @navaneethkn |
| varnamproject [at] googlegroups.com |
Copyright
Copyright (C) Navaneeth.K.N
This is part of libvarnam. See LICENSE.txt for the license
The MIT License (MIT)
Copyright (c) 2013 Navaneeth.K.N
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.