Vietnamese diacritics restoration
Python
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
predict change settings Jan 18, 2016
train Training Jan 18, 2016
.gitignore Update gitignore file Jan 18, 2016
README.md Update README.md Jan 18, 2016

README.md

Vietnamese Diacritics Restoration

This is a Vietnamese Diacritics Restoration tool based on SVMs.

Usage

"train" and "predict" directory, you should put LIBLINEAR Libary, "liblinear.so.3" under the "src" directory.

Train

Make corpus

# make no syllable corpus
% cat corpus.txt | python stdin2delete_tonemark.py > resource/viet_corpus_no_tonemark.txt

Training

Firstly, you edit config.ini.

% emacs config.ini
[settings]
path1 = /Users/takahashi/restore-tonemark/train/resource/VNTQcorpus_small.txt
path2 = /Users/takahashi/restore-tonemark/train/resource/VNTQcorpus_small_no_tone_mark.txt
preserve_dir_path = /Users/takahashi/restore-tonemark/train/models
window_size = 2
# training
% cd train
% python train.py

Predict

% cd predict
% python predict.py < echo "Toi la sinh vien" # cat input.txt