A sentence aligner for comparable corpora
Python HTML
Failed to load latest commit information.
docs Updated links to point to github repo Aug 14, 2015
scripts fixed languages Sep 13, 2013
tests Increased test sample size so the test doesn't make a false negative … Sep 16, 2013
yalign Set WordPairScore to prefer maximum scoring pairs (rather than random). Dec 24, 2013
.gitignore Added git ignore file Jun 17, 2013
LICENSE Added files to create a python package along with the project license Jun 19, 2013
MANIFEST.in remove numpy, include requirements.txt Sep 3, 2013
README.rst Updated links to point to github repo Aug 14, 2015
requirements.txt removed fixed scikit-learn dependency Sep 5, 2013
setup.py Updated version information Sep 16, 2013



Yalign is a tool for extracting parallel sentences from comparable corpora.

Statistical Machine Translation relies on parallel corpora (eg.. europarl) for training translation models. However these corpora are limited and take time to create. Yalign is designed to automate this process by finding sentences that are close translation matches from comparable corpora. This opens up avenues for harvesting parallel corpora from sources like translated documents and the web.


Yalign requires that you install scikit-learn.

After that you can install Yalign from PyPi via pip:

sudo pip install yalign


Firstly we need to download and unpack the english to spanish model.

wget https://raw.githubusercontent.com/machinalis/yalign/develop/data/models/0.1/en-es.tar.gz
tar -xvzf en-es.tar.gz

Now we can use the yalign-align script along with the english to spanish model to align two web pages.

yalign-align en-es http://en.wikipedia.org/wiki/Antiparticle http://es.wikipedia.org/wiki/Antipart%C3%ADcula

Yalign is not limited to any one language pair. By creating your own models you can align any two languages. For more details on how to use yalign and on yalign's implementation please read the docs.

The Yalign Team:

Yalign is a Machinalis project. You can view our other open source contributions here.

Andrew Vine
Gonzalo García Berrotarán
Rafael Carrascosa
Elías Andrawos
Laura Alonso Alemany