A trilingual corpus to be used with langid.py [Es-Ar-En]
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
corpus-esaren.test
corpus-esaren
corpuslib
LICENSE
README.md

README.md

corpus-esaren

A trilingual corpus to be used with langid.py. For now the 3 covered languages are Spanis, Arabic and English, hence es-ar-en. There are also 3 domains.

I try to keep the Named Entities to minimum, and I also try to keep them balanced between the 3 languages, so that langid's Information Gain algorithm can exclude them from its model building.