English-Tamil parallel Corpus prepared by the National Languages Processing Center, University of Moratuwa. The data has been cleaned and then aligned.

#En-Ta Glossary Line Count : 22477 #En-Ta Corpus Line Count : 8950

#Source: Data extracted from publicly available government resources such as annual reports, procurement reports, circulars and websites.

#Processing: Each word/pdf file was converted to text files, and unicode errors were fixed using a custom tool. Then the Tamil and English files were manually sentence-aligned. All the spelling and grammatical errors were manually fixed.

#If you use this dataset, kindly cite the following publication: Fernando, A., Ranathunga, S., & Dias, G. (2020). Data Augmentation and Terminology Integration for Domain-Specific Sinhala-English-Tamil Statistical Machine Translation. arXiv preprint arXiv:2011.02821.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
En-Ta Corpus		En-Ta Corpus
En-Ta Glossary		En-Ta Glossary
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

En-Ta Corpus

En-Ta Corpus

En-Ta Glossary

En-Ta Glossary

README.md

README.md

Repository files navigation

English-Tamil parallel Corpus prepared by the National Languages Processing Center, University of Moratuwa. The data has been cleaned and then aligned.

About

Releases

Packages

Contributors 2

nlpcuom/English-Tamil-Parallel-Corpus

Folders and files

Latest commit

History

Repository files navigation

English-Tamil parallel Corpus prepared by the National Languages Processing Center, University of Moratuwa. The data has been cleaned and then aligned.

About

Resources

Stars

Watchers

Forks