Skip to content
Aravinth Bheemaraj edited this page Oct 23, 2020 · 2 revisions

Anuvaad Parallel Corpus

This repository contains parallel language corpus links for popular Indian languages developed as part of the Anuvaad project.

Please reach out to nlp-nmt@tarento.com for any clarification/interpretation/usage of the linked datasets.

Status

The current status of the parallel corpus built* :

Language Pair Parallel Corpus Count
English-Hindi 228,631

This dataset is growing everyday!

Goal

The goal is to build high quality parallel corpus for the Indian languages across various domains (Judicial, Educational, Medical, News etc). This can be eventually used to train the ML models based on the use cases.

Read more about Anuvaad @ http://anuvaad.org/

The code for building the below mentioned datasets are available under https://github.com/project-anuvaad/anuvaad-corpus-tools

Links

English-Hindi

Domain : News

PIB (2016-2020) - Created from the parallel reports available in PIB site

Year En-Hi pairs count
2020 65,149
2019 41,695
2018 50,628
2017 32,113
2016 39,046