Home

Anuvaad Parallel Corpus

This repository contains parallel language corpus links for popular Indian languages developed as part of the Anuvaad project.

Please reach out to nlp-nmt@tarento.com for any clarification/interpretation/usage of the linked datasets.

Status

The current status of the parallel corpus built* :

Language Pair	Parallel Corpus Count
English-Hindi	228,631

This dataset is growing everyday!

Goal

The goal is to build high quality parallel corpus for the Indian languages across various domains (Judicial, Educational, Medical, News etc). This can be eventually used to train the ML models based on the use cases.

Read more about Anuvaad @ http://anuvaad.org/

The code for building the below mentioned datasets are available under https://github.com/project-anuvaad/anuvaad-corpus-tools

Links

English-Hindi

Domain : News

PIB (2016-2020) - Created from the parallel reports available in PIB site

Year	En-Hi pairs count
2020	65,149
2019	41,695
2018	50,628
2017	32,113
2016	39,046

Visit http://anuvaad.org

Provide feedback

Saved searches

Use saved searches to filter your results more quickly