Skip to content
Switch branches/tags

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time

Practical Comparable Data Collection for Low-Resource Languages via Images

  • Paper:

  • Slides:

  • We propose a method of curating high-quality comparable training data for low-resource languages without requiring that the annotators are bilingual. Our method involves using a carefully selected set of images as a pivot between the source and target languages by getting captions for such images in both languages independently. Human evaluations on the English-Hindi comparable corpora created with our method show that 81.1% of the pairs are acceptable translations, and only 2.47% of the pairs are not a translation at all. We further establish the potential of dataset collected through our approach by experimenting on two downstream tasks -- machine translation and dictionary extraction.

  • Joint work with Shruti Rijhwani, Antonios Anastasopoulos, Yiming Yang, and Graham Neubig.

  • Work to be presented as a poster at Practical ML for Developing Countries Workshop at ICLR 2020.


python src/ -i data/alignments/wys.fastalign.input -a data/alignments/symmetric.align -l1 hin -l2 eng 
  • Extracting Counts
python src/ data/alignments/wys.fastalign.input data/alignments/symmetric.align output_path


├── alignments
│   ├── forward.align
│   ├── reverse.align
│   ├── symmetric.align
│   └── wys.fastalign.input
├── captions.tsv
└── dict.hin-eng.txt
Contents Path
Captions in both English and Hindi, as well as image ids data/captions.tsv
Generated Dictionary data/dict.hin-eng.txt
Fastalign input/output data/alignments/

Task Instructions

  • आपको एक वाक्य के साथ हर छवि का वर्णन करना होगा।
  • छवि के वर्णन में शुद्ध हिंदी का उपयोग करना ज़रूरी नहीं हे | जैसा ठीक लगे लिखें |
  • कृपया उन गतिविधियों, लोगों, जानवरों और वस्तुओं का सटीक विवरण प्रदान करें जिन्हें आप चित्र में देख रहे हें |
  • प्रत्येक विवरण एक ही वाक्य का होना चाहिए।
  • विवरण हिंदी में लिखा जाना चाहिए।
  • संक्षिप्त होने का प्रयास करें।
  • अगर किसी शब्द का हिंदी में मतलब ना पता हो तो उसे इंग्लिश में ही लिख दें |
  • व्याकरण और वर्तनी पर ध्यान दें।


If you use our work, please cite:

  title={Practical Comparable Data Collection for Low-Resource Languages via Images},
  author={Madaan, Aman and Rijhwani, Shruti and Anastasopoulos, Antonios and Yang, Yiming and Neubig, Graham},
  booktitle={Proceedings of the Practical ML for Developing Countries Workshop, ICLR 2020},


Code and Data for our work "Practical Comparable Data Collection for Low-Resource Languages via Images"



No releases published


No packages published