Words and their images in 100 languages
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
CNAME
README.md
_config.yml
doc.md
downloads.md

README.md

The Massively Multilingual Image Dataset (MMID)

MMID is a large-scale, massively multilingual dataset of images paired with the words they represent collected at the University of Pennsylvania. The dataset is doubly parallel: for each language, words are stored parallel to images that represent the word, and parallel to the word's translation into English (and corresponding images.)

By far the largest dataset of its kind, with 100 languages and up to 10,000 words per language, it is useful for evaluating image-based translation methods.

Getting Started

See the documentation page

Downloads

CNN features and web text for the 30 languages evaluated on in the paper are up. Getting the whole 21+TB hosted will take some time, but we're working on it!

Check out the downloads page for options on how to access the dataset.

Citation

If you use this dataset for your research, please cite:

Learning Translations via Images with a Massively Multilingual Image Dataset.
John Hewitt*, Daphne Ippolito*, Brendan Callahan, Reno Kriz, Derry Tanti Wijaya and Chris Callison-Burch.
ACL 2018.

@InProceedings{hewitt-et-al:2018:Long,
  author    = {Hewitt, John  and  Ippolito, Daphne  and  Callahan, Brendan and Kriz, Reno and Wijaya, Derry Tanti and Callison-Burch, Chris},
  title     = {Learning Translations via Images with a Massively Multilingual Image Dataset},
  booktitle = {Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
  month     = {July},
  year      = {2018},
  address   = {Melbourne, Australia},
  publisher = {Association for Computational Linguistics}
}