Skip to content


Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time

The Massively Multilingual Image Dataset (MMID)

MMID is a large-scale, massively multilingual dataset of images paired with the words they represent collected at the University of Pennsylvania. The dataset is doubly parallel: for each language, words are stored parallel to images that represent the word, and parallel to the word's translation into English (and corresponding images.)

By far the largest dataset of its kind, it has 98 languages (including English) and up to 10,000 words per language! (and many more for English.)

Getting Started

See the documentation page

If you have questions, please email the MMID users list. (


We're happy to announce that MMID is available via the Amazon Public Datasets program! Through their generosity, we're able to provide all of MMID free of charge via a public S3 bucket.

Check out the downloads page for options on how to access the dataset.


We gratefully acknowledge the support of an Amazon Research Award and AWS Research Credits, which enabled the construction of MMID.

If you use MMID for your research, please cite:

Learning Translations via Images with a Massively Multilingual Image Dataset.
John Hewitt*, Daphne Ippolito*, Brendan Callahan, Reno Kriz, Derry Tanti Wijaya and Chris Callison-Burch.
ACL 2018.

  author    = {Hewitt, John  and  Ippolito, Daphne  and  Callahan, Brendan and Kriz, Reno and Wijaya, Derry Tanti and Callison-Burch, Chris},
  title     = {Learning Translations via Images with a Massively Multilingual Image Dataset},
  booktitle = {Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
  month     = {July},
  year      = {2018},
  address   = {Melbourne, Australia},
  publisher = {Association for Computational Linguistics}


Words and their images in 98 languages






No releases published


No packages published