Tools for working with the Massively Multilingual Image Dataset
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.

Massively Multilingual Image Dataset Tools

This repository contains scripts and tools for working with the MMID, a large dataset of images, and the words they represent, in 100 languages.

Information about the dataset can be found at the dataset website

Translation scripts


In the paper that introduced the dataset, we showed that the MMID can be used for translating words from many languages into English. In this subdirectory, find scripts to recreate the experiments in the paper.

To replicate a translation experiment from our paper, use the following script for example, with the Latvian language, from the MMID downloads page:

            python code/  
                -f nlp/data/word-translation/language_packages/latvian/latvian-features/ 
                -e nlp/data/word-translation/language_packages/latvian/english-features/ 
                -d mmid-tools/dictionaries/
                -o ~ 
                -t 4
                -tc 4

where -t and -tc specify the number of CPUs to use for loading and computing similarities.


To create the MMID, we started with crowdsourced dictionaries from the paper The Language Demographics of Amazon Mechanical Turk.