Skip to content
master
Switch branches/tags
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
 
 
src
 
 
 
 

Massively Multilingual Image Dataset Tools

This repository contains scripts and tools for working with the MMID, a large dataset of images, and the words they represent, in 100 languages.

Information about the dataset can be found at the dataset website

Translation scripts

src/translation/

In the paper that introduced the dataset, we showed that the MMID can be used for translating words from many languages into English. In this subdirectory, find scripts to recreate the experiments in the paper.

To replicate a translation experiment from our paper, use the following script for example, with the Latvian language, from the MMID downloads page:

            python code/evaluate_package_cnn_combined.py  
                -f nlp/data/word-translation/language_packages/latvian/latvian-features/ 
                -e nlp/data/word-translation/language_packages/latvian/english-features/ 
                -d mmid-tools/dictionaries/
                -o ~ 
                -t 4
                -tc 4

where -t and -tc specify the number of CPUs to use for loading and computing similarities.

Dictionaries

To create the MMID, we started with crowdsourced dictionaries from the paper The Language Demographics of Amazon Mechanical Turk.

About

Tools for working with the Massively Multilingual Image Dataset

Resources

Releases

No releases published

Packages

No packages published