Skip to content

Google Research Datasets

Datasets released by Google Research


  1. Natural Questions (NQ) contains real user questions issued to Google search, and answers found from Wikipedia by annotators. NQ is designed for the training and evaluation of automatic question ans…

    Python 740 137

  2. Conceptual Captions is a dataset containing (image-URL, caption) pairs designed for the training and evaluation of machine learned image captioning systems.

    Shell 366 20

  3. ToTTo Public

    ToTTo is an open-domain English table-to-text dataset with over 120,000 training examples that proposes a controlled generation task: given a Wikipedia table and a set of highlighted table cells, p…

    353 32

  4. dakshina Public

    The Dakshina dataset is a collection of text in both Latin and native scripts for 12 South Asian languages. For each language, the dataset includes a large collection of native script Wikipedia tex…

    154 16

  5. tydiqa Public

    TyDi QA contains 200k human-annotated question-answer pairs in 11 Typologically Diverse languages, written without seeing the answer and without the use of translation, and is designed for the trai…

    Python 221 34

  6. GAP is a gender-balanced dataset containing 8,908 coreference-labeled pairs of (ambiguous pronoun, antecedent name), sampled from Wikipedia for the evaluation of coreference resolution in practica…

    Python 210 82


  • wit Public

    WIT (Wikipedia-based Image Text) Dataset is a large multimodal multilingual dataset comprising 37M+ image-text sets with 11M+ unique images across 100+ languages.

    686 28 2 1 Updated Jun 9, 2022
  • hiertext Public

    The HierText dataset contains ~12k images from the Open Images dataset v6 with large amount of text entities. We provide word, line and paragraph level annotations.

    Jupyter Notebook 92 CC-BY-SA-4.0 5 0 0 Updated Jun 3, 2022
  • informal Public

    InFormal is a formality style transfer dataset for four Indic Languages. The dataset is made up of a pair of sentences and corresponding human-annotated labels identifying the more formal sentence as well the pair’s semantic similarity. This dataset can be used as an evaluation set for style transfer tasks in Indic Languages. InFormal contains s…

    0 Apache-2.0 0 0 0 Updated May 24, 2022
  • RxR Public

    Room-across-Room (RxR) is a large-scale, multilingual dataset for Vision-and-Language Navigation (VLN) in Matterport3D environments. It contains 126k navigation instructions in English, Hindi and Telugu, and 126k navigation following demonstrations. Both annotation types include dense spatiotemporal alignments between the text and the visual per…

    Python 78 CC-BY-4.0 9 1 0 Updated Apr 7, 2022
  • TF-IDF-IIF-top100-wordlists Public

    These are lists for a variety of languages containing words that are distinctive to each language.

    18 3 0 0 Updated Apr 5, 2022
  • taperception Public

    This repository contains the datasets that were used for the research described in "Predicting and Explaining Mobile UI Tappability with Vision Modeling and Saliency Analysis" by Eldon Schoop, Xin Zhou, Gang Li, Zhourong Chen, Bjoern Hartmann and Yang Li, which is to appear in CHI 2022.

    1 1 0 0 Updated Apr 4, 2022
  • cvss Public

    CVSS: A Massively Multilingual Speech-to-Speech Translation Corpus

    106 CC-BY-4.0 7 0 0 Updated Mar 29, 2022
  • WikipediaAbbreviationData Public

    This data set consists of 24,000 English sentences, extracted from Wikipedia in 2017, annotated to support development of an abbreviation expansion system for text-to-speech synthesis (e.g., a systm tht cn prnounc txt lk ths).

    Python 7 Apache-2.0 1 0 0 Updated Mar 21, 2022
  • ToTTo Public

    ToTTo is an open-domain English table-to-text dataset with over 120,000 training examples that proposes a controlled generation task: given a Wikipedia table and a set of highlighted table cells, produce a one-sentence description. We hope it can serve as a useful research benchmark for high-precision conditional text generation.

    353 32 6 1 Updated Mar 17, 2022
  • MAVE Public

    The dataset contains 3 million attribute-value annotations across 1257 unique categories on 2.2 million cleaned Amazon product profiles. It is a large, multi-sourced, diverse dataset for product attribute extraction study.

    Python 57 12 1 0 Updated Mar 16, 2022


This organization has no public members. You must be a member to see who’s a part of this organization.

Most used topics