{"payload":{"pageCount":6,"repositories":[{"type":"Public","name":"adversarial-nibbler","owner":"google-research-datasets","isFork":false,"description":"This dataset contains results from all rounds of Adversarial Nibbler. This data includes adversarial prompts fed into public generative text2image models and validations for unsafe images. There will be two sets of data: all prompts submitted and all prompts attempted (sent to t2i models but not submitted as unsafe).","allTopics":[],"primaryLanguage":null,"pullRequestCount":0,"issueCount":0,"starsCount":16,"forksCount":3,"license":"Creative Commons Attribution 4.0 International","participation":null,"lastUpdated":{"hasBeenPushedTo":true,"timestamp":"2024-09-25T00:49:53.340Z"}},{"type":"Public","name":"C4_200M-synthetic-dataset-for-grammatical-error-correction","owner":"google-research-datasets","isFork":false,"description":"This dataset contains synthetic training data for grammatical error correction. The corpus is generated by corrupting clean sentences from C4 using a tagged corruption model. The approach and the dataset are described in more detail by Stahlberg and Kumar (2021) (<a href=\"https://www.aclweb.org/anthology/2021.bea-1.4/\" rel=\"nofollow\">https://www.aclweb.org/anthology/2021.bea-1.4/</a>)","allTopics":[],"primaryLanguage":{"name":"Python","color":"#3572A5"},"pullRequestCount":0,"issueCount":0,"starsCount":153,"forksCount":24,"license":"Creative Commons Attribution 4.0 International","participation":null,"lastUpdated":{"hasBeenPushedTo":true,"timestamp":"2024-09-24T14:30:31.751Z"}},{"type":"Public","name":"sanpo_dataset","owner":"google-research-datasets","isFork":false,"description":"","allTopics":[],"primaryLanguage":{"name":"Python","color":"#3572A5"},"pullRequestCount":2,"issueCount":3,"starsCount":39,"forksCount":1,"license":"Apache License 2.0","participation":[0,0,2,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0],"lastUpdated":{"hasBeenPushedTo":true,"timestamp":"2024-09-19T16:15:30.274Z"}},{"type":"Public","name":"SeeGULL-Multilingual","owner":"google-research-datasets","isFork":false,"description":"SeeGULL Multilingual is a multilingual and multicultural dataset of stereotypes. It consists of stereotypes in 20 languages with human annotations across 23 languages, including annotations on their degree of offensiveness.","allTopics":[],"primaryLanguage":null,"pullRequestCount":0,"issueCount":0,"starsCount":3,"forksCount":1,"license":"Creative Commons Attribution 4.0 International","participation":null,"lastUpdated":{"hasBeenPushedTo":true,"timestamp":"2024-09-19T00:35:14.477Z"}},{"type":"Public","name":"ToTTo","owner":"google-research-datasets","isFork":false,"description":"ToTTo is an open-domain English table-to-text dataset with over 120,000 training examples that proposes a controlled generation task: given a Wikipedia table and a set of highlighted table cells, produce a one-sentence description. We hope it can serve as a useful research benchmark for high-precision conditional text generation.","allTopics":[],"primaryLanguage":null,"pullRequestCount":0,"issueCount":6,"starsCount":436,"forksCount":37,"license":null,"participation":null,"lastUpdated":{"hasBeenPushedTo":true,"timestamp":"2024-09-11T18:07:47.554Z"}},{"type":"Public","name":"indic-gen-bench","owner":"google-research-datasets","isFork":false,"description":"IndicGenBench is a high-quality, multilingual, multi-way parallel benchmark for evaluating Large Language Models (LLMs) on 4 user-facing generation tasks across a diverse set 29 of Indic languages covering 13 scripts and 4 language families.","allTopics":[],"primaryLanguage":null,"pullRequestCount":0,"issueCount":0,"starsCount":41,"forksCount":6,"license":"Other","participation":null,"lastUpdated":{"hasBeenPushedTo":true,"timestamp":"2024-09-01T09:40:37.683Z"}},{"type":"Public","name":"hiertext","owner":"google-research-datasets","isFork":false,"description":"The HierText dataset contains ~12k images from the Open Images dataset v6 with large amount of text entities. We provide word, line and paragraph level annotations.","allTopics":[],"primaryLanguage":{"name":"Jupyter Notebook","color":"#DA5B0B"},"pullRequestCount":1,"issueCount":0,"starsCount":260,"forksCount":24,"license":"Creative Commons Attribution Share Alike 4.0 International","participation":null,"lastUpdated":{"hasBeenPushedTo":true,"timestamp":"2024-08-30T23:47:37.824Z"}},{"type":"Public","name":"cf_triviaqa","owner":"google-research-datasets","isFork":false,"description":"The CF-TriviaQA dataset accompanies \"Hallucination Augmented Recitations for Language Models\" paper (<a href=\"https://arxiv.org/abs/2311.07424\" rel=\"nofollow\">https://arxiv.org/abs/2311.07424</a>). It is a counterfactual open book QA dataset generated from the TriviaQA dataset using Hallucination Augmented Recitations (HAR) approach, with the purpose of improving attribution in LLMs. ","allTopics":[],"primaryLanguage":null,"pullRequestCount":0,"issueCount":1,"starsCount":2,"forksCount":1,"license":"Apache License 2.0","participation":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,2,0,0,0],"lastUpdated":{"hasBeenPushedTo":true,"timestamp":"2024-08-30T00:18:48.722Z"}},{"type":"Public","name":"BamTwoogle","owner":"google-research-datasets","isFork":false,"description":"The BamTwoogle dataset accompanies \"ReST meets ReAct: Self-Improvement for Multi-Step Reasoning LLM Agent\" paper (<a href=\"https://arxiv.org/abs/2312.10003\" rel=\"nofollow\">https://arxiv.org/abs/2312.10003</a>). It was written to be a complementary, slightly more challenging sequel to Bamboogle dataset. It addresses some of the shortcomings of Bamboogle we discovered while performing human evals for the paper.","allTopics":[],"primaryLanguage":null,"pullRequestCount":0,"issueCount":0,"starsCount":3,"forksCount":1,"license":"Creative Commons Attribution 4.0 International","participation":null,"lastUpdated":{"hasBeenPushedTo":true,"timestamp":"2024-08-14T01:36:03.933Z"}},{"type":"Public","name":"mittens","owner":"google-research-datasets","isFork":false,"description":"Datasets for measuring misgendering in translation","allTopics":[],"primaryLanguage":null,"pullRequestCount":0,"issueCount":0,"starsCount":5,"forksCount":0,"license":"Other","participation":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,6,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,0,0,0,0,0,0],"lastUpdated":{"hasBeenPushedTo":true,"timestamp":"2024-08-13T20:16:08.914Z"}},{"type":"Public","name":"visage","owner":"google-research-datasets","isFork":false,"description":"Visage contains an image dataset of images with human annotations on whether or not certain attributes are present or depicted in the image. The attribute may either be stereotypical or non-stereotypical w.r.t. to the identity group in the image. It also contains a list of attributes in English along with annotations about whether they are visual.","allTopics":[],"primaryLanguage":null,"pullRequestCount":0,"issueCount":0,"starsCount":7,"forksCount":2,"license":"Apache License 2.0","participation":null,"lastUpdated":{"hasBeenPushedTo":true,"timestamp":"2024-08-13T15:31:03.545Z"}},{"type":"Public","name":"SPICE","owner":"google-research-datasets","isFork":false,"description":"SPICE is a stereotype dataset in English containing stereotypes collected in India with community engagement. It spans identity groups and stereotypes unique to India, as well as other stereotypes about gender and nationalities.","allTopics":[],"primaryLanguage":null,"pullRequestCount":0,"issueCount":0,"starsCount":2,"forksCount":1,"license":"Creative Commons Attribution 4.0 International","participation":null,"lastUpdated":{"hasBeenPushedTo":true,"timestamp":"2024-07-26T19:42:00.395Z"}},{"type":"Public","name":"cube","owner":"google-research-datasets","isFork":false,"description":"CUBE is a benchmark to evaluate the Cultural Competence of T2I models","allTopics":[],"primaryLanguage":null,"pullRequestCount":0,"issueCount":0,"starsCount":4,"forksCount":0,"license":"Creative Commons Attribution 4.0 International","participation":null,"lastUpdated":{"hasBeenPushedTo":true,"timestamp":"2024-07-18T18:35:03.399Z"}},{"type":"Public","name":"screen_qa","owner":"google-research-datasets","isFork":false,"description":"ScreenQA dataset was introduced in the \"ScreenQA: Large-Scale Question-Answer Pairs over Mobile App Screenshots\" paper. It contains ~86K question-answer pairs collected by human annotators for ~35K screenshots from Rico. It should be used to train and evaluate models capable of screen content understanding via question answering.","allTopics":[],"primaryLanguage":null,"pullRequestCount":0,"issueCount":1,"starsCount":87,"forksCount":7,"license":"Creative Commons Attribution 4.0 International","participation":null,"lastUpdated":{"hasBeenPushedTo":true,"timestamp":"2024-07-18T14:21:22.080Z"}},{"type":"Public","name":"uicrit","owner":"google-research-datasets","isFork":false,"description":"UICrit is a dataset containing human-generated natural language design critiques, corresponding bounding boxes for each critique, and design quality ratings for 1,000 mobile UIs from RICO. This dataset was collected for our UIST '24 paper: <a href=\"https://arxiv.org/abs/2407.08850\" rel=\"nofollow\">https://arxiv.org/abs/2407.08850</a>.","allTopics":[],"primaryLanguage":null,"pullRequestCount":0,"issueCount":0,"starsCount":4,"forksCount":1,"license":null,"participation":null,"lastUpdated":{"hasBeenPushedTo":true,"timestamp":"2024-07-18T03:20:18.083Z"}},{"type":"Public","name":"dices-dataset","owner":"google-research-datasets","isFork":false,"description":"This repository contains two datasets with multi-turn adversarial conversations generated by human agents interacting with a dialog model and rated for safety by two corresponding diverse rater pools.","allTopics":[],"primaryLanguage":null,"pullRequestCount":0,"issueCount":1,"starsCount":23,"forksCount":3,"license":null,"participation":null,"lastUpdated":{"hasBeenPushedTo":true,"timestamp":"2024-07-16T15:56:04.039Z"}},{"type":"Public","name":"wit","owner":"google-research-datasets","isFork":false,"description":"WIT (Wikipedia-based Image Text) Dataset is a large multimodal multilingual dataset comprising 37M+ image-text sets with 11M+ unique images across 100+ languages.","allTopics":["multilingual","nlp","machine-learning","wikipedia","multimodal","cc-by-sa-3"],"primaryLanguage":null,"pullRequestCount":0,"issueCount":0,"starsCount":995,"forksCount":40,"license":"Other","participation":null,"lastUpdated":{"hasBeenPushedTo":true,"timestamp":"2024-07-12T19:24:24.910Z"}},{"type":"Public","name":"rico_semantics","owner":"google-research-datasets","isFork":false,"description":"Consists of ~500k human annotations on the RICO dataset identifying various icons based on their shapes and semantics, and associations between selected general UI elements and their text labels. Annotations also include human annotated bounding boxes which are more accurate and have a greater coverage of UI elements.","allTopics":[],"primaryLanguage":null,"pullRequestCount":0,"issueCount":1,"starsCount":20,"forksCount":3,"license":"Creative Commons Attribution Share Alike 4.0 International","participation":null,"lastUpdated":{"hasBeenPushedTo":true,"timestamp":"2024-06-27T22:18:12.367Z"}},{"type":"Public","name":"tpu_graphs","owner":"google-research-datasets","isFork":false,"description":"","allTopics":[],"primaryLanguage":{"name":"C++","color":"#f34b7d"},"pullRequestCount":1,"issueCount":2,"starsCount":122,"forksCount":43,"license":"Apache License 2.0","participation":null,"lastUpdated":{"hasBeenPushedTo":true,"timestamp":"2024-06-25T20:26:24.556Z"}},{"type":"Public","name":"MISeD","owner":"google-research-datasets","isFork":false,"description":"MISeD (Meeting Information Seeking Dialogs dataset) is an information-seeking dialog dataset focused on meeting transcripts. It includes 432 dialogs over transcripts from the QMSum dataset. MISeD is described in detail in the paper: Efficient Data Generation for Source-grounded Information-seeking Dialogs: A Use Case for Meeting Transcripts.","allTopics":[],"primaryLanguage":null,"pullRequestCount":0,"issueCount":0,"starsCount":8,"forksCount":3,"license":null,"participation":null,"lastUpdated":{"hasBeenPushedTo":true,"timestamp":"2024-06-25T16:03:13.340Z"}},{"type":"Public","name":"richhf-18k","owner":"google-research-datasets","isFork":false,"description":"RichHF-18K dataset contains rich human feedback labels we collected for our CVPR'24 paper: <a href=\"https://arxiv.org/pdf/2312.10240\" rel=\"nofollow\">https://arxiv.org/pdf/2312.10240</a>, along with the file name of the associated labeled images (no urls or images are included in this dataset).","allTopics":[],"primaryLanguage":null,"pullRequestCount":0,"issueCount":9,"starsCount":97,"forksCount":2,"license":null,"participation":null,"lastUpdated":{"hasBeenPushedTo":true,"timestamp":"2024-06-25T00:03:34.808Z"}},{"type":"Public","name":"web-images","owner":"google-research-datasets","isFork":false,"description":"Images gathered from the Internet in 2023 and some metadata","allTopics":[],"primaryLanguage":{"name":"HTML","color":"#e34c26"},"pullRequestCount":0,"issueCount":0,"starsCount":1,"forksCount":1,"license":"Other","participation":null,"lastUpdated":{"hasBeenPushedTo":true,"timestamp":"2024-06-24T13:49:05.799Z"}},{"type":"Public","name":"GeniL","owner":"google-research-datasets","isFork":false,"description":"GeniL dataset is an effort for detecting various types of generalization in language. This multilingual dataset covers sentences in EN, FR, ES, PT, AR, HI, BN, MS, and ID and is annotated by native speakers of each language. Each sentence is collected from a public corpora of language and contains at least one identity group name and an attribute.","allTopics":[],"primaryLanguage":null,"pullRequestCount":0,"issueCount":0,"starsCount":0,"forksCount":1,"license":"Creative Commons Attribution 4.0 International","participation":null,"lastUpdated":{"hasBeenPushedTo":true,"timestamp":"2024-06-18T16:22:56.900Z"}},{"type":"Public","name":"D3code","owner":"google-research-datasets","isFork":false,"description":"D3code is a large-scale cross-cultural dataset of parallel annotations for offensive language detection by over 4k annotators, balanced across gender and age, from across 21 countries, representing eight geo-cultural regions.","allTopics":[],"primaryLanguage":null,"pullRequestCount":0,"issueCount":0,"starsCount":0,"forksCount":1,"license":"Creative Commons Attribution 4.0 International","participation":null,"lastUpdated":{"hasBeenPushedTo":true,"timestamp":"2024-05-22T18:18:12.938Z"}},{"type":"Public","name":"scin","owner":"google-research-datasets","isFork":false,"description":"The SCIN dataset contains 10,000+ images of dermatology conditions, crowdsourced with informed consent from US internet users. Contributions include self-reported demographic and symptom information and dermatologist labels. The dataset also contains estimated Fitzpatrick skin type and Monk Skin Tone.","allTopics":[],"primaryLanguage":{"name":"Jupyter Notebook","color":"#DA5B0B"},"pullRequestCount":0,"issueCount":2,"starsCount":69,"forksCount":4,"license":"Other","participation":null,"lastUpdated":{"hasBeenPushedTo":true,"timestamp":"2024-05-08T18:25:31.059Z"}},{"type":"Public","name":"cpcd","owner":"google-research-datasets","isFork":false,"description":"The Conversational Playlist Creation Dataset (CPCD) contains 917 conversations between two people where users express preferences over sets of songs in natural language and wizards to elicit preferences from users. The dataset includes per-song ratings and can be used to design and evaluate conversational recommendation systems.","allTopics":[],"primaryLanguage":{"name":"Python","color":"#3572A5"},"pullRequestCount":1,"issueCount":1,"starsCount":9,"forksCount":3,"license":null,"participation":null,"lastUpdated":{"hasBeenPushedTo":true,"timestamp":"2024-05-03T20:40:05.465Z"}},{"type":"Public","name":"thesios","owner":"google-research-datasets","isFork":false,"description":"This repository describes I/O traces of Google storage servers and disks synthesized by Thesios. Thesios synthesizes representative I/O traces by combining down-sampled I/O traces collected from multiple disks (HDDs) attached to multiple storage servers in Google distributed storage system.","allTopics":[],"primaryLanguage":null,"pullRequestCount":0,"issueCount":1,"starsCount":18,"forksCount":1,"license":null,"participation":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0],"lastUpdated":{"hasBeenPushedTo":true,"timestamp":"2024-04-29T18:43:51.646Z"}},{"type":"Public","name":"Taskmaster","owner":"google-research-datasets","isFork":false,"description":"Please see the readme file as well as our 2019 EMNLP paper linked here --&gt;","allTopics":[],"primaryLanguage":null,"pullRequestCount":0,"issueCount":4,"starsCount":192,"forksCount":58,"license":null,"participation":null,"lastUpdated":{"hasBeenPushedTo":true,"timestamp":"2024-04-24T15:13:46.899Z"}},{"type":"Public","name":"Crosslingual-Morphosyntactic-Divergence-dataset","owner":"google-research-datasets","isFork":false,"description":"This repository contains the annotations from the paper \"To Diverge or Not to Diverge: A Morphosyntactic Perspective on Machine Translation vs Human Translation.\"","allTopics":[],"primaryLanguage":null,"pullRequestCount":0,"issueCount":0,"starsCount":0,"forksCount":1,"license":null,"participation":null,"lastUpdated":{"hasBeenPushedTo":true,"timestamp":"2024-03-25T23:09:25.760Z"}},{"type":"Public","name":"QuoteSum","owner":"google-research-datasets","isFork":false,"description":"QuoteSum is a textual QA dataset containing Semi-Extractive Multi-source Question Answering (SEMQA) examples written by humans, based on Wikipedia passages.","allTopics":[],"primaryLanguage":{"name":"Python","color":"#3572A5"},"pullRequestCount":0,"issueCount":0,"starsCount":12,"forksCount":1,"license":"Creative Commons Attribution Share Alike 4.0 International","participation":[0,0,0,0,0,7,0,0,0,0,0,0,0,0,0,0,0,0,0,5,0,0,0,0,0,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0],"lastUpdated":{"hasBeenPushedTo":true,"timestamp":"2024-03-25T03:47:22.252Z"}}],"repositoryCount":161,"userInfo":null,"searchable":true,"definitions":[],"typeFilters":[{"id":"all","text":"All"},{"id":"public","text":"Public"},{"id":"source","text":"Sources"},{"id":"fork","text":"Forks"},{"id":"archived","text":"Archived"},{"id":"template","text":"Templates"}],"compactMode":false},"title":"google-research-datasets repositories"}