{"payload":{"pageCount":5,"repositories":[{"type":"Public","name":"indic-gen-bench","owner":"google-research-datasets","isFork":false,"description":"IndicGenBench is a high-quality, multilingual, multi-way parallel benchmark for evaluating Large Language Models (LLMs) on 4 user-facing generation tasks across a diverse set 29 of Indic languages covering 13 scripts and 4 language families.","topicNames":[],"topicsNotShown":0,"primaryLanguage":null,"pullRequestCount":0,"issueCount":0,"starsCount":11,"forksCount":0,"license":"Other","participation":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,10],"lastUpdated":{"hasBeenPushedTo":true,"timestamp":"2024-04-26T14:55:04.683Z"}},{"type":"Public","name":"D3code","owner":"google-research-datasets","isFork":false,"description":"D3code is a large-scale cross-cultural dataset of parallel annotations for offensive language detection by over 4k annotators, balanced across gender and age, from across 21 countries, representing eight geo-cultural regions.","topicNames":[],"topicsNotShown":0,"primaryLanguage":null,"pullRequestCount":0,"issueCount":0,"starsCount":0,"forksCount":0,"license":"Creative Commons Zero v1.0 Universal","participation":null,"lastUpdated":{"hasBeenPushedTo":true,"timestamp":"2024-04-25T16:35:03.938Z"}},{"type":"Public","name":"Taskmaster","owner":"google-research-datasets","isFork":false,"description":"Please see the readme file as well as our 2019 EMNLP paper linked here -->","topicNames":[],"topicsNotShown":0,"primaryLanguage":null,"pullRequestCount":0,"issueCount":4,"starsCount":187,"forksCount":57,"license":null,"participation":null,"lastUpdated":{"hasBeenPushedTo":true,"timestamp":"2024-04-24T15:13:46.899Z"}},{"type":"Public","name":"thesios","owner":"google-research-datasets","isFork":false,"description":"This repository describes I/O traces of Google storage servers and disks synthesized by Thesios. Thesios synthesizes representative I/O traces by combining down-sampled I/O traces collected from multiple disks (HDDs) attached to multiple storage servers in Google distributed storage system.","topicNames":[],"topicsNotShown":0,"primaryLanguage":null,"pullRequestCount":0,"issueCount":0,"starsCount":1,"forksCount":0,"license":null,"participation":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0],"lastUpdated":{"hasBeenPushedTo":true,"timestamp":"2024-03-27T17:41:25.646Z"}},{"type":"Public","name":"Crosslingual-Morphosyntactic-Divergence-dataset","owner":"google-research-datasets","isFork":false,"description":"This repository contains the annotations from the paper \"To Diverge or Not to Diverge: A Morphosyntactic Perspective on Machine Translation vs Human Translation.\"","topicNames":[],"topicsNotShown":0,"primaryLanguage":null,"pullRequestCount":0,"issueCount":0,"starsCount":0,"forksCount":0,"license":null,"participation":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,4,0,0,0,0],"lastUpdated":{"hasBeenPushedTo":true,"timestamp":"2024-03-25T23:09:25.760Z"}},{"type":"Public","name":"QuoteSum","owner":"google-research-datasets","isFork":false,"description":"QuoteSum is a textual QA dataset containing Semi-Extractive Multi-source Question Answering (SEMQA) examples written by humans, based on Wikipedia passages.","topicNames":[],"topicsNotShown":0,"primaryLanguage":{"name":"Python","color":"#3572A5"},"pullRequestCount":0,"issueCount":0,"starsCount":8,"forksCount":0,"license":"Creative Commons Attribution Share Alike 4.0 International","participation":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,7,0,0,0,0,0,0,0,0,0,0,0,0,5,0,0,0,0,0,0,3,0,0,0,0],"lastUpdated":{"hasBeenPushedTo":true,"timestamp":"2024-03-25T03:47:22.252Z"}},{"type":"Public","name":"screen_qa","owner":"google-research-datasets","isFork":false,"description":"ScreenQA dataset was introduced in the \"ScreenQA: Large-Scale Question-Answer Pairs over Mobile App Screenshots\" paper. It contains ~86K question-answer pairs collected by human annotators for ~35K screenshots from Rico. It should be used to train and evaluate models capable of screen content understanding via question answering.","topicNames":[],"topicsNotShown":0,"primaryLanguage":null,"pullRequestCount":0,"issueCount":0,"starsCount":51,"forksCount":5,"license":"Creative Commons Attribution 4.0 International","participation":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,0,0,0,0,0,0,0,0,0,0,1,0,0,2,0,0,1,0,0,0,0,0],"lastUpdated":{"hasBeenPushedTo":true,"timestamp":"2024-03-19T13:41:12.080Z"}},{"type":"Public","name":"scin","owner":"google-research-datasets","isFork":false,"description":"The SCIN dataset contains 10,000+ images of dermatology conditions, crowdsourced with informed consent from US internet users. Contributions include self-reported demographic and symptom information and dermatologist labels. The dataset also contains estimated Fitzpatrick skin type and Monk Skin Tone.","topicNames":[],"topicsNotShown":0,"primaryLanguage":{"name":"Jupyter Notebook","color":"#DA5B0B"},"pullRequestCount":0,"issueCount":1,"starsCount":46,"forksCount":1,"license":"Other","participation":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,2,3,0,0,0,0,0,0],"lastUpdated":{"hasBeenPushedTo":true,"timestamp":"2024-03-15T16:06:57.059Z"}},{"type":"Public","name":"LLAMA1-Test-Set","owner":"google-research-datasets","isFork":false,"description":"We introduce the LLAMA1 Test Set, a comprehensive open-domain world knowledge QA dataset for evaluating question-answering systems. We prompted the open-source LLama-7B model for questions and short answers on various topics. We gathered 300 questions (with Google Cloud TTS service, voice en-US-Neural2-C), and generally verified the answers. ","topicNames":[],"topicsNotShown":0,"primaryLanguage":null,"pullRequestCount":0,"issueCount":0,"starsCount":0,"forksCount":0,"license":null,"participation":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,4,0,0,0,0,0,0],"lastUpdated":{"hasBeenPushedTo":true,"timestamp":"2024-03-14T13:23:12.202Z"}},{"type":"Public","name":"screen_annotation","owner":"google-research-datasets","isFork":false,"description":"The Screen Annotation dataset consists of pairs of mobile screenshots and their annotations. The annotations are in text format, and describe the UI elements present on the screen: their type, location, OCR text and a short description. It has been introduced in the paper `ScreenAI: A Vision-Language Model for UI and Infographics Understanding`.","topicNames":[],"topicsNotShown":0,"primaryLanguage":null,"pullRequestCount":0,"issueCount":1,"starsCount":10,"forksCount":2,"license":null,"participation":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0],"lastUpdated":{"hasBeenPushedTo":true,"timestamp":"2024-03-07T13:21:35.868Z"}},{"type":"Public","name":"SeeGULL-Multilingual","owner":"google-research-datasets","isFork":false,"description":"SeeGULL Multilingual is a multilingual and multicultural dataset of stereotypes. It consists of stereotypes in 20 languages with human annotations across 23 languages, including annotations on their degree of offensiveness.","topicNames":[],"topicsNotShown":0,"primaryLanguage":null,"pullRequestCount":0,"issueCount":0,"starsCount":3,"forksCount":0,"license":"Creative Commons Attribution 4.0 International","participation":null,"lastUpdated":{"hasBeenPushedTo":true,"timestamp":"2024-03-06T22:03:00.477Z"}},{"type":"Public","name":"seahorse","owner":"google-research-datasets","isFork":false,"description":"Seahorse is a dataset for multilingual, multi-faceted summarization evaluation. It consists of 96K summaries with human ratings along 6 quality dimensions: comprehensibility, repetition, grammar, attribution, main idea(s), and conciseness, covering 6 languages, 9 systems and 4 datasets.","topicNames":[],"topicsNotShown":0,"primaryLanguage":null,"pullRequestCount":0,"issueCount":1,"starsCount":82,"forksCount":7,"license":null,"participation":null,"lastUpdated":{"hasBeenPushedTo":true,"timestamp":"2024-02-27T20:16:39.194Z"}},{"type":"Public","name":"dices-dataset","owner":"google-research-datasets","isFork":false,"description":"This repository contains two datasets with multi-turn adversarial conversations generated by human agents interacting with a dialog model and rated for safety by two corresponding diverse rater pools.","topicNames":[],"topicsNotShown":0,"primaryLanguage":null,"pullRequestCount":0,"issueCount":1,"starsCount":20,"forksCount":1,"license":null,"participation":null,"lastUpdated":{"hasBeenPushedTo":true,"timestamp":"2024-01-25T17:21:39.039Z"}},{"type":"Public","name":"maxm","owner":"google-research-datasets","isFork":false,"description":"MaXM is a suite of test-only benchmarks for multilingual visual question answering in 7 languages: English (en), French (fr), Hindi (hi), Hebrew (iw), Romanian (ro), Thai (th), and Chinese (zh).","topicNames":[],"topicsNotShown":0,"primaryLanguage":null,"pullRequestCount":0,"issueCount":0,"starsCount":11,"forksCount":0,"license":"Other","participation":null,"lastUpdated":{"hasBeenPushedTo":true,"timestamp":"2024-01-16T07:33:05.946Z"}},{"type":"Public","name":"mittens","owner":"google-research-datasets","isFork":false,"description":"Datasets for measuring misgendering in translation","topicNames":[],"topicsNotShown":0,"primaryLanguage":null,"pullRequestCount":0,"issueCount":0,"starsCount":5,"forksCount":0,"license":"Other","participation":null,"lastUpdated":{"hasBeenPushedTo":true,"timestamp":"2024-01-12T23:40:59.914Z"}},{"type":"Public","name":"sco_rai","owner":"google-research-datasets","isFork":false,"description":"Societal Context Ontology and annotated dataset of 2023 proceedings of the ACM Conference on Fairness, Accountability, and Transparency (ACM FAccT).","topicNames":[],"topicsNotShown":0,"primaryLanguage":null,"pullRequestCount":0,"issueCount":0,"starsCount":1,"forksCount":0,"license":"Creative Commons Attribution 4.0 International","participation":null,"lastUpdated":{"hasBeenPushedTo":true,"timestamp":"2024-01-10T22:58:56.737Z"}},{"type":"Public","name":"Synthetic-Persona-Chat","owner":"google-research-datasets","isFork":false,"description":"The Synthetic-Persona-Chat dataset is a synthetically generated persona-based dialogue dataset. It extends the original Persona-Chat dataset. ","topicNames":[],"topicsNotShown":0,"primaryLanguage":{"name":"Python","color":"#3572A5"},"pullRequestCount":0,"issueCount":0,"starsCount":33,"forksCount":1,"license":null,"participation":null,"lastUpdated":{"hasBeenPushedTo":true,"timestamp":"2024-01-02T19:27:27.793Z"}},{"type":"Public","name":"india-soil-health-card","owner":"google-research-datasets","isFork":false,"description":"","topicNames":[],"topicsNotShown":0,"primaryLanguage":{"name":"Python","color":"#3572A5"},"pullRequestCount":1,"issueCount":0,"starsCount":0,"forksCount":0,"license":"Apache License 2.0","participation":null,"lastUpdated":{"hasBeenPushedTo":true,"timestamp":"2023-12-27T19:18:23.934Z"}},{"type":"Public","name":"LaGOT","owner":"google-research-datasets","isFork":false,"description":"We enrich the LaSOT validation set with annotations of additional object tracks, up to 10 object tracks per video in total. Tracks consist of precise bounding box annotations of moving objects. Annotations are provided at 10 fps. The original LaSOT validation set annotations and video can be downloaded from: https://vision.cs.stonybrook.edu/~lasot/","topicNames":[],"topicsNotShown":0,"primaryLanguage":{"name":"Python","color":"#3572A5"},"pullRequestCount":0,"issueCount":0,"starsCount":6,"forksCount":0,"license":"Creative Commons Attribution 4.0 International","participation":null,"lastUpdated":{"hasBeenPushedTo":true,"timestamp":"2023-12-23T11:02:24.785Z"}},{"type":"Public","name":"global_streamflow_model_paper","owner":"google-research-datasets","isFork":false,"description":"","topicNames":[],"topicsNotShown":0,"primaryLanguage":{"name":"Jupyter Notebook","color":"#DA5B0B"},"pullRequestCount":0,"issueCount":0,"starsCount":25,"forksCount":2,"license":"Apache License 2.0","participation":null,"lastUpdated":{"hasBeenPushedTo":true,"timestamp":"2023-12-15T19:34:20.977Z"}},{"type":"Public","name":"DaTaSeg-Objects365-Instance-Segmentation","owner":"google-research-datasets","isFork":false,"description":"We release the DaTaSeg Objects365 Instance Segmentation Dataset introduced in the DaTaSeg paper, which can be used as an evaluation benchmark for weakly or semi supervised segmentation.","topicNames":[],"topicsNotShown":0,"primaryLanguage":{"name":"Jupyter Notebook","color":"#DA5B0B"},"pullRequestCount":0,"issueCount":0,"starsCount":13,"forksCount":0,"license":"Other","participation":null,"lastUpdated":{"hasBeenPushedTo":true,"timestamp":"2023-12-09T00:49:14.444Z"}},{"type":"Public","name":"aart-ai-safety-dataset","owner":"google-research-datasets","isFork":false,"description":"AART: AI-Assisted Red-Teaming with Diverse Data Generation for New LLM-powered Applications","topicNames":["ai-safety","responsible-ai","responsible-ml","ml-fairness"],"topicsNotShown":0,"primaryLanguage":null,"pullRequestCount":0,"issueCount":0,"starsCount":4,"forksCount":0,"license":"Other","participation":null,"lastUpdated":{"hasBeenPushedTo":true,"timestamp":"2023-11-29T14:48:27.633Z"}},{"type":"Public","name":"wit","owner":"google-research-datasets","isFork":false,"description":"WIT (Wikipedia-based Image Text) Dataset is a large multimodal multilingual dataset comprising 37M+ image-text sets with 11M+ unique images across 100+ languages.","topicNames":["multilingual","nlp","machine-learning","wikipedia","multimodal","cc-by-sa-3"],"topicsNotShown":0,"primaryLanguage":null,"pullRequestCount":0,"issueCount":3,"starsCount":957,"forksCount":39,"license":"Other","participation":null,"lastUpdated":{"hasBeenPushedTo":true,"timestamp":"2023-11-15T23:55:48.556Z"}},{"type":"Public","name":"tpu_graphs","owner":"google-research-datasets","isFork":false,"description":"","topicNames":[],"topicsNotShown":0,"primaryLanguage":{"name":"C++","color":"#f34b7d"},"pullRequestCount":0,"issueCount":2,"starsCount":120,"forksCount":41,"license":"Apache License 2.0","participation":null,"lastUpdated":{"hasBeenPushedTo":true,"timestamp":"2023-11-15T07:54:27.854Z"}},{"type":"Public","name":"swim-ir","owner":"google-research-datasets","isFork":false,"description":"SWIM-IR is a Synthetic Wikipedia-based Multilingual Information Retrieval training set with 28 million query-passage pairs spanning 33 languages, generated using PaLM 2 and summarize-then-ask prompting.","topicNames":["multilingual","nlp","machine-learning","natural-language-processing","information-retrieval","deep-learning","datasets","cross-lingual","training-data","neural-information-retrieval"],"topicsNotShown":0,"primaryLanguage":null,"pullRequestCount":0,"issueCount":0,"starsCount":40,"forksCount":2,"license":null,"participation":null,"lastUpdated":{"hasBeenPushedTo":true,"timestamp":"2023-11-13T23:42:22.176Z"}},{"type":"Public","name":"hiertext","owner":"google-research-datasets","isFork":false,"description":"The HierText dataset contains ~12k images from the Open Images dataset v6 with large amount of text entities. We provide word, line and paragraph level annotations.","topicNames":[],"topicsNotShown":0,"primaryLanguage":{"name":"Jupyter Notebook","color":"#DA5B0B"},"pullRequestCount":0,"issueCount":1,"starsCount":231,"forksCount":20,"license":"Creative Commons Attribution Share Alike 4.0 International","participation":null,"lastUpdated":{"hasBeenPushedTo":true,"timestamp":"2023-10-30T04:14:00.824Z"}},{"type":"Public","name":"sanpo_dataset","owner":"google-research-datasets","isFork":false,"description":"","topicNames":[],"topicsNotShown":0,"primaryLanguage":{"name":"Python","color":"#3572A5"},"pullRequestCount":0,"issueCount":3,"starsCount":37,"forksCount":0,"license":"Apache License 2.0","participation":null,"lastUpdated":{"hasBeenPushedTo":true,"timestamp":"2023-10-25T22:10:01.969Z"}},{"type":"Public","name":"SDOH-NLI","owner":"google-research-datasets","isFork":false,"description":"Description of the dataset: SDOH-NLI is a natural language inference dataset containing ~30k premise-hypothesis pairs with binary entailment labels in the domain of social and behavioral determinants of health.","topicNames":[],"topicsNotShown":0,"primaryLanguage":null,"pullRequestCount":0,"issueCount":0,"starsCount":4,"forksCount":1,"license":"Creative Commons Attribution 4.0 International","participation":null,"lastUpdated":{"hasBeenPushedTo":true,"timestamp":"2023-10-17T19:12:07.531Z"}},{"type":"Public","name":"SPICE","owner":"google-research-datasets","isFork":false,"description":"SPICE is a stereotype dataset in English containing stereotypes collected in India with community engagement. It spans identity groups and stereotypes unique to India, as well as other stereotypes about gender and nationalities.","topicNames":[],"topicsNotShown":0,"primaryLanguage":null,"pullRequestCount":0,"issueCount":0,"starsCount":2,"forksCount":0,"license":null,"participation":null,"lastUpdated":{"hasBeenPushedTo":true,"timestamp":"2023-09-26T22:25:39.395Z"}},{"type":"Public","name":"seegull","owner":"google-research-datasets","isFork":false,"description":"SeeGULL is a broad-coverage stereotype dataset in English containing stereotypes about identity groups spanning 178 countries across 8 different geo-political regions across 6 continents, as well as state-level identities within the US and India.","topicNames":[],"topicsNotShown":0,"primaryLanguage":null,"pullRequestCount":0,"issueCount":3,"starsCount":30,"forksCount":1,"license":"Creative Commons Attribution 4.0 International","participation":null,"lastUpdated":{"hasBeenPushedTo":true,"timestamp":"2023-09-25T18:45:45.474Z"}}],"repositoryCount":125,"userInfo":null,"searchable":true,"definitions":[],"typeFilters":[{"id":"all","text":"All"},{"id":"public","text":"Public"},{"id":"source","text":"Sources"},{"id":"fork","text":"Forks"},{"id":"archived","text":"Archived"},{"id":"mirror","text":"Mirrors"},{"id":"template","text":"Templates"}],"compactMode":false},"title":"Repositories"}