- Named Entity Recognition (NER)
- Chinese Word Segmentation (CWS)
- Part-of-Speech Tagging (POS)
- Ultra-fine Entity Typing
- Event Extraction
- Entity Relation joint Extraction
- End-to-End Entity Linking
Language | Dataset | Size | #Types | Description | Paper | Download |
---|---|---|---|---|---|---|
Chinese | msra | 46364/-/4365 | 3 | Levow | damo/msra_ner | |
Chinese | resume | 3821/463/477 | 9 | Zhang & Yang | damo/resume_ner | |
Chinese | 1350/269/270 | 4 | Peng & Dredze | damo/weibo_ner | ||
Chinese | ontonotes-v4-zh | 15724/4301/4346 | - | ldc/ontonotes-v4 | ||
Chinese | cluener2020 | 10748/1343/1345 | 10 | Xu et al., 2020 | github/cluener2020 | |
Chinese | people_dairy1998 | 3 | github/ChineseNLPCorpus | |||
Chinese | people_dairy2014 | 3 | baidu-pan passwrod:1fa3 | |||
Chinese | cmeee | 15000/5000/3000 | CMeEE dataset in CBLUE benchmark | Zhang et al., 2022 | github/cblue | |
Chinese | yidu-s4k | - | openkg/yidu-s4k | |||
Chinese | ecommerce | Jie et al., 2019 | github/ner_incomplete_annotation/ecommerce | |||
Chinese | dlner | Xu, et al.,2017 | github/dlner | |||
Dutch | conll2002-nl | 15796/2895/5196 | 4 | Tjong Kim Sang, 2002 | ||
English | wnut2016 | 2394/1000/3850 | Noisy User-generated Text | Strauss et al., 2016 | damo/wnut16 | |
English | wnut2017 | 3394/1009/1287 | Derczynski et al., 2017 | damo/wnut17 | ||
English | conll2003-en | 14041/3250/3453 | 4 | Tjong Kim Sang & De Meulder, 2003 | ||
English | conllpp | 14041/3250/3453 | 4 | corrected version of the conll03-en NER dataset | Wang et al., 2019 | damo/conllpp_ner |
English | ontonotes-v5-en | 59924/8528/8262(TBD) | Pradhan et al., 2013 | ldc/ontonotes-v5 | ||
English | ai | 100/350/431 | Liu et al., 2020 | damo/cross_ner | ||
English | literature | 100/400/416 | Liu et al., 2020 | damo/cross_ner | ||
English | music | 100/541/465 | Liu et al., 2020 | damo/cross_ner | ||
English | politics | 200/541/651 | Liu et al., 2020 | damo/cross_ner | ||
English | science | 200/450/543 | Liu et al., 2020 | damo/cross_ner | ||
English | bc5cdr | 4560/4581/4797 | Li et al., 2016 | |||
English | ncbi | 5424/923/940 | Doğan et al., 2014 | |||
English | mit-movie | 6816/1000/1953(TBD) | Liu et al., 2013 | mit/movie | ||
English | mit-restaurant | 6900/760/1521 | Liu et al., 2013 | mit/restaurant | ||
English | ace2004-en | 7 | nested ner | Doddington et al., 2005 | ldc/ace04 | |
English | ace2005-en | 7 | nested ner | - | ldc/ace05 | |
English | kbp2017 | nested ner | - | - | ||
English | genia | nested ner | Ohta et al., 2002 | |||
English | few-nerd | 131767/18824/37548 | 8 / 66 | a few-shot ner dataset | Ding et al., 2021 | |
English | wikigold | Balasuriya et al.,2009 | ||||
English | bionlp2014 | Collier & Kim, 2004 | ||||
English | fin | Alvarado et al., 2015 | ||||
English | btc | 6338/1001/2000 | 3 | Derczynski et al., 2016 | ||
English | ttc | Rijhwani & Preot¸iuc-Pietro | github/ttc | |||
English | tweebank | Jiang et al.,2022 | github/tweebank | |||
English | tweetner7 | Ushio, et al., 2022 | huggingface/tweetner7 | |||
German | conll2003-de | 12152/2866/3005 | 4 | Tjong Kim Sang & De Meulder, 2003 | ||
Spanish | conll2002-es | 8302/1919/1517 | 4 | Tjong Kim Sang, 2002 | ||
English | twitter2015 | multi-modal | Zhang et al., 2018 | |||
English | snap | multi-modal | Lu et al., 2018 | github/UMT | ||
English | twitter2017 | multi-modal | Yu et al., 2020 | github/UMT | ||
English | wiki-diverse | constructed from wiki-diverse (a multi-modal entity typing dataset) | Wang et al., 2022 | github/wikidiverse | ||
11 langs | multiconer2022 | - | 6 | dataset of SemEval 2022 Task 11 (English, Spanish, Dutch, Russian, Turkish, Korean, Farsi, German, Chinese, Hindi, and Bangla) |
Malmasi et al., 2022 | aws/multiconer |
282 langs | wikiann | - | silver-standard data | Pan et al, 2017 | github/wikiann | |
9 langs | wikiner | - | silver-standard data | Nothman et al, 2013 | ||
9 langs | wikineural | - | silver-standard data | Tedeschi et al, 2021 | ||
10 langs | multinerd | - | silver-standard data | Tedeschi & Navigli. 2022 |
Language | Dataset | Size | #Types | Description | Paper | Download |
---|---|---|---|---|---|---|
Chinese | PKU | 19056/-/1944 | - | - | sighan05 | train test |
Chinese | MSRA | 86924/-/3985 | - | - | sighan05 | train test |
Chinese | CTB6 | 23401/2078/2795 | - | - | Chinese Tree Bank v6 | train dev test |
Language | Dataset | Size | #Types | Description | Paper | Download |
---|---|---|---|---|---|---|
Chinese | CTB5 | - | - | - | train dev test |
|
Chinese | CTB8 | 23401 2078 2795 | - | - | Chinese Tree Bank v6 | train dev test |
Chinese | CTB9 | - | - | - | train dev test |
Language | Dataset | Size | #Types | Description | Paper | Download |
---|---|---|---|---|---|---|
English | UFET | 1998/1998/1998 | 10331 | Ultra-fine Entity Typing | Choi et al., 2018 | izhx404/ufet |
Chinese | CFET | 2880/960/958 | 1299 | Unofficial split, no official split provided. | Lee et al., 2020 | izhx404/cfet |
Language | Dataset | Size | Description | Paper | Download |
---|---|---|---|---|---|
Chinese | FewFC | 7185/899/898 | Passage level | Zhou et al., 2021 | here |
Chinese | Duee | 11908/1492/34904 | Passage level | Li et al., 2020 | here |
Chinese | Duee-fin | 7015/1171/59394 | Document level | Li et al., 2020 | here |
Chinese | ChFinAnn | 25632/3204/3204 | Document level | Zheng et al., 2019 | here |
English | WIKIEVENTS | 206/20/20 | Document level | Li et al., 2021 | train / dev / test |
English | RAMS | 7329/924/871 | Document level | Ebner et al., 2020 | here |
Language | Dataset | Size | Description | Paper | Download |
---|---|---|---|---|---|
English | NYT | - | - | Ren et al.,2017 | here |
English | NYT10-HRL/11-HRL | 70339/-/4006;62648/-/369 | got by preprocessing in paper HRL | Takanobu et al., 2019 | here |
English | WebNLG | 5019/-/703 | - | Gardent et al.,2017 | here |
English | ADE | - | - | Gurulingappa et al., 2012 | - |
English | SciERC | 1816/275/551 | - | Luan et al., 2018 | here |
English | CoNLL04 | - | - | Roth et al., 2004 | - |
English | ACE04 | - | - | - | here |
English | ACE05 | 10051/2424/2050 | - | - | here |
Chinese | DuIE2.0 | 171135/-/21055 | - | Li et al., 2019 | here |
Language | Domain | Dataset | Train/Dev/Test/KB Size | Paper/Link | Download |
---|---|---|---|---|---|
English | News | AIDA-CoNLL | 12820/4242/3953/5903530 | Hoffart et al.,2011 | here |
English | Medical | BC5CDR | 9535/9481/10032/2291 | Li et al., 2016 | here |
English | Speech | NLPCC2022 | 28400/7640/2905/118795 | NLPCC2022 | here |
Chinese | ShortText | CCKS2020 | 69691/9148/-/3234418 | CCKS2020 | - |