WNUT 2020: Named Entity Extraction

WNUT 2020 shared task is designed on the wet lab protocol data. Wet Lab protocols are bascially a collection of steps from different lab procedures. They are noisy, dense, and domain-specific. Automatic or semi-automatic conversion of protocols into machine-readable format benefits medical and biological research. In this task, participants are invited for event recognition and relation extraction over these lab protocols. Note that these protocols are written by researchers and lab technicians from all over the world, some of which may contain non-standard language or spelling errors.

All of the protocols were collected from protocols.io using their public APIs. The full protocol-dump is also available as json format in their github repository. For this shared task, we provide the annotation of 615 protocols. The BRAT styled annotated protocols can be visulalized at: http://bit.ly/WNUT2020. Below is a sample of the input data:

Shared-task Organizers

Jeniya Tabassum (Ohio State University)
Wei Xu (Ohio State University → Georgia Tech)
Alan Ritter (Ohio State University → Georgia Tech)

Shared-task Details

Data

Our Wet Lab Protocol (WLP) dataset consists of 615 unique protocols from the 623 protocols of Kulkarni et al. (2018). It excludes the following 8 duplicated protocols:

protocol 45 (duplicate of protocol 441)
protocol 459 (duplicate of protocol 310)
protocol 464 (duplicate of protocol 46)
protocol 480 (duplicate of protocol 473)
protocol 482 (duplicate of protocol 474)
protocol 483 (duplicate of protocol 475)
protocol 484 (duplicate of protocol 476)
protocol 621 (duplicate of protocol 570)

After discarding the duplicate protocols, the remaining 615 unique protocols are re-annoated in brat with 3 annotators with 0.75 inter-annotator agreement, measured by span-level Cohen’s Kappa. Our annotators added the missing entity-relations and also corrected the incosistencies. The updated dataset is provided in the data directory in both StandOff and CoNLL format. The data is divided in 3 sub-directories as below:

train_data: 370 protocols with 8444 sentences
dev_data: 122 protocols with 2839 sentences
test_data: 123 protocols with 2813 sentences

We will release a new test set on August 31, 2020 for the offical evaluation. Please sign up here to receive the data in email. We will use a mailing list for annoucements and discussions.

Timeline

Data available: June 8
Evaluation window: Aug 31 - Sep 4
System Description Papers Submitted: Sep 18
Papers Reviewed: Sep 28
Papers Camera-ready: Oct 8
Workshop Day: Nov 11

Baselines

We have provided a feature based Linear CRF tagger for the Named Entity Recognition Task.

Evaluation

The participants are required to produce predictions on the protocols as StandOff format or CoNLL format, which will be compared with the gold data using the evalution script.

Directory Structure

.
├── code
│   ├── baseline_CRF
│   │   ├── Conll_Outputs
│   │   ├── Standoff_Outputs
│   │   └── gazetters
│   ├── eval
│   └── scripts
│       ├── convert_conll_to_standoff
│       └── convert_standoff_conll_ner
└── data
    ├── dev_data
    │   ├── Conll_Format
    │   └── Standoff_Format
    ├── test_data
    │   ├── Conll_Format
    │   └── Standoff_Format
    └── train_data
        ├── Conll_Format
        └── Standoff_Format

Relevant Paper:

Paper about the shared task:

@inproceedings{tabassum2020,
  author={Tabassum, Jeniya and Lee, Sydney and Xu, Wei and Ritter, Alan },
  title={{WNUT-2020 Task 1 Overview: Extracting Entities and Relations from Wet Lab Protocols}},
  booktitle={Proceedings of EMNLP 2020 Workshop on Noisy User-generated Text {(WNUT)}},
  year = {2020}
}

Paper about the dataset:

@inproceedings{kulkarni2018wetlab,
  author     = {Kulkarni, Chaitanya and Xu, Wei and Ritter, Alan and Machiraju, Raghu},
  title      = {An Annotated Corpus for Machine Reading of Instructions in Wet Lab Protocols},
  booktitle = {Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL)},
  year       = {2018}
}

Name		Name	Last commit message	Last commit date
Latest commit History 169 Commits
code		code
data		data
LICENSE		LICENSE
Readme.md		Readme.md
covid-data.png		covid-data.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WNUT 2020: Named Entity Extraction

Shared-task Organizers

Shared-task Details

Data

Timeline

Baselines

Evaluation

Directory Structure

Relevant Paper:

About

Releases

Packages

Contributors 3

Languages

License

jeniyat/WNUT_2020_NER

Folders and files

Latest commit

History

Repository files navigation

WNUT 2020: Named Entity Extraction

Shared-task Organizers

Shared-task Details

Data

Timeline

Baselines

Evaluation

Directory Structure

Relevant Paper:

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages