This is the repository for the NorMed_neg dataset: a Norwegian dataset of biomedical articles annotated with negation.
The original dataset is The Norwegian GastroSurgery Biomedical Negation Corpus. It was created as part of a master's thesis at Stockholm University: Building and evaluating the NegEx negation detection system for Norwegian biomedical text (Sadhukhan, 2021).
English sentences and duplicate sentences have been removed from the original dataset, and sentence splitting errors in negated sentences have been corrected manually. The dataset has been annotated according to a different annotation scheme, which is described below.
The annotation is based on these guidelines, which were used in the annotation of the NoReC_neg dataset, see Negation in Norwegian: an annotated dataset (Mæhlum et al., NoDaLiDa 2021).
Some additional assumptions have been made. These are described in Chapter 5 of my master's thesis. (The link will be posted when my thesis is made publicly accessible.)
Brat (see brat: a Web-based Tool for NLP-Assisted Text Annotation (Stenetorp et al., EACL 2012)) was used as the annotation tool, and the config files are provided. The code and data used for computation of inter-annotator agreement are also included.
The dataset is provided in JSON format:
{
"sent_id": "biomedical_sentence_1601",
"text": "Forskjellen i bruk av keisersnitt mellom gruppene vedvarte da førstegangsfødende ble analysert separat , mens forskjellene i bruk av tang- og vakuumforløsning forsvant .",
"negations": [
{
"Cue": [
[
"forsvant"
],
[
"159:167"
]
],
"Scope": [
[
"forskjellene i bruk av tang- og vakuumforløsning"
],
[
"110:158"
]
],
"Affixal": false
}
],
"focus_relations": []
}
The following description of the JSON format is provided in the NoReC_neg repo:
"Each sentence has a dictionary with the following keys and values:
- 'sent_id': unique NoReC identifier for document + paragraph + sentence which lines up with the identifiers from the document and sentence-level NoReC data
- 'text': raw text
- 'negations': list of all negations (dictionaries) in the sentence
Additionally, each negation in a sentence is a dictionary with the following keys and values:
- 'Cue': a list of text and character offsets for the negation cues
- 'Scope': a list of text and character offsets for the negation scopes
- 'Affixal': (True, False). Indicating whether the cue is affixal (prefix,suffix) or not.
Note that a single sentence may contain several annotated negations. All negations must contain at least one cue, but it is possible for a negation to be without scope. These cases are mostly short comments with references to earlier sentences. Both cues and scopes can be multiword and discontinuous."
Note: In NorMed_neg, each sentence has a list "focus_relations" as well. If non-empty, it has the same structure as the "negations" list, with "Cue" and "Focus" corresponding to "Cue" and "Scope", where "Focus" denotes a term annotated as negated in the original dataset.