Skip to content

marieef/NorMed_neg

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

NorMed_neg

This is the repository for the NorMed_neg dataset: a Norwegian dataset of biomedical articles annotated with negation.

Original dataset

The original dataset is The Norwegian GastroSurgery Biomedical Negation Corpus. It was created as part of a master's thesis at Stockholm University: Building and evaluating the NegEx negation detection system for Norwegian biomedical text (Sadhukhan, 2021).

English sentences and duplicate sentences have been removed from the original dataset, and sentence splitting errors in negated sentences have been corrected manually. The dataset has been annotated according to a different annotation scheme, which is described below.

Annotation

The annotation is based on these guidelines, which were used in the annotation of the NoReC_neg dataset, see Negation in Norwegian: an annotated dataset (Mæhlum et al., NoDaLiDa 2021).

Some additional assumptions have been made. These are described in Chapter 5 of my master's thesis. (The link will be posted when my thesis is made publicly accessible.)

Brat (see brat: a Web-based Tool for NLP-Assisted Text Annotation (Stenetorp et al., EACL 2012)) was used as the annotation tool, and the config files are provided. The code and data used for computation of inter-annotator agreement are also included.

Data format

The dataset is provided in JSON format:

{
    "sent_id": "biomedical_sentence_1601",
    "text": "Forskjellen i bruk av keisersnitt mellom gruppene vedvarte da førstegangsfødende ble analysert separat , mens forskjellene i bruk av tang- og vakuumforløsning forsvant .",
    "negations": [
        {
            "Cue": [
                [
                    "forsvant"
                ],
                [
                    "159:167"
                ]
            ],
            "Scope": [
                [
                    "forskjellene i bruk av tang- og vakuumforløsning"
                ],
                [
                    "110:158"
                ]
            ],
            "Affixal": false
        }
    ],
    "focus_relations": []
}

The following description of the JSON format is provided in the NoReC_neg repo:

"Each sentence has a dictionary with the following keys and values:

  • 'sent_id': unique NoReC identifier for document + paragraph + sentence which lines up with the identifiers from the document and sentence-level NoReC data
  • 'text': raw text
  • 'negations': list of all negations (dictionaries) in the sentence

Additionally, each negation in a sentence is a dictionary with the following keys and values:

  • 'Cue': a list of text and character offsets for the negation cues
  • 'Scope': a list of text and character offsets for the negation scopes
  • 'Affixal': (True, False). Indicating whether the cue is affixal (prefix,suffix) or not.

Note that a single sentence may contain several annotated negations. All negations must contain at least one cue, but it is possible for a negation to be without scope. These cases are mostly short comments with references to earlier sentences. Both cues and scopes can be multiword and discontinuous."

Note: In NorMed_neg, each sentence has a list "focus_relations" as well. If non-empty, it has the same structure as the "negations" list, with "Cue" and "Focus" corresponding to "Cue" and "Scope", where "Focus" denotes a term annotated as negated in the original dataset.

About

Repository for the NorMed_neg dataset

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published