NorMed_neg

This is the repository for the NorMed_neg dataset: a Norwegian dataset of biomedical articles annotated with negation.

Original dataset

The original dataset is The Norwegian GastroSurgery Biomedical Negation Corpus. It was created as part of a master's thesis at Stockholm University: Building and evaluating the NegEx negation detection system for Norwegian biomedical text (Sadhukhan, 2021).

English sentences and duplicate sentences have been removed from the original dataset, and sentence splitting errors in negated sentences have been corrected manually. The dataset has been annotated according to a different annotation scheme, which is described below.

Annotation

The annotation is based on these guidelines, which were used in the annotation of the NoReC_neg dataset, see Negation in Norwegian: an annotated dataset (Mæhlum et al., NoDaLiDa 2021).

Some additional assumptions have been made. These are described in Chapter 5 of my master's thesis. (The link will be posted when my thesis is made publicly accessible.)

Brat (see brat: a Web-based Tool for NLP-Assisted Text Annotation (Stenetorp et al., EACL 2012)) was used as the annotation tool, and the config files are provided. The code and data used for computation of inter-annotator agreement are also included.

Data format

The dataset is provided in JSON format:

{
    "sent_id": "biomedical_sentence_1601",
    "text": "Forskjellen i bruk av keisersnitt mellom gruppene vedvarte da førstegangsfødende ble analysert separat , mens forskjellene i bruk av tang- og vakuumforløsning forsvant .",
    "negations": [
        {
            "Cue": [
                [
                    "forsvant"
                ],
                [
                    "159:167"
                ]
            ],
            "Scope": [
                [
                    "forskjellene i bruk av tang- og vakuumforløsning"
                ],
                [
                    "110:158"
                ]
            ],
            "Affixal": false
        }
    ],
    "focus_relations": []
}

The following description of the JSON format is provided in the NoReC_neg repo:

"Each sentence has a dictionary with the following keys and values:

'sent_id': unique NoReC identifier for document + paragraph + sentence which lines up with the identifiers from the document and sentence-level NoReC data
'text': raw text
'negations': list of all negations (dictionaries) in the sentence

Additionally, each negation in a sentence is a dictionary with the following keys and values:

'Cue': a list of text and character offsets for the negation cues
'Scope': a list of text and character offsets for the negation scopes
'Affixal': (True, False). Indicating whether the cue is affixal (prefix,suffix) or not.

Note that a single sentence may contain several annotated negations. All negations must contain at least one cue, but it is possible for a negation to be without scope. These cases are mostly short comments with references to earlier sentences. Both cues and scopes can be multiword and discontinuous."

Note: In NorMed_neg, each sentence has a list "focus_relations" as well. If non-empty, it has the same structure as the "negations" list, with "Cue" and "Focus" corresponding to "Cue" and "Scope", where "Focus" denotes a term annotated as negated in the original dataset.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
IAA		IAA
brat_config		brat_config
data_analysis		data_analysis
dataset		dataset
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NorMed_neg

Original dataset

Annotation

Data format

About

Releases

Packages

Languages

marieef/NorMed_neg

Folders and files

Latest commit

History

Repository files navigation

NorMed_neg

Original dataset

Annotation

Data format

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages