Skip to content

kuhumcst/semdax

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SemDaX corpora

Sense-annotated corpora from the Semantic Processing Across Domains project. This project pools together the data from several articles related to sense annotation for Danish corpora.

This repository contains three main folders:

  1. supersenses contains the all-words supersense-annotated corpus. 2. It contains a folder official_distribution with the files used for training and testing in the noted articles, and a folder all_annotations with all the annotations generated by each annotator, previous to adjucation. 3. It is made up of six domains from the ClarinDK corpus plus the test section of the Danish Dependency Treebank (DDT).
  2. lexicalsample contains the lexical-sample annotations for a regular, dictionary based sense inventory, and for a supersense-clustered inventory.
  3. active_learning contains the resulting annotation of "Active Learning for Sense Annotation".

The following publications make use or document the construction of this resource.

@inproceedings{pedersen-etal-2016-semdax,
    title = "The {S}em{D}a{X} Corpus ― Sense Annotations with Scalable Sense Inventories",
    author = "Pedersen, Bolette  and
      Braasch, Anna  and
      Johannsen, Anders  and
      Alonso, H{\'e}ctor Mart{\'\i}nez  and
      Nimb, Sanni  and
      Olsen, Sussi  and
      S{\o}gaard, Anders  and
      S{\o}rensen, Nicolai Hartvig",
    booktitle = "Proceedings of the Tenth International Conference on Language Resources and Evaluation ({LREC}'16)",
    month = may,
    year = "2016",
    address = "Portoro{\v{z}}, Slovenia",
    publisher = "European Language Resources Association (ELRA)",
    url = "https://aclanthology.org/L16-1136",
    pages = "842--847",
    abstract = "We launch the SemDaX corpus which is a recently completed Danish human-annotated corpus available through a CLARIN academic license. The corpus includes approx. 90,000 words, comprises six textual domains, and is annotated with sense inventories of different granularity. The aim of the developed corpus is twofold: i) to assess the reliability of the different sense annotation schemes for Danish measured by qualitative analyses and annotation agreement scores, and ii) to serve as training and test data for machine learning algorithms with the practical purpose of developing sense taggers for Danish. To these aims, we take a new approach to human-annotated corpus resources by double annotating a much larger part of the corpus than what is normally seen: for the all-words task we double annotated 60{\%} of the material and for the lexical sample task 100{\%}. We include in the corpus not only the adjucated files, but also the diverging annotations. In other words, we consider not all disagreement to be noise, but rather to contain valuable linguistic information that can help us improve our annotation schemes and our learning algorithms.",
}

@inproceedings{olsenetal2015,
  title={Coarse-Grained Sense Annotation of Danish across Textual Domains},
  author={Olsen, Sussi and Pedersen, Bolette Sandford Mart{\i}nez Alonso, H{\'e}ctor and Johannsen, Anders},
  booktitle={Proceedings of the workshop on Semantic resources and semantic annotation for Natural Language Processing and the Digital Humanities at NODALIDA},
  pages={37},
  year={2015}
}

@inproceedings{martinezalonsoetal2015supersenses,
  title={Supersense tagging for Danish},
  author={Mart{\i}nez Alonso, H{\'e}ctor and Johannsen, Anders and Olsen, Sussi and Nimb, Sanni and Sørensen, Nicolai Hartvig and Braasch, Anna and Søgaard, Anders and Pedersen, Bolette Sandford},
  booktitle={Nordic Conference of Computational Linguistics NODALIDA 2015},
  pages={21},
  year={2015}
}

@inproceedings{martinezalonsoetal2016,
  title={An empirically grounded expansion of the supersense inventory},
  author={Mart{\i}nez Alonso, H{\'e}ctor and Johannsen, Anders and Olsen, Sussi and Nimb, Sanni and Pedersen, Bolette Sandford},
  booktitle={Global Wordnet Conference 2016 (to appear)},
}


  @inproceedings{martinezalonsoetal2015active,
  title={Active learning for sense annotation},
  author={ Mart{\i}nez Alonso, H{\'e}ctor and  Plank, Barbara and Johannsen, Anders and  S{\o}gaard,  Anders},
  booktitle={Nordic Conference of Computational Linguistics NODALIDA 2015},
  pages={245},
  year={2015}
}

About

Sense-annotated corpora from the Semantic Processing Across Domains project

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published