This repository contains the supplementary data to the paper "Euphemistic Abuse -- A New Dataset and Classification Experiments for Implicitly Abusive Language".
The repository comprises the following data:
This file contains the set of 1797 sentences along their manual annotation. Each row corresponds to one sentence and each column provides some information regarding that sentence. The first row of the file indicates the semantics of each column (the labels are self-explanatory). Next to the information whether a sentence has been considered as abusive (i.e. "ABUSIVE") or not (i.e. "OTHER") -- this corresponds to the judgement of the majority of the 5 crowdworkers -- this file also includes the manual annotation of the different features of the feature-based classifier. The file also contains the fragment and the cue phrase of the respective sentence.
This file contains for each sentence the set of completions generated by GPT-3 which were used for the classifiers "GPT-3::inclusive" and "GPT-3::exclusive" in our paper. Each line represents one individual completion for the respective original sentence. The completion represents the last column while the original sentence represents the first column. For some sentences, there are fewer than 100 completions although we queried for each sentence 100 prompts. This can be explained by the fact that for some of these prompts an empty string was returned. Completions that correspond to empty strings were omitted from that file.
This file contains the negative instances which were automatically generated with the help of GPT3. These data are not part of our proposed gold standard (in which the negative data were produced manually via crowdsourcing)! These instances were used in Section 5.2 as an alternative method to produce negative data. (They correspond to the method "automatic" in Table 4.) For fragments that also indicate how the sentence is to be concluded, sometimes, the sentence generated by GPT-3 does not match the end of the given fragment. This is due to the fact that GPT-3 may have generated more than one sentence (with the second sentence containing the concluding part of the fragment). We just included the beginning sentence of those completions.
This directory contains the folds we used for 5-fold crossvalidation. The folds are created in such a way that all instances of euphemistic abuse belonging to one particular cue phrase are in the same fold. Thus, in 5-fold crossvalidation, the test fold will always include euphemistic abuse of unseen cue phrases, i.e. cue phrases whose euphemistic counterparts were not observed in the training data. This is the strictest and most realistic evaluation setting.
This directory includes all annotation guidelines. It is split into 2 subdirectories. The subdirectory "Crowdsourcing" contains all the guidelines related to crowdsourcing experiments while the subdirectory "Experts" contains all guidelines related to the expert annotation (i.e. the annotation guidelines referring to the features of the feature-based classifier). Please notice that for the crowdsourcing experiments we used the term "polite insults" when referring to euphemistic abuse. We used that term since we felt it was more intuitive (though less linguistically adequate) to laymen.
A data sheet providing summarizing important information about the data containe d in this repository.
This data set is published under Creative Commons Attribution 4.0.
The research was partially supported by the Austrian Science Fund (FWF): P 35467-G.
Please direct any questions that you have about this software to Michael Wiegand at Alpen-Adria-Universitaet Klagenfurt.
Michael Wiegand email: michael.wiegand@aau.at
M. Wiegand, J. Kampfmeier, E. Eder, J. Ruppenhofer: "Euphemistic Abuse -- A New Dataset and Classification Experiments for Implicitly Abusive Language", in EMNLP, 2023.