Collect, Measure, Repeat: Reliability Factors for Responsible AI Data Collection

We propose a Responsible AI (RAI) methodology designed to guide the data collection with a set of metrics for an iterative in-depth analysis of the factors influencing the quality and reliability of the generated data. We propose a granular set of measurements to inform on the internal reliability of a dataset and its external stability over time. We validate our approach across nine existing datasets and annotation tasks and four content modalities.

This repository contains the validation of our proposed Responsible AI (RAI) methodology on nine datasets and the supplemental material that accompanies our publication. We publish nine different notebooks, each for one of the following datasets:

Video Concept Relevance (VCR_E, VCR_P, VCR_L, VCR_O, VCR_ALL): Dataset of 208, 234, 223, 59, and respectively 969 video-concept pairs which have been annotated in terms of relevance in five different annotation tasks, namely focusing on relevent events (VCR_E), relevant people (VCR_P), relevant locations (VCR_L), relevant organizations (VCR_O), and, respectively all relevant concepts (VCR_ALL). The concepts were machine-extracted (video subtitles and video stream) from ten short English news broadcasts (i.e., videos) published on YouTube, from a publicly available dataset used by Inel, Tintarev and Aroyo (2020), Mavridis et al. (2018), and Inel and Aroyo (2022). Each task was repeated three times, at least three months apart, and each repetition used the same raters’ qualifications, and raters were allowed to participate across repetitions.
Video Human Facial Expressions (IRep): Dataset of 1090 video recordings of human facial recordings, part of the International Replication (IRep) dataset, published by Wong, Paritosh, and Aroyo (2020). Each video recording is annotated with emotions from 30 available emotions. The video recordings were generally very short, 5 seconds on average. Each video recording was annotated by two raters. The task was repeated three times, each time with raters from a different pool, namely raters from Mexico City, Kuala Lumpur, Budapest, and internationals.
Product Reviews (PR): Dataset of 20 English product reviews for fashion items (accompanied by a photo representative of the respective product), randomly selected from the dataset published by Chernushenko et al. (2018). Each product review is annotated with one of three possible issues classes, as described in the study conducted by Qarout et al. (2019). The task was repeated five times at intervals of one week. The raters were not allowed to participate in more than one repetition.
Crisis Tweets (CT): Dataset of 20 English crisis-related Twitter messages (\emph{e.g.}, earthquake, flood), randomly selected from the dataset published by Imran, Mitra, and Castillo (2016). Each tweet is annotated with one of nine possible crisis-related options, as described in the study conducted by Qarout et al. (2019). The task was repeated five times at intervals of one week. Each rater was allowed to participate in just one repetition.
WordSim (WS353): Dataset of 353 English word pairs published by Finkelstein et al. (2001), used as benchmark for semantic similarity and word embeddings. Each pair is annotated in terms of how similar the two words are on a 1 to 10 scale. The task was first run by Finkelstein et al. (2001), 13 or 16 raters annotated each pair of words, and each rater annotated all pairs. The second time the task was run by Welty, Paritosh, and Aroyo (2019) in 2019 (after 20 years), on AMT. In this repetition, each pair of words was annotated by 13 raters, and each rater was allowed to annotate as many pairs as they wanted.

All results are published in the following paper:

Oana Inel, Tim Draws and Lora Aroyo: Collect, Measure, Repeat: Reliability Factors for Responsible AI Data Collection. HCOMP 2023.

If you find the paper and the data useful in your research, please consider citing:

@inproceedings{inel2023collect,
  title={Collect, Measure, Repeat: Reliability Factors for Responsible AI Data Collection},
  author={Inel, Oana and Draws, Tim and Aroyo, Lora},
  booktitle={To Appear in the Proceedings of the Eleventh AAAI Conference on Human Computation and Crowdsourcing (HCOMP)},
  year={2023},
  organization={AAAI}
}

Note

In order to reproduce the analysis in the notebooks, each dataset needs to be downloaded from their respective source. All sources are provided in this repository or in the publication.

Furthermore, several libraries need to be added to the repository as follows:

Download the python implementation of the Krippendorff's alpha metric from this repository: https://github.com/grrrr/krippendorff-alpha. Copy the file https://github.com/grrrr/krippendorff-alpha/blob/master/krippendorff_alpha.py to the scripts folder.
Download the python implementation of the xRR metric from this repository: https://github.com/aneeta/xRR. Copy the folder https://github.com/aneeta/xRR/tree/main/xrr to the scripts folder.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
data		data
notebooks		notebooks
plots		plots
scripts		scripts
HCOMP2023_SupplementalMaterial.pdf		HCOMP2023_SupplementalMaterial.pdf
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

notebooks

notebooks

plots

plots

scripts

scripts

HCOMP2023_SupplementalMaterial.pdf

HCOMP2023_SupplementalMaterial.pdf

README.md

README.md

Repository files navigation

Collect, Measure, Repeat: Reliability Factors for Responsible AI Data Collection

Note

About

Releases

Packages

Languages

oana-inel/ResponsibleAIDataCollection

Folders and files

Latest commit

History

Repository files navigation

Collect, Measure, Repeat: Reliability Factors for Responsible AI Data Collection

Note

About

Resources

Stars

Watchers

Forks

Languages