Skip to content
Switch branches/tags

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time

Debiasing Vandalism Detection Models at Wikidata: Classification and Evaluation

The Wikidata Vandalism Detectors FAIR-E and FAIR-S are machine learning models for automatic vandalism detection in Wikidata without discriminating against anonymous editors. They were developed as a joint project between Paderborn University and Leipzig University.

This is the classification and evaluation component for FAIR-E, FAIR-S and the baselines WDVD, ORES, and FILTER. The feature extraction can be done with the corresponding feature extraction component.


This source code forms the basis for our WWW 2019 paper Debiasing Vandalism Detection Models at Wikidata. When using the code, please make sure to refer to it as follows:

  author    = {Stefan Heindorf and
               Yan Scholten and
               Gregor Engels and
               Martin Potthast},
  title     = {Debiasing Vandalism Detection Models at Wikidata},
  booktitle = {{WWW}},
  publisher = {{ACM}},
  year      = {2019}

Classification and Evaluation Component


The code was tested with Python 3.5.2, 64 Bit under Windows 10 with 16 cores and 256 GB RAM.


We recommend Miniconda for easy installation on many platforms.

  1. Create new environment:
    conda create --name www19-fair python=3.5.2 --file requirements.txt
  2. Activate environment:
    activate www19-fair
  3. Install Kernel:
    python -m ipykernel install --user --name www19-fair --display-name www19-fair
  4. Start Jupyter:
    jupyter notebook

Execute Notebooks

Run the Jupyter notebooks in this order:


Required Data

We assume the following project structure:

├── data/
│   ├── classification/
│   ├── corpus-validity/
│   ├── external/
│   │   └─── wdvc-2016/
│   ├── features/
│   │   ├── test/
│   │   │   ├── embeddings/
│   │   │   └── features.csv.bz2
│   │   ├── training/
│   │   │   ├── embeddings/
│   │   │   └── features.csv.bz2
│   │   ├── validation/
│   │   │   ├── embeddings/
│   │   │   └── features.csv.bz2
│   │   └── wdvd_features.csv.bz2
│   ├── item-properties/
│   └── property-domains/
└── www19-fair-classification/

classification: This folder will contain the output of the classification component: plots, tables, and vandalism scores. Initially, it can be empty.

corpus-validity: Manually reviewed Wikidata revisions. You can download the folder corpus-validity.

external: Contains the Wikidata Vandalism Corpus 2016.

features: Contains the features for our models. The feature extraction can be done with the feature extraction component. Alternatively, you can download the features directly.

item-properties: The list of Wikidata item properties at the end of the training set. The file can be created with the feature extraction component. Alternatively, you can download the item-properties directly.

property-domains: The domain each Wikidata property belongs to. You can download the folder property-domains.

www19-fair-feature-classification: This git repository.

Known Issues

The dataset contains some revisions that change references of subject-predicate-object triples instead of subject-predicate-object triples themselves. In order to filter all references, in the notebook 01-dataset-analysis.ipynb, the condition df['revisionAction'].isin(revisionActions) must be changed to (df['revisionAction'].isin(revisionActions) & df['param4'].isna()). This change has little effect on our evaluation results. For consistency to the paper, we use the original version in this repository.


For questions and feedback please contact:

Stefan Heindorf, Paderborn University
Yan Scholten, Paderborn University
Gregor Engels, Paderborn University
Martin Potthast, Leipzig University


The code by Stefan Heindorf, Yan Scholten, Gregor Engels, Martin Potthast is licensed under a MIT license.


WWW 2019 Paper: Debiasing Vandalism Detection Models at Wikidata: Classification and Evaluation Component




No releases published


No packages published