Skip to content
WWW 2019 Paper: Debiasing Vandalism Detection Models at Wikidata: Feature Extraction Component
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.

Debiasing Vandalism Detection Models at Wikidata: Feature Extraction

The Wikidata Vandalism Detectors FAIR-E and FAIR-S are machine learning models for automatic vandalism detection in Wikidata without discriminating against anonymous editors. They were developed as a joint project between Paderborn University and Leipzig University.

This is the feature extraction component that extracts features for FAIR-E and FAIR-S. Classification and evaluation for FAIR-E, FAIR-S and the baselines WDVD, ORES, and FILTER can be done with the corresponding classification and evaluation component.


This source code forms the basis for our WWW 2019 paper Debiasing Vandalism Detection Models at Wikidata. When using the code, please make sure to refer to it as follows:

  author    = {Stefan Heindorf and
               Yan Scholten and
               Gregor Engels and
               Martin Potthast},
  title     = {Debiasing Vandalism Detection Models at Wikidata},
  booktitle = {{WWW}},
  publisher = {{ACM}},
  year      = {2019}

Feature Extraction Component


The code was tested with Java 8, under Linux 4.9.0-8-amd64 with 16 cores and 256 GB RAM.

We require an installation of 7z for decompression.


We assume the following project structure:

├── data
├── www19-fair-feature-classification
└── www19-fair-feature-extraction

Required Data

Before you can start the feature extraction, you need to download the following data:

  1. Wikidata Vandalism Corpus 2016:

    Expected Path: www19-fair/data/external/wdvc-2016/

  2. Wikidata JSON Dump of 2/29/2016:

    Expected Path: www19-fair/data/external/wikidata-20160229-all.json.bz2

  3. WDVD features:

    Expected Path: www19-fair/data/features/wdvd_features.csv.bz2


To start the feature extraction, you need to execute ./

Computed Features

This feature extraction component will compute the following feature files:

├── features/
│   ├── test/
│   │   ├── embeddings/
│   │   └── features.csv.bz2
│   ├── training/
│   │   ├── embeddings/
│   │   └── features.csv.bz2
│   └── validation/
│       ├── embeddings/
│       └── features.csv.bz2
├── item-properties/
│   └── item-properties.bz2
└── wikidata-graph/
    └── wikidata-graph.csv.bz2

features: Contains the features for the models FAIR-E and FAIR-S. The file has the following columns: revisionId, isEditingTool, subject, predicate, object, superSubject, superObject. Each row was extracted from the Wikidata Vandalism Corpus 2016 and represents a revision that adds, removes, or updates statements between two Wikidata items.

embeddings: This folder contains predicate embeddings as described in the paper. We store embeddings in four CSR-matrices: subjectOut, predicate, objectOut, objectIn.

item-properties: The list of Wikidata item properties extracted from the Wikidata JSON Dump from 2/29/2016. Item properties are the Wikidata properties solely used to describe relations between two Wikidata items.

wikidata-graph: Statements between two Wikidata items extracted from the Wikidata JSON Dump from 2/29/2016. This file contains subject-predicate-object-triple where subject and object are Wikidata items. The predicate is an item property.


For questions and feedback please contact:

Stefan Heindorf, Paderborn University
Yan Scholten, Paderborn University
Gregor Engels, Paderborn University
Martin Potthast, Leipzig University


The code by Stefan Heindorf, Yan Scholten, Gregor Engels, Martin Potthast is licensed under a MIT license.

You can’t perform that action at this time.