WSDM Cup 2017 Wikidata Vandalism Detector: Feature Extraction Component
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.

WSDM Cup 2017 Vandalism Detection Task: Feature Extraction

The WSDM Cup 2017 was a data mining challenge held in conjunction with the 10th International Conference on Web Search and Data Mining (WSDM). The goal of the vandalism detection task was to compute a vandalism score for each Wikidata revision denoting the likelihood of this revision being vandalism or similarly damaging. This is feature extraction component for the baselines WDVD, ORES, and FILTER. The classification and evaluation can be done with the corresponding classification component.


This source code forms the basis for the overview paper of the vandalism detection task at WSDM Cup 2017. When using the code, please make sure to refer to it as follows:

  author    = {Stefan Heindorf and
               Martin Potthast and
               Gregor Engels and
               Benno Stein},
  title     = {Overview of the Wikidata Vandalism Detection Task at {WSDM} Cup 2017},
  booktitle = {{WSDM Cup 2017 Notebook Papers}},
  url       = {},
  year      = {2017}

The code is based on the Wikidata Vandalism Detector 2016:

  author    = {Stefan Heindorf and
               Martin Potthast and
               Benno Stein and
               Gregor Engels},
  title     = {Vandalism Detection in Wikidata},
  booktitle = {{CIKM}},
  pages     = {327--336},
  publisher = {{ACM}},
  url       = {},
  year      = {2016}


The code depends on the WSDM Cup 2017 Data Server. It was tested with Java 8 Update 77, 64 Bit under Windows 10.


In Eclipse, executing "Run As -> Maven install" creates a JAR file which includes all dependencies.


The program can be executed with the script ´´ which starts a data server for every file of the corpus. The paths to the jar files can be configured within the script.



Given a CORPUS directory, extract features and store them in the FEATURES file.


./ wdvc-2016/ features.csv.bz2

Required Data

Computed Feature File

The computed feature file is also available for download:


For questions and feedback please contact:

Stefan Heindorf, Paderborn University
Martin Potthast, Leipzig University
Gregor Engels, Paderborn University
Benno Stein, Bauhaus-Universität Weimar


The code by Stefan Heindorf, Martin Potthast, Gregor Engels, Benno Stein is licensed under a MIT license.