Skip to content
Hate speech data
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.
_config.yml Set theme jekyll-theme-slate Jun 25, 2019

Hate Speech Datasets

This page catalogues datasets annotated for hate speech, online abuse, and offensive language. They may be useful for e.g. training a natural language processing system to detect this language.

Data included on the list should be accessible via either the author or direct download, in most cases, though some other significant work may be listed.

The list is maintained by Leon Derczynski and Bertie Vidgen.

Please make contributions via pull request or email. Accompanying data statements preferred for all corpora.


  1. L-HSAB


  1. DKhate
    • Annotation type: Offensive speech, target, and grade
    • Annotation level: Document
    • Text genre: Twitter, Reddit, News comments
    • Size: 3600
    • Data link: to appear 2019
    • Reference: Cross-lingual Multi-Platform Hate Speech Detection (to appear)


  1. Davidson et al.

  2. Wikipedia Detox

  3. Waseem & Hovy

  4. Imperium

  5. OffensEval 2019

  6. Liu et al.

  7. StormfrontWS

  8. Toxic Comment Classification Challenge

  9. hatEval

  10. Founta et al.


  1. IWG_hatespeech_public

  2. GermEval 2018

  3. GermEval 2019


  1. Ibrohim & Budi


  1. HSC



  1. PolEval 2019


  1. Fortuna et al.

  2. OffComBR


  1. hatEval

Lists of abusive keywords

  1. Hatebase

    • "Researchers are encouraged to take advantage of Hatebase's vocabulary dataset, which is a valuable lexicon for searching other data repositories such as public forums, as well as Hatebase's sightings dataset, which is useful for trending analysis"
    • Data link:
  2. Hurtlex

  3. Gorrell et al.

  4. Wiegand et al.

  5. Chandrasekharan et al.

This page is

You can’t perform that action at this time.