Skip to content

MarcinCiura/przetak

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Przetak: fewer weeds on the Web

Przetak is a library for checking whether a text contains abusive or vulgar speech in Polish. While it is written in Go, it can be used by programs written in many other languages thanks to FFI (Foreign Function Interface).

Przetak is resilient to:

  • replicating letters,
  • spacing out the words,
  • inserting non-letters between letters,
  • homograph spoofing, i.e. replacing letters with similar characters.

Also, thanks to its use of character 5-grams, it handles some frequent misspellings and out-of-vocabulary words composed of morphemes with an abusive or vulgar meaning.

Przetak finished the Polish contest of cyberbullying detection PolEval 2019 in second place. Here is a paper about Przetak, and here are the slides from my presentation at AI & NLP Workshop Day 2019.

Installation

First, get the package:

$ go get github.com/MarcinCiura/przetak

Change directory to your ${GOPATH}/src/github.com/MarcinCiura/przetak and run make to build the shared library. Depending on your operating system, the shared library will be called:

  • libprzetak.so on Linux,
  • libprzetak.dylib on macOS,
  • przetak.dll on Windows.

Usage

Przetak's evaluate() function returns an integer whose bits with respective values 1, 2, or 4 are set if the input UTF-8 string contains:

  • abusive words,
  • vulgar words with negative connotations,
  • vulgar words with positive connotations.

The examples directory showcases the use of Przetak directly from Go and from several other programming languages via FFI (Foreign Function Interface).

Author

Marcin Ciura

License

Przetak is licensed under Apache License, Version 2.0.