Skip to content

Toxicity Detection in Context: Assuming that the comment exists in a thread and that the parent comment or/and the discussion topic are enough context provided to humans/systems detecting toxicity.

License

ipavlopoulos/context_toxicity

master
Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
 
 
 
 
 
 
 
 
 
 
 
 

Toxicity detection w/ and w/o context

  • Concerning comments existing in a thread.
  • Context information:
    • The parent comment.
    • The discussion topic.
  • The large dataset is included in the data folder in the form of two CSV files.
    • gn.csv comprises the out of context annotations.
    • gc.csv comprises the in-context annotations.
  • The small dataset will be included soon.

Word embeddings

  • You will need to add a folder embeddings when using pre-trained embeddings.
    • For example, GloVe embeddings.

Building the datasets

Create random splits:

python experiments.py --create_random_splits 10

Downsample the two categories (one per dataset) to make the datasets equibalanced while equally sized:

python experiments.py --create_balanced_datasets

Then, create 10 random splits:

python experiments.py --create_random_splits 10 --use_balanced_datasets True

Running a classifier

Run a simple bi-LSTM by:

nohup python experiments.py --with_context_data False --with_context_model "RNN:OOC" --repeat 10 > rnn.ooc.log &

  • You can train it also in IC data, by changing the related argument.
    • If you call "RNN:INC1", the same LSTM will be trained, but another LSTM will encode the parent text (IC data required) and concatenate the two encoded texts before the dense layers on the top.
    • If you call "BERT:OOC1" you have a simple BERT.
    • If you call "BERT:OOC2" you concatenate the parent text (IC data required) with a SEPARATED token.
    • If you call "BERT:CA" you extend BERT:OOC1 with the LSTM encoded parent text, similarly to the RNN:INC1.

The names are messy, but these will hopefully change.

The article

@misc{pavlopoulos2020toxicity, title={Toxicity Detection: Does Context Really Matter?}, author={John Pavlopoulos and Jeffrey Sorensen and Lucas Dixon and Nithum Thain and Ion Androutsopoulos}, year={2020}, eprint={2006.00998}, archivePrefix={arXiv}, primaryClass={cs.CL}}

About

Toxicity Detection in Context: Assuming that the comment exists in a thread and that the parent comment or/and the discussion topic are enough context provided to humans/systems detecting toxicity.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published