🇵🇹 🇸🇦 🇵🇱 🇮🇩 🇮🇹 Solving the problem of hate speech detection in 9 languages across 16 datasets. :fr: :us: :es: :de:
New update -- 🎉 🎉 all our BERT models are available here. Be sure to check it out 🎉 🎉.
Please look here to check model loading and inference.
Please cite our paper in any published work that uses any of these resources.
@inproceedings{aluru2021deep,
title={A Deep Dive into Multilingual Hate Speech Classification},
author={Aluru, Sai Saketh and Mathew, Binny and Saha, Punyajoy and Mukherjee, Animesh},
booktitle={Machine Learning and Knowledge Discovery in Databases. Applied Data Science and Demo Track: European Conference, ECML PKDD 2020, Ghent, Belgium, September 14--18, 2020, Proceedings, Part V},
pages={423--439},
year={2021},
organization={Springer International Publishing}
}
./Dataset --> Contains the dataset related files.
./BERT_Classifier --> Contains the codes for BERT classifiers performing binary classifier on the dataset
./CNN_GRU --> Contains the codes for CNN-GRU model
./LASER+LR --> Containes the codes for Logistic regression classifier used on top of LASER embeddings
Make sure to use Python3 when running the scripts. The package requirements can be obtained by running pip install -r requirements.txt
.
Check out the Dataset
folder to know more about how we curated the dataset for different languages.
We release the code for train/finetuning the following models along with their hyperparamters.
🥇 best for high resource language
, 🏅 best for low resource language
fastest to train
, 🛩️ slowest to train
-
mBERT Baseline: This setting consists of using multilingual bert model with the same language dataset for training and testing. Refer to
BERT Classifier
folder for the codes and usage instructions. -
mBERT All_but_one::1st_place_medal::small_airplane: This setting consists of using multilingual bert model with training dataset from multiple languages and validation and test from a single target language. Refer to
BERT Classifier
folder for the codes and usage instructions. -
Translation + BERT Baseline: This setting consists of translating the other language datasets to english and finetuning the bert-base model using this translated datasets. Refer to
BERT Classifier
folder for the codes and usage instructions. -
CNN+GRU Baseline: This setting consists of using MUSE word embeddings along with a CNN-GRU based model, and training and testing on the same language. Refer to
CNN_GRU
folder for the codes and usage instructions. -
LASER+LR baseline::airplane: This setting consists of training a logistic regression model on the LASER embeddings of the dataset. The training and testing dataset are from the same language. Refer to
LASER+LR
folder for the codes and usage instructions. -
LASER+LR all_but_one::medal_sports: This setting consists of training a logistic regression model on the LASER embeddings of the dataset. The dataset from other languages are also used to train the LR model. Refer to
LASER+LR
folder for the codes and usage instructions.
- Muse embeddding are downloaded and extracted using the code from MUSE github repository
- For finetuning BERT this blog by Chris McCormick is used and we also referred Transformers github repo
- For CNN-GRU model we used the original repo for reference
- For generating the LASER embeddings of the dataset, we used the code from LASER github repository
Sai Saketh Aluru, Binny Mathew, Punyajoy Saha and Animesh Mukherjee. 2020. "Deep Learning Models for Multilingual Hate Speech Detection". ECML-PKDD
- Upload our models to transformers community to make them public
- Add arxiv paper link and description
- Create an interface for social scientists where they can use our models easily with their data
- Create a pull request to add the models to official transformers repo