[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/hate-alert/Tutorial-ICWSM-2021/blob/main/Demos/Multilingual_abuse_predictor.ipynb)

# **Multilingual abuse predictor**
> This tool provides a suite of classifiers for different abuse detection tasks in a multilingual setting. This tool is provided by [Hate-alert](https://github.com/hate-alert)


![Multilingual](https://cdn.eventplanner.net/imgs/xnr8886_how-to-run-an-efficient-multilingual-conference.jpg)



In [1]:
#@title **Install necessary modules**
#@markdown this cell will install transformers and other necessary packages required for running the code
%%capture
!pip install transformers
!pip install transformers[sentencepiece]
!pip install ekphrasis
!git clone https://github.com/hate-alert/Tutorial-ICWSM-2021.git

In [2]:
cd Tutorial-ICWSM-2021

/content/Tutorial-ICWSM-2021


In [3]:
%%capture
import transformers
import random
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from transformers import BertForSequenceClassification
from transformers import XLMRobertaForSequenceClassification
import torch.nn as nn
import torch
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler
import re
import torch.nn.functional as F
import numpy as np
from Code.utils import *
from Code.model import *
from Code.predictions import *

### **Set GPU** : 
> This will select the device based on your current configuration. Select Runtime --> change runtimetype and select GPU as hardware accelerator to use GPU. 

In [4]:
if torch.cuda.is_available():
   device = torch.device("cuda")
else:
   device = torch.device("cpu")

In [5]:
def getDatasetPrediction(dataset,config):
    labels=model.return_probab(dataset['Sentences'])
    predictions = {}
    for index, row in dataset.iterrows():
        
        dict1={}
        dict1['Sentence']=row['Sentences']
        dict_labels={}
        for ele in config:
            dict_labels[config[ele]]=round(labels[index][ele],3)
        dict1["Labels"]=dict_labels
        predictions[row['Index']] = dict1
    return predictions

def getRandomTextFromPred(pred = None):
    return random.choice(list(prediction.items()))

# **Models and their origins**
Here different models have different origins in terms of the dataset and prediction they user
* **Kannada, Malaylam, Telugu models are trained using the recent competition Offensive Language [shared task](https://competitions.codalab.org/competitions/27654) at DravidianLangTech workshop in EACL 2021**  

These are XLM-R models and has the following labels
> Not_in_intended_language, Not_offensive , Off_target_group (*offensive targetting group*),
  Off_target_ind (*offensive targetting individual*), Profanity (*presence of slur*)

#### **If used cite this**  
```
@misc{saha2021hatealertdravidianlangtecheacl2021,
      title={Hate-Alert@DravidianLangTech-EACL2021: Ensembling strategies for Transformer-based Offensive language Detection}, 
      author={Debjoy Saha and Naman Paharia and Debajit Chakraborty and Punyajoy Saha and Animesh Mukherjee},
      year={2021},
      eprint={2102.10084},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
```


*  **English_hatexplain is a model trained using the [hatexplain dataset](https://huggingface.co/datasets/hatexplain)** 

This is a BERT-BASE-UNCASED model trained with hateful, offensive and normal labels


#### **If used cite this**  
```
@article{mathew2020hatexplain,
  title={HateXplain: A Benchmark Dataset for Explainable Hate Speech Detection},
  author={Mathew, Binny and Saha, Punyajoy and Yimam, Seid Muhie and Biemann, Chris and Goyal, Pawan and Mukherjee, Animesh},
  journal={arXiv preprint arXiv:2012.10289},
  year={2020}
}
```
*  **Rest of the models are trained using the datasets in the [DELIMIT repo](https://github.com/hate-alert/DE-LIMIT)** 
This is a BERT-BASE-UNCASED model trained with hateful, non hateful labels

#### **If used cite this**  
```
@article{aluru2020deep,
  title={Deep Learning Models for Multilingual Hate Speech Detection},
  author={Aluru, Sai Saket and Mathew, Binny and Saha, Punyajoy and Mukherjee, Animesh},
  journal={arXiv preprint arXiv:2004.06465},
  year={2020}
}
```



In [6]:
#@title ### **Select a language**
Language = "French" #@param ["Arabic", "English", "French", "German", "Indonesian", "Polish", "Portugese", "Italian", "Spanish", "Kannada", "Malyalam", "Tamil", "English_hatexplain"]


In [7]:
model = modelPred(language=Language.lower(), device=device)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1225.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=669482093.0, style=ProgressStyle(descri…




In [8]:
from google.colab import files
uploaded = files.upload()

Saving new.csv to new.csv


In [9]:
import io
import pandas as pd
dataset = pd.read_csv(io.BytesIO(uploaded[list(uploaded)[0]])).reset_index()

In [10]:
prediction = getDatasetPrediction(dataset,model.config.id2label)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=871891.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=112.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=152.0, style=ProgressStyle(description_…


Running eval on test data...


In [11]:
getRandomTextFromPred(prediction)

(0,
 {'Labels': {'HATE': 0.91, 'NON_HATE': 0.09},
  'Sentence': 'The nigger help each other'})

### **Word of caution**

> Model used here have any trained using a particular dataset and they may carry some bias or errors, they should be only used as a complementary labels in case of any analysis.

In [14]:
#@title **Download the file generated**
#@markdown Run this cell and select the destination folder

from google.colab import files
import json

with open('predictions.json', 'w') as f:
    json_string = json.dumps(prediction, cls=NumpyEncoder, sort_keys=True, indent=4)
    f.write(json_string)
  

files.download('predictions.json')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>