<a href="https://colab.research.google.com/github/natybkl/Hate-Speech-Detection-in-Amharic-Language/blob/main/Hate_Speech_Detection_in_Amharic_Language%5BmBERT%5D.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Hate-Speech-Detection-in-Amharic-Language using fine-tuned mBERT





This repository contains the code and resources for a machine learning project that uses fine-tuned mBERT to detect hate speech in Amharic language. The model was fine-tuned using the Hugging Face Trainer API.

1. mBERT (multilingual BERT) is a pre-trained language model developed by Google that can understand multiple languages. You can find more information about mBERT [here](https://github.com/google-research/bert/blob/master/multilingual.md).
2. Davlan has provided a finetuned mBERT model specifically for the Amharic language, which is used for this project. The model is available on Hugging Face [here](https://huggingface.co/Davlan/bert-base-multilingual-cased-finetuned-amharic).
3. Since the original mBERT model is not well-suited for Amharic, we fine-tuned Davlan's model specifically for the task of hate speech detection in the Amharic language.

### **Step 1**: Installing nexessary packages

We need two major packages:
1. Transformers package made available by Huggingface
2. Dataset package made availale by Huggingface

In [None]:
#installing the transformers package
!pip install transformers

In [None]:
#installing the datasets package
!pip install datasets

### **Step 2**: Import relevent libraries from the installed packages

In [None]:
#importing the datasets package
from datasets import Dataset
import datasets
#import load metric for model evaluation
from datasets import load_metric

In [None]:
#import numpy and pandas for mathematical computation and data manipulation respectively 
import numpy as np
import pandas as pd
#import drive package to connect this colab file with the drive where the data will be retrived from
from google.colab import drive
#import the pipeline of transformers
from transformers import pipeline
#import AutoTokenizer for tokenization purposes
from transformers import AutoTokenizer
#import the Trainer API
from transformers import TrainingArguments, Trainer
#import early stopping callback 
from transformers import EarlyStoppingCallback, IntervalStrategy


### **Step 3**: Import dataset to be used for training,validating and testing the model


The dataset used for this project is an Amharic dataset that was made available by Data Mendeley. It contains Amharic posts and comments retrieved from Facebook. It has 30,000 rows. The dataset can be accessed from [here](https://data.mendeley.com/datasets/ymtmxx385m)

In [None]:
#mount google drive to access the dataset directly from drive
drive.mount('/content/drive') 

In [None]:
#fetch the dataset from drive
Labels = pd.read_csv('/content/drive/MyDrive/Machine Learning/Data/Amharic-Hate-Speech-Dataset/Labels.txt',header=None)
Posts = pd.read_csv('/content/drive/MyDrive/Machine Learning/Data/Amharic-Hate-Speech-Dataset/Posts.txt',header=None)

### **Step 4**: Preprocess the Dataset

When the dataset was retrived, the labels and the post were in different files. 


1. Hence, the first step in this phase is merging the files into one panda's dataframe.
2. Second step is Label encoding. Lable encoding is the process of converting the labels(classes) into numeric format to make it easier for the machine to understand it
3. Third step is dividing the dataset into training, validation and testing categories. The division ratio is 7:1:2 respectively.
4. Last step is to remove an unncessary columns from the main dataset and merging the all the categories into one main dataset



In [None]:
#naming the columns
Labels.columns = ["labels"]
Texts.columns = ["Texts"]

In [None]:
#encoding the classes into numerical data
Labels = Labels.replace(['Free', 'Free ','Hate'],[0,0,1]) 

In [None]:
#check the encoded label data
Labels.head(10)

In [None]:
#check the Amharic data
Texts.head(1000)

In [None]:
#merge the datasets
Frames = [Labels, Texts]
Merged = pd.concat(Frames, axis=1)

In [None]:
#preview of merged data
Merged

In [None]:
#Divide the dataset into train, validation and test categories 
train_val_df, test_dataset = train_test_split(Merged, test_size=0.20, random_state=42)
train_dataset, evaluation_dataset = train_test_split(train_val_df, test_size=0.115, random_state=42)
print('Training dataset shape: ', train_dataset.shape)
print('Validation dataset shape: ', evaluation_dataset.shape)
print('Testing dataset shape: ', test_dataset.shape)

In [None]:
#convert format of the dataset to HuggingFace Dataset from Pandas DataFrame
test_dataset=Dataset.from_pandas(test_dataset)

In [None]:
#convert the format of the dataset to HuggingFace Dataset from Pandas DataFrame
train_dataset=Dataset.from_pandas(train_dataset)

In [None]:
#convert the format of the dataset to HuggingFace Dataset from Pandas DataFrame
evaluation_dataset=Dataset.from_pandas(evaluation_dataset)

In [None]:
#preview of the dataset after conversion
(test_dataset)

In [None]:
#combine the train and test dataset into one datset
main_dataset= datasets.DatasetDict({
    'train': train_dataset,
    'test': test_dataset,
    'evaluate': evaluation_dataset
})

In [None]:
#preview of the dataset after merging
main_dataset

In [None]:
# training and testing data size
training_data_size = main_dataset['train'].num_rows
testing_data_size = main_dataset['test'].num_rows
evaluation_data_size = main_dataset['evaluate'].num_rows

### **Step 5**: Tokenizing Dataset

A Tokenizer is used to translate text into data that can be processed by the model. Models can only process numbers, so tokenizers need to convert our text inputs to numerical data.

In this case, the tokenizer used is an AutoTokenizer from the fine-tuned mBERT model made available by Hugging face [here](https://huggingface.co/Davlan/bert-base-multilingual-cased-finetuned-amharic)

In this phase, we have the following tasks:

1. Load the tokenizer
2. Create a tokenizer function that takes the dataset in batches and tokenize them using the tokenizer loaded from the model
3. Call the tokenizer function on the whole dataset

In [None]:
#loading a tokenizer from the pretrained model
tokenizer = AutoTokenizer.from_pretrained("Davlan/bert-base-multilingual-cased-finetuned-amharic")

In [None]:
#Have a tokenizer function that uses the tokenizer 
def tokenize_function(data):
    return tokenizer(data["post"], padding="max_length", truncation=True)

In [None]:
#Tokenize all the data using the mapping functionality
tokenized_datasets = main_dataset.map(tokenize_function)

In [None]:
#empty cache
torch.cuda.empty_cache()

### **Step 6**: Prepare the tokenized Dataset

In this phase, we do the following tasks:

1. Remove unnecessary columns such as the "posts" column from the tokenized dataset as we no longer need them
2. Change the format of the tokenized dataset into pytorch since we are using pytorch
3. Load the dataset using DataLoader with the proper batch size
4. Preview the features of the dataset to make sure everything is okay

In [None]:
#remove the posts column as it is no longer needed
tokenized_datasets = tokenized_datasets.remove_columns(["post"])

In [None]:
#changing the format of the tokenized dataset to torch
tokenized_datasets.set_format("torch")

In [None]:
#shuffeling and selecting the needed size of dataset for training and evaluating the model
small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(training_data_size))
small_test_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(testing_data_size))
small_eval_dataset = tokenized_datasets["evaluate"].shuffle(seed=42).select(range(evaluation_data_size))

In [None]:
# preview of the shuffeled and selected evaluation dataset
small_eval_dataset
     

In [None]:
# preview of the shuffeled and selected training dataset
small_train_dataset

In [None]:
# preview of the shuffeled and selected testing dataset
small_test_dataset

In [None]:
#load the dataset using DataLoader
train_dataloader = DataLoader(small_train_dataset, shuffle=True, batch_size=4)
eval_dataloader = DataLoader(small_eval_dataset, batch_size=4)
test_dataloader = DataLoader(small_test_dataset, batch_size=4)

### **Step 7**: Fine-tune the model

This phase has the following steps:

1. Load the model
2. Specify the computing metric
3. Specify the Training/fine-tuning arguments
4. Load the Trainer class
5. Fine-tune the model

7.1 **Load the model**

We load the fine-tuned mBERT mode in this step.


In [None]:
#Load auto mode classifier from the pretrained model 
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained("Davlan/bert-base-multilingual-cased-finetuned-amharic", num_labels=2)
# model = AutoModelForSequenceClassification.from_pretrained("Davlan/", num_labels=2)
 

7.2 **Computing Metrics**

In this stage, we load the computing metrics. The computing metrics used in this phase are the f1-score and the accuracy. These computing metrics are used during the validation and testing phase.

In [None]:
import numpy as np
from datasets import load_metric

metric = load_metric("f1","accuracy")

In [None]:
#Function that uses the loaded metrics to compute the performance of the model
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

7.3 **Specify the training arguments**

This phase includes loading the training parameters and hyperparameters. It also specifies the validation interval during the fine-tuning process.

In [None]:
from transformers import TrainingArguments
training_args = TrainingArguments(output_dir="test_trainer")

In [None]:
from transformers import EarlyStoppingCallback, IntervalStrategy

In [None]:
training_args = TrainingArguments(
   f"training_with_callbacks",
   evaluation_strategy = IntervalStrategy.STEPS, # "steps"
   warmup_steps=500,                # number of warmup steps for learning rate  
   save_steps=2000,
   eval_steps = 2000, # Evaluation and Save happens every 2000 steps
   save_total_limit = 3, # Only last 3 models are saved. Older ones are deleted.
   learning_rate=1e-5,
   per_device_train_batch_size=4,
   per_device_eval_batch_size=4,
   num_train_epochs=15,
   weight_decay=0.01,
   push_to_hub=False,
   metric_for_best_model = 'f1',
   do_predict=True,
   load_best_model_at_end=True)

7.4 **Load the Trainer class**

In the trainer class, early stopping strategy is called. Early Stopping is a an optimization technique used to reduce overfitting without compromising on model accuracy. It allows to specify an arbitrary large number of training epochs and stop training once the model performance stops improving on a hold out validation dataset. For this model, the early stopping patience used is 10 epoches.

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=small_train_dataset,
    eval_dataset=small_eval_dataset,
    compute_metrics=compute_metrics,
    callbacks = [EarlyStoppingCallback(early_stopping_patience=10)],
)

7.5 **Fine-tune the model**

Fine-tuning process embbeds the validation within itself. After every 2000 steps of finetuning, the model is validated on the loaded computing metrics to modify the hyperparameters to make the model perform well.

In [None]:
trainer.train()

### **Step 8**: Test the model

In this stage the model is tested on the testing dataset. This dataset isn't seen by the model during the finetuning process.

In [None]:
trainer.evaluate(small_test_dataset)

### **Step 9**: Push the model to Huggingface Hub

The final model was pushed and made publicly available on Huggingface. You can find the model on huggingface [here](https://huggingface.co/NathyB/Hate-Speech-Detection-in-Amharic-Language-mBERT).

In [None]:
#install huggingface_hub package to interact with huggingface platform
!pip install huggingface_hub

In [None]:
from huggingface_hub import notebook_login

In [None]:
notebook_login()

In [None]:
#specify the path for the trainde model and tokenizer to huggingface repository
model.save_pretrained("path/to/improved-amharic-hate-speech-detection-mBERT")
tokenizer.save_pretrained("path/to/improved-amharic-hate-speech-detection-mBERT")

In [None]:
#push the trained model to huggingface repository
model.push_to_hub("improved-amharic-hate-speech-detection-mBERT")

In [None]:
#push the tokenizer to huggingface repository
tokenizer.push_to_hub("improved-amharic-hate-speech-detection-mBERT")