<center>
    <h1><b>Pretraining a Transformer Model from Scratch on an Amharic Dataset and Fine-Tuning for Amharic Hate Speech Recognition Task</b></h1>
</center>


## Table of Contents  

1. [Introduction](#introduction)  
2. [Importing Packages](#importing-packages)  
3. [Dataset Collection & Preprocessing](#dataset-collection--preprocessing)  
   - 3.1 [Data Collection](#data-collection)  
   - 3.2 [Data Cleaning](#data-cleaning)  
   - 3.3 [Tokenization](#tokenization)  
4. [Pretraining the Transformer Model](#pretraining-the-transformer-model)  
5. [Fine-Tuning for Hate Speech Recognition](#fine-tuning-for-hate-speech-recognition)  
6. [Evaluation](#evaluation)  
7. [Deployment on Mahder AI App](#deployment-on-mahder-ai-app)  
8. [Conclusion](#conclusion)  


## 1. Introduction  

In this notebook, I will pretrain a Transformer network on an Amharic dataset collected from a variety of Telegram channels, using the Masked Language Model (MLM). The primary objective of pretraining is to enable the model to learn contextualized word and phrase representations, thereby enhancing its understanding of language semantics. The Transformer’s self-attention mechanism plays a crucial role by allowing the model to dynamically weigh different parts of the input sequence, effectively capturing long-range dependencies in the data.  

After pretraining, I will fine-tune the model on a labeled dataset of hate speech and deploy the resulting model in the **Mahder AI** app.



## 2. Importing the Packages

Let's start by importing all the required libraries. 

In [173]:
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt 
import pandas as pd 
import json
import re
import random
import emoji
import sentencepiece as spm

## 3. Dataset Collection & Preprocessing

### 3.1 Data Collection  

In order to pretrain the Transformer network from scratch, we will use **self-supervised learning**, which requires a large corpus of unlabeled text. We will apply a **Masked Language Model (MLM)** to pre-train the model.  

#### **Why Telegram Channels?**  
Telegram is the most widely used platform for information storage in Ethiopia. For this reason, I have chosen **Telegram channels** as the primary data source. Most of the selected channels are news channels, ensuring a diverse and rich dataset.  

#### **Data Collection Method**  
To collect the data, I used the **Telethon Python library** and the **Telegram API** to scrape text from selected channels.  

#### **Selected Telegram Channels**  
The dataset has been collected from the following Telegram channels:  

- [Tikvah Ethiopia](https://t.me/tikvahethiopia)  
- [Addis Standard Amharic](https://t.me/AddisstandardAmh)  
- [Tarikn Wedehuala](https://t.me/TariknWedehuala)  
- [Addis News](https://t.me/Addis_News)  
- [Zena 24 Now](https://t.me/zena24now)  
- [Tikvah University](https://t.me/TikvahUniversity)  
- [Tikvah Ethiopia Magazine](https://t.me/tikvahethmagazine)  
- [Tikvah Ethiopia Sport](https://t.me/tikvahethsport)  
- [Philosophy Thoughts](https://t.me/Philosophy_Thoughts1)  
- [Mudenyaz](https://t.me/Mudenyaz)  
- [Yemeri Terekoch](https://t.me/yemeri_terekoch)  
- [Bemnet Library](https://t.me/Bemnet_Library)  
- [Amazing Fact](https://t.me/amazing_fact_433)  
- [Zephilosophy](https://t.me/Zephilosophy)  
- [Huluezih](https://t.me/huluezih)  

#### **Accessing Collected Data**  
To access the code and all the raw data collected from each channel, visit the following GitHub repository:  
[GitHub Repository Link](https://github.com/your-repo-link-here).


### 3.2 Data Loading and Cleaning

Next, we will define a function that will load data from a JSON file as an array of strings.

In [28]:
def load_data(filepath):
    return_data=[]
    with open (filepath, "r") as file:
        datas=json.load(file)
        for data in datas:
            return_data.append(data["text"])
            
    return return_data

The following code shows how the news look like

In [29]:
sample=load_data("datas/sample.json")
for news in sample[:5]:
    print(news)
    
    print("---------------------------------------------------------------------------------\n\n")

#መቄዶንያ

ሰውን ለመርዳት ሰው መሆን በቂ ነው !

ትላንት የካቲት 1/2017 ዓ/ም በጀመረው የመቄዶንያ የአረጋዊያን እና የአእምሮ ህሙማን መርጃ ማዕከል የድጋፍ ማሰባሰብ ዘመቻ እስኩን 120,000,000 ብር ተሰብስቧል።

መቄዶንያ በሚያስገነባው ሆስፒታል ጭምር ያለው ህንፃ ለማጠናቀቅ የገንዘብ እጥረት አጋጥሞታል። ህንፃው ለማጠናቀቅ ገንዘብ ተቸግረናል። ለማጠናቀቅ ወደ 5 ቢሊዮን ብር ያስፈልጋል።

በቀጥታ ይከታተሉ 👇
https://www.youtube.com/live/q0bMjwt9PvM?feature=shared

የምትችሉትን ሁሉ ድጋፍ አድርጉ።

@tikvahethiopia
---------------------------------------------------------------------------------


🔊 #የሠራተኞችድምጽ

" ቋሚ ሠራተኞች ሆነን ሳለ በደሞዝ ማሻሻያው አልተካተትንም " - የሀዋሳ ዙሪያ ወረዳ መንግስት ሠራተኞች

የማክሮ ኢኮኖሚ ማሻሻያ ሪፎርሙን ተከትሎ የሚከሰቱ የኑሮ ዉድነትና ተያያዥ ጉዳዮችን ታሳቢ በማድረግ የመንግስት ሠራተኞች ደሞዝ ማሻሻያ ተደርጎ ከጥቅምት ወር 2017 ዓ/ም ጀምሮ ተግባራዊ የተደረገ መሆኑ ይታወቃል።

በሲዳማ ክልል፤ ሰሜናዊ ሲዳማ ዞን፤ ሀዋሳ ዙሪያ ወረዳ በተለያዩ የመንግስት መስሪያ ቤቶች የሚሰሩ የመንግስት ሠራተኞች ግን " ከ2012 ዓ/ም ጀምሮ በቋሚነት ተቀጥረን እየሰራን ያለን ቢሆንም በአዲሱ የመንግስት ሠራተኞች የደመወዝ ማሻሻያ አልተካተትንም " ሲሉ ቅሬታቸዉን ለቲክቫህ ኢትዮጵያ አስገብተዋል።

ቅሬታቸዉን ካደረሱን መካከል ፦
- በከተማ ልማትና ኮንስትራክሽን፣
- ማዘጋጃ ቤቶች፣
- በትምህርት ዘርፍ ፣
- በሴቶችና ሕፃናት እንዲሁም በሕብረት ስራ ጽ/ቤቶች የሚሰሩ ሠራተኞች ናቸው።

" በወቅቱ በአግባቡ ማስታወቂያ ወ

Since we are working with a Telegram dataset, we aim to clean the text by removing substrings that are commonly used on the platform, such as hashtagged entities, usernames, hyperlinks, and emojis. To achieve this, we will use Python's re library to perform regular expression operations. We will define specific search patterns and use the sub() method to remove matches by replacing them with an empty string ('').

However, we will retain some rarely occurring English words. This decision is based on the observation that Amharic texts on social media are often mixed with English words, and completely pure Amharic text is difficult to find in informal digital communication. These rarely occurring English words often carry meaningful context, so preserving them ensures that the cleaned dataset remains representative of real-world usage while still achieving our goal of removing platform-specific clutter.

In [153]:
def clean_text(text):
    text = re.sub(r'https?://[^\s\n\r]+', '', text)
    text = re.sub(r'#\S+', '', text)
    text=re.sub(r'@\S+', '', text)
    text=emoji.replace_emoji(text," ")
   
    
    return text

Let's test the above function on sample data


let's load our training data and see how many contents we have and what the first 5 contents look like

In [130]:
data=load_data("datas/totaldata.json")
number_of_contents=len(data)
print(f'Total number of contents: {number_of_contents}\n')
print(f'First 5 contents: \n')
for news in data[:5]:
    print(news)
    print("---------------------------------------------------------------------------------\n\n")




Total number of contents: 193419

First 5 contents: 

#መቄዶንያ

ሰውን ለመርዳት ሰው መሆን በቂ ነው !

ትላንት የካቲት 1/2017 ዓ/ም በጀመረው የመቄዶንያ የአረጋዊያን እና የአእምሮ ህሙማን መርጃ ማዕከል የድጋፍ ማሰባሰብ ዘመቻ እስኩን 120,000,000 ብር ተሰብስቧል።

መቄዶንያ በሚያስገነባው ሆስፒታል ጭምር ያለው ህንፃ ለማጠናቀቅ የገንዘብ እጥረት አጋጥሞታል። ህንፃው ለማጠናቀቅ ገንዘብ ተቸግረናል። ለማጠናቀቅ ወደ 5 ቢሊዮን ብር ያስፈልጋል።

በቀጥታ ይከታተሉ 👇
https://www.youtube.com/live/q0bMjwt9PvM?feature=shared

የምትችሉትን ሁሉ ድጋፍ አድርጉ።

@tikvahethiopia
---------------------------------------------------------------------------------


🔊 #የሠራተኞችድምጽ

" ቋሚ ሠራተኞች ሆነን ሳለ በደሞዝ ማሻሻያው አልተካተትንም " - የሀዋሳ ዙሪያ ወረዳ መንግስት ሠራተኞች

የማክሮ ኢኮኖሚ ማሻሻያ ሪፎርሙን ተከትሎ የሚከሰቱ የኑሮ ዉድነትና ተያያዥ ጉዳዮችን ታሳቢ በማድረግ የመንግስት ሠራተኞች ደሞዝ ማሻሻያ ተደርጎ ከጥቅምት ወር 2017 ዓ/ም ጀምሮ ተግባራዊ የተደረገ መሆኑ ይታወቃል።

በሲዳማ ክልል፤ ሰሜናዊ ሲዳማ ዞን፤ ሀዋሳ ዙሪያ ወረዳ በተለያዩ የመንግስት መስሪያ ቤቶች የሚሰሩ የመንግስት ሠራተኞች ግን " ከ2012 ዓ/ም ጀምሮ በቋሚነት ተቀጥረን እየሰራን ያለን ቢሆንም በአዲሱ የመንግስት ሠራተኞች የደመወዝ ማሻሻያ አልተካተትንም " ሲሉ ቅሬታቸዉን ለቲክቫህ ኢትዮጵያ አስገብተዋል።

ቅሬታቸዉን ካደረሱን መካከል ፦
- በከተማ ልማትና ኮንስትራክሽን፣
- ማዘጋጃ ቤቶች፣
- በትምህርት ዘርፍ ፣
- በሴቶችና ሕፃናት እንዲሁም

Let's clean our data using clean_text function and sample our data to see the difference between the original and raw data

In [154]:
cleaned_data=[clean_text(content) for content in data]

In [171]:
index=random.randint(0,number_of_contents)
print(f"Data at index {index} before cleaning: \n\n",data[index])
print("\n---------------------------------------------------------------------------------\n\n")
print(f"Data at index {index} after cleaning: \n\n",cleaned_data[index])

Data at index 182822 before cleaning: 

 #የዛሬ (ለነሐሴ 4/2014 የተመረጡ የቲክቫህ ዜናዎች)

💐 የሙርሌ ጎሳዎች ባደረሱት ጥቃት የሁለት ሰዎች ህይወት አለፈ....

- በጋምቤላ ከደቡብ ሱዳን የሚነሱ የ "ሙርሌ ጎሳ" ታጣቂዎች ድንበር አቋርጠው በመግባት ከትናንት በስቲያ ከምሽቱ 4:30 ላይ በፈጸሙት ጥቃት የ2 ሰው ህይወት ሲያልፍ አንድ ሰው ቆስሎ 2 ህፃናት በታጣቂዎች መወሰዳቸውን የክልሉ ፖሊስ አሳውቋል። በአሁኑ ሰዓት ልዩ ኃይሉንና ሚሊሻ በመጠቀም ሰላምና ፀጥታ የማስከበር ስራውን እየሰራ መሆኑን አሳውቋል።

🏛 የክላስተር አደረጃጀትን ያላጸደቀው ብቸኛው የጉራጌ ዞን ምክር ቤት ነገ አስቸኳይ ስብሰባውን ያካሂዳል.....

- በደቡብ ክልል ከሚገኙት ዞኖችና ልዩ ወረዳዎች ውስጥ የ " ክላስተር " አደረጃጀትን ባለማጽድቅ ብቸኛ የሆነው የጉራጌ ዞን ምክር ቤት አስቸኳይ ጉባኤውን ነገ ሐሙስ ነሐሴ 5/2014 እንደሚያካሂድ ተገልጿል። ከክልልነት ጥያቄ ጋር ተያይዞ በትላንትናው ዕለት በወልቂጤ ከተማ የስራ ማቆም አድማ ተደርጎ የነበር ሲሆን ዛሬ ሩቡዕ  ከተማዋ ወደ መደበኛ እንቅስቃሴ መግባቷ ተሰምቷል።

🇺🇸🇸🇴 አሜሪካ 4 የአልሸባብ የሽብር ቡድን አባላትን መግደሏን ገለጸች...

- አሜሪካ ትላንትና ማክሰኞ ፤ ከሶማሊያ መንግስት ጋር በመተባበር በፈፀመችው የአየር ድብደባ 4 የአልሸባብ የሽብር ቡድን አባላትን መግደሏን አሳውቃለች።

4⃣ የደቡብ ምዕራብ ኢትዮጵያ ህዝቦች ክልል የብዝሃ ዋና ከተሞች ረቂቅ አዋጅን አጸደቀ....

አዲሱ የደቡብ ምዕራብ ኢትዮጵያ ህዝቦች ክልል መንግስት የብዝሃ ዋና ከተሞች ረቂቅ አዋጅን መርምሮ ማፅደቁ ተሰምቷል። በዚህም ቦንጋ ከተማ የክልሉ የፖለቲካ እና የርዕሰ መስተዳደሩ መቀመጫ፤ ተርጫ ከተማ የክልሉ ምክር 

### 3.3 Tokenization

#### Next, We Will Train the Tokenizer

Tokenization is a critical step in natural language processing (NLP) as it converts raw text into smaller, meaningful units (tokens) that can be processed by machine learning models. Effective tokenization ensures that the model can understand and interpret the text accurately, which is essential for tasks like text classification, machine translation, and sentiment analysis.

For this task, we will use the **SentencePiece tokenizer** instead of traditional word-based tokenization. The [SentencePieceTokenizer](https://www.tensorflow.org/text/api_docs/python/text/SentencepieceTokenizer) is a powerful tool that tokenizes text into **subword units**, which offers several advantages:

1. **Handling Complex Word Structures**: SentencePiece breaks words into smaller subword units, making it effective for handling complex word structures and morphological variations, which are common in languages like Amharic.
2. **Out-of-Vocabulary (OOV) Words**: By using subword tokenization, SentencePiece can handle out-of-vocabulary words more gracefully, as it can decompose them into known subword units.
3. **Multilingual Support**: SentencePiece is language-agnostic, making it suitable for multilingual datasets. This is particularly useful for Amharic, as it can handle the repetition of common subwords and morphological patterns unique to the language.
4. **Simplified Preprocessing**: SentencePiece works directly on raw text, eliminating the need for extensive preprocessing steps like word segmentation or stemming.
5. **Seamless Integration**: It integrates seamlessly with popular machine learning frameworks like TensorFlow and PyTorch, ensuring consistent tokenization across training and inference pipelines.

Given these benefits, SentencePiece is an ideal choice for tokenizing Amharic text, as it can effectively capture the language's unique characteristics while simplifying the overall preprocessing workflow.