<center>
    <h1><b>Pretraining a Transformer Model from Scratch on an Amharic Dataset and Fine-Tuning for Amharic Hate Speech Recognition Task</b></h1>
</center>


## Table of Contents  
1. [Introduction](#introduction)  
2. [Importing Packages](#importing-packages)  
3. [Dataset Collection & Preprocessing](#dataset-collection--preprocessing)  
   - 3.1 [Data Collection](#data-collection)  
   - 3.2 [Data Cleaning](#data-cleaning)  
   - 3.3 [Tokenization](#tokenization)  
   - 3.4 [Tokenizing and Masking](#tokenizing-and-masking)  
4. [Pretraining the Transformer Model](#pretraining-the-transformer-model)  
5. [Fine-Tuning for Hate Speech Recognition](#fine-tuning-for-hate-speech-recognition)  
6. [Evaluation](#evaluation)  
7. [Deployment on Mahder AI App](#deployment-on-mahder-ai-app)  
8. [Conclusion](#conclusion)  


## 1. Introduction  

In this notebook, I will pretrain a Transformer network on an Amharic dataset collected from a variety of Telegram channels, using the Masked Language Model (MLM). The primary objective of pretraining is to enable the model to learn contextualized word and phrase representations, thereby enhancing its understanding of language semantics. The Transformer’s self-attention mechanism plays a crucial role by allowing the model to dynamically weigh different parts of the input sequence, effectively capturing long-range dependencies in the data.  

After pretraining, I will fine-tune the model on a labeled dataset of hate speech and deploy the resulting model in the **Mahder AI** app.



## 2. Importing the Packages

Let's start by importing all the required libraries. 

In [137]:
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt 
import pandas as pd 
import json
import re
import random
import emoji
import sentencepiece as spm
import string

## 3. Dataset Collection & Preprocessing

### 3.1 Data Collection  

In order to pretrain the Transformer network from scratch, we will use **self-supervised learning**, which requires a large corpus of unlabeled text. We will apply a **Masked Language Model (MLM)** to pre-train the model.  

#### **Why Telegram Channels?**  
Telegram is the most widely used platform for information storage in Ethiopia. For this reason, I have chosen **Telegram channels** as the primary data source. Most of the selected channels are news channels, ensuring a diverse and rich dataset.  

#### **Data Collection Method**  
To collect the data, I used the **Telethon Python library** and the **Telegram API** to scrape text from selected channels.  

#### **Selected Telegram Channels**  
The dataset has been collected from the following Telegram channels:  

- [Tikvah Ethiopia](https://t.me/tikvahethiopia)  
- [Addis Standard Amharic](https://t.me/AddisstandardAmh)  
- [Tarikn Wedehuala](https://t.me/TariknWedehuala)  
- [Addis News](https://t.me/Addis_News)  
- [Zena 24 Now](https://t.me/zena24now)  
- [Tikvah University](https://t.me/TikvahUniversity)  
- [Tikvah Ethiopia Magazine](https://t.me/tikvahethmagazine)  
- [Tikvah Ethiopia Sport](https://t.me/tikvahethsport)  
- [Philosophy Thoughts](https://t.me/Philosophy_Thoughts1)  
- [Mudenyaz](https://t.me/Mudenyaz)  
- [Yemeri Terekoch](https://t.me/yemeri_terekoch)  
- [Bemnet Library](https://t.me/Bemnet_Library)  
- [Amazing Fact](https://t.me/amazing_fact_433)  
- [Zephilosophy](https://t.me/Zephilosophy)  
- [Huluezih](https://t.me/huluezih)  

#### **Accessing Collected Data**  
To access the code and all the raw data collected from each channel, visit the following GitHub repository:  
[GitHub Repository Link](https://github.com/your-repo-link-here).


### 3.2 Data Loading and Cleaning

Next, we will define a function that will load data from a JSON file as an array of strings.

In [50]:
def load_data(filepath):
    return_data=[]
    with open (filepath, "r") as file:
        datas=json.load(file)
        for data in datas:
            return_data.append(data["text"])
            
    return return_data

The following code shows how the news look like

In [15]:
sample=load_data("datas/sample.json")
for news in sample[:5]:
    print(news)
    
    print("---------------------------------------------------------------------------------\n\n")

#መቄዶንያ

ሰውን ለመርዳት ሰው መሆን በቂ ነው !

ትላንት የካቲት 1/2017 ዓ/ም በጀመረው የመቄዶንያ የአረጋዊያን እና የአእምሮ ህሙማን መርጃ ማዕከል የድጋፍ ማሰባሰብ ዘመቻ እስኩን 120,000,000 ብር ተሰብስቧል።

መቄዶንያ በሚያስገነባው ሆስፒታል ጭምር ያለው ህንፃ ለማጠናቀቅ የገንዘብ እጥረት አጋጥሞታል። ህንፃው ለማጠናቀቅ ገንዘብ ተቸግረናል። ለማጠናቀቅ ወደ 5 ቢሊዮን ብር ያስፈልጋል።

በቀጥታ ይከታተሉ 👇
https://www.youtube.com/live/q0bMjwt9PvM?feature=shared

የምትችሉትን ሁሉ ድጋፍ አድርጉ።

@tikvahethiopia
---------------------------------------------------------------------------------


🔊 #የሠራተኞችድምጽ

" ቋሚ ሠራተኞች ሆነን ሳለ በደሞዝ ማሻሻያው አልተካተትንም " - የሀዋሳ ዙሪያ ወረዳ መንግስት ሠራተኞች

የማክሮ ኢኮኖሚ ማሻሻያ ሪፎርሙን ተከትሎ የሚከሰቱ የኑሮ ዉድነትና ተያያዥ ጉዳዮችን ታሳቢ በማድረግ የመንግስት ሠራተኞች ደሞዝ ማሻሻያ ተደርጎ ከጥቅምት ወር 2017 ዓ/ም ጀምሮ ተግባራዊ የተደረገ መሆኑ ይታወቃል።

በሲዳማ ክልል፤ ሰሜናዊ ሲዳማ ዞን፤ ሀዋሳ ዙሪያ ወረዳ በተለያዩ የመንግስት መስሪያ ቤቶች የሚሰሩ የመንግስት ሠራተኞች ግን " ከ2012 ዓ/ም ጀምሮ በቋሚነት ተቀጥረን እየሰራን ያለን ቢሆንም በአዲሱ የመንግስት ሠራተኞች የደመወዝ ማሻሻያ አልተካተትንም " ሲሉ ቅሬታቸዉን ለቲክቫህ ኢትዮጵያ አስገብተዋል።

ቅሬታቸዉን ካደረሱን መካከል ፦
- በከተማ ልማትና ኮንስትራክሽን፣
- ማዘጋጃ ቤቶች፣
- በትምህርት ዘርፍ ፣
- በሴቶችና ሕፃናት እንዲሁም በሕብረት ስራ ጽ/ቤቶች የሚሰሩ ሠራተኞች ናቸው።

" በወቅቱ በአግባቡ ማስታወቂያ ወ

Since we are working with a Telegram dataset, we aim to clean the text by removing substrings that are commonly used on the platform, such as hashtagged entities, usernames, hyperlinks, and emojis. To achieve this, we will use Python's re library to perform regular expression operations. We will define specific search patterns and use the sub() method to remove matches by replacing them with an empty string ('').

However, we will retain some rarely occurring English words. This decision is based on the observation that Amharic texts on social media are often mixed with English words, and completely pure Amharic text is difficult to find in informal digital communication. These rarely occurring English words often carry meaningful context, so preserving them ensures that the cleaned dataset remains representative of real-world usage while still achieving our goal of removing platform-specific clutter.

In [16]:
def clean_text(text):
    text = re.sub(r'https?://[^\s\n\r]+', '', text)
    text = re.sub(r'#\S+', '', text)
    text=re.sub(r'@\S+', '', text)
    text=emoji.replace_emoji(text," ")
   
    
    return text

Let's test the above function on sample data


let's load our training data and see how many contents we have and what the first 5 contents look like

In [17]:
data=load_data("datas/totaldata.json")
number_of_contents=len(data)
print(f'Total number of contents: {number_of_contents}\n')
print(f'First 5 contents: \n')
for news in data[:5]:
    print(news)
    print("---------------------------------------------------------------------------------\n\n")




Total number of contents: 193419

First 5 contents: 

#መቄዶንያ

ሰውን ለመርዳት ሰው መሆን በቂ ነው !

ትላንት የካቲት 1/2017 ዓ/ም በጀመረው የመቄዶንያ የአረጋዊያን እና የአእምሮ ህሙማን መርጃ ማዕከል የድጋፍ ማሰባሰብ ዘመቻ እስኩን 120,000,000 ብር ተሰብስቧል።

መቄዶንያ በሚያስገነባው ሆስፒታል ጭምር ያለው ህንፃ ለማጠናቀቅ የገንዘብ እጥረት አጋጥሞታል። ህንፃው ለማጠናቀቅ ገንዘብ ተቸግረናል። ለማጠናቀቅ ወደ 5 ቢሊዮን ብር ያስፈልጋል።

በቀጥታ ይከታተሉ 👇
https://www.youtube.com/live/q0bMjwt9PvM?feature=shared

የምትችሉትን ሁሉ ድጋፍ አድርጉ።

@tikvahethiopia
---------------------------------------------------------------------------------


🔊 #የሠራተኞችድምጽ

" ቋሚ ሠራተኞች ሆነን ሳለ በደሞዝ ማሻሻያው አልተካተትንም " - የሀዋሳ ዙሪያ ወረዳ መንግስት ሠራተኞች

የማክሮ ኢኮኖሚ ማሻሻያ ሪፎርሙን ተከትሎ የሚከሰቱ የኑሮ ዉድነትና ተያያዥ ጉዳዮችን ታሳቢ በማድረግ የመንግስት ሠራተኞች ደሞዝ ማሻሻያ ተደርጎ ከጥቅምት ወር 2017 ዓ/ም ጀምሮ ተግባራዊ የተደረገ መሆኑ ይታወቃል።

በሲዳማ ክልል፤ ሰሜናዊ ሲዳማ ዞን፤ ሀዋሳ ዙሪያ ወረዳ በተለያዩ የመንግስት መስሪያ ቤቶች የሚሰሩ የመንግስት ሠራተኞች ግን " ከ2012 ዓ/ም ጀምሮ በቋሚነት ተቀጥረን እየሰራን ያለን ቢሆንም በአዲሱ የመንግስት ሠራተኞች የደመወዝ ማሻሻያ አልተካተትንም " ሲሉ ቅሬታቸዉን ለቲክቫህ ኢትዮጵያ አስገብተዋል።

ቅሬታቸዉን ካደረሱን መካከል ፦
- በከተማ ልማትና ኮንስትራክሽን፣
- ማዘጋጃ ቤቶች፣
- በትምህርት ዘርፍ ፣
- በሴቶችና ሕፃናት እንዲሁም

Let's clean our data using clean_text function and sample our data to see the difference between the original and raw data

In [18]:
cleaned_data=[clean_text(content) for content in data]

In [171]:
index=random.randint(0,number_of_contents)
print(f"Data at index {index} before cleaning: \n\n",data[index])
print("\n---------------------------------------------------------------------------------\n\n")
print(f"Data at index {index} after cleaning: \n\n",cleaned_data[index])

Data at index 182822 before cleaning: 

 #የዛሬ (ለነሐሴ 4/2014 የተመረጡ የቲክቫህ ዜናዎች)

💐 የሙርሌ ጎሳዎች ባደረሱት ጥቃት የሁለት ሰዎች ህይወት አለፈ....

- በጋምቤላ ከደቡብ ሱዳን የሚነሱ የ "ሙርሌ ጎሳ" ታጣቂዎች ድንበር አቋርጠው በመግባት ከትናንት በስቲያ ከምሽቱ 4:30 ላይ በፈጸሙት ጥቃት የ2 ሰው ህይወት ሲያልፍ አንድ ሰው ቆስሎ 2 ህፃናት በታጣቂዎች መወሰዳቸውን የክልሉ ፖሊስ አሳውቋል። በአሁኑ ሰዓት ልዩ ኃይሉንና ሚሊሻ በመጠቀም ሰላምና ፀጥታ የማስከበር ስራውን እየሰራ መሆኑን አሳውቋል።

🏛 የክላስተር አደረጃጀትን ያላጸደቀው ብቸኛው የጉራጌ ዞን ምክር ቤት ነገ አስቸኳይ ስብሰባውን ያካሂዳል.....

- በደቡብ ክልል ከሚገኙት ዞኖችና ልዩ ወረዳዎች ውስጥ የ " ክላስተር " አደረጃጀትን ባለማጽድቅ ብቸኛ የሆነው የጉራጌ ዞን ምክር ቤት አስቸኳይ ጉባኤውን ነገ ሐሙስ ነሐሴ 5/2014 እንደሚያካሂድ ተገልጿል። ከክልልነት ጥያቄ ጋር ተያይዞ በትላንትናው ዕለት በወልቂጤ ከተማ የስራ ማቆም አድማ ተደርጎ የነበር ሲሆን ዛሬ ሩቡዕ  ከተማዋ ወደ መደበኛ እንቅስቃሴ መግባቷ ተሰምቷል።

🇺🇸🇸🇴 አሜሪካ 4 የአልሸባብ የሽብር ቡድን አባላትን መግደሏን ገለጸች...

- አሜሪካ ትላንትና ማክሰኞ ፤ ከሶማሊያ መንግስት ጋር በመተባበር በፈፀመችው የአየር ድብደባ 4 የአልሸባብ የሽብር ቡድን አባላትን መግደሏን አሳውቃለች።

4⃣ የደቡብ ምዕራብ ኢትዮጵያ ህዝቦች ክልል የብዝሃ ዋና ከተሞች ረቂቅ አዋጅን አጸደቀ....

አዲሱ የደቡብ ምዕራብ ኢትዮጵያ ህዝቦች ክልል መንግስት የብዝሃ ዋና ከተሞች ረቂቅ አዋጅን መርምሮ ማፅደቁ ተሰምቷል። በዚህም ቦንጋ ከተማ የክልሉ የፖለቲካ እና የርዕሰ መስተዳደሩ መቀመጫ፤ ተርጫ ከተማ የክልሉ ምክር 

### 3.3 Tokenization

#### Next, We Will Train the Tokenizer

Tokenization is a critical step in natural language processing (NLP) as it converts raw text into smaller, meaningful units (tokens) that can be processed by machine learning models. Effective tokenization ensures that the model can understand and interpret the text accurately, which is essential for tasks like text classification, machine translation, and sentiment analysis.

For this task, we will use the **SentencePiece tokenizer** instead of traditional word-based tokenization. The [SentencePieceTokenizer](https://www.tensorflow.org/text/api_docs/python/text/SentencepieceTokenizer) is a powerful tool that tokenizes text into **subword units**, which offers several advantages:

1. **Handling Complex Word Structures**: SentencePiece breaks words into smaller subword units, making it effective for handling complex word structures and morphological variations, which are common in languages like Amharic.
2. **Out-of-Vocabulary (OOV) Words**: By using subword tokenization, SentencePiece can handle out-of-vocabulary words more gracefully, as it can decompose them into known subword units.
3. **Multilingual Support**: SentencePiece is language-agnostic, making it suitable for multilingual datasets. This is particularly useful for Amharic, as it can handle the repetition of common subwords and morphological patterns unique to the language.
4. **Simplified Preprocessing**: SentencePiece works directly on raw text, eliminating the need for extensive preprocessing steps like word segmentation or stemming.
5. **Seamless Integration**: It integrates seamlessly with popular machine learning frameworks like TensorFlow and PyTorch, ensuring consistent tokenization across training and inference pipelines.

Given these benefits, SentencePiece is an ideal choice for tokenizing Amharic text, as it can effectively capture the language's unique characteristics while simplifying the overall preprocessing workflow.

Let's train sentencepiece tokenizer model first. in order to do that we need to save our cleaned data into a single corpus of text in .txt file

In [175]:
with open("datas/cleaned_data.txt", "a") as file:
    for content in cleaned_data:
        file.write(content + "\n")

In [110]:
input_file="datas/cleaned_data.txt"
model_prefix="amharic_sp_model"

spm.SentencePieceTrainer.train(
    input=input_file,
    model_prefix=model_prefix,
    vocab_size=32000,  
    model_type="bpe",  
    character_coverage=0.99, 
    num_threads=12,  
    max_sentence_length=8192, 
    split_by_whitespace=True,
    pad_id=0,
    unk_id=1,
    bos_id=2,
    eos_id=3,
    
)
print("Training complete! Check 'amharic_bpe.model' and 'amharic_bpe.vocab'.")

Training complete! Check 'amharic_bpe.model' and 'amharic_bpe.vocab'.


After training the sentencepeice tokenizer the next step is to load the trainied tokenizer model

In [111]:
tokenizer=spm.SentencePieceProcessor(model_file="amharic_sp_model.model")

This code shows the process of tokenizing individual words from a given text, in this case, the first entry of the dataset.

In [120]:
# printing the encoding of each word to see how subwords are tokenized
tokenized_text = [(list(tokenizer.tokenize(word)), word) for word in cleaned_data[3000].split()]

print("Word\t\t-->\tTokenization")
print("-"*40)
for element in tokenized_text:
    print(f"{element[1]:<8}\t-->\t{element[0]}")
    

Word		-->	Tokenization
----------------------------------------
በአማራ    	-->	[1763]
ክልል     	-->	[233]
መዲና     	-->	[5973]
በባህር    	-->	[6161]
ዳር      	-->	[1575]
ከተማ     	-->	[140]
ቀበሌ     	-->	[1582]
14      	-->	[1448]
ትላንት    	-->	[1534]
መጋቢት    	-->	[2900]
29      	-->	[2460]
የመግሪብ   	-->	[66, 1901, 31797]
ሰላት     	-->	[25, 200]
ሰግደው    	-->	[25, 31795, 476]
ሲመለሱ    	-->	[11140]
የነበሩ    	-->	[824]
አባት     	-->	[2970]
ከ3      	-->	[10, 31875]
ልጆቹ     	-->	[14764]
እንዲሁም   	-->	[342]
አንድ     	-->	[278]
ጎረቤታቸውን 	-->	[13460, 416]
ጨምሮ     	-->	[774]
አጠቃላይ   	-->	[1739]
5       	-->	[285]
ሰዎች     	-->	[146]
በተከፈተባቸው	-->	[18007, 293]
የጥይት    	-->	[17380]
እሩምታ    	-->	[7, 3700, 31800]
መገደላቸው  	-->	[10615]
ተነግሯል።  	-->	[1928]
ትላንትና   	-->	[14907]
ምሽት     	-->	[1473]
በግፍ     	-->	[13965]
የተገደሉት  	-->	[12161]
፥       	-->	[31769, 1]
አቶ      	-->	[261]
ሙሄ      	-->	[234, 31920]
፣       	-->	[164]
ልጃቸው    	-->	[9806]
አበባዉ    	-->	[255, 31894]
ሙሄ      	-->	[234, 31920]
፣       	-->	[164]
ሽኩር    

Now let's take data from the cleaned_data  and see how the tokenization of the whole content looks like

In [130]:
index=890
print(f"Data at index {index} before tokenization:", cleaned_data[index])
print("\n---------------------------------------------------------------------------------")
print(f"Data at index {index} after tokenization: ", tokenizer.encode(cleaned_data[index]))
print("\n---------------------------------------------------------------------------------")
print(f"Data at index {index} after detokenization:", tokenizer.encode_as_pieces(cleaned_data[index]))

Data at index 890 before tokenization: 

" ሰላም ከአንገት በላይና ዝም ላለማለት ያህል የምንናገረው ሳይሆን ዋጋ ከፍለን የምናመጣው ነው " - ቅዱስነታቸው

ዛሬ የሰላም ሚኒስቴር አንድ  ዓለም አቀፍ ኮንፈረንስ አዘጋጅቶ ነበር።

በዚህ መድረክ ላይም የሰላም ሚኒስትር አቶ ብናልፍ አዱዓለም ፣ የኢትዮጵያ የሃይማኖት ተቋማት የበላይ ጠባቂ አባቶች፣ ብፁዓን አበው ሊቃነ ጳጳሳት ወኤጲስ ቆጶሳት ፣ የመንግሥት ባለስልጣናት ፣ አባሳደሮች ጭምር ተገኝተው ነበር።

በመድረኩ  ፤ ብፁዕ ወቅዱስ አቡነ ማትያስ ቀዳማዊ ፓትርያርክ ርእሰ ሊቃነ ጳጳሳት ዘኢትዮጵያ ሊቀ ጳጳስ ዘአክሱም ወእጨጌ ዘመንበረ ተክለሃይማኖት መልዕክት አስተላልፈዋል።

ቅዱስነታቸው ምን አሉ ?

(ከመልዕክታቸው የተወሰደ)

" ሰላም የሰው ልጆች ፍላጎት፣ የብዙ ምንዱባን የየዕለት ናፍቆት ነው። የበርካታ ዘመናት ቅርሶች፤ ጊዜ፣ ገንዘብ እና የሰው ጉልበት የፈሰሰባቸው ግንባታዎች በሰላም ማጣት በቅጽበት ይፈርሳሉ። 

ሰላም ካለ የዓለም ሀብት ለሁሉም በቂ ነው። ሰላም ማጣት ግን ብዙ ሠራዊት፣ ብዙ የጦር መሳሪያ እንዲዘጋጅ እያደረገ ሀብትን ያወድማል። 

ጦርነት ማለት ሀብትና ሕይወትን ወደሚነድ እሳት ውስጥ መጣል ነው። 

የአንደኛና የሁለተኛ ዓለም ጦርነት፣ ታሪክ ብቻ ሳይሆን ጠባሳው አሁንም የዓለምን መልክ አበላሽቶታል። 

ሰላም በውስጥዋ ገራምነት፣ ትዕግሥት፣ ታዛዥነትና በትህትና ዝቅ ማለት ስለሚገኙ መራራ ትመስላለች፤ በውጤቷ ግን ሀገርን ከጥፋት፣ ሕዝብን ከመከራ ማትረፍ የሚቻል በመሆኑ ዋጋዋ ከፍ ያለ ነው። 

ቅድስት ቤተ ክርስቲያናችን ፦
° ሰላም የሆነው ክርስቶስ የሚሰበክባት፣ 
° የሰላም መልእክተኞች በውስጥዋ የሚመላለሱባት፣ 
° በግብረ ኃጢአት የወደቁት በንስሓ ከእግዚአብሔር 

To pretrain our Transformer network, we will use the Masked Language Model (MLM) approach. This technique involves randomly masking a percentage of words in a sentence and replacing them with special tokens. The model then attempts to predict these masked words, enabling it to learn contextual and semantic representations effectively.

I will be implementing the Masked language model (MLM) as shown in the following image. 

<img src = "images/losses.png" width="600" height = "400">

Assume you have the following text: <span style = "color:blue"> **ሰላም <span style = "color:red">የሰው ልጆች </span> ፍላጎት፣ የብዙ ምንዱባን የየዕለት <span style = "color:red">ናፍቆት</span>  ነው።** </span>     


Now as input you will mask the words in red in the text: 

<span style = "color:blue"> **Input:**</span> ሰላም  **X** ፍላጎት፣ የብዙ ምንዱባን የየዕለት **Y** ነው።

<span style = "color:blue">**Output:**</span> The model should predict the words(s) for **X** and **Y**. 

**[EOS]** will be used to mark the end of the target sequence.

As you can see above, I were able to take a piece of string and tokenize it. 

Now I will create `input` and `target` pairs that will allow me to pre-train the model. The model uses the ids at the end of the vocab file as sentinels. For example, it will replace: 
   - `vocab_size - 1` by `<Z>`
   - `vocab_size - 2` by `<Y>`
   - and so forth. 
   
It assigns every word a `chr`.

The `pretty_decode` function below, which I will use in a bit, helps in handling the type when decoding. 

Notice that:
```python
string.ascii_letters = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'
```



In [None]:
def get_sentinels(tokenizer, display=False):
    sentinels = {}
    vocab_size = tokenizer.vocab_size()
    for i, char in enumerate(reversed(string.ascii_letters), 1):
        decoded_text = tokenizer.detokenize([vocab_size - i])
        
        # Sentinels, ex: <Z> - <a>
        sentinels[decoded_text] = f'<{char}>'    
    
        if display:
            print(f'The sentinel is <{char}> and the decoded token is:', decoded_text)

    return sentinels

def pretty_decode(encoded_str_list, sentinels, tokenizer):
    # If already a string, just do the replacements.
    if type(encoded_str_list) == str:
        for token, char in sentinels.items():
            encoded_str_list = re.sub(token, char,encoded_str_list)
        return encoded_str_list
  
    # We need to decode and then prettyfy it.
    return pretty_decode(tokenizer.detokenize(encoded_str_list), sentinels, tokenizer)

In [301]:
sentinels = get_sentinels(tokenizer, display=True)

The sentinel is <Z> and the decoded token is: C
The sentinel is <Y> and the decoded token is: ጌ
The sentinel is <X> and the decoded token is: f
The sentinel is <W> and the decoded token is: E
The sentinel is <V> and the decoded token is: S
The sentinel is <U> and the decoded token is: ጪ
The sentinel is <T> and the decoded token is: ጁ
The sentinel is <S> and the decoded token is: ቼ
The sentinel is <R> and the decoded token is: ”
The sentinel is <Q> and the decoded token is: b
The sentinel is <P> and the decoded token is: ሟ
The sentinel is <O> and the decoded token is: ዬ
The sentinel is <N> and the decoded token is: ሂ
The sentinel is <M> and the decoded token is: T
The sentinel is <L> and the decoded token is: •
The sentinel is <K> and the decoded token is: ቧ
The sentinel is <J> and the decoded token is: y
The sentinel is <I> and the decoded token is: ፦
The sentinel is <H> and the decoded token is: ሔ
The sentinel is <G> and the decoded token is: p
The sentinel is <F> and the decoded toke

Now, let's use the `pretty_decode` function in the following sentence.

<a name='1-5'></a>
### 3.4 - Tokenizing and Masking

In this task, I will implement the `tokenize_and_mask` function, which tokenizes and masks input words based on a given probability. The probability is controlled by the `noise` parameter, typically set to mask around `15%` of the words in the input text. The function will generate two lists of tokenized sequences following the algorithm outlined below:


###  tokenize_and_mask

- Start with two empty lists: `inps` and `targs`
- Tokenize the input text using the given tokenizer.
- For each `token` in the tokenized sequence:
  - Generate a random number(simulating a weighted coin toss)
  - If the random value is greater than the given threshold(noise):
    - Add the current token to the `inps` list
  - Else:
    - If a new sentinel must be included:
      - Compute the next sentinel ID using a progression.
      - Add a sentinel into the `inps` and `targs` to mark the position of the masked element.
    - Add the current token to the `targs` list.

** There's a special case to consider. If two or more consecutive tokens get masked during the process, no need to add a new sentinel to the sequences. To account for this, use the `prev_no_mask` flag, which starts as `True` but is turned to `False` each time I mask a new element. The code that adds sentinels will only be executed if, before masking the token, the flag was in the `True` state.


In [302]:
def tokenize_and_mask(text, 
                      noise=0.15, 
                      randomizer=np.random.uniform, 
                      tokenizer=None):
    """Tokenizes and masks a given input.

    Args:
        text (str or bytes): Text input.
        noise (float, optional): Probability of masking a token. Defaults to 0.15.
        randomizer (function, optional): Function that generates random values. Defaults to np.random.uniform.
        tokenizer (function, optional): Tokenizer function. Defaults to tokenize.

    Returns:
        inps, targs: Lists of integers associated to inputs and targets.
    """
    
    # Current sentinel number (starts at 0)
    cur_sentinel_num = 0
    
    # Inputs and targets
    inps, targs = [], []

    # Vocab_size
    vocab_size = int(tokenizer.vocab_size())
    
    # EOS token id 
    # Must be at the end of each target!
    eos = tokenizer.piece_to_id("</s>")
    

    
    # prev_no_mask is True if the previous token was NOT masked, False otherwise
    # set prev_no_mask to True
    prev_no_mask = True
    
    # Loop over the tokenized text
    for token in tokenizer.tokenize(text):
        
        # Generate a random value between 0 and 1
        rnd_val = randomizer() 
        
        # Check if the noise is greater than a random value (weighted coin flip)
        if noise > rnd_val:
            
            # Check if previous token was NOT masked
            if prev_no_mask:
                
                # Current sentinel increases by 1
                cur_sentinel_num += 1
                
                # Compute end_id by subtracting current sentinel value out of the total vocabulary size
                end_id = vocab_size - cur_sentinel_num
                
                # Append end_id at the end of the targets
                targs.append(end_id)
                
                # Append end_id at the end of the inputs
                inps.append(end_id)
                
            # Append token at the end of the targets
            targs.append(token)
            
            # set prev_no_mask accordingly
            prev_no_mask = False

        else:
            
            # Append token at the end of the inputs
            inps.append(token)
            
            # Set prev_no_mask accordingly
            prev_no_mask = True
    
    
    # Add EOS token to the end of the targets
    targs.append(eos)
    

    
    return inps, targs

I will now take random value from the cleaned_data and pass it to `tokenize_and_mask` function and see how it randomly masks and separate inputs and targets

In [303]:
random_data=cleaned_data[3000]
print("Random data before tokenization: \n\n", random_data)

Random data before tokenization: 

 

በአማራ ክልል መዲና በባህር ዳር ከተማ ቀበሌ 14 ትላንት መጋቢት 29 የመግሪብ ሰላት ሰግደው ሲመለሱ የነበሩ አባት ከ3 ልጆቹ እንዲሁም አንድ ጎረቤታቸውን ጨምሮ አጠቃላይ 5 ሰዎች በተከፈተባቸው የጥይት እሩምታ መገደላቸው ተነግሯል።

ትላንትና ምሽት በግፍ የተገደሉት ፥ አቶ ሙሄ ፣ ልጃቸው አበባዉ ሙሄ ፣ ሽኩር ሙሄ ፣ ሙላት ሙሄ እና ጎረቤታቸው አቶ እንድሪስ የተባሉ ሲሆኑ ስርዓት ቀብራቸው በዛሬው ዕለት ተፈፅሟል።

እስካሁን ገዳዮች ስለመያዛቸው የተባለ ነገር የለም።

በከተማዋ ከተገደሉት ሰዎች ባሻገር ባህርዳር ከተማ አባይ ማዶ የሚገኘው መስጂድ ከፍተኛ የመሳሪያ ድብደባ እንደተፈፀመበት ተገልጿል።

ከዚሁ ጋር በተያያዘ ዛሬ የባህር ዳር ሙስሊሞች በክልሉ በሙስሊሞች ላይ አነጣጥረዋል ያሉትን ግድያ እና እገታ በመቃወም ሰልፍ ማድረጋቸውን " ሀሩን ሚዲያ " ዘግቧል።

እስካሁን በአማራ ክልል እስልምና ጉዳዮች ከፍተኛ ምክር ቤትም ሆነ በኢትዮጵያ እስልምና ጉዳዮች ጠቅላይ ምክር ቤት የተሰጠ አስተያየት የለም።

ቲክቫህ ኢትዮጵያ በክልሉ ተፈፅመዋል ስለተባሉ ግድያዎች ፣ ጥቃቶች ፣ ዘረፋና እገታዎች የአማራ ክልል እስልምና ጉዳዮች ከፍተኛ ምክር ቤት እና የሚመለከታቸው አካላትን ለማነጋገር ጥረት እያደረገ ይገኛል፤ ምላሽ እንዳገኘ ተጨማሪ መረጃዎችን ያቀርባል።




In [304]:
inps_sample,targs_sample=tokenize_and_mask(random_data,noise=0.15,randomizer=np.random.uniform,tokenizer=tokenizer)


In [None]:
print('Inputs: \n\n', pretty_decode(inps_sample, sentinels, tokenizer))
print('\nTargets: \n\n', pretty_decode(targs_sample, sentinels, tokenizer))

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()