### Text Processing: Handling Amharic text, tokenization, and preprocessing techniques.

To preprocess the scraped Amharic text data for tasks like tokenization, normalization, and handling Amharic-specific linguistic features, we need to follow several preprocessing steps tailored for the language. 

Here’s how we can approach this task:

**Steps to Preprocess Amharic Text**

- **Tokenization**: Tokenization is the process of splitting text into individual units such as words or subwords. Since Amharic uses a different script and has some unique linguistic features, tokenizing might need adjustments. 
    - Use specialized libraries that handle Amharic text or a custom rule-based tokenizer.

- **Normalization**: This step involves cleaning and converting the text into a standard format:

    - Remove special characters, punctuation, and numbers.
    - Normalize similar-looking characters.
    - Convert text to a standard form (for example, removing diacritics if necessary).

- **Handling Amharic-Specific Features:**

    - Amharic, like other Semitic languages, has specific features such as root-and-pattern morphology.

    - Handling unique orthographic variants and considering suffixes, prefixes, and infixes in the language.

    - Identifying verb conjugations, plural forms, and possessives for better tokenization.

In [1]:

# Import necessary libraries
import pandas as pd
import logging
import os, sys
import matplotlib.pyplot as plt
from matplotlib import font_manager
from collections import Counter
# Add the 'scripts' directory to the Python path for module imports
sys.path.append(os.path.abspath(os.path.join('..', 'scripts')))
# Import data preprocessor class
from amharic_text_processor import AmharicTextPreprocessor
from amharic_labeler import AmharicNERLabeler

# Set max rows and columns to display
pd.set_option('display.max_columns', 200)
pd.set_option('display.max_rows', 200)

# Configure logging
logging.basicConfig(level=logging.INFO, 
                    format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)

logger.info("Imported libraries and configured logging.")

2024-09-30 09:48:35,104 - INFO - Imported libraries and configured logging.


**Load the scraped Telegram data**

In [2]:
# Read the data
data = pd.read_csv('../data/telegram_data.csv')
# Explore the first five rows
data.head()

Unnamed: 0,Channel Title,Channel Username,ID,Message,Date,Media Path
0,ልዩ እቃ,@Leyueqa,5822,🌼🌼🌼🔴ይህን መፍጫ ከሁሉም የተሻለ ሆኖ አግኝተነዋል❗️kitchen expe...,2024-09-26 09:27:56+00:00,
1,ልዩ እቃ,@Leyueqa,5820,⌛💧🌼🌼🌼Telescopic Stainless Steel Majic Mop\n\n✔...,2024-09-26 05:49:13+00:00,
2,ልዩ እቃ,@Leyueqa,5819,🔠🔠🔠🔠🔠Siliver crest ➡️Brand ባለ1 እና ባለ 2 ተች ስቶ...,2024-09-25 17:39:49+00:00,
3,ልዩ እቃ,@Leyueqa,5818,🔠🔠🔠🔠ሶስት ፍሬ የዳቦ እና የኬክ ቅርጽ ማውጫ ( መጋገሪያ ፓትራ )\n\...,2024-09-25 10:38:58+00:00,
4,ልዩ እቃ,@Leyueqa,5817,🧳🧳🧳HIGH PRESSURE WATER GUN HEAD SET\n👉 360° የሚ...,2024-09-25 07:44:47+00:00,


In [3]:
# Check the last five rows
data.tail()

Unnamed: 0,Channel Title,Channel Username,ID,Message,Date,Media Path
1668,ልዩ እቃ,@Leyueqa,148,ይመቻቹ ፈታ ያለ ምሽት ተመኘሁ,2018-10-25 13:09:24+00:00,
1669,ልዩ እቃ,@Leyueqa,136,,2018-10-20 12:46:15+00:00,
1670,ልዩ እቃ,@Leyueqa,70,,2018-09-04 15:28:25+00:00,
1671,ልዩ እቃ,@Leyueqa,55,,2018-08-23 20:18:56+00:00,
1672,ልዩ እቃ,@Leyueqa,1,,2018-08-02 07:30:19+00:00,


In [4]:
data.shape

(1673, 6)

In [5]:
# Let's check the missing values
data.isnull().sum()

Channel Title         0
Channel Username      0
ID                    0
Message             704
Date                  0
Media Path          548
dtype: int64

In [6]:
# Preprocess and tokenizes the amharic message
if __name__ == "__main__":
    # Amharic text sample
    amharic_text = "ሰላም እንዴት ነህ? እንኳን ደህና መጣህ።"

    preprocessor = AmharicTextPreprocessor()

    # Preprocess the text
    tokens = preprocessor.preprocess_dataframe(data, 'Message')
    display(tokens)


Unnamed: 0,Channel Title,Channel Username,ID,Message,Date,Media Path,preprocessed_message
0,ልዩ እቃ,@Leyueqa,5822,🌼🌼🌼🔴ይህን መፍጫ ከሁሉም የተሻለ ሆኖ አግኝተነዋል❗️kitchen expe...,2024-09-26 09:27:56+00:00,,ይህን መፍጫ ከሁሉም የተሻለ ሆኖ አግኝተነዋል 1 አስተማማኝ የሆነ ዕቃ በ...
1,ልዩ እቃ,@Leyueqa,5820,⌛💧🌼🌼🌼Telescopic Stainless Steel Majic Mop\n\n✔...,2024-09-26 05:49:13+00:00,,ማጂክ መወልወያ ውሃ በከፍተኛ ደረጃ ይመጣል በራሱ ይጨምቃል ከእጅ ንኪኪ ...
2,ልዩ እቃ,@Leyueqa,5819,🔠🔠🔠🔠🔠Siliver crest ➡️Brand ባለ1 እና ባለ 2 ተች ስቶ...,2024-09-25 17:39:49+00:00,,ባለ1 እና ባለ 2 ተች ስቶቭ ግዜዎን እና ጉልበትዎን የሚቆጥብ ፈጣን ስቶ...
3,ልዩ እቃ,@Leyueqa,5818,🔠🔠🔠🔠ሶስት ፍሬ የዳቦ እና የኬክ ቅርጽ ማውጫ ( መጋገሪያ ፓትራ )\n\...,2024-09-25 10:38:58+00:00,,ሶስት ፍሬ የዳቦ እና የኬክ ቅርጽ ማውጫ መጋገሪያ ትራ ትልቁ 3 24 26...
4,ልዩ እቃ,@Leyueqa,5817,🧳🧳🧳HIGH PRESSURE WATER GUN HEAD SET\n👉 360° የሚ...,2024-09-25 07:44:47+00:00,,360 የሚዞር በቀላሉ የውሃ ቱቦ ላይ የሚገጠም ለመኪና እጥበት ተመራጭ አ...
...,...,...,...,...,...,...,...
1668,ልዩ እቃ,@Leyueqa,148,ይመቻቹ ፈታ ያለ ምሽት ተመኘሁ,2018-10-25 13:09:24+00:00,,ይመቻቹ ፈታ ያለ ምሽት ተመኘሁ
1669,ልዩ እቃ,@Leyueqa,136,,2018-10-20 12:46:15+00:00,,
1670,ልዩ እቃ,@Leyueqa,70,,2018-09-04 15:28:25+00:00,,
1671,ልዩ እቃ,@Leyueqa,55,,2018-08-23 20:18:56+00:00,,


In [7]:
# Drop NaN 

data.dropna(subset='Message', inplace=True)

In [8]:
list(data['preprocessed_message'])

['ይህን መፍጫ ከሁሉም የተሻለ ሆኖ አግኝተነዋል 1 አስተማማኝ የሆነ ዕቃ በኤሌክትሪክ የሚሰራ የሽንኩርት የስጋ የጨጓራ መፍጫ በ 15 ሰከንዶች ዉስጥ ድቅቅ አድርጎ የሚፈጭ ምላጭ በ 3 ሊትር እና 5ሊትር ፈጣን የሆኑ 3ሊትር 2100 5ሊትር 2600 የ1 አመት ዋስትና ያላቸዉ በጣም የሚያስደስት ዕቃ ደዉለዉ ማዘዝ ይችላሉ ይህን ዕቃ እና ሌላ 1 እቃ ጨምረዉ ሲገዙ 1 ምርጥ ቢላ በነፃ አድራሻ ቁጥር 1 ልደታ ወደ ባልቻ ሆስታል ገባ ብሎ አህመድ ህንፃ ላይ 1ኛፎቅ 105 ቁጥር 2 22 አውራሪስ ሆቴል አጠገብ በፀጋ ህንፃ 3ኛ ፎ ቅ ከ ሊፍቱ በ ግራ የሱቅ ቁጥር 10 ይደዉሉልን ባሉበት ያለተጨማሪ ክፍያ ማዘዝ ይችላሉ ክፍያዎንበሞባይልባንኪንግመፈፀምምይችላሉ በተጨማሪ ከ1000ብር በላይ የሆኑ ሁለትዕቃዎች ሲገዙ ስጦታ እንልክለዎታለን / ቻናላችንን ለጓደኛዎ ሸር ማድረግዎን አይርሱ 0933334444 0944109295 0946242424',
 'ማጂክ መወልወያ ውሃ በከፍተኛ ደረጃ ይመጣል በራሱ ይጨምቃል ከእጅ ንኪኪ ነፃ ዋጋ 900 ብር ከነፃ ዲሊቨሪ ጋር አድራሻ ቁጥር 1 ልደታ ወደ ባልቻ ሆስታል ገባ ብሎ አህመድ ህንፃ ላይ 1ኛፎቅ 105 ቁጥር 2 22 አውራሪስ ሆቴል አጠገብ በፀጋ ህንፃ 3ኛ ፎ ቅ ከ ሊፍቱ በ ግራ የሱቅ ቁጥር 10 ይደዉሉልን 0933334444 0946242424 0944109295',
 'ባለ1 እና ባለ 2 ተች ስቶቭ ግዜዎን እና ጉልበትዎን የሚቆጥብ ፈጣን ስቶቭ ባለ 2 7200 ባለ 13100 ክፍያዎን ዕቃዉ እጅዎ ሲደርስበሞባይልባንኪንግመፈፀምይችላሉ በተጨማሪ ሁለት ዕቃዎችን ከ1000ብር በላይ የሚተመኑ 2 ዕቃዎችንአንዴ ሲገዙ ስጦታ እንልክለዎታለን 0933334444 0944109295 0946242424',
 'ሶስት ፍሬ የዳቦ እና የኬክ ቅርጽ ማውጫ መጋገሪያ ትራ ት

In [9]:
# Ensure there are no NaN values in the preprocessed column
preprocessed_texts = tokens['preprocessed_message'].dropna().tolist()
df = pd.Series(preprocessed_texts).reset_index(name='message')


In [10]:
# Initialize the labeler

labeler = AmharicNERLabeler()

# Ensure there are no NaN values in the preprocessed column
preprocessed_texts = tokens['preprocessed_message'].dropna().tolist()
df = pd.Series(preprocessed_texts).reset_index(name='message')
# df = df.iloc[10:15]
df['Tokenized'] = df['message'].apply(lambda x: x.split())
# Label the tokens in the DataFrame
labeled_df = labeler.label_dataframe(df, 'Tokenized')


# Save to CoNLL format
labeler.save_conll_format(labeled_df, '../labeled_data_conll.conll')



In [11]:
labeled_df.drop(columns=['index'], inplace=True)

In [12]:
labeled_df['message'].duplicated().sum()

np.int64(271)