# EthioMart - Named Entity Recognition for Amharic E-commerce Data

## Overview

**EthioMart**, a growing hub for Telegram-based e-commerce in Ethiopia, aims to consolidate multiple independent e-commerce channels into a single centralized platform. With the increasing use of Telegram for business transactions, customers and vendors are currently spread across various channels, leading to challenges in product discovery, communication, and order management.

This project focuses on building an Amharic Named Entity Recognition (NER) system to extract important business entities—such as product names, prices, and locations—from the messages shared in these Telegram channels. The extracted data will be used to populate EthioMart's centralized database, providing a seamless and organized shopping experience for customers and a unified platform for vendors.

## Key Objectives

1. Real-time Data Extraction: Fetch data from various Ethiopian Telegram e-commerce channels.
1. Fine-tuning Large Language Models (LLMs): Adapt existing LLMs to accurately extract business entities like product names, prices, and locations from Amharic text.

## Install Necessary Packages

In [1]:
!pip install transformers datasets seqeval

Collecting seqeval
  Downloading seqeval-1.2.2.tar.gz (43 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.6/43.6 kB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25ldone
Building wheels for collected packages: seqeval
  Building wheel for seqeval (setup.py) ... [?25ldone
[?25h  Created wheel for seqeval: filename=seqeval-1.2.2-py3-none-any.whl size=16161 sha256=934bac6590b2903f743c7421514941d1b0cd603ef8818082e8c2a977deb2adb1
  Stored in directory: /root/.cache/pip/wheels/1a/67/4a/ad4082dd7dfc30f2abfe4d80a2ed5926a506eb8a972b4767fa
Successfully built seqeval
Installing collected packages: seqeval
Successfully installed seqeval-1.2.2


# Import Necessary Libraries

In [2]:
import pandas as pd
from datasets import Dataset
from transformers import AutoTokenizer, AutoModelForTokenClassification, TrainingArguments, Trainer
from transformers import DataCollatorForTokenClassification
from seqeval.metrics import accuracy_score, classification_report
import torch
from sklearn.model_selection import train_test_split
from collections import Counter

### Load the Dataset

In [3]:
def load_conll_dataset(file_path):
    sentences, labels = [], []
    sentence, label = [], []
    
    with open(file_path, "r", encoding="utf-8") as file:
        for line in file:
            if line.strip():  # If the line is not empty
                parts = line.strip().split()
                if len(parts) == 2:  # Ensure the line has exactly two components
                    token, tag = parts
                    sentence.append(token)
                    label.append(tag)
                else:
                    print(f"Skipping malformed line: {line.strip()}")
            else:
                if sentence:  # Append only if the sentence is not empty
                    sentences.append(sentence)
                    labels.append(label)
                sentence, label = [], []
    
    if sentence:  # Append any remaining sentence
        sentences.append(sentence)
        labels.append(label)
    
    return pd.DataFrame({"tokens": sentences, "ner_tags": labels})


### Split the dataset

In [4]:
# Function to split the CoNLL dataset into training and validation sets
def split_conll_dataset(conll_df, train_ratio=0.8):
    # Split the dataset into train and validation sets
    train_df, val_df = train_test_split(conll_df, train_size=train_ratio, random_state=42, shuffle=True)

    return train_df, val_df

# Example usage
file_path = "/kaggle/input/reduced-data/labeled_ner_data.txt"
conll_df = load_conll_dataset(file_path)  # Load the dataset
train_dataset, val_dataset = split_conll_dataset(conll_df)  # Split the dataset

# Check the sizes of the resulting datasets
print(f"Training set size: {len(train_dataset)}")
print(f"Validation set size: {len(val_dataset)}")


Training set size: 1072
Validation set size: 268


### Label Mapping

In [5]:
# Define label mapping
label_to_id = {
    "O": 0,  # Outside of entity
    "B-Product": 1,  # Beginning of a Product entity
    "I-Product": 2,  # Inside of a Product entity
    "B-PRICE": 3,  # Beginning of a Price entity
    "I-PRICE": 4,  # Inside of a Price entity
    "B-LOC": 5,  # Beginning of a Location entity
    "I-LOC": 6   # Inside of a Location entity
}

# Reverse mapping for predictions
id_to_label = {v: k for k, v in label_to_id.items()}

### Tokenize and Align Labels

In [6]:
def tokenize_and_align_labels(examples, tokenizer, label_all_tokens=True):
    tokenized_inputs = tokenizer(examples['tokens'], truncation=True, is_split_into_words=True)
    
    labels = []
    for i, label in enumerate(examples['ner_tags']):
        # Replace any '0' (zero) with 'O' (uppercase letter O) and 'o' (lowercase) with 'O'
        label = ['O' if l in ['0', 'o'] else l for l in label]
        
        # Convert string labels to integers using label_to_id mapping
        label = [label_to_id[l] for l in label]  # Mapping the string NER tags to integers
        
        word_ids = tokenized_inputs.word_ids(batch_index=i)
        previous_word_idx = None
        label_ids = []
        for word_idx in word_ids:
            if word_idx is None:
                label_ids.append(-100)  # Padding token
            elif word_idx != previous_word_idx:  # First token of a word
                label_ids.append(label[word_idx])
            else:  # Non-first token of a word
                label_ids.append(-100 if not label_all_tokens else label[word_idx])
            previous_word_idx = word_idx
        labels.append(label_ids)
    
    tokenized_inputs["labels"] = labels
    return tokenized_inputs

### Train and Evaluate Model

In [7]:
# Function to fine-tune the model
def train_and_evaluate_model(model_name, train_dataset, val_dataset, label_list, batch_size=16, epochs=15):
    print(f"Training model: {model_name}")
    
    # Load tokenizer and model
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForTokenClassification.from_pretrained(model_name, num_labels=len(label_list))

    # Tokenize dataset
    # Passing tokenizer inside lambda function
    training_dataset = train_dataset.map(lambda x: tokenize_and_align_labels(x, tokenizer), batched=True)
    evaluation_dataset = val_dataset.map(lambda x: tokenize_and_align_labels(x, tokenizer), batched=True)

    # Data collator
    data_collator = DataCollatorForTokenClassification(tokenizer)
    
    # Training arguments
    training_args = TrainingArguments(
        output_dir=f"./results_{model_name}",
        evaluation_strategy="epoch",
        learning_rate=2e-5,
        per_device_train_batch_size=batch_size,
        per_device_eval_batch_size=batch_size,
        num_train_epochs=epochs,
        weight_decay=0.01,
        logging_dir=f"./logs_{model_name}",
        logging_steps=50
    )

    # Trainer
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=training_dataset,
        eval_dataset=evaluation_dataset,
        tokenizer=tokenizer,
        data_collator=data_collator,
        compute_metrics=compute_metrics
    )

    # Train the model
    trainer.train()

    # Evaluate the model
    eval_results = trainer.evaluate()
    print(f"Evaluation results for {model_name}:", eval_results)
    return eval_results

### Compute Metrics

In [8]:
def compute_metrics(pred):
    # Retrieve predictions and true labels
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    
    # Convert numeric labels back to their string names using id_to_label mapping
    true_labels = [[id_to_label[l] for l in label_row if l != -100] for label_row in labels]
    true_preds = [[id_to_label[p] for (p, l) in zip(pred_row, label_row) if l != -100] for pred_row, label_row in zip(preds, labels)]
    
    # Use seqeval to evaluate the performance
    report = classification_report(true_labels, true_preds)
    accuracy = accuracy_score(true_labels, true_preds)
    
    return {"accuracy": accuracy, "report": report}

### Compare Models

In [9]:
# Function to compare models
def compare_models(models, dataset, label_list):
    results = {}
    for model_name in models:
        eval_result = train_and_evaluate_model(model_name, dataset, label_list)
        results[model_name] = eval_result
    return results

In [10]:
# Load the labeled CoNLL dataset
#conll_df = load_conll_dataset("/kaggle/input/collection-ner/NER_Collection_data.txt")
#dataset = Dataset.from_pandas(conll_df)

### Count Labels in the Dataset

In [11]:
# Function to count each label in the dataset
def count_labels(dataset):
    all_labels = [label for labels in dataset['ner_tags'] for label in labels]
    label_counts = Counter(all_labels)
    
    # Print the counts for each label
    for label, count in label_counts.items():
        print(f"Label: {label}, Count: {count}")
    
    return label_counts

In [12]:
train_label_counts = count_labels(train_dataset)

Label: O, Count: 26629
Label: B-Product, Count: 1096
Label: I-Product, Count: 168
Label: B-LOC, Count: 3178
Label: I-LOC, Count: 28566
Label: B-Price, Count: 263
Label: I-Price, Count: 263


In [13]:
evaluation_label_counts = count_labels(val_dataset)

Label: B-Product, Count: 295
Label: O, Count: 7119
Label: B-LOC, Count: 829
Label: I-LOC, Count: 7388
Label: B-Price, Count: 67
Label: I-Price, Count: 67
Label: I-Product, Count: 44


#### Map labels to correct labels

In [14]:
# Function to map incorrect labels to correct labels
def map_labels(dataset):
    # Define the mapping from incorrect to correct labels
    label_mapping = {
        'B-PROD': 'B-Product',   # Map 'B-PROD' to 'B-Product'
        'B-PRODUCT': 'B-Product', # Map 'B-PRODUCT' to 'B-Product'
        'I-PRODUCT': 'I-Product', # Map 'I-PRODUCT' to 'I-Product'
        'B-Price': 'B-PRICE',    # Map 'B-Price' to 'B-PRICE'
        'I-Price': 'I-PRICE',    # Map 'I-Price' to 'I-PRICE'
        'IO': 'O'                # Map 'IO' to 'O'
    }
    
    # Replace the incorrect labels with the correct ones
    dataset['ner_tags'] = dataset['ner_tags'].apply(
        lambda tags: [label_mapping.get(tag, tag) for tag in tags]
    )
    
    return dataset


In [15]:
# Example usage:
train_df = map_labels(train_dataset)
train_dataset = Dataset.from_pandas(train_df)

# Verify the label counts after remapping
label_counts = count_labels(train_df)

Label: O, Count: 26629
Label: B-Product, Count: 1096
Label: I-Product, Count: 168
Label: B-LOC, Count: 3178
Label: I-LOC, Count: 28566
Label: B-PRICE, Count: 263
Label: I-PRICE, Count: 263


In [16]:
# Example usage:
val_df = map_labels(val_dataset)
val_dataset = Dataset.from_pandas(val_df)

# Verify the label counts after remapping
label_counts = count_labels(val_df)

Label: B-Product, Count: 295
Label: O, Count: 7119
Label: B-LOC, Count: 829
Label: I-LOC, Count: 7388
Label: B-PRICE, Count: 67
Label: I-PRICE, Count: 67
Label: I-Product, Count: 44


In [17]:
# Example usage:
conll_df = map_labels(conll_df)

# Verify the label counts after remapping
label_counts = count_labels(conll_df)

Label: O, Count: 33748
Label: B-Product, Count: 1391
Label: I-Product, Count: 212
Label: B-LOC, Count: 4007
Label: I-LOC, Count: 35954
Label: B-PRICE, Count: 330
Label: I-PRICE, Count: 330


In [18]:
"""
# Function to save the dataset to storage
def save_dataset(dataset, file_path):
    
    Save the modified dataset to a specified file path in CSV format.
    
    Args:
        dataset (pd.DataFrame): The DataFrame containing the dataset.
        file_path (str): The file path where the dataset will be saved.
    
    dataset.to_csv(file_path, index=False)
    print(f"Dataset saved to {file_path}")

# Save the mapped dataset to a CSV file
save_dataset(conll_df, "preprocessed_conll_data.txt")

"""

'\n# Function to save the dataset to storage\ndef save_dataset(dataset, file_path):\n    \n    Save the modified dataset to a specified file path in CSV format.\n    \n    Args:\n        dataset (pd.DataFrame): The DataFrame containing the dataset.\n        file_path (str): The file path where the dataset will be saved.\n    \n    dataset.to_csv(file_path, index=False)\n    print(f"Dataset saved to {file_path}")\n\n# Save the mapped dataset to a CSV file\nsave_dataset(conll_df, "preprocessed_conll_data.txt")\n\n'

### List Models and Labels

In [19]:
"""
# List of entity labels 
label_list = ['O', 'B-Product', 'I-Product', 'B-PRICE', 'I-PRICE', 'B-LOC', 'I-LOC']

# Define models for comparison
models = [
    "xlm-roberta-base",  
    "bert-base-multilingual-cased",  
    "distilbert-base-multilingual-cased"  
]

# Compare models
results = compare_models(models, dataset, label_list)

"""

'\n# List of entity labels \nlabel_list = [\'O\', \'B-Product\', \'I-Product\', \'B-PRICE\', \'I-PRICE\', \'B-LOC\', \'I-LOC\']\n\n# Define models for comparison\nmodels = [\n    "xlm-roberta-base",  \n    "bert-base-multilingual-cased",  \n    "distilbert-base-multilingual-cased"  \n]\n\n# Compare models\nresults = compare_models(models, dataset, label_list)\n\n'

### Print Comparison Results

In [20]:
"""
# Print out comparison results
for model_name, result in results.items():
    print(f"Model: {model_name}")
    print(f"Accuracy: {result['eval_accuracy']}")
    print(result['eval_report'])

"""

'\n# Print out comparison results\nfor model_name, result in results.items():\n    print(f"Model: {model_name}")\n    print(f"Accuracy: {result[\'eval_accuracy\']}")\n    print(result[\'eval_report\'])\n\n'

In [21]:
#API = c912e406b425b51cb31ae3db26397612b381918d

### Fine-tune a Single Model at a Time

In [22]:
# Function to train and evaluate one model
def run_single_model(model_name, train_dataset, val_dataset, label_list):
    # Train and evaluate the model
    eval_result = train_and_evaluate_model(model_name, train_dataset, val_dataset, label_list)
    
    # Print the evaluation result for the model
    print(f"Model: {model_name}")
    print(f"Accuracy: {eval_result['eval_accuracy']}")
    print(eval_result['eval_report'])
    
    return eval_result


In [23]:
# List of entity labels 
label_list = ['O', 'B-Product', 'I-Product', 'B-PRICE', 'I-PRICE', 'B-LOC', 'I-LOC']

# Define the model to run
model_name = "distilbert-base-multilingual-cased"

# Run and evaluate the model
eval_result = run_single_model(model_name, train_dataset, val_dataset, label_list)


Training model: distilbert-base-multilingual-cased


tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/466 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/996k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.96M [00:00<?, ?B/s]



model.safetensors:   0%|          | 0.00/542M [00:00<?, ?B/s]

Some weights of DistilBertForTokenClassification were not initialized from the model checkpoint at distilbert-base-multilingual-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Map:   0%|          | 0/1072 [00:00<?, ? examples/s]

Map:   0%|          | 0/268 [00:00<?, ? examples/s]

[34m[1mwandb[0m: Using wandb-core as the SDK backend. Please refer to https://wandb.me/wandb-core for more information.
[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
[34m[1mwandb[0m: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

  ········································


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


Epoch,Training Loss,Validation Loss,Accuracy,Report
1,0.6786,0.389414,0.888706,precision recall f1-score support  LOC 0.06 0.02 0.03 829  PRICE 0.00 0.00 0.00 74  Product 0.00 0.00 0.00 295  micro avg 0.06 0.01 0.02 1198  macro avg 0.02 0.01 0.01 1198 weighted avg 0.04 0.01 0.02 1198
2,0.3887,0.296287,0.906228,precision recall f1-score support  LOC 0.10 0.03 0.05 829  PRICE 0.28 0.20 0.24 74  Product 0.84 0.38 0.52 295  micro avg 0.35 0.13 0.19 1198  macro avg 0.41 0.20 0.27 1198 weighted avg 0.30 0.13 0.18 1198
3,0.2857,0.238103,0.920353,precision recall f1-score support  LOC 0.42 0.12 0.19 829  PRICE 0.84 0.77 0.80 74  Product 0.98 0.38 0.54 295  micro avg 0.64 0.22 0.33 1198  macro avg 0.75 0.42 0.51 1198 weighted avg 0.58 0.22 0.31 1198
4,0.231,0.187392,0.928689,precision recall f1-score support  LOC 0.41 0.37 0.39 829  PRICE 0.86 0.81 0.83 74  Product 0.97 0.38 0.54 295  micro avg 0.51 0.40 0.45 1198  macro avg 0.75 0.52 0.59 1198 weighted avg 0.58 0.40 0.46 1198
5,0.2105,0.171767,0.939309,precision recall f1-score support  LOC 0.68 0.41 0.51 829  PRICE 0.85 0.86 0.86 74  Product 0.95 0.41 0.57 295  micro avg 0.75 0.44 0.55 1198  macro avg 0.83 0.56 0.65 1198 weighted avg 0.76 0.44 0.55 1198
6,0.1809,0.150577,0.949079,precision recall f1-score support  LOC 0.69 0.65 0.67 829  PRICE 0.87 0.92 0.89 74  Product 0.96 0.39 0.56 295  micro avg 0.74 0.61 0.67 1198  macro avg 0.84 0.66 0.71 1198 weighted avg 0.77 0.61 0.66 1198
7,0.1636,0.135937,0.959539,precision recall f1-score support  LOC 0.74 0.69 0.72 829  PRICE 0.94 0.91 0.92 74  Product 0.86 0.48 0.61 295  micro avg 0.78 0.65 0.71 1198  macro avg 0.85 0.69 0.75 1198 weighted avg 0.78 0.65 0.70 1198
8,0.162,0.119027,0.962831,precision recall f1-score support  LOC 0.79 0.75 0.77 829  PRICE 0.93 0.92 0.93 74  Product 0.94 0.44 0.60 295  micro avg 0.82 0.68 0.74 1198  macro avg 0.88 0.70 0.76 1198 weighted avg 0.83 0.68 0.74 1198
9,0.1362,0.108057,0.969734,precision recall f1-score support  LOC 0.83 0.81 0.82 829  PRICE 0.92 0.91 0.91 74  Product 0.92 0.47 0.63 295  micro avg 0.85 0.73 0.79 1198  macro avg 0.89 0.73 0.79 1198 weighted avg 0.86 0.73 0.78 1198
10,0.1271,0.107849,0.967663,precision recall f1-score support  LOC 0.84 0.78 0.81 829  PRICE 0.92 0.92 0.92 74  Product 0.74 0.46 0.57 295  micro avg 0.83 0.71 0.76 1198  macro avg 0.83 0.72 0.77 1198 weighted avg 0.82 0.71 0.76 1198


  _warn_prf(average, modifier, msg_start, len(result))
Trainer is attempting to log a value of "              precision    recall  f1-score   support

         LOC       0.06      0.02      0.03       829
       PRICE       0.00      0.00      0.00        74
     Product       0.00      0.00      0.00       295

   micro avg       0.06      0.01      0.02      1198
   macro avg       0.02      0.01      0.01      1198
weighted avg       0.04      0.01      0.02      1198
" of type <class 'str'> for key "eval/report" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Trainer is attempting to log a value of "              precision    recall  f1-score   support

         LOC       0.10      0.03      0.05       829
       PRICE       0.28      0.20      0.24        74
     Product       0.84      0.38      0.52       295

   micro avg       0.35      0.13      0.19      1198
   macro avg       0.41      0.20      0.27      1198
we

Trainer is attempting to log a value of "              precision    recall  f1-score   support

         LOC       0.91      0.87      0.89       829
       PRICE       0.93      0.93      0.93        74
     Product       0.94      0.49      0.65       295

   micro avg       0.92      0.78      0.84      1198
   macro avg       0.93      0.76      0.82      1198
weighted avg       0.92      0.78      0.83      1198
" of type <class 'str'> for key "eval/report" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.


Evaluation results for distilbert-base-multilingual-cased: {'eval_loss': 0.09585697948932648, 'eval_accuracy': 0.9748845112302873, 'eval_report': '              precision    recall  f1-score   support\n\n         LOC       0.91      0.87      0.89       829\n       PRICE       0.93      0.93      0.93        74\n     Product       0.94      0.49      0.65       295\n\n   micro avg       0.92      0.78      0.84      1198\n   macro avg       0.93      0.76      0.82      1198\nweighted avg       0.92      0.78      0.83      1198\n', 'eval_runtime': 1.4402, 'eval_samples_per_second': 186.086, 'eval_steps_per_second': 11.804, 'epoch': 15.0}
Model: distilbert-base-multilingual-cased
Accuracy: 0.9748845112302873
              precision    recall  f1-score   support

         LOC       0.91      0.87      0.89       829
       PRICE       0.93      0.93      0.93        74
     Product       0.94      0.49      0.65       295

   micro avg       0.92      0.78      0.84      1198
   macro a