# Amharic E-commerce NER Demo

This notebook demonstrates the Amharic Named Entity Recognition system for e-commerce data extraction.

## Overview
- Data collection from Telegram channels
- Text preprocessing and entity extraction
- Model training and evaluation
- Vendor scorecard analysis

In [8]:
# Import required libraries
import sys
import os
sys.path.append('../scripts')

import json

# Try to import optional dependencies
try:
    import pandas as pd
    PANDAS_AVAILABLE = True
except ImportError:
    print("Pandas not available. Some features may be limited.")
    PANDAS_AVAILABLE = False

# Import our custom modules
try:
    from data_processor import AmharicDataProcessor
    from conll_labeler import CoNLLLabeler
    from vendor_scorecard import VendorAnalytics
    print("✅ All modules imported successfully")
except ImportError as e:
    print(f"⚠️ Import error: {e}")
    print("Make sure you're running from the project root directory")

Pandas not available. Some features may be limited.
✅ All modules imported successfully


## 1. Data Loading and Exploration

In [9]:
# Load raw Telegram data
data = []
with open('../raw_telegram_data.jsonl', 'r', encoding='utf-8') as f:
    for line in f:
        try:
            obj = json.loads(line.strip())
            data.append(obj)
        except json.JSONDecodeError:
            continue

print(f"Loaded {len(data)} messages")
print(f"Channels: {set(item['channel'] for item in data)}")

Loaded 1200 messages
Channels: {'@meneshayeofficial', ' @ZemenExpress', '@Leyueqa', '@sinayelj', '@nevacomputer', '@ethio_brand_collection'}


In [10]:
# Display sample messages
for i, item in enumerate(data[:3]):
    print(f"\n--- Message {i+1} ---")
    print(f"Channel: {item['channel']}")
    print(f"Text: {item['text'][:200]}...")
    print(f"Media: {item['media']}")


--- Message 1 ---
Channel:  @ZemenExpress
Text: 💥💥...................................💥💥

📌Over Door Hooks Japanese Style Door Back Hangers 

👍Punch Organizers for Towels Coats and More 
👍Easy Installation Versatile Home Storage

ዋጋ፦  💵🏷 1ፍሬ 300  ብር...
Media: media/ @ZemenExpress_7022.jpg

--- Message 2 ---
Channel:  @ZemenExpress
Text: 💥💥...................................💥💥

📌Over Door Hooks Japanese Style Door Back Hangers 

👍Punch Organizers for Towels Coats and More 
👍Easy Installation Versatile Home Storage

ዋጋ፦  💵🏷 1ፍሬ 300  ብር...
Media: None

--- Message 3 ---
Channel:  @ZemenExpress
Text: 💥💥...................................💥💥

📌Ball Ice Cube Tray

ዋጋ፦  💰 🏷  700 ብር

♦️ውስን ፍሬ ነው ያለው 🔥🔥🔥

🏢 አድራሻ👉

📍♦️#መገናኛ_መሰረት_ደፋር_ሞል_ሁለተኛ_ፎቅ ቢሮ ቁ. S05/S06


     💧💧💧💧


    📲 0902660722
    📲 0928460606...
Media: None


## 2. Text Preprocessing

In [11]:
# Initialize data processor
processor = AmharicDataProcessor()

# Process sample text
sample_text = "💥💥ዋጋ፦ 500 ብር የሆነ ሻሚዝ በቦሌ ይሸጣል📍"
clean_text = processor.normalize_amharic_text(sample_text)

print(f"Original: {sample_text}")
print(f"Cleaned: {clean_text}")

# Extract entities
entities = processor.extract_entities_from_text(clean_text)
print(f"\nExtracted entities: {entities}")

Original: 💥💥ዋጋ፦ 500 ብር የሆነ ሻሚዝ በቦሌ ይሸጣል📍
Cleaned: ዋጋ፦ 500 ብር የሆነ ሻሚዝ በቦሌ ይሸጣል

Extracted entities: {'products': [], 'prices': ['500'], 'locations': ['ቦሌ']}


In [12]:
# Initialize data processor
processor = AmharicDataProcessor()

# Process sample text
sample_text = "💥💥ዋጋ፦ 500 ብር የሆነ ሻሚዝ በቦሌ ይሸጣል📍"
clean_text = processor.normalize_amharic_text(sample_text)

print(f"Original: {sample_text}")
print(f"Cleaned: {clean_text}")

# Extract entities
entities = processor.extract_entities_from_text(clean_text)
print(f"\nExtracted entities: {entities}")

Original: 💥💥ዋጋ፦ 500 ብር የሆነ ሻሚዝ በቦሌ ይሸጣል📍
Cleaned: ዋጋ፦ 500 ብር የሆነ ሻሚዝ በቦሌ ይሸጣል

Extracted entities: {'products': [], 'prices': ['500'], 'locations': ['ቦሌ']}


In [13]:
# Initialize data processor
processor = AmharicDataProcessor()

# Process sample text
sample_text = "💥💥ዋጋ፦ 500 ብር የሆነ ሻሚዝ በቦሌ ይሸጣል📍"
clean_text = processor.normalize_amharic_text(sample_text)

print(f"Original: {sample_text}")
print(f"Cleaned: {clean_text}")

# Extract entities
entities = processor.extract_entities_from_text(clean_text)
print(f"\nExtracted entities: {entities}")

Original: 💥💥ዋጋ፦ 500 ብር የሆነ ሻሚዝ በቦሌ ይሸጣል📍
Cleaned: ዋጋ፦ 500 ብር የሆነ ሻሚዝ በቦሌ ይሸጣል

Extracted entities: {'products': [], 'prices': ['500'], 'locations': ['ቦሌ']}


In [14]:
# Initialize data processor
processor = AmharicDataProcessor()

# Process sample text
sample_text = "💥💥ዋጋ፦ 500 ብር የሆነ ሻሚዝ በቦሌ ይሸጣል📍"
clean_text = processor.normalize_amharic_text(sample_text)

print(f"Original: {sample_text}")
print(f"Cleaned: {clean_text}")

# Extract entities
entities = processor.extract_entities_from_text(clean_text)
print(f"\nExtracted entities: {entities}")

Original: 💥💥ዋጋ፦ 500 ብር የሆነ ሻሚዝ በቦሌ ይሸጣል📍
Cleaned: ዋጋ፦ 500 ብር የሆነ ሻሚዝ በቦሌ ይሸጣል

Extracted entities: {'products': [], 'prices': ['500'], 'locations': ['ቦሌ']}


In [15]:
# Initialize data processor
processor = AmharicDataProcessor()

# Process sample text
sample_text = "💥💥ዋጋ፦ 500 ብር የሆነ ሻሚዝ በቦሌ ይሸጣል📍"
clean_text = processor.normalize_amharic_text(sample_text)

print(f"Original: {sample_text}")
print(f"Cleaned: {clean_text}")

# Extract entities
entities = processor.extract_entities_from_text(clean_text)
print(f"\nExtracted entities: {entities}")

Original: 💥💥ዋጋ፦ 500 ብር የሆነ ሻሚዝ በቦሌ ይሸጣል📍
Cleaned: ዋጋ፦ 500 ብር የሆነ ሻሚዝ በቦሌ ይሸጣል

Extracted entities: {'products': [], 'prices': ['500'], 'locations': ['ቦሌ']}


In [16]:
# Initialize data processor
processor = AmharicDataProcessor()

# Process sample text
sample_text = "💥💥ዋጋ፦ 500 ብር የሆነ ሻሚዝ በቦሌ ይሸጣል📍"
clean_text = processor.normalize_amharic_text(sample_text)

print(f"Original: {sample_text}")
print(f"Cleaned: {clean_text}")

# Extract entities
entities = processor.extract_entities_from_text(clean_text)
print(f"\nExtracted entities: {entities}")

Original: 💥💥ዋጋ፦ 500 ብር የሆነ ሻሚዝ በቦሌ ይሸጣል📍
Cleaned: ዋጋ፦ 500 ብር የሆነ ሻሚዝ በቦሌ ይሸጣል

Extracted entities: {'products': [], 'prices': ['500'], 'locations': ['ቦሌ']}


## 3. CoNLL Format Labeling

In [17]:
# Initialize CoNLL labeler
labeler = CoNLLLabeler()

# Label sample text
sample_amharic = "ዋጋ 500 ብር የሆነ ሻሚዝ በቦሌ ይሸጣል"
labeled_tokens = labeler.auto_label_text(sample_amharic)

print("Token\t\tLabel")
print("-" * 30)
for token, label in labeled_tokens:
    print(f"{token}\t\t{label}")

Token		Label
------------------------------
ዋጋ		B-PRICE
500		I-PRICE
ብር		I-PRICE
የሆነ		O
ሻሚዝ		B-PRODUCT
በቦሌ		B-LOC
ይሸጣል		O


## 4. Dataset Statistics

In [18]:
# Load labeled dataset statistics
try:
    stats = labeler.validate_conll_format("../Data/merged_labeled_data.txt")
    
    print("=== Dataset Statistics ===")
    print(f"Total sentences: {stats['total_sentences']:,}")
    print(f"Total tokens: {stats['total_tokens']:,}")
    print("\nEntity counts:")
    for entity, count in stats['entity_counts'].items():
        print(f"  {entity}: {count:,}")
    
    if stats['errors']:
        print(f"\nErrors found: {len(stats['errors'])}")
        
except FileNotFoundError:
    print("Labeled dataset not found. Run the labeling script first.")

=== Dataset Statistics ===
Total sentences: 3,216
Total tokens: 174,695

Entity counts:
  PRODUCT: 14,399
  PRICE: 8,204
  LOC: 2,920


## 5. Vendor Scorecard Analysis

In [19]:
# Initialize vendor analytics
analytics = VendorAnalytics()

# Load and process data
telegram_data = analytics.load_telegram_data("../raw_telegram_data.jsonl")
print(f"Processed {len(telegram_data)} messages")

# Calculate vendor metrics
vendor_metrics = analytics.calculate_vendor_metrics(telegram_data)

print("\n=== Vendor Metrics ===")
for vendor, metrics in vendor_metrics.items():
    print(f"\n{vendor}:")
    print(f"  Posts per week: {metrics['posts_per_week']}")
    print(f"  Average price: {metrics['avg_price_etb']} ETB")
    print(f"  Price consistency: {metrics['price_consistency']:.2%}")
    print(f"  Media ratio: {metrics['media_ratio']:.2%}")

Processed 1200 messages

=== Vendor Metrics ===

ZemenExpress:
  Posts per week: 38.89
  Average price: 842.86 ETB
  Price consistency: 42.00%
  Media ratio: 81.00%

nevacomputer:
  Posts per week: 3.45
  Average price: 0 ETB
  Price consistency: 0.00%
  Media ratio: 98.00%

meneshayeofficial:
  Posts per week: 4.28
  Average price: 6670.59 ETB
  Price consistency: 9.00%
  Media ratio: 48.00%

ethio_brand_collection:
  Posts per week: 8.92
  Average price: 0 ETB
  Price consistency: 0.00%
  Media ratio: 99.00%

Leyueqa:
  Posts per week: 36.84
  Average price: 2226.04 ETB
  Price consistency: 34.00%
  Media ratio: 71.00%

sinayelj:
  Posts per week: 73.68
  Average price: 6350.0 ETB
  Price consistency: 4.00%
  Media ratio: 98.00%


In [20]:
# Create vendor scorecard
scorecard_data = analytics.create_vendor_scorecard(telegram_data)

print("=== Vendor Scorecard ===")
print(f"{'Vendor':<25} {'Lending Score':<15} {'Posts/Week':<12} {'Avg Price':<12}")
print("-" * 70)

for vendor_data in scorecard_data:
    print(f"{vendor_data['Vendor']:<25} {vendor_data['Lending_Score']:<15.1f} "
          f"{vendor_data['Posts_Per_Week']:<12.1f} {vendor_data['Avg_Price_ETB']:<12.1f}")

=== Vendor Scorecard ===
Vendor                    Lending Score   Posts/Week   Avg Price   
----------------------------------------------------------------------
sinayelj                  79.0            73.7         6350.0      
ZemenExpress              74.5            38.9         842.9       
Leyueqa                   73.3            36.8         2226.0      
ethio_brand_collection    70.3            8.9          0.0         
nevacomputer              60.9            3.5          0.0         
meneshayeofficial         58.6            4.3          6670.6      


## 6. Model Training Demo (Conceptual)

In [21]:
# Note: This is a conceptual demonstration
# Actual model training requires GPU resources and transformers library

print("=== Model Training Configuration ===")
print("Models to train:")
print("  - XLM-Roberta-base (multilingual)")
print("  - mBERT (multilingual BERT)")
print("  - DistilBERT (lightweight)")
print("\nTraining parameters:")
print("  - Learning rate: 2e-5")
print("  - Batch size: 16")
print("  - Max length: 128")
print("  - Epochs: 3")
print("\nDataset split:")
print("  - Training: 80%")
print("  - Validation: 20%")

=== Model Training Configuration ===
Models to train:
  - XLM-Roberta-base (multilingual)
  - mBERT (multilingual BERT)
  - DistilBERT (lightweight)

Training parameters:
  - Learning rate: 2e-5
  - Batch size: 16
  - Max length: 128
  - Epochs: 3

Dataset split:
  - Training: 80%
  - Validation: 20%


## 7. Results Summary

In [22]:
# Summary of achievements
print("=== Project Results Summary ===")
print("\n Data Collection:")
print(f"   - {len(data)} messages from 6 channels")
print(f"   - {len([d for d in data if d.get('media')])} media files")

print("\n Data Processing:")
print("   - Amharic text normalization")
print("   - Entity pattern extraction")
print("   - CoNLL format labeling")

print("\n Vendor Analytics:")
print(f"   - {len(scorecard_data)} vendors analyzed")
print(f"   - Top vendor: {scorecard_data[0]['Vendor']} (Score: {scorecard_data[0]['Lending_Score']:.1f})")

print("\n Model Framework:")
print("   - Multi-model training pipeline")
print("   - Evaluation and comparison framework")
print("   - Interpretability tools (SHAP/LIME)")

print("\n Business Value:")
print("   - Automated entity extraction")
print("   - Vendor performance scoring")
print("   - Micro-lending decision support")
print("   - Scalable e-commerce intelligence")

=== Project Results Summary ===

 Data Collection:
   - 1200 messages from 6 channels
   - 992 media files

 Data Processing:
   - Amharic text normalization
   - Entity pattern extraction
   - CoNLL format labeling

 Vendor Analytics:
   - 6 vendors analyzed
   - Top vendor: sinayelj (Score: 79.0)

 Model Framework:
   - Multi-model training pipeline
   - Evaluation and comparison framework
   - Interpretability tools (SHAP/LIME)

 Business Value:
   - Automated entity extraction
   - Vendor performance scoring
   - Micro-lending decision support
   - Scalable e-commerce intelligence


## Conclusion

This notebook demonstrates the complete Amharic e-commerce data extraction pipeline:

1. **Data Collection**: Automated scraping from Telegram channels
2. **Preprocessing**: Amharic-specific text normalization
3. **Entity Extraction**: Pattern-based and ML-based approaches
4. **Model Training**: Framework for transformer fine-tuning
5. **Vendor Analytics**: Business intelligence and scoring

The system provides a solid foundation for EthioMart's e-commerce platform and demonstrates the potential for AI-driven business intelligence in Ethiopian markets.