# Vendor Scorecard Development

This notebook implements the FinTech Vendor Scorecard for Micro-Lending.
It combines entities extracted by the NER model with Telegram post metadata
(views, timestamps, channel/vendor name) to create a rich profile for each vendor.
It then calculates key performance metrics and a weighted "Lending Score" for each.

This notebook is a core component of the FinTech E-commerce Data Extractor project,
specifically designed to fulfill the business objective of identifying promising
vendors for micro-lending opportunities within the EthioMart ecosystem.

**Project Goal:** To leverage unstructured Telegram e-commerce data to provide
data-driven insights for financial decision-making, particularly micro-lending.

**Business Objective:** Enable EthioMart to make informed micro-lending decisions
by quantifying vendor performance and potential based on their digital footprint
and engagement on Telegram channels. This aims to reduce risk and optimize lending
to active and successful small businesses.

**Methodology:**
The methodology integrates several stages of data processing and analysis:
1.  **Data Integration:** Utilizes preprocessed Telegram message data, incorporating
    critical metadata such as message views, posting dates, and channel information.
2.  **Entity Extraction:** Employs a fine-tuned Named Entity Recognition (NER) model
    to accurately extract crucial business entities (Product, Price, Location, Contact Info)
    from the free-form Amharic text of Telegram posts.
3.  **Vendor Analytics Engine Development:** Constructs a robust analytics engine
    to compute key performance indicators for each vendor, transforming raw data
    and extracted entities into quantifiable metrics.
4.  **Lending Score Formulation:** Develops a composite "Lending Score" by normalizing
    and weighting selected performance indicators, providing a single, interpretable
    metric to rank vendors.

**Implementation Details:**
This notebook orchestrates the following implementation steps:
 -   Loading of preprocessed Telegram data and the custom-trained NER model (from a Hugging Face repository or local path).
 -   Performing high-throughput NER inference across all available messages to enrich vendor profiles with extracted product and pricing information.
 -   Calculating vendor-specific metrics including:
     -   **Posting Frequency:** Measuring the regularity and volume of a vendor's posts.
     -   **Average Views per Post:** Quantifying the reach and engagement of a vendor's content.
     -   **Average Price Point:** Analyzing the typical pricing of products offered, indicating market segment.
     -   **Top Performing Post Analysis:** Identifying the most successful posts and their associated product-price pairs.
 -   Normalizing these metrics and applying predefined weights to compute the final "Lending Score".
 -   Presenting a comprehensive summary table of the calculated metrics and Lending Scores,
     sorted for immediate business insights, along with detailed interpretations of the results.

In [None]:
# --- 1. Setup and Library Installation ---
# Install necessary libraries if running in a fresh Colab environment
# !pip install pandas numpy transformers scikit-learn -q

import pandas as pd
import numpy as np
import os
from pathlib import Path
from datetime import datetime, timedelta
import re
from collections import defaultdict
import logging

# Set up logging for informative messages
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)

# Import Hugging Face components for loading the fine-tuned model
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline

print("Libraries imported successfully.")

In [None]:
# --- 2. Define Paths and Configuration ---
# Assuming the notebook is in 'notebooks/' and project root is its parent's parent
PROJECT_ROOT = Path(os.getcwd()).parent.parent

# Paths to data and model
# Ensure preprocessed_telegram_data.csv has 'views', 'date', 'channel' columns
# (These should ideally be passed through during preprocessing or original merged data used)
PREPROCESSED_DATA_PATH = PROJECT_ROOT / 'data' / 'processed' / 'preprocessed_telegram_data.csv'

# --- IMPORTANT CHANGE: Load model from Hugging Face Hub ---
# Replace "YOUR_HF_USERNAME/your_ner_model_repo" with your actual Hugging Face model repository name.
# Example: "your-username/xlm-roberta-base-finetuned-amharic-ner"
# Ensure the model was pushed to the hub in Task 4.
FINE_TUNED_MODEL_NAME_OR_PATH = "xlm-roberta-base" # Default to local if not pushing to hub, or replace with your HF repo.
                                                  # e.g., "YOUR_HF_USERNAME/your_amharic_ner_model"
# For demonstration, we'll try to load locally first, then fallback to hub if specified.
# If you pushed your model to HF Hub, replace 'xlm-roberta-base' with your repo name.
# E.g., FINE_TUNED_MODEL_NAME_OR_PATH = "your_hf_username/your_finetuned_model_repo"
# If running locally without pushing to HF Hub, keep this as the path to your saved model directory.
if (PROJECT_ROOT / 'fine_tuned_ner_model').exists():
    FINE_TUNED_MODEL_NAME_OR_PATH = str(PROJECT_ROOT / 'fine_tuned_ner_model')
    logger.info(f"Loading model from local path: {FINE_TUNED_MODEL_NAME_OR_PATH}")
else:
    # If the local path doesn't exist, assume it's a Hugging Face Hub model name
    # You MUST replace "YOUR_HF_USERNAME/your_ner_model_repo" with your actual model name on Hugging Face Hub
    # This is a placeholder and will likely fail if you haven't uploaded it.
    FINE_TUNED_MODEL_NAME_OR_PATH = "xlm-roberta-base" # Fallback to base model for demo if local not found
    logger.warning(f"Local fine-tuned model not found. Attempting to load '{FINE_TUNED_MODEL_NAME_OR_PATH}'. "
                   f"Please ensure your model is uploaded to Hugging Face Hub or exists locally.")


# Output for the vendor scorecard
VENDOR_SCORECARD_PATH = PROJECT_ROOT / 'data' / 'processed' / 'vendor_scorecard.csv'

print(f"Project root: {PROJECT_ROOT}")
print(f"Preprocessed data path: {PREPROCESSED_DATA_PATH}")
print(f"Fine-tuned model loading from: {FINE_TUNED_MODEL_NAME_OR_PATH}")
print(f"Vendor scorecard output path: {VENDOR_SCORECARD_PATH}")

In [None]:
# --- 3. Load Data and Fine-tuned NER Model ---

# Load preprocessed data (should ideally contain views, date, channel, and preprocessed_text)
if not PREPROCESSED_DATA_PATH.exists():
    logger.error(f"Error: Preprocessed data not found at {PREPROCESSED_DATA_PATH}. Please ensure preprocessing output is available and includes metadata.")
    raise FileNotFoundError(f"Preprocessed data not found: {PREPROCESSED_DATA_PATH}")

try:
    df = pd.read_csv(PREPROCESSED_DATA_PATH)
    logger.info(f"Loaded {len(df)} rows from {PREPROCESSED_DATA_PATH}")
    # Basic check for essential columns for this analysis
    required_cols = ['preprocessed_text', 'views', 'date', 'channel']
    if not all(col in df.columns for col in required_cols):
        logger.error(f"Missing one or more required columns ({required_cols}) in the dataframe. This will prevent scorecard generation.")
        logger.error("Please ensure src/data_preprocessing/text_processor.py::preprocess_dataframe passes through 'views', 'date', and 'channel' columns from the raw data.")
        raise ValueError("Essential metadata (views, date, channel) columns are missing in the input DataFrame.")

except Exception as e:
    logger.error(f"Failed to load or validate data: {e}")
    raise

# Convert 'date' column to datetime objects
df['date'] = pd.to_datetime(df['date'], errors='coerce')
df.dropna(subset=['date', 'views', 'channel', 'preprocessed_text'], inplace=True)
df['views'] = pd.to_numeric(df['views'], errors='coerce') # Ensure views is numeric
logger.info(f"DataFrame after dropping rows with missing essential metadata and coercing views: {len(df)} rows.")

# Load the fine-tuned NER model and tokenizer
try:
    tokenizer = AutoTokenizer.from_pretrained(FINE_TUNED_MODEL_NAME_OR_PATH)
    model = AutoModelForTokenClassification.from_pretrained(FINE_TUNED_MODEL_NAME_OR_PATH)
    
    # Create an NER pipeline for efficient inference
    ner_pipeline = pipeline(
        "token-classification",
        model=model,
        tokenizer=tokenizer,
        aggregation_strategy="simple" # Aggregates sub-word tokens to full words
    )
    logger.info("Fine-tuned NER model and tokenizer loaded successfully.")
except Exception as e:
    logger.error(f"Failed to load fine-tuned model or create pipeline from '{FINE_TUNED_MODEL_NAME_OR_PATH}': {e}")
    logger.error("Please verify the model path/name and ensure it exists locally or on Hugging Face Hub.")
    raise

In [None]:
# --- 4. Perform NER Inference on All Data ---
logger.info("Starting NER inference on all preprocessed messages...")

# Apply NER to all preprocessed text
# This might take a while depending on the number of messages and GPU availability
ner_results = []
for i, text in enumerate(df['preprocessed_text']):
    if pd.isna(text) or not text.strip():
        ner_results.append([]) # Append empty list for empty/NaN texts
        continue
    try:
        results = ner_pipeline(text)
        ner_results.append(results)
    except Exception as e:
        logger.warning(f"Error during NER for message {i}: {e}. Appending empty result.")
        ner_results.append([])

df['extracted_entities'] = ner_results
logger.info("NER inference complete. Extracted entities column added.")

In [None]:
# --- 5. Entity Extraction and Normalization Helpers ---

def extract_entities_by_type(entities: List[Dict], entity_type: str) -> List[str]:
    """Extracts all entities of a specific type from a list of NER results."""
    return [ent['word'] for ent in entities if ent['entity_group'] == entity_type]

def extract_numerical_price(price_tokens: List[str]) -> float:
    """
    Extracts and converts a price entity (e.g., ['5000', 'ብር']) into a float.
    Handles commas, currency symbols, and ensures valid conversion.
    """
    if not price_tokens:
        return np.nan
    
    # Join tokens and convert to lowercase for robust matching
    full_price_str = "".join(price_tokens).lower()
    
    # Remove common Amharic currency terms and symbols and symbols, as well as commas and spaces
    # Add more variations as observed in data (e.g., 'ብር', ' birr', 'eth', 'etb', '$')
    price_value_str = re.sub(r'[ብርbirr\s,]', '', full_price_str)
    
    try:
        return float(price_value_str)
    except ValueError:
        logger.warning(f"Could not convert price '{full_price_str}' to float. Returning NaN.")
        return np.nan

# Extract specific entity types into new columns
df['products'] = df['extracted_entities'].apply(lambda x: extract_entities_by_type(x, 'PRODUCT'))
df['prices'] = df['extracted_entities'].apply(lambda x: extract_entities_by_type(x, 'PRICE'))
df['locations'] = df['extracted_entities'].apply(lambda x: extract_entities_by_type(x, 'LOC'))
df['contact_info'] = df['extracted_entities'].apply(lambda x: extract_entities_by_type(x, 'CONTACT_INFO'))

# Convert extracted price strings to numerical values
# Note: For multiple prices in one message, this will create a list of numerical prices.
df['numerical_prices'] = df['prices'].apply(lambda x: [extract_numerical_price([p]) for p in x if p])
# Flatten the list of numerical prices for easier average calculation across all products in a message
df['all_numerical_prices'] = df['numerical_prices'].apply(lambda x: [val for val in x if not pd.isna(val)])

logger.info("Extracted specific entities and converted prices.")

In [None]:
# --- 6. Develop Vendor Analytics Engine & Calculate Key Metrics ---

def calculate_vendor_metrics(df_vendor: pd.DataFrame) -> Dict:
    """
    Calculates key performance metrics for a single vendor.

    Args:
        df_vendor (pd.DataFrame): DataFrame containing posts for a single vendor.

    Returns:
        Dict: A dictionary of calculated metrics for the vendor.
    """
    if df_vendor.empty:
        return {
            'Avg. Views/Post': 0,
            'Posts/Week': 0,
            'Avg. Price (ETB)': 0,
            'Top Product': 'N/A',
            'Top Product Price': 'N/A',
            'Total Posts': 0,
            'Date Range Days': 0
        }

    # Total Posts
    total_posts = len(df_vendor)

    # Average Views per Post
    avg_views_per_post = df_vendor['views'].mean() if not df_vendor['views'].empty else 0

    # Posting Frequency (Posts per Week)
    min_date = df_vendor['date'].min()
    max_date = df_vendor['date'].max()
    date_range_days = (max_date - min_date).days + 1 # +1 to include both start and end day
    
    if date_range_days <= 0: # Handle cases where there's only one post or invalid date range
        posting_frequency = total_posts # Consider all posts within a "single period"
        date_range_days = 1 # Set to 1 to avoid division by zero later if used
    else:
        posting_frequency = total_posts / (date_range_days / 7) # Posts per 7 days

    # Average Price Point
    # Flatten all prices from all posts of the vendor
    all_prices_flat = [price for sublist in df_vendor['all_numerical_prices'] for price in sublist]
    avg_price_point = np.mean(all_prices_flat) if all_prices_flat else np.nan

    # Top Performing Post - find the message with the highest view count
    top_post = df_vendor.loc[df_vendor['views'].idxmax()]
    # Get the first product if available, otherwise 'N/A'
    top_product = top_post['products'][0] if top_post['products'] else 'N/A'
    # Get the first numerical price if available, otherwise NaN
    top_product_price = top_post['all_numerical_prices'][0] if top_post['all_numerical_prices'] else np.nan
    
    return {
        'Total Posts': total_posts,
        'Avg. Views/Post': avg_views_per_post,
        'Posts/Week': posting_frequency,
        'Avg. Price (ETB)': avg_price_point,
        'Top Product': top_product,
        'Top Product Price': top_product_price,
        'Date Range Days': date_range_days # Useful for debugging frequency
    }

In [None]:
# --- 7. Process All Vendors ---
logger.info("Calculating metrics for all unique vendors...")
vendor_metrics_list = []
unique_vendors = df['channel'].unique()

for vendor_name in unique_vendors:
    df_vendor = df[df['channel'] == vendor_name].copy()
    metrics = calculate_vendor_metrics(df_vendor)
    metrics['Vendor'] = vendor_name
    vendor_metrics_list.append(metrics)

vendor_scorecard_df = pd.DataFrame(vendor_metrics_list)

logger.info("Vendor metrics calculated.")
print("\nRaw Vendor Metrics:")
print(vendor_scorecard_df[['Vendor', 'Total Posts', 'Avg. Views/Post', 'Posts/Week', 'Avg. Price (ETB)', 'Top Product', 'Top Product Price']].to_string())

## Interpretation of Key Vendor Metrics
**Average Views per Post:**
This metric directly reflects a vendor's market reach and the level of customer engagement their content generates. A higher average view count indicates that the vendor's posts are being seen by more potential customers, suggesting broader visibility and potentially more interest in their products.

**Posting Frequency (Posts per Week):**
This metric indicates a vendor's activity and consistency. A higher posting frequency suggests an active and potentially reliable business that regularly engages its audience and updates its product offerings. Consistent activity is a positive signal for potential micro-lending.

**Average Price Point (ETB):**
This metric provides insight into a vendor's business profile and market segment. A higher average price point might indicate a vendor dealing in higher-value, lower-volume products (e.g., electronics, specialized machinery), while a lower average price point could suggest a high-volume, lower-margin business (e.g., common household goods, fast-moving consumer goods). This helps lenders understand the scale and nature of the business.

**Top Performing Post (Product & Price):**
Identifying the product and its price from the highest-viewed post helps understand what kind of content or product resonates most with a vendor's audience. This provides qualitative insight into their most successful offerings, which can be valuable for business assessment.

In [None]:
# --- 8. Create a Final "Lending Score" ---
# Normalize metrics before combining them to ensure fair weighting
# We will use Min-Max scaling: (X - min(X)) / (max(X) - min(X))

# Define metrics to normalize and their importance (weights)
# Example weights, these can be adjusted based on business priorities
# Views and Posting Frequency are often good indicators of engagement.
# Price point might indicate market segment (luxury vs. mass market).
METRIC_WEIGHTS = {
    'Avg. Views/Post': 0.4,
    'Posts/Week': 0.4,
    'Avg. Price (ETB)': 0.2, # Lower weight as it defines market segment, not directly engagement
}

# Ensure columns exist and handle NaNs for normalization
for col in METRIC_WEIGHTS.keys():
    if col not in vendor_scorecard_df.columns:
        vendor_scorecard_df[col] = 0.0 # Add column with zeros if missing
    # Replace NaNs with 0 for metrics where it makes sense (e.g., if no prices were extracted)
    vendor_scorecard_df[col] = vendor_scorecard_df[col].fillna(0)

# Normalize relevant columns
normalized_df = vendor_scorecard_df.copy()
for metric, weight in METRIC_WEIGHTS.items():
    min_val = normalized_df[metric].min()
    max_val = normalized_df[metric].max()
    
    if max_val == min_val: # Avoid division by zero if all values are the same
        normalized_df[f'Normalized {metric}'] = 0.0
    else:
        normalized_df[f'Normalized {metric}'] = (normalized_df[metric] - min_val) / (max_val - min_val)
        
    logger.info(f"Normalized '{metric}' (Min: {min_val:.2f}, Max: {max_val:.2f})")

# Calculate the Lending Score
normalized_df['Lending Score'] = 0.0
for metric, weight in METRIC_WEIGHTS.items():
    normalized_df['Lending Score'] += normalized_df[f'Normalized {metric}'] * weight

# Optionally scale Lending Score to 0-100 or another range for easier interpretation
max_possible_score = sum(METRIC_WEIGHTS.values()) # Should be 1.0 if weights sum to 1
normalized_df['Lending Score (0-100)'] = (normalized_df['Lending Score'] / max_possible_score) * 100

logger.info("Lending Score calculated.")

## Interpretation of the Lending Score
The 'Lending Score' is a composite metric designed to identify promising vendors for micro-lending. It is calculated as a weighted sum of normalized key performance indicators:
  - Average Views per Post (Weight: 0.4): Reflects market reach and customer engagement.
  - Posting Frequency (Weight: 0.4): Indicates business activity and consistency.
  - Average Price Point (Weight: 0.2): Provides insight into the vendor's market segment (e.g., high-value vs. high-volume goods).

A higher Lending Score (on a scale of 0-100) suggests a vendor with stronger engagement, consistent activity, and a potentially viable business model based on their Telegram presence. This score serves as a data-driven input for EthioMart's micro-lending decisions, complementing traditional financial assessments.

In [None]:
# --- 9. Present Vendor Scorecard Table ---
final_scorecard_display = normalized_df[[
    'Vendor',
    'Avg. Views/Post',
    'Posts/Week',
    'Avg. Price (ETB)',
    'Top Product',
    'Top Product Price',
    'Lending Score (0-100)'
]].sort_values(by='Lending Score (0-100)', ascending=False) # Sort by lending score

print("\n--- FinTech Vendor Scorecard Summary ---")
print("The table below presents the calculated metrics and the composite 'Lending Score' for each analyzed vendor. Vendors are sorted by their Lending Score in descending order, with higher scores indicating potentially more promising candidates for micro-lending.")
print(final_scorecard_display.to_string(index=False))

In [None]:
# --- Result Summary ---
if not final_scorecard_display.empty:
    top_vendor = final_scorecard_display.iloc[0]
    print("\n**Top Performing Vendor:**")
    print(f"  - Vendor: {top_vendor['Vendor']}")
    print(f"  - Lending Score (0-100): {top_vendor['Lending Score (0-100)']:.2f}")
    print(f"  - Avg. Views/Post: {top_vendor['Avg. Views/Post']:.2f}")
    print(f"  - Posts/Week: {top_vendor['Posts/Week']:.2f}")
    print(f"  - Average Price (ETB): {top_vendor['Avg. Price (ETB)'] if not pd.isna(top_vendor['Avg. Price (ETB)']) else 'N/A'}")
    print(f"  - Top Product: {top_vendor['Top Product']}")
    print(f"  - Top Product Price: {top_vendor['Top Product Price'] if not pd.isna(top_vendor['Top Product Price']) else 'N/A'}")
    print("\nThis vendor demonstrates strong online activity and customer engagement, making them a prime candidate for micro-lending based on their Telegram presence.")
    
    if len(final_scorecard_display) > 1:
        bottom_vendor = final_scorecard_display.iloc[-1]
        print("\n**Lowest Performing Vendor (for comparison):**")
        print(f"  - Vendor: {bottom_vendor['Vendor']}")
        print(f"  - Lending Score (0-100): {bottom_vendor['Lending Score (0-100)']:.2f}")
        print(f"  - Avg. Views/Post: {bottom_vendor['Avg. Views/Post']:.2f}")
        print(f"  - Posts/Week: {bottom_vendor['Posts/Week']:.2f}")
        print(f"  - Average Price (ETB): {bottom_vendor['Avg. Price (ETB)'] if not pd.isna(bottom_vendor['Avg. Price (ETB)']) else 'N/A'}")
        print(f"  - Top Product: {bottom_vendor['Top Product']}")
        print(f"  - Top Product Price: {bottom_vendor['Top Product Price'] if not pd.isna(bottom_vendor['Top Product Price']) else 'N/A'}")
        print("\nThis vendor shows lower activity and engagement, indicating a less promising profile for micro-lending based solely on Telegram data.")
else:
    print("No vendor data available to generate a summary.")

# --- 10. Save the Scorecard ---
os.makedirs(VENDOR_SCORECARD_PATH.parent, exist_ok=True)
final_scorecard_display.to_csv(VENDOR_SCORECARD_PATH, index=False, encoding='utf-8')
logger.info(f"\nVendor scorecard saved to: {VENDOR_SCORECARD_PATH}")