# Task 1 — Data Exploration & Enrichment
**Author:** Kudu T  
**Date:** 2026-01-30

Objective: Understand the starter dataset for Ethiopia's financial inclusion and enrich it with additional observations, events, or impact links.


In [12]:
import pandas as pd
import numpy as np
from datetime import datetime

pd.set_option("display.max_columns", None)


In [7]:
# File paths
file_path = "../data/raw/ethiopia_fi_unified_data.xlsx"
ref_codes_path = "../data/raw/reference_codes.xlsx"

# Load Excel sheets
xls = pd.ExcelFile(file_path)
df_data = pd.read_excel(xls, sheet_name="ethiopia_fi_unified_data")
df_impact = pd.read_excel(xls, sheet_name="Impact_sheet")

# Load reference codes
ref_codes = pd.read_excel(ref_codes_path)

# Quick look
df_data.head(), df_impact.head(), ref_codes.head()


(  record_id  record_type category  pillar               indicator  \
 0  REC_0001  observation      NaN  ACCESS  Account Ownership Rate   
 1  REC_0002  observation      NaN  ACCESS  Account Ownership Rate   
 2  REC_0003  observation      NaN  ACCESS  Account Ownership Rate   
 3  REC_0004  observation      NaN  ACCESS  Account Ownership Rate   
 4  REC_0005  observation      NaN  ACCESS  Account Ownership Rate   
 
   indicator_code indicator_direction  value_numeric value_text  value_type  \
 0  ACC_OWNERSHIP       higher_better           22.0        NaN  percentage   
 1  ACC_OWNERSHIP       higher_better           35.0        NaN  percentage   
 2  ACC_OWNERSHIP       higher_better           46.0        NaN  percentage   
 3  ACC_OWNERSHIP       higher_better           56.0        NaN  percentage   
 4  ACC_OWNERSHIP       higher_better           36.0        NaN  percentage   
 
    ... impact_direction impact_magnitude impact_estimate lag_months  \
 0  ...              NaN      

## Unified Schema Overview

The dataset follows a unified schema design where:
- All records share the same columns
- The `record_type` field determines how each row is interpreted

Record types include:
- observation: measured indicators (e.g., Findex, infrastructure)
- event: policies, product launches, milestones
- impact_link: modeled relationships between events and indicators
- target: official policy goals

Events are intentionally NOT assigned to pillars.
Their effects on Access or Usage are defined only through impact_link records.


In [29]:
# Separate record types
obs = df_data[df_data["record_type"] == "observation"]
events = df_data[df_data["record_type"] == "event"]

print("Observations:", len(obs))
print("Events:", len(events))

# Record type counts
print(df_data["record_type"].value_counts())

# Pillar coverage
print(df_data["pillar"].value_counts(dropna=False))

# Observation date range
print(df_data["observation_date"].min(), df_data["observation_date"].max())

# Unique indicators
print(df_data['indicator_code'].value_counts())

# Source types and confidence
print(df_data['source_type'].value_counts())
print(df_data['confidence'].value_counts())

# Indicator coverage across years
indicator_coverage = obs.groupby("indicator_code")["observation_date"].nunique().sort_values()
indicator_coverage


Observations: 31
Events: 12
record_type
observation    31
event          12
target          3
Name: count, dtype: int64
pillar
ACCESS           17
NaN              12
USAGE            11
GENDER            5
AFFORDABILITY     1
Name: count, dtype: int64
2014-12-31 00:00:00 2030-12-31 00:00:00
indicator_code
ACC_OWNERSHIP         7
ACC_FAYDA             4
ACC_MM_ACCOUNT        2
ACC_4G_COV            2
USG_P2P_COUNT         2
GEN_MM_SHARE          2
GEN_GAP_ACC           2
USG_ATM_COUNT         1
USG_P2P_VALUE         1
USG_CROSSOVER         1
USG_TELEBIRR_USERS    1
USG_TELEBIRR_VALUE    1
USG_ATM_VALUE         1
ACC_MOBILE_PEN        1
USG_MPESA_ACTIVE      1
USG_MPESA_USERS       1
AFF_DATA_INCOME       1
USG_ACTIVE_RATE       1
GEN_GAP_MOBILE        1
EVT_TELEBIRR          1
EVT_SAFARICOM         1
EVT_MPESA             1
EVT_FAYDA             1
EVT_FX_REFORM         1
EVT_CROSSOVER         1
EVT_MPESA_INTEROP     1
EVT_ETHIOPAY          1
EVT_NFIS2             1
EVT_SAFCOM_PRICE    

indicator_code
ACC_MOBILE_PEN        1
ACC_SMARTPHONE_PCT    1
AFF_DATA_INCOME       1
USG_ATM_VALUE         1
USG_ATM_COUNT         1
USG_ACTIVE_RATE       1
GEN_MM_SHARE          1
GEN_GAP_MOBILE        1
USG_MPESA_ACTIVE      1
USG_CROSSOVER         1
USG_MPESA_USERS       1
USG_P2P_VALUE         1
USG_TELEBIRR_USERS    1
USG_TELEBIRR_VALUE    1
ACC_MM_ACCOUNT        2
ACC_4G_COV            2
GEN_GAP_ACC           2
USG_P2P_COUNT         2
ACC_FAYDA             3
ACC_OWNERSHIP         4
Name: observation_date, dtype: int64

**Observations:**
- 30 observations, 10 events, 3 targets
- ACCESS and USAGE pillars are most common
- Observation years range from 2011–2024
- High-confidence events include Telebirr launch, NFIS policy, M-Pesa entry


In [30]:
events_df = df_data[df_data["record_type"] == "event"][
    ["observation_date", "category", "original_text", "notes", "confidence"]
].sort_values("observation_date")

events_df


Unnamed: 0,observation_date,category,original_text,notes,confidence
33,2021-05-17,product_launch,First major mobile money service in Ethiopia,,high
41,2021-09-01,policy,5-year national financial inclusion strategy,,high
34,2022-08-01,market_entry,End of state telecom monopoly,,high
45,2023-06-01,policy,Ethiopia launched the Fayda Digital ID system ...,Digital ID reduces KYC barriers for account op...,medium
35,2023-08-01,product_launch,Second mobile money entrant,,high
36,2024-01-01,infrastructure,National biometric digital ID system,,high
37,2024-07-29,policy,Birr float introduced,,high
38,2024-10-01,milestone,Historic: digital > cash for first time,,high
39,2025-10-27,partnership,Full interoperability for M-Pesa,,high
42,2025-12-15,pricing,Data and voice prices increased 20-82%,,high


**Insight:**  
Core indicators (e.g. Account Ownership) have multi-year coverage,
while infrastructure and usage indicators are sparse.
This motivates targeted data enrichment.
**Event Overview:**
- Includes product launches, policies, infrastructure investments, milestones
- Important events with high confidence:
  - Telebirr launch (May 2021)
  - NFIS policy (Sept 2021)
  - M-Pesa market entry (Aug 2023)


In [24]:
new_observation = {
    "record_id": "OBS_ENRICH_001",
    "record_type": "observation",
    "pillar": "ACCESS",
    "indicator": "Smartphone penetration",
    "indicator_code": "ACC_SMARTPHONE_PCT",
    "indicator_direction": "higher_better",
    "value_numeric": 28.0,
    "value_type": "percentage",
    "unit": "%",
    "observation_date": pd.to_datetime("2023-01-01"),
    "source_name": "GSMA",
    "source_type": "industry_report",
    "source_url": "https://www.gsma.com",
    "confidence": "medium",
    "original_text": "Smartphone penetration in Ethiopia reached approximately 28% in 2023.",
    "collected_by": "Kudu T",
    "collection_date": pd.to_datetime("2026-01-30"),
    "notes": "Smartphone access enables app-based digital financial services."
}

df_data = pd.concat([df_data, pd.DataFrame([new_observation])], ignore_index=True)

df_data[df_data["record_id"] == "OBS_ENRICH_001"]


Unnamed: 0,record_id,record_type,category,pillar,indicator,indicator_code,indicator_direction,value_numeric,value_text,value_type,unit,observation_date,period_start,period_end,fiscal_year,gender,location,region,source_name,source_type,source_url,confidence,related_indicator,relationship_type,impact_direction,impact_magnitude,impact_estimate,lag_months,evidence_basis,comparable_country,collected_by,collection_date,original_text,notes,event_name,event_date
44,OBS_ENRICH_001,observation,,ACCESS,Smartphone penetration,ACC_SMARTPHONE_PCT,higher_better,28.0,,percentage,%,2023-01-01,NaT,NaT,,,,,GSMA,industry_report,https://www.gsma.com,medium,,,,,,,,,Kudu T,2026-01-30 00:00:00,Smartphone penetration in Ethiopia reached app...,Smartphone access enables app-based digital fi...,,


In [25]:
new_event = {
    "record_id": "EVT_ENRICH_001",
    "record_type": "event",
    "category": "policy",
    "event_name": "Fayda Digital ID national rollout",
    "observation_date": pd.to_datetime("2023-06-01"),
    "source_name": "Government of Ethiopia",
    "source_type": "government",
    "source_url": "https://id.gov.et",
    "confidence": "medium",
    "original_text": "Ethiopia launched the Fayda Digital ID system to enable access to digital services.",
    "collected_by": "Kudu T",
    "collection_date": pd.to_datetime("2026-01-30"),
    "notes": "Digital ID reduces KYC barriers for account opening."
}

df_data = pd.concat([df_data, pd.DataFrame([new_event])], ignore_index=True)

df_data[df_data["record_id"] == "EVT_ENRICH_001"]


Unnamed: 0,record_id,record_type,category,pillar,indicator,indicator_code,indicator_direction,value_numeric,value_text,value_type,unit,observation_date,period_start,period_end,fiscal_year,gender,location,region,source_name,source_type,source_url,confidence,related_indicator,relationship_type,impact_direction,impact_magnitude,impact_estimate,lag_months,evidence_basis,comparable_country,collected_by,collection_date,original_text,notes,event_name,event_date
43,EVT_ENRICH_001,event,policy,,,,,,,,,NaT,NaT,NaT,,,,,Government of Ethiopia,,https://id.gov.et,medium,,,,,,,,,Kudu T,2026-01-30,Ethiopia launched the Fayda Digital ID system ...,Digital ID reduces KYC barriers for account op...,Fayda Digital ID national rollout,2023-06-01
45,EVT_ENRICH_001,event,policy,,,,,,,,,2023-06-01,NaT,NaT,,,,,Government of Ethiopia,government,https://id.gov.et,medium,,,,,,,,,Kudu T,2026-01-30 00:00:00,Ethiopia launched the Fayda Digital ID system ...,Digital ID reduces KYC barriers for account op...,Fayda Digital ID national rollout,


In [26]:
new_impact_link = {
    "record_id": "IMP_ENRICH_001",
    "parent_id": "EVT_ENRICH_001",
    "record_type": "impact_link",
    "pillar": "ACCESS",
    "related_indicator": "ACC_OWNERSHIP",
    "impact_direction": "positive",
    "impact_magnitude": 0.15,
    "lag_months": 6,
    "evidence_basis": "Comparable evidence from Kenya and Rwanda digital ID programs",
    "confidence": "medium",
    "collected_by": "Kudu T",
    "collection_date": pd.to_datetime("2026-01-30"),
    "original_text": "Digital ID programs reduce onboarding friction for financial accounts.",
    "notes": "Effect size conservative due to implementation and adoption risks."
}

df_impact = pd.concat([df_impact, pd.DataFrame([new_impact_link])], ignore_index=True)

df_impact[df_impact["record_id"] == "IMP_ENRICH_001"]


Unnamed: 0,record_id,parent_id,record_type,category,pillar,indicator,indicator_code,indicator_direction,value_numeric,value_text,value_type,unit,observation_date,period_start,period_end,fiscal_year,gender,location,region,source_name,source_type,source_url,confidence,related_indicator,relationship_type,impact_direction,impact_magnitude,impact_estimate,lag_months,evidence_basis,comparable_country,collected_by,collection_date,original_text,notes
16,IMP_ENRICH_001,EVT_ENRICH_001,impact_link,,ACCESS,,,,,,,,NaT,,,,,,,,,,medium,ACC_OWNERSHIP,,positive,0.15,,6,Comparable evidence from Kenya and Rwanda digi...,,Kudu T,2026-01-30 00:00:00,Digital ID programs reduce onboarding friction...,Effect size conservative due to implementation...


In [27]:
print("New observations added:")
print(df_data[df_data["record_id"] == "OBS_ENRICH_001"])

print("\nNew events added:")
print(df_data[df_data["record_id"] == "EVT_ENRICH_001"])

print("\nNew impact links added:")
print(df_impact[df_impact["record_id"] == "IMPACT_ENRICH_001"])


New observations added:
         record_id  record_type category  pillar               indicator  \
44  OBS_ENRICH_001  observation      NaN  ACCESS  Smartphone penetration   

        indicator_code indicator_direction  value_numeric value_text  \
44  ACC_SMARTPHONE_PCT       higher_better           28.0        NaN   

    value_type unit observation_date period_start period_end fiscal_year  \
44  percentage    %       2023-01-01          NaT        NaT         NaN   

   gender location  region source_name      source_type            source_url  \
44    NaN      NaN     NaN        GSMA  industry_report  https://www.gsma.com   

   confidence  related_indicator  relationship_type  impact_direction  \
44     medium                NaN                NaN               NaN   

    impact_magnitude  impact_estimate  lag_months  evidence_basis  \
44               NaN              NaN         NaN             NaN   

   comparable_country collected_by      collection_date  \
44               

In [31]:
df_event_impact = df_impact.merge(
    df_data[df_data["record_type"] == "event"][["record_id", "observation_date", "category", "original_text"]],
    left_on="parent_id",
    right_on="record_id",
    how="left",
    suffixes=("", "_event")
)

df_event_impact[[
    "observation_date_event",
    "category_event",
    "original_text_event",
    "pillar",
    "related_indicator",
    "impact_direction",
    "impact_magnitude",
    "lag_months"
]]


Unnamed: 0,observation_date_event,category_event,original_text_event,pillar,related_indicator,impact_direction,impact_magnitude,lag_months
0,2021-05-17,product_launch,First major mobile money service in Ethiopia,ACCESS,ACC_OWNERSHIP,increase,high,12
1,2021-05-17,product_launch,First major mobile money service in Ethiopia,USAGE,USG_TELEBIRR_USERS,increase,high,3
2,2021-05-17,product_launch,First major mobile money service in Ethiopia,USAGE,USG_P2P_COUNT,increase,high,6
3,2022-08-01,market_entry,End of state telecom monopoly,ACCESS,ACC_4G_COV,increase,medium,12
4,2022-08-01,market_entry,End of state telecom monopoly,AFFORDABILITY,AFF_DATA_INCOME,decrease,medium,12
5,2023-08-01,product_launch,Second mobile money entrant,USAGE,USG_MPESA_USERS,increase,high,3
6,2023-08-01,product_launch,Second mobile money entrant,ACCESS,ACC_MM_ACCOUNT,increase,medium,6
7,2024-01-01,infrastructure,National biometric digital ID system,ACCESS,ACC_OWNERSHIP,increase,medium,24
8,2024-01-01,infrastructure,National biometric digital ID system,GENDER,GEN_GAP_ACC,decrease,medium,24
9,2024-07-29,policy,Birr float introduced,AFFORDABILITY,AFF_DATA_INCOME,increase,high,3


In [32]:
df_data.to_excel(
    "../data/processed/ethiopia_fi_unified_data_enriched.xlsx",
    index=False
)

df_impact.to_excel(
    "../data/processed/impact_links_enriched.xlsx",
    index=False
)

print("Enriched datasets saved successfully.")


Enriched datasets saved successfully.


**Next Steps:**
- Task 2 will perform exploratory data analysis on the enriched dataset
- Visualize trends in Access and Usage
- Map events to indicators to estimate historical impact
