# Problem Formulation: Predicting the Risk of Energy Poverty in Developing Areas

## Problem Statement

Globally, more than 700 million people do not have access to electricity, which is a major obstacle to advancements in education, healthcare, and the economy. Most of the current approaches to addressing energy poverty are reactive and rely on outdated annual surveys and reports that are unable to pinpoint the most important areas that require intervention.

Using historical economic, demographic, and infrastructure data from 1990 to 2020, we aim to forecast which nations will have energy poverty (electricity access <90%) between 2024 and 2027. In order to facilitate data-driven resource allocation by international development organizations, the model will categorize nations into three risk categories: severe, moderate, and minimal.

## Central Question

**Using historical trends in infrastructure development, demographic shifts, and economic indicators from 1990 to 2020, can we forecast which countries will face electricity access challenges?**

## Secondary questions
- Which variables—GDP, urbanization, and population density—correlate most strongly with a rise or drop in energy access?
- Is it possible to group nations into significant risk categories (severe, moderate, and minimal) in order to help prioritize tasks?

## Problem Type

This is essentially a **supervised learning problem** that combines:
- **Prediction**: Predicting the percentages of continuous electricity access
- **Classification**: classifying nations into risk categories for energy poverty
- **Exploration**: Finding the main underlying factors and trends in energy access

## Inputs (Features)

The following predictor variables from World Bank datasets will be used in the model:

*Economic Indicators:*
- GDP per capita (Current USD)

*Demographic Indicators:*
- Urban population (% of total)
- Rural population (% of total)
- Population density (persons/km²)
- Total population

*Structural Variables:*
- Year (1990-2023)
- Country/Region (ISO codes)

*Source: World Bank World Development Indicators*

## Time horizon

**Training:** 1990-2020 historical data 

**Prediction:** 2024-2027 (one to three years forecast) 

**Validation:** Make predictions for 2021–2023 based on data from 1990–2020 (accuracy check)

## Outputs (Target Variable)

**Primary Output:** 
- Predicted rates of access to electricity (%) for each country in 2024–2027

**Classification Output:**
- Risk categories for energy poverty: Severe (<50%), Moderate (50-89%), Minimal (90-99%) 
 
**Actionable Outputs:** 
- Ranked list of the ten to twenty greatest risk nations that need intervention 
- The top three to five feature importances (which factors matter most) 
- Confidence intervals for every forecast 
- Regional trends and patterns

## Success Criteria

The project will be considered successful if:

1. **Prediction Accuracy**:  The model predicts risk categories with at least 75% classification accuracy on held-out test data.

2. **Temporal Validation**: Using a model trained on data from 1990 to 2020, the model can accurately predict electricity access rates in 2021–2023 (demonstrates true forecasting ability).

3. **Interpretability**: We are able to identify and define the top three to five characteristics that have the biggest impact on forecasts of energy poverty.

4. **Actionability**: The model generates a prioritized list of ten to fifteen nations that need immediate intervention, with a detailed explanation based on anticipated risk factors.


In [40]:
%pip install pandas
import pandas as pd

# Load raw dataset
electricity_access_df = pd.read_csv("data/raw/1.AccessToElectricityAPI_EG.ELC.ACCS.ZS_DS2_en_csv_v2_63/API_EG.ELC.ACCS.ZS_DS2_en_csv_v2_63.csv", skiprows=4)
gdp_df = pd.read_csv("data/raw/2.GDPAPI_NY.GDP.PCAP.CD_DS2_en_csv_v2_31/API_NY.GDP.PCAP.CD_DS2_en_csv_v2_31.csv", skiprows=4)
urban_population_df = pd.read_csv("data/raw/3.UrbanAPI_SP.URB.TOTL.IN.ZS_DS2_en_csv_v2_608/API_SP.URB.TOTL.IN.ZS_DS2_en_csv_v2_608.csv", skiprows=4)
rural_population_df = pd.read_csv("data/raw/4.RuralAPI_SP.RUR.TOTL.ZS_DS2_en_csv_v2_2502/API_SP.RUR.TOTL.ZS_DS2_en_csv_v2_2502.csv", skiprows=4)
population_density_df = pd.read_csv("data/raw/5.PopulationDensityAPI_EN.POP.DNST_DS2_en_csv_v2_275/API_EN.POP.DNST_DS2_en_csv_v2_275.csv", skiprows=4)
total_population_df = pd.read_csv("data/raw/6.TotalPopulationAPI_SP.POP.TOTL_DS2_en_csv_v2_7/API_SP.POP.TOTL_DS2_en_csv_v2_7.csv", skiprows=4)
renewable_energy_df = pd.read_csv("data/raw/7.RenewableEnergyAPI_EG.FEC.RNEW.ZS_DS2_en_csv_v2_1409/API_EG.FEC.RNEW.ZS_DS2_en_csv_v2_1409.csv", skiprows=4)
government_effectiveness_df = pd.read_csv("data/raw/8.GovernmentEffectivenessAPI_GE.EST_DS2_en_csv_v2_2683/API_GE.EST_DS2_en_csv_v2_2683.csv", skiprows=4)

# Display the first few rows of each dataset to verify successful loading
print("Electricity Access Dataset:")
print(electricity_access_df.head())
print("\nGDP Dataset:")
print(gdp_df.head())
print("\nUrban Population Dataset:")
print(urban_population_df.head())
print("\nRural Population Dataset:")
print(rural_population_df.head())
print("\nPopulation Density Dataset:")
print(population_density_df.head())
print("\nTotal Population Dataset:")
print(total_population_df.head())
print("\nRenewable Energy Consumption Dataset:")
print(renewable_energy_df.head())
print("\nGovernment Effectiveness Dataset:")
print(government_effectiveness_df.head())


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.3[0m[39;49m -> [0m[32;49m26.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip3 install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.
Electricity Access Dataset:
                  Country Name Country Code  \
0                        Aruba          ABW   
1  Africa Eastern and Southern          AFE   
2                  Afghanistan          AFG   
3   Africa Western and Central          AFW   
4                       Angola          AGO   

                            Indicator Name  Indicator Code  1960  1961  1962  \
0  Access to electricity (% of population)  EG.ELC.ACCS.ZS   NaN   NaN   NaN   
1  Access to electricity (% of population)  EG.ELC.ACCS.ZS   NaN   NaN   NaN   
2  Access to electricity (% of population)  EG.ELC.ACCS.ZS   NaN   NaN   NaN   
3  Access to electricity (% of population)  EG.ELC.

In [41]:
# Check dataframe structure, column types, and missing values
print("Electricity Access Dataset:")
print(electricity_access_df.info())
print("\nGDP Dataset:")
print(gdp_df.info())
print("\nUrban Population Dataset:")
print(urban_population_df.info())
print("\nRural Population Dataset:")
print(rural_population_df.info())
print("\nPopulation Density Dataset:")
print(population_density_df.info())
print("\nTotal Population Dataset:")
print(total_population_df.info())
print("\nRenewable Energy Consumption Dataset:")
print(renewable_energy_df.info())
print("\nGovernment Effectiveness Dataset:")
print(government_effectiveness_df.info())

Electricity Access Dataset:
<class 'pandas.DataFrame'>
RangeIndex: 266 entries, 0 to 265
Data columns (total 70 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Country Name    266 non-null    str    
 1   Country Code    266 non-null    str    
 2   Indicator Name  266 non-null    str    
 3   Indicator Code  266 non-null    str    
 4   1960            0 non-null      float64
 5   1961            0 non-null      float64
 6   1962            0 non-null      float64
 7   1963            0 non-null      float64
 8   1964            0 non-null      float64
 9   1965            0 non-null      float64
 10  1966            0 non-null      float64
 11  1967            0 non-null      float64
 12  1968            0 non-null      float64
 13  1969            0 non-null      float64
 14  1970            0 non-null      float64
 15  1971            0 non-null      float64
 16  1972            0 non-null      float64
 17  1973            0 

In [42]:
# Quick overview of distributions and ranges of numerical data
print("Electricity Access Dataset:")
print(electricity_access_df.describe())
print("\nGDP Dataset:")
print(gdp_df.describe())
print("\nUrban Population Dataset:")
print(urban_population_df.describe())
print("\nRural Population Dataset:")
print(rural_population_df.describe())
print("\nPopulation Density Dataset:")
print(population_density_df.describe())
print("\nTotal Population Dataset:")
print(total_population_df.describe())
print("\nRenewable Energy Consumption Dataset:")
print(renewable_energy_df.describe())
print("\nGovernment Effectiveness Dataset:")
print(government_effectiveness_df.describe())

Electricity Access Dataset:
       1960  1961  1962  1963  1964  1965  1966  1967  1968  1969  ...  \
count   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  ...   
mean    NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN  ...   
std     NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN  ...   
min     NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN  ...   
25%     NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN  ...   
50%     NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN  ...   
75%     NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN  ...   
max     NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN  ...   

             2016        2017        2018        2019        2020        2021  \
count  263.000000  263.000000  263.000000  263.000000  263.000000  263.000000   
mean    83.819459   84.566219   85.202457   85.759857   86.293403   86.855188   
std     25.496742   24.937006   24.083577   23.831564   23.429

In [43]:
# Inspect missing data and number of unique countries
print("Electricity Access Dataset:")
print(electricity_access_df.isnull().sum())
print(electricity_access_df['Country Code'].nunique())
print("\nGDP Dataset:")
print(gdp_df.isnull().sum())
print(gdp_df['Country Code'].nunique())
print("\nUrban Population Dataset:")
print(urban_population_df.isnull().sum())
print(urban_population_df['Country Code'].nunique())
print("\nRural Population Dataset:")
print(rural_population_df.isnull().sum())
print(rural_population_df['Country Code'].nunique())
print("\nPopulation Density Dataset:")
print(population_density_df.isnull().sum())
print(population_density_df['Country Code'].nunique())
print("\nTotal Population Dataset:")
print(total_population_df.isnull().sum())
print(total_population_df['Country Code'].nunique())
print("\nRenewable Energy Consumption Dataset:")
print(renewable_energy_df.isnull().sum())
print(renewable_energy_df['Country Code'].nunique())
print("\nGovernment Effectiveness Dataset:")
print(government_effectiveness_df.isnull().sum())
print(government_effectiveness_df['Country Code'].nunique())

Electricity Access Dataset:
Country Name        0
Country Code        0
Indicator Name      0
Indicator Code      0
1960              266
                 ... 
2021                3
2022                3
2023                3
2024              266
Unnamed: 69       266
Length: 70, dtype: int64
266

GDP Dataset:
Country Name        0
Country Code        0
Indicator Name      0
Indicator Code      0
1960              115
                 ... 
2021                8
2022                9
2023               15
2024               26
Unnamed: 69       266
Length: 70, dtype: int64
266

Urban Population Dataset:
Country Name        0
Country Code        0
Indicator Name      0
Indicator Code      0
1960                1
                 ... 
2021                1
2022                1
2023                1
2024                1
Unnamed: 69       266
Length: 70, dtype: int64
266

Rural Population Dataset:
Country Name        0
Country Code        0
Indicator Name      0
Indicator Code      0
196