# 01 - Data Cleaning: Inflation Risk Analysis
**Author:** Namora Fernando  
**Date:** 2025-08-17 <br>
**Objective:** Clean and merge four World Bank datasets:
1. CPI (% change, annual)
2. GDP Growth (%)
3. Money Supply (% of GDP)
4. Exchange Rate (LCU to USD)

## 1. Introduction
The goal of this notebook is to prepare a single clean dataset for inflation risk analysis across countries.  
We will:
- Load all raw datasets.
- Standardize the structure.
- Handle missing values.
- Merge into a wide table.
- Save as `cleaned_merged_inflation_data.csv`.

## 2. Import Libraries

In [1]:
import pandas as pd
import numpy as np
import os

## 3. Define File Paths

In [2]:
# Raw data directory
RAW_DIR = "raw_data/"

# Output directory
OUTPUT_DIR = "data_intermediate/"
os.makedirs(OUTPUT_DIR, exist_ok=True)

# File paths
cpi_file = os.path.join(RAW_DIR, "API_FP.CPI.TOTL.ZG.csv")
gdp_file = os.path.join(RAW_DIR, "API_NY.GDP.MKTP.KD.ZG.csv")
money_file = os.path.join(RAW_DIR, "API_FM.LBL.BMNY.GD.ZS.csv")
exchange_file = os.path.join(RAW_DIR, "API_PA.NUS.FCRF.csv")

## 4. Load and Inspect Data
First, let's see the first rows from the dataset:

In [3]:
# Inspect raw CPI CSV without skiprows, to see metadata
with open(cpi_file, "r", encoding="utf-8") as f:
    for i in range(6):   # print first 6 lines
        print(f.readline().strip())

﻿"Data Source","World Development Indicators",

"Last Updated Date","2025-07-01",

"Country Name","Country Code","Indicator Name","Indicator Code","1960","1961","1962","1963","1964","1965","1966","1967","1968","1969","1970","1971","1972","1973","1974","1975","1976","1977","1978","1979","1980","1981","1982","1983","1984","1985","1986","1987","1988","1989","1990","1991","1992","1993","1994","1995","1996","1997","1998","1999","2000","2001","2002","2003","2004","2005","2006","2007","2008","2009","2010","2011","2012","2013","2014","2015","2016","2017","2018","2019","2020","2021","2022","2023","2024",
"Aruba","ABW","Inflation, consumer prices (annual %)","FP.CPI.TOTL.ZG","","","","","","","","","","","","","","","","","","","","","","","","","","4.03225805628628","1.07396640826829","3.64304545817706","3.12186849610723","3.99162804604575","5.83668775158166","5.55555555555579","3.8733753699648","5.2155599603571","6.3110797127044","3.36139107320899","3.22528797213979","2.99994809778412","1.8694

> Notice that the first 4 rows contain metadata rather than actual data. This is a standard format for World Bank datasets.
> The same structure applies to the other 3 datasets (not shown here for brevity).

In [4]:
def load_worldbank_csv(file_path):
    """
    Load World Bank CSV, skip metadata rows, and return DataFrame.
    """
    df = pd.read_csv(file_path, skiprows=4)
    return df

cpi_df = load_worldbank_csv(cpi_file)
gdp_df = load_worldbank_csv(gdp_file)
money_df = load_worldbank_csv(money_file)
exchange_df = load_worldbank_csv(exchange_file)

# Quick re-inspection 
cpi_df.head()

Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,1960,1961,1962,1963,1964,1965,...,2016,2017,2018,2019,2020,2021,2022,2023,2024,Unnamed: 69
0,Aruba,ABW,"Inflation, consumer prices (annual %)",FP.CPI.TOTL.ZG,,,,,,,...,-0.931196,-1.028282,3.626041,4.257462,,,,,,
1,Africa Eastern and Southern,AFE,"Inflation, consumer prices (annual %)",FP.CPI.TOTL.ZG,,,,,,,...,6.596505,6.399343,4.720805,4.644967,5.405162,7.240978,10.773751,7.126975,4.425471,
2,Afghanistan,AFG,"Inflation, consumer prices (annual %)",FP.CPI.TOTL.ZG,,,,,,,...,4.383892,4.975952,0.626149,2.302373,5.601888,5.133203,13.712102,-4.644709,-6.601186,
3,Africa Western and Central,AFW,"Inflation, consumer prices (annual %)",FP.CPI.TOTL.ZG,,,,,,,...,1.487416,1.725486,1.78405,1.983092,2.490378,3.7457,7.774027,5.302548,3.765558,
4,Angola,AGO,"Inflation, consumer prices (annual %)",FP.CPI.TOTL.ZG,,,,,,,...,30.694415,29.84448,19.628938,17.080954,22.271539,25.754295,21.35529,13.644102,28.240495,


## 5. Select and Rename Columns

We will:

- Drop metadata columns: `Indicator Name` and `Indicator Code`
- Keep only: `Country Name` and `Country Code` columns.
- Melt from wide to long format (`Year`, `Indicator Value`) for both columns.
- Add an indicator name for each dataset.

In [5]:
# First, let us drop meta data columns Indicator Name and Indicator Code
cols_to_drop = [c for c in ["Indicator Name", "Indicator Code"] if c in cpi_df.columns]
cpi_df = cpi_df.drop(columns=cols_to_drop, errors="ignore")

# Second, we keep only Country Name and Country Code columns and melt the Year column
cpi_long = cpi_df.melt(
    id_vars=["Country Name", "Country Code"], 
    var_name = "Year", 
    value_name = "CPI_AnnualChange"
)

# Inspect the Year columns after melt process
cpi_long.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17556 entries, 0 to 17555
Data columns (total 4 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Country Name      17556 non-null  object 
 1   Country Code      17556 non-null  object 
 2   Year              17556 non-null  object 
 3   CPI_AnnualChange  11260 non-null  float64
dtypes: float64(1), object(3)
memory usage: 548.8+ KB


In [6]:
cpi_long["Year"].unique()

array(['1960', '1961', '1962', '1963', '1964', '1965', '1966', '1967',
       '1968', '1969', '1970', '1971', '1972', '1973', '1974', '1975',
       '1976', '1977', '1978', '1979', '1980', '1981', '1982', '1983',
       '1984', '1985', '1986', '1987', '1988', '1989', '1990', '1991',
       '1992', '1993', '1994', '1995', '1996', '1997', '1998', '1999',
       '2000', '2001', '2002', '2003', '2004', '2005', '2006', '2007',
       '2008', '2009', '2010', '2011', '2012', '2013', '2014', '2015',
       '2016', '2017', '2018', '2019', '2020', '2021', '2022', '2023',
       '2024', 'Unnamed: 69'], dtype=object)

Since by default, `Year` column data type is not numeric, needs to be converted first:

In [7]:
cpi_long["Year"] = pd.to_numeric(cpi_long["Year"], errors = "coerce")

cpi_long.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17556 entries, 0 to 17555
Data columns (total 4 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Country Name      17556 non-null  object 
 1   Country Code      17556 non-null  object 
 2   Year              17290 non-null  float64
 3   CPI_AnnualChange  11260 non-null  float64
dtypes: float64(2), object(2)
memory usage: 548.8+ KB


In [8]:
# Do the same process for other data frame

def reshape_worldbank(df, indicator_name):
    cols_to_drop = [c for c in ["Indicator Name", "Indicator Code"] if c in df.columns]
    df = df.drop(columns=cols_to_drop, errors="ignore") # drop metadata column process
    
    df_long = df.melt(
        id_vars=["Country Name", "Country Code"], 
        var_name="Year", 
        value_name=indicator_name
    ) # keep only Country Name and Country Code, melt the rest to Year and Indicator columns
    
    df_long["Year"] = pd.to_numeric(df_long["Year"], errors="coerce") # convert numeric
    
    return df_long

gdp_long = reshape_worldbank(gdp_df, "GDP_Growth")
money_long = reshape_worldbank(money_df, "MoneySupply_GDPpct")
exchange_long = reshape_worldbank(exchange_df, "ExchangeRate_LCUperUSD")

## 6. Merge Datasets

We will merge step-by-step on `Country Name` and `Year`.

In [9]:
merged_df = cpi_long.merge(gdp_long, on=["Country Name", "Country Code", "Year"], how="outer")
merged_df = merged_df.merge(money_long, on=["Country Name", "Country Code", "Year"], how="outer")
merged_df = merged_df.merge(exchange_long, on=["Country Name", "Country Code", "Year"], how="outer")

merged_df.head()

Unnamed: 0,Country Name,Country Code,Year,CPI_AnnualChange,GDP_Growth,MoneySupply_GDPpct,ExchangeRate_LCUperUSD
0,Afghanistan,AFG,1960.0,,,,17.196561
1,Afghanistan,AFG,1961.0,,,,17.196561
2,Afghanistan,AFG,1962.0,,,,17.196561
3,Afghanistan,AFG,1963.0,,,,35.109645
4,Afghanistan,AFG,1964.0,,,,38.692262


## 7. Handle Missing Values

Strategy:
- Check missing values
- Investigate rows with missing values
- Handle missing values after investigation

### 7A. Check Missing Values
Before cleaning, let's inspect missing data to understand the extent and pattern.

In [10]:
merged_df.isna().sum()

Country Name                 0
Country Code                 0
Year                       266
CPI_AnnualChange          6296
GDP_Growth                3442
MoneySupply_GDPpct        6728
ExchangeRate_LCUperUSD    5248
dtype: int64

### 7B. Investigate Rows with Missing Values
Check a few examples to see which indicators are missing.

In [11]:
merged_df[merged_df.isna().any(axis=1)].sample(5, random_state=39)

Unnamed: 0,Country Name,Country Code,Year,CPI_AnnualChange,GDP_Growth,MoneySupply_GDPpct,ExchangeRate_LCUperUSD
12951,Portugal,PRT,1975.0,15.271686,-4.347632,,25.55275
9351,Liechtenstein,LIE,2005.0,,4.828077,,
4186,East Asia & Pacific,EAS,1988.0,7.887324,7.608775,145.65521,
12352,Pacific island small states,PSS,1970.0,,,29.625367,
9127,Lesotho,LSO,1979.0,16.003552,2.893919,,0.842023


### 7C. Handle Missing Values After Investigation
Our strategy for Step 01:
- If a country-year has **all 4 indicators missing**, drop the row → because it carries no usable information.  
- Otherwise, leave NaN as-is.  
- Imputation will be considered later in Step 03 (feature engineering stage).

In [12]:
# Drop rows where all key indicators are missing
key_cols = ["CPI_AnnualChange", "GDP_Growth", "MoneySupply_GDPpct", "ExchangeRate_LCUperUSD"]
merged_df = merged_df.dropna(subset=key_cols, how="all")

In [13]:
# Check final dataset dimensions
print("Dataset shape after cleaning:", merged_df.shape)

Dataset shape after cleaning: (15918, 7)


## 8. Save Cleaned Dataset

In [14]:
output_file = os.path.join(OUTPUT_DIR, "cleaned_merged_inflation_data.csv")
merged_df.to_csv(output_file, index=False)
print(f"Cleaned dataset saved to: {output_file}")

Cleaned dataset saved to: data_intermediate/cleaned_merged_inflation_data.csv


## 9. Summary
- Loaded 4 datasets from World Bank.  
- Reshaped them into long format.  
- Merged into a single dataset with 15918 rows and 7 columns.  
- Removed rows where all indicators are missing.  
- Left partial NaN values untouched (to be handled in Step 03 during feature engineering).  
- Saved dataset as `cleaned_merged_inflation_data.csv` for further EDA.  
