# Data Cleaning

First we create a python conda environment to use for the rest of this project for reproducibility

In [25]:
!conda create -y -n wb-analysis-env -c conda-forge python=3.10 ipykernel=6.29.3 numpy=1.26.4 pandas=2.2.1 matplotlib=3.8.4 seaborn=0.13.2 wbgapi=1.0.12 requests=2.31.0 sqlalchemy=2.0.29
!python -m ipykernel install --user --name wb-analysis-env --display-name "Python (wb-analysis-env)"
!conda activate wb-analysis-env

Channels:
 - conda-forge
Platform: osx-arm64
Collecting package metadata (repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /Users/Tom/miniforge3/envs/wb-analysis-env

  added / updated specs:
    - ipykernel=6.29.3
    - matplotlib=3.8.4
    - numpy=1.26.4
    - pandas=2.2.1
    - python=3.10
    - requests=2.31.0
    - seaborn=0.13.2
    - sqlalchemy=2.0.29
    - wbgapi=1.0.12


The following NEW packages will be INSTALLED:

  _openmp_mutex      conda-forge/osx-arm64::_openmp_mutex-4.5-7_kmp_llvm 
  appnope            conda-forge/noarch::appnope-0.1.4-pyhd8ed1ab_1 
  asttokens          conda-forge/noarch::asttokens-3.0.1-pyhd8ed1ab_0 
  backports.zstd     conda-forge/osx-arm64::backports.zstd-1.1.0-py310hdc7f11d_1 
  brotli             conda-forge/osx-arm64::brotli-1.2.0-h7d5ae5b_1 
  brotli-bin         conda-forge/osx-arm64::brotli-bin-1.2.0-hc919400_1 
  brotli-python      conda-forge/osx-arm64::brotli-python-1.2.0-py310h6123dab_1 
  bzip2 

### Imports and Setup
We import `wbgapi` to fetch data directly from the World Bank and standard libraries for file handling.

In [26]:
import wbgapi as wb
import pandas as pd
import os

### Data Retrieval
Here we define the indicators map and the specific countries we want to analyze. We then query the World Bank API for the years 2017-2023.

In [27]:
indicators = {
    'NY.GDP.PCAP.KD': 'gdp_pc',
    'SL.EMP.TOTL.SP.ZS': 'emp_ratio',
    'NY.GDP.MKTP.KD.ZG': 'gdp_growth',
    'SL.TLF.0714.ZS': 'child_labor_rate',
    'SE.PRM.UNER': 'children_out_of_school_primary'
}

# country list
countries = ['PRT','GBR','NOR','ALB','UKR','ITA']

years = range(2017, 2024)
# Download data
df = wb.data.DataFrame(indicators, economy=countries, time=years,
                       numericTimeKeys=True, labels=True, columns='series')

### Initial Data Inspection

In [28]:
df


Unnamed: 0_level_0,Unnamed: 1_level_0,Country,Time,NY.GDP.MKTP.KD.ZG,NY.GDP.PCAP.KD,SE.PRM.UNER,SL.EMP.TOTL.SP.ZS,SL.TLF.0714.ZS
economy,time,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
ITA,2023,Italy,2023,0.715373,34146.023226,52117.0,46.002,
ITA,2022,Italy,2022,4.821177,33891.886653,64907.0,45.078,
ITA,2021,Italy,2021,8.931062,32267.709105,55640.0,43.842,
ITA,2020,Italy,2020,-8.868221,29469.798924,39912.0,43.832,
ITA,2019,Italy,2019,0.429163,32180.434102,43971.0,44.651,
ITA,2018,Italy,2018,0.826647,31819.397183,43832.0,44.338,
ITA,2017,Italy,2017,1.6037,31635.522771,46243.0,43.952,
UKR,2023,Ukraine,2023,5.534734,2164.26123,,,
UKR,2022,Ukraine,2022,-28.758584,1874.662964,,,
UKR,2021,Ukraine,2021,3.445621,2426.612305,299855.0,49.266,


### Sorting and Index Handling
We sort the values by Country and Time. We also reset the index to ensure 'economy' and 'time' become accessible columns rather than index levels.

In [29]:
df = df.sort_values(['Country', 'Time'])

if 'economy' not in df.columns:
    df = df.reset_index()

### Column Renaming
We apply the `indicators` dictionary to rename the World Bank codes to human-readable names. We also standardize the identifier columns.

In [30]:
# Rename indicator columns
df = df.rename(columns=indicators)

# Rename structural columns for clarity
df = df.rename(columns={'economy': 'iso_code', 'Time': 'year', 'Country': 'country_name'})

# Ensure year is an integer (removing 'YR' prefix if present)
df['year'] = df['year'].astype(str).str.replace('YR', '').astype(int)

### Handling Missing Data
We check for columns that are entirely empty. In this dataset, `child_labor_rate` is likely empty for these specific developed countries and should be removed.

In [31]:
# Drop columns that are 100% empty (NaN)
df = df.dropna(axis=1, how='all')
print("Remaining columns:", df.columns.tolist())

Remaining columns: ['iso_code', 'time', 'country_name', 'year', 'gdp_growth', 'gdp_pc', 'children_out_of_school_primary', 'emp_ratio']


### Saving Data
Finally, we reorder the columns to put identifiers first and save the clean dataset to the `../data/` directory.

In [33]:
# Reorder columns for readability
cols = ['iso_code', 'country_name', 'year'] + [c for c in df.columns if c not in ['iso_code', 'country_name', 'year']]
df = df[cols]

# Create output directory
output_dir = '../data'
os.makedirs(output_dir, exist_ok=True)

# Save to CSV
output_path = os.path.join(output_dir, 'cleaned_wb_data.csv')
df.to_csv(output_path, index=False)

print(f"Data cleaned and saved to: {output_path}")
df.head()

Data cleaned and saved to: ../data/cleaned_wb_data.csv


Unnamed: 0,iso_code,country_name,year,time,gdp_growth,gdp_pc,children_out_of_school_primary,emp_ratio
0,ALB,Albania,2017,2017,3.283176,4283.982627,1139.424549,50.152
1,ALB,Albania,2018,2018,3.671419,4452.237147,2234.391432,52.0
2,ALB,Albania,2019,2019,2.062578,4563.467363,3575.110586,53.391
3,ALB,Albania,2020,2020,-3.313756,4437.653469,10359.0,51.026
4,ALB,Albania,2021,2021,8.969576,4880.723462,10881.0,52.115
