# **Economic Development vs. Sustainability**
# Data Preprocessing - GDP Growth
Katlyn Goeujon-Mackness <br>
05/05/2025

## Introduction
Economic growth is often pursued at the cost of environmental sustainability. This study aims to analyze the balance between economic development and sustainable practices across different regions, industries, and policies.

In this phase of the data analysis, we will locate, collect and process necessary raw data, applying any transformations necessary. Sections will be organized by key indicator. Finally, we will export processed data in CSV format for analysis.

### Key Challenge
Achieving sustainable economic growth requires balancing financial prosperity with environmental and social responsibility. Identifying actionable patterns in historical data can inform policymakers, businesses, and environmental advocates.

### Data of Interest
- GDP growth rate compared to carbon emissions per capita.
- Percentage of renewable energy adoption.
- Employment trends in green industries.
- Improvement in environmental quality indicators (air quality, water safety).
- Sustainability index scores vs. economic performance.

### Locating Relevant Data
- **World Bank**: Economic and environmental indicators.
    * [GDP per capita growth (annual %)](https://data.worldbank.org/indicator/NY.GDP.PCAP.KD.ZG)
    * [GDP per capita (constant 2015 US$)](https://data.worldbank.org/indicator/NY.GDP.PCAP.KD)
- **United Nations SDGs Database**: Sustainable development statistics.
- **OECD**: Policy effectiveness on sustainability.
- **NASA Earth Observations**: Environmental impact metrics.
- **National Employment Data**: Job growth in sustainable sectors.


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Prevent truncating columns and rows
pd.set_option("display.max_rows", None) 
pd.set_option("display.max_columns", None) 

---
### Data Discovery
From the World Bank Group, raw data includes per capita growth by annual percent and constant in USD (2015), as well as metadata indicators to aid analysis.

#### Load the Datasets

In [2]:
# GDP per capita % growth
gdp_percent = pd.read_csv("data/raw/gdp/GDP-per-capita-percent-growth_raw.csv", encoding="latin1")
gdp_percent_meta = pd.read_csv("data/raw/gdp/Metadata-GDP-per-capita-percent-growth_raw.csv", encoding="latin1")
gdp_percent_meta_ind = pd.read_csv("data/raw/gdp/Metadata_Indicator_GDP-per-capita-percent-growth_raw.csv", encoding='latin1')

# GDP per capita constsant USD 2015
gdp_const = pd.read_csv("data/raw/gdp/gdp_per_capita_constant_USD_raw.csv", encoding="latin1")
gdp_const_meta = pd.read_csv("data/raw/gdp/Metadata_gdp_per_capita_constant_USD_raw.csv", encoding="latin1")
gdp_const_meta_ind = pd.read_csv("data/raw/gdp/Metadata_Indicator_gdp_per_capita_constant_USD_raw.csv", encoding='latin1')

#### Metadata Indicators

In [3]:
# Metadata indicator provides information about the dataset
print(gdp_percent_meta_ind)

  ï»¿"INDICATOR_CODE"                    INDICATOR_NAME  \
0   NY.GDP.PCAP.KD.ZG  GDP per capita growth (annual %)   

                                         SOURCE_NOTE  \
0  Annual percentage growth rate of GDP per capit...   

                                 SOURCE_ORGANIZATION  Unnamed: 4  
0  World Bank national accounts data, and OECD Na...         NaN  


In [4]:
# Metadata indicator provides information about the dataset
print(gdp_const_meta_ind)

  ï»¿"INDICATOR_CODE"                      INDICATOR_NAME  \
0      NY.GDP.PCAP.KD  GDP per capita (constant 2015 US$)   

                                         SOURCE_NOTE  \
0  GDP per capita is gross domestic product divid...   

                                 SOURCE_ORGANIZATION  Unnamed: 4  
0  World Bank national accounts data, and OECD Na...         NaN  


---
#### GDP (annual %)

In [5]:
gdp_percent.head(3)

Unnamed: 0,"ï»¿""Country Name""",Country Code,Indicator Name,Indicator Code,1960,1961,1962,1963,1964,1965,1966,1967,1968,1969,1970,1971,1972,1973,1974,1975,1976,1977,1978,1979,1980,1981,1982,1983,1984,1985,1986,1987,1988,1989,1990,1991,1992,1993,1994,1995,1996,1997,1998,1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021,2022,2023,2024,Unnamed: 69
0,Aruba,ABW,GDP per capita growth (annual %),NY.GDP.PCAP.KD.ZG,,,,,,,,,,,,,,,,,,,,,,,,,,,,17.593206,18.304687,10.066932,0.13448,2.813435,1.111856,0.492195,2.751524,-0.292643,-2.733864,2.978397,-0.48716,-0.125966,6.519224,3.212406,-1.628099,-0.033839,5.026912,-2.930823,-0.673258,2.322678,1.061772,-12.274936,-2.956953,2.610525,-2.484648,4.855279,-2.629615,-1.635753,0.951538,7.040657,2.234428,-2.496549,-25.79323,25.154964,8.912308,4.216132,,
1,Africa Eastern and Southern,AFE,GDP per capita growth (annual %),NY.GDP.PCAP.KD.ZG,,-2.13663,5.009835,2.794289,1.830475,2.245841,1.917904,2.418062,1.18621,2.097636,-1.747625,2.477136,-0.110166,1.595887,2.383976,-1.515149,-0.561033,-1.699047,-1.480369,-0.240458,2.346863,0.84449,-2.858611,-3.045377,0.277329,-3.091139,-0.744192,0.885399,1.287891,-0.166717,-2.655658,-2.808265,-4.910566,-3.314233,-0.745923,1.61081,2.699137,1.214009,-0.850095,-0.007474,0.570581,0.860717,1.163242,0.294752,2.803787,3.378235,3.770141,3.814958,1.563935,-1.852648,2.393038,1.308359,-0.957114,1.512944,1.218467,0.264191,-0.468619,0.007074,-0.102513,-0.549575,-5.45141,1.842087,0.903488,-0.226928,,
2,Afghanistan,AFG,GDP per capita growth (annual %),NY.GDP.PCAP.KD.ZG,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,-10.119484,22.02019,2.345672,-2.148212,7.383377,1.132485,11.692303,1.677279,17.043896,11.055031,-3.213295,8.279369,2.052068,-0.939985,-1.665057,-0.300121,-0.19557,-1.713743,0.856295,-5.382515,-22.584482,-7.576669,0.540656,,


In [6]:
gdp_percent.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 266 entries, 0 to 265
Data columns (total 70 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   ï»¿"Country Name"  266 non-null    object 
 1   Country Code       266 non-null    object 
 2   Indicator Name     266 non-null    object 
 3   Indicator Code     266 non-null    object 
 4   1960               0 non-null      float64
 5   1961               145 non-null    float64
 6   1962               152 non-null    float64
 7   1963               152 non-null    float64
 8   1964               152 non-null    float64
 9   1965               152 non-null    float64
 10  1966               155 non-null    float64
 11  1967               159 non-null    float64
 12  1968               161 non-null    float64
 13  1969               161 non-null    float64
 14  1970               161 non-null    float64
 15  1971               185 non-null    float64
 16  1972               185 non

In [7]:
gdp_percent.describe()

Unnamed: 0,1960,1961,1962,1963,1964,1965,1966,1967,1968,1969,1970,1971,1972,1973,1974,1975,1976,1977,1978,1979,1980,1981,1982,1983,1984,1985,1986,1987,1988,1989,1990,1991,1992,1993,1994,1995,1996,1997,1998,1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021,2022,2023,2024,Unnamed: 69
count,0.0,145.0,152.0,152.0,152.0,152.0,155.0,159.0,161.0,161.0,161.0,185.0,185.0,185.0,185.0,187.0,190.0,191.0,196.0,196.0,197.0,206.0,208.0,210.0,210.0,213.0,214.0,216.0,220.0,221.0,223.0,242.0,242.0,243.0,243.0,244.0,245.0,245.0,247.0,247.0,247.0,248.0,248.0,252.0,252.0,252.0,252.0,253.0,253.0,256.0,257.0,257.0,257.0,257.0,258.0,258.0,257.0,257.0,258.0,257.0,257.0,257.0,254.0,243.0,0.0,0.0
mean,,1.551609,2.982683,2.762094,3.97995,3.57312,2.421155,1.777137,3.927512,4.517874,4.316691,2.932702,3.098565,3.293566,3.08158,-0.013086,3.903495,2.145544,2.254961,2.028502,0.668481,0.219847,-1.010194,-0.679292,1.190731,1.018527,1.317372,1.703381,2.474171,0.97366,1.138905,-1.070382,-0.635606,-0.142672,0.421589,2.292233,2.878801,3.73535,1.981517,1.718242,3.164254,1.727875,1.919283,2.419399,4.357371,3.588082,4.198835,3.970836,2.21756,-1.344087,3.061513,2.378194,1.78168,1.704722,1.914273,1.372379,1.832635,2.087969,1.884746,1.760215,-5.723905,4.818938,3.254277,2.123455,,
std,,5.57907,4.778384,5.428235,5.269092,5.565084,4.284444,6.757169,7.300174,4.525287,7.432219,4.383079,5.556503,7.383348,6.108751,5.585016,6.169807,5.011204,5.435626,6.164588,6.111777,5.837776,4.987514,5.02532,4.760954,4.633277,4.894383,5.023585,5.52647,5.721857,7.060999,7.820009,9.346929,6.866627,7.363371,5.828973,6.071763,11.406349,4.803281,4.608824,6.227211,4.934002,4.491697,5.178328,5.137998,3.928841,4.026833,4.27262,4.049335,4.995444,3.958534,5.114071,7.643561,4.394242,3.247474,4.38871,3.665226,3.991325,2.897729,3.322241,7.631557,5.708325,6.060594,6.193483,,
min,,-26.527644,-20.854896,-14.163578,-14.458423,-15.018576,-10.519981,-17.52377,-8.580169,-9.300131,-48.243296,-13.31412,-15.612517,-19.237863,-18.599207,-17.555827,-26.374675,-14.64238,-25.892434,-28.435258,-24.388705,-24.756413,-21.074529,-17.441688,-19.243349,-14.894531,-20.143467,-19.412935,-15.252919,-43.604223,-44.552252,-64.423582,-45.324884,-35.546284,-41.540612,-14.230487,-18.31943,-13.590922,-23.877852,-27.113647,-16.217142,-11.716099,-14.988201,-38.538218,-6.824979,-12.676373,-6.893259,-22.076582,-18.654258,-15.899715,-13.015172,-49.127857,-48.428726,-36.824547,-24.511054,-30.150755,-12.09711,-9.306452,-18.05056,-12.498917,-55.228911,-22.584482,-22.745681,-21.164316,,
25%,,-0.417608,1.163929,0.330003,1.924268,0.89927,-0.265863,-0.763237,1.18621,1.894998,1.258591,0.599649,-0.328982,0.881619,0.636585,-2.571406,0.631037,-0.054005,-0.37927,-0.252211,-2.299758,-2.287274,-3.528409,-3.601858,-1.435592,-1.256779,-0.928378,-1.119946,0.003492,-0.571288,-2.049457,-2.843855,-3.089044,-2.242711,-1.503778,0.354981,0.883746,1.214009,0.365098,-0.406102,1.12397,-0.016422,0.17168,0.625484,2.099233,1.705244,2.117991,1.854643,0.1747,-4.517814,0.95227,0.800573,-0.353869,0.226083,0.558534,0.00806,0.277408,0.46232,0.503853,0.071604,-8.075626,1.771729,1.349278,0.29487,,
50%,,2.198412,2.733083,3.079408,4.33008,3.386048,2.461462,2.048103,3.653098,4.358969,3.807399,2.728298,3.247397,3.788932,2.941323,-0.183188,3.784248,2.449207,2.692403,2.491293,1.13499,0.732017,-0.685287,-0.250305,1.620469,1.621519,1.77121,1.389041,2.573625,1.339109,1.260802,-0.384955,0.127308,0.459093,1.768206,2.233087,2.572464,2.794926,2.059203,1.882233,3.089054,1.630748,1.624427,2.586906,3.735167,3.20013,3.734873,3.761896,2.233752,-1.273541,3.019247,2.435003,1.526211,1.754549,1.89961,1.585844,1.895257,2.111455,2.006731,1.641704,-4.66529,4.414695,2.808548,1.708805,,
75%,,4.466541,4.806262,4.742374,5.727695,5.1467,4.892294,4.557754,5.210488,6.406064,6.386637,4.424564,5.158774,5.530668,4.579107,2.63912,5.741338,4.775645,5.307611,4.952525,4.258175,3.504438,1.444202,2.52152,3.756309,3.052309,3.379565,4.113804,4.726862,3.517981,3.498611,2.298572,3.45435,3.103502,3.740303,3.979864,4.496944,4.425325,3.899009,3.686224,4.42279,3.120589,3.842652,4.704912,5.962149,5.436045,6.271037,6.289637,4.42259,1.922682,5.028538,4.776793,3.89322,3.712311,3.690974,3.442575,3.488788,3.93498,3.8152,3.228631,-2.460782,7.06063,4.882776,3.732239,,
max,,22.113903,27.110688,32.280917,39.840747,42.016075,15.702949,62.285555,77.41001,22.461098,46.366027,23.694967,24.65025,55.823529,44.365224,19.340903,32.536867,20.201679,20.442132,23.669347,20.883835,19.147281,20.966438,13.144281,23.46106,20.570991,20.034433,25.178805,32.321813,21.194246,55.872942,46.443818,51.103509,31.01041,17.553636,61.873535,60.341092,140.490578,30.825438,20.816108,77.089569,53.235688,22.02019,14.619701,49.073853,26.660095,33.030488,23.590685,20.513208,17.043896,24.656802,18.828534,91.78137,18.140048,13.984753,23.443691,31.097891,30.395061,8.384961,21.827247,43.512346,33.768559,62.111024,74.674529,,


In [8]:
# Metadata table provides additional categorical information that could be useful in the current analysis
gdp_percent_meta.head()

Unnamed: 0,"ï»¿""Country Code""",Region,IncomeGroup,SpecialNotes,TableName,Unnamed: 5
0,ABW,Latin America & Caribbean,High income,,Aruba,
1,AFE,,,"26 countries, stretching from the Red Sea in t...",Africa Eastern and Southern,
2,AFG,South Asia,Low income,The reporting period for national accounts dat...,Afghanistan,
3,AFW,,,"22 countries, stretching from the westernmost ...",Africa Western and Central,
4,AGO,Sub-Saharan Africa,Lower middle income,The World Bank systematically assesses the app...,Angola,


In [9]:
gdp_percent_meta.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 265 entries, 0 to 264
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   ï»¿"Country Code"  265 non-null    object 
 1   Region             217 non-null    object 
 2   IncomeGroup        216 non-null    object 
 3   SpecialNotes       127 non-null    object 
 4   TableName          265 non-null    object 
 5   Unnamed: 5         0 non-null      float64
dtypes: float64(1), object(5)
memory usage: 12.6+ KB


In [10]:
gdp_percent.shape

(266, 70)

#### Comments
- GDP main dataset:
    * Dataset needs to be melted from a wide format to a long format with a single column for year for efficiency
    * Column names are inconsistent
    * There is a lot of inconsistencies with years of data. Some years can be deleted, and some can be filled by interpolation.
- GDP metadata:
    * Tablenames can be used to filter out regional groups from the list of countries
    * Region and IncomeGroup can be added to the main dataset for future analysis
    * SpecialNotes will not be needed for this analysis and can be dropped


#### GDP (constant USD)

In [11]:
gdp_const.head(3)

Unnamed: 0,"ï»¿""Country Name""",Country Code,Indicator Name,Indicator Code,1960,1961,1962,1963,1964,1965,1966,1967,1968,1969,1970,1971,1972,1973,1974,1975,1976,1977,1978,1979,1980,1981,1982,1983,1984,1985,1986,1987,1988,1989,1990,1991,1992,1993,1994,1995,1996,1997,1998,1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021,2022,2023,2024,Unnamed: 69
0,Aruba,ABW,GDP per capita (constant 2015 US$),NY.GDP.PCAP.KD,,,,,,,,,,,,,,,,,,,,,,,,,,,17354.421151,20407.620205,24143.171284,26573.647914,26609.384075,27358.021707,27662.203405,27798.355327,28563.233609,28479.645266,27701.050412,28526.097567,28387.129957,28351.371902,30199.661277,31169.796914,30662.321715,30651.945756,32192.791947,31249.278037,31038.889867,31759.823202,32097.040224,28157.148989,27324.555426,28037.869793,27341.227428,28668.72042,27914.843344,27458.225331,27719.500731,29671.135829,30334.115949,29576.809942,21947.99533,27469.005622,29917.128029,31178.473679,,
1,Africa Eastern and Southern,AFE,GDP per capita (constant 2015 US$),NY.GDP.PCAP.KD,1172.316285,1147.268217,1204.74446,1238.408507,1261.077266,1289.399062,1314.128504,1345.904947,1361.870203,1390.437288,1366.137659,1399.978749,1398.436442,1420.753904,1454.624336,1432.584603,1424.547333,1400.343603,1379.613354,1376.295964,1408.595745,1420.491197,1379.884876,1337.862183,1341.572468,1300.102603,1290.427341,1301.852771,1318.619218,1316.420856,1281.461222,1245.4744,1184.314554,1145.063606,1136.522316,1154.82953,1185.999956,1200.398106,1190.193576,1190.104623,1196.895129,1207.197004,1221.239624,1224.839257,1259.181138,1301.719234,1350.795881,1402.328173,1424.259677,1397.873154,1431.324792,1450.051661,1436.173008,1457.901496,1475.665545,1479.564123,1472.630604,1472.734783,1471.225033,1463.139549,1383.377818,1408.860838,1421.589728,1418.363737,,
2,Afghanistan,AFG,GDP per capita (constant 2015 US$),NY.GDP.PCAP.KD,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,308.31827,277.118051,338.139974,346.071627,338.637274,363.640141,367.758312,410.757729,417.647283,488.830652,542.87103,525.426983,568.929021,580.603833,575.146246,565.56973,563.872337,562.769574,553.125152,557.861533,527.834554,408.625855,377.665627,379.707497,,


In [12]:
gdp_const.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 266 entries, 0 to 265
Data columns (total 70 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   ï»¿"Country Name"  266 non-null    object 
 1   Country Code       266 non-null    object 
 2   Indicator Name     266 non-null    object 
 3   Indicator Code     266 non-null    object 
 4   1960               144 non-null    float64
 5   1961               151 non-null    float64
 6   1962               151 non-null    float64
 7   1963               151 non-null    float64
 8   1964               151 non-null    float64
 9   1965               153 non-null    float64
 10  1966               157 non-null    float64
 11  1967               159 non-null    float64
 12  1968               159 non-null    float64
 13  1969               159 non-null    float64
 14  1970               182 non-null    float64
 15  1971               182 non-null    float64
 16  1972               182 non

In [13]:
gdp_const.describe()

Unnamed: 0,1960,1961,1962,1963,1964,1965,1966,1967,1968,1969,1970,1971,1972,1973,1974,1975,1976,1977,1978,1979,1980,1981,1982,1983,1984,1985,1986,1987,1988,1989,1990,1991,1992,1993,1994,1995,1996,1997,1998,1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021,2022,2023,2024,Unnamed: 69
count,144.0,151.0,151.0,151.0,151.0,153.0,157.0,159.0,159.0,159.0,182.0,182.0,182.0,182.0,184.0,187.0,188.0,193.0,193.0,194.0,203.0,205.0,207.0,207.0,210.0,211.0,213.0,217.0,218.0,220.0,239.0,239.0,240.0,240.0,241.0,242.0,242.0,244.0,244.0,244.0,246.0,246.0,250.0,250.0,250.0,250.0,251.0,251.0,254.0,256.0,256.0,257.0,256.0,257.0,257.0,259.0,256.0,256.0,256.0,256.0,256.0,256.0,254.0,244.0,0.0,0.0
mean,4722.687616,4713.422002,4868.959952,5022.767026,5282.246491,5588.266011,5697.014604,5791.508218,6047.932624,6340.79997,8077.490462,8325.526642,8646.876039,9141.355227,9554.551171,9261.309115,9635.262205,9651.911962,9743.605417,10083.820647,9760.072432,9505.879509,9289.263496,9230.113875,9333.281822,9319.265994,9428.043513,9472.506655,9669.467687,9870.072132,9679.39135,9585.01259,9550.595107,9575.397235,9726.350411,9892.727249,10081.099591,10551.666649,10774.845561,11025.092043,11441.973487,11607.777971,11984.832967,12222.355706,12678.099096,12995.718739,13799.553272,14260.15843,14311.844231,13966.375736,14232.955523,14524.630556,14618.116528,14716.596391,14893.244484,15670.464509,15306.2824,15560.411184,15816.721713,16047.401838,14876.202969,15879.487362,16020.295054,16166.983754,,
std,6878.403997,7014.774559,7206.322205,7380.116061,7803.688541,8106.761083,8395.887336,8684.362507,8942.475539,9301.363592,13384.376239,13740.602712,14252.353328,15672.141706,16212.109393,15498.78785,16106.44723,16199.72497,15895.425335,16733.850758,16608.970412,15938.113654,15116.390332,14796.35802,14806.66258,14604.677354,14525.642914,14634.046052,14902.913898,15234.898438,14964.340324,14870.717256,14891.424345,14915.723,15183.29095,15437.113001,15629.186929,16268.435904,16677.248864,17176.210441,17887.017199,18258.511851,18365.723511,18657.230565,19259.880008,19614.669581,21010.104709,21964.581418,21737.915824,20552.911587,20785.983182,21153.374544,21036.962533,21398.170057,21598.451121,23568.143103,21966.93939,22069.901735,22476.157463,22943.642992,21160.998831,23174.966333,23744.448117,24384.197051,,
min,129.77204,127.564281,129.887941,144.000793,132.52244,143.417218,133.398255,122.678901,134.388137,135.702756,139.275134,141.8238,142.104345,137.829005,142.387514,145.45796,151.381528,157.33762,164.35842,169.599433,179.588601,187.439973,194.360862,194.645655,178.415686,176.854086,169.7878,188.871668,164.700417,168.113854,170.233437,166.71044,180.333227,188.680525,188.658267,211.087543,219.936395,214.852364,219.712174,226.444124,233.032407,239.839088,248.636469,252.806675,266.257287,275.457057,284.470807,290.979116,292.793765,288.941918,289.856951,290.948032,293.232621,296.816901,299.077479,280.966831,274.513363,269.47689,265.672394,261.75018,255.078218,255.917951,253.691642,253.530723,,
25%,877.88616,817.806503,843.711202,851.891426,903.398879,929.989868,945.928636,892.016726,898.827884,934.045454,988.767826,1010.180461,1047.950107,1082.475072,1119.788012,1119.424664,1163.72997,1235.697101,1283.357084,1330.037696,1275.508013,1276.927788,1250.058326,1218.927479,1193.216534,1219.693975,1234.178161,1242.608486,1278.233036,1271.989609,1266.423461,1232.862799,1197.499363,1155.939187,1136.792281,1143.61608,1186.280768,1255.695692,1265.353835,1268.586247,1270.917404,1308.483298,1367.979392,1387.975405,1452.017452,1449.949804,1587.115358,1580.947448,1646.108901,1669.318742,1722.197493,1763.81961,1812.137921,1890.278613,1950.16252,1999.690173,2052.270476,2000.828412,2049.754152,2037.350605,1966.672178,2105.56376,2071.149022,2093.23865,,
50%,1531.448961,1575.524061,1544.730841,1624.518695,1647.166479,1789.993422,1837.501502,1784.374133,1852.79561,1896.19772,2363.889377,2394.320466,2414.682351,2510.402157,2734.944623,2806.685631,2839.939844,2968.662581,3148.238868,3080.347268,2842.89927,2670.747093,2833.014421,2797.125478,2824.847012,2803.236921,2934.26575,3050.231762,3006.659985,2947.468593,3222.201027,3137.835017,3125.133858,3122.554851,3164.667779,3238.083948,3268.262583,3314.308772,3298.829806,3413.104318,3547.838972,3699.494175,3905.655649,4024.176117,4192.05871,4281.005926,4578.872593,4715.303598,4973.843732,5023.930217,5202.647809,5453.457187,5707.531841,5910.805923,6119.997154,6154.495935,6187.978534,6246.85347,6407.591385,6437.250652,6180.599141,6468.396596,6427.272509,6494.911413,,
75%,5310.533344,4718.247471,4839.548816,4948.212775,5918.835714,6547.982998,5818.520977,6194.414479,6620.176549,6892.524689,10634.889432,10826.482526,11432.034184,11212.636,11660.508094,11698.443777,12523.000472,11860.542143,11863.693968,11880.541051,10750.154497,10756.210938,11173.814282,11210.383205,11423.108846,11496.80327,12022.664789,11381.23077,11641.212291,11386.888255,9910.651523,10120.343245,10345.279119,10571.464561,10770.276504,9922.976417,10343.155689,11284.79505,11407.388956,11334.475628,11551.897883,11236.692355,13275.937295,13514.634911,14110.432449,15131.18433,16743.017931,17654.707749,18317.505723,17689.34663,17752.406444,17958.681398,18113.437203,18597.721407,18726.202695,19060.625424,18861.217726,19242.388883,19165.300266,19701.24956,17748.878511,18871.202106,19041.964969,18823.854519,,
max,39907.263948,42297.964739,43214.515515,44364.466052,45930.292981,46358.450863,51062.408151,56696.152709,56867.577462,57787.656227,75772.766268,75563.526831,78166.303341,108198.817535,110539.679013,105792.829745,108662.746591,114430.052844,98826.101729,106042.17417,117949.376546,113999.509456,100275.259936,94848.727743,94912.517394,95274.311722,96396.228381,97606.949962,100947.054449,104091.09475,105878.372767,106020.717885,106558.024152,104739.794413,106244.898613,107691.087544,108107.452989,109747.029484,112873.60449,116084.581378,120456.506337,123122.122915,124481.880221,126022.239091,129403.404955,132253.218403,140615.178788,162028.480643,163243.752211,142375.532007,141814.786642,148272.711992,146472.67786,157167.975028,165288.200345,170437.101188,173120.189785,165781.779122,173041.761058,185582.75569,161262.925884,194674.777609,213937.006284,224582.449752,,


In [14]:
gdp_const_meta.head()

Unnamed: 0,"ï»¿""Country Code""",Region,IncomeGroup,SpecialNotes,TableName,Unnamed: 5
0,ABW,Latin America & Caribbean,High income,,Aruba,
1,AFE,,,"26 countries, stretching from the Red Sea in t...",Africa Eastern and Southern,
2,AFG,South Asia,Low income,The reporting period for national accounts dat...,Afghanistan,
3,AFW,,,"22 countries, stretching from the westernmost ...",Africa Western and Central,
4,AGO,Sub-Saharan Africa,Lower middle income,The World Bank systematically assesses the app...,Angola,


In [15]:
gdp_const_meta.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 265 entries, 0 to 264
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   ï»¿"Country Code"  265 non-null    object 
 1   Region             217 non-null    object 
 2   IncomeGroup        216 non-null    object 
 3   SpecialNotes       127 non-null    object 
 4   TableName          265 non-null    object 
 5   Unnamed: 5         0 non-null      float64
dtypes: float64(1), object(5)
memory usage: 12.6+ KB


In [16]:
gdp_const.shape

(266, 70)

#### Comments
* Features appear consistent with GDP (annual %)

---
### Restructuring

In [17]:
# Remove special characters from column headers
gdp_percent.rename(columns={gdp_percent.columns[0]: "Country Name"}, inplace=True)
gdp_const.rename(columns={gdp_const.columns[0]: "Country Name"}, inplace=True)

gdp_percent_meta.rename(columns={gdp_percent_meta.columns[0]: "Country Code"}, inplace=True)
gdp_const_meta.rename(columns={gdp_const_meta.columns[0]: "Country Code"}, inplace=True)

In [18]:
# Drop columns that are completely empty
gdp_percent = gdp_percent.dropna(axis=1, how='all')
gdp_const = gdp_const.dropna(axis=1, how='all')

# Drop unneeded columns
gdp_percent.drop(columns=['Indicator Code', 'Indicator Name'], inplace=True)
gdp_const.drop(columns=['Indicator Code', 'Indicator Name'], inplace=True)

In [19]:
gdp_const.head(5)

Unnamed: 0,Country Name,Country Code,1960,1961,1962,1963,1964,1965,1966,1967,1968,1969,1970,1971,1972,1973,1974,1975,1976,1977,1978,1979,1980,1981,1982,1983,1984,1985,1986,1987,1988,1989,1990,1991,1992,1993,1994,1995,1996,1997,1998,1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021,2022,2023
0,Aruba,ABW,,,,,,,,,,,,,,,,,,,,,,,,,,,17354.421151,20407.620205,24143.171284,26573.647914,26609.384075,27358.021707,27662.203405,27798.355327,28563.233609,28479.645266,27701.050412,28526.097567,28387.129957,28351.371902,30199.661277,31169.796914,30662.321715,30651.945756,32192.791947,31249.278037,31038.889867,31759.823202,32097.040224,28157.148989,27324.555426,28037.869793,27341.227428,28668.72042,27914.843344,27458.225331,27719.500731,29671.135829,30334.115949,29576.809942,21947.99533,27469.005622,29917.128029,31178.473679
1,Africa Eastern and Southern,AFE,1172.316285,1147.268217,1204.74446,1238.408507,1261.077266,1289.399062,1314.128504,1345.904947,1361.870203,1390.437288,1366.137659,1399.978749,1398.436442,1420.753904,1454.624336,1432.584603,1424.547333,1400.343603,1379.613354,1376.295964,1408.595745,1420.491197,1379.884876,1337.862183,1341.572468,1300.102603,1290.427341,1301.852771,1318.619218,1316.420856,1281.461222,1245.4744,1184.314554,1145.063606,1136.522316,1154.82953,1185.999956,1200.398106,1190.193576,1190.104623,1196.895129,1207.197004,1221.239624,1224.839257,1259.181138,1301.719234,1350.795881,1402.328173,1424.259677,1397.873154,1431.324792,1450.051661,1436.173008,1457.901496,1475.665545,1479.564123,1472.630604,1472.734783,1471.225033,1463.139549,1383.377818,1408.860838,1421.589728,1418.363737
2,Afghanistan,AFG,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,308.31827,277.118051,338.139974,346.071627,338.637274,363.640141,367.758312,410.757729,417.647283,488.830652,542.87103,525.426983,568.929021,580.603833,575.146246,565.56973,563.872337,562.769574,553.125152,557.861533,527.834554,408.625855,377.665627,379.707497
3,Africa Western and Central,AFW,1110.513849,1107.762042,1124.66106,1178.787178,1215.750864,1238.103967,1192.193605,1060.095104,1052.208145,1184.296706,1358.785939,1466.146029,1475.895726,1496.925455,1604.662308,1528.886851,1613.828576,1642.258225,1566.17486,1600.027462,1585.401904,1436.929034,1350.38603,1230.695048,1205.045355,1235.065196,1218.020774,1202.161752,1224.014494,1210.919721,1247.047875,1226.86277,1221.82126,1173.725506,1139.201703,1129.417531,1150.091738,1168.592294,1177.722013,1163.56122,1175.774844,1202.42389,1285.712183,1319.360539,1385.00979,1424.638807,1459.3281,1497.188074,1546.852237,1598.600925,1663.015834,1695.187851,1733.897753,1791.113987,1846.050944,1845.767804,1799.974945,1793.465292,1798.339783,1811.727583,1751.19501,1779.340159,1804.198142,1820.754741
4,Angola,AGO,,,,,,,,,,,,,,,,,,,,,2835.459185,2613.739386,2519.351585,2529.401923,2583.619826,2577.465092,2557.958838,2571.496919,2637.781855,2550.294313,2380.176243,2324.369376,2118.166545,1559.517152,1529.994431,1701.697277,1867.935031,1936.916193,1961.037225,1938.86184,1932.988479,1947.85673,2139.872597,2128.195793,2277.866965,2526.410579,2716.253026,2983.041873,3193.287723,3100.609274,3114.696086,3099.94229,3236.574945,3268.638907,3300.73511,3213.902611,3020.983328,2911.600842,2775.746423,2664.43851,2433.376373,2385.448818,2382.02264,2332.886913


In [20]:
# Melt dataset from wide format to long format
gdp_percent = gdp_percent.melt(
    id_vars=['Country Name', 'Country Code'],
    var_name="Year", # Creates new column to store Year data
    value_name="GDPAnnualPercent"  # Data values go here
)

# Ensure Year is coded as a numeric column
gdp_percent["Year"] = pd.to_numeric(gdp_percent["Year"], errors="coerce")

# preview
gdp_percent.head(3)

Unnamed: 0,Country Name,Country Code,Year,GDPAnnualPercent
0,Aruba,ABW,1961,
1,Africa Eastern and Southern,AFE,1961,-2.13663
2,Afghanistan,AFG,1961,


In [21]:
gdp_const.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 266 entries, 0 to 265
Data columns (total 66 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Country Name  266 non-null    object 
 1   Country Code  266 non-null    object 
 2   1960          144 non-null    float64
 3   1961          151 non-null    float64
 4   1962          151 non-null    float64
 5   1963          151 non-null    float64
 6   1964          151 non-null    float64
 7   1965          153 non-null    float64
 8   1966          157 non-null    float64
 9   1967          159 non-null    float64
 10  1968          159 non-null    float64
 11  1969          159 non-null    float64
 12  1970          182 non-null    float64
 13  1971          182 non-null    float64
 14  1972          182 non-null    float64
 15  1973          182 non-null    float64
 16  1974          184 non-null    float64
 17  1975          187 non-null    float64
 18  1976          188 non-null    

In [22]:
# Melt dataset from wide format to long format
gdp_const = gdp_const.melt(
    id_vars=['Country Name', 'Country Code'],
    var_name="Year", # Creates new column to store Year data
    value_name="GDPConstantUSD"  # Data values go here
)

# Ensure Year is coded as a numeric column
gdp_const["Year"] = pd.to_numeric(gdp_const["Year"], errors="coerce")

# preview
gdp_const.head(3)

Unnamed: 0,Country Name,Country Code,Year,GDPConstantUSD
0,Aruba,ABW,1960,
1,Africa Eastern and Southern,AFE,1960,1172.316285
2,Afghanistan,AFG,1960,


#### Merge Annual % and Constant USD tables

In [23]:
gdp_combined = gdp_percent.merge(gdp_const, on=["Country Name", "Country Code", "Year"], how="outer")

In [24]:
gdp_combined.tail(10)

Unnamed: 0,Country Name,Country Code,Year,GDPAnnualPercent,GDPConstantUSD
17014,Zimbabwe,ZWE,2014,0.101989,1377.25456
17015,Zimbabwe,ZWE,2015,0.665693,1386.422847
17016,Zimbabwe,ZWE,2016,-0.490074,1379.628343
17017,Zimbabwe,ZWE,2017,2.58932,1415.351332
17018,Zimbabwe,ZWE,2018,3.459492,1464.315294
17019,Zimbabwe,ZWE,2019,-7.78558,1350.309851
17020,Zimbabwe,ZWE,2020,-9.333971,1224.272314
17021,Zimbabwe,ZWE,2021,6.611911,1305.220113
17022,Zimbabwe,ZWE,2022,4.343667,1361.91453
17023,Zimbabwe,ZWE,2023,3.584864,1410.737311


---
#### Create Regional & Income Groups Table
First compare gdp_percent_meta and gdp_const_meta to see if they are identical

In [25]:
# Are columns identical? Yes.
print(gdp_percent_meta.columns)
print(gdp_const_meta.columns)

Index(['Country Code', 'Region', 'IncomeGroup', 'SpecialNotes', 'TableName',
       'Unnamed: 5'],
      dtype='object')
Index(['Country Code', 'Region', 'IncomeGroup', 'SpecialNotes', 'TableName',
       'Unnamed: 5'],
      dtype='object')


In [26]:
# Are mising values identical? Yes.
print(gdp_percent_meta.isnull().sum())
print("-----")
print(gdp_const_meta.isnull().sum())

Country Code      0
Region           48
IncomeGroup      49
SpecialNotes    138
TableName         0
Unnamed: 5      265
dtype: int64
-----
Country Code      0
Region           48
IncomeGroup      49
SpecialNotes    138
TableName         0
Unnamed: 5      265
dtype: int64


In [27]:
# Check if there are differences in countries on gdp_const_meta and gdp_percent_meta
diff_countries = set(gdp_percent_meta["Country Code"]) - set(gdp_const_meta["Country Code"])
print("Countries in gdp_percent_meta but not in gdp_const_meta:", diff_countries)

Countries in gdp_percent_meta but not in gdp_const_meta: set()


#### Comments
No differences are found between gdp_percent_meta and gdp_const_meta, and they are therefore assumed identical. For this purpose, only gdp_percent_meta will be used for further analysis.

In [28]:
# Assign a new new name to the DataFrame
gdp_meta = gdp_percent_meta

# Remove unneeded columns from gdp_meta
gdp_meta.drop(columns=['SpecialNotes', 'Unnamed: 5'], inplace=True)

In [29]:
# Create a new DataFrame that contains only the Region and Income groups
gdp_groups_regions = gdp_meta[gdp_meta["Region"].isna()].copy()
gdp_groups_regions.drop(columns=["Region", "IncomeGroup"], inplace=True)
gdp_groups_regions.rename(columns={'TableName': "IncomeRegionGroup"}, inplace=True)

# Set "Country Code" as the index
gdp_groups_regions.reset_index(drop=True, inplace=True)

# Preview the updated table
gdp_groups_regions.head(5)


Unnamed: 0,Country Code,IncomeRegionGroup
0,AFE,Africa Eastern and Southern
1,AFW,Africa Western and Central
2,ARB,Arab World
3,CEB,Central Europe and the Baltics
4,CSS,Caribbean small states


In [30]:
# Remove rows with empty "Region" values 
#   Removes the income/regional groups 
#   and leaves only country-specific income and region data
gdp_meta_cleaned = gdp_meta[gdp_meta["Region"].notna()]

# Overwrite original dataframe
gdp_meta = gdp_meta_cleaned.copy()

In [31]:
# Remove unneeded columns from gdp_meta
gdp_meta.reset_index(drop=True, inplace=True)
gdp_meta.head(5)

Unnamed: 0,Country Code,Region,IncomeGroup,TableName
0,ABW,Latin America & Caribbean,High income,Aruba
1,AFG,South Asia,Low income,Afghanistan
2,AGO,Sub-Saharan Africa,Lower middle income,Angola
3,ALB,Europe & Central Asia,Upper middle income,Albania
4,AND,Europe & Central Asia,High income,Andorra


In [32]:
gdp_combined.head(5)

Unnamed: 0,Country Name,Country Code,Year,GDPAnnualPercent,GDPConstantUSD
0,Afghanistan,AFG,1960,,
1,Afghanistan,AFG,1961,,
2,Afghanistan,AFG,1962,,
3,Afghanistan,AFG,1963,,
4,Afghanistan,AFG,1964,,


In [33]:
# Add gdp_meta information to gdp_combined
gdp_with_meta = gdp_combined.merge(gdp_meta, on="Country Code", how="left")

# Reorder columns
columns = ["Country Name", "Country Code", "Region", "IncomeGroup", "Year", "GDPAnnualPercent", "GDPConstantUSD"]
gdp_with_meta = gdp_with_meta[columns]

# Prevew DataFrame
gdp_with_meta.head(5)

Unnamed: 0,Country Name,Country Code,Region,IncomeGroup,Year,GDPAnnualPercent,GDPConstantUSD
0,Afghanistan,AFG,South Asia,Low income,1960,,
1,Afghanistan,AFG,South Asia,Low income,1961,,
2,Afghanistan,AFG,South Asia,Low income,1962,,
3,Afghanistan,AFG,South Asia,Low income,1963,,
4,Afghanistan,AFG,South Asia,Low income,1964,,


---

In [34]:
# Remove rows that are in gdp_groups_regions
#   and leaves only country-specific data
gdp_filtered_rows = gdp_combined[gdp_combined["Country Code"].isin(gdp_groups_regions["Country Code"])]

# Merge filtered rows with gdp_groups_regions table
gdp_groups_regions = gdp_groups_regions.merge(gdp_filtered_rows, on="Country Code", how="left")
gdp_groups_regions.head(5)

Unnamed: 0,Country Code,IncomeRegionGroup,Country Name,Year,GDPAnnualPercent,GDPConstantUSD
0,AFE,Africa Eastern and Southern,Africa Eastern and Southern,1960,,1172.316285
1,AFE,Africa Eastern and Southern,Africa Eastern and Southern,1961,-2.13663,1147.268217
2,AFE,Africa Eastern and Southern,Africa Eastern and Southern,1962,5.009835,1204.74446
3,AFE,Africa Eastern and Southern,Africa Eastern and Southern,1963,2.794289,1238.408507
4,AFE,Africa Eastern and Southern,Africa Eastern and Southern,1964,1.830475,1261.077266


### Export GDP by Country to CSV
Output data includes country and country code listing and their GDP annual % and constant 2015 USD by year. Data has not yet been cleaned or transformed. 

In [None]:
# Commented out to avoid duplicated exports
# gdp_percent.to_csv("data/in_process/gdp_percent_structured.csv", index=False)

In [36]:
# Check to see if IncomeRegionGroup and Country Name are identical
#   and therefore one can be deleted
if "Country Name" in gdp_groups_regions.columns:
	matches = gdp_groups_regions["IncomeRegionGroup"] == gdp_groups_regions["Country Name"]
	print(matches.value_counts())
else:
	print("'Country Name' column not found in gdp_groups_regions.")

True     2752
False     320
Name: count, dtype: int64


In [37]:
# Given that there are mismatches, let's check them out
if "Country Name" in gdp_groups_regions.columns:
	mismatches = gdp_groups_regions[gdp_groups_regions["IncomeRegionGroup"] != gdp_groups_regions["Country Name"]]
	print(mismatches.head())
else:
	print("'Country Name' column not found in gdp_groups_regions. No mismatches to display.")

     Country Code                 IncomeRegionGroup  \
2560          TEA  East Asia & Pacific (IDA & IBRD)   
2561          TEA  East Asia & Pacific (IDA & IBRD)   
2562          TEA  East Asia & Pacific (IDA & IBRD)   
2563          TEA  East Asia & Pacific (IDA & IBRD)   
2564          TEA  East Asia & Pacific (IDA & IBRD)   

                                    Country Name  Year  GDPAnnualPercent  \
2560  East Asia & Pacific (IDA & IBRD countries)  1960               NaN   
2561  East Asia & Pacific (IDA & IBRD countries)  1961        -13.260177   
2562  East Asia & Pacific (IDA & IBRD countries)  1962         -1.935164   
2563  East Asia & Pacific (IDA & IBRD countries)  1963          3.769435   
2564  East Asia & Pacific (IDA & IBRD countries)  1964          8.117929   

      GDPConstantUSD  
2560      327.135944  
2561      283.757140  
2562      278.265973  
2563      288.755027  
2564      312.195956  


In [38]:
# Column "Country Name" is found to be redundant
gdp_groups_regions.drop(columns={"Country Name"}, inplace=True)
gdp_groups_regions.head(5)

Unnamed: 0,Country Code,IncomeRegionGroup,Year,GDPAnnualPercent,GDPConstantUSD
0,AFE,Africa Eastern and Southern,1960,,1172.316285
1,AFE,Africa Eastern and Southern,1961,-2.13663,1147.268217
2,AFE,Africa Eastern and Southern,1962,5.009835,1204.74446
3,AFE,Africa Eastern and Southern,1963,2.794289,1238.408507
4,AFE,Africa Eastern and Southern,1964,1.830475,1261.077266


### Export GDP Income Groups and Regions to CSV
Output data includes country code for worldwide regions and economic classes (as classified by UN Data), as well as their GDP annual % and constant 2015 USD by year. Data has not yet been cleaned or transformed. 

In [None]:
# Comment out to avoid duplicated exports
# gdp_groups_regions.to_csv("data/in_process/gdp_groups_regions_structured.csv", index=False)

---

Author: Katlyn Goeujon-Mackness