# **Economic Development vs. Sustainability**
# CO_2 Emissions and GDP - Version 2
Katlyn Goeujon-Mackness <br>
Last Updated: 20/06/2025

This notebook updates the GDP-CO2 dataset to make improvements.

In [32]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from scipy.stats import zscore

# Prevent truncating columns and rows
pd.set_option("display.max_rows", None) 
pd.set_option("display.max_columns", None) 

# Ignore warnings
import warnings
warnings.filterwarnings("ignore")

In [33]:
# Load the dataset
gdp_co2 = pd.read_csv("../data/processed/gdp_co2_by_country_v1.csv")
gdp_co2.head(3)

Unnamed: 0,Country Name,Country Code,Year,Population,Pop Log,Pop Outliers,Pop Category,CO2,CO2 %,Per Capita CO2,Cumulative CO2,CO2 Log,CO2 Outliers,Emissions Category,GDP USD,GDP USD Log,GDP %,GDP % Winsor,GDP Per Capita,GDP Category,CO2 Per GDP
0,Afghanistan,AFG,1961,9214082.0,16.036244,not outlier,1M-10M,0.491,,5.3288e-08,235.001,0.399447,False,Moderate,308.31827,5.734371,-10.119484,-10.119484,3.3e-05,Low GDP,0.001593
1,Afghanistan,AFG,1962,9404411.0,16.056689,not outlier,1M-10M,0.689,40.325866,7.326349e-08,235.001,0.524137,False,Moderate,308.31827,5.734371,-10.119484,-10.119484,3.3e-05,Low GDP,0.002235
2,Afghanistan,AFG,1963,9604491.0,16.077741,not outlier,1M-10M,0.707,2.612482,7.36114e-08,235.001,0.534737,False,Moderate,308.31827,5.734371,-10.119484,-10.119484,3.2e-05,Low GDP,0.002293


## Rescale Per Capita CO2 

In [34]:
# Previous: metric tons per person
# Current: kilograms per person (Multiply metric by 1m)
gdp_co2['Per Capita CO2 (kg)'] = gdp_co2['Per Capita CO2'] * 1000
gdp_co2.drop(columns=['Per Capita CO2'], inplace=True)
gdp_co2.head(3)

Unnamed: 0,Country Name,Country Code,Year,Population,Pop Log,Pop Outliers,Pop Category,CO2,CO2 %,Cumulative CO2,CO2 Log,CO2 Outliers,Emissions Category,GDP USD,GDP USD Log,GDP %,GDP % Winsor,GDP Per Capita,GDP Category,CO2 Per GDP,Per Capita CO2 (kg)
0,Afghanistan,AFG,1961,9214082.0,16.036244,not outlier,1M-10M,0.491,,235.001,0.399447,False,Moderate,308.31827,5.734371,-10.119484,-10.119484,3.3e-05,Low GDP,0.001593,5.3e-05
1,Afghanistan,AFG,1962,9404411.0,16.056689,not outlier,1M-10M,0.689,40.325866,235.001,0.524137,False,Moderate,308.31827,5.734371,-10.119484,-10.119484,3.3e-05,Low GDP,0.002235,7.3e-05
2,Afghanistan,AFG,1963,9604491.0,16.077741,not outlier,1M-10M,0.707,2.612482,235.001,0.534737,False,Moderate,308.31827,5.734371,-10.119484,-10.119484,3.2e-05,Low GDP,0.002293,7.4e-05


## Merge Sint Maarten data with the Netherlands
Because Sint Maarten has a very small population, its data can disproportionately skew per capita analyses. To improve consistency, we merged it with its principal sovereign entity, the Netherlands.

In [35]:
# Filter rows for both countries
nld = gdp_co2[gdp_co2['Country Name'] == "Netherlands"].copy()
sxm = gdp_co2[gdp_co2['Country Name'] == "Sint Maarten (Dutch part)"].copy()

# Get numeric columns (exclude 'Year' to avoid suffix issues)
numeric_cols = nld.select_dtypes(include='number').columns.drop('Year').tolist()

# Merge on 'Year'
merged = pd.merge(nld, sxm, on='Year', suffixes=('_nld', '_sxm'))

# Sum numeric columns into the main Netherlands row
for col in numeric_cols:
    merged[col] = merged[f"{col}_nld"] + merged[f"{col}_sxm"]

# Keep only the base numeric + 'Year' columns
final = merged[['Year'] + numeric_cols].copy()
final['Country Name'] = 'Netherlands'
final['Country Code'] = 'NLD'

# Remove old rows and replace with aggregated version
# Commented out to avoid making changes to the dataset
# gdp_co2 = gdp_co2[~gdp_co2['Country Name'].isin(['Netherlands', 'Sint Maarten (Dutch part)'])]
# gdp_co2 = pd.concat([gdp_co2, final], ignore_index=True)

# Confirm the changes
# print("Netherlands rows:", gdp_co2[gdp_co2['Country Name'] == "Netherlands"].shape[0])
# print("Sint Maarten rows:", gdp_co2[gdp_co2['Country Name'] == "Sint Maarten (Dutch part)"].shape[0])


### Alternative Option
Given Sint Maarten’s status as a statistical outlier, largely due to its small population and unique emission profile, it may be more appropriate to exclude it from the dataset to avoid distortion.

In [36]:
# Remove Sint Maarten data
gdp_co2 = gdp_co2[gdp_co2['Country Name'] != 'Sint Maarten (Dutch part)']

# Confirm the changes
print("Remaining rows for Sint Maarten:", len(gdp_co2[gdp_co2['Country Name'] == 'Sint Maarten (Dutch part)']))


Remaining rows for Sint Maarten: 0


## Update GDP Categories
Previous GDP bins were much too high to put countries into categories. We will update the bins to improve the GDP categories.

In [37]:
print(gdp_co2['GDP USD'].min())
print(gdp_co2['GDP USD'].max())

122.678900960286
167187.15730982


In [38]:
# GDP categories (per World Bank)
bins = [0, 1000, 12000, np.inf]
labels = ['Low GDP', 'Middle GDP', 'High GDP']

gdp_co2["GDP Category"] = pd.cut(
    gdp_co2["GDP USD"],
    bins=bins,
    labels=labels,
    include_lowest=True
)

---
## Export Updated Dataset

In [None]:
# Comment out to avoid duplicate exports
# gdp_co2.to_csv("../data/processed/gdp_co2_by_country_v2.csv", index=False)