# **Economic Development vs. Sustainability**
# Data Discovery and Structuring - CO_2 Emissions
Katlyn Goeujon-Mackness <br>
Last Updated: 17/06/2025

## Table of Contents

**1. Introduction**
- Overview of CO₂ emissions and dataset scope  
- Objectives of the analysis  

**2. Locating Relevant Data**
- Importing and exploring emissions datasets  
- Identifying structured data sources  

**3. Data Preprocessing**
- Handling missing values and inconsistencies  
- Filtering and structuring country-based emissions data  
- Assigning ISO codes to regions  

**4. Exploratory Data Analysis**
- Trends in CO₂ emissions by country and region  
- Comparisons across income groups  
- Visualizing emissions per capita and per GDP  

**5. Regional Aggregation**
- Managing large region classifications (Africa, Europe, etc.)  
- Integrating grouped emissions data  

**6. Cleaning and Refining Data**
- Filtering out non-country entities (aviation, shipping, etc.)  
- Handling missing ISO codes and country classifications  

**7. Final Adjustments**
- Validating structured emissions datasets  
- Exporting cleaned data for further analysis  



## Introduction

Economic growth is often pursued at the cost of environmental sustainability. This study aims to analyze the balance between economic development and sustainable practices across different regions, industries, and policies.

In this **discovery and structuring phase** of the data analysis, we will locate, collect and process necessary raw data to structure it in a way that will be useful in analysis. Finally, we will export processed data in CSV format for analysis.

### Key Challenge
Achieving sustainable economic growth requires balancing financial prosperity with environmental and social responsibility. Identifying actionable patterns in historical data can inform policymakers, businesses, and environmental advocates.

### Data of Interest
- GDP growth rate compared to carbon emissions per capita (current analysis).
- Percentage of renewable energy adoption.
- Employment trends in green industries.
- Improvement in environmental quality indicators (air quality, water safety).
- Sustainability index scores vs. economic performance.

### Locating Relevant Data
- **World Bank**: Economic indicators.
    * [GDP per capita growth (annual %)](https://data.worldbank.org/indicator/NY.GDP.PCAP.KD.ZG)
    * [GDP per capita (constant 2015 US$)](https://data.worldbank.org/indicator/NY.GDP.PCAP.KD)
- **Our World in Data**: Environmental indicators and population data
    * [CO2 and Greenhouse Gas Emissions](https://github.com/owid/co2-data)
    * [Historical and Projected Population](https://ourworldindata.org/population-sources) 
- **United Nations SDGs Database**: Sustainable development statistics.
- **OECD**: Policy effectiveness on sustainability.
- **NASA Earth Observations**: Environmental impact metrics.
- **National Employment Data**: Job growth in sustainable sectors.


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Prevent truncating columns and rows
pd.set_option("display.max_rows", None) 
pd.set_option("display.max_columns", None) 

In [2]:
# OWIS complete dataset
co2_data = pd.read_csv("../data/raw/emissions/owid-co2-data_raw.csv")

# gdp_groups_regions_structured to extract regional and income groups from the OWIS dataset
groups_regions = pd.read_csv("../data/in_process/1a_gdp_region_groups_structured.csv")

# access gdp_percent_structured from World Bank for the list of countries to compare to OWIS
countries = pd.read_csv("../data/in_process/1a_gdp_countries_structured.csv")

In [3]:
co2_data.head()

Unnamed: 0,country,year,iso_code,population,gdp,cement_co2,cement_co2_per_capita,co2,co2_growth_abs,co2_growth_prct,co2_including_luc,co2_including_luc_growth_abs,co2_including_luc_growth_prct,co2_including_luc_per_capita,co2_including_luc_per_gdp,co2_including_luc_per_unit_energy,co2_per_capita,co2_per_gdp,co2_per_unit_energy,coal_co2,coal_co2_per_capita,consumption_co2,consumption_co2_per_capita,consumption_co2_per_gdp,cumulative_cement_co2,cumulative_co2,cumulative_co2_including_luc,cumulative_coal_co2,cumulative_flaring_co2,cumulative_gas_co2,cumulative_luc_co2,cumulative_oil_co2,cumulative_other_co2,energy_per_capita,energy_per_gdp,flaring_co2,flaring_co2_per_capita,gas_co2,gas_co2_per_capita,ghg_excluding_lucf_per_capita,ghg_per_capita,land_use_change_co2,land_use_change_co2_per_capita,methane,methane_per_capita,nitrous_oxide,nitrous_oxide_per_capita,oil_co2,oil_co2_per_capita,other_co2_per_capita,other_industry_co2,primary_energy_consumption,share_global_cement_co2,share_global_co2,share_global_co2_including_luc,share_global_coal_co2,share_global_cumulative_cement_co2,share_global_cumulative_co2,share_global_cumulative_co2_including_luc,share_global_cumulative_coal_co2,share_global_cumulative_flaring_co2,share_global_cumulative_gas_co2,share_global_cumulative_luc_co2,share_global_cumulative_oil_co2,share_global_cumulative_other_co2,share_global_flaring_co2,share_global_gas_co2,share_global_luc_co2,share_global_oil_co2,share_global_other_co2,share_of_temperature_change_from_ghg,temperature_change_from_ch4,temperature_change_from_co2,temperature_change_from_ghg,temperature_change_from_n2o,total_ghg,total_ghg_excluding_lucf,trade_co2,trade_co2_share
0,Afghanistan,1750,AFG,2802560.0,,0.0,0.0,,,,,,,,,,,,,,,,,,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1,Afghanistan,1751,AFG,,,0.0,,,,,,,,,,,,,,,,,,,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2,Afghanistan,1752,AFG,,,0.0,,,,,,,,,,,,,,,,,,,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
3,Afghanistan,1753,AFG,,,0.0,,,,,,,,,,,,,,,,,,,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
4,Afghanistan,1754,AFG,,,0.0,,,,,,,,,,,,,,,,,,,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


In [4]:
groups_regions.head()

Unnamed: 0,Region Group Name,Country Code
0,Africa Eastern and Southern,AFE
1,Africa Western and Central,AFW
2,Arab World,ARB
3,Central Europe and the Baltics,CEB
4,Caribbean small states,CSS


In [5]:
countries.head()

Unnamed: 0,Country Name,Country Code
0,Afghanistan,AFG
1,Albania,ALB
2,Algeria,DZA
3,American Samoa,ASM
4,Andorra,AND


### Select Data of Interest
Select the most relevant columns for the current analysis.
**Economic Performance Data**
* gdp (2011 inflation adjusted) - _Validate trends consistency with World Bank GDP_ data

**Emissions & Growth**
* co2 (annual CO_2 emissions per country)
* co2_per_capita (CO_2 emissions relative to population)
* co2_per_gdp (measures annual CO_2 emissions intensity according to GDP)

**Sustainability/Energy Effieciency**
* co2_per_unit_energy (measure carbon emissions per unit of energy used)
* energy_per_gdp (primary energy consumption per GDP - energy sustainability metric)

In [6]:
# Select relevant columns
co2_selected = co2_data[['country', 'iso_code', 'year', 'population', 'gdp', 'co2', 'co2_per_capita', 'co2_per_gdp', 'co2_per_unit_energy', 'energy_per_gdp']].copy()
co2_selected.head()

Unnamed: 0,country,iso_code,year,population,gdp,co2,co2_per_capita,co2_per_gdp,co2_per_unit_energy,energy_per_gdp
0,Afghanistan,AFG,1750,2802560.0,,,,,,
1,Afghanistan,AFG,1751,,,,,,,
2,Afghanistan,AFG,1752,,,,,,,
3,Afghanistan,AFG,1753,,,,,,,
4,Afghanistan,AFG,1754,,,,,,,


In [7]:
co2_selected.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50191 entries, 0 to 50190
Data columns (total 10 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   country              50191 non-null  object 
 1   iso_code             42262 non-null  object 
 2   year                 50191 non-null  int64  
 3   population           41019 non-null  float64
 4   gdp                  15251 non-null  float64
 5   co2                  29137 non-null  float64
 6   co2_per_capita       26182 non-null  float64
 7   co2_per_gdp          17528 non-null  float64
 8   co2_per_unit_energy  10350 non-null  float64
 9   energy_per_gdp       7696 non-null   float64
dtypes: float64(7), int64(1), object(2)
memory usage: 3.8+ MB


In [8]:
co2_selected.describe()

Unnamed: 0,year,population,gdp,co2,co2_per_capita,co2_per_gdp,co2_per_unit_energy,energy_per_gdp
count,50191.0,41019.0,15251.0,29137.0,26182.0,17528.0,10350.0,7696.0
mean,1919.883067,56861410.0,330049500000.0,415.698178,3.815391,0.397621,0.239634,1.768091
std,65.627296,319990500.0,3086383000000.0,1945.843973,14.383452,0.753783,0.257531,1.717964
min,1750.0,215.0,49980000.0,0.0,0.0,0.0,0.0,0.078
25%,1875.0,327313.0,7874038000.0,0.374,0.169,0.131,0.175,0.85575
50%,1924.0,2289522.0,27438610000.0,4.99,1.013,0.2635,0.216,1.294
75%,1974.0,9862459.0,121262700000.0,53.273,4.29675,0.508,0.256,2.13925
max,2023.0,8091735000.0,130112600000000.0,37791.57,782.682,82.576,10.686,25.253


In [9]:
co2_selected.shape

(50191, 10)

### Standardize Countries, Regions and Groupings
Like the GDP data, the OWIS dataset includes Income and Regional groups. We will extract this subset and combine it with gdp_groups_regions from the World Bank dataset with the following steps:
1. Align consistent country codes where available
2. Standardize country and region names across datasets
3. Separate country data from regional/economic group data
4. Handle regional aggregates separately

In [10]:
# Some regions and groupings are historical, so let's drop years that are not relevant to our analysis
# Earliest GDP data is from 1960 - Reduce the year range to 1960 to 2023
co2_selected = co2_selected[(co2_selected['year'] >= 1960) & (co2_selected['year'] <= 2023)]

In [11]:
# Select unique country names and their ISO codes
unique_countries_co2 = co2_selected[["country", "iso_code"]].drop_duplicates()

In [12]:
unique_countries_co2.head(10)

Unnamed: 0,country,iso_code
210,Afghanistan,AFG
484,Africa,
658,Africa (GCP),
832,Albania,ALB
1006,Algeria,DZA
1280,Andorra,AND
1454,Angola,AGO
1628,Anguilla,AIA
1902,Antarctica,ATA
2076,Antigua and Barbuda,ATG


In [13]:
unique_countries_co2.count()

country     255
iso_code    218
dtype: int64

#### Align Consistent Country Codes Where Available

In [14]:
# Filter out countries that already have an appropriate country code 
#    according to the World Bank list in countries1.csv
# Use anti-join method
unmatched = unique_countries_co2.merge(
    countries,
    left_on=["country", "iso_code"],
    right_on=["Country Name", "Country Code"],
    how="left",
    indicator=True
).query('_merge == "left_only"').drop(columns=['_merge'])

In [15]:
unmatched = unmatched[["country", "iso_code"]]
unmatched.head(10)

Unnamed: 0,country,iso_code
1,Africa,
2,Africa (GCP),
7,Anguilla,AIA
8,Antarctica,ATA
13,Asia,
14,Asia (GCP),
15,Asia (excl. China and India),
19,Bahamas,BHS
30,Bonaire Sint Eustatius and Saba,BES
35,Brunei,BRN


In [16]:
unmatched.count()

country     76
iso_code    39
dtype: int64

In [17]:
# Next, filter out any matching income/regional group that has an associated country code
unmatched2 = unmatched.merge(
    groups_regions,
    left_on="iso_code",
    right_on="Country Code",
    how="left",
    indicator=True
).query('_merge == "left_only"').drop(columns=['_merge'])


In [18]:
unmatched2 = unmatched2[["country", "iso_code"]]
unmatched2.head(10)

Unnamed: 0,country,iso_code
0,Africa,
1,Africa (GCP),
2,Anguilla,AIA
3,Antarctica,ATA
4,Asia,
5,Asia (GCP),
6,Asia (excl. China and India),
7,Bahamas,BHS
8,Bonaire Sint Eustatius and Saba,BES
9,Brunei,BRN


In [19]:
unmatched2.count()

country     76
iso_code    39
dtype: int64

In [20]:
# Find matches in by country code where the names may differ
matched_regions = unmatched2.merge(
    countries,
    left_on="iso_code",
    right_on="Country Code",
    how="inner"
)
matched_regions

Unnamed: 0,country,iso_code,Country Name,Country Code
0,Bahamas,BHS,"Bahamas, The",BHS
1,Brunei,BRN,Brunei Darussalam,BRN
2,Cape Verde,CPV,Cabo Verde,CPV
3,Congo,COG,"Congo, Rep.",COG
4,Democratic Republic of Congo,COD,"Congo, Dem. Rep.",COD
5,East Timor,TLS,Timor-Leste,TLS
6,Egypt,EGY,"Egypt, Arab Rep.",EGY
7,Gambia,GMB,"Gambia, The",GMB
8,Hong Kong,HKG,"Hong Kong SAR, China",HKG
9,Iran,IRN,"Iran, Islamic Rep.",IRN


#### Update co2_data to reflect the World Data country names

In [21]:
# Set mapping to come from countries dataframe
country_mapping = countries.set_index("Country Code")["Country Name"]

# Update country name in co2_data using iso_code
co2_selected["country"] = co2_selected["iso_code"].map(country_mapping).fillna(co2_selected["country"])

In [22]:
# Show the remaining regions/income groups without country codes
unmatched3 = co2_selected[co2_selected['iso_code'].isnull()][['country', 'iso_code']].drop_duplicates().head(20)
unmatched3

Unnamed: 0,country,iso_code
484,Africa,
658,Africa (GCP),
2876,Asia,
3050,Asia (GCP),
3324,Asia (excl. China and India),
9294,Central America (GCP),
15129,Europe,
15303,Europe (GCP),
15577,Europe (excl. EU-27),
15851,Europe (excl. EU-28),


In [23]:
# Regional GCP data appears to be redundant and will be removed from the DataFrame
co2_selected = co2_selected[~co2_selected["country"].str.contains("(GCP)", case=False, na=False)]
#   In addition to other categories not relevant to the current analysis

# Show the remaining regions/income groups without country codes
unmatched4 = co2_selected[co2_selected['iso_code'].isnull()][['country', 'iso_code']].drop_duplicates()
unmatched4


  co2_selected = co2_selected[~co2_selected["country"].str.contains("(GCP)", case=False, na=False)]


Unnamed: 0,country,iso_code
484,Africa,
2876,Asia,
3324,Asia (excl. China and India),
15129,Europe,
15577,Europe (excl. EU-27),
15851,Europe (excl. EU-28),
16125,European Union (27),
16399,European Union (28),
20507,High-income countries,
22043,International aviation,


In [24]:
# Match income-related categories
filtered_income_countries = co2_selected[co2_selected["country"].str.contains("income", case=False, na=False)][["country", "iso_code"]].drop_duplicates()
filtered_income_countries2 = groups_regions[groups_regions["Region Group Name"].str.contains("income", case=False, na=False)][["Region Group Name", "Country Code"]].drop_duplicates()

print(filtered_income_countries)
print(filtered_income_countries2)


                             country iso_code
20507          High-income countries      NaN
26946           Low-income countries      NaN
27220  Lower-middle-income countries      NaN
48010  Upper-middle-income countries      NaN
                                    Region Group Name Country Code
5         East Asia & Pacific (excluding high income)          EAP
8       Europe & Central Asia (excluding high income)          ECA
13                                        High income          HIC
20  Latin America & Caribbean (excluding high income)          LAC
23                                         Low income          LIC
24                                Lower middle income          LMC
25                                Low & middle income          LMY
28                                      Middle income          MIC
29  Middle East & North Africa (excluding high inc...          MNA
37         Sub-Saharan Africa (excluding high income)          SSA
46                               

In [25]:
# Define mapping for country name and ISO codes
income_mapping = {
    "High-income countries": ("High income", "HIC"),
    "Low-income countries": ("Low income", "LIC"),
    "Lower-middle-income countries": ("Lower middle income", "LMC"),
    "Upper-middle-income countries": ("Upper middle income", "UMC")
}

# Apply transformations
co2_selected["country"] = co2_selected["country"].str.strip()
mask = co2_selected["country"].isin(income_mapping.keys())
co2_selected.loc[mask, ["country", "iso_code"]] = co2_selected["country"].map(income_mapping).apply(pd.Series)

# Show the remaining regions/income groups without country codes
unmatched5 = co2_selected[co2_selected['iso_code'].isnull()][['country', 'iso_code']].drop_duplicates()
unmatched5

Unnamed: 0,country,iso_code
484,Africa,
2876,Asia,
3324,Asia (excl. China and India),
15129,Europe,
15577,Europe (excl. EU-27),
15851,Europe (excl. EU-28),
16125,European Union (27),
16399,European Union (28),
20507,,
22043,International aviation,


In [26]:
# Define country codes to remaining regions
region_iso_mapping = {
    "Africa": "AFR",
    "Asia": "ASI",
    "Asia (excl. China and India)": "ASI-XCNIN",
    "Europe": "EUR",
    "Europe (excl. EU-27)": "EUR-X27",
    "Europe (excl. EU-28)": "EUR-X28",
    "European Union (27)": "EU-27",
    "European Union (28)": "EU-28",
    "North America": "NAM",
    "North America (excl. USA)": "NAM-XUSA",
    "Oceania": "OCE",
    "South America": "SAM",
    "World": "WLD"
}

# Integrate ISO codes into co2_selected dataframe
co2_selected.loc[co2_selected["iso_code"].isna(), "iso_code"] = co2_selected["country"].map(region_iso_mapping)

# Convert mapping to DataFrame
region_groups_df = pd.DataFrame(list(region_iso_mapping.items()), columns=["Region Group Name", "Country Code"])

# Append to existing `regions_groups`
groups_regions = pd.concat([groups_regions, region_groups_df], ignore_index=True)

print(groups_regions)

                                    Region Group Name Country Code
0                         Africa Eastern and Southern          AFE
1                          Africa Western and Central          AFW
2                                          Arab World          ARB
3                      Central Europe and the Baltics          CEB
4                              Caribbean small states          CSS
5         East Asia & Pacific (excluding high income)          EAP
6                          Early-demographic dividend          EAR
7                                 East Asia & Pacific          EAS
8       Europe & Central Asia (excluding high income)          ECA
9                               Europe & Central Asia          ECS
10                                          Euro area          EMU
11                                     European Union          EUU
12           Fragile and conflict affected situations          FCS
13                                        High income         

In [27]:
# Show the remaining regions/income groups without country codes
unmatched6 = co2_selected[co2_selected['iso_code'].isnull()][['country', 'iso_code']].drop_duplicates()
unmatched6

Unnamed: 0,country,iso_code
20507,,
22043,International aviation,
22317,International shipping,
22381,International transport,
24542,Kosovo,
24889,Kuwaiti Oil Fires,
25624,Least developed countries (Jones et al.),
35056,OECD (Jones et al.),
38441,Ryukyu Islands,


In [28]:
# Remove the remaining list of groups
# Define the list of countries to remove
remove_list = [
    "International aviation", "International shipping", "International transport",
    "Kuwaiti Oil Fires", "Least developed countries (Jones et al.)", "OECD (Jones et al.)"
]
# Remove matching rows from co2_selected
co2_selected = co2_selected[~co2_selected["country"].isin(remove_list)]

#### Separate Country data from Regional/Economic group data
We will keep the data in separate datasets to aid in future analysis.

In [29]:
co2_selected.head(3)

Unnamed: 0,country,iso_code,year,population,gdp,co2,co2_per_capita,co2_per_gdp,co2_per_unit_energy,energy_per_gdp
210,Afghanistan,AFG,1960,9035048.0,13033250000.0,0.414,0.046,0.032,,
211,Afghanistan,AFG,1961,9214082.0,13146290000.0,0.491,0.053,0.037,,
212,Afghanistan,AFG,1962,9404411.0,13367630000.0,0.689,0.073,0.052,,


In [30]:
unique_countries = co2_selected['country'].unique()

In [31]:
# Filter co2_selected with many-to-one match for countries
# co2_selected includes "iso_code"
# countries includes "Country Code"

# Get the list of country codes to match against
country_codes = countries['Country Code'].unique()

# Filter co2_selected where iso_code is in country_codes
co2_by_country = co2_selected[co2_selected['iso_code'].isin(country_codes)].copy()

In [32]:
co2_by_country.head(5)

Unnamed: 0,country,iso_code,year,population,gdp,co2,co2_per_capita,co2_per_gdp,co2_per_unit_energy,energy_per_gdp
210,Afghanistan,AFG,1960,9035048.0,13033250000.0,0.414,0.046,0.032,,
211,Afghanistan,AFG,1961,9214082.0,13146290000.0,0.491,0.053,0.037,,
212,Afghanistan,AFG,1962,9404411.0,13367630000.0,0.689,0.073,0.052,,
213,Afghanistan,AFG,1963,9604491.0,13630300000.0,0.707,0.074,0.052,,
214,Afghanistan,AFG,1964,9814318.0,13870500000.0,0.839,0.085,0.06,,


In [33]:
# Rename "iso_code" for consistency
co2_by_country = co2_by_country.rename(columns={"iso_code": "Country Code"})

In [34]:
# Filter co2_selected with many-to-one match for regional/economic groups
# co2_selected includes "iso_code"
# groups_regions includes "Country Code"

# Get the list of country codes to match against
country_codes_groups = groups_regions['Country Code'].unique()

# Filter co2_selected where iso_code is in country_codes
co2_by_groups = co2_selected[co2_selected['iso_code'].isin(country_codes_groups)].copy()

In [35]:
co2_by_groups.head(5)

Unnamed: 0,country,iso_code,year,population,gdp,co2,co2_per_capita,co2_per_gdp,co2_per_unit_energy,energy_per_gdp
484,Africa,AFR,1960,283922289.0,,156.567,0.552,0.345,,
485,Africa,AFR,1961,290814083.0,,161.994,0.558,0.349,,
486,Africa,AFR,1962,297959967.0,,166.335,0.559,0.346,,
487,Africa,AFR,1963,305371382.0,,176.335,0.578,0.342,,
488,Africa,AFR,1964,313060731.0,,193.639,0.619,0.356,,


In [36]:
co2_by_groups['country'].unique()

array(['Africa', 'Asia', 'Asia (excl. China and India)', 'Europe',
       'Europe (excl. EU-27)', 'Europe (excl. EU-28)',
       'European Union (27)', 'European Union (28)', 'Namibia',
       'North America', 'North America (excl. USA)', 'Oceania',
       'South America', 'World'], dtype=object)

In [37]:
# Rename "country" column to reflect change to groups
# Rename "iso_code" for consistency
co2_by_groups = co2_by_groups.rename(columns={
    "country": "Region/Group",
    "iso_code": "Country Code"
    })
co2_by_groups.head(3)

Unnamed: 0,Region/Group,Country Code,year,population,gdp,co2,co2_per_capita,co2_per_gdp,co2_per_unit_energy,energy_per_gdp
484,Africa,AFR,1960,283922289.0,,156.567,0.552,0.345,,
485,Africa,AFR,1961,290814083.0,,161.994,0.558,0.349,,
486,Africa,AFR,1962,297959967.0,,166.335,0.559,0.346,,


In [38]:
# Check if there are any remaining stragglers that are in neither new dataset

# Get a list of all the country codes present in the DataFrames
present_country_codes = pd.concat([
    co2_by_country['Country Code'],
    co2_by_groups['Country Code']
]).unique()

# Find any countries that are not in the list of present country codes
unmatched_countries = co2_selected[~co2_selected['iso_code'].isin(present_country_codes)].copy()

print(len(unmatched_countries))

1152


In [39]:
print(unmatched_countries['country'].unique())

['Anguilla' 'Antarctica' 'Bonaire Sint Eustatius and Saba'
 'Christmas Island' 'Cook Islands' nan 'Kosovo' 'Montserrat' 'Niue'
 'Ryukyu Islands' 'Saint Helena' 'Saint Pierre and Miquelon' 'Taiwan'
 'Vatican' 'Wallis and Futuna']


#### Comments
Many of these can be safely excluded from the data, including overseas territories, tiny island nations and unusual entities. We will manually handle those which may be relevant to the data.

In [40]:
# Filter out countries
manual_countries = co2_selected[co2_selected['country'].isin(['Taiwan', 'Kosovo'])].copy()

# Fix iso_codes
iso_corrections = {'Taiwan': 'TWN', 'Kosovo': 'XK'}
manual_countries['iso_code'] = manual_countries['country'].map(iso_corrections)

In [41]:
# Rename columns to match co2_by_country
manual_countries = manual_countries.rename(columns={"iso_code": "Country Code"})
manual_countries.head(3)

Unnamed: 0,country,Country Code,year,population,gdp,co2,co2_per_capita,co2_per_gdp,co2_per_unit_energy,energy_per_gdp
24542,Kosovo,XK,1960,984853.0,,,,,,
24543,Kosovo,XK,1961,1011428.0,,,,,,
24544,Kosovo,XK,1962,1036955.0,,,,,,


In [42]:
# Concatenate manual_countries with co2_by_country
co2_by_country = pd.concat([co2_by_country, manual_countries], ignore_index=True).drop_duplicates()

In [43]:
# Check if any countries are present in both countries and groups datasets
duplicates = set(co2_by_country['country']) & set(co2_by_groups['Region/Group'])
print(sorted(duplicates))

['Namibia', 'North America']


In [44]:
# Namibia doesn't belong in co2_by_groups
co2_by_groups = co2_by_groups[co2_by_groups['Region/Group'] != 'Namibia']

# North America doesn't belong ins co2_by_country
co2_by_country = co2_by_country[co2_by_country['country'] != 'North America']

# check duplicates again
duplicates = set(co2_by_country['country']) & set(co2_by_groups['Region/Group'])
print(sorted(duplicates))

[]


---
## Export data
### Export revised region/income groupings

In [45]:
# Comment out to avoid duplicate exports
# groups_regions.to_csv("../data/in_process/1b_co2_region_groups_structured.csv")

### Export CO_2 Emissions Datasets by Country and Groups

In [46]:
# Comment out to avoid duplicate exports
# co2_by_country.to_csv("../data/in_process/1b_co2_by_country_structured.csv")

# Comment out to avoid duplicate exports
# co2_by_groups.to_csv("../data/in_process/1b_co2_by_groups_structured.csv")