# Clean and Merge PM2.5, SDI, and GBD Data

## Introduction

To address the first part of the research question, we use three separate datasets:

- **PM2.5 dataset** from WHO  
- **SDI dataset** from IHME  
- **GBD dataset** from IHME  

Details about these datasets are provided in [`datasets documentation`](..\1_datasets\README.md).

In this notebook, we **clean and merge** these datasets to prepare them for analysis.

### Cleaning Strategy

Each dataset provides measurements by **country and year**. To organize this, we:

1. **Match country names across datasets**
2. **Define a `Country` class** to store data per country
3. **Load data into class instances**
4. **Extract data from the class objects and save it into a CSV**

---

### Imports

In [1]:
import pandas as pd
from rapidfuzz import process, fuzz

### Load Raw Data

In [4]:
# Load the datasets
PM25_df = pd.read_csv('../1_datasets/raw_datasets/WHO-PM2.5 Data.csv')
SDI_df = pd.read_csv('../1_datasets/raw_datasets/IHME-SDI Data.csv')
GBD_df = pd.read_csv('../1_datasets/raw_datasets/IHME-GBD Data.csv')

In [5]:
# Filter SDI data to include only 2010-2019
SDI_df = SDI_df[(SDI_df['year_id'] >= 2010) & (SDI_df['year_id'] <= 2019)]

### Stage 1: Match Country Names

We use WHO's PM2.5 country list as the control since it contains exactly the 195 recognized countries.

In [25]:
# Helper function for fuzzy matching

def match_countries_semantically(set1, set2):
    """
    Matches each country in set1 with the most semantically similar country in
    set2 using fuzzy string matching.

    Parameters:
    set1 : iterable, a set or list of country names to be matched.
    set2 : iterable, a set or list of reference country names to match against.

    Returns: 
    matches: A dictionary where keys are items from set1 and values are
    the best-matched items from set2, based on token sort ratio similarity.
    """
    matches = {}
    for country in set1:
        best_match, score, _ = process.extractOne(
            country, set2, scorer=fuzz.token_sort_ratio
        )
        matches[country] = best_match
    return matches

In [8]:
# Create country sets
PM25_countries = set(PM25_df['Location'])
SDI_countries = set(SDI_df['location_name'])
GBD_countries = set(GBD_df['location'])

####  PM2.5 countries VS SDI countries

In [9]:
# Identify mismatches
PM25_only = PM25_countries - SDI_countries
PM25_only

{"Cote d'Ivoire",
 'Netherlands (Kingdom of the)',
 'United Kingdom of Great Britain and Northern Ireland',
 'occupied Palestinian territory, including east Jerusalem'}

In [10]:
# Manual correction
correct_spelling = {
    "Côte d'Ivoire",
    "United Kingdom",
    "Netherlands",
    "Palestine"
}

# Apply matching
matches = match_countries_semantically(PM25_only, correct_spelling)

# Replace in dataframe
for country, matched in matches.items():
    PM25_df['Location'] = PM25_df['Location'].replace(country, matched)

In [12]:
# Recheck commonality
PM25_countries = set(PM25_df['Location'])
PM25_only = PM25_countries - SDI_countries
PM25_only # Should be an empty set

set()

####  PM2.5 countries VS GBD countries

In [13]:
# Check GBD match
PM25_countries - GBD_countries

set()

No mismatches found

### Stage 2: Define the `Country` Class

We use a class to store each country's data.

In [None]:
class Country(object):
    """
    A class to represent a country and hold various health and environmental datasets for analysis.

    Attributes:
    name : String, name of the country.
    PM25_total, PM25_urban, PM25_towns, PM25_rural, PM25_cities : Dictionary,PM2.5 pollution data
    categorized by region type and year.
    SDI : Dictionary, socio-demographic Index data by year.
    all_causes, cardiovascular, stroke, respiratory : Dictionary, health outcome data from the Global
                                                    Burden of Disease study by cause and year.

    Class Attributes:
    countries : list, a list containing all instances of the Country class.

    Methods:
    get_name(): Returns the name of the country.
    get_country(name): Returns the Country instance with the given name, if it exists.
    load_PM25_data(dataframe): Loads PM2.5 data from a dataframe.
    load_SDI_data(dataframe): Loads SDI data from a dataframe.
    load_GBD_data(dataframe): Loads health outcome data from a dataframe.
    """
    countries = []

    def __init__(self, name):
        """Initiates the object by country name"""
        self.name = name
        self.PM25_total = {}
        self.PM25_urban = {}
        self.PM25_towns = {}
        self.PM25_rural = {}
        self.PM25_cities = {}
        self.SDI = {}
        self.all_causes = {}
        self.cardiovascular = {}
        self.stroke = {}
        self.respiratory = {}
        Country.countries.append(self)

    def get_name(self):
        """Returns the name of the country."""
        return self.name

    @staticmethod
    def get_country(name):
        """
        Searches for a country instance by name.

        Parameters:
        name : string, the name of the country to search for.

        Returns:
        The Country object with the given name or an error message if not found.
        """
        for country in Country.countries:
            if country.get_name() == name:
                return country
        return "There is no listed country with that name."

    def load_PM25_data(self, dataframe):
        """
        Loads PM2.5 pollution data into the country instance from a pandas DataFrame.
        The DataFrame is expected to have columns: 'Period', 'FactValueNumeric', and 'Dim1'.
        """
        for _, row in dataframe.iterrows():
            year = row['Period']
            value = row['FactValueNumeric']
            dim = row['Dim1']
            if dim == 'Total':
                self.PM25_total[year] = value
            elif dim == 'Cities':
                self.PM25_cities[year] = value
            elif dim == 'Urban':
                self.PM25_urban[year] = value
            elif dim == 'Towns':
                self.PM25_towns[year] = value
            elif dim == 'Rural':
                self.PM25_rural[year] = value

    def load_SDI_data(self, dataframe):
        """
        Loads SDI (Socio-demographic Index) data into the country instance from a pandas DataFrame.
        The DataFrame is expected to have columns: 'year_id' and 'mean_value'.
        """
        for _, row in dataframe.iterrows():
            year = row['year_id']
            value = row['mean_value']
            self.SDI[year] = value

    def load_GBD_data(self, dataframe):
        """
        Loads Global Burden of Disease data into the country instance from a pandas DataFrame.
        The DataFrame is expected to have columns: 'year', 'val', and 'cause'.
        """
        for _, row in dataframe.iterrows():
            year = row['year']
            value = row['val']
            cause = row['cause']
            if cause == 'All causes':
                self.all_causes[year] = value
            elif cause == 'Cardiovascular diseases':
                self.cardiovascular[year] = value
            elif cause == 'Stroke':
                self.stroke[year] = value
            elif cause == 'Chronic respiratory diseases':
                self.respiratory[year] = value

### Stage 3: Load Data into `Country` Objects

A script to load the data from the raw files and store it the country objects.

In [16]:
# Group data by country
PM25_country_groups = PM25_df.groupby('Location')
SDI_country_groups = SDI_df.groupby('location_name')
GBD_country_groups = GBD_df.groupby('location')

# Create Country instances and load data
for name in PM25_countries:
    country = Country(name)
    PM25_data = PM25_country_groups.get_group(name)
    SDI_data = SDI_country_groups.get_group(name)
    GBD_data = GBD_country_groups.get_group(name)
    country.load_PM25_data(PM25_data)
    country.load_SDI_data(SDI_data)
    country.load_GBD_data(GBD_data)

In [None]:
# Checks
print(len(Country.countries)) # There should be 195 country objects
Country.get_country('Japan').PM25_total # should output a dict of values per year (2010-2019)

195


{2019: 10.84,
 2018: 10.87,
 2017: 11.89,
 2016: 12.54,
 2015: 12.75,
 2014: 13.41,
 2013: 13.45,
 2012: 12.37,
 2011: 13.91,
 2010: 14.22}

### Stage 4: Data Extraction

To extract data from `Country` objects and write into a csv file in a format suitable for analysis.

In [23]:
# Create a list to store rows
data_rows = []

# Loop over all Country objects
for country in Country.countries:
    # Get the list of years in the PM2.5 dictionary (assumed to have the 2010–2019 keys)
    years = sorted(country.PM25_total.keys())
    
    for year in years:
        row = {
            'Country': country.name,
            'Year': year,
            'SDI': country.SDI.get(year, None),
            'PM2.5': country.PM25_total.get(year, None),
            'All-cause DALYs': country.all_causes.get(year, None),
            'Cardiovascular DALYs': country.cardiovascular.get(year, None),
            'Stroke DALYs': country.stroke.get(year, None),
            'Respiratory DALYs': country.respiratory.get(year, None)
        }
        data_rows.append(row)

In [None]:
# Convert list of dicts to DataFrame
clean_df = pd.DataFrame(data_rows)

# Check
len(clean_df) # should have 1950 rows of data

1950

In [27]:
# Save to CSV
output_filename = '../1_datasets/final_datasets/clean_merged_data.csv'
clean_df.to_csv(output_filename, index=False)

print(f"✅ Clean data saved to ({output_filename})")

✅ Clean data saved to (../1_datasets/final_datasets/clean_merged_data.csv)
