# <div div style="text-align:center">Tropical Cyclone impact data comparison between Wikimpacts1.0 and EM-DAT database </div>

<div div style="text-align:center">
PhiRu Environmental Engineering Members: </br>
Bernal, Chiara (r) </br>
Caligdong, Ronan (r0966302) </br>
Espejo, Kristine Nadeen (r1017911) </br>
Haghebaert, Lukas (r0826858) </br>
</div>

## Dataset
1.**Wikimpacts 1.0**：contains data on the occurrence and impacts of climate extremes in country and sub-national scales. The database is inferred from Wikipedia and uses generative AI. </br>
2.**EM-DAT**, downloaded from Public EM-DAT platform, using only “tropical cyclone”.

## Tasks


1. Download the Wikimpacts 1.0 database in db format. 
2. Load Data:   
- Read the database file and load all tables that start with "Total" into a DataFrame named `L1`.
- Identify all tables that start with "Specific" and load them into separate DataFrames named `L3_*`, where `*` represents impact categories, only load Deaths, Injuries and Damage.


This code is for the extraction of data from the raw dataframes. It only extract necessary data and put them in another dataframes.

Importing necessary modules.

In [None]:
import sqlite3
import pandas as pd
import numpy as np
import ast # This library turns string "[...]" into list [...]
import matplotlib.pyplot as plt
import seaborn as sns

db_path = "impactdb.v1.0.2.dg_filled.db"  # <-- database
conn = sqlite3.connect(db_path)


This code is commanding the database to show the list of all existing data and filter the table that we are interested with.

In [None]:
# List all tables
tables = pd.read_sql(
    "SELECT name FROM sqlite_master WHERE type='table';", conn)

all_total_tables = tables[tables["name"].str.startswith("Total")]["name"]


This code is concatenating all the data from table with a 
'Total' name on it and creates a list (L1). 

In [None]:
# Concatenate to one big L1 dataframe
L1_list = []
for table_name in all_total_tables:
    df = pd.read_sql(f"SELECT * FROM {table_name};", conn)
    df["source_table"] = table_name
    L1_list.append(df)

L1 = pd.concat(L1_list, ignore_index=True)


This code is for the data to be categorized or sort them. 

In [None]:
spec_tables = tables[tables["name"].str.startswith("Specific")]["name"].tolist()

L3 = {}  # empty dictionary of category -> dataframe

for table_name in spec_tables: #for each table that starts with specific
    #classifyinging tables into three impacts deaths, injuries & damage
    if "Deaths" in table_name:
        category = "Deaths"
    elif "Injuries" in table_name:
        category = "Injuries"
    elif "Damage" in table_name:
        category = "Damage"
    else:
        continue

    df = pd.read_sql(f"SELECT * FROM {table_name};", conn)
    df["source_table"] = table_name
    L3.setdefault(category, []).append(df) # if the lsit is not in the dictionary, create an empty list and add this new dataframe to that list.


This code turns the one dataframe in to three different datafrmes with each for deaths, injuries and damages. 

In [None]:
# Get only Deaths, Injuries and Damage
for category in L3:
    L3[category] = pd.concat(L3[category], ignore_index=True)

L3_Deaths = L3.get("Deaths")
L3_Injuries = L3.get("Injuries")
L3_Damage = L3.get("Damage")

3. Filter by “Tropical Storm/Cyclone”:
- Using the “Main_Event”, filter the Tropical Storm/Cyclone events from L1 into a new dataframe “L1_TC”
- Using “Event_ID” from “L1_TC”, filter the “L3_*” with only impact from Tropical Storm/Cyclone
- “Start/End_Date_Year,” “Start/End_Date_Month,” and “Start/End_Date_Day” col-umns. If these date fields are missing in `L3_*`, fill them with the corresponding infor-mation from `L1_TC`.

4. Filter by Date:
- In each ` L3_* ` DataFrame, filter the records to include only those events that occurred after the year 1900. Name these filtered DataFrames as `L3_*_1900`.

In [None]:
def filter_year(df, year):
    
    ''' Filters the data frame according to the year you input. 
    The filter keeps everything after the year specified 
    (e.g. x>1900) '''
    
    if type(year) == int:
        year_mask = df["Start_Date_Year"]>year
        return df[year_mask].copy()
    else:
        print ("Year must be an int data type")
        
year_to_filter = 1900
L3_Deaths_TC_1900 = filter_year(L3_Deaths_TC, year_to_filter)
L3_Injuries_TC_1900 = filter_year(L3_Injuries_TC, year_to_filter)
L3_Damage_TC_1900 = filter_year(L3_Damage_TC, year_to_filter)

We created a function that allows us to filter a data base by year. This only works for data bases that have a column with the title "Start_Date_Year". <br>
An explaination how how to function works was added in the comments and an if statement was added to help trouble shoot errors users may encounter.

5. Aggregate by Administrative Area:
- Using the “Administrative_Area_GID” column in each ` L3_*_1900` DataFrame obtained from Step 3, for the same “Event_ID”, aggregate the impact from the same “Administrative_Area_GID”. <br>
- Only consider the rows with one valid GID (specific cases like one country involving several GIDs, only use the one without digits, or the first 3 alphabets), name the new dataframe to `L3_*_1900_aggregated`

In [1]:
# -----GID CLEANING FUNCTION (Applied to one cell at a time) -----

def get_single_valid_gid(gid_entry):#Checks every single GID at a time

#Handle empty or missing cells -> return NaN
    if gid_entry is None or (isinstance(gid_entry, float) and np.isnan(gid_entry)):
        return np.nan 

    #Convert strings that LOOK like lists into real Python lists
        #Examples:
            #    "['USA']"      -> ['USA']
            #    "[['USA']]"    -> [['USA']]
            #    "USA"          -> stays as "USA"
    if isinstance(gid_entry, str):
        try: #used ast module, turns strings into lists
            check_stringorlist = ast.literal_eval(gid_entry)  #convert string to python object
            # if literal_eval returns a list, use it
            if isinstance(check_stringorlist, list):
                gid_entry = check_stringorlist
        except (ValueError, SyntaxError):
            # If literal_eval fails, treat the string as a single element
            gid_entry = [gid_entry]

    #Ensure the entry is ALWAYS treated as a list of strings
        #Cases handled:
    #    gid_entry = "USA"        -> ['USA']
    #    gid_entry = ['USA']      -> ['USA']
    #    gid_entry = [['USA']]    -> ['USA']
    
    if isinstance(gid_entry, str):
        elements = [gid_entry]  #wrap single string in a list
    else:
        #If it's a list, flatten and ensure all elements are strings
        #Example: [['USA']] -> ['USA']
        flat_list = []
        for e in gid_entry:
            if isinstance(e, list):
                flat_list.extend(e)  # Flatten nested lists
            else:
                flat_list.append(e)
        # Convert all elements to strings and remove NaNs
        elements = [str(e) for e in flat_list if pd.notna(e)]

    # 4. Extract valid 3-letter country codes
    valid_codes = []  # Start an empty list to store valid country codes
    
    for e in elements:  # Loop through every cleaned element
        # Clean formatting: remove whitespace, take first 3 chars, force UPPERCASE
        # Examples:
        #   'AUS.10' → 'AUS'
        #   'chn'    → 'CHN'
        code = e.strip()[:3].upper()

        # Validation rule:
        # Must be exactly 3 letters AND contain only letters
        if len(code) == 3 and code.isalpha():
            valid_codes.append(code)

    # 5. Enforce "Single Valid GID"
    #    Only accept rows with EXACTLY ONE valid country code
    if len(valid_codes) == 1:
        return valid_codes[0]  # Return the clean code (e.g., 'CHN')
    else:
        return np.nan  # If zero or multiple codes found → discard row

In this part of the code, we created a function that cleans each GID entry one at a time.<br>
We handled the cleaning in several steps:

- **First, we checked whether the cell was empty or missing.**<br>
    -> If it was, we simply returned `NaN` so we wouldn’t process invalid or unusable data.<br>
    -> This prevents errors and keeps the dataset clean from the start.

- **Next, we handled entries that were stored as text.**<br>
    -> Many GID values were saved as strings that *looked* like lists (for example: `"['USA']"` or `"[['USA']]"`).<br>
    -> To deal with this, we imported the `ast` library because it allows us to safely convert these string representations into actual Python list objects.

- **Then, we attempted to convert the text into real Python lists using `ast.literal_eval`.**<br>
    -> If the conversion worked and produced a list, we used that list as the cleaned version of the entry.<br>
    -> If the conversion failed (for example, if the value was just `"USA"`), we treated the value as a single‑item list like `['USA']` so that all entries follow the same structure.<br>
    -> By doing this, we standardized all GID formats into clean, consistent lists, making them much easier to filter, validate, and aggregate later in the process.


In the next part of the function, we made sure that every GID entry is treated as a clean list of strings and then extracted a single valid country code from it.<br>
We did this in several steps:

- **First, we ensured that the entry is always treated as a list.**<br>
    -> If `gid_entry` was just a single string like `"USA"`, we wrapped it into a list, becoming `['USA']`.<br>
    -> If `gid_entry` was already a list (for example `['USA']` or even `[['USA']]`), we processed it differently in the next step.<br>
    -> This step guarantees that, no matter the original format, we can handle all entries in a consistent way.

- **Next, we flattened list structures and cleaned the elements.**<br>
    -> If `gid_entry` was a list, we created an empty list called `flat_list` and went through each element `e`.<br>
    -> If an element `e` was itself a list (e.g., `['USA']` inside `[['USA']]`), we extended `flat_list` with its contents to remove nesting.<br>
    -> If `e` was not a list, we simply appended it to `flat_list`.<br>
    -> After flattening, we converted all elements to strings and removed any `NaN` values, storing the result in `elements`.<br>
    -> This step makes sure we end up with a simple, clean list of string values that we can safely process.

- **Then, we extracted valid 3-letter country codes from these cleaned elements.**<br>
    -> We created an empty list called `valid_codes` to store valid country codes.<br>
    -> For each element `e` in `elements`, we removed extra spaces, took only the first three characters, and converted them to uppercase.<br>
    -> For example: `'AUS.10'` becomes `'AUS'`, and `'chn'` becomes `'CHN'`.<br>
    -> We then checked if this code was exactly 3 characters long and contained only letters. If so, we added it to `valid_codes`.<br>
    -> This step ensures that we only keep properly formatted 3-letter country codes.

- **Finally, we enforced the “single valid GID” rule.**<br>
    -> If `valid_codes` contained exactly one valid country code, we returned that code (for example, `'CHN'`).<br>
    -> If there were no valid codes or more than one, we returned `NaN` and discarded that row.<br>
    -> This rule guarantees that only rows with one clear, unambiguous GID are kept for later analysis and aggregation.

In [None]:
# --- MAIN PROCESSING AND AGGREGATION FUNCTION ---
def clean_dataframe(df):
    df_clean = df.copy()

    # 1. IDENTIFY THE COLUMN
    if 'Administrative_Area_GID' in df_clean.columns:
        target_col = 'Administrative_Area_GID'

        print("TARGET COLUMN:", target_col)
        print("FIRST VALUES:\n", df_clean[target_col].head())
        print("COLUMN DTYPE:", df_clean[target_col].dtype)
        print("PYTHON TYPE OF VALUE:", type(df_clean[target_col].iloc[2]))

    elif 'Administrative_Areas_GID' in df_clean.columns:
        target_col = 'Administrative_Areas_GID'

        #Step 1: Convert string "[['USA']]" → [['USA']]
        df_clean[target_col] = df_clean[target_col].apply(
            lambda x: ast.literal_eval(x) if isinstance(x, str) else x
        )

        #Step 2: Flatten [['USA']] → ['USA']
        df_clean[target_col] = df_clean[target_col].apply(
            lambda x: x[0] if isinstance(x, list) and len(x) > 0 else x
        )

        #Step 3: Convert ['USA'] → "['USA']" (string)
        df_clean[target_col] = df_clean[target_col].apply(
            lambda x: str([x]) if isinstance(x, str) else x
        )

        print("TARGET COLUMN:", target_col)
        print("FIRST VALUES:\n", df_clean[target_col].head())
        print("COLUMN DTYPE:", df_clean[target_col].dtype)
        print("PYTHON TYPE OF VALUE:", type(df_clean[target_col].iloc[2]))

    else:
        print("Error: Neither GID column found.")
        return df_clean

    #return df_clean
    
    # Debug: Confirm which column is being used
    print(f"Detected column: {target_col}")
    
    # Debug: Print before cleaning to see what we are dealing with
    print(f"Rows before cleaning: {len(df_clean)}")
    
    # A. Clean the GID column
    # Apply the complex cleaning function to every row in the 'Administrative_Area_GID' column
    df_clean[target_col] = df_clean[target_col].apply(get_single_valid_gid) 
    
    # B. Filter out the NaNs
    # Remove any row where the GID cleaning process returned NaN (discarding bad/multiple GID rows)
    df_clean = df_clean.dropna(subset=[target_col]) 
    
    # Debug: Print after cleaning
    print(f"Rows after cleaning: {len(df_clean)}")
    return df_clean

In this part of the code, we created a main function that processes an entire dataframe and cleans its GID column step by step.<br>
The goal of this function is to detect the correct GID column, standardize its format, apply our GID‑cleaning function, and remove invalid rows.<br>
We handled this in several stages:

**1. We identified which GID column exists in the dataframe.**  
Different datasets use different column names, so we checked both possibilities.

- **If the dataframe contains `Administrative_Area_GID`:**<br>
    -> We set this as our target column.<br>
    -> We printed sample values and data types to understand the format before cleaning.

- **If the dataframe contains `Administrative_Areas_GID`:**<br>
    -> We set this as the target column and performed three preprocessing steps:<br>
    -> Step 1: Convert strings like `"[['USA']]"` into actual Python lists using `ast.literal_eval`.<br>
    -> Step 2: Flatten nested lists such as `[['USA']]` into `['USA']`.<br>
    -> Step 3: Convert the list back into a string format like `"['USA']"` so it matches the expected input format of our cleaning function.<br>
    -> We printed sample values again to confirm the transformation.

- **If neither column exists:**<br>
    -> We printed an error message and returned the dataframe unchanged.

**2. We printed debug information before cleaning.**  
These debug prints help us understand what the dataframe looks like before applying the cleaning function.<br>
-> We printed which column was detected.<br>
-> We printed how many rows the dataframe had before cleaning.

**3. We applied the GID‑cleaning function to every row.**  
-> We used `.apply(get_single_valid_gid)` to clean each GID entry one at a time.<br>
-> This step standardizes messy formats and extracts a single valid 3‑letter country code.

**4. We removed rows with invalid or ambiguous GIDs.**  
-> If the cleaning function returned `NaN` (meaning zero or multiple valid codes), we dropped those rows using `dropna`.<br>
-> This ensures that only rows with one clear, valid GID remain.

**5. We printed debug information after cleaning.**  
-> We printed how many rows remained after filtering out invalid entries.<br>
-> This helps us verify how much data was cleaned or discarded.

**6. Finally, we returned the cleaned dataframe.**  
-> At this point, the dataframe contains only rows with a single valid GID.<br>
-> This cleaned version is ready for aggregation and further analysis.

In [2]:
def aggregate_by_eventID(df_clean):
    # --- C. FIXED AGGREGATION LOGIC (Prevents adding years) ---
    
    # 1. Define the columns we are grouping by
    group_cols = ['Event_ID', 'Administrative_Area_GID'] # The keys that must be identical to form a group
    
    # 2. Create the "Rule Book" for aggregation
    agg_rules = {} # This dictionary tells Pandas what math to do for each column
    
    # Loop through every column to decide what to do with it
    for col in df_clean.columns:
        if col in group_cols:
            continue # Skip the grouping keys—they are handled automatically by groupby
            
        # If it is a Numerical Impact column -> SUM it
        if col in ['Num_Min', 'Num_Max', 'Num_Approx']:
            agg_rules[col] = 'sum' # Add the numbers together
            
        # For Dates and everything else -> KEEP FIRST value
        # (This prevents adding 1992 + 1992)
        else:
            agg_rules[col] = 'first' # Just take the first value found in the group

    # 3. Apply the rules
    # Groups the rows, applies the specific SUM/FIRST rules, and flattens the result
    df_agg = df_clean.groupby(group_cols).agg(agg_rules).reset_index()
    
    return df_agg

In this part of the code, we created a function that aggregates the cleaned dataframe by combining rows that belong to the same Event_ID and the same Administrative_Area_GID.<br>
The goal is to sum numerical impact values while keeping non‑numerical information consistent and avoiding incorrect operations like adding years together.<br>
We handled this in several steps:

**1. We defined the columns used for grouping.**  
Different datasets use different column names, so we checked both possibilities.

- **Grouping columns:**<br>
    -> `Event_ID`<br>
    -> `Administrative_Area_GID`<br>
    -> Rows with the same values in these two columns will be merged into one.

**2. We created a “rule book” for how each column should be aggregated.**  
We built a dictionary called `agg_rules` that tells Pandas what operation to apply to each column.

- **For each column in the dataframe:**<br>
    -> If the column is one of the grouping keys, we skip it because `groupby` handles those automatically.<br>
    -> If the column is a numerical impact column (`Num_Min`, `Num_Max`, `Num_Approx`), we sum the values.<br>
    -> For all other columns (like dates, names, descriptions), we keep only the first value found in the group.<br>
    -> This prevents incorrect operations such as adding years (e.g., `1992 + 1992`).

**3. We applied the aggregation rules to the dataframe.**  
-> We used `groupby(group_cols).agg(agg_rules)` to combine rows that belong to the same event and administrative area.<br>
-> The `.reset_index()` step flattens the grouped result back into a normal dataframe.<br>
-> The final output contains one row per unique combination of Event_ID and Administrative_Area_GID.

**4. We returned the aggregated dataframe.**  
-> At this point, all numerical impacts are properly summed.<br>
-> All non‑numerical fields are kept consistent by taking the first value.<br>
-> The dataframe is now ready for analysis or merging with other datasets.

In [3]:
# --- Run Again ---
# Execute the process on each of our filtered dataframes:
L3_Deaths_TC_1900_aggregated = aggregate_by_eventID(clean_dataframe(L3_Deaths_TC_1900))
L3_Damage_TC_1900_aggregated = aggregate_by_eventID(clean_dataframe(L3_Damage_TC_1900))
L3_Injuries_Damage_TC_1900_aggregated = aggregate_by_eventID(clean_dataframe(L3_Injuries_TC_1900))
#5------

NameError: name 'clean_dataframe' is not defined

In this final part of the code, we executed the entire cleaning and aggregation pipeline on each of our filtered dataframes.<br>
The goal here was to apply the same standardized process to all datasets so that they become consistent and ready for analysis.<br>
We handled this in a straightforward sequence:

**1. We applied the cleaning function to each dataframe.**  
-> We used `clean_dataframe(...)` to detect the correct GID column, standardize its format, clean each GID entry, and remove invalid rows.<br>
-> This ensures that every dataset has only one valid GID per row before aggregation.

**2. We applied the aggregation function to the cleaned data.**  
-> We used `aggregate_by_eventID(...)` to group rows by `Event_ID` and `Administrative_Area_GID`.<br>
-> Numerical impact values were summed, while non‑numerical fields kept their first valid entry.<br>
-> This step produces one clean, aggregated row per event per administrative area.

**3. We stored the final aggregated outputs.**  
-> `L3_Deaths_TC_1900_aggregated` contains the cleaned and aggregated deaths data.<br>
-> `L3_Damage_TC_1900_aggregated` contains the cleaned and aggregated damage data.<br>
-> `L3_Injuries_Damage_TC_1900_aggregated` contains the cleaned and aggregated injuries data.<br>
-> All three outputs now follow the same structure and can be compared or merged easily.

This completes the full cleaning and aggregation workflow for all filtered datasets.

6. Comparison with L2 tables
- Read all tables that start with "Instance" and load them into separate DataFrames named `L2_*`, where `*` represents impact categories, only load Deaths, Injuries and Damage.
- Using the same Event_ID from ‘L3_*_1900_aggregated’, filter the events from ’ L2_*`, name as ‘L2_*_filter`
- For the same Event_ID events, using the “Administrative_Area_GID” from ‘L3_*_1900_aggregated’ and the “Administrative_Areas_GID” from ‘L2_*_filter`, map the same GID, compute the impact data difference between ‘L3_*_1900_aggregated’ and ‘L2_*_filter`, for each impact category, get the average relative difference score. (‘L3_*_1900_aggregated’/ ‘L2_*_filter`)/ ‘L2_*_filter`.

7. Identify and Analyze same tropical cyclone (TC) Events:
- Using the ISO from EM-DAT, and Administrative_Areas_GID (only consider the row-with one GID) in ` L2_*_filter`, and “Start/End_Date_Year,” “Start/End_Date_Month,”, to identify the same TC events, and save a new dataframe as “EM_DAT_Wikimapcts_Matched”.
- Calculate the impact (e.g., Deaths, mean of Num_Min and Num_Max) difference of these matched events. Using the relative difference, and category the difference to 5 categories, -50% less, -30% less, Perfect Match, +30% more, +50% more, and visualize the difference in a bar plot. (relative difference: (Wikimpacts-EM_DAT)/EM_DAT)
- Save the plot as “EM_DAT_Wikimpacts_*_comparison.png”.


The fisrt code is for data loading and copying the data.  

In [None]:
# Load EM-DAT Excel file
emdat = pd.read_excel("EMDAT.xlsx", sheet_name="EM-DAT Data")

emdat = emdat[[
    "ISO",
    "Start Year", "Start Month",
    "End Year", "End Month", 'Total Deaths', 'No. Injured', "Total Damage ('000 US$)", "Total Damage, Adjusted ('000 US$)"
]].copy()

In this code the columns that were the necessary data to be extracted from was defined. The goal is to extract data from three different data frames. 

In [None]:
cols_for_matching = [
    "Event_ID",
    "Administrative_Area_GID",
    "Start_Date_Year", "Start_Date_Month",
    "End_Date_Year", "End_Date_Month",
    "Num_Min", "Num_Max", "Num_Approx"
]

L2_Deaths_match = L2_Deaths_filter[cols_for_matching].copy()
L2_Injuries_match = L2_Injuries_filter[cols_for_matching].copy()
L2_Damage_match = L2_Damage_filter[cols_for_matching].copy()


This code is to combine the two dataframes. The two data from EM-DAT and Wikimpacts1.0 are matched. To match this, we use .merge() to combine the data from two data frames.  left_on and right_on commands are just the names of the columns in the two dataframe that to be matched. how="inner" command is to keeps rows where a match is found in both datasets. 

In [None]:
match_deaths = L2_Deaths_match.merge(
    emdat,
    left_on=["Administrative_Area_GID", "Start_Date_Year", "Start_Date_Month", "End_Date_Year", "End_Date_Month"],
    right_on=["ISO", "Start Year", "Start Month", "End Year", "End Month"],
    how="inner"
)

match_injuries = L2_Injuries_match.merge(
    emdat,
    left_on=["Administrative_Area_GID", "Start_Date_Year", "Start_Date_Month", "End_Date_Year", "End_Date_Month"],
    right_on=["ISO", "Start Year", "Start Month", "End Year", "End Month"],
    how="inner"
)

match_damage = L2_Damage_match.merge(
    emdat,
    left_on=["Administrative_Area_GID", "Start_Date_Year", "Start_Date_Month", "End_Date_Year", "End_Date_Month"],
    right_on=["ISO", "Start Year", "Start Month", "End Year", "End Month"],
    how="inner"
)


In this code we are merging the three dataframes (death, injuries and damages) into one dataframe(EM_DAT_Wikimapcts_Matched). 

In [None]:
cols_final = [
    "Event_ID",
    "ISO",
    "Administrative_Area_GID",
    "Start_Date_Year", "Start_Date_Month",
    "End_Date_Year", "End_Date_Month",
    "Start Year", "Start Month", "End Year", "End Month",
    "Num_Min", "Num_Max", "Num_Approx",
    "Total Deaths",
    "No. Injured",
    "Total Damage ('000 US$)",
    "Total Damage, Adjusted ('000 US$)"
] #all three matched dataframes have the same columns as mentioned above.

match_deaths = match_deaths[cols_final].copy()
match_injuries = match_injuries[cols_final].copy()
match_damage = match_damage[cols_final].copy()

EM_DAT_Wikimapcts_Matched = pd.concat(
    [match_deaths, match_injuries, match_damage],
    ignore_index=True
    )


This code is to categegorize the level of match in the two dataframes (Wikimpacts-EM_DAT).The relative difference, and category the difference to 5 categories, -50% less, -30% less, Perfect Match, +30% more, +50% more. After this the result was shown in bar graphs. 

In [None]:
def process_and_plot_impacts(df, category_name, emdat_col):
    """
    1. Calculates Wikimpacts Mean.
    2. Calculates Relative Difference vs EM-DAT.
    3. Categorizes into bins.
    4. Plots and saves the result.
    """
    # Work on a copy to avoid SettingWithCopy warnings
    df = df.copy()
    # 1. Calculate Wikimpacts Mean (Row-wise mean of Min, Max, Approx)
    # We use mean(axis=1) which ignores NaNs automatically. 
    df['Wikimpact_Mean'] = df[['Num_Min', 'Num_Max']].mean(axis=1)
    
    # 2. Calculate Relative Difference: (Wikimpacts - EM_DAT) / EM_DAT
    # We must handle cases where EM_DAT is 0 or NaN to avoid infinite errors.
    
    # Extract series for easier handling
    wiki_val = df['Wikimpact_Mean']
    emdat_val = df[emdat_col]
    
    # Define logic for division
    # Case A: Both are 0 -> 0 diff (Perfect Match)
    # Case B: EM_DAT is 0 but Wiki > 0 -> Treat as High Positive (set to 1.0 for binning)
    # Case C: Standard Formula
    
    conditions = [
        (emdat_val == 0) & (wiki_val == 0), # Both zero
        (emdat_val == 0) & (wiki_val > 0),  # EM_DAT zero, Wiki positive
        (emdat_val.isna()) | (wiki_val.isna()) # Missing data
    ]
    
    choices = [
        0.0,  # Perfect match
        1.0,  # Arbitrary high number to push it into +50% bin
        0.0
    ]
    
    # Calculate standard formula
    standard_calc = (wiki_val - emdat_val) / emdat_val
    
    # Apply logic
    df['Relative_Diff'] = np.select(conditions, choices, default=standard_calc)
    
    # Drop rows where we couldn't calculate a difference (NaNs)
    df = df.dropna(subset=['Relative_Diff'])

    # 3. Sort into 5 categories
    # Bins: 
    #   < -0.5       -> -50% less
    #   -0.5 to -0.3 -> -30% less
    #   -0.3 to 0.3  -> Perfect Match
    #   0.3 to 0.5   -> +30% more
    #   > 0.5        -> +50% more
    
    bins = [-np.inf, -0.5, -0.3, 0.3, 0.5, np.inf]
    labels = ['-50% less', '-30% less', '"Perfect" Match', '+30% more', '+50% more']
    
    df['Impact_Category'] = pd.cut(df['Relative_Diff'], bins=bins, labels=labels)

    # 4. Visualization
    plt.figure(figsize=(10, 6))
    
    # Count the values for the plot
    ax = sns.countplot(x='Impact_Category', data=df, palette='viridis', order=labels)
    
    # Formatting
    plt.title(f'Comparison of {category_name}: EM-DAT vs Wikimpacts', fontsize=15)
    plt.xlabel('Impact Difference Category', fontsize=12)
    plt.ylabel('Count of Events', fontsize=12)
    plt.grid(axis='y', linestyle='--', alpha=0.7)
    
    # Add count labels on top of bars
    for p in ax.patches:
        ax.annotate(f'{int(p.get_height())}', 
                    (p.get_x() + p.get_width() / 2., p.get_height()), 
                    ha = 'center', va = 'center', 
                    xytext = (0, 9), 
                    textcoords = 'offset points')

    # Save the plot
    filename = f"EM_DAT_Wikimpacts_{category_name}_comparison.png"
    plt.savefig(filename, dpi=300)
    print(f"Plot saved: {filename}")
    plt.show() # Optional: Show plot in IDE
    
    return df

# --- Execute for each Category ---

print("Processing Deaths...")
match_deaths_processed = process_and_plot_impacts(
    match_deaths, 
    category_name="Deaths", 
    emdat_col="Total Deaths"
)

print("Processing Injuries...")
match_injuries_processed = process_and_plot_impacts(
    match_injuries, 
    category_name="Injuries", 
    emdat_col="No. Injured"
)

print("Processing Damage...")
match_damage_processed = process_and_plot_impacts(
    match_damage, 
    category_name="Damage", 
    emdat_col="Total Damage, Adjusted ('000 US$)"
)

8. Analyze the spatial differences between two databases
- Using the ISO from EM-DAT, and Administrative_Areas_GID (only consider the row with one GID) in ` L2_*_filter`, compute the number of impact data entries difference between two databases, and visualize the difference in a world map.
- Save the plot as “EM_DAT_Wikimpacts_Spatial_*_comparison.png”.