# <div div style="text-align:center">Tropical Cyclone impact data comparison between Wikimpacts1.0 and EM-DAT database </div>

<div div style="text-align:center">
PhiRu Environmental Engineering Members: </br>
Bernal, Chiara (r) </br>
Caligdong, Ronan (r) </br>
Espejo, Kristine Nadeen (r1017911) </br>
Haghebaert, Lukas (r) </br>
</div>

## Dataset
1.**Wikimpacts 1.0**：contains data on the occurrence and impacts of climate extremes in country and sub-national scales. The database is inferred from Wikipedia and uses generative AI. </br>
2.**EM-DAT**, downloaded from Public EM-DAT platform, using only “tropical cyclone”.

## Tasks


1. Download the Wikimpacts 1.0 database in db format. 
2. Load Data:   
- Read the database file and load all tables that start with "Total" into a DataFrame named `L1`.
- Identify all tables that start with "Specific" and load them into separate DataFrames named `L3_*`, where `*` represents impact categories, only load Deaths, Injuries and Damage.


3. Filter by “Tropical Storm/Cyclone”:
- Using the “Main_Event”, filter the Tropical Storm/Cyclone events from L1 into a new dataframe “L1_TC”
- Using “Event_ID” from “L1_TC”, filter the “L3_*” with only impact from Tropical Storm/Cyclone
- “Start/End_Date_Year,” “Start/End_Date_Month,” and “Start/End_Date_Day” col-umns. If these date fields are missing in `L3_*`, fill them with the corresponding infor-mation from `L1_TC`.

4. Filter by Date:
- In each ` L3_* ` DataFrame, filter the records to include only those events that occurred after the year 1900. Name these filtered DataFrames as `L3_*_1900`.

In [None]:
def filter_year(df, year):
    
    ''' Filters the data frame according to the year you input. 
    The filter keeps everything after the year specified 
    (e.g. x>1900) '''
    
    if type(year) == int:
        year_mask = df["Start_Date_Year"]>year
        return df[year_mask].copy()
    else:
        print ("Year must be an int data type")
        
year_to_filter = 1900
L3_Deaths_TC_1900 = filter_year(L3_Deaths_TC, year_to_filter)
L3_Injuries_TC_1900 = filter_year(L3_Injuries_TC, year_to_filter)
L3_Damage_TC_1900 = filter_year(L3_Damage_TC, year_to_filter)

We created a function that allows us to filter a data base by year. This only works for data bases that have a column with the title "Start_Date_Year". <br>
An explaination how how to function works was added in the comments and an if statement was added to help trouble shoot errors users may encounter.

5. Aggregate by Administrative Area:
- Using the “Administrative_Area_GID” column in each ` L3_*_1900` DataFrame obtained from Step 3, for the same “Event_ID”, aggregate the impact from the same “Administrative_Area_GID”. <br>
- Only consider the rows with one valid GID (specific cases like one country involving several GIDs, only use the one without digits, or the first 3 alphabets), name the new dataframe to `L3_*_1900_aggregated`

In [4]:
import ast          # This library turns string "[...]" into list [...]

#1.GID CLEANING FUNCTION (Applied to one cell at a time)
def get_single_valid_gid(gid_entry): # Checks every single GID at a time
    
    # Handle no data cells and returns it as NaNs
    if pd.isna(gid_entry): 
        return np.nan # Returns NaN if the cell is truly empty

# Currently the data that is GID is considered a string, we use this to fix strings and convert it to python list
    if isinstance(gid_entry, str) and gid_entry.startswith('[') and gid_entry.endswith(']'):
        try:
            gid_entry = ast.literal_eval(gid_entry) # ast.literal_eval safely converts the text into a real Python list
        except (ValueError, SyntaxError):
            pass # If the string cannot be converted, ignore the error and proceed

# Make sure all variable elements is a list of strings
    if not isinstance(gid_entry, list): # If the entry is NOT a list (ex: a single string like 'USA'), execute this block
        elements = [str(gid_entry)] # Wrap the single item in a list so we can loop over it
    else: # If the entry is a list, execute this block
        elements = [str(e) for e in gid_entry if pd.notna(e)] #Ensure every item in the list is a string and ignore any NaNs inside the list

    valid_codes = [] # Start an empty list to store valid country codes
    
    for e in elements: # Loop through every item in the cleaned list (ex: 'Z03', 'CHN')
        # Clean formatting: remove whitespace, take first 3 chars
        # 'AUS.10' -> 'AUS'
        code = e.strip()[:3] # Apply the cleaning and standardization
        
        # Validation Rule: 
        # Must be exactly 3 letters AND contain only letters (this excludes codes like 'Z03')
        if len(code) == 3 and code.isalpha(): 
            valid_codes.append(code) #If it passes the test, add it to our list of valid_codes
    
    # 4. Enforce "Single Valid GID"
    if len(valid_codes) == 1: # Check if we found exactly one valid country code
        return valid_codes[0] # If yes, return the code (ex: 'CHN')
    else:
        return np.nan #If zero or multiple valid codes were found, return NaN (Discard the row)

# --- 2. THE MAIN PROCESSING AND AGGREGATION FUNCTION ---
def process_step_5(df):
    df_clean = df.copy() # Create a copy of the input data to work on safely
    
    # Debug: Print before cleaning to see what we are dealing with
    print(f"Rows before cleaning: {len(df_clean)}")
    
    # A. Clean the GID column
    # Apply the complex cleaning function to every row in the 'Administrative_Area_GID' column
    df_clean['Administrative_Area_GID'] = df_clean['Administrative_Area_GID'].apply(get_single_valid_gid) 
    
    # B. Filter out the NaNs
    # Remove any row where the GID cleaning process returned NaN (discarding bad/multiple GID rows)
    df_clean = df_clean.dropna(subset=['Administrative_Area_GID']) 
    
    # Debug: Print after cleaning
    print(f"Rows after cleaning: {len(df_clean)}")

    # --- C. FIXED AGGREGATION LOGIC (Prevents adding years) ---
    
    # 1. Define the columns we are grouping by
    group_cols = ['Event_ID', 'Administrative_Area_GID'] # The keys that must be identical to form a group
    
    # 2. Create the "Rule Book" for aggregation
    agg_rules = {} # This dictionary tells Pandas what math to do for each column
    
    # Loop through every column to decide what to do with it
    for col in df_clean.columns:
        if col in group_cols:
            continue # Skip the grouping keys—they are handled automatically by groupby
            
        # If it is a Numerical Impact column -> SUM it
        if col in ['Num_Min', 'Num_Max', 'Num_Approx']:
            agg_rules[col] = 'sum' # Add the numbers together
            
        # For Dates and everything else -> KEEP FIRST value
        # (This prevents adding 1992 + 1992)
        else:
            agg_rules[col] = 'first' # Just take the first value found in the group

    # 3. Apply the rules
    # Groups the rows, applies the specific SUM/FIRST rules, and flattens the result
    df_agg = df_clean.groupby(group_cols).agg(agg_rules).reset_index()
    
    return df_agg

# --- Run Again ---
# Execute the process on each of your filtered dataframes:
L3_Deaths_TC_1900_aggregated = process_step_5(L3_Deaths_TC_1900)
L3_Damage_TC_1900_aggregated = process_step_5(L3_Damage_TC_1900)
L3_Injuries_Damage_TC_1900_aggregated = process_step_5(L3_Injuries_TC_1900)
print(L3_Deaths_TC_1900_aggregated.head())

NameError: name 'L3_Deaths_TC_1900' is not defined

6. Comparison with L2 tables
- Read all tables that start with "Instance" and load them into separate DataFrames named `L2_*`, where `*` represents impact categories, only load Deaths, Injuries and Damage.
- Using the same Event_ID from ‘L3_*_1900_aggregated’, filter the events from ’ L2_*`, name as ‘L2_*_filter`
- For the same Event_ID events, using the “Administrative_Area_GID” from ‘L3_*_1900_aggregated’ and the “Administrative_Areas_GID” from ‘L2_*_filter`, map the same GID, compute the impact data difference between ‘L3_*_1900_aggregated’ and ‘L2_*_filter`, for each impact category, get the average relative difference score. (‘L3_*_1900_aggregated’/ ‘L2_*_filter`)/ ‘L2_*_filter`.

7. Identify and Analyze same tropical cyclone (TC) Events:
- Using the ISO from EM-DAT, and Administrative_Areas_GID (only consider the row-with one GID) in ` L2_*_filter`, and “Start/End_Date_Year,” “Start/End_Date_Month,”, to identify the same TC events, and save a new dataframe as “EM_DAT_Wikimapcts_Matched”.
- Calculate the impact (e.g., Deaths, mean of Num_Min and Num_Max) difference of these matched events. Using the relative difference, and category the difference to 5 categories, -50% less, -30% less, Perfect Match, +30% more, +50% more, and visualize the difference in a bar plot. (relative difference: (Wikimpacts-EM_DAT)/EM_DAT)
- Save the plot as “EM_DAT_Wikimpacts_*_comparison.png”.

8. Analyze the spatial differences between two databases
- Using the ISO from EM-DAT, and Administrative_Areas_GID (only consider the row with one GID) in ` L2_*_filter`, compute the number of impact data entries difference between two databases, and visualize the difference in a world map.
- Save the plot as “EM_DAT_Wikimpacts_Spatial_*_comparison.png”.