# Natural Disaster Data

## Task:

The goal is to convert the data into the following formats for later use.
Along the way, this notebook does some data-preparation


### Disaster-All
disaster/disaster-all:
Columns: disaster_no, year, subgroup, type, total_deaths, dis_mag_value, dis_mag_scale, start_year, end_year
Other interesting columns?


### Disaster-Global
disaster/disaster-global.csv
Columns: year, subgroup, type, total_deaths


### Disaster-Region
disaster/disaster-region.csv
Columns: region_code, region_name, year, subgroup, type, total_deaths
Calculate from country data, use UN Dataset to assign region to each country


### Disaster-Country
disaster/disaster-country.csv
Columns: year, country_code, country_name,  subtype, type, total_deaths

Prefix: dis


Publisher: Centre for Research on the Epidemiology of Disasters (CRED)

CRED defines a disaster as “a situation or event that overwhelms local capacity, necessitating a
request at the national or international level for external assistance; an unforeseen and often sudden
event that causes great damage, destruction and human suffering”

For a disaster to be entered into the database at least one of the following criteria must be fulfilled:

- 10 or more people reported killed
- 100 or more people reported affected
- Declaration of a state of emergency
- Call for international assistance

## Setup & Imports

In [None]:
import os
import numpy as np
import pandas as pd
from pathlib import Path
filepath_source = Path('data/raw/disaster/emdat_public_2022_12_22_full.xlsx')
filepath_all = Path("data/processed/disaster/disaster-all.csv")
filepath_global = Path("data/processed/disaster/disaster-global.csv")
filepath_country = Path('data/processed/disaster/disaster-country.csv')
filepath_region = Path("data/processed/disaster/disaster-region.csv")

In [None]:
dis = pd.read_excel(filepath_source, skiprows = 6)

In [None]:
if not os.path.isdir("data/processed/disaster/"):
    os.makedirs("data/processed/disaster/")

## First Look

In [None]:
dis.head()

## Reformat Attribute-Names

1. Replace whitespaces with underscores
2. Convert every character to lowercase
3. Rename specific columns to ensure uniformity

In [None]:
# Remove whitespaces from all col-names and convert them to lower-case
dis.columns = [c.replace(' ', '_').lower() for c in dis.columns]
dis.rename(columns={'country':'country_name', 'iso':'country_code', 'disaster_subtype':'subtype', 'disaster_type':'type', 'total_deaths':'deaths'}, inplace=True)

## Filter for all relevant attributes & observations

1. We only consider observations of disasters of type natural. (rows)
2. We only consider relevant attributes. (columns)

## Which disaster groups are present in the dataset ?

In [None]:
dis.disaster_group.unique()

There are Natural disasters, technological disasters as well as  complex disasters that represent specific events (e.g. famine) which are not directly linked to a natural hazard.

We only focus on disasters which have a natural causation

In [None]:
dis = dis[dis.disaster_group == "Natural"]

### MAYBE

***The dataset contains observation starting from the year 1900, since our task is to only analyze observations from the last 100 year, we only take observations into account that happened after the year 1920.****

In [None]:
#dis = dis[dis.year >= 1920]

Which types of natural disasters are there ?

In [None]:
dis["type"].unique()

The types of disasters are mostly the ones a normal person would expect when thinking about natural disasters.
But there are some strange types like insect-infestations or animal-accident which are not that obvious to understand.
Therefore, we need to take a closer look at the disasters of those types.

In [None]:
dis[dis.type == "Insect infestation"].deaths.sum()

It is safe to say that we can omit observations of disasters of type insect-infestations since there are no accounts of people dying from those kinds of incidents.

In [None]:
dis[dis.type == "Impact"].deaths.sum()

The same goes for disaster type impact, which only occurred once in Russia with again 0 deaths.
We omit this event.

In [None]:
dis[dis.type == "Animal accident"].deaths.count()

We also decided to omit disasters of type animal-accident since there is only one recorded accident over the last 100 years with only 12 people dying. Therefore, it makes not much sense to include it into our further research.

In [None]:
dis = dis[((dis.type != "Insect infestation") & (dis.type != "Animal accident")) & (dis.type != "Impact")]

We also decided to no include epidemics since it would go beyond the scope of this task.

In [None]:
dis = dis[dis.type != "Epidemic"]

We only consider the following types of disasters

In [None]:
dis["type"].unique()

In [None]:
dis[dis.type == "Impact"].deaths.head()

Now we take a look at the different attributes of each observation

In [None]:
dis.dtypes

In [None]:
# Disaster-All
dis_all_col_names = ["year", "dis_no", "region", "continent", "country_name", "country_code", "location",                             "type", "subtype", "deaths", "dis_mag_value", "dis_mag_scale", "start_year", "end_year"]
dis_all = dis.filter(items=dis_all_col_names)

## Are there missing values?

Fill missing Values for the number of deaths

We can assume that missing values for the number of deaths of a particular disaster means that the deathtoll was 0.

For the subtype we take a look for which type of natural disasters a subtype is not provided.

In [None]:
for col in dis_all:
    print(col + ": " + str(dis_all.loc[:, col].isnull().sum()))
print("Total: " + str(len(dis_all)))

In [None]:
dis_all[dis_all["subtype"].isna()]["type"].unique()

Unfortunately the missing values in the subtype column do not correspond to specific types of disasters.
We can not conclude that easily what caused the values to be missing.

In [None]:
dis_all[['deaths']] = dis_all[['deaths']].fillna(value=0)

We convert the number of deaths from float to integer since absolute deaths are by definition always integers

In [None]:
dis_all["deaths"] = dis_all["deaths"].astype(int)

## ISO-Codes

Compare iso-codes to match the id's of each row with the other datasets.
We use the ISO3-Codes, which contain 3 letters to identify a country.
From now on, if ISO-Codes are mentioned, we are always speaking of ISO3-Codes.

In [None]:
un_country_codes = pd.read_csv("data/raw/country-codes/un-country-codes.csv", sep=";")
un_country_codes.columns = [c.replace(' ', '_').replace('-','_') for c in un_country_codes.columns]

In [None]:
countries_with_iso = dis_all.merge(un_country_codes, how="left", left_on='country_name', right_on='Country_or_Area')[["country_name", "country_code", "ISO_alpha3_Code"]]

In [None]:
countries_with_iso.head(10)

## Display all countries for which NO matching ISO-Code was found

In [None]:
mismatches = countries_with_iso[countries_with_iso.ISO_alpha3_Code.isnull()]
mismatches.country_name.unique()

## Rename remaining Countrynames to a standardized format

In [None]:
# Replace suffix
dis_all['country_name'] = dis_all['country_name'].apply(lambda x: x.replace(' (the)', ''))
# Reorder compley country-names
dis_all['country_name'] = dis_all['country_name'].apply(lambda x: x.split(',')[1] + " " + x.split(',')[0] if ',' in x else x)
# Remove random white-spaces at the start of a name
dis_all['country_name'] = dis_all['country_name'].apply(lambda x: x[1:] if x.startswith(' ') else x)


In [None]:
new_country_names = {
    "Germany Fed Rep": "Germany",
    "Germany Dem Rep": "Germany",
    "Hong Kong": "China",
    "Macao": "China",
    "Åland": "Åland Islands",
    "Congo (the Democratic Republic of the)": "Congo",
    "Turkey": "Türkiye",
    "Korea (the Republic of)": "Republic of Korea",
    "Macedonia (the former Yugoslav Republic of)": "North Macedonia",
    "Congo (Democratic Republic of the)": "Democratic Republic of the Congo",
    "Yemen P Dem Rep": "Yemen",
    "Yemen Arab Rep": "Yemen",
    "Korea (the Democratic People's Republic of)": "Democratic People's Republic of Korea",
    "Korea (the Republic of)": "Republic of Korea",
    "Serbia Montenegro" : "Serbia",
    "Moldova (the Republic of)" : "Republic of Moldova",
    "Czech Republic" : "Czechia",
    "Taiwan (Province of China)" : "Taiwan"
}

In [None]:
dis_all = dis_all.replace({"country_name": new_country_names}, inplace=False)

## Check for countries with missing ISO-Codes

In [None]:
countries_with_iso = dis_all.merge(un_country_codes, how="left", left_on='country_code', right_on='ISO_alpha3_Code')[["country_name", "country_code", "ISO_alpha3_Code"]]
mismatches = countries_with_iso[countries_with_iso.ISO_alpha3_Code.isnull()]
mismatches.country_name.unique()

In [None]:
countries_with_iso.head(10)
mask = countries_with_iso.notnull().all(axis=1)
countries_with_iso[~mask].country_name.unique()

## Assign ISO-Codes (we know of) to countries

In [None]:
dis_all.loc[dis_all.country_name == "Germany", "country_code"] = "DEU"
dis_all.loc[dis_all.country_name == "Serbia", "country_code"] = "SRB"
dis_all.loc[dis_all.country_name == "Yemen", "country_code"] = "YEM"
dis_all.loc[dis_all.country_name == "Taiwan", "country_code"] = "TWM"
dis_all.loc[dis_all.country_name == "Canary Is", "country_code"] = "SPI"
dis_all.loc[dis_all.country_name == "Azores Islands", "country_code"] = "AZO"

## Check which countries still do not have an ISO-Code

The following countries either do not have an ISO-Code assigned because they do not exist anymore or are not recognized internationally.

For small countries like Azore-Islands or the netherlands antilles it is not that tragic, since they probably contribute only marginally to the total number of deaths by natural disasters globally or for a specific region.
They are therefore negligible.

For internationally unrecognized countries (Taiwan) we can default to a specifically assigned ISO-Code by us.

The difficult part is to make sense of the observations belonging to a larger country which has been split up into smaller nations in the last 100 years. (Soviet Union, Czechoslovakia, Yugoslavia)

In [None]:
countries_with_iso = dis_all.merge(un_country_codes, how="left", left_on='country_code', right_on='ISO_alpha3_Code')[["country_name", "country_code", "ISO_alpha3_Code", "Region_Code"]]
mismatches = countries_with_iso[countries_with_iso.ISO_alpha3_Code.isnull()]
mismatches.country_name.unique()

## Taiwan: Assign ISO-Code manually (TWN)

In [None]:
dis_all.loc[dis_all.country_name == "Taiwan", "country_code"] = "TWN"

## Count all disasters that happened in the Soviet Union

In [None]:
dis_all[dis_all.country_code == "SUN"].deaths.count()

## Count all disasters that happened in Czechoslovakia

In [None]:
dis_all[dis_all.country_code == "CSK"].deaths.count()

## Count all disasters that happened in Yugoslavia

In [None]:
dis_all[dis_all.country_code == "YUG"].deaths.count()

## Check in which part of the country the disaster occured

In [None]:
dis_all[dis_all.country_code == "CSK"].location

## Determine Location

To determine in which currently existing country those disasters happened,
we need to take a look at the location-attribute

Fortunately only the disasters in Czechoslovakia have missing values for the location attribute.
For all disasters in the other dissolved countries an exact location is provided.

Now we proceed by checking matching the location with the now existing countries that were part of the former nations.

Soviet Union (SUN):
- Armenia
- Azerbaijan
- Belarus
- Estonia
- Georgia
- Kazakhstan
- Kyrgyzstan
- Latvia
- Lithuania
- Moldova
- Russia
- Tajikistan
- Turkmenistan
- Ukraine
- Uzbekistan

Yugoslavia (YUG):
- Bosnia and Herzegovina
- Croatia
- Kosovo (included but not part of the dataset)
- Montenegro
- North Macedonia
- Serbia

Czechoslovakia (CSK):
- The Czech Republic
- Slovakia

In [None]:
former_sum_country_names = ["Russian Federation", "Armenia", "Azerbaijan", "Belarus", "Estonia", "Georgia", "Kazakhstan", "Kyrgyzstan", "Latvia", "Lithuania", "Moldova", "Tajikistan", "Turkmenistan", "Ukraine", "Uzbekistan"]

former_yug_country_names = ["Bosnia and Herzegovina", "Croatia", "Kosovo", "Montenegro", "North Macedonia","Serbia"]

former_csk_country_names = ["Czechia", "Slovakia"]

In [None]:
dis_all[dis_all.country_code == "SUN"].location

In [None]:
dis_sun_with_region = dis_all[dis_all.country_code == "SUN"].copy()
dis_sun_with_region["region"] = np.nan

mask_europe = dis_sun_with_region["location"].str.contains("Russian Federation|Ukraine|Moldavia|Siberia").fillna(False)
mask_asia = dis_sun_with_region["location"].str.contains("Kazakhstan|Azerbaijan|Uzbekistan|Turkmenistan|Georgia|Armenia|Kyrgystan|Tajikistan|Tajiskistan|Tadzhikistan|Tadjikistan|Caucasus region|Dushanbe", case=False).fillna(False)

# disasters in europe and asia
dis_sun_with_region[mask_europe & mask_asia]

Only one Event happened in both the asian as well as the european part of the soviet union.
It is also a major event since it is a drought which caused the death of 1.2 million people.

Researching the details of this event one can conclude that this observation can only be the Russian famine of 1921–1922.
It mostly affected people living in europe, hence we assign this single observation the region europe.
(https://en.wikipedia.org/wiki/Russian_famine_of_1921%E2%80%931922)

For all other observations, the region should be unambiguous.

In [None]:
dis_sun_with_region.loc[mask_europe, "region"] = "Europe"
dis_sun_with_region.loc[mask_asia, "region"] = "Asia"
dis_sun_with_region.loc[1262, "region"] = "Europe"

The observations which still have missing region values, all have no recorded deaths and can therefore be safely ignored.

In [None]:
# Czechoslovakia
dis_all[dis_all.country_code == "CSK"].region

In [None]:
# Yugoslavia
dis_all[dis_all.country_code == "YUG"].region

## Save the Disaster-All file

In [None]:
dis_all.to_csv(filepath_all, index=False)

In [None]:
dis_all.head()

## Create/Save the Disaster-Country file

In [None]:
dis_country_col_names = ["year", "country_name", "country_code", "type", "subtype", "deaths"]
dis_country = dis_all.filter(items=dis_country_col_names)
dis_country.to_csv(filepath_country, index=False)

In [None]:
dis_country.head()

## Create/Save the Disaster-Region file

In [None]:
dis_region_col_names = ["year", "region", "country_code", "type", "subtype", "deaths"]
dis_region = dis_all.filter(items=dis_region_col_names)
dis_region.to_csv(filepath_region, index=False)

In [None]:
dis_region.head()

## Create/Save the Disaster-global file

In [None]:
dis_global_attributes = ["year", "type", "subtype", "deaths"]
dis_global = dis_all[dis_global_attributes]
dis_global.to_csv(filepath_global, index=False)

In [None]:
dis_region.head()