# Natural Disaster Data

## Task:

The goal is to convert the data into the following formats for later use.
Along the way, this notebook does some data-preparation


### Disaster-All
disaster/disaster-all:
Columns: disaster_no, year, subgroup, type, total_deaths, dis_mag_value, dis_mag_scale, start_year, end_year
Other interesting columns?


### Disaster-Global
disaster/disaster-global.csv
Columns: year, subgroup, type, total_deaths


### Disaster-Region
disaster/disaster-region.csv
Columns: region_code, region_name, year, subgroup, type, total_deaths
Calculate from country data, use UN Dataset to assign region to each country


### Disaster-Country
disaster/disaster-country.csv
Columns: year, country_code, country_name,  subtype, type, total_deaths


## Setup & Imports

In [168]:
import pandas as pd
from pathlib import Path
filepath_source = Path('data/raw/disaster/emdat_public_2022_12_22_full.xlsx')
filepath_all = Path("data/processed/disaster/disaster-all.csv")
filepath_global = Path("data/processed/disaster/disaster-global.csv")
filepath_country = Path('data/processed/disaster/disaster-country.csv')
filepath_region = Path("data/processed/disaster/disaster-region.csv")

In [169]:
disasters = pd.read_excel(filepath_source, skiprows = 6)

  warn("Workbook contains no default style, apply openpyxl's default")


## First Look

In [170]:
disasters.head()

Unnamed: 0,Dis No,Year,Seq,Glide,Disaster Group,Disaster Subgroup,Disaster Type,Disaster Subtype,Disaster Subsubtype,Event Name,...,"Reconstruction Costs, Adjusted ('000 US$)",Insured Damages ('000 US$),"Insured Damages, Adjusted ('000 US$)",Total Damages ('000 US$),"Total Damages, Adjusted ('000 US$)",CPI,Adm Level,Admin1 Code,Admin2 Code,Geo Locations
0,1900-9002-CPV,1900,9002,,Natural,Climatological,Drought,Drought,,,...,,,,,,3.077091,,,,
1,1900-9001-IND,1900,9001,,Natural,Climatological,Drought,Drought,,,...,,,,,,3.077091,,,,
2,1901-0003-BEL,1901,3,,Technological,Technological,Industrial accident,Explosion,,Coal mine,...,,,,,,3.077091,,,,
3,1902-0012-GTM,1902,12,,Natural,Geophysical,Earthquake,Ground movement,,,...,,,,25000.0,781207.0,3.200175,,,,
4,1902-0003-GTM,1902,3,,Natural,Geophysical,Volcanic activity,Ash fall,,Santa Maria,...,,,,,,3.200175,,,,


## Reformat Attribute-Names

1. Replace whitespaces with underscores
2. Convert every character to lowercase
3. Rename specific columns to ensure uniformity

In [171]:
# Remove whitespaces from all col-names and convert them to lower-case
disasters.columns = [c.replace(' ', '_').lower() for c in disasters.columns]
disasters.rename(columns={'country':'country_name', 'iso':'country_code', 'disaster_subtype':'subtype', 'disaster_type':'type', 'total_deaths':'deaths'}, inplace=True)

## Filter for all relevant attributes & observations

1. We only consider observations of disasters of type natural. (rows)
2. We only consider relevant attributes. (columns)

In [172]:
disasters = disasters[disasters.disaster_group == "Natural"]

In [173]:
disasters.dtypes

dis_no                                        object
year                                           int64
seq                                            int64
glide                                         object
disaster_group                                object
disaster_subgroup                             object
type                                          object
subtype                                       object
disaster_subsubtype                           object
event_name                                    object
country_name                                  object
country_code                                  object
region                                        object
continent                                     object
location                                      object
origin                                        object
associated_dis                                object
associated_dis2                               object
ofda_response                                 

In [174]:
# Disaster-All
disaster_all_col_names = ["year", "dis_no", "region", "continent", "country_name", "country_code", "type",
                           "subtype", "deaths", "dis_mag_value", "dis_mag_scale", "start_year", "end_year"]
disasters_all = disasters.filter(items=disaster_all_col_names)

## Check which attributes contain how many missing values

In [175]:
for col in disasters_all:
    print(col + ": " + str(disasters_all.loc[:, col].isnull().sum()))
print("Total: " + str(len(disasters_all)))

year: 0
dis_no: 0
region: 0
continent: 0
country_name: 0
country_code: 0
type: 0
subtype: 3269
deaths: 4748
dis_mag_value: 11458
dis_mag_scale: 1211
start_year: 0
end_year: 0
Total: 16488


## ISO-Codes

Compare iso-codes to match the id's of each row with the other datasets

In [176]:
un_country_codes = pd.read_csv("data/raw/country-codes/un-country-codes.csv", sep=";")
un_country_codes.columns = [c.replace(' ', '_').replace('-','_') for c in un_country_codes.columns]

In [177]:
countries_with_iso = disasters_all.merge(un_country_codes, how="left", left_on='country_name', right_on='Country_or_Area')[["country_name", "country_code", "ISO_alpha3_Code"]]

In [178]:
countries_with_iso.head(10)

Unnamed: 0,country_name,country_code,ISO_alpha3_Code
0,Cabo Verde,CPV,CPV
1,India,IND,IND
2,Guatemala,GTM,GTM
3,Guatemala,GTM,GTM
4,Guatemala,GTM,GTM
5,Canada,CAN,CAN
6,Comoros (the),COM,
7,Bangladesh,BGD,BGD
8,Canada,CAN,CAN
9,India,IND,IND


## Display all countries for which NO matching ISO-Code was found

In [179]:
mismatches = countries_with_iso[countries_with_iso.ISO_alpha3_Code.isnull()]
mismatches.country_name.unique()

array(['Comoros (the)', 'Hong Kong', 'Gambia (the)', 'Germany Fed Rep',
       'Bahamas (the)', 'Dominican Republic (the)', 'Cook Islands (the)',
       'Azores Islands',
       'United Kingdom of Great Britain and Northern Ireland (the)',
       'Netherlands Antilles', 'Congo (the)', 'Czechoslovakia',
       'United States of America (the)', 'Soviet Union', 'Niger (the)',
       'Turkey', 'Philippines (the)', 'Taiwan (Province of China)',
       'Korea (the Republic of)', 'Sudan (the)', 'Netherlands (the)',
       'Canary Is', 'Tanzania, United Republic of',
       "Lao People's Democratic Republic (the)", 'Yemen Arab Rep',
       'Yugoslavia', 'Wallis and Futuna',
       'Congo (the Democratic Republic of the)', 'Yemen P Dem Rep',
       'Germany Dem Rep', 'Palestine, State of',
       "Korea (the Democratic People's Republic of)",
       'Turks and Caicos Islands (the)', 'Marshall Islands (the)',
       'Russian Federation (the)',
       'Macedonia (the former Yugoslav Republic of)'

## Rename remaining Countrynames to a standardized format

In [180]:
# Replace suffix
disasters_all['country_name'] = disasters_all['country_name'].apply(lambda x: x.replace(' (the)', ''))
# Reorder compley country-names
disasters_all['country_name'] = disasters_all['country_name'].apply(lambda x: x.split(',')[1] + " " + x.split(',')[0] if ',' in x else x)
# Remove random white-spaces at the start of a name
disasters_all['country_name'] = disasters_all['country_name'].apply(lambda x: x[1:] if x.startswith(' ') else x)


In [181]:
new_country_names = {
    "Germany Fed Rep": "Germany",
    "Germany Dem Rep": "Germany",
    "Hong Kong": "China",
    "Macao": "China",
    "Åland": "Åland Islands",
    "Congo (the Democratic Republic of the)": "Congo",
    "Turkey": "Türkiye",
    "Korea (the Republic of)": "Republic of Korea",
    "Macedonia (the former Yugoslav Republic of)": "North Macedonia",
    "Congo (Democratic Republic of the)": "Democratic Republic of the Congo",
    "Yemen P Dem Rep": "Yemen",
    "Yemen Arab Rep": "Yemen",
    "Korea (the Democratic People's Republic of)": "Democratic People's Republic of Korea",
    "Korea (the Republic of)": "Republic of Korea",
    "Serbia Montenegro" : "Serbia",
    "Moldova (the Republic of)" : "Republic of Moldova",
    "Czech Republic" : "Czechia",
    "Taiwan (Province of China)" : "Taiwan"   
}

In [182]:
disasters_all = disasters_all.replace({"country_name": new_country_names}, inplace=False)

## Check for countries with missing ISO-Codes

In [183]:
countries_with_iso = disasters_all.merge(un_country_codes, how="left", left_on='country_code', right_on='ISO_alpha3_Code')[["country_name", "country_code", "ISO_alpha3_Code"]]
mismatches = countries_with_iso[countries_with_iso.ISO_alpha3_Code.isnull()]
mismatches.country_name.unique()

array(['Germany', 'Azores Islands', 'Netherlands Antilles',
       'Czechoslovakia', 'Soviet Union', 'Taiwan', 'Canary Is', 'Yemen',
       'Yugoslavia', 'Serbia'], dtype=object)

In [184]:
countries_with_iso.head(10)
mask = countries_with_iso.notnull().all(axis=1)
countries_with_iso[~mask].country_name.unique()

array(['Germany', 'Azores Islands', 'Netherlands Antilles',
       'Czechoslovakia', 'Soviet Union', 'Taiwan', 'Canary Is', 'Yemen',
       'Yugoslavia', 'Serbia'], dtype=object)

## Assign ISO-Codes (we know of) to countries

In [185]:
disasters_all.loc[disasters_all.country_name == "Germany", "country_code"] = "DEU"
disasters_all.loc[disasters_all.country_name == "Serbia", "country_code"] = "SRB"
disasters_all.loc[disasters_all.country_name == "Yemen", "country_code"] = "YEM"
disasters_all.loc[disasters_all.country_name == "Taiwan", "country_code"] = "TWM"
disasters_all.loc[disasters_all.country_name == "Canary Is", "country_code"] = "SPI"
disasters_all.loc[disasters_all.country_name == "Azores Islands", "country_code"] = "AZO"

## Check which countries still do not have an ISO-Code

The following countries either do not have an ISO-Code assigned because they do not exist anymore or are not recognized internationally.

For small countries like Azore-Islands or the netherlands antilles it is not that tragic, since they probably contribute only marginally to the total number of deaths by natural disasters globally or for a specific region.
They are therefore negligible.

For internationally unrecognized countries (Taiwan) we can default to a specifically assigned ISO-Code by us.

The difficult part is to make sense of the observations belonging to a larger country which has been split up into smaller nations in the last 100 years. (Soviet Union, Czechoslovakia, Yugoslavia)

In [186]:
countries_with_iso = disasters_all.merge(un_country_codes, how="left", left_on='country_code', right_on='ISO_alpha3_Code')[["country_name", "country_code", "ISO_alpha3_Code"]]
mismatches = countries_with_iso[countries_with_iso.ISO_alpha3_Code.isnull()]
mismatches.country_name.unique()

array(['Azores Islands', 'Netherlands Antilles', 'Czechoslovakia',
       'Soviet Union', 'Taiwan', 'Canary Is', 'Yugoslavia'], dtype=object)

## Display all disasters that happened in the Soviet Union

In [187]:
disasters_all[disasters_all.country_code == "SUN"].count()

year             64
dis_no           64
region           64
continent        64
country_name     64
country_code     64
type             64
subtype          47
deaths           37
dis_mag_value    34
dis_mag_scale    53
start_year       64
end_year         64
dtype: int64

## Display all disasters that happened in Czechoslovakia

In [188]:
disasters_all[disasters_all.country_code == "CSK"].count()

year             9
dis_no           9
region           9
continent        9
country_name     9
country_code     9
type             9
subtype          7
deaths           2
dis_mag_value    0
dis_mag_scale    7
start_year       9
end_year         9
dtype: int64

## Display all disasters that happened in Yugoslavia

In [189]:
disasters_all[disasters_all.country_code == "YUG"].count()

year             22
dis_no           22
region           22
continent        22
country_name     22
country_code     22
type             22
subtype          18
deaths           12
dis_mag_value     9
dis_mag_scale    22
start_year       22
end_year         22
dtype: int64

In [190]:
disasters_all.dtypes

year               int64
dis_no            object
region            object
continent         object
country_name      object
country_code      object
type              object
subtype           object
deaths           float64
dis_mag_value    float64
dis_mag_scale     object
start_year         int64
end_year           int64
dtype: object

## Save the Disaster-All file

In [191]:
disasters_all.to_csv(filepath_all)

## Create/Save the Disaster-Country file

In [192]:
disaster_country_col_names = ["year", "country_name", "country_code", "type", "subtype", "deaths"]
disasters_country = disasters_all.filter(items=disaster_country_col_names)
disasters_country.to_csv(filepath_country)

In [193]:
disasters_country.head(10)

Unnamed: 0,year,country_name,country_code,type,subtype,deaths
0,1900,Cabo Verde,CPV,Drought,Drought,11000.0
1,1900,India,IND,Drought,Drought,1250000.0
3,1902,Guatemala,GTM,Earthquake,Ground movement,2000.0
4,1902,Guatemala,GTM,Volcanic activity,Ash fall,1000.0
5,1902,Guatemala,GTM,Volcanic activity,Ash fall,6000.0
6,1903,Canada,CAN,Mass movement (dry),Rockfall,76.0
7,1903,Comoros,COM,Volcanic activity,Ash fall,17.0
10,1904,Bangladesh,BGD,Storm,Tropical cyclone,
12,1905,Canada,CAN,Mass movement (dry),Rockfall,18.0
13,1905,India,IND,Earthquake,Ground movement,20000.0


## Create/Save the Disaster-Region file

In [194]:
disaster_region_col_names = ["year", "region_name", "region_code", "iso", "type", "subtype", "deaths"]
disasters_region = disasters_all.filter(items=disaster_region_col_names)
disasters_region.to_csv(filepath_region)

In [195]:
disasters_region.head(10)

Unnamed: 0,year,type,subtype,deaths
0,1900,Drought,Drought,11000.0
1,1900,Drought,Drought,1250000.0
3,1902,Earthquake,Ground movement,2000.0
4,1902,Volcanic activity,Ash fall,1000.0
5,1902,Volcanic activity,Ash fall,6000.0
6,1903,Mass movement (dry),Rockfall,76.0
7,1903,Volcanic activity,Ash fall,17.0
10,1904,Storm,Tropical cyclone,
12,1905,Mass movement (dry),Rockfall,18.0
13,1905,Earthquake,Ground movement,20000.0


## Create/Save the Disaster-global file

In [196]:
disasters_global_attributes = ["year", "type", "subtype", "deaths"]
disasters_global = disasters_all[disasters_global_attributes]
disasters_global.to_csv(filepath_global)

In [197]:
disasters_region.head(10)

Unnamed: 0,year,type,subtype,deaths
0,1900,Drought,Drought,11000.0
1,1900,Drought,Drought,1250000.0
3,1902,Earthquake,Ground movement,2000.0
4,1902,Volcanic activity,Ash fall,1000.0
5,1902,Volcanic activity,Ash fall,6000.0
6,1903,Mass movement (dry),Rockfall,76.0
7,1903,Volcanic activity,Ash fall,17.0
10,1904,Storm,Tropical cyclone,
12,1905,Mass movement (dry),Rockfall,18.0
13,1905,Earthquake,Ground movement,20000.0
