# Natural Disaster Data

## Task:

The goal is to convert the data into the following formats for later use.
Along the way, this notebook does some data-preparation


### Disaster-All
disaster/disaster-all:
Columns: disaster_no, year, subgroup, type, total_deaths, dis_mag_value, dis_mag_scale, start_year, end_year
Other interesting columns?


### Disaster-Global
disaster/disaster-global.csv
Columns: year, subgroup, type, total_deaths


### Disaster-Region
disaster/disaster-region.csv
Columns: region_code, region_name, year, subgroup, type, total_deaths
Calculate from country data, use UN Dataset to assign region to each country


### Disaster-Country
disaster/disaster-country.csv
Columns: year, country_code, country_name,  subtype, type, total_deaths

Prefix: dis


## Setup & Imports

In [1]:
import numpy as np
import pandas as pd
from pathlib import Path

filepath_source = Path('data/raw/disaster/emdat_public_2022_12_22_full.xlsx')
filepath_all = Path("data/processed/disaster/disaster-all.csv")

In [2]:
dis = pd.read_excel(filepath_source, skiprows = 6)
un_country_codes = pd.read_csv("data/raw/country-codes/un-country-codes.csv", sep=";")

  warn("Workbook contains no default style, apply openpyxl's default")


## First Look

In [3]:
dis.head()

Unnamed: 0,Dis No,Year,Seq,Glide,Disaster Group,Disaster Subgroup,Disaster Type,Disaster Subtype,Disaster Subsubtype,Event Name,...,"Reconstruction Costs, Adjusted ('000 US$)",Insured Damages ('000 US$),"Insured Damages, Adjusted ('000 US$)",Total Damages ('000 US$),"Total Damages, Adjusted ('000 US$)",CPI,Adm Level,Admin1 Code,Admin2 Code,Geo Locations
0,1900-9002-CPV,1900,9002,,Natural,Climatological,Drought,Drought,,,...,,,,,,3.077091,,,,
1,1900-9001-IND,1900,9001,,Natural,Climatological,Drought,Drought,,,...,,,,,,3.077091,,,,
2,1901-0003-BEL,1901,3,,Technological,Technological,Industrial accident,Explosion,,Coal mine,...,,,,,,3.077091,,,,
3,1902-0012-GTM,1902,12,,Natural,Geophysical,Earthquake,Ground movement,,,...,,,,25000.0,781207.0,3.200175,,,,
4,1902-0003-GTM,1902,3,,Natural,Geophysical,Volcanic activity,Ash fall,,Santa Maria,...,,,,,,3.200175,,,,


## Select & Rename Attributes

1. Replace whitespaces with underscores
2. Convert every character to lowercase
3. Rename specific columns to ensure uniformity

In [4]:
# Remove whitespaces from all col-names and convert them to lower-case
dis.columns = [c.replace(' ', '_').lower() for c in dis.columns]
dis.rename(columns={'country':'country_name', 'iso':'country_code', 'disaster_subgroup':'subgroup', 'disaster_subtype':'subtype', 'disaster_type':'type', 'total_deaths':'deaths'}, inplace=True)

In [5]:
# Select the most interesting columns
dis_all_col_names = ["year", "dis_no", "country_name", "country_code", "location", "subgroup", "type", "subtype", "deaths", "dis_mag_value", "dis_mag_scale", "start_year", "end_year"]
dis_all = dis.filter(items=dis_all_col_names)

## Which disaster groups are present in the dataset ?

In [6]:
dis.disaster_group.unique()

array(['Natural', 'Technological', 'Complex Disasters'], dtype=object)

We only focus on disasters which have a natural causation

In [7]:
dis = dis[dis.disaster_group == "Natural"]

Which types of natural disasters are there ?

In [8]:
dis.groupby(["subgroup","type"]).agg({"deaths":"sum"}).reset_index()

Unnamed: 0,subgroup,type,deaths
0,Biological,Animal accident,12.0
1,Biological,Epidemic,9618804.0
2,Biological,Insect infestation,0.0
3,Climatological,Drought,11733889.0
4,Climatological,Glacial lake outburst,262.0
5,Climatological,Wildfire,4653.0
6,Extra-terrestrial,Impact,0.0
7,Geophysical,Earthquake,2343912.0
8,Geophysical,Mass movement (dry),4644.0
9,Geophysical,Volcanic activity,86893.0


The types of disasters are mostly the ones a normal person would expect when thinking about natural disasters. But there are some strange types like insect-infestations or animal-accident which are not that obvious to understand, they also have basically no deaths. Also for our research we want to exclude Epidemics.

Therefore also decided to omit disasters of the subgroups `Biological` and `Extra-terrestrial`.

In [9]:
dis = dis[(dis["subgroup"] != "Biological") & (dis["subgroup"] != "Extra-terrestrial")]

We only consider the following types of disasters

In [10]:
dis["type"].unique()

array(['Drought', 'Earthquake', 'Volcanic activity',
       'Mass movement (dry)', 'Storm', 'Flood', 'Landslide', 'Wildfire',
       'Extreme temperature ', 'Fog', 'Glacial lake outburst'],
      dtype=object)

## Handle Missing Values

Fill missing Values for the number of deaths

We can assume that missing values for the number of deaths of a particular disaster means that the deathtoll was 0.

For the subtype we take a look for which type of natural disasters a subtype is not provided.

In [11]:
dis_all.isna().sum()

year                 0
dis_no               0
country_name         0
country_code         0
location          2340
subgroup             0
type                 0
subtype           3279
deaths            5316
dis_mag_value    20722
dis_mag_scale     8931
start_year           0
end_year             0
dtype: int64

In [12]:
dis_all[dis_all["subtype"].isna()]["type"].unique()

array(['Flood', 'Storm', 'Landslide', 'Wildfire', 'Fog', 'Epidemic',
       'Complex Disasters', 'Miscellaneous accident',
       'Insect infestation', 'Mass movement (dry)', 'Impact',
       'Volcanic activity', 'Animal accident', 'Drought', 'Earthquake',
       'Glacial lake outburst', 'Industrial accident'], dtype=object)

Unfortunately the missing values in the subtype column do not correspond to specific types of disasters.
We can not conclude that easily what caused the values to be missing.

In [13]:
dis_all[["subtype"]] = dis_all[["subtype"]].fillna("Uncategorized")

We assume that disasters with no death toll reported have a death toll of 0.
This also aligns with the information we get from emdat (deaths < 10 are missing)

In [14]:
dis_all[['deaths']] = dis_all[['deaths']].fillna(value=0)

## Determine Regions

As mentioned in `data integration` we want to compute the region of each disaster, by taking the UN Region that is assigned to each UN Country.
Some country codes are not in the list of UN countries, thus we handle them specifically.

In [15]:
dis_iso_codes = set(dis_all["country_code"].unique())
un_iso_codes = set(un_country_codes["ISO-alpha3 Code"].tolist())
emdat_without_un = dis_iso_codes-un_iso_codes
print(f"countries in emdat, but not un ({len(emdat_without_un)}): \n{emdat_without_un}")

countries in emdat, but not un (12): 
{'SCG', 'YMD', 'ANT', 'TWN', 'AZO', 'YUG', 'YMN', 'DFR', 'DDR', 'CSK', 'SUN', 'SPI'}


### Automatically assign regions with UN Dataset

For countries that were split in the past, but are now unified, we can just assign the unified country.

In [16]:
# Germany
dis_all.loc[dis_all['country_code'] == "DFR",'country_code'] = "DEU"
dis_all.loc[dis_all['country_code'] == "DDR",'country_code'] = "DEU"
# Yemen
dis_all.loc[dis_all['country_code'] == "YMD",'country_code'] = "YEM"
dis_all.loc[dis_all['country_code'] == "YMN",'country_code'] = "YEM"

Next we determine the region by using the un dataset:

In [17]:
dis_all = pd.merge(dis_all, un_country_codes[["ISO-alpha3 Code","Region Name", "Region Code"]], left_on='country_code', right_on='ISO-alpha3 Code', how='left')
dis_all.rename(columns={"Region Code": "region_code", "Region Name": "region_name"}, inplace=True)
dis_all.drop(columns=["ISO-alpha3 Code"],inplace=True)

### Manually Assign Regions

Some countries clearly belong to one region, so we can assign the disasters manually

In [18]:
# Taiwan
dis_all.loc[dis_all['country_code']=='TWN','region_name'] = 'Asia'
dis_all.loc[dis_all['country_code']=='TWN','region_code'] = 142
# Czechoslovakia
dis_all.loc[dis_all['country_code']=='CSK','region_name'] = 'Europe'
dis_all.loc[dis_all['country_code']=='CSK','region_code'] = 150
# Yugoslavia
dis_all.loc[dis_all['country_code']=='YUG','region_name'] = 'Europe'
dis_all.loc[dis_all['country_code']=='YUG','region_code'] = 150
# Serbia Montenegro
dis_all.loc[dis_all['country_code']=='SCG','region_name'] = 'Europe'
dis_all.loc[dis_all['country_code']=='SCG','region_code'] = 150
# Netherlands Antilles
dis_all.loc[dis_all['country_code']=='ANT','region_name'] = 'Americas'
dis_all.loc[dis_all['country_code']=='ANT','region_code'] = 19
# Azores Islands
dis_all.loc[dis_all['country_code']=='AZO','region_name'] = 'Europe'
dis_all.loc[dis_all['country_code']=='AZO','region_code'] = 150
# Canary Islands
dis_all.loc[dis_all['country_code']=='SPI','region_name'] = 'Europe'
dis_all.loc[dis_all['country_code']=='SPI','region_code'] = 150

### Manually Assign Regions of Disasters Soviet Union

For the Soviet Union disasters can occur in the european and/or asian parts.
Thus we also take the "location" attribute into account and try to derive if the disaster was in europe or asia.

In [19]:
mask_europe = dis_all.loc[dis_all['country_code']=='SUN']["location"].str.contains("Russian Federation|Ukraine|Moldavia|Siberia").fillna(False)
mask_asia = dis_all.loc[dis_all['country_code']=='SUN']["location"].str.contains("Kazakhstan|Azerbaijan|Uzbekistan|Turkmenistan|Georgia|Armenia|Kyrgystan|Tajikistan|Tajiskistan|Tadzhikistan|Tadjikistan|Caucasus region|Dushanbe", case=False).fillna(False)

In [20]:
dis_all[dis_all['country_code']=='SUN'][mask_europe & mask_asia]

Unnamed: 0,year,dis_no,country_name,country_code,location,subgroup,type,subtype,deaths,dis_mag_value,dis_mag_scale,start_year,end_year,region_name,region_code
1262,1921,1921-9001-SUN,Soviet Union,SUN,"South Ukraine, Volga, Ural (Kazakhstan,Russian...",Climatological,Drought,Drought,1200000.0,,Km2,1921,1921,,
1275,1923,1923-0001-SUN,Soviet Union,SUN,"Nationwide (Ukraine, Georgia, Russian Federati...",Biological,Epidemic,Parasitic disease,0.0,,Vaccinated,1923,1923,,
1316,1932,1932-9001-SUN,Soviet Union,SUN,"Nationwide (Russian Federation, Ukraine, Kazak...",Complex Disasters,Complex Disasters,Uncategorized,5000000.0,,,1932,1932,,


Only one Event happened in both the asian as well as the european part of the soviet union.
It is also a major event since it is a drought which caused the death of 1.2 million people.

Researching the details of this event one can conclude that this observation can only be the Russian famine of 1921–1922.
It mostly affected people living in europe, hence we assign this single observation the region europe.
(https://en.wikipedia.org/wiki/Russian_famine_of_1921%E2%80%931922)

For all other observations, the region should be unambiguous.

In [21]:
dis_all.loc[(dis_all['country_code']=='SUN') & mask_europe, "region_name"] = "Europe"
dis_all.loc[(dis_all['country_code']=='SUN') & mask_europe, "region_code"] = 150

dis_all.loc[(dis_all['country_code']=='SUN') & mask_asia, "region_name"] = "Asia"
dis_all.loc[(dis_all['country_code']=='SUN') & mask_asia, "region_code"] = 142

dis_all.loc[dis_all['dis_no']=='1921-9001-SUN', "region_name"] = "Europe"
dis_all.loc[dis_all['dis_no']=='1921-9001-SUN', "region_code"] = 150

The only disasters without a region are now these 3 in the soviet Union. However, since they have no death count we can safely ignore them.

In [22]:
dis_all[dis_all["region_name"].isna()]

Unnamed: 0,year,dis_no,country_name,country_code,location,subgroup,type,subtype,deaths,dis_mag_value,dis_mag_scale,start_year,end_year,region_name,region_code
1250,1917,1917-0002-SUN,Soviet Union,SUN,Nationwide,Biological,Epidemic,Uncategorized,2500000.0,,Vaccinated,1917,1917,,
4787,1981,1981-0280-SUN,Soviet Union,SUN,East,Meteorological,Storm,Uncategorized,0.0,,Kph,1981,1981,,
4788,1981,1981-0301-SUN,Soviet Union,SUN,,Hydrological,Flood,Uncategorized,0.0,,Km2,1981,1981,,
4822,1982,1982-0346-SUN,Soviet Union,SUN,East,Hydrological,Flood,Uncategorized,0.0,150.0,Km2,1982,1982,,


## Save File

In [23]:
dis_all.to_csv(filepath_all)