# Climate Change and Deaths by Natural Disasters

*Sebastian Fürndraht, Hannes Rokitte, Paul Schmitt, Lukas Wieser*

## Overview
- Introduction
    - Research Questions
    - Used Datasets
    - Requirements & Dependencies
    - Constants
    - Download Temperature Data
- Data Integration
    - ...
- Prepare Datasets
    - ...
- Data Exploration
    - ...
- Conclusion
    - ...

## 1. Introduction

### Research Question

### Used Datasets

### Requirements & Dependencies

This project was created using Python 3.9.
The exact versions of the dependencies can be installed with the following command.

In [None]:
#!pip install -r requirements.txt -q

In [1]:
import numpy as np
import pandas as pd
import urllib.parse
import requests
from pathlib import Path

### Constants

In [19]:
# Constants should be uppercase
# Path variables should end with FOLDER or FILE
RAW_TEMP_DATA_FOLDER = "data/raw/temperature/"
COUNTRIES_LIST_FILE = "temp-countries-list.csv"

DIS_RAW_FILE = Path('data/raw/disaster/emdat_public_2022_12_22_full.xlsx')
DIS_PROCESSED_ALL_FILE = Path("data/processed/disaster/disaster-all.csv")
DIS_PROCESSED_FOLDER = "data/processed/disaster"

TEMP_PROCESSED_FOLDER = "data/processed/temperature"

### Create Directories

In [20]:
Path(DIS_PROCESSED_FOLDER).mkdir(parents=True, exist_ok=True)
Path(TEMP_PROCESSED_FOLDER).mkdir(parents=True, exist_ok=True)

### Download Temperature Data
Automatically download the regional and country temperature data, so we don't have to download each file by ourselves.

In [4]:
countries = pd.read_csv("data/raw/temperature/countries-list.csv", sep=";")

In [5]:
temp_regions = countries["Region"].dropna().unique().tolist()
temp_countries = countries["Country"].tolist()

In [6]:
def download_temperature_countries(country_names: list[str]):
    for country in country_names:
        print(f"downloading {country}")
        country_encoded = urllib.parse.quote(country.lower().replace(" ", "-"),encoding='cp1252')
        url = f"http://berkeleyearth.lbl.gov/auto/Regional/TAVG/Text/{country_encoded}-TAVG-Trend.txt"
        response = requests.get(url)
        data = response.text
        with open(f'data/raw/temperature/countries-land/{country}.txt', 'w', encoding="utf-8") as file:
            file.write(data)

In [7]:
def download_temperature_regions(region_names: list[str]):
    for region in region_names:
        print(f"downloading {region}")
        region_encoded = urllib.parse.quote(region.lower().replace(" ", "-"),encoding='cp1252')
        url = f"http://berkeleyearth.lbl.gov/auto/Regional/TAVG/Text/{region_encoded}-TAVG-Trend.txt"
        response = requests.get(url)
        data = response.text
        with open(f'data/raw/temperature/regions-land/{region}.txt', 'w', encoding="utf-8") as file:
            file.write(data)

Change the variable `should_download_temperature_data` to `True`, to download the temperature data of countries & regions. (This should not be necessary, since the data should already be downloaded)

In [8]:
should_download_temperature_data = False
if should_download_temperature_data:
    download_temperature_countries(temp_countries)
    download_temperature_regions(temp_regions)

## 2. Data Integration

### Load Datasets

In [9]:
disasters = pd.read_excel('data/raw/disaster/emdat_public_2022_12_22_full.xlsx', skiprows=6, sheet_name="emdat data")

temperature_countries = pd.read_csv("data/raw/temperature/countries-list.csv", sep=";")
population_by_country = pd.read_excel('data/raw/population/gapminder-population-v7.xlsx', sheet_name="data-for-countries-etc-by-year")
population_by_region = pd.read_excel('data/raw/population/gapminder-population-v7.xlsx', sheet_name="data-for-regions-by-year")

un_country_codes = pd.read_csv("data/raw/country-codes/un-country-codes.csv", sep=";")
cia_country_codes = pd.read_csv("data/raw/country-codes/cia-country-codes.csv", sep="\t")

  warn("Workbook contains no default style, apply openpyxl's default")


In [10]:
dis = pd.read_excel(DIS_RAW_FILE, skiprows = 6)

  warn("Workbook contains no default style, apply openpyxl's default")


Preprocess CIA dataset (since it is a bit messy)

In [11]:
cia_country_codes.head(3)

Unnamed: 0,Entity,GENC,ISO 3166,Stanag,Internet,Comment
0,Afghanistan,AFG,AF | AFG | 004,AFG,.af,-
1,Akrotiri,XQZ,- | - | -,-,-,-
2,Albania,ALB,AL | ALB | 008,ALB,.al,-


In [12]:
# split iso-codes in separate columns
cia_country_codes[["ISO-alpha2","ISO-alpha3","ISO-numeric"]] = cia_country_codes["ISO 3166"].str.split("|",2,expand=True)
cia_country_codes.drop(columns=["ISO 3166"], inplace=True)
# strip whitespaces from iso-codes
cia_country_codes[["ISO-alpha2","ISO-alpha3","ISO-numeric"]] = cia_country_codes[["ISO-alpha2","ISO-alpha3","ISO-numeric"]].apply(lambda x: x.str.strip())
# replace not existing iso-codes with NaN for more clarity
cia_country_codes["ISO-alpha2"].replace("-", np.nan, inplace=True)
cia_country_codes["ISO-alpha3"].replace("-", np.nan, inplace=True)
cia_country_codes["ISO-numeric"].replace("-", np.nan, inplace=True)
# show preprocessed cia data
cia_country_codes.head(3)

  cia_country_codes[["ISO-alpha2","ISO-alpha3","ISO-numeric"]] = cia_country_codes["ISO 3166"].str.split("|",2,expand=True)


Unnamed: 0,Entity,GENC,Stanag,Internet,Comment,ISO-alpha2,ISO-alpha3,ISO-numeric
0,Afghanistan,AFG,AFG,.af,-,AF,AFG,4.0
1,Akrotiri,XQZ,-,-,-,,,
2,Albania,ALB,ALB,.al,-,AL,ALB,8.0


### Determine ISO codes for temperature data

Remove aggregated countries.

In [13]:
temperature_countries.shape

(237, 2)

We can see, that e.g. Denmark appears twice. This issue happens multiple times, and is due to the reason that some countries are aggregates of other countries e.g. `Denmark` consists of `Denmark (Europe)` also known as `Denmark Mainland`, and `Greenland`. The bearkley earth website has a worldmap on which the country is highlighted, this helped us to better understand what each of the conflicting countries is.

In [14]:
temperature_countries.iloc[55: 55+5]

Unnamed: 0,Country,Region
55,Cyprus,Asia
56,Czech Republic,Europe
57,Denmark,North America
58,Denmark (Europe),Europe
59,Djibouti,Africa


We decided to remove the "aggregated" country. Here is a list of the aggregate countries we removed, their individual parts still exists in the dataset:
- Denmark (Denmark Mainland, Greenland)
- France (France Mainland, French Guiana, French Polynesia, French Southern and Antarctic Lands)
- Netherlands (Netherlands Mainland, Sint Maarten, Curaçao, Aruba)
- United Kingdom (United Kingdom + Oversea territories such as Montserrat, Bermuda)

In [15]:
temperature_countries_remove = pd.DataFrame({
    "Country": ["Denmark","France", "Netherlands", "United Kingdom"],
    "Region": ["North America", np.nan, "Europe", "Europe"]
})
temperature_countries_cleaned = pd.concat([temperature_countries, temperature_countries_remove]).drop_duplicates(keep=False)
temperature_countries_cleaned.shape

(233, 2)

Rename Countries & Match ISO Codes

In [16]:
# some countries need to be renamed so that we find the matching country-code later
new_country_names = {
    "Denmark (Europe)": "Denmark",
    "France (Europe)": "France",
    "Netherlands (Europe)": "Netherlands",
    "United Kingdom (Europe)": "United Kingdom",
    "Åland": "Åland Islands",
    "Czech Republic": "Czechia",
    "Turkey": "Türkiye",
    "Svalbard and Jan Mayen": "Svalbard and Jan Mayen Islands",
    "Cape Verde": "Cabo Verde",
    "Turks and Caicas Islands": "Turks and Caicos Islands",
    "Swaziland": "Eswatini",
    "Macedonia": "North Macedonia",
    "Côte d'Ivoire": "Côte d’Ivoire",
    "Federated States of Micronesia": "Micronesia (Federated States of)",
    "South Georgia and the South Sandwich Isla": "South Georgia and the South Sandwich Islands",
    "Bonaire, Saint Eustatius and Saba": "Bonaire, Sint Eustatius and Saba",
    "Congo (Democratic Republic of the)": "Democratic Republic of the Congo",
    "South Korea": "Korea, South",
    "North Korea": "Korea, North",
    "Palestina": "State of Palestine"
}

temperature_countries_cleaned = temperature_countries_cleaned.replace({"Country": new_country_names}, inplace=False)

# left-join cia-country-codes and un-country-codes
temperature_countries_with_iso = temperature_countries_cleaned.merge(cia_country_codes,how="left",left_on='Country', right_on='Entity')[["Country","ISO-alpha3"]]
temperature_countries_with_iso = temperature_countries_with_iso.merge(un_country_codes,how="left",left_on='Country', right_on='Country or Area')[["Country","ISO-alpha3", "ISO-alpha3 Code"]]

# fill missing cia-country codes with un-country-codes
temperature_countries_with_iso["ISO-alpha3"].fillna(temperature_countries_with_iso["ISO-alpha3 Code"], inplace=True)
temperature_countries_with_iso.drop(columns=["ISO-alpha3 Code"], inplace=True)

# show countries for which we could not find an ISO code
temperature_countries_with_iso[temperature_countries_with_iso["ISO-alpha3"].isna()]

Unnamed: 0,Country,ISO-alpha3
18,Baker Island,
113,Kingman Reef,
161,Palmyra Atoll,


These 3 countries/areas do not have any country codes in general, and are quite small, so we just ignore them later on.

In [22]:
# Todo: When combining adjust to use variable without saving temperature_countries_with_iso as csv file
processed_countries_list_path = 'data/processed/temperature/temp-countries-list.csv'
temperature_countries_with_iso.to_csv(processed_countries_list_path, index=False)

### Which countries are in which datasets?

Disaster vs. Temperature Dataset

In [23]:
berkely_iso_codes = set(temperature_countries_with_iso["ISO-alpha3"].dropna().tolist())
emdat_iso_codes = set(disasters["ISO"].unique().tolist())

emdat_and_bekely = emdat_iso_codes.intersection(berkely_iso_codes)
emdat_without_berkely = emdat_iso_codes-emdat_and_bekely
berkely_without_emdat = berkely_iso_codes-emdat_and_bekely

print(f"countries in emdat & berkely: {len(emdat_and_bekely)}")
print(f"countries in emdat but not berkely ({len(emdat_without_berkely)}):")
print(sorted(emdat_without_berkely))
print(f"countries in berkely but not emdat ({len(berkely_without_emdat)}):")
print(sorted(berkely_without_emdat))

countries in emdat & berkely: 209
countries in emdat but not berkely (22):
['ANT', 'AZO', 'BMU', 'BRN', 'COK', 'CSK', 'DDR', 'DFR', 'MDV', 'MHL', 'SCG', 'SHN', 'SPI', 'SSD', 'SUN', 'TKL', 'TUV', 'VUT', 'WLF', 'YMD', 'YMN', 'YUG']
countries in berkely but not emdat (20):
['ABW', 'ALA', 'AND', 'ATA', 'ATF', 'BES', 'CXR', 'ESH', 'FLK', 'FRO', 'GGY', 'GRL', 'HMD', 'JEY', 'LIE', 'MCO', 'SGS', 'SJM', 'SMR', 'SPM']


The Countries for which we have disaster data, but no temperature data are as follows:
- Existing Countries (usually very small countries/islands):
    - AZO Azores Islands
    - BMU Bermuda
    - BRN Brunei Darussalam
    - COK Cook Islands (the)
    - MDV Maldives
    - MHL Marshall Islands (the)
    - SHN Saint Helena, Ascension and Tristan da Cunha
    - SSD South Sudan
    - TKL Tokelau
    - TUV Tuvalu
    - VUT Vanuatu
    - WLF Wallis and Futuna
- Existing Countries (but invalid country code):
    - SPI Canary Islands
- Former Countries:
    - ANT Netherlands Antilles
    - CSK Czechoslovakia
    - DDR Germany Dem Rep
    - DFR Germany Fed Rep
    - SCG Serbia Montenegro
    - SUN Soviet Union
    - YMD Yemen P Dem Rep
    - YMN Yemen Arab Rep
    - YUG Yugoslavia

Disaster vs Population Dataset

In [24]:
gapminder_iso_codes = set(population_by_country["geo"].str.upper().unique())

In [25]:
emdat_and_gapminder = emdat_iso_codes.intersection(gapminder_iso_codes)
emdat_without_gapminder = emdat_iso_codes-emdat_and_gapminder
gapminder_without_emdat = gapminder_iso_codes-emdat_and_gapminder

print(f"countries in emdat & gapminder: {len(emdat_and_gapminder)}")
print(f"countries in emdat but not gapminder ({len(emdat_without_gapminder)}):")
print(sorted(emdat_without_gapminder))
print(f"countries in gapminder but not emdat ({len(gapminder_without_emdat)}):")
print(sorted(gapminder_without_emdat))

countries in emdat & gapminder: 191
countries in emdat but not gapminder (40):
['AIA', 'ANT', 'ASM', 'AZO', 'BLM', 'BMU', 'COK', 'CSK', 'CUW', 'CYM', 'DDR', 'DFR', 'GLP', 'GUF', 'GUM', 'IMN', 'MAC', 'MAF', 'MNP', 'MSR', 'MTQ', 'MYT', 'NCL', 'NIU', 'PRI', 'PYF', 'REU', 'SCG', 'SHN', 'SPI', 'SUN', 'SXM', 'TCA', 'TKL', 'VGB', 'VIR', 'WLF', 'YMD', 'YMN', 'YUG']
countries in gapminder but not emdat (6):
['AND', 'HOS', 'LIE', 'MCO', 'NRU', 'SMR']


The Countries for which we have disaster data, but no population data are as follows:
- Existing Countries (independent)
    - COK Cook Islands (the)
    - NIU Niue
- Existing Countries (dependent e.g .oversea territories)
    - AIA Anguilla
    - ASM American Samoa
    - AZO Azores Islands
    - BLM Saint Barthélemy
    - BMU Bermuda
    - CUW Curaçao
    - CYM Cayman Islands (the)
    - GLP Guadeloupe
    - GUF French Guiana
    - GUM Guam
    - IMN Isle of Man
    - MAC Macao
    - MAF Saint Martin (French Part)
    - MNP Northern Mariana Islands (the)
    - MSR Montserrat
    - MTQ Martinique
    - MYT Mayotte
    - NCL New Caledonia
    - PRI Puerto Rico
    - PYF French Polynesia
    - REU Réunion
    - SHN Saint Helena, Ascension and Tristan da Cunha
    - SPI Canary Islands
    - SXM Sint Maarten (Dutch part)
    - TCA Turks and Caicos Islands (the)
    - TKL Tokelau
    - VGB Virgin Island (British)
    - VIR Virgin Island (U.S.)
    - WLF Wallis and Futuna
- Former Countries
    - ANT Netherlands Antilles
    - CSK Czechoslovakia
    - DDR Germany Dem Rep
    - DFR Germany Fed Rep
    - SCG Serbia Montenegro
    - SUN Soviet Union
    - YMD Yemen P Dem Rep
    - YMN Yemen Arab Rep
    - YUG Yugoslavia

### Which regions are in which datasets?

In [26]:
disasters["Continent"].unique().tolist()

['Africa', 'Asia', 'Europe', 'Americas', 'Oceania']

In [27]:
population_by_region["geo"].unique().tolist()

['africa', 'asia', 'europe', 'americas']

In [28]:
temperature_countries["Region"].dropna().unique().tolist()

['Asia', 'Europe', 'Africa', 'South America', 'Oceania', 'North America']

As we can see, each dataset has a different number of regions. Additionally, we don't know which countries belong to each region. For example the region Europe could consist of different countries the disaster dataset than in the temperature dataset. That's why we decided to compute the regional data ourselves, by aggregating the countries according to UN Regions.

### Which countries are need to be manually assigned a region

Check which countries are not in the UN Dataset, and thus need to be manually assigned a region:

In [29]:
un_iso_codes = set(un_country_codes["ISO-alpha3 Code"].tolist())

emdat_without_un = emdat_iso_codes-un_iso_codes
gapminder_without_un = gapminder_iso_codes-un_iso_codes
berkely_without_un = berkely_iso_codes-un_iso_codes

print(f"countries in emdat, but not un ({len(emdat_without_un)}): \n{emdat_without_un}")
print(f"countries in gapminder, but not un ({len(gapminder_without_un)}): \n{gapminder_without_un}")
print(f"countries in berkely, but not un ({len(berkely_without_un)}): \n{berkely_without_un}")

countries in emdat, but not un (12): 
{'DDR', 'CSK', 'SCG', 'TWN', 'DFR', 'SUN', 'YMD', 'YUG', 'AZO', 'YMN', 'ANT', 'SPI'}
countries in gapminder, but not un (2): 
{'HOS', 'TWN'}
countries in berkely, but not un (1): 
{'TWN'}


## 3. Prepare Datasets

### 3.1 Disaster Data

Task:

The goal is to convert the data into the following formats for later use.
Along the way, this notebook does some data-preparation


Disaster-All
disaster/disaster-all:
Columns: disaster_no, year, subgroup, type, total_deaths, dis_mag_value, dis_mag_scale, start_year, end_year
Other interesting columns?

Prefix: dis


Publisher: Centre for Research on the Epidemiology of Disasters (CRED)

CRED defines a disaster as “a situation or event that overwhelms local capacity, necessitating a
request at the national or international level for external assistance; an unforeseen and often sudden
event that causes great damage, destruction and human suffering”

For a disaster to be entered into the database at least one of the following criteria must be fulfilled:

- 10 or more people reported killed
- 100 or more people reported affected
- Declaration of a state of emergency
- Call for international assistance

First look at the disaster data

In [30]:
dis.head()

Unnamed: 0,Dis No,Year,Seq,Glide,Disaster Group,Disaster Subgroup,Disaster Type,Disaster Subtype,Disaster Subsubtype,Event Name,...,"Reconstruction Costs, Adjusted ('000 US$)",Insured Damages ('000 US$),"Insured Damages, Adjusted ('000 US$)",Total Damages ('000 US$),"Total Damages, Adjusted ('000 US$)",CPI,Adm Level,Admin1 Code,Admin2 Code,Geo Locations
0,1900-9002-CPV,1900,9002,,Natural,Climatological,Drought,Drought,,,...,,,,,,3.077091,,,,
1,1900-9001-IND,1900,9001,,Natural,Climatological,Drought,Drought,,,...,,,,,,3.077091,,,,
2,1901-0003-BEL,1901,3,,Technological,Technological,Industrial accident,Explosion,,Coal mine,...,,,,,,3.077091,,,,
3,1902-0012-GTM,1902,12,,Natural,Geophysical,Earthquake,Ground movement,,,...,,,,25000.0,781207.0,3.200175,,,,
4,1902-0003-GTM,1902,3,,Natural,Geophysical,Volcanic activity,Ash fall,,Santa Maria,...,,,,,,3.200175,,,,


#### Select & Rename Attributes

1. Replace whitespaces with underscores
2. Convert every character to lowercase
3. Rename specific columns to ensure uniformity

In [31]:
# Remove whitespaces from all col-names and convert them to lower-case
dis.columns = [c.replace(' ', '_').lower() for c in dis.columns]
dis.rename(columns={'country':'country_name', 'iso':'country_code', 'disaster_subgroup':'subgroup', 'disaster_subtype':'subtype', 'disaster_type':'type', 'total_deaths':'deaths'}, inplace=True)

In [32]:
# Select the most interesting columns
dis_all_col_names = ["year", "dis_no", "country_name", "country_code", "location", "subgroup", "type", "subtype", "deaths", "dis_mag_value", "dis_mag_scale", "start_year", "end_year"]
dis_all = dis.filter(items=dis_all_col_names)

#### Which disaster groups are present in the dataset ?

There are Natural disasters, technological disasters as well as complex disasters that represent specific events (e.g. famine) which are not directly linked to a natural hazard.

In [33]:
dis.disaster_group.unique()

array(['Natural', 'Technological', 'Complex Disasters'], dtype=object)

We only focus on disasters which have a natural causation

In [34]:
dis = dis[dis.disaster_group == "Natural"]

Which types of natural disasters are there ?

In [35]:
dis.groupby(["subgroup","type"]).agg({"deaths":"sum"}).reset_index()

Unnamed: 0,subgroup,type,deaths
0,Biological,Animal accident,12.0
1,Biological,Epidemic,9618804.0
2,Biological,Insect infestation,0.0
3,Climatological,Drought,11733889.0
4,Climatological,Glacial lake outburst,262.0
5,Climatological,Wildfire,4653.0
6,Extra-terrestrial,Impact,0.0
7,Geophysical,Earthquake,2343912.0
8,Geophysical,Mass movement (dry),4644.0
9,Geophysical,Volcanic activity,86893.0


The types of disasters are mostly the ones a normal person would expect when thinking about natural disasters. But there are some strange types like insect-infestations or animal-accident which are not that obvious to understand, they also have basically no deaths. Also for our research we want to exclude Epidemics since it would go beyond the scope of this task.

Therefore also decided to omit disasters of the subgroups `Biological` and `Extra-terrestrial`.

In [36]:
dis = dis[(dis["subgroup"] != "Biological") & (dis["subgroup"] != "Extra-terrestrial")]

We only consider the following types of disasters

In [37]:
dis["type"].unique()

array(['Drought', 'Earthquake', 'Volcanic activity',
       'Mass movement (dry)', 'Storm', 'Flood', 'Landslide', 'Wildfire',
       'Extreme temperature ', 'Fog', 'Glacial lake outburst'],
      dtype=object)

#### Handle Missing Values

Fill missing Values for the number of deaths

We can assume that missing values for the number of deaths of a particular disaster means that the deathtoll was 0.

For the subtype we take a look for which type of natural disasters a subtype is not provided.

In [38]:
dis_all.isna().sum()

year                 0
dis_no               0
country_name         0
country_code         0
location          2340
subgroup             0
type                 0
subtype           3279
deaths            5316
dis_mag_value    20722
dis_mag_scale     8931
start_year           0
end_year             0
dtype: int64

In [39]:
dis_all[dis_all["subtype"].isna()]["type"].unique()

array(['Flood', 'Storm', 'Landslide', 'Wildfire', 'Fog', 'Epidemic',
       'Complex Disasters', 'Miscellaneous accident',
       'Insect infestation', 'Mass movement (dry)', 'Impact',
       'Volcanic activity', 'Animal accident', 'Drought', 'Earthquake',
       'Glacial lake outburst', 'Industrial accident'], dtype=object)

Unfortunately the missing values in the subtype column do not correspond to specific types of disasters.
We can not conclude that easily what caused the values to be missing.

In [40]:
dis_all[["subtype"]] = dis_all[["subtype"]].fillna("Uncategorized")

We assume that disasters with no death toll reported have a death toll of 0.
This also aligns with the information we get from emdat (deaths < 10 are missing)

In [41]:
dis_all[['deaths']] = dis_all[['deaths']].fillna(value=0)

#### Determine Regions

As mentioned in `data integration` we want to compute the region of each disaster, by taking the UN Region that is assigned to each Country, in which the disaster occurred.
Some country codes are not in the list of UN countries, thus we handle them specifically.

In [42]:
dis_iso_codes = set(dis_all["country_code"].unique())
un_iso_codes = set(un_country_codes["ISO-alpha3 Code"].tolist())
emdat_without_un = dis_iso_codes-un_iso_codes
print(f"countries in emdat, but not un ({len(emdat_without_un)}): \n{emdat_without_un}")

countries in emdat, but not un (12): 
{'DDR', 'CSK', 'SCG', 'TWN', 'DFR', 'SUN', 'YMD', 'YUG', 'AZO', 'YMN', 'ANT', 'SPI'}


##### Automatically assign regions with UN Dataset

For countries that were split in the past, but are now unified, we can just assign the unified country.

In [43]:
# Germany
dis_all.loc[dis_all['country_code'] == "DFR",'country_code'] = "DEU"
dis_all.loc[dis_all['country_code'] == "DDR",'country_code'] = "DEU"
# Yemen
dis_all.loc[dis_all['country_code'] == "YMD",'country_code'] = "YEM"
dis_all.loc[dis_all['country_code'] == "YMN",'country_code'] = "YEM"

Next we determine the region by using the un dataset:

In [44]:
dis_all = pd.merge(dis_all, un_country_codes[["ISO-alpha3 Code","Region Name", "Region Code"]], left_on='country_code', right_on='ISO-alpha3 Code', how='left')
dis_all.rename(columns={"Region Code": "region_code", "Region Name": "region_name"}, inplace=True)
dis_all.drop(columns=["ISO-alpha3 Code"],inplace=True)

##### Manually Assign Regions

Some countries clearly belong to one region, so we can assign the disasters manually

In [45]:
# Taiwan
dis_all.loc[dis_all['country_code']=='TWN','region_name'] = 'Asia'
dis_all.loc[dis_all['country_code']=='TWN','region_code'] = 142
# Czechoslovakia
dis_all.loc[dis_all['country_code']=='CSK','region_name'] = 'Europe'
dis_all.loc[dis_all['country_code']=='CSK','region_code'] = 150
# Yugoslavia
dis_all.loc[dis_all['country_code']=='YUG','region_name'] = 'Europe'
dis_all.loc[dis_all['country_code']=='YUG','region_code'] = 150
# Serbia Montenegro
dis_all.loc[dis_all['country_code']=='SCG','region_name'] = 'Europe'
dis_all.loc[dis_all['country_code']=='SCG','region_code'] = 150
# Netherlands Antilles
dis_all.loc[dis_all['country_code']=='ANT','region_name'] = 'Americas'
dis_all.loc[dis_all['country_code']=='ANT','region_code'] = 19
# Azores Islands
dis_all.loc[dis_all['country_code']=='AZO','region_name'] = 'Europe'
dis_all.loc[dis_all['country_code']=='AZO','region_code'] = 150
# Canary Islands
dis_all.loc[dis_all['country_code']=='SPI','region_name'] = 'Europe'
dis_all.loc[dis_all['country_code']=='SPI','region_code'] = 150

##### Manually Assign Regions of Disasters Soviet Union

For the Soviet Union disasters can occur in the european and/or asian parts.
Thus we also take the "location" attribute into account and try to derive if the disaster was in europe or asia.

In [46]:
mask_europe = dis_all.loc[dis_all['country_code']=='SUN']["location"].str.contains("Russian Federation|Ukraine|Moldavia|Siberia").fillna(False)
mask_asia = dis_all.loc[dis_all['country_code']=='SUN']["location"].str.contains("Kazakhstan|Azerbaijan|Uzbekistan|Turkmenistan|Georgia|Armenia|Kyrgystan|Tajikistan|Tajiskistan|Tadzhikistan|Tadjikistan|Caucasus region|Dushanbe", case=False).fillna(False)

In [47]:
dis_all[dis_all['country_code']=='SUN'][mask_europe & mask_asia]

Unnamed: 0,year,dis_no,country_name,country_code,location,subgroup,type,subtype,deaths,dis_mag_value,dis_mag_scale,start_year,end_year,region_name,region_code
1262,1921,1921-9001-SUN,Soviet Union,SUN,"South Ukraine, Volga, Ural (Kazakhstan,Russian...",Climatological,Drought,Drought,1200000.0,,Km2,1921,1921,,
1275,1923,1923-0001-SUN,Soviet Union,SUN,"Nationwide (Ukraine, Georgia, Russian Federati...",Biological,Epidemic,Parasitic disease,0.0,,Vaccinated,1923,1923,,
1316,1932,1932-9001-SUN,Soviet Union,SUN,"Nationwide (Russian Federation, Ukraine, Kazak...",Complex Disasters,Complex Disasters,Uncategorized,5000000.0,,,1932,1932,,


Only one Event happened in both the asian as well as the european part of the soviet union.
It is also a major event since it is a drought which caused the death of 1.2 million people.

Researching the details of this event one can conclude that this observation can only be the Russian famine of 1921–1922.
It mostly affected people living in europe, hence we assign this single observation the region europe.
(https://en.wikipedia.org/wiki/Russian_famine_of_1921%E2%80%931922)

For all other observations, the region should be unambiguous.

In [48]:
dis_all.loc[(dis_all['country_code']=='SUN') & mask_europe, "region_name"] = "Europe"
dis_all.loc[(dis_all['country_code']=='SUN') & mask_europe, "region_code"] = 150

dis_all.loc[(dis_all['country_code']=='SUN') & mask_asia, "region_name"] = "Asia"
dis_all.loc[(dis_all['country_code']=='SUN') & mask_asia, "region_code"] = 142

dis_all.loc[dis_all['dis_no']=='1921-9001-SUN', "region_name"] = "Europe"
dis_all.loc[dis_all['dis_no']=='1921-9001-SUN', "region_code"] = 150

The only disasters without a region are now these 3 in the soviet Union. However, since they have no death count we can safely ignore them.

In [49]:
dis_all[dis_all["region_name"].isna()]

Unnamed: 0,year,dis_no,country_name,country_code,location,subgroup,type,subtype,deaths,dis_mag_value,dis_mag_scale,start_year,end_year,region_name,region_code
1250,1917,1917-0002-SUN,Soviet Union,SUN,Nationwide,Biological,Epidemic,Uncategorized,2500000.0,,Vaccinated,1917,1917,,
4787,1981,1981-0280-SUN,Soviet Union,SUN,East,Meteorological,Storm,Uncategorized,0.0,,Kph,1981,1981,,
4788,1981,1981-0301-SUN,Soviet Union,SUN,,Hydrological,Flood,Uncategorized,0.0,,Km2,1981,1981,,
4822,1982,1982-0346-SUN,Soviet Union,SUN,East,Hydrological,Flood,Uncategorized,0.0,150.0,Km2,1982,1982,,


#### Save File

In [51]:
dis_all.to_csv(DIS_PROCESSED_ALL_FILE, index=False)