# Title

Intro / research explenation <br>
Link to datasets

In [8]:
import yaml
import pandas as pd

## Load files and create data frames

In [63]:
with open("config.yaml", "r") as stream:
    config_values = yaml.safe_load(stream)
    temperature_path = config_values['temperature_data_path']
    townships_path = config_values['townships_data_path']
    mortality_path = config_values['mortality_data_path']
    
# Create data frames
temperature_df = pd.read_csv(temperature_path, skiprows=11)
townships_df = pd.read_excel(townships_path)
mortality_df = pd.read_csv(mortality_path, sep=';')

## Inspect data
This sections handles the data inspection of the data frames.

### Temperature

In [64]:
temperature_df.head()

Unnamed: 0,STN,YYYY,JAN,FEB,MAR,APR,MAY,JUN,JUL,AUG,SEP,OCT,NOV,DEC,YEAR
0,235,1906,40,29,38,68,110,129,162,170,151,123,87,21,94
1,235,1907,26,8,45,73,116,132,137,150,140,115,69,42,88
2,235,1908,0,36,31,55,114,145,160,152,136,101,51,19,83
3,235,1909,17,6,16,71,101,120,139,150,129,112,60,33,80
4,235,1910,39,39,49,72,112,153,144,159,136,103,44,55,92


In [66]:
temperature_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 944 entries, 0 to 943
Data columns (total 15 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   STN     944 non-null    int64 
 1   YYYY    944 non-null    int64 
 2      JAN  944 non-null    object
 3      FEB  944 non-null    object
 4      MAR  944 non-null    object
 5      APR  944 non-null    object
 6      MAY  944 non-null    object
 7      JUN  944 non-null    object
 8      JUL  944 non-null    object
 9      AUG  944 non-null    int64 
 10     SEP  944 non-null    object
 11     OCT  944 non-null    object
 12     NOV  944 non-null    object
 13     DEC  944 non-null    object
 14    YEAR  944 non-null    object
dtypes: int64(3), object(12)
memory usage: 110.8+ KB


The temperature data frame has a total of 944 entries. There are 15 columns. 12 of these columns are month columns (Jan - Dec). The remaining three columns are the station number, the year, and the year average. To get the real temperatures, the values need to be divided by 10. What stands out is that, regarding the temperature columns, only August has the correct data type. Also, the column names of the station number and the year could be a bit more descriptive, whereas the name of the year average column (which we don't need) is a little misleading. These issues will be fixed during the data wrangling.

In [65]:
temperature_df.isna().sum()

STN       0
YYYY      0
   JAN    0
   FEB    0
   MAR    0
   APR    0
   MAY    0
   JUN    0
   JUL    0
   AUG    0
   SEP    0
   OCT    0
   NOV    0
   DEC    0
  YEAR    0
dtype: int64

The temperature data frame is not missing any data.

### Townships

In [58]:
townships_df.head()

Unnamed: 0,GM_CODE,GM_NAAM,geometry
0,GM0003,Appingedam,<Polygon><outerBoundaryIs><LinearRing><coordin...
1,GM0005,Bedum,<Polygon><outerBoundaryIs><LinearRing><coordin...
2,GM0007,Bellingwedde,<Polygon><outerBoundaryIs><LinearRing><coordin...
3,GM0009,Ten Boer,<Polygon><outerBoundaryIs><LinearRing><coordin...
4,GM0010,Delfzijl,<Polygon><outerBoundaryIs><LinearRing><coordin...


In [60]:
townships_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 431 entries, 0 to 430
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   GM_CODE   431 non-null    object
 1   GM_NAAM   431 non-null    object
 2   geometry  431 non-null    object
dtypes: object(3)
memory usage: 10.2+ KB


The township data frame exists of 431 entries. There are only has 3 columns: the township code (which are the same as the codes in the mortality data), the township name and the geometry. The data types of the columns are good as they are. The geometry column contains data to create a polygon. However, in this research we don't need a complete polygon, but we need the center point (lat/lon) of the polygons. Furthermore, the column names could be a bit more descriptive. These issues will be fixed during the data wrangling part.

In [67]:
townships_df.isna().sum()

GM_CODE     0
GM_NAAM     0
geometry    0
dtype: int64

The township data frame is not missing any data.

### Mortality

In [70]:
mortality_df.head()

Unnamed: 0,ID,RegioS,Perioden,Overledenen_3
0,0,NL01,2002MM01,13469
1,1,NL01,2002MM02,11735
2,2,NL01,2002MM03,13281
3,3,NL01,2002MM04,11968
4,4,NL01,2002MM05,11623


In [72]:
mortality_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 167076 entries, 0 to 167075
Data columns (total 4 columns):
 #   Column         Non-Null Count   Dtype 
---  ------         --------------   ----- 
 0   ID             167076 non-null  int64 
 1   RegioS         167076 non-null  object
 2   Perioden       167076 non-null  object
 3   Overledenen_3  129332 non-null  object
dtypes: int64(1), object(3)
memory usage: 5.1+ MB


The mortality data frame exists of 167076 entries. There are 4 columns: an ID, a region code, a period and the number of deceased people. The ID column should be set as actual ID of the data frame. The values in the region code column can start with the following prefixes: NL (whole Netherlands), LD (landsdeel: country part), PV (province) and GM (gemeente: township). For this research we only need the records with a GM-code. The period column contains a combination of a year and a month number. This column should be separated in two columns: year and month. Furthermore, the data type of the deceased column should be changed to the right data type. Lastly, the column names of this data frame could be a bit more descriptive. These issues will be fixed during the data wrangling part.

In [73]:
mortality_df.isna().sum()

ID                   0
RegioS               0
Perioden             0
Overledenen_3    37744
dtype: int64

It appears that there are 37744 entries that are missing the number of deceased people. For this research the choice has been made to leave out the townships with missing data, to get the most reliable results.

## Wrangle data

## Merge dataframes

## Analyse data (create graphs, etc.)

## Statistics