# Spruce Bark Beetle Prediction - Data Aggreation

#### Modeling the spruce bark beetle infestation for given spatial administrative units within Saxony and for distictive time intervals on the basis of the infestation development and the weather pattern up to the time of the forecast

**by**
Yannic Holländer

**Abstract**
This notebook encompasses the merging of various data sources. We create a single dataset containing all relevant information on the bark beetle infestation in Saxony. Subsequent notebooks use this dataset for the exploratory data analysis and model training. 

# 1. Overview of available data

For this project there are four main data sources. These data sources are:

1. **The infestation history**
    * contains all observations for the amount of accrued infested wood (target variable, in solid m$^3$)
    * also contains the timeframe for these observations, the respective forestry district, the type of forest (sepeartion by private/state owned) and the amount of disposed wood in this time period
    * data supplied by *Staatsbetrieb Sachsenforst*


2. **Information on the forestry districts**
    * contrary to previous approaches, we are predicting the amount of infested wood not for the whole state of Saxony, but for given spatial administrative units - forestry districts - of which there currently are 53
    * contains the geodata/polygons for these districts
    * also for every district, contains the area covered by forest, separated by private/state owned forest as well as endangered and safe forest area (endangered are only sections that consist predominantely of adult spruce trees)
    * the forestry district borders changed slightly in 2013/2014, we have the shape of the old districts as well as the new districts
    * data supplied by *Staatsbetrieb Sachsenforst*


3. **Meteorological raster data**
    * contain certain climatic parameters such as the mean temperature, humidity, wind speeds, global irridiation etc. (15 variables total)
    * one raster file for every variable and month/day of the covered time period (from 2005 up to February 2020)
    * 1000mx1000m raster
    * supplied by ReKIS (*Regionales Klima-Informationssystem Sachsen, Sachsen-Anhalt und Thüringen*, https://rekis.hydro.tu-dresden.de/)
    
    
4. **Information on abiotic damages**
    * covers windfall and demolition wood, damages from drought, storm, ice, snow etc.
    * gathered semi-anually (april and september)
    * similar variables to infestation_history
    * data supplied by *Staatsbetrieb Sachsenforst*
    
To make sense of the data we will have to aggregate this information into a single dataframe that can be used for an EDA and the modeling process. 


# 2. Setup

In [1]:
# import modules
import numpy as np
import pandas as pd
import geopandas as gpd

import matplotlib.pyplot as plt
import matplotlib.patches as patches

import rasterio 
import rasterio.plot
from rasterio.mask import mask

import time
from datetime import date

import warnings
# warnings.filterwarnings('ignore')

# diplay all columns of a dataframe
pd.options.display.max_columns= None

We first load in the infestation history dataset, which was supplied as a Microsoft Excel file. We use it as a skeleton on which information from the other sources is added on. This way we build our dataset on top of central observations of the target variable. It has the following columns:

* **county_acronym** - A shorthand of the respective county. The 53 forestry districts form the 13 counties of the free state of Saxony. 
* **county_nr** - The number of the respective county. Every county has a unique double-digit number.
* **fdist_nr** - Denotes the forestry district within the county. In combination with county_nr makes up the unique identifier for forestry districts (fdist_id)
* **fdist_id** - Unique identifier for the forestry district. Has four digits. The first two digits are made up of county_nr and the last two are made up of fdist_nr.
* **year** - Year of the observation. Ranges from 2006 until 2020. Last observation is from September 2020
* **timeframe** - The timeframe of the observation within the year. The data is gathered monthly from April till September and quarterly from October till March.
* **forest_ownership** - A binary categorial variable that distinguishes between state owned forest (SW) and private/corporate forest (NSW)
* **infested_wood** - Target variable. The amount of accrued infested wood in solid cubic metres. 
* **disposed_wood** - Infested wood that was disposed, i.e. the infested trees were cut down and removed. 

In [2]:
# load the infestation history data
infestation_history = pd.read_excel(
    r'data_raw/ML_BDR_20201019.xlsx', 
    names=['county_acronym', 
           'county_nr', 
           'fdist_nr', 
           'fdist_id',
           'year', 
           'timeframe', 
           'forest_ownership', 
           'infested_wood', 
           'disposed_wood'])

# display first few rows of the dataframe
infestation_history.head()

Unnamed: 0,county_acronym,county_nr,fdist_nr,fdist_id,year,timeframe,forest_ownership,infested_wood,disposed_wood
0,BZ,25,1,2501,2007,06 Juni,SW,5.0,0.0
1,BZ,25,1,2501,2007,08 August,SW,12.0,12.0
2,BZ,25,1,2501,2007,10 Oktober-Dezember,SW,2.0,0.0
3,BZ,25,1,2501,2008,04 April,SW,1.0,0.0
4,BZ,25,1,2501,2008,06 Juni,SW,2.0,0.0


# 3 Forestry Districts
## 3.1 Preparing the infestation history dataframe 

The fdist_id column of our infestation_history dataframe contains a unique identifier for the forstry districts. The first two digits indicate the county (*Landkreis*) and the last two digits indicate the number of the forestry district in this county. These digits correspond to the county_nr and fdist_nr columns respectively.

In some forestry districts the district number (last two digits) begins with a leading nine instead of a leading zero:

In [3]:
# display all forestry district numbers
infestation_history['fdist_id'].unique()

array([2501, 2502, 2503, 2504, 2505, 2506, 2507, 2508, 2509, 2510, 1101,
       1201, 2101, 2102, 2103, 2104, 2105, 2106, 2107, 2191, 2192, 2193,
       2194, 2195, 2196, 2197, 2198, 2201, 2202, 2203, 2204, 2601, 2602,
       2603, 2604, 2605, 2606, 2691, 2901, 2902, 2701, 2702, 2703, 2704,
       2791, 2792, 2793, 2801, 2802, 2803, 2804, 2805, 3001, 3002, 3003,
       2301, 2302, 2303, 2304, 2305, 2306, 2401, 2402], dtype=int64)

During the observation timeframe, some of the forestry districts (counties *Erzgebirgskreis* and *Meißen*) underwent a restructuring process. A leading nine instead of a leading zero signifies that the border of the district was different than it is today. According to *Sachsenforst* these changes happened in July 2013 for the county *Meißen* (fdist_id 27xx) and in September 2014 for *Erzgebirgskreis* (fdist_id 21xx).

In [4]:
# check the max years for districts with a fdist_nr with leading nine
infestation_history[
    infestation_history['fdist_nr'] >= 90
].groupby('fdist_id')['year'].max()

fdist_id
2191    2014
2192    2014
2193    2014
2194    2014
2195    2014
2196    2014
2197    2014
2198    2014
2691    2020
2791    2013
2792    2013
2793    2013
Name: year, dtype: int64

The two parts of 'fdist_id' also appear in the 'county_nr' and 'fdist_nr' columns seperately. Thus they are redundant. We check if the information in these three columns really is the same for every observation. If that is the case we drop 'county_nr' and 'fdist_nr':

In [5]:
# county_nr as as a string
county_nr = infestation_history[
    'county_nr'
].astype(str) 
# fdist_nr as a string (pad with 0)
fdist_nr = infestation_history[
    'fdist_nr'
].astype(str).map(lambda x: x.zfill(2)) 

# concatenate these strings and check if they are identical to the 'fdist_id' column at every observation
(county_nr + fdist_nr == infestation_history['fdist_id'].astype(str)).all() 

True

Our comparison states that the columns are redunant. We drop county_nr and fdist_nr in favor of fdist_id which combines them into a single, four-digit identifier. We also drop the county_acronym column which only states the county initials of the county_nr. A full association of the forestry district and county names is found in the forestry district geodata which we will load in later. For now the fdist_id countains all information we need in this dataframe. 

In [6]:
# drop 'county_nr' and 'fdist_nr' columns 
# because the information is also found in 'fdist_id'
infestation_history.drop(
    [
        'county_nr', 
        'fdist_nr', 
        'county_acronym'
    ], 
    axis=1, 
    inplace=True
)

Let's continue examining the cases with leading 9s. The *Stadtwald Zittau* (fdist_id 2691) is a special case among those special cases. From the fdist_id and the maximum year of occurence in the data we can already conclude that it is not in either of the restructured counties. According to Sachsenforst the correct procedure for this fdist_id is to just add the corresponding observations to the forestry district *Zittau* (fdist_id 2601).

In [7]:
# allocate fdist_id 2691 to fdist_id 2601 and add the observations

# in column 'fdist_id' change all occurrences of 2691 to 2601
infestation_history['fdist_id'] = infestation_history[
    'fdist_id'
].replace(2691, 2601)

# aggregate the values 
# by summing them together for the 'infested_wood' column
# but only if every other column value is the same (same observation)
infestation_history['infested_wood'] = infestation_history.groupby(
    [
        'fdist_id', 
        'year', 
        'timeframe', 
        'forest_ownership'
    ]
)['infested_wood'].transform('sum')

#  do the same for the 'disposed_wood' column
infestation_history['disposed_wood'] = infestation_history.groupby(
    [
        'fdist_id', 
        'year', 
        'timeframe', 
        'forest_ownership'
    ]
)['disposed_wood'].transform('sum')

# now the values are aggregated correctly but duplicated 
# (since we did't change the shape of the dataframe)
# drop the duplicated rows that were just created
infestation_history.drop_duplicates(inplace=True)

# reset index
infestation_history.reset_index(inplace=True, drop=True)

## 3.2 Loading and formating forestry district information

For the remaining forestry districts we need to distinguish between the old borders and the new ones. Sachsenforst supplied us with two shape files, one with the current district borders for all forestry districts and one with only old borders of districts that changed in some way. The districts in the old shapefile are not yet formated in the same way as our infestation_history dataframe. We have to change the 'fdist_id' numbers for the abolished districts so they match the format with the leading 9s. Currently they still have leading zeros. After that we merge both geodataframes.

Every row corresponds to one forestry district. The column names are as follows:
* **county_name** - The name of the county this forestry district is located in.
* **fdist_name** - The name of the forestry district.
* **fdist_id** - Unique identifier for the forestry district. Has four digits. The first two digits are made up of county_nr and the last two are made up of fdist_nr. For the file with the new borders, same as the fdist_id in the infestation_history dataframe. For the file with the old borders, the third digit still has to be changed to a nine to match the fdist_id in the infestation_history dataframe.
* **area_nse** - Forest area (in ha) that is not state owned (private/corporate forest) and endangered by the spruce bark beetle. Endangered forest is forest with a spruce ratio > 10% and a tree height >= 20 m.
* **area_nsne** - Forest area (in ha) that is not state owned (private/corporate forest) and not endangered by the spruce bark beetle.
* **area_se** - Forest area (in ha) that is state owned and endangered by the spruce bark beetle. Endangered forest is forest with a spruce ratio > 10% and a tree height >= 20 m.
* **area_sne** - Forest area (in ha) that is state owned (private/corporate forest) and not endangered by the spruce bark beetle.
* **geometry** - The forestry district border as polygons.

In [8]:
# load in the first shape file as a geopandas geodataframe
districts_new = gpd.read_file(
    r'data_raw/shape/ufb_rev_wald_teil.shp', 
    encoding='utf-8'
)

# column names
districts_new.columns=[
    'county_name',
    'fdist_name', 
    'fdist_id', 
    'area_nse', 
    'area_nsne', 
    'area_se', 
    'area_sne', 
    'geometry'
]

# display first few rows of the dataframe
districts_new.head(5)

Unnamed: 0,county_name,fdist_name,fdist_id,area_nse,area_nsne,area_se,area_sne,geometry
0,Mittelsachsen,Reinsberg,2203,1597.32,3274.630917,2706.18,2133.910411,"POLYGON ((386902.476 5656907.025, 386910.595 5..."
1,Mittelsachsen,Geringswalde,2201,841.61,3508.60581,196.15,1453.972847,"POLYGON ((332902.962 5650328.573, 332905.989 5..."
2,Leipzig,Leipziger Land,2902,401.71,8199.85385,615.51,5314.476829,"POLYGON ((332897.160 5650325.466, 332893.592 5..."
3,Mittelsachsen,Striegistal,2202,954.18,3156.650864,1147.04,1844.186239,"MULTIPOLYGON (((377509.195 5657427.330, 377569..."
4,Meißen,Süd,2703,392.75,4365.001441,381.91,1973.920712,"POLYGON ((377329.166 5657157.286, 377285.838 5..."


In [9]:
# load in the second shape file as a geopandas geodataframe
districts_old = gpd.read_file(
    r'data_raw/shape/ufb_rev_vorUmstrukturierungen.shp', 
    encoding='utf-8'
)

# column names
districts_old.columns=[
    'county_name', 
    'fdist_name', 
    'fdist_id', 
    'area_nse', 
    'area_nsne', 
    'area_se', 
    'area_sne', 
    'geometry'
]
# display first few rows of the dataframe
districts_old.head(5)

Unnamed: 0,county_name,fdist_name,fdist_id,area_nse,area_nsne,area_se,area_sne,geometry
0,Meißen,Nord,2703,143.31,5780.407594,1.09,768.093453,"POLYGON ((418952.942 5692288.782, 418909.147 5..."
1,Meißen,West,2701,22.8,4255.041515,3.93,3650.063576,"POLYGON ((389635.997 5699901.234, 389648.747 5..."
2,Meißen,Süd,2702,411.13,4543.837549,381.83,1975.417673,"POLYGON ((378695.051 5678837.912, 378676.082 5..."
3,Erzgebirgskreis,Annaberg,2105,2408.13,1882.699804,6142.78,2810.089861,"MULTIPOLYGON (((366499.454 5606840.063, 366468..."
4,Erzgebirgskreis,Eibenstock,2101,726.57,1754.566081,10922.47,3573.808856,"POLYGON ((333355.788 5609845.954, 333373.979 5..."


Format the districts_old geodataframe so it matches the notation in the infestation_history dataframe. This is done by changing the third digit from a zero to a nine. We also change the fdist_id in both geodataframes from a string to an integer.

In [10]:
# to get the leading 9 notation for abolished forestry districts
# add 90 to every 'fdist_id' in the districts_old dataframe 
districts_old['fdist_id'] = districts_old['fdist_id'].astype(int) + 90

# also change 'fdist_id' of districts_new to type int
districts_new['fdist_id'] = districts_new['fdist_id'].astype(int)

# merge the geodataframes
districts = pd.merge(districts_new, districts_old, how ='outer') 

# shape should be 64x8 now (53 new districts + 11 old districts)
districts.shape

(64, 8)

We make one more modification and the change some forestry district names slightly, to make them unambiguous. Currently, the counties *Meißen* and *Zwickau* have their districts labeled as *Nord* (north), *Süd* (south) etc. Since we also have the 'county_name' column to distinguish them, this is currently not a dealbreaker, but we would need to consult the county_name column every time to distinguish forestry district names. This is tedious, we'd rather just use the district names by themself. Thus we add the first county name letter to the name ('M Nord', 'Z Nord' and so on) for those forestry districts only.

In [11]:
# locate the fdist_name column for all foretry districts in Zwickau
# and add the letter Z to the start of the string
districts.loc[
    districts['county_name'] == 'Zwickau', 
    'fdist_name'
] = districts.loc[
    districts['county_name'] == 'Zwickau', 
    'fdist_name'
].map(
    lambda x: 'Z '+ x
)

# locate the fdist_name column for all foretry districts in Meißen
# and add the letter M to the start of the string
districts.loc[
    districts['county_name'] == 'Meißen', 
    'fdist_name'
] = districts.loc[
    districts['county_name'] == 'Meißen', 
    'fdist_name'
].map(
    lambda x: 'M '+ x
)

# check the results by filtering for Meißen, Zwickau
districts[districts['county_name'].isin(['Meißen', 'Zwickau'])]

Unnamed: 0,county_name,fdist_name,fdist_id,area_nse,area_nsne,area_se,area_sne,geometry
4,Meißen,M Süd,2703,392.75,4365.001441,381.91,1973.920712,"POLYGON ((377329.166 5657157.286, 377285.838 5..."
9,Zwickau,Z Nord,2401,1319.45,4145.980567,1348.61,1555.580286,"POLYGON ((305774.813 5632213.129, 305790.058 5..."
14,Meißen,M West,2704,36.08,1499.801018,0.08,3411.198775,"MULTIPOLYGON (((378097.915 5695126.311, 378079..."
27,Zwickau,Z Süd,2402,1794.78,6175.259947,196.48,236.623276,"POLYGON ((324855.422 5602320.960, 324853.742 5..."
31,Meißen,M Ost,2702,114.56,4945.945638,0.6,762.36579,"POLYGON ((413698.678 5674573.351, 413686.981 5..."
36,Meißen,M Nord,2701,33.41,3794.019452,3.85,243.218414,"POLYGON ((408736.142 5692125.831, 408767.450 5..."
53,Meißen,M Nord,2793,143.31,5780.407594,1.09,768.093453,"POLYGON ((418952.942 5692288.782, 418909.147 5..."
54,Meißen,M West,2791,22.8,4255.041515,3.93,3650.063576,"POLYGON ((389635.997 5699901.234, 389648.747 5..."
55,Meißen,M Süd,2792,411.13,4543.837549,381.83,1975.417673,"POLYGON ((378695.051 5678837.912, 378676.082 5..."


The old and new district borders are now correctly labeled in a single geodataframe and match the fdist_id in the infestation_history dataframe. 

## 3.3 Supplement forestry district features

From the columns in our districts geodataframe we can derive additional features. First we calculate the actual area of all districts as well as the endangered forest density.

In [12]:
# calculate area of the forestry district polygons in square kilometeres
# to get the correct area, we need to use an equal area projection 
# (in this case cea) 
districts['area_fdist'] = districts.to_crs(
    {'proj':'cea'}
)['geometry'].area / 1000000

# as a brief evaluation we check the area for the town of Leipzig 
# area should be 297.8 km^2 according to wikipedia
kfs_leipzig_area = districts[
    districts['county_name'] == 'Kreisfreie Stadt Leipzig'
]['area_fdist'].sum()

print(f'Area for Kreisfreie Stadt Leipzig is {kfs_leipzig_area:.1f} km^2,',
      'should be 297.8 km^2')

Area for Kreisfreie Stadt Leipzig is 297.8 km^2, should be 297.8 km^2


In [13]:
# the endangered forest density is the area 
# of non state and state endangered forest (in km^2)
# divided by the total forestry district area
districts['endangered_forest_density'] = (
    districts['area_nse'] + districts['area_se']
) * 100 / districts['area_fdist']

In case we want to use the geographical information contained within the geodataframe in our model, we calculate the centroid x and y coordinates of the polygons. Thus, we get two numerical features with location information of the districts.

In [14]:
# Add columns for coordinates of centroid for every district
# could be used as features instead of dummy for every district
districts['centroid_xcoord'] = districts[
    'geometry'
].map(
    lambda x: x.centroid.coords[0][0]
)

districts['centroid_ycoord'] = districts[
    'geometry'
].map(
    lambda x: x.centroid.coords[0][1]
)

## 3.4 Insert missing observations in infestation history

Before we merge the infestation history with the information on forestry districts, we need to pad the observations in the infestation_history dataframe. The dataset currently does not include an observation for every combination of forestry district, forest ownership and observation periods. The cause of these missing observations is that sometimes neither damaged nor disposed wood is reported. This may be the case if a forestry district does have almost no endangered forest (of the respective forest ownership type), a winter month yielded unfavorable conditions for bark beetles or the infestation subsided locally in a particular year (or a combination of these factors). According to *Sachsenforst* the appropriate procedure is to assume that for every observation not in the dataframe there was neither infested not disposed wood, because none was reported. Since this is still crucial information, we augment the dataset so that these cases are taken into account.

Let's see how many rows our dataset currently has in total and how many of those rows already have neither infested nor disposed wood.

In [15]:
# how many 'zero rows' do we already have?
n_zrows = infestation_history[
    (infestation_history['infested_wood'] == 0) & 
    (infestation_history['disposed_wood'] == 0)
].shape[0]

print(f'Initially {n_zrows} observations with neither infested wood nor',
      f'disposed wood (out of {infestation_history.shape[0]}',
      f'toal observations).')

Initially 839 observations with neither infested wood nor disposed wood (out of 8007 toal observations).


So in total we have 8007 observations. If we had every combination of timeframe, district and forest ownership for the years 2006 until February 2020 in the dataset, we would have 12,637 observations. We will also add the year 2005 in the dataframe, since we have the climate data for this year availabe. In case we ever want to use a moving average as a feature, we can already start with valid values for 2006. This makes the total rows we should have at the end of this chapter 13485. Calculations below:

In [16]:
# estimate how many rows we should get after filling in zero rows

# 2006 - September 2020 (without 2005)
# 12 full years with 8 timeframes and 53 forestry districts
# (these years are 2006-2012 and 2015-2019 = 12 total)
# 12 remaining timeframes with 53 districts (Jan-May 2013 & Jan-Sep 2020)
# 11 timeframes with 54 forestry districts (Jul 2013 - Sep 2014)
# everything x2 for state & non-state forest

n_full = (
    (12 * 8 * 53) + 
    (1 * 12 * 53) + 
    (1 * 11 * 54)
) * 2

# 2005 - September 2020 (including 2005 for climate data only)
# same as above, only one more full year
n_full_2005 = (
    (13 * 8 * 53) + 
    (1 * 12 * 53) + 
    (1 * 11 * 54) 
) * 2

print(f'Total rows after padding should be {n_full}',
      f'({n_full_2005} including 2005).')

Total rows after padding should be 12636 (13484 including 2005).


As illustrated in the above calculation, accounting for the exact forestry district borders of every year and timeframe is somewhat complicated, but still neccessary to yield reliable results. Before we fill the dataset with missing rows, we construct one dataframe for each combination of forestry districts that existed from 2006 untill 2020. They are created from subsets of our districts_old and districts_new geodataframes. In July 2013 all districts of the *Meißen* county changed from their xx9x version to the xx0x version. In September 2014 the forestry districts of the *Erzgebirgskreis* county followed suit. So in total we have three different geographical borders, one before July 2013 (all old borders for *Meißen* and *Erzgebirgskreis*), one after September 2014 (all new borders) and one in between (*Meißen* had new borders already, *Erzgebirgskreis* still had the old ones).   

In [17]:
# account for the different forestry district border changes
# by making three geodataframes containing the right polygons

# before July 2013: all old borders for Meißen and Erzgebirgskreis
# other borders are unchanged and thus in districts_new
districts_before_jul2013 = pd.concat(
    [
        districts_old, 
        districts_new[
            (districts_new['county_name'] != 'Erzgebirgskreis') & 
            (districts_new['county_name'] != 'Meißen')
        ]
    ], axis=0)

# bewtween July 2013 and September 2014
# take Erzgebirgskreis from districts_old 
# and everything else from districts_new
districts_jul2013_sep2014 = pd.concat(
    [
        districts_old[
            districts_old['county_name'] == 'Erzgebirgskreis'
        ], 
        districts_new[
            districts_new['county_name'] != 'Erzgebirgskreis'
        ]
    ], axis=0)

# the districts after September 2014 are already in districts_new
districts_after_sep2014 = districts_new

Now we can finally fill in the missing rows for every observation with zeroes for infested and disposed wood. The first part is to define a function which creates a row based on its input. 

In [18]:
def create_zero_row(df, fdist_id, year, timeframe, forest_ownership):
    '''
    This function takes in a dataframe, as well as values for the columns
    'fdist_id', 'year', 'timeframe' and 'forest_ownership' of this 
    dataframe. 
    It returns a dictionary with a new row where the last two column 
    values are zero, only if there is not yet a row with the 
    other column values as specified by the inputs.
    
    inputs:
        - df: the dataframe in question
        - district: value for the 'fdist_id' column
        - year: value for the 'year' column
        - timeframe: value for the 'timeframe' column
        - forest_ownership: value for the forest_ownership column
        
    returns:
        - a dictionary serving as the new row in the dataframe, if the
          row in question does not exist yet
    '''
    
    # first check if there already is an observation for 
    # this combination of parameters    
    if not (
        (df['fdist_id'] == fdist_id) & 
        (df['year'] == year) &
        (df['timeframe'] == timeframe) &
        (df['forest_ownership'] == forest_ownership)
    ).any():
        
        # if there is no observation yet: create one with the last two 
        # columns as zero
        # note: this will only work if the shape and column names 
        # are exactly as they are in infestation_history
        # however we do it this way because the function can also be used
        # for demolition wood this way (more on that later)
        return {
            'fdist_id': fdist_id, 
            'year': year,
            'timeframe': timeframe,
            'forest_ownership': forest_ownership,
            df.columns[-2]: 0,
            df.columns[-1]: 0
        } 


The second part is to go over all possible combination of features and call the create_zero_row() function. This is done in the following function.

In [19]:
def zero_fill(df=infestation_history, 
              districts_before_jul2013=districts_before_jul2013, 
              districts_jul2013_sep2014=districts_jul2013_sep2014, 
              districts_after_sep2014=districts_after_sep2014):
    '''
    This function takes in a dataframe and iterates over all possible
    combinations of years, timeframes, forestry districts 
    and forest ownerships. 
    It then calls the create_zero_row function with the appicable inputs.
    
    inputs:
        - df: the dataframe in question
        - districts_before_jul2013: a geodataframe with information
          on the forestry districts that existed before July 2013
        - districts_jul2013_sep2014: a geodataframe with information
          on the forestry districts that existed between July 2013
          and September 2014
        - districts_after_sep2014: a geodataframe with information
          on the forestry districts that existed after September 2014
    returns:
        - the augemnted dataframe, filled with new rows whenever a valid 
          combination of features did not exist in df. infested_wood and 
          disposed_wood values are zero in those rows
    '''
    
    # print current number of rows in df
    print(f'Number of rows before zero_fill(): {df.shape[0]}')
    
    # to check every valid combination of timeframes, forest types, years 
    # and districts we use nested for loops
    # loop through all unique months and quarters
    for tf in df['timeframe'].unique():
        
        # loop through forest ownership types
        for fo in df['forest_ownership'].unique():
            
            # loop through all years we want to include
            for yr in range(2005, 2021):
                
                # depending on the year there were differences
                # in forestry districts
                # we check which year it is via an if-statement
                
                # before July 2013
                if yr < 2013 or (
                    yr == 2013 and 
                    tf in ['01 Januar-März', '04 April', 
                           '05 Mai', '06 Juni']
                ):
                    
                    # loop only through the old districts before July 2013
                    for fd in districts_before_jul2013[
                        'fdist_id'
                    ].unique():
                    
                        # create new row if conditions are met 
                        # by calling create_zero_rows()
                        df = df.append(
                            create_zero_row(df, fd, yr, tf, fo), 
                            ignore_index=True
                        )
                        
                # between July 2013 and September 2014    
                elif yr == 2013 or (
                    yr == 2014 and not 
                    tf == '10 Oktober-Dezember'
                ):
                    
                    # loop only through the districts from July 2013 
                    # until December 2014
                    for fd in districts_jul2013_sep2014[
                        'fdist_id'
                    ].unique():
                    
                        # create new row if conditions are met 
                        # by calling create_zero_rows()
                        df = df.append(
                            create_zero_row(df, fd,yr, tf, fo),
                            ignore_index=True
                        )
                        
                # after September 2014        
                elif yr >= 2014 and not (
                    yr == 2020 and 
                    tf == '10 Oktober-Dezember'
                ):
                    
                    # loop only through the new districts after 2014
                    for fd in districts_after_sep2014[
                        'fdist_id'
                    ].unique():
                        
                        # create new row if conditions are met 
                        # by calling create_zero_rows()
                        df = df.append(
                            create_zero_row(df, fd, yr, tf, fo), 
                            ignore_index=True
                        )
                        
    # reset the index
    df.reset_index(inplace=True, drop=True)  
    
    # print new number of rows
    print(f'Number of rows after zero_fill(): {df.shape[0]}')
          
    return df

In [20]:
 # call zero_fill() function to augment infestation_history
infestation_history = zero_fill(infestation_history)

Number of rows before zero_fill(): 8007
Number of rows after zero_fill(): 13484


## 3.5 Merge the datasets

The number of rows after inserting missing obervations matches what we calculated earlier. Now that we finally have all observations we'll ever need, we can focus on creating new features and integrating our other data sources in the infestation_history dataframe. First, we merge the info from the districts geodataframe with infestation_history. To merge, we use the fdist_id column.

In [21]:
# merge information on the forestry districts 
# with the obersvations in infestation_history
infestation_history = pd.merge(
    infestation_history, 
    districts[[
        'county_name', 
        'fdist_name',
        'area_nse', 
        'area_nsne', 
        'area_se', 
        'area_sne', 
        'fdist_id', 
        'centroid_xcoord', 
        'centroid_ycoord', 
        'area_fdist', 
        'endangered_forest_density'
    ]], 
    on='fdist_id')

# we also save the districts geodataframe as a shape file 
# this way we can access it during the EDA
# 
# appending the geometry feature to our main dataset
# would unnecessarily impair performance,memory and storage 
# as the information contained in the polygons are rather large
districts.to_file('forestry_districts.shp', encoding='utf-8')

# 4. Meteorological raster data

For every meteorological parameter and every month (alternatively day), we have one raster file with a 1000 m x 1000 m raster that covers all of Saxony. Before the aggregation of the climate data takes place, we define two functions to help us in our endevour. The first one reads in a specific raster file and calculates the mean for all polygon geometries that we pass it.

In [22]:
def raster_mean(filename, polygons):
    '''
    This function calculates the mean of values in a raster grid for 
    specific polygon shapes. 
    This is done by masking the raster grid with the polygon vectors and 
    using the masked raster points to calculate the mean.

    inputs:
        - filename: path/filename of the raster file
        - polygons: a list of polygons that we want to calculate mean values for
        
    returns:
        - a list with one element for every input polygon, every element being the
          mean value of the raster masked with the polygon
    '''  
    # try to read in raster file
    try:
        current_raster = rasterio.open(filename, nodata=-9999.0)
        
    # specify procedure when file does not exist
    except: 
        # return NaN for all polygons and print message with filename
        print(f'File {filename} not found in directory. Returning NaN.')
        return [np.nan for i in range(polygons.shape[0])] 
    
    # prepare list with results
    results = []
    
    # do masking and calculation for every polygon
    for polygon in polygons['geometry']:
        
        # mask raster with polygon and read in the relevant raster points
        masked, mask_transform = mask(
            dataset=current_raster, 
            shapes=[polygon], 
            crop=True, # avoids loading in the whole raster
            filled=False, # mask outside values with nodata instead of 0, 
                          # so we can safely compute zonal stats
            all_touched=True # overfill polygon instead of underfilling
        ) 
        
        # calculate the mean of the remaining raster points
        # use numpy.ma as it supports masked arrays
        # and append to results
        results.append(
            np.ma.mean(masked)
        )
        
    # return results list
    return results
    

The second function fixes a problem that stems from the fact we have different timeframe lengths. The observations are gathered monthly from April until September and quarterly for the other six months. Some climate parameterts are formulated as the sum of daily values. For example the sunshine duration 'SD0' of a month is the sum of sunshine durations of every day. Our target variable is the infested wood that was **accrued** in the timeframe, meaning it is the sum of the accrued wood for every month (technically also each day or other arbitrary period). Thus, for these quarterly timeframes we want the features in question to also be added together (for example total sunshine duration of those three months, so we can explore the relationship between sunshine duration and target variable).

In case one of the files does not exist in the data (for now every file does exist, might not be the caser in production), the coorect approach would be to caluclate the numpy.nansum. However in case all of the three files for the qurterly values are missing, we want to return NaN, which is not the behavior of numpy.nansum. This why we create a wrapper function. 

In [23]:
def nansumwrapper(a, **kwargs):
    '''
    A function that returns NaN when all elements of an array-like
    are NaN and else calculates the nansum of this array-like.
    
    inputs:
        - a: any array-like
    returns:
        - np.nan if all elements are np.nan, else the nansum of input
    '''
    if np.isnan(a).all():
        return np.nan
    else:
        return np.nansum(a, **kwargs)

On to the aggregation of the climate data. Since reading in thousands of raster files, overlaying vector data and doing calculations with the results is computationally expensive, we do the aggregation in way that we have to read in every file only once, do the necessary calculations and only in the end merge the results onto our main dataset. We also do not want to repeatedly append datatframes (or actually handle dataframes at all) like we did during the insertion of missing rows, where it was really an issue. 

Because of those reasons we do not iterate over the infestation_history dataframe nor do we use dataframes to append our interim results. As long as we calcualte the meteorological feature values we are working with a list of lists, each inner list representing one of the parameters. Only in the end, after we are done with the aggregation, we transform our results into a pandas dataframe.

The following parameters exist in the data, calculated for every month. All parameters will be included as a feature with the same column name in the dataframe:
* **TX0** - mean of the daily maximum temperatures in °C
* **TM0** - mean temperature in °C
* **TN0** - mean of the daily minimum temperatures in °C
* **RF0** - mean relative humidity in %
* **SD0** - total sunshinde duration in h
* **RRU** - total precipitation in mm
* **RRK** - corrected total precipitation in mm (corrects systematic errors of the measuring device and installation location such as wetting/evaporation losses)
* **FF1** - mean of the daily mean wind velocity 10 metres above ground in m*s-1
* **FF2** - mean of the daily mean wind velocity 2 metres above ground in m*s-1
* **FFB** - mean of the daily wind speed of the day on the beaufort scale in bft
* **RGK** - total global irridiation in kWh*m-2
* **ETP** - potential evaporation in mm
* **GRV** - potential evapotranspiration in mm
* **KWU** - waterbalance in mm
* **KWK** - corrected waterbalance in mm (corrects systematic errors of the measuring device and installation location such as wetting/evaporation losses)

In [24]:
# aggregate meteorological data

# specify location of raster files
raster_dir = r'data_raw/climate_monthly_1000/'

# the obervations from april until september are gathered monthly 
# while they are gathered quarterly from october till march
# create a dictionary that maps the timeframe values from infestation_history 
# to the naming pattern that is used in the raster file names 
timeframe_dict = {
'01 Januar-März': ['01', '02', '03'],
'04 April': ['04'],
'05 Mai': ['05'],
'06 Juni': ['06'],
'07 Juli': ['07'],
'08 August': ['08'],
'09 September': ['09'],
'10 Oktober-Dezember': ['10', '11', '12']
}

# create a dictionary of all meteorological parameter shorthands to calculate
# these shorthands match the notation used in the respective filenames
#
# they are mapped to the respective aggregation function that will be used 
# if there are multiple months in the timeframe
parameter_info = {
    'TX0' : np.nanmean,  
    'TM0' : np.nanmean,  
    'TN0' : np.nanmean,  
    'RF0' : np.nanmean,  
    'SD0' : nansumwrapper,
    'RRU' : nansumwrapper,
    'RRK' : nansumwrapper,
    'FF1' : np.nanmean,  
    'FF2' : np.nanmean,  
    'FFB' : np.nanmean,  
    'RGK' : nansumwrapper,
    'ETP' : nansumwrapper,
    'GRV' : nansumwrapper,
    'KWU' : nansumwrapper,
    'KWK' : nansumwrapper
}

# since this might take a while we track the time
# start time
start = time.time()

# use list of lists to store results
# one inner list for every parameter plus one containing
# the information we use to merge infestation_history on
climate_res = [[] for _ in range(len(parameter_info) + 1)]

# to read every file only once, iterate over the years (and timeframes) 
for current_year in np.sort(infestation_history['year'].unique()):

    # whenever we start a new year, print elapsed time
    elapsed = (time.time() - start) / 60
    print(f'Starting with {current_year}, elapsed time: {elapsed:.2f} min')
    
    # if we are past 2014 we only need to go through the new districts
    # else go through all forestry districts in the districts geodataframe
    current_districts = districts if current_year <= 2014 else districts_new
    
    # get the forestry district id as well as the polygons
    polygons = current_districts[['fdist_id', 'geometry']]
    
    # iterate over the timeframes
    for current_timeframe in timeframe_dict:
        for idx, current_parameter in enumerate(parameter_info):
            
            # get all filenames 
            # (one if timeframe is one month, else list has three filenames)
            filenames = [
                fr'{raster_dir}GRID_1_Messungen_Tageswerte_2020_' +
                fr'{current_parameter}_MW_{current_year}' +
                fr'{current_month}00_utm.asc' 
                for current_month in timeframe_dict.get(current_timeframe)
            ]
            
            # call raster_mean() function we defined earlier
            aggregation_results = [
                raster_mean(filename, polygons) for filename in filenames
            ]
            
            # now use mapped function to handle timeframes 
            # with multiple months
            results_after_dispatch = [
                parameter_info[current_parameter](x) for x in zip(
                    *aggregation_results
                )
            ]
            
            # now we have one list left independent of timeframe
            # extend results
            climate_res[idx].extend(results_after_dispatch)         
        
        # extend the list on which we merge the results by specifying what 
        # combination of year, timeframe and districts we calculated
        merge_dummies = [
            f'{current_year}-{current_timeframe}-{dist}' 
            for dist in polygons['fdist_id']
        ]
        climate_res[-1].extend(merge_dummies)



# calculation done for all years, timeframes, districts and parameters
# transform results into dataframe
climate_res = pd.DataFrame(climate_res).T

# name columns correctly
climate_res.columns = [*parameter_info, 'merge_dummy']

# print final time
print(f'Finished aggregation, total time: {(time.time()-start)/60:.2f} min')

Starting with 2005, elapsed time: 0.00 min
Starting with 2006, elapsed time: 1.67 min
Starting with 2007, elapsed time: 3.54 min
Starting with 2008, elapsed time: 5.48 min
Starting with 2009, elapsed time: 7.45 min
Starting with 2010, elapsed time: 9.41 min
Starting with 2011, elapsed time: 11.38 min
Starting with 2012, elapsed time: 13.27 min
Starting with 2013, elapsed time: 15.25 min
Starting with 2014, elapsed time: 17.15 min
Starting with 2015, elapsed time: 18.85 min
Starting with 2016, elapsed time: 20.29 min
Starting with 2017, elapsed time: 21.68 min
Starting with 2018, elapsed time: 23.08 min
Starting with 2019, elapsed time: 24.44 min
Starting with 2020, elapsed time: 25.80 min
File data_raw/climate_monthly_1000/GRID_1_Messungen_Tageswerte_2020_TX0_MW_20200300_utm.asc not found in directory. Returning NaN.
File data_raw/climate_monthly_1000/GRID_1_Messungen_Tageswerte_2020_TM0_MW_20200300_utm.asc not found in directory. Returning NaN.
File data_raw/climate_monthly_1000/GRID_



File data_raw/climate_monthly_1000/GRID_1_Messungen_Tageswerte_2020_FFB_MW_20201000_utm.asc not found in directory. Returning NaN.
File data_raw/climate_monthly_1000/GRID_1_Messungen_Tageswerte_2020_FFB_MW_20201100_utm.asc not found in directory. Returning NaN.
File data_raw/climate_monthly_1000/GRID_1_Messungen_Tageswerte_2020_FFB_MW_20201200_utm.asc not found in directory. Returning NaN.
File data_raw/climate_monthly_1000/GRID_1_Messungen_Tageswerte_2020_RGK_MW_20201000_utm.asc not found in directory. Returning NaN.
File data_raw/climate_monthly_1000/GRID_1_Messungen_Tageswerte_2020_RGK_MW_20201100_utm.asc not found in directory. Returning NaN.
File data_raw/climate_monthly_1000/GRID_1_Messungen_Tageswerte_2020_RGK_MW_20201200_utm.asc not found in directory. Returning NaN.
File data_raw/climate_monthly_1000/GRID_1_Messungen_Tageswerte_2020_ETP_MW_20201000_utm.asc not found in directory. Returning NaN.
File data_raw/climate_monthly_1000/GRID_1_Messungen_Tageswerte_2020_ETP_MW_20201100

In [25]:
import glob

# specify location of raster files
raster_dir=r'data_raw/climate/'

# since this will take a while we track the time it takes
start_time = time.time()

# we do not want to append to dataframes and thus use a list of lists 
# (one list for every parameter and a final list that will be used for merging)
heatsum_res = [[],[],[]]


for current_year in np.sort(infestation_history['year'].unique()):
    
    elapsed_time = round((time.time() - start_time)/60, 2)
    print(f'Starting with year {current_year}, elapsed time: {elapsed_time} min')
    
    polygons = districts[['fdist_id', 'geometry']] if current_year <= 2014 else districts_new[['fdist_id', 'geometry']]
    
    for current_timeframe in timeframe_dict:
        
        filenames = []
        
        for current_month in timeframe_dict.get(current_timeframe):
        
            filenames.extend(glob.glob(fr'{raster_dir}GRID_1_Messungen_Tageswerte_2020_TN0_TW_{current_year}{current_month}??_utm.asc'))
         
        aggregation_results = [raster_mean(filename, polygons) for filename in filenames]
        
        arr = [np.array(x) for x in zip(*aggregation_results)]
        
        heatsum_8 = [np.nansum(x[x >= 8.3]) for x in arr]
        heatsum_16 = [np.nansum(x[x >= 8.3]) for x in arr]


        heatsum_res[0].extend(heatsum_8)         
        heatsum_res[1].extend(heatsum_16)   
        
        merge_dummies = [f'{current_year}-{current_timeframe}-{dist}' for dist in polygons['fdist_id']]
        heatsum_res[2].extend(merge_dummies)

heatsum_res = pd.DataFrame(heatsum_res).T

heatsum_res.columns = ['HS8', 'HS16', 'merge_dummy']

print(f'Finished heatsum calculations, total time: {(time.time()-start_time)/60:.2f} min')

Starting with year 2005, elapsed time: 0.0 min
Starting with year 2006, elapsed time: 3.01 min
Starting with year 2007, elapsed time: 6.02 min
Starting with year 2008, elapsed time: 9.04 min
Starting with year 2009, elapsed time: 12.07 min
Starting with year 2010, elapsed time: 15.11 min
Starting with year 2011, elapsed time: 18.17 min
Starting with year 2012, elapsed time: 21.21 min
Starting with year 2013, elapsed time: 24.25 min
Starting with year 2014, elapsed time: 28.65 min
Starting with year 2015, elapsed time: 40.19 min
Starting with year 2016, elapsed time: 44.61 min
Starting with year 2017, elapsed time: 55.27 min
Starting with year 2018, elapsed time: 60.76 min
Starting with year 2019, elapsed time: 64.78 min
Starting with year 2020, elapsed time: 67.52 min
Finished heatsum calculations, total time: 68.08 min


In [26]:
infestation_history['merge_dummy'] = infestation_history['year'].map(lambda x: str(x) + '-') + infestation_history['timeframe'].map(lambda x: x + '-') + infestation_history['fdist_id'].astype(str)

infestation_history = pd.merge(
    infestation_history, 
    climate_res, 
    on='merge_dummy'
)

infestation_history = pd.merge(
    infestation_history, 
    heatsum_res, 
    on='merge_dummy'
).drop('merge_dummy', axis=1)

### County/District names, ID, timestamp - preparation for time series analysis

In [27]:
# for Meißen, logically connect the old forestry districts to the new ones that best approxiamte the location/shape
def connect_districts(fdist_name, fdist_id):
    if not fdist_id in [2793, 2791]:
        return fdist_name
    
    else:
        return fdist_name.replace(
            'Nord', 'Ost' # what was M Nord is almost exactly M Ost in the new structure
        ).replace(
            'West', 'Nord' # M West is best approximated by M Nord in the new structure
        )


In [28]:
infestation_history['fdist_newname'] = infestation_history[['fdist_name','fdist_id']].apply(lambda x: connect_districts(x[0], x[1]), axis=1)
infestation_history['id'] = infestation_history['county_name'].map(lambda x: x + '-') + infestation_history['fdist_newname'].map(lambda x: x + '-') + infestation_history['forest_ownership']

In [29]:
end_of_timeframe = {
    '01 Januar-März': '-03-31',
    '04 April': '-04-30',
    '05 Mai': '-05-31',
    '06 Juni': '-06-30',
    '07 Juli': '-07-31',
    '08 August': '-08-31',
    '09 September': '-09-30',
    '10 Oktober-Dezember': '-12-31'
    }
             
     
infestation_history['timestamp'] = infestation_history['year'].astype(str) + infestation_history['timeframe'].map(lambda x: end_of_timeframe.get(x))
infestation_history['timestamp'] = pd.to_datetime(infestation_history['timestamp'])

### Accounting for previously infested/disposed wood

In [30]:
from pandas.tseries.offsets import MonthEnd



# for every row map the 'disposed_wood' and 'infested_wood' value of the previous observation with the same 'id'
# TODO: documentation

prev_disposed_wood = []
prev_infested_wood = []
prev_infested_wood_ofo = []
    
for i, row in infestation_history.iterrows():
    if row['timestamp'].month in range(4,10):
        previous_row = infestation_history.loc[
            (infestation_history['timestamp'] == row['timestamp'] + MonthEnd(-1)) & 
            (infestation_history['id'] == row['id'])
        ]
        
        previous_row_ofo = infestation_history.loc[
            (infestation_history['timestamp'] == row['timestamp'] + MonthEnd(-1)) & 
            (infestation_history['fdist_newname'] == row['fdist_newname']) & 
            (infestation_history['forest_ownership'] != row['forest_ownership'])
        ]
        
    else:
        previous_row = infestation_history.loc[
            (infestation_history['timestamp'] == row['timestamp'] + MonthEnd(-3)) & 
            (infestation_history['id'] == row['id'])
        ]
        
        previous_row_ofo = infestation_history.loc[
            (infestation_history['timestamp'] == row['timestamp'] + MonthEnd(-3)) & 
            (infestation_history['fdist_newname'] == row['fdist_newname']) & 
            (infestation_history['forest_ownership'] != row['forest_ownership'])
        ]
    
    pdw = previous_row['disposed_wood'].values
    piw = previous_row['infested_wood'].values
    piw_ofo = previous_row_ofo['infested_wood'].values
    
    prev_disposed_wood.append(pdw[0] if len(pdw)==1 else np.nan)
    prev_infested_wood.append(piw[0] if len(piw)==1 else np.nan)
    prev_infested_wood_ofo.append(piw_ofo[0] if len(piw_ofo==1) else np.nan)
    
infestation_history['prev_disposed_wood'] = prev_disposed_wood
infestation_history['prev_infested_wood'] = prev_infested_wood
infestation_history['prev_infested_wood_ofo'] = prev_infested_wood_ofo


### Moving Averages of climate features for the last year

In [31]:
# _rollingyr

rolling_df = pd.DataFrame(np.nan, 
                          index=range(infestation_history.shape[0]),
                          columns=[*[name + '_rollyr' for name in parameter_info], 
                                   'prev_infested_wood_rollyr',
                                   'prev_disposed_wood_rollyr',
                                   'HS8_rollyr',
                                   'HS16_rollyr'
                                  ])

for ID in infestation_history['id'].unique(): 
    
    # extract relevant time series for the id, sorted by timestamps
    id_subset = infestation_history.loc[infestation_history['id'] == ID].sort_values('timestamp')
    
    # the timeframes with three months need to be weighted accordingly
    weight = pd.Series([1 if element.month in range(4, 10) else 3 for element in id_subset['timestamp']])
        
    # calculate moving average for every meteorological parameter
    for current_parameter in parameter_info:
        
        # multiply parameter values by weight
        # use weight.values to keep the original index from id_subset
        weighted_parameter =  pd.Series(id_subset[current_parameter] * weight.values, name=current_parameter+'_rollyr')
        
        # perform rolling, save results in rolling_df
        rolling_df.loc[
            rolling_df.index.isin(weighted_parameter.index), 
            current_parameter+'_rollyr'
        ] = weighted_parameter.rolling(8).apply(
            lambda x: np.nansum(x)/12
        )
    # also do this for prev_infested_wood and prev_disposed_wood
    rolling_df.loc[
        rolling_df.index.isin(weighted_parameter.index), 
        'prev_infested_wood_rollyr'
    ] = id_subset['prev_infested_wood'].rolling(8).apply(lambda x: np.nansum(x))
    
    rolling_df.loc[
        rolling_df.index.isin(weighted_parameter.index), 
        'prev_disposed_wood_rollyr'
    ] = id_subset['prev_disposed_wood'].rolling(8).apply(lambda x: np.nansum(x))
    
    rolling_df.loc[
        rolling_df.index.isin(weighted_parameter.index), 
        'HS8_rollyr'
    ] = id_subset['HS8'].rolling(8).apply(lambda x: np.nansum(x))
    
    rolling_df.loc[
        rolling_df.index.isin(weighted_parameter.index), 
        'HS16_rollyr'
    ] = id_subset['HS16'].rolling(8).apply(lambda x: np.nansum(x))
                          
# sice we kept the original indeces, merge results on index        
infestation_history = pd.merge(infestation_history, rolling_df, left_index=True, right_index=True)

In [32]:
# _rollingsr

rolling_df = pd.DataFrame(np.nan, 
                          index=range(infestation_history.shape[0]),
                          columns=[*[name + '_rollsr' for name in parameter_info]])

for ID in infestation_history['id'].unique(): 
    
    # extract relevant time series for the id, sorted by timestamps
    id_subset = infestation_history.loc[infestation_history['id'] == ID].sort_values('timestamp')
    
    # the timeframes with three months need to be weighted accordingly
    weight = pd.Series([1 if element.month in range(4, 10) else 0 for element in id_subset['timestamp']])
        
    # calculate moving average for every meteorological parameter
    for current_parameter in parameter_info:
        
        # multiply parameter values by weight
        # use weight.values to keep the original index from id_subset
        weighted_parameter =  pd.Series(id_subset[current_parameter] * weight.values, name=current_parameter+'_rollsr')
        
        # perform rolling, save results in rolling_df
        rolling_df.loc[
            rolling_df.index.isin(weighted_parameter.index), 
            current_parameter+'_rollsr'
        ] = weighted_parameter.rolling(8).apply(
            lambda x: np.nansum(x)/6
        )
                          
# sice we kept the original indeces, merge results on index        
infestation_history = pd.merge(infestation_history, rolling_df, left_index=True, right_index=True)

In [33]:
# _rollingwr

rolling_df = pd.DataFrame(np.nan, 
                          index=range(infestation_history.shape[0]),
                          columns=[*[name + '_rollwr' for name in parameter_info]])

for ID in infestation_history['id'].unique(): 
    
    # extract relevant time series for the id, sorted by timestamps
    id_subset = infestation_history.loc[infestation_history['id'] == ID].sort_values('timestamp')
    
    # the timeframes with three months need to be weighted accordingly
    weight = pd.Series([0 if element.month in range(4, 10) else 1 for element in id_subset['timestamp']])
        
    # calculate moving average for every meteorological parameter
    for current_parameter in parameter_info:
        
        # multiply parameter values by weight
        # use weight.values to keep the original index from id_subset
        weighted_parameter =  pd.Series(id_subset[current_parameter] * weight.values, name=current_parameter+'_rollwr')
        
        # perform rolling, save results in rolling_df
        rolling_df.loc[
            rolling_df.index.isin(weighted_parameter.index), 
            current_parameter+'_rollwr'
        ] = weighted_parameter.rolling(8).apply(
            lambda x: np.nansum(x)/6
        )
           
# sice we kept the original indeces, merge results on index        
infestation_history = pd.merge(infestation_history, rolling_df, left_index=True, right_index=True)

In [34]:
# endangered area for this id
infestation_history['area_endangered'] = infestation_history[['forest_ownership', 'area_nse', 'area_se']].apply(lambda x: x[1] if x[0] == 'NSW' else x[2], axis=1)

## Windfall and demolition wood 

Abiotic damages - Bruch & Wurf, Schnee,Eis u. Sturm kombiniert

In [35]:
# load in the data set
demolition_history = pd.read_excel(r'data_raw/ML_WB_20201112.xlsx', 
                                    names=['county_acronym', 'county_nr', 'fdist_nr', 'fdist_id','year', 
                                           'timeframe', 'forest_ownership', 'demolition_wood', 'disposed_demolition_wood'])


In [36]:
# just like with infestation_history, we drop the redundant columns
demolition_history.drop(['county_acronym', 'county_nr', 'fdist_nr'], axis=1, inplace=True)

Add *Stadtwald Zittau*

In [37]:
# in column 'fdist_id' change all occurrences of 2691 to 2601
demolition_history['fdist_id'] = demolition_history['fdist_id'].replace(2691, 2601)

In [38]:
# use the fruits of our labor to fill zero rows
demolition_history = zero_fill(demolition_history)

# since there could be manual entry errors etc. get the new name NOW so we can later merge back on that
# use the fruits of our labor and merge fdist_ids on fdist_newname to logically connect old and new districts
newname_df = infestation_history[['fdist_id', 'fdist_newname']].drop_duplicates('fdist_id').copy()

demolition_history = pd.merge(demolition_history, newname_df, on='fdist_id')

# aggregate the values by summing them together for the 'demolition_wood' and 'disposed_demolition_wood' columns if every other column value is the same
demolition_history['demolition_wood'] = demolition_history.groupby(['fdist_newname', 'year', 'timeframe', 'forest_ownership'])['demolition_wood'].transform('sum')

demolition_history['disposed_demolition_wood'] = demolition_history.groupby(['fdist_newname', 'year', 'timeframe', 'forest_ownership'])['disposed_demolition_wood'].transform('sum')

# Now drop the duplicated rows that were just created
demolition_history.drop_duplicates(['fdist_newname', 'year', 'timeframe', 'forest_ownership'], inplace=True)

# reset the index
demolition_history.reset_index(inplace=True, drop=True)



Number of rows before zero_fill(): 1945
Number of rows after zero_fill(): 3405


In [39]:
# fill all other timeframes except April and September with the right values

In [40]:
def prev_demolition_date(timestamp):
    '''
    TODO: documentation
    '''
    if timestamp.month in [4, 9]:
        return timestamp
    elif timestamp.month in range(5, 9):
        return date(timestamp.year, 4, 30)
    elif timestamp.month in range(1, 4):
        return date(timestamp.year - 1, 9, 30)
    else:
        return date(timestamp.year, 9, 30)

In [41]:
demolition_history['timestamp'] = demolition_history['year'].astype(str) + demolition_history['timeframe'].map(lambda x: end_of_timeframe.get(x))
demolition_history['timestamp'] = pd.to_datetime(demolition_history['timestamp'])

In [42]:
infestation_history['demolition_date'] = infestation_history['timestamp'].map(lambda x: prev_demolition_date(x))

In [43]:
demolition_history[demolition_history.duplicated(['fdist_newname','forest_ownership', 'timestamp'], keep=False)]

Unnamed: 0,fdist_id,year,timeframe,forest_ownership,demolition_wood,disposed_demolition_wood,fdist_newname,timestamp


In [44]:
infestation_history = pd.merge(
    infestation_history, 
    demolition_history[[
        'fdist_newname',
        'forest_ownership', 
        'timestamp', 
        'demolition_wood', 
        'disposed_demolition_wood']], 
    left_on=['fdist_newname', 'forest_ownership', 'demolition_date'], 
    right_on=['fdist_newname', 'forest_ownership', 'timestamp'], 
    suffixes=('', '_drop'),
    how='left'
).drop(['demolition_date', 'timestamp_drop'], axis=1)

In [45]:
infestation_history[(infestation_history['demolition_wood'].isna()) & (infestation_history['year'] > 2005)]

Unnamed: 0,fdist_id,year,timeframe,forest_ownership,infested_wood,disposed_wood,county_name,fdist_name,area_nse,area_nsne,area_se,area_sne,centroid_xcoord,centroid_ycoord,area_fdist,endangered_forest_density,TX0,TM0,TN0,RF0,SD0,RRU,RRK,FF1,FF2,FFB,RGK,ETP,GRV,KWU,KWK,HS8,HS16,fdist_newname,id,timestamp,prev_disposed_wood,prev_infested_wood,prev_infested_wood_ofo,TX0_rollyr,TM0_rollyr,TN0_rollyr,RF0_rollyr,SD0_rollyr,RRU_rollyr,RRK_rollyr,FF1_rollyr,FF2_rollyr,FFB_rollyr,RGK_rollyr,ETP_rollyr,GRV_rollyr,KWU_rollyr,KWK_rollyr,prev_infested_wood_rollyr,prev_disposed_wood_rollyr,HS8_rollyr,HS16_rollyr,TX0_rollsr,TM0_rollsr,TN0_rollsr,RF0_rollsr,SD0_rollsr,RRU_rollsr,RRK_rollsr,FF1_rollsr,FF2_rollsr,FFB_rollsr,RGK_rollsr,ETP_rollsr,GRV_rollsr,KWU_rollsr,KWK_rollsr,TX0_rollwr,TM0_rollwr,TN0_rollwr,RF0_rollwr,SD0_rollwr,RRU_rollwr,RRK_rollwr,FF1_rollwr,FF2_rollwr,FFB_rollwr,RGK_rollwr,ETP_rollwr,GRV_rollwr,KWU_rollwr,KWK_rollwr,area_endangered,demolition_wood,disposed_demolition_wood
8410,2704,2013,08 August,SW,0.0,0.0,Meißen,M West,36.08,1499.801018,0.08,3411.198775,384941.58554,5684987.0,378.29004,9.558803,25.5675,18.8919,12.5429,69.7946,228.568,37.0475,41.0908,2.67386,1.94619,2.13399,136.811,101.182,99.5412,-64.1349,-60.0915,374.969,374.969,M West,Meißen-M West-SW,2013-08-31,0.0,0.0,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.08,,
8411,2704,2013,08 August,NSW,0.0,0.0,Meißen,M West,36.08,1499.801018,0.08,3411.198775,384941.58554,5684987.0,378.29004,9.558803,25.5675,18.8919,12.5429,69.7946,228.568,37.0475,41.0908,2.67386,1.94619,2.13399,136.811,101.182,99.5412,-64.1349,-60.0915,374.969,374.969,M West,Meißen-M West-NSW,2013-08-31,0.0,0.0,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,36.08,,
8476,2704,2013,07 Juli,SW,0.0,0.0,Meißen,M West,36.08,1499.801018,0.08,3411.198775,384941.58554,5684987.0,378.29004,9.558803,26.6843,20.2111,13.5906,69.032,305.959,34.9166,37.6292,2.57407,1.88758,2.05033,180.731,133.936,131.126,-99.0196,-96.307,423.479,423.479,M West,Meißen-M West-SW,2013-07-31,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.08,,
8477,2704,2013,07 Juli,NSW,0.0,0.0,Meißen,M West,36.08,1499.801018,0.08,3411.198775,384941.58554,5684987.0,378.29004,9.558803,26.6843,20.2111,13.5906,69.032,305.959,34.9166,37.6292,2.57407,1.88758,2.05033,180.731,133.936,131.126,-99.0196,-96.307,423.479,423.479,M West,Meißen-M West-NSW,2013-07-31,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,36.08,,


In [46]:
infestation_history[['demolition_wood', 'disposed_demolition_wood']] = infestation_history[['demolition_wood', 'disposed_demolition_wood']].fillna(value=0)

### Ratios

In [47]:
infestation_history['disposing_rate_demolition'] = infestation_history[['demolition_wood', 'disposed_demolition_wood']].apply(lambda x: 1 if x[0]==0 else x[1]/x[0], axis=1)  
infestation_history['disposing_rate_infested_yr'] = infestation_history[['prev_infested_wood_rollyr', 'prev_disposed_wood_rollyr']].apply(lambda x: 1 if x[0]==0 else x[1]/x[0], axis=1)  

# rounding or entry errors or if more was disposed: still 1
infestation_history['disposing_rate_demolition'] = infestation_history['disposing_rate_demolition'].map(lambda x: 1 if x > 1 else x)

infestation_history['disposing_rate_infested_yr'] = infestation_history['disposing_rate_infested_yr'].map(lambda x: 1 if x > 1 else x)

In [48]:
infestation_history.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 13484 entries, 0 to 13483
Data columns (total 93 columns):
 #   Column                      Non-Null Count  Dtype         
---  ------                      --------------  -----         
 0   fdist_id                    13484 non-null  int64         
 1   year                        13484 non-null  int64         
 2   timeframe                   13484 non-null  object        
 3   forest_ownership            13484 non-null  object        
 4   infested_wood               13484 non-null  float64       
 5   disposed_wood               13484 non-null  float64       
 6   county_name                 13484 non-null  object        
 7   fdist_name                  13484 non-null  object        
 8   area_nse                    13484 non-null  float64       
 9   area_nsne                   13484 non-null  float64       
 10  area_se                     13484 non-null  float64       
 11  area_sne                    13484 non-null  float64   

### Saving

In [49]:
infestation_history.to_csv('barkbeetle_dataset.csv', index=False)