# Bark Beetles: Predicting the Plague - Predictions

#### Modeling the spruce bark beetle infestation in short-time intervals for locally distinct spatial administrative units within Saxony on the basis of the infestation development and the weather pattern up to the time of forecast

**Abstract**
With the help of this notebook the model can be used to forecast the amount of infested wood based on multiple user-defined scenarios.

# 1. Setup

A Jupyter Notebook was chosen as the way to deploy the machine learning model (as well as the preferred method to supply all code written as part of this project) on the behest of *Sachsenforst*. This is because notebooks make it clear and easy for novices to understand and examine code, but more importantly because the client does not have access to python on their machines, but **does** have the option of using Google Colab to view and execute these notebooks online. This is also why this notebook in particlar is created with the use of Google Colab in mind by having optional codeblocks to upload and download data to the platform.

To make predictions with this notebook, two more files are needed. The first is the 'model.pkl' file which was written in the previous notebook and contains the final prediction model. The other file is named 'input.xlsx' and is a simplified version of the bark beetle dataset which only contains the minimal amount of columns needed to make predictions with the model. Since the precipitation rolling mean as well as previous values for the amount of infested wood are predictors in the model, the use of this file is mandatory. However it is purposely designed to require minimal maintenance. Derived features, such as the previous amount of infested wood, are not included in the file but calculated 'on the fly' in this notebook. The file also supplies historic values for the climate parameters as wel as the demolitionn wood, both of which wich can be used to create the different scenarios.

The first step is to upload the data (this first codeblock can be ignored if the notebook is executet locally), import the necessary modules and read in the data from 'input.xlsx'.

In [1]:
# Only in Google Colab:
# Upload input.xlsx and model.pkl in the dialogue.
from google.colab import files
data_to_load = files.upload()

% pip install scikit-learn==0.23.2

In [2]:
# imports
import numpy as np
import pandas as pd
from pandas.tseries.offsets import MonthEnd
import pickle
import zipfile

# read file with historic observations
data = pd.read_excel(
    'input.xlsx',
    names=[
        'fdist_id',
        'year',
        'forest_ownership',
        'timeframe',
        'RRK',
        'TM0',
        'demolition_wood',
        'infested_wood'
    ]
)

# 2. User Settings

The prediction parameters in the following code block can be freely modified by the user and affect the calculation of the predictions. The rest of the code can then be executed directly without the need for further adjustments. An exception would be, for example, if a restructuring of the forest districts takes place. Then the subsequently assigned endangered forest areas would need to be adjusted.

The predictions are made for three scenarios, each defined by two parameters. One parameter (s_cli) defines the presumed future climate conditions and a second one (s_dem) the accumulated demolition wood. The parameters can be defined separately and populated with any valid values, but the names of the scenarios are pre-defined as: 1 - warm/dry, 2 - moderate/average, 3 - cold/humid. 

The parameters can be specified either as a four-digit integer representing a year or by using a quantile in the range 1-99. Combinations of these two possibilities within one scenario are allowed. If a year number is selected (e.g. 2015), then the climate or demolition wood values from the chosen year are referenced. If a quantile is specified, the respective quantile from the entire history in 'input.xlsx' is calculated for the respective month and used as the assumed values for the scenario. A higher quantile value for the climatic parameter expresses that the climate is getting drier and warmer (i.e. at 75, the 75% quantile of temperature and the 25% quantile of precipitation are calculated). The 50% quantile corresponds to the median.

A last parameter indicates the end of the prediction period. For example, if the last real observation in 'input.xlsx' is September 2020 and this parameter is defined as '2021-05', then the periods 'October-December', 'January-March', 'April' and 'May' will be calculated consecutively, i.e. everything between the last actual entry up to and including the end of the specified month. In general, it is recommended not to make predictions for the distant future, because several predictors are calculated from the past values for infested wood. Since the model always takes its own output (prediction) as input (observation) for the next period and no limitation of the feedback loop is implemented, deviations as well as wrong assumptions of the model may be amplified. Also for this reason, it is important that real, current observations are entered into the input file when they are available to allow corrections of the predictions by the model.

In [3]:
# user-defined settings
# scenarios for the climate as well as demolition wood

# scenario 1 - warm/dry
s_cli_1 = 2018
s_dem_1 = 2018

# scenario 2 - moderate/average
s_cli_2 = 50
s_dem_2 = 50

# scenario 1 - cold/wet
s_cli_3 = 2010
s_dem_3 = 2010


# last prediction period ('YYYY-MM')
# month values of 03-09 as well as 12 allowed
pred_end = '2021-05'

# 3. Input verification

In the following code, the imported Excel file 'input.xlsx' and the specified setting parameters are scanned for errors and problems. Any warnings are printed below.

In [4]:
# Check input data integrity

# count warnings
warn_count = 0

################################################################################

# no empty values?
n_nan = data.isna().sum().sum()

if n_nan == 0:
    # no empty values in historic observations
    pass
else:
    print(
        f'Warning: {n_nan} empty values in past observations.\n'
        f'This may affect the quality of the predictions negatively.\n'
    )
    warn_count += 1
    
################################################################################

# no duplicate rows?
dup_rows = data.duplicated(
    ['fdist_id', 'forest_ownership', 'year', 'timeframe']
)

if dup_rows.any()==False:
    # no duplicate rows in data
    pass
else:    
    print(
        f'Warning: Multiple entries for certain observations found.\n'
        f'The following rows will be deleted: {data[dup_rows]}.\n'
    )
    
    warn_count += 1
    
    data.drop_duplicates(
        ['fdist_id', 'forest_ownership', 'year', 'timeframe'], 
        inplace=True
    )

################################################################################

# Check if current values are available for all forestry districts

# get current timeframe
max_year = data['year'].max()
max_timeframe = data[data['year'] == max_year]['timeframe'].max()

newest_data = data.loc[
    (data['year'] == max_year) & (data['timeframe'] == max_timeframe)
]

if newest_data.shape[0] == 106:
    # current values for all forestry districts found
    newest_ids = newest_data['fdist_id'].unique()
else:
    # existing IDs in current values
    newest_ids = newest_data['fdist_id'].unique()
    
    # existing IDs with only one forest ownership group
    mo_ids = [
        ID for ID in newest_ids 
        if newest_data[newest_data['fdist_id']==ID].shape[0] == 1
    ]
    
    # output
    print(
        f'Warning: Current observations found for {len(newest_ids)} ' 
        f'forestry districts.\n'
        f'One forest ownership group is missing in {len(mo_ids)} of '
        f'these districts.\n'
        f'Only districts with both ownership group are taken into account.\n'
    )
    
    warn_count += 1
    
    # only continue with forestry districts containing both ownerships
    newest_ids = [x for x in newest_ids if x not in mo_ids]

################################################################################ 

# Check if all forestry districts are present for the previous 7 observations 

def eotf(x):
    '''
    Return the end of a timeframe. 
    input:
        - x: a timestamp (datetime object)
    returns:
        - End of the month (April-September) or end of the month after the next
          (January, October) as a datetime object
    '''

    if x.month in range(4, 10):
        return x + MonthEnd()
    else:
        return x + MonthEnd(3)
    

for ID in newest_ids:
    for fo in ['NSW', 'SW']:

        sub = data[
            (data['fdist_id'] == ID) & (data['forest_ownership'] == fo)
        ].copy()
        
        sub['ts'] = sub['year'].astype(str) + sub['timeframe'].map(
            lambda x: '-' + x.split(' ')[0])
        sub['ts'] = pd.to_datetime(sub['ts']).map(lambda x: eotf(x))
        
        max_ts = str(max_year) + '-' + max_timeframe.split(' ')[0]
        max_ts = eotf(pd.to_datetime(max_ts))
        cur_ts = max_ts
        
        for i in range(1,8):
            if cur_ts.month in range(4, 10):
                cur_ts = cur_ts + MonthEnd(-1)
            elif cur_ts.month in (3, 12):
                cur_ts = cur_ts + MonthEnd(-3)
            else:
                print(
                    f'Error: Invalid months in observations.\n'
                    f'District{ID}, ownership group {fo}, timeframe {cur_ts}.\n' 
                    f'Timeframes beginning with 02, 03, 11, 12 not allowed.\n'
                    f'Interrupting code execution...'
                )
                raise SystemExit(f'Invalid timeframe in {ID}, {fo}, {cur_ts}!')
                
            if (sub['ts'] == cur_ts).any():
                pass
            else:
                print(
                    f'Error: Necessary entry not found.\n'
                    f'District{ID}, ownership group {fo}, timeframe {cur_ts}.\n'
                    f'All forest districts for which predictions are to be '
                    f'made require observations for the whole last year.'
                )
                raise SystemExit(f'Missing entry: {ID}, {fo}, {cur_ts}!')

# all necessary observation entries found in input data

################################################################################

# check the user-defined setting parameters
for p in [s_cli_1, s_dem_1, s_cli_2, s_dem_2, s_cli_3, s_dem_3]:
    if (p in range(1,100)) or (p in data['year'].unique()):
        pass
    else:
        raise SystemExit(
            'Invalid prediction parameters. The scenario parameters are '
            'required to be between 1 and 99 or correspond to a year contained '
            'in the dataset.'
        )
        
pred_end = pd.to_datetime(pred_end) + MonthEnd()   

if pred_end - max_ts < pd.Timedelta('0'):
    raise SystemExit(
        'Invalid prediction parameters. The end of the prediction timeframe '
        'cannot be in the past.'
        )
elif pred_end - max_ts > pd.Timedelta('470 days'):
    print(
        'Warning: Predictions for the distant future are not recommended.'
    )
    warn_count += 1

################################################################################  

print(
    f'Review of the data set and the prediction parameters completed.\n'
    f'0 errors and {warn_count} warning(s) occured.'
)    



Review of the data set and the prediction parameters completed.


# 3. Forecast preparation

The following codeblock contains functions to create the features needed for the prediction from 'input.xlsx'. The assignment of information to forest district ids, timeframes etc. is done using different dictionaries. This has the advantage that in case of restructuring etc. the values can be manually adjusted easily by the user. Furthermore, this information does not have to be read from additional data into Google Colab, reducing potential sources of errors. The disadvantage of this approach is that the code becomes longer and possibly more convoluted.

In [5]:
# Dictionary for the assignment of the (new) names of the forest districts.
# As in the previous notebooks, for the 'old' districts with leading 9s the 
# name of the current district that best approximates it is used.
# This logical connection is necessary to ensure that the correct district
# will be used if a year prior to 2015 is used as the scenario.
fdist_names = {
    2501: 'Elsterheide',
    2502: 'Bernsdorf',
    2503: 'Königswartha',
    2504: 'Nebelschütz',
    2505: 'Königsbrück',
    2506: 'Radibor',
    2507: 'Kamenz',
    2508: 'Ohorn',
    2509: 'Bischofswerda',
    2510: 'Cunewalde',
    1101: 'Chemnitz',
    1201: 'Dresden',
    2101: 'Eibenstock',
    2102: 'Zwönitz',
    2103: 'Stollberg',
    2104: 'Zschopau',
    2105: 'Annaberg',
    2106: 'Marienberg',
    2107: 'Olbernhau',
    2191: 'Eibenstock',
    2192: 'Schwarzenberg',
    2193: 'Zwönitz',
    2194: 'Stollberg',
    2195: 'Annaberg',
    2196: 'Zschopau',
    2197: 'Marienberg',
    2198: 'Olbernhau',
    2201: 'Geringswalde',
    2202: 'Striegistal',
    2203: 'Reinsberg',
    2204: 'Frauenstein',
    2601: 'Zittau',
    2602: 'Löbau',
    2603: 'Niesky',
    2604: 'Krauschwitz',
    2605: 'Boxberg',
    2606: 'Weißwasser',
    2901: 'Muldental',
    2902: 'Leipziger Land',
    2701: 'M Nord',
    2702: 'M Ost',
    2703: 'M Süd',
    2704: 'M West',
    2791: 'M Nord',
    2792: 'M Süd',
    2793: 'M Ost',
    2801: 'Freital',
    2802: 'Glashütte',
    2803: 'Bad-Gottleuba',
    2804: 'Pirna',
    2805: 'Sebnitz',
    3001: 'Delitzsch',
    3002: 'Torgau',
    3003: 'Oschatz',
    2301: 'Adorf',
    2302: 'Schöneck',
    2303: 'Weischlitz',
    2304: 'Plauen',
    2305: 'Treuen',
    2306: 'Auerbach',
    2401: 'Z Nord',
    2402: 'Z Süd',
    1302: 'Connewitz',
    1301: 'Leutzsch'
}

# endangered forest area for state owned forests from fdist_id
endarea_sw = {
    2501: 26.85,
    2502: 82.46,
    2503: 24.17,
    2504: 8.68,
    2505: 483.53,
    2506: 63.43,
    2507: 3.43,
    2508: 1417.36,
    2509: 127.91,
    2510: 8.3,
    1101: 774.06,
    1201: 1762.71,
    2101: 15177.9,
    2102: 2847.32,
    2103: 1921.76,
    2104: 3646.39,
    2105: 7210.79,
    2106: 3946.53,
    2107: 2314.46,
    2191: 10922.47,
    2192: 6940.4,
    2193: 1290.14,
    2194: 2364.83,
    2195: 6142.78,
    2196: 3191.06,
    2197: 3892.5,
    2198: 2315.36,
    2201: 196.15,
    2202: 1147.04,
    2203: 2706.18,
    2204: 3833.76,
    2601: 14.92,
    2602: 20.78,
    2603: 8.2,
    2604: 122.78,
    2605: 71.88,
    2606: 42.73,
    2901: 51.06,
    2902: 615.51,
    2701: 3.85,
    2702: 0.6,
    2703: 381.91,
    2704: 0.08,
    2791: 3.93,
    2792: 381.83,
    2793: 1.09,
    2801: 5291.51,
    2802: 4507.81,
    2803: 2690.42,
    2804: 1584.51,
    2805: 8458.97,
    3001: 0.12,
    3002: 0.0,
    3003: 465.65,
    2301: 4312.15,
    2302: 8310.18,
    2303: 791.57,
    2304: 779.57,
    2305: 475.83,
    2306: 2924.21,
    2401: 1348.61,
    2402: 196.48,
    1302: 0.0,
    1301: 0.0
}

# endangered forest area for private/corporate forests from fdist_id
endarea_nsw = {
    2501: 11.37,
    2502: 60.79,
    2503: 109.44,
    2504: 231.78,
    2505: 173.07,
    2506: 246.66,
    2507: 721.84,
    2508: 1028.08,
    2509: 2184.61,
    2510: 3323.88,
    1101: 280.84,
    1201: 124.17,
    2101: 929.71,
    2102: 2953.21,
    2103: 2903.46,
    2104: 1745.57,
    2105: 1855.4,
    2106: 2062.88,
    2107: 1379.08,
    2191: 726.57,
    2192: 1143.21,
    2193: 3482.08,
    2194: 1898.64,
    2195: 2408.13,
    2196: 1577.87,
    2197: 1208.74,
    2198: 1380.03,
    2201: 841.61,
    2202: 954.18,
    2203: 1597.32,
    2204: 1708.41,
    2601: 3757.67,
    2602: 2858.26,
    2603: 631.21,
    2604: 401.0,
    2605: 146.13,
    2606: 103.7,
    2901: 225.27,
    2902: 401.71,
    2701: 33.41,
    2702: 114.56,
    2703: 392.75,
    2704: 36.08,
    2791: 22.8,
    2792: 411.13,
    2793: 143.31,
    2801: 1675.64,
    2802: 1439.74,
    2803: 1073.94,
    2804: 1153.3,
    2805: 2152.83,
    3001: 11.28,
    3002: 14.46,
    3003: 271.09,
    2301: 3922.55,
    2302: 2342.52,
    2303: 2123.89,
    2304: 2152.93,
    2305: 2621.37,
    2306: 2748.21,
    2401: 1319.45,
    2402: 1794.78,
    1302: 0.54,
    1301: 0.0
}

# numerical encoding of timeframes analogous to training notebook
timeframe_encoder = {
    '01 Januar-März': 0.04695437961628093, 
    '04 April': 0.01613466392797747, 
    '05 Mai': 0.026238326902676162, 
    '06 Juni': 0.048392631582756127, 
    '07 Juli': 0.11820248134015453, 
    '08 August': 0.2340321842232208, 
    '09 September': 0.2396741387722562, 
    '10 Oktober-Dezember': 0.2703711936346777
}

# name of timeframe in notation used by Sachsenforst from month integer
tf_from_month = {
    3:  '01 Januar-März',
    4:  '04 April',
    5:  '05 Mai',
    6:  '06 Juni',
    7:  '07 Juli',
    8:  '08 August',
    9:  '09 September',
    12: '10 Oktober-Dezember'
}
    
# create column with (new) name of foretry districts
data['fdist_newname'] = data['fdist_id'].map(lambda x: fdist_names.get(x))

# function to create the foundation for the next predicion in the form of a
# dataframe with only the basic descriptive columns to which features can be
# added on
def prepare_next(
    newest_ids, cur_ts, 
    fdist_names=fdist_names, 
    tf_from_month=tf_from_month
):
    '''
    This function creates a new dataframe as the foundation for the next forecast. 
    The dataframe conatins columns for the forest district (id, name), 
    the ownership group, the year and the timeframe.
    
    input:
        - newest_ids: all fdist_ids included in the next prediction
        - cur_ts: timeframe of the next prediction as a datetime object
        - fdist_names: dictionary with mappings of forestry district ids to names
        - tf_from_month: dictionary with mappings of timeframe strings from month
        
    returns:
        - dataframe with the needed descriptive columns to serve as the foundation 
          for a new round of predictions
    '''  
    
    # columns will be initiated as numpy arrays
    # every fdist_id needed two times, as there are two forest ownership groups
    fdist_id = np.append(newest_ids, newest_ids)
    
    # year array of same length
    year = [cur_ts.year] * 2 * len(newest_ids)
    
    # ownership group once each for every forestry district
    forest_ownership = np.append(
        ['NSW'] * len(newest_ids), 
        ['SW'] * len(newest_ids)
    )
    
    # timeframe as month, derived from cur_ts
    timeframe = [cur_ts.month]* 2 * len(newest_ids)
    # timeframe as string via tf_from_month dictionary
    timeframe = np.array([tf_from_month.get(x) for x in timeframe])
    
    # forestry district names via fdist_names dictionary
    fdist_newname = [fdist_names.get(x) for x in fdist_id]
    
    # merging arrays into a dataframe
    df = pd.DataFrame([
        fdist_id, 
        year, 
        forest_ownership, 
        timeframe, 
        fdist_newname
    ]).T
    
    # set column names
    df.columns =[
        'fdist_id',
        'year',
        'forest_ownership',
        'timeframe',
        'fdist_newname'
    ]
    
    return df
    
# function to fill 'foundation' with remaining features
def populate_features(
    X, data, original_data, s_cli, s_dem, cur_ts,
    endarea_sw=endarea_sw, 
    endarea_nsw=endarea_nsw, 
    tf_from_month=tf_from_month,
    timeframe_encoder=timeframe_encoder
):
    '''
    
    This function takes a dataframe with the minimal set of descriptive 
    columns and adds all features needed for a prediction by picking out the 
    respective values from the historical observations as well as the latest 
    observations/predictions and aggregating or processing them.
    
    inputs:
        - X: the dataframe to which the features will be attached and for which
          the prediction will later take place
        - data: dataframe with all previous actual observations and previous 
          predictions. Needed to get previous values for infested wood
        - original_data: dataframe with actual observations only. Needed for 
          designating climate parameters, since only real observations should 
          be basis for calculation of quantiles
        - s_cli: current climate scenario
        - s_dem: current demolition wood scenario
        - cur_ts: current timeframe as datetime object
        - endarea_sw: dictionary with mapping of endangered forest area to 
          forestry districts (state owned forest).
        - endarea_nsw: dictionary with mapping of endangered forest area to 
          forestry districts (private/corporate forest).
        - tf_from_month: dictionary with mapping of timeframe strings from month
        - timeframe_encoder: dictionary with mapping of numerical factors to 
          timeframes, analogous to the model training
          
    returns:
        - dataFrame with all rows and columns required to make predictions with 
          the regression model.
    '''
    ############################################################################

    # preparation: make copy of 'data' as original should not be touched
    df = data.copy()
    
    # preparation: add column for timestamp
    df['ts'] = df['year'].astype(str) + df['timeframe'].map(
            lambda x: '-' + x.split(' ')[0])
    df['ts'] = pd.to_datetime(df['ts']).map(lambda x: eotf(x))
    
    # preparation: timestamp of previous timeframe
    if cur_ts.month in range(4,10):
        prev_ts = cur_ts + MonthEnd(-1)
    else:
        prev_ts = cur_ts + MonthEnd(-3)
        
    ############################################################################ 

    # feature: area_endangered
    X['area_endangered'] = X[['fdist_id','forest_ownership']].apply(
        lambda x: endarea_sw.get(x[0]) if x[1]=='SW' else endarea_nsw.get(x[0]),
        axis=1
    )
    
    ############################################################################    

    # features regarding the previous infested wood
    prev_inf_wood = []
    prev_inf_wood_ofo = []
    prev_infested_Wood_rollyr = []
    
    for r in X[['fdist_newname', 'forest_ownership']].itertuples(index=False):
        # feature: prev_infested_wood
        # value of the previous timeframe with same fdist_id and ownership
        piw = df.loc[
            (df['fdist_newname'] == r[0]) &
            (df['forest_ownership'] == r[1]) &
            (df['ts'] == prev_ts)
        ]['infested_wood'].values[0]
        
        prev_inf_wood.append(piw)
        
        # feature: prev_infested_wood_ofo
        # value of the previous timeframe with same fdist_id other ownership
        piwo = df.loc[
            (df['fdist_newname'] == r[0]) &
            (df['forest_ownership'] != r[1]) &
            (df['ts'] == prev_ts)
        ]['infested_wood'].values[0]
        
        prev_inf_wood_ofo.append(piwo)
        
        # feature: prev_infested_Wood_rollyr
        # sum of last years infested wood values for same fdist_id and ownership
        piwryr = df.loc[
            (df['fdist_newname'] == r[0]) &
            (df['forest_ownership'] == r[1]) &
            (df['ts'] >= cur_ts + MonthEnd(-12)) & 
            (df['ts'] < cur_ts)
        ]['infested_wood'].sum()
        
        prev_infested_Wood_rollyr.append(piwryr)
        
    X['prev_infested_wood'] = prev_inf_wood
    X['prev_infested_wood_ofo'] = prev_inf_wood_ofo
    X['prev_infested_wood_rollyr'] = prev_infested_Wood_rollyr
        
    ############################################################################ 

    # features regarding climate parameters
    rrk = []
    tm0 = []
    rrk_rollsr = []
    
    # option 1: year integer specified as scenario
    if s_cli in original_data['year'].unique():
        for r in X[[
            'fdist_newname', 'forest_ownership', 'timeframe'
        ]].itertuples(index=False):
            # issue if year <= 2013 and fdist_newname 'Meißen West': 
            # entry does not exist - workaround in the following if-statement:
            if r[0] == 'M West' and s_cli <= 2013:
                # feature: RRK
                rrk_s = original_data.loc[
                    (original_data['fdist_newname'] == 'M Nord') &
                    (original_data['forest_ownership'] == r[1]) &
                    (original_data['timeframe'] == r[2]) &    
                    (original_data['year'] == s_cli)  
                ]['RRK'].values[0]
                
                # feature: TM0
                tm0_s = original_data.loc[
                    (original_data['fdist_newname'] == 'M Nord') &
                    (original_data['forest_ownership'] == r[1]) &
                    (original_data['timeframe'] == r[2]) &    
                    (original_data['year'] == s_cli)  
                ]['TM0'].values[0]
                
            # other forestry districts as follows:    
            else:
                # feature: RRK
                rrk_s = original_data.loc[
                    (original_data['fdist_newname'] == r[0]) &
                    (original_data['forest_ownership'] == r[1]) &
                    (original_data['timeframe'] == r[2]) &    
                    (original_data['year'] == s_cli)  
                ]['RRK'].values[0]
                
                # feature: TM0
                tm0_s = original_data.loc[
                    (original_data['fdist_newname'] == r[0]) &
                    (original_data['forest_ownership'] == r[1]) &
                    (original_data['timeframe'] == r[2]) &    
                    (original_data['year'] == s_cli)  
                ]['TM0'].values[0]
                
            rrk.append(rrk_s)
            tm0.append(tm0_s)
            
            # feature: RRK_rollsr
            # sum of the summer values over the last year (including
            # current one) in the same fdist_id and forest ownership group
            
            # previous values
            rrk_rollsr1 = df.loc[
                (df['fdist_newname'] == r[0]) &
                (df['forest_ownership'] == r[1]) &
                (df['ts'] >= cur_ts + MonthEnd(-11)) & 
                (df['ts'] <= cur_ts) &
                (df['ts'].map(lambda x: x.month).isin(range(4,10)))
            ]['RRK'].sum()
            
            # current value
            rrk_rollsr2 = rrk_s
            
            # combine previous and current in feature
            rrk_rollsr.append((rrk_rollsr1 + rrk_rollsr2) / 6)
                
    # option 2: quantile specified as scenario        
    elif s_cli in range(1,100):
        for r in X[[
            'fdist_newname', 'forest_ownership', 'timeframe'
        ]].itertuples(index=False):
            # Feature: RRK
            rrk_s = np.quantile(
                original_data.loc[
                    (original_data['fdist_newname'] == r[0]) &
                    (original_data['forest_ownership'] == r[1]) &
                    (original_data['timeframe'] == r[2])  
                ]['RRK'].values, 
                1 - (s_cli * 0.01)
            )
            
            rrk.append(rrk_s)
            
            # feature: TM0
            tm0_s = np.quantile(
                original_data.loc[
                    (original_data['fdist_newname'] == r[0]) &
                    (original_data['forest_ownership'] == r[1]) &
                    (original_data['timeframe'] == r[2])  
                ]['TM0'].values, 
                s_cli * 0.01
            )
            
            tm0.append(tm0_s)
            
            # feature: RRK_rollsr
            # sum of the summer values over the last year (including
            # current one) in the same fdist_id and forest ownership group
            
            # previous values
            rrk_rollsr1 = df.loc[
                (df['fdist_newname'] == r[0]) &
                (df['forest_ownership'] == r[1]) &
                (df['ts'] >= cur_ts + MonthEnd(-11)) & 
                (df['ts'] <= cur_ts) &
                (df['ts'].map(lambda x: x.month).isin(range(4,10)))
            ]['RRK'].sum()
            
            # current value
            rrk_rollsr2 = rrk_s
            
            # combine previous and current in feature
            rrk_rollsr.append((rrk_rollsr1 + rrk_rollsr2) / 6)

            
    X['RRK'] = rrk
    X['TM0'] = tm0
    X['RRK_rollsr'] = rrk_rollsr
    ############################################################################ 

    # demolition wood
    dmw = []
    
    # option 1: year integer specified as scenario
    if s_dem in original_data['year'].unique():
        for r in X[[
            'fdist_newname', 'forest_ownership', 'timeframe'
        ]].itertuples(index=False):
            # feature: demolition_wood
            # issue if year <= 2013 and fdist_newname 'Meißen West': 
            # entry does not exist - workaround in the following if-statement:
            if r[0] == 'M West' and s_cli <= 2013:
                dmw_s = original_data.loc[
                    (original_data['fdist_newname'] == 'M Nord') &
                    (original_data['forest_ownership'] == r[1]) &
                    (original_data['timeframe'] == r[2]) &    
                    (original_data['year'] == s_dem)  
                ]['demolition_wood'].values[0]
            
            # other forestry districts as follows:   
            else:
                dmw_s = original_data.loc[
                    (original_data['fdist_newname'] == r[0]) &
                    (original_data['forest_ownership'] == r[1]) &
                    (original_data['timeframe'] == r[2]) &    
                    (original_data['year'] == s_dem)  
                ]['demolition_wood'].values[0]
            
            dmw.append(dmw_s)
            
    # option 2: quantile specified as scenario           
    elif s_dem in range(1,100):
        for r in X[[
            'fdist_newname', 'forest_ownership', 'timeframe'
        ]].itertuples(index=False):
            # feature: demolition_wood
            dmw_s = np.quantile(
                original_data.loc[
                    (original_data['fdist_newname'] == r[0]) &
                    (original_data['forest_ownership'] == r[1]) &
                    (original_data['timeframe'] == r[2])  
                ]['demolition_wood'].values, 
                s_dem * 0.01
            )
            
            dmw.append(dmw_s)
            
    X['demolition_wood'] = dmw
            
    ############################################################################ 
    
    # feature: timeframe (encoded)
    X['timeframe_enc'] = X['timeframe'].map(lambda x: timeframe_encoder.get(x))
    
    
    return X    
        
# load the model
model = pickle.load(open('model.pkl', 'rb'))    

# 4. Make forecasts for different scenarios

After all the required variables and functions have been defined, the prediction is performed for the three scenarios.

In [6]:
# store original data values
original_data = data.copy()

# predictions for three scenarios
for i in range(3):
    # reset data before a new scenario is calculated
    data = original_data.copy()
    
    # counter for number of timeframes
    period_count = 0
    pred_range = pd.date_range(start=max_ts, end=pred_end, freq='M')
    max_periods = len(
        [ts for ts in pred_range if ts.month not in (10,11,1,2)]
    ) - 1
    
    # define current scenario based on user settings
    if i == 0:
        s_name = 'warmdry'
        s_cli = s_cli_1
        s_dem = s_dem_1
        print('Starting with predictions for scenario warm/dry.')
    elif i == 1:
        s_name = 'moderate'
        s_cli = s_cli_2
        s_dem = s_dem_2
        print('Starting with predictions for scenario moderate/average.')
    elif i == 2:
        s_name = 'coldwet'
        s_cli = s_cli_3
        s_dem = s_dem_3
        print('Starting with predictions for scenario cold/dry.')
        
    cur_ts = max_ts     
    
    # execute algorithm until end of prediction period is reached
    while cur_ts < pred_end:
        
        # go one period forward
        period_count += 1
        
        if cur_ts.month in range(3,9):
            cur_ts = cur_ts + MonthEnd(1)
        else:
            cur_ts = cur_ts + MonthEnd(3)
        
        # make dataframe as foundation for next prediction
        # and populate it with features
        X = prepare_next(newest_ids, cur_ts)
        X = populate_features(X, data, original_data, s_cli, s_dem, cur_ts)
        
        # make prediction with model
        X['infested_wood'] = model.predict(X[[
            'area_endangered',
            'timeframe_enc',
            'prev_infested_wood',
            'prev_infested_wood_rollyr',
            'prev_infested_wood_ofo',
            'RRK',
            'TM0',
            'demolition_wood',
            'RRK_rollsr'
        ]])
        
        # combine predictions with data as preparation for next iteration
        data = pd.concat(
            [data, 
            X[[
                'fdist_id', 
                'year', 
                'forest_ownership', 
                'timeframe', 
                'RRK', 
                'TM0', 
                'demolition_wood', 
                'infested_wood',
                'fdist_newname'
            ]]], 
            ignore_index=True
        )
        
        print(f'Period {period_count}/{max_periods} complete')
    
    # after all predictions for scenario are done, extract them from data
    predictions = data.loc[original_data.shape[0] + 1 :]
    
    # column names in German and 'Sachsenforst notation' to match 'input.xlsx'
    predictions.columns = [
        'REVUFBADR',
        'Jahr',
        'Eigentumsgruppe',
        'ZR',
        'Niederschlagsumme in l/m2',
        'Mittlere Temperatur in °C',
        'Zugang Wurf-/Bruchholz',
        'Zugang Schadholz',
        'Revier'
    ]
    
    # save predictions as 'elaborate' version in Excel-file
    predictions.to_excel(
        'predictions_elaborate_'+s_name+'.xlsx',
        index=False
    )
    
    # sum up results for respective forestry districts and ownership groups
    predictions = predictions.groupby(
        ['REVUFBADR', 'Revier', 'Eigentumsgruppe']
    )['Zugang Schadholz'].sum()
    
    # save predictions as 'condensed' version in a second Excel-file
    predictions.to_excel(
        'predictions_total_'+s_name+'.xlsx'
    )
    
    print(
        f'Completed forecasts for scenario. ' 
        f'Written in "predictions_elaborate_{s_name}.xlsx" '
        f'and "predictions_total_{s_name}.xlsx".\n'
    )
    
    

Starting with predictions for scenario warm/dry.
Period 1/4 complete
Period 2/4 complete
Period 3/4 complete
Period 4/4 complete
Completed forecasts for scenario. Written in "predictions_elaborate_warmdry.xlsx" and "predictions_total_warmdry.xlsx".

Starting with predictions for scenario moderate/average.
Period 1/4 complete
Period 2/4 complete
Period 3/4 complete
Period 4/4 complete
Completed forecasts for scenario. Written in "predictions_elaborate_moderate.xlsx" and "predictions_total_moderate.xlsx".

Starting with predictions for scenario cold/dry.
Period 1/4 complete
Period 2/4 complete
Period 3/4 complete
Period 4/4 complete
Completed forecasts for scenario. Written in "predictions_elaborate_coldwet.xlsx" and "predictions_total_coldwet.xlsx".



The following code downloads the Excel files containing the predictions from Google Colab. Not needed if notebook is run locally.

In [7]:
# only for use in Google Colab:
# add files to zip archive
zipfile = zipfile.ZipFile(
    'predictions.zip', 
    mode='w', 
    compression=zipfile.ZIP_DEFLATED
    )

zipfile.write('predictions_elaborate_warmdry.xlsx')
zipfile.write('predictions_elaborate_moderate.xlsx')
zipfile.write('predictions_elaborate_coldwet.xlsx')
zipfile.write('predictions_total_warmdry.xlsx')
zipfile.write('predictions_total_moderate.xlsx')
zipfile.write('predictions_total_coldwet.xlsx')
 
zipfile.close()

# download files from Google Colab
files.download('predictions.zip')