# Bark Beetles: Predicting the Plague - Model Training

#### Modeling the spruce bark beetle infestation in short-time intervals for locally distinct spatial administrative units within Saxony on the basis of the infestation development and the weather pattern up to the time of forecast

**Author**
Yannic Holländer

**Abstract**
In this notebook the model will be used to predict future values for the amount of infested wood based on multiple user-defined scenarios.

# 1. Setup

A Jupyter Notebook was chosen as the way to deploy the machine learning model (as well as the preferred method to supply all code written as part of this project) on the behest of *Sachsenforst*. This is because notebooks make it clear and easy for novices to understand and examine code, but more importantly because the client does not have access to python on their machines, but **does** have the option of using Google Colab to view and execute these notebooks online. This is also why this notebook in particlar is created with the use of Google Colab in mind by having optional codeblocks to upload and download data to the platform.

To make predictions with this notebook, two more files are needed. The first is the 'model.pkl' file which was written in the previous notebook and contains the final prediction model. The other file is named 'input.xlsx' and is a simplified version of the bark beetle dataset which only contains the minimal amount of columns needed to make predictions with the model. Since the precipitation rolling mean as well as previous values for the amount of infested wood are predictors in the model, the use of this file is mandatory. However it is purposely designed to require minimal maintenance. Derived features, such as the previous amount of infested wood, are not included in the file but calculated 'on the fly' in this notebook. The file also supplies historic values for the climate parameters as wel as the demolitionn wood, both of which wich can be used to create the different scenarios.

The first step is to upload the data (this first codeblock can be ignored if the notebook is executet locally), import the necessary modules and read in the data from 'input.xlsx'.

In [1]:
# Only in Google Colab:
# Upload input.xlsx and model.pkl in the dialogue.
from google.colab import files
data_to_load = files.upload()

% pip install scikit-learn==0.23.2

In [2]:
# imports
import numpy as np
import pandas as pd
from pandas.tseries.offsets import MonthEnd
import pickle
import zipfile

# read file with historic observations
data = pd.read_excel(
    'input.xlsx',
    names=[
        'fdist_id',
        'year',
        'forest_ownership',
        'timeframe',
        'RRK',
        'TM0',
        'demolition_wood',
        'infested_wood'
    ]
)

# 2. User Settings

The prediction parameters in the following code block can be freely modified by the user and affect the calculation of the predictions. The rest of the code can then be executed directly without the need for further adjustments. An exception would be, for example, if a restructuring of the forest districts takes place. Then the subsequently assigned endangered forest areas would need to be adjusted.

The predictions are made for three scenarios, each defined by two parameters. One parameter (s_cli) defines the presumed future climate conditions and a second one (s_dem) the accumulated demolition wood. The parameters can be defined separately and populated with any valid values, but the names of the scenarios are pre-defined as: 1 - warm/dry, 2 - moderate/average, 3 - cold/humid. 

The parameters can be specified either as a four-digit integer representing a year or by using a quantile in the range 1-99. Combinations of these two possibilities within one scenario are allowed. If a year number is selected (e.g. 2015), then the climate or demolition wood values from the chosen year are referenced. If a quantile is specified, the respective quantile from the entire history in 'input.xlsx' is calculated for the respective month and used as the assumed values for the scenario. A higher quantile value for the climatic parameter expresses that the climate is getting drier and warmer (i.e. at 75, the 75% quantile of temperature and the 25% quantile of precipitation are calculated). The 50% quantile corresponds to the median.

A last parameter indicates the end of the prediction period. For example, if the last real observation in 'input.xlsx' is September 2020 and this parameter is defined as '2021-05', then the periods 'October-December', 'January-March', 'April' and 'May' will be calculated consecutively, i.e. everything between the last actual entry up to and including the end of the specified month. In general, it is recommended not to make predictions for the distant future, because several predictors are calculated from the past values for infested wood. Since the model always takes its own output (prediction) as input (observation) for the next period and no limitation of the feedback loop is implemented, deviations as well as wrong assumptions of the model may be amplified. Also for this reason, it is important that real, current observations are entered into the input file when they are available to allow corrections of the predictions by the model.

In [3]:
# user-defined settings
# scenarios for the climate as well as demolition wood

# scenario 1 - warm/dry
s_cli_1 = 2018
s_dem_1 = 2018

# scenario 2 - moderate/average
s_cli_2 = 50
s_dem_2 = 50

# scenario 1 - cold/wet
s_cli_3 = 2010
s_dem_3 = 2010


# last prediction period ('YYYY-MM')
# month values of 03-09 as well as 12 allowed
pred_end = '2022-03'

# 3. Input verification

In the following code, the imported Excel file 'input.xlsx' and the specified setting parameters are scanned for errors and problems. Any warnings are printed below.

In [4]:
# Check input data integrity

# count warnings
warn_count = 0

################################################################################

# no empty values?
n_nan = data.isna().sum().sum()

if n_nan == 0:
    # no empty values in historic observations
    pass
else:
    print(
        f'Warning: {n_nan} empty values in past observations.\n'
        f'This may affect the quality of the predictions negatively.\n'
    )
    warn_count += 1
    
################################################################################

# no duplicate rows?
dup_rows = data.duplicated(
    ['fdist_id', 'forest_ownership', 'year', 'timeframe']
)

if dup_rows.any()==False:
    # no duplicate rows in data
    pass
else:    
    print(
        f'Warning: Multiple entries for certain observations found.\n'
        f'The following rows will be deleted: {data[dup_rows]}.\n'
    )
    
    warn_count += 1
    
    data.drop_duplicates(
        ['fdist_id', 'forest_ownership', 'year', 'timeframe'], 
        inplace=True
    )

################################################################################

# Check if current values are available for all forestry districts

# get current timeframe
max_year = data['year'].max()
max_timeframe = data[data['year'] == max_year]['timeframe'].max()

newest_data = data.loc[
    (data['year'] == max_year) & (data['timeframe'] == max_timeframe)
]

if newest_data.shape[0] == 106:
    # current values for all forestry districts found
    newest_ids = newest_data['fdist_id'].unique()
else:
    # existing IDs in current values
    newest_ids = newest_data['fdist_id'].unique()
    
    # existing IDs with only one forest ownership group
    mo_ids = [
        ID for ID in newest_ids 
        if newest_data[newest_data['fdist_id']==ID].shape[0] == 1
    ]
    
    # output
    print(
        f'Warning: Current observations found for {len(newest_ids)} ' 
        f'forestry districts.\n'
        f'One forest ownership group is missing in {len(mo_ids)} of '
        f'these districts.\n'
        f'Only districts with both ownership group are taken into account.\n'
    )
    
    warn_count += 1
    
    # only continue with forestry districts containing both ownerships
    newest_ids = [x for x in newest_ids if x not in mo_ids]

################################################################################ 

# Check if all forestry districts are present for the previous 7 observations 

def eotf(x):
    '''
    Return the end of a timeframe. 
    input:
        - x: a timestamp (datetime object)
    returns:
        - End of the month (April-September) or end of the month after the next
          (January, October) as a datetime object
    '''

    if x.month in range(4, 10):
        return x + MonthEnd()
    else:
        return x + MonthEnd(3)
    

for ID in newest_ids:
    for fo in ['NSW', 'SW']:

        sub = data[
            (data['fdist_id'] == ID) & (data['forest_ownership'] == fo)
        ].copy()
        
        sub['ts'] = sub['year'].astype(str) + sub['timeframe'].map(
            lambda x: '-' + x.split(' ')[0])
        sub['ts'] = pd.to_datetime(sub['ts']).map(lambda x: eotf(x))
        
        max_ts = str(max_year) + '-' + max_timeframe.split(' ')[0]
        max_ts = eotf(pd.to_datetime(max_ts))
        cur_ts = max_ts
        
        for i in range(1,8):
            if cur_ts.month in range(4, 10):
                cur_ts = cur_ts + MonthEnd(-1)
            elif cur_ts.month in (3, 12):
                cur_ts = cur_ts + MonthEnd(-3)
            else:
                print(
                    f'Error: Invalid months in observations.\n'
                    f'District{ID}, ownership group {fo}, timeframe {cur_ts}.\n' 
                    f'Timeframes beginning with 02, 03, 11, 12 not allowed.\n'
                    f'Interrupting code execution...'
                )
                raise SystemExit(f'Invalid timeframe in {ID}, {fo}, {cur_ts}!')
                
            if (sub['ts'] == cur_ts).any():
                pass
            else:
                print(
                    f'Error: Necessary entry not found.\n'
                    f'District{ID}, ownership group {fo}, timeframe {cur_ts}.\n'
                    f'All forest districts for which predictions are to be '
                    f'made require observations for the whole last year.'
                )
                raise SystemExit(f'Missing entry: {ID}, {fo}, {cur_ts}!')

# all necessary observation entries found in input data

################################################################################

# check the user-defined setting parameters
for p in [s_cli_1, s_dem_1, s_cli_2, s_dem_2, s_cli_3, s_dem_3]:
    if (p in range(1,100)) or (p in data['year'].unique()):
        pass
    else:
        raise SystemExit(
            'Invalid prediction parameters. The scenario parameters are '
            'required to be between 1 and 99 or correspond to a year contained '
            'in the dataset.'
        )
        
pred_end = pd.to_datetime(pred_end) + MonthEnd()   

if pred_end - max_ts < pd.Timedelta('0'):
    raise SystemExit(
        'Invalid prediction parameters. The end of the prediction timeframe '
        'cannot be in the past.'
        )
elif pred_end - max_ts > pd.Timedelta('470 days'):
    print(
        'Warning: predictions for the distant future are not recommended.'
    )
    warn_count += 1

################################################################################  

print(
    f'Review of the data set and the prediction parameters completed.\n'
    f'0 errors and {warn_count} warning(s) occured.'
)    



Warnung: 4 leere Einträge in bisherigen Beobachtungen.
Dies beeinträchtigt unter Umständen die Qualität der Vorhersagen.

Warnung: Vorhersagen für die ferne Zukunft werden nicht empfohlen.
Überprüfung des Datensets und der Vorhersageparameter abgeschlossen.
Es traten 0 Fehler und 2 Warnung(en) auf. 


# 3. Forecast preparation

The following codeblock contains functions to create the features needed for the prediction from 'input.xlsx'. The assignment of information to forest district ids, timeframes etc. is done using different dictionaries. This has the advantage that in case of restructuring etc. the values can be manually adjusted easily by the user. Furthermore, this information does not have to be read from additional data into Google Colab, reducing potential sources of errors. The disadvantage of this approach is that the code becomes longer and possibly more convoluted.

In [5]:
# Dictionary for the assignment of the (new) names of the forest districts.
# As in the previous notebooks, for the 'old' districts with leading 9s the 
# name of the current district that best approximates it is used.
# This logical connection is necessary to ensure that the correct district
# will be used if a year prior to 2015 is used as the scenario.
fdist_names = {
    2501: 'Elsterheide',
    2502: 'Bernsdorf',
    2503: 'Königswartha',
    2504: 'Nebelschütz',
    2505: 'Königsbrück',
    2506: 'Radibor',
    2507: 'Kamenz',
    2508: 'Ohorn',
    2509: 'Bischofswerda',
    2510: 'Cunewalde',
    1101: 'Chemnitz',
    1201: 'Dresden',
    2101: 'Eibenstock',
    2102: 'Zwönitz',
    2103: 'Stollberg',
    2104: 'Zschopau',
    2105: 'Annaberg',
    2106: 'Marienberg',
    2107: 'Olbernhau',
    2191: 'Eibenstock',
    2192: 'Schwarzenberg',
    2193: 'Zwönitz',
    2194: 'Stollberg',
    2195: 'Annaberg',
    2196: 'Zschopau',
    2197: 'Marienberg',
    2198: 'Olbernhau',
    2201: 'Geringswalde',
    2202: 'Striegistal',
    2203: 'Reinsberg',
    2204: 'Frauenstein',
    2601: 'Zittau',
    2602: 'Löbau',
    2603: 'Niesky',
    2604: 'Krauschwitz',
    2605: 'Boxberg',
    2606: 'Weißwasser',
    2901: 'Muldental',
    2902: 'Leipziger Land',
    2701: 'M Nord',
    2702: 'M Ost',
    2703: 'M Süd',
    2704: 'M West',
    2791: 'M Nord',
    2792: 'M Süd',
    2793: 'M Ost',
    2801: 'Freital',
    2802: 'Glashütte',
    2803: 'Bad-Gottleuba',
    2804: 'Pirna',
    2805: 'Sebnitz',
    3001: 'Delitzsch',
    3002: 'Torgau',
    3003: 'Oschatz',
    2301: 'Adorf',
    2302: 'Schöneck',
    2303: 'Weischlitz',
    2304: 'Plauen',
    2305: 'Treuen',
    2306: 'Auerbach',
    2401: 'Z Nord',
    2402: 'Z Süd',
    1302: 'Connewitz',
    1301: 'Leutzsch'
}

# endangered forest are for state owned forests from fdist_id
endarea_sw = {
    2501: 26.85,
    2502: 82.46,
    2503: 24.17,
    2504: 8.68,
    2505: 483.53,
    2506: 63.43,
    2507: 3.43,
    2508: 1417.36,
    2509: 127.91,
    2510: 8.3,
    1101: 774.06,
    1201: 1762.71,
    2101: 15177.9,
    2102: 2847.32,
    2103: 1921.76,
    2104: 3646.39,
    2105: 7210.79,
    2106: 3946.53,
    2107: 2314.46,
    2191: 10922.47,
    2192: 6940.4,
    2193: 1290.14,
    2194: 2364.83,
    2195: 6142.78,
    2196: 3191.06,
    2197: 3892.5,
    2198: 2315.36,
    2201: 196.15,
    2202: 1147.04,
    2203: 2706.18,
    2204: 3833.76,
    2601: 14.92,
    2602: 20.78,
    2603: 8.2,
    2604: 122.78,
    2605: 71.88,
    2606: 42.73,
    2901: 51.06,
    2902: 615.51,
    2701: 3.85,
    2702: 0.6,
    2703: 381.91,
    2704: 0.08,
    2791: 3.93,
    2792: 381.83,
    2793: 1.09,
    2801: 5291.51,
    2802: 4507.81,
    2803: 2690.42,
    2804: 1584.51,
    2805: 8458.97,
    3001: 0.12,
    3002: 0.0,
    3003: 465.65,
    2301: 4312.15,
    2302: 8310.18,
    2303: 791.57,
    2304: 779.57,
    2305: 475.83,
    2306: 2924.21,
    2401: 1348.61,
    2402: 196.48,
    1302: 0.0,
    1301: 0.0
}

# endangered forest are for private/corporate forests from fdist_id
endarea_nsw = {
    2501: 11.37,
    2502: 60.79,
    2503: 109.44,
    2504: 231.78,
    2505: 173.07,
    2506: 246.66,
    2507: 721.84,
    2508: 1028.08,
    2509: 2184.61,
    2510: 3323.88,
    1101: 280.84,
    1201: 124.17,
    2101: 929.71,
    2102: 2953.21,
    2103: 2903.46,
    2104: 1745.57,
    2105: 1855.4,
    2106: 2062.88,
    2107: 1379.08,
    2191: 726.57,
    2192: 1143.21,
    2193: 3482.08,
    2194: 1898.64,
    2195: 2408.13,
    2196: 1577.87,
    2197: 1208.74,
    2198: 1380.03,
    2201: 841.61,
    2202: 954.18,
    2203: 1597.32,
    2204: 1708.41,
    2601: 3757.67,
    2602: 2858.26,
    2603: 631.21,
    2604: 401.0,
    2605: 146.13,
    2606: 103.7,
    2901: 225.27,
    2902: 401.71,
    2701: 33.41,
    2702: 114.56,
    2703: 392.75,
    2704: 36.08,
    2791: 22.8,
    2792: 411.13,
    2793: 143.31,
    2801: 1675.64,
    2802: 1439.74,
    2803: 1073.94,
    2804: 1153.3,
    2805: 2152.83,
    3001: 11.28,
    3002: 14.46,
    3003: 271.09,
    2301: 3922.55,
    2302: 2342.52,
    2303: 2123.89,
    2304: 2152.93,
    2305: 2621.37,
    2306: 2748.21,
    2401: 1319.45,
    2402: 1794.78,
    1302: 0.54,
    1301: 0.0
}

# numerical encoding of timeframes analogous to training notebook
timeframe_encoder = {
    '01 Januar-März': 0.04695437961628093, 
    '04 April': 0.01613466392797747, 
    '05 Mai': 0.026238326902676162, 
    '06 Juni': 0.048392631582756127, 
    '07 Juli': 0.11820248134015453, 
    '08 August': 0.2340321842232208, 
    '09 September': 0.2396741387722562, 
    '10 Oktober-Dezember': 0.2703711936346777
}

# name of timeframe in notation used by Sachsenforst from month integer
tf_from_month = {
    3:  '01 Januar-März',
    4:  '04 April',
    5:  '05 Mai',
    6:  '06 Juni',
    7:  '07 Juli',
    8:  '08 August',
    9:  '09 September',
    12: '10 Oktober-Dezember'
}
    
# Erstelle Spalte mit (neuem) Namen der Forstreviere
data['fdist_newname'] = data['fdist_id'].map(lambda x: fdist_names.get(x))

# Funktion, um 'Skelett' für nächste Vorhersage zu erstellen
# d.h. Dataframe nur mit deskriptiven Spalten
def prepare_next(
    newest_ids, cur_ts, 
    fdist_names=fdist_names, 
    tf_from_month=tf_from_month
):
    '''
    Diese Funktion erstellt einen neuen Dataframe für die nächste Vorhersage.
    Der Dataframe erhält die Spalten für das Forstrevier (id, name), 
    die Eigentumsgruppe, das Jahr sowie den Zeitraum.
    
    input:
        - newest_ids: Alle in der neuen Vorhersage enthaltenen REVUFBADR-Nummern
        - cur_ts: Zeitraum der neuen Vorhersage als datetime-object
        - fdist_names: Dictionary mit Zuordnung der Forstrevier-Namen
        - tf_from_month: Dictionary mit Zuordnung der kategorischen Zeiträume
        
    returns:
        - Dataframe als Grundlage für nächste Vorhersage
    '''  
    
    # Erstellung der späteren Spalten als numpy arrays
    # Jede REVUFBADR (fdist_id) zweimal, da zwei Eigentumsgruppen
    fdist_id = np.append(newest_ids, newest_ids)
    
    # Spalte mit Jahr in gleicher Länge
    year = [cur_ts.year] * 2 * len(newest_ids)
    
    # Eigentumsgruppe je einmal für die Forstreviere
    forest_ownership = np.append(
        ['NSW'] * len(newest_ids), 
        ['SW'] * len(newest_ids)
    )
    
    # Zeitraum als Monat, aus cur_ts
    timeframe = [cur_ts.month]* 2 * len(newest_ids)
    # Zeitraum zurücksetzen in Notierung von Sachsenforst über Dictionary
    timeframe = np.array([tf_from_month.get(x) for x in timeframe])
    
    # Forstrevier Name über Dictionary
    fdist_newname = [fdist_names.get(x) for x in fdist_id]
    
    # Zusammenführen der Spalten in Dataframe
    df = pd.DataFrame([
        fdist_id, 
        year, 
        forest_ownership, 
        timeframe, 
        fdist_newname
    ]).T
    
    # Spaltennamen festlegen
    df.columns =[
        'fdist_id',
        'year',
        'forest_ownership',
        'timeframe',
        'fdist_newname'
    ]
    
    return df
    
    
    
# Funktion, um 'Skelett' mit Features zu füllen
def populate_features(
    X, data, original_data, s_cli, s_dem, cur_ts,
    endarea_sw=endarea_sw, 
    endarea_nsw=endarea_nsw, 
    tf_from_month=tf_from_month,
    timeframe_encoder=timeframe_encoder
):
    '''
    Diese Funktion nimmt einen Dataframe mit deskriptiven Spalten und
    ergänzt alle für die Vorhersage benötigten Features, indem es aus den
    historischen Beobachtungen und den letzten Beobachtungen/Vorhersagen die
    jeweiligen Werte heraussucht und diese aggregiert bzw. verarbeitet.
    
    inputs:
        - X: Der Dataframe, an den die Features angehangen werden und für 
          den später die Vorhersage stattfindet
        - data: Dataframe mit allen bisherigen realen Beobachtungen sowie 
          den bisherigen Vorhersagen. Relevant, wenn das jüngste Schadholz
          einbezogen wird
        - original_data: Dataframe nur mit realen Beobachtungen. Relevant, 
          wenn Klima bestimmt wird, da nur reale Beobachtungen Grundlage für
          Quantil sind
        - s_cli: Aktuelles Klimaszenario
        - s_dem: Szenario für Wurf- und Bruchholz
        - cur_ts: aktuelle Periode als datetime-object
        - endarea_sw: Dictionary mit Zuordnung der gefährdeten Waldfläche der 
          Forstreviere für Staatswald
        - endarea_nsw: Dictionary mit Zuordnung der gefährdeten Waldfläche der 
          Forstreviere für Nichtstaatswald
        - tf_from_month: Dictionary mit Zuordnung der kategorischen Zeiträume
        - timeframe_encoder; Dictionary mit Zuordnung der Numerischen Faktoren
          zu den Zeiträumen, analog des Modelltrainings
          
    returns:
        - DataFrame mit allen benötigten Zeilen und Spalten für die Vorhersage 
          des nächsten Zeitraums mit dem trainierten Modell
    '''
    ############################################################################

    # Vorbereitung: 'data' kopieren (original sollte unver#ndert bleiben) 
    df = data.copy()
    
    # Vorbereitung: Spalte für timestamp ergänzen
    df['ts'] = df['year'].astype(str) + df['timeframe'].map(
            lambda x: '-' + x.split(' ')[0])
    df['ts'] = pd.to_datetime(df['ts']).map(lambda x: eotf(x))
    
    # Vorbereitung: Vorheriger timestamp
    if cur_ts.month in range(4,10):
        prev_ts = cur_ts + MonthEnd(-1)
    else:
        prev_ts = cur_ts + MonthEnd(-3)
        
    ############################################################################ 

    # Feature: area_endangered
    X['area_endangered'] = X[['fdist_id','forest_ownership']].apply(
        lambda x: endarea_sw.get(x[0]) if x[1]=='SW' else endarea_nsw.get(x[0]),
        axis=1
    )
    
    ############################################################################    

    # Features zu bisherigem Schadholz
    prev_inf_wood = []
    prev_inf_wood_ofo = []
    prev_infested_Wood_rollyr = []
    
    for r in X[['fdist_newname', 'forest_ownership']].itertuples(index=False):
        # Feature: prev_infested_wood
        # Wert für letzten Zeitraum des gleichen Reviers und Eigentumgruppe
        piw = df.loc[
            (df['fdist_newname'] == r[0]) &
            (df['forest_ownership'] == r[1]) &
            (df['ts'] == prev_ts)
        ]['infested_wood'].values[0]
        
        prev_inf_wood.append(piw)
        
        # Feature: prev_infested_wood_ofo
        
        piwo = df.loc[
            (df['fdist_newname'] == r[0]) &
            (df['forest_ownership'] != r[1]) &
            (df['ts'] == prev_ts)
        ]['infested_wood'].values[0]
        
        prev_inf_wood_ofo.append(piwo)
        
        # Feature: prev_infested_Wood_rollyr
        # Summe der letzten Jahreswerte für jew. Revier und Eigentumsgruppe
        piwryr = df.loc[
            (df['fdist_newname'] == r[0]) &
            (df['forest_ownership'] == r[1]) &
            (df['ts'] >= cur_ts + MonthEnd(-12)) & 
            (df['ts'] < cur_ts)
        ]['infested_wood'].sum()
        
        prev_infested_Wood_rollyr.append(piwryr)
        
    X['prev_infested_wood'] = prev_inf_wood
    X['prev_infested_wood_ofo'] = prev_inf_wood_ofo
    X['prev_infested_wood_rollyr'] = prev_infested_Wood_rollyr
        
    ############################################################################ 

    # Klimatische Features
    rrk = []
    tm0 = []
    rrk_rollsr = []
    
    # Möglichkeit 1: Szenario als Jahreszahl ausgewählt
    if s_cli in original_data['year'].unique():
        for r in X[[
            'fdist_newname', 'forest_ownership', 'timeframe'
        ]].itertuples(index=False):
            # Problem wenn Jahreszahl <= 2013 und fdist_newname 'Meißen West' 
            # Eintrag existiert nicht. Workaround in folgendem if-statement:
            if r[0] == 'M West' and s_cli <= 2013:
                # Feature: RRK
                rrk_s = original_data.loc[
                    (original_data['fdist_newname'] == 'M Nord') &
                    (original_data['forest_ownership'] == r[1]) &
                    (original_data['timeframe'] == r[2]) &    
                    (original_data['year'] == s_cli)  
                ]['RRK'].values[0]
                
                #Feature: TM0
                tm0_s = original_data.loc[
                    (original_data['fdist_newname'] == 'M Nord') &
                    (original_data['forest_ownership'] == r[1]) &
                    (original_data['timeframe'] == r[2]) &    
                    (original_data['year'] == s_cli)  
                ]['TM0'].values[0]
                
            # Für die anderen Reviere wie folgt:    
            else:
                # Feature: RRK
                rrk_s = original_data.loc[
                    (original_data['fdist_newname'] == r[0]) &
                    (original_data['forest_ownership'] == r[1]) &
                    (original_data['timeframe'] == r[2]) &    
                    (original_data['year'] == s_cli)  
                ]['RRK'].values[0]
                
                #Feature: TM0
                tm0_s = original_data.loc[
                    (original_data['fdist_newname'] == r[0]) &
                    (original_data['forest_ownership'] == r[1]) &
                    (original_data['timeframe'] == r[2]) &    
                    (original_data['year'] == s_cli)  
                ]['TM0'].values[0]
                
            rrk.append(rrk_s)
            tm0.append(tm0_s)
            
            # Feature: RRK_rollsr
            # Summe der letzten (inklusive aktueller) Sommerwerte des 
            # letzten Jahres aus jew. Revier und Eigentumsgruppe
            # letze Werte
            rrk_rollsr1 = df.loc[
                (df['fdist_newname'] == r[0]) &
                (df['forest_ownership'] == r[1]) &
                (df['ts'] >= cur_ts + MonthEnd(-11)) & 
                (df['ts'] <= cur_ts) &
                (df['ts'].map(lambda x: x.month).isin(range(4,10)))
            ]['RRK'].sum()
            
            # aktueller Wert
            rrk_rollsr2 = rrk_s
            
            rrk_rollsr.append((rrk_rollsr1 + rrk_rollsr2) / 6)
                
    # Möglichkeit 2: Szenario als Quantil ausgewählt        
    elif s_cli in range(1,100):
        for r in X[[
            'fdist_newname', 'forest_ownership', 'timeframe'
        ]].itertuples(index=False):
            # Feature: RRK
            rrk_s = np.quantile(
                original_data.loc[
                    (original_data['fdist_newname'] == r[0]) &
                    (original_data['forest_ownership'] == r[1]) &
                    (original_data['timeframe'] == r[2])  
                ]['RRK'].values, 
                1 - (s_cli * 0.01)
            )
            
            rrk.append(rrk_s)
            
            # Feature: TM0
            tm0_s = np.quantile(
                original_data.loc[
                    (original_data['fdist_newname'] == r[0]) &
                    (original_data['forest_ownership'] == r[1]) &
                    (original_data['timeframe'] == r[2])  
                ]['TM0'].values, 
                s_cli * 0.01
            )
            
            tm0.append(tm0_s)
            
            # Feature: RRK_rollsr
            # Summe der letzten (inklusive aktueller) Sommerwerte des 
            # letzten Jahres aus jew. Revier und Eigentumsgruppe
            # letze Werte
            rrk_rollsr1 = df.loc[
                (df['fdist_newname'] == r[0]) &
                (df['forest_ownership'] == r[1]) &
                (df['ts'] >= cur_ts + MonthEnd(-11)) & 
                (df['ts'] <= cur_ts) &
                (df['ts'].map(lambda x: x.month).isin(range(4,10)))
            ]['RRK'].sum()
            
            # aktueller Wert
            rrk_rollsr2 = rrk_s
            
            rrk_rollsr.append((rrk_rollsr1 + rrk_rollsr2) / 6)

            
    X['RRK'] = rrk
    X['TM0'] = tm0
    X['RRK_rollsr'] = rrk_rollsr
    ############################################################################ 

    # Wurf-/Bruchholz
    dmw = []
    
    # Möglichkeit 1: Szenario als Jahreszahl ausgewählt
    if s_dem in original_data['year'].unique():
        for r in X[[
            'fdist_newname', 'forest_ownership', 'timeframe'
        ]].itertuples(index=False):
            # Feature: demolition_wood
            # Problem wenn Jahreszahl <= 2013 und fdist_newname 'Meißen West' 
            # Eintrag existiert nicht. Workaround in folgendem if-statement
            if r[0] == 'M West' and s_cli <= 2013:
                dmw_s = original_data.loc[
                    (original_data['fdist_newname'] == 'M Nord') &
                    (original_data['forest_ownership'] == r[1]) &
                    (original_data['timeframe'] == r[2]) &    
                    (original_data['year'] == s_dem)  
                ]['demolition_wood'].values[0]
            
            # Für die anderen Reviere wie folgt: 
            else:
                dmw_s = original_data.loc[
                    (original_data['fdist_newname'] == r[0]) &
                    (original_data['forest_ownership'] == r[1]) &
                    (original_data['timeframe'] == r[2]) &    
                    (original_data['year'] == s_dem)  
                ]['demolition_wood'].values[0]
            
            dmw.append(dmw_s)
            
    # Möglichkeit 2: Szenario als Quantil ausgewählt          
    elif s_dem in range(1,100):
        for r in X[[
            'fdist_newname', 'forest_ownership', 'timeframe'
        ]].itertuples(index=False):
            # Feature: demolition_wood
            dmw_s = np.quantile(
                original_data.loc[
                    (original_data['fdist_newname'] == r[0]) &
                    (original_data['forest_ownership'] == r[1]) &
                    (original_data['timeframe'] == r[2])  
                ]['demolition_wood'].values, 
                s_dem * 0.01
            )
            
            dmw.append(dmw_s)
            
    X['demolition_wood'] = dmw
            
    ############################################################################ 
    
    # Feature: timeframe (encoded)
    X['timeframe_enc'] = X['timeframe'].map(lambda x: timeframe_encoder.get(x))
    
    
    return X    
        
# Lade Modell
model = pickle.load(open('model.pkl', 'rb'))    

# Vorhersagen für Szenarien durchführen

Nachdem alle benötigten Variablen und Funktionen definiert wurden, kann die Vorhersage für die drei Szenarien durchgeführt werden.

In [None]:
# Ursprüngliche Werte der Daten festhalten 
original_data = data.copy()

# Vorhersagen für drei Szenarien
for i in range(3):
    # Vor dem Szenario 'data' zurücksetzen
    data = original_data.copy()
    
    # Zähler für Intervalle erstellen
    period_count = 0
    pred_range = pd.date_range(start=max_ts, end=pred_end, freq='M')
    max_periods = len(
        [ts for ts in pred_range if ts.month not in (10,11,1,2)]
    ) - 1
    
    # Szenario über Einstellungen definieren
    if i == 0:
        s_name = 'warmtrocken'
        s_cli = s_cli_1
        s_dem = s_dem_1
        print('Starte Vorhersage für Szenario warm/trocken.')
    elif i == 1:
        s_name = 'gemäßigt'
        s_cli = s_cli_2
        s_dem = s_dem_2
        print('Starte Vorhersage für Szenario gemäßigt.')
    elif i == 2:
        s_name = 'kaltfeucht'
        s_cli = s_cli_3
        s_dem = s_dem_3
        print('Starte Vorhersage für Szenario kalt/feucht.')
        
    cur_ts = max_ts     
    
    # Algorithmus ausführen, bis Ende Vorhersagezeitraum erreicht
    while cur_ts < pred_end:
        
        # Eine Periode nach vorne
        period_count += 1
        
        if cur_ts.month in range(3,9):
            cur_ts = cur_ts + MonthEnd(1)
        else:
            cur_ts = cur_ts + MonthEnd(3)
        
        # Dataframe kreieren und für Vorhersage mit Features füllen
        X = prepare_next(newest_ids, cur_ts)
        X = populate_features(X, data, original_data, s_cli, s_dem, cur_ts)
        
        # Vorhersage durchführen
        X['infested_wood'] = model.predict(X[[
            'area_endangered',
            'timeframe_enc',
            'prev_infested_wood',
            'prev_infested_wood_rollyr',
            'prev_infested_wood_ofo',
            'RRK',
            'TM0',
            'demolition_wood',
            'RRK_rollsr'
        ]])
        
        # Vorhersagen und Datenset vereinigen als Vorbereitung 
        # für nächste Iteration
        data = pd.concat(
            [data, 
            X[[
                'fdist_id', 
                'year', 
                'forest_ownership', 
                'timeframe', 
                'RRK', 
                'TM0', 
                'demolition_wood', 
                'infested_wood',
                'fdist_newname'
            ]]], 
            ignore_index=True
        )
        
        print(f'Intervall {period_count}/{max_periods} abgeschlossen')
    
    # Nachdem alle Vorhersagen für Szenario fertig, Vohersagen aus data extrahieren
    predictions = data.loc[original_data.shape[0] + 1 :]

    predictions.columns = [
        'REVUFBADR',
        'Jahr',
        'Eigentumsgruppe',
        'ZR',
        'Niederschlagsumme in l/m2',
        'Mittlere Temperatur in °C',
        'Zugang Wurf-/Bruchholz',
        'Zugang Schadholz',
        'Revier'
    ]
    
    # Speichern der 'ausführlichen' Version der Vorhersagen in Excel-Datei
    predictions.to_excel(
        'vorhersagen_ausführlich_'+s_name+'.xlsx',
        index=False
    )
    
    # Aufsummierung der Ergebnisse für die einzelnen Reviere, für Gesamtvorhersage
    predictions = predictions.groupby(
        ['REVUFBADR', 'Revier', 'Eigentumsgruppe']
    )['Zugang Schadholz'].sum()
    
    # Speichern der aggregierten Ergebnisse in weiterer Excel-Datei
    predictions.to_excel(
        'vorhersagen_gesamt_'+s_name+'.xlsx'
    )
    
    print(
        f'Vorhersage für Szenario abgeschlossen. ' 
        f'Gespeichert in "vorhersagen_ausführlich_{s_name}.xlsx" '
        f'sowie "vorhersagen_gesamt_{s_name}.xlsx".\n'
    )
    
    

Starte Vorhersage für Szenario warm/trocken.
Intervall 1/10 abgeschlossen
Intervall 2/10 abgeschlossen
Intervall 3/10 abgeschlossen
Intervall 4/10 abgeschlossen
Intervall 5/10 abgeschlossen
Intervall 6/10 abgeschlossen
Intervall 7/10 abgeschlossen
Intervall 8/10 abgeschlossen
Intervall 9/10 abgeschlossen
Intervall 10/10 abgeschlossen
Vorhersage für Szenario abgeschlossen. Gespeichert in "vorhersagen_ausführlich_warmtrocken.xlsx" sowie "vorhersagen_gesamt_warmtrocken.xlsx".

Starte Vorhersage für Szenario gemäßigt.
Intervall 1/10 abgeschlossen
Intervall 2/10 abgeschlossen
Intervall 3/10 abgeschlossen
Intervall 4/10 abgeschlossen
Intervall 5/10 abgeschlossen
Intervall 6/10 abgeschlossen
Intervall 7/10 abgeschlossen
Intervall 8/10 abgeschlossen
Intervall 9/10 abgeschlossen
Intervall 10/10 abgeschlossen
Vorhersage für Szenario abgeschlossen. Gespeichert in "vorhersagen_ausführlich_gemäßigt.xlsx" sowie "vorhersagen_gesamt_gemäßigt.xlsx".

Starte Vorhersage für Szenario kalt/feucht.
Interval

Der folgende Code lädt die Excel-Dateien mit den Vorhersagen aus Google Colab herunter. Wird nicht benötigt, wenn Notebook lokal ausgeführt wird.

In [None]:
# # Nur für Google Colab:
# # Dateien zu zip-Archiv hinzufügen
# zipfile = zipfile.ZipFile(
#     'vorhersagen.zip', 
#     mode='w', 
#     compression=zipfile.ZIP_DEFLATED
#     )
# 
# zipfile.write('vorhersagen_ausführlich_warmtrocken.xlsx')
# zipfile.write('vorhersagen_ausführlich_gemäßigt.xlsx')
# zipfile.write('vorhersagen_ausführlich_kaltfeucht.xlsx')
# zipfile.write('vorhersagen_gesamt_warmtrocken.xlsx')
# zipfile.write('vorhersagen_gesamt_gemäßigt.xlsx')
# zipfile.write('vorhersagen_gesamt_kaltfeucht.xlsx')
#  
# zipfile.close()
# 
# # Excel Dateien aus Google Colab downloaden
# files.download('vorhersagen.zip')