In [1]:
import pandas as pd, numpy as np
import bayes_net_utils as bn
import matplotlib.pyplot as plt

# Introduction

This is a draft notebook to be used as guidance in developing a Voila app for producing seasonal weather and water quality forecasts for Lake Vansjø.

# Workflow to follow before running this notebook after 2020

- If the forecast is being issued in April, then water chemistry and ecology data from the previous summer needs updating to include values from the previous year. Involves updating files two files. Probably best to just ask Sigrid Haande for this data and do it manually once a year:

    * cyanobacterial count data: '../../Data/Observed_Chem_Ecol/Van2_Vanemfjorden_Cyanobacteria.csv'
    * lake TP, chla and colour data: '../../Data/Observed_Chem_Ecol/Van2_Vanemfjorden_chem_obs.csv'<br><br>
    
- Optionally update the model fitting to include more recent data:
    * Also optionally: download and update latest ERA5 met data (see notebook '../../MetData_Processing/notebooks/05_download_era5'). As met data isn't included at the moment in the network, this isn't important for now.
    * Update the fitting of the Bayesian Network: re-run notebooks 04_MakeHistoricTrainingData onwards, to create a new bn.fit object. The model performance statistics will also change, and will need updating in the online tool.


# Weather forecast for the coming season

The first part of the forecast is a weather forecast for the three month 'target season'. The workflow would be something like:
- Download System5 data for the area from Copernicus. Just 25 members (to match the hindcast period), or all 50 if that's easier. Just precipitation and air temperature is fine for now.
- Using a similar workflow to that used in the original Voila app (put together before Ireland), calculate the proportion of S5 members in each of the three terciles. Use this to populate the 'Likelihood' column in the layout template.
- Work out which the most likely tercile is. Use to generate a string variable, most_likely_tercile (see first column in the forecast table for possible string descriptions of the terciles for temperature and precipitation)
- Also work out least_likely_tercile
- Read in the historic ROCSS values (calculated in the met_data_processing notebooks, by comparing S5 and ERA5), and use to populate the 'Historic skill' ROCSS column in the layout. NB different scores for each season/met variable/tercile. If the ROCSS was significant, set the text in this column to 'Some skill', if it's not significant, set text to 'None'
- Generate 'overall confidence' scores. For each tercile, combine the 'Likelihood' and 'Historic skill' columns (see the 'Confidence score guide' table in the layout word document), and then pick out value for most and least likely terciles:
    * Overall confidence score, confidence that most likely tercile will happen
    * Overall confidence score that the least likely tercile won't happen
- Combine information across terciles and from the two kinds of 'overall confidence' score to generate a single forecast text summary, for the 'Forecast summary' column in the table. E.g. of pseudocode:

        if confidence score for most and least likely terciles is 'None' or 'very low':
            print(“Forecast confidence is too low to make %s predictions for the coming season” %weather_variable)
        if confidence score for most likely tercile is low, medium or high:
            print(“There is %s confidence that the coming sesason will be %s” %(confidence_score_most_likely, most_likely_tercile))
        if confidence score for least likely tercile is medium or high:
            print(“There is %s confidence that the coming sesason will not be %s” %(confidence_score_least_likely, least_likely_tercile))
         
Feel free to modify this in any way though, it's possible I've made mistakes, or that things could be done better! If you do change anything, make sure to change both the layout and the accompanying text guidance doc too.

# Lake chemistry and ecology forecast for the summer season

For the forecast issued in April, the seasonal weather forecast for the next three months (May-July) is accompanied by a forecast of water chemistry/ecology for the whole growing season (May-October). The forecast is for mean TP, chla, and colour concentration, and maximum cyanobacteria biovolume. This period and these variables match those used in WFD assessment of ecological status here.

All forecasts were originally going to be based on a Bayesian Belief Network (BBN) which included several weather-related nodes (mean seasonal wind speed and seasonal precipitation sum). However, the results of cross validation of the Bayesian Network and different versions of the ntework (notebook BN_CV_PythonPostProcess), and a comparison of different models for the hindcast period (notebook Hindcast_stats_and_plots), lead to the following choices for models to use in operational forecasting:

- TP: BBN
- chla: Naive seasonal forecast
- colour: BBN, no met (stats were the same for BBN with met, without met, or seasonal naive. Choose this for consistency with cyano)
- cyano: BBN, no met

For those nodes where the prediction will be based on the BBN without met data drivers, the forecast is still derived using the BBN which includes met nodes. However, the met nodes are not used when setting evidence for forecasting.

**Seasons**

The original plan was to update the lake chemistry and ecology forecast in the middle of the current summer, to take into account observed weather data for the first half of the summer, and update predictions for the second half using more up-to-date (and probably more accurate) seasonal forecasts. However, given the lack of sensitivity to met data and therefore the decision to remove this from the network, this has been removed (as there will not be any data to base an updated prediction on). So just produce one lake water quality forecast in April for the next summer season (May-October)

## Set up

In [2]:
target_yr = 2018

# ------------------------------------------------------------------------------
# Read in daily lake chem and ecol data

# Lake water quality
lakewq_fpath = r'../../Data/Observed_Chem_Ecol/Van2_Vanemfjorden_chem_obs.csv'
lakewq_df = pd.read_csv(lakewq_fpath, index_col=0, parse_dates=True, dayfirst=True)
# display(lakewq_df.head())

# Ecol count data
ecol_fpath = r'../../Data/Observed_Chem_Ecol/Van2_Vanemfjorden_Cyanobacteria.csv'
ecol_df = pd.read_csv(ecol_fpath, index_col=0, parse_dates=True, dayfirst=True)
# convert units to mm3/l (mg/l if assume density is same as water)
ecol_df['Cyano_biovol_mm3_per_l'] = ecol_df['Cyano_biovol_mm3_per_m3']/1000.
ecol_df.drop(['Cyano_biovol_mm3_per_m3'], axis=1, inplace=True)
# display(ecol_df.head())

# ------------------------------------------------------------------------------
# Join chem and ecol data together and and tidy

daily_df = lakewq_df.join(ecol_df)

daily_df = daily_df.loc['%s-01-01'%(target_yr-1): '%s-12-31'%(target_yr-1)].asfreq('D') # Truncate to time period of interest

var_rename_dict = {'Biovolume_mm3_per_l':'Biovolume',
                   'Cyano_biovol_mm3_per_l':'cyano',
                   'chl-a':'chla'}
daily_df.rename(var_rename_dict, axis=1, inplace=True)

cols_to_keep = ['TP','chla','colour','cyano']
for col in daily_df.columns:
    if col not in cols_to_keep:
        daily_df.drop(col, axis=1, inplace=True)

# display(daily_df.head())

# ------------------------------------------------------------------------------
# Aggregate to seasonal frequency (maximum of cyano, mean of everything else)
previous_summer_waterquality_df = bn.daily_to_summer_season(daily_df)
previous_summer_waterquality_df

Unnamed: 0_level_0,colour,TP,chla,cyano
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2017,43.333333,19.142857,13.385714,0.232


## Produce the forecast

In [3]:
# 1) For lake TP, cyano and colour
# Use the pre-fitted BN, excluding meteorological nodes when setting the 'evidence'

# Fitted bnlearn object
rfile_fpath = "../Data/RData/Vansjo_fitted_GaussianBN_era5_1981-2019.rds"

# Make predictions for target season
forecast_df = bn.bayes_net_predict_operational(rfile_fpath,
                          float(target_yr),
                          float(previous_summer_waterquality_df['chla']),
                          float(previous_summer_waterquality_df['colour']),
                          float(previous_summer_waterquality_df['TP'])
                         )

# Re-order cols
forecast_df = forecast_df[['year', 'node', 'threshold','prob_below_threshold', 
     'prob_above_threshold', 'expected_value', 'WFD_class']]
forecast_df

# ----------------------------------------------------------------------------------
# 2) For chl-a
# Estimate using naive seasonal forecast (i.e. lake observations from the previous summer (May-Oct average))

chla_forecast = previous_summer_waterquality_df.loc[target_yr-1, 'chla']

# Corresponding expected WFD class (just split into <20: Moderate or better, or >20: Poor or worse)
chla_class = bn.discretize([20.0], chla_forecast)

# ----------------------------------------------------------------------------------
# Add predictions to the df containing all forecasts
forecast_df.loc[len(forecast_df)+1] = [target_yr,'chla', 20.0, np.NaN, np.NaN, chla_forecast, chla_class]

forecast_df = forecast_df.set_index('node')
forecast_df

Unnamed: 0_level_0,year,threshold,prob_below_threshold,prob_above_threshold,expected_value,WFD_class
node,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
colour,2018,48.0,0.72,0.28,42.5,0
cyano,2018,1.0,0.78,0.22,0.442,0
TP,2018,29.5,0.96,0.04,22.8,0
chla,2018,20.0,,,13.385714,0


## Add skill information

Read in pre-calculated skill information for the historic period (1981-2019), to accompany the forecast

In [4]:
# Historic goodness-of-fit stats for variables predicted using bayesian network (derived from cross validation)

cv_stats_fpath = r'../Data/CrossValidation/Stats/CV_results_predictableNodes_era5-vs-nomet.csv'
cv_stats = pd.read_csv(cv_stats_fpath)

# Drop stats for chla (not relevant as BN not used), and for era5-driven met (also not used)
cv_stats = cv_stats.loc[cv_stats['Variable'] != 'chla']
cv_stats = cv_stats.loc[cv_stats['met_included'] == 'nomet']
cv_stats = cv_stats.drop('met_included', axis=1).set_index('Variable')
display(cv_stats)

Unnamed: 0_level_0,mean_CC,mean_rmse,mean_class_error,mean_mcc,mean_ROC_AUC
Variable,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
TP,0.577181,3.915841,0.325641,0.342134,0.669444
colour,0.818836,9.360379,0.230769,0.471405,0.730769
cyano,0.674139,0.950204,0.125,0.77762,0.884615


In [5]:
# Historic gof stats from naive forecaster
naive_stats_fpath = r'../Hindcast_stats_plots/GoF_sim_vs_obs_1981-2019.csv'
naive_stats = pd.read_csv(naive_stats_fpath)

# Just pick for chla, naive forecast
naive_stats_chla = naive_stats.loc[naive_stats['var']=='chla'].loc[naive_stats['model']=='sim_naive'].set_index('var')
naive_stats_chla

Unnamed: 0_level_0,model,pearsons_cc,spearman_cc,mae,rmse,bias,mape,mathews_cc,roc_auc_score,classification_error
var,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
chla,sim_naive,0.648,0.631,3.614,4.6,0.058,5.054,0.706,0.853,0.108


In [6]:
# Rename and join the two dfs, dropping some skill info

cv_stats = cv_stats.rename({'mean_CC':'Pearsons r',
                            'mean_rmse':'RMSE',
                            'mean_class_error': 'Classification error',
                            'mean_mcc': 'Matthews correlation coefficient',
                            'mean_ROC_AUC': 'Area under ROC curve'}, axis=1)

naive_stats_chla = naive_stats_chla[['pearsons_cc','rmse','mathews_cc','roc_auc_score','classification_error']]
naive_stats_chla = naive_stats_chla.rename({'pearsons_cc':'Pearsons r',
                                           'rmse':'RMSE',
                                           'mathews_cc':'Matthews correlation coefficient',
                                           'roc_auc_score':'Area under ROC curve',
                                           'classification_error':'Classification error'}, axis=1)

cv_stats = cv_stats.append(naive_stats_chla, sort=False)

cv_stats

Unnamed: 0,Pearsons r,RMSE,Classification error,Matthews correlation coefficient,Area under ROC curve
TP,0.577181,3.915841,0.325641,0.342134,0.669444
colour,0.818836,9.360379,0.230769,0.471405,0.730769
cyano,0.674139,0.950204,0.125,0.77762,0.884615
chla,0.648,4.6,0.108,0.706,0.853
