In [1]:
import pandas as pd, numpy as np, os
from scipy.stats import boxcox
import bayes_net_utils as bn

# Intro

Notebook takes observed and predicted lake chemistry/ecol values output by notebook 02_BN_development_1Season (from the continuous network cross validation section). These are then discretized into WFD-related boundaries, and classification error is calculated. This classification error can then be compared to the classification error obtained by cross validation of discrete networks using the same classifications, or just to provide supporting error info to accompany predictions of probabilities of being in a certain class.

# Set up for processing

In [2]:
# USER INPUT

# Boundaries dictionary, copied from notebook B_seasonal_data_matrix_1Season
bound_dict = {
             'TP': [29.5], # No data below 20, so drop this class boundary. 29.5 is middle of 'Mod' class   
             'chla': [20.0],  # WFD boundaries: [10.5, 20.0]. But only 6 d.p. under 10.5 so merge G and M classes.
                              # For predicting cyano, would be better 17.4.   
             'colour': [48.0], # [51.2] if comparing CV scores of continuous and discrete. [48.0] if calculating CV score of final network (66th percentile)
             'cyano': [1.0], # M-P boundary is 2.0, but there were only 2 values in this class. Plenty above 2 tho.
             }

# Alter the boundaries in the boundaries dict for cyano, to take account of the box-cox transformation applied to the continuous
# data:  y* = (y^L - 1)/L, where we used lambda=0.1 when transforming original cyano data
bound_dict['cyano'] = boxcox(bound_dict['cyano'], lmbda=0.1)

met_source = 'metno' #'metno' or 'era5'

var_li = ['TP','chla','cyano','colour'] # What do you want to produce stats for? Need to have corresponding files in 'Data/CrossValidation/%s' %met_source folder

In [3]:
def xval_postprocess(var, fpath):
    """
    Function to read in a csv of observed and predicted values from a continuous
    Bayesian belief network produced in BNLearn R notebook, and calculate correlation
    coefficients, and then classify according to WFD and work out classification error
    for comparison to classification error calculated directly in BNLearn using discrete network.
    
    Inputs:
        var: string, one of 'TP','chla','cyano','colour_summer'
        fpath: string giving location of csv to be read in. csv should have columns:
                'obs_1','pred_1','obs_2','pred_2',... where _1, _2, etc. is the cross
                validation run number.
    Returns:
    Printed output, plus dictionary of results, with keys
        'corr_coeffs': series of correlation coefficients, one value per cross validation run
        'classification_errors': series of classification errors, one value per cross validation run
        'cont_data_dict': dict of observed and predicted dfs, continuous data, one df per xval run
        'classified_data_dict': dict of observed and predicted dfs, classified data, one df per xval run
        
    """

    # ---------------------------------------------------------------------------------
    # Read in data
    df = pd.read_csv(fpath, index_col=0)

    # ---------------------------------------------------------------------------------
    # Split into separate dataframes for each cross validation run
    cont_dict = {}  # Key: run number, returns df with obs and pred
    for i, col_name in enumerate(df.columns):
        if i % 2 == 0:  # If even, i.e. only do this for half the cols
            run_no = int(col_name.split("_", 1)[1])
            if run_no == 1:
                temp_df = df.iloc[:, [0, 1]]
            else:
                temp_df = df.iloc[:, [2 * run_no - 2, 2 * run_no - 1]]

            temp_df.columns = ["obs", "pred"]
        cont_dict[run_no] = temp_df

    # ---------------------------------------------------------------------------------
    # Calculate correlation coefficients, convert to WFD classes and work out classification error

    cc_dict = {}  # key: run_no, returns correlation coeff
    mse_dict = {}  # key: run_no, returns mse (pred - obs)
    classified_dict = {} # Key: run number. Returns df with cols 'obs' and 'pred', discrete data
    class_error_dict = {}

    for run_no in cont_dict.keys():

        cont_df = cont_dict[run_no]  # df with continuous data

        # Correlation coefficients
        cc_dict[run_no] = cont_df["obs"].corr(cont_df["pred"], method="pearson")

        # Mean square error
        mse = np.mean(((cont_df["pred"] - cont_df["obs"]) ** 2))
        mse_dict[run_no] = mse

        # Classify obs and pred into WFD (or related) class boundaries
        disc_df = pd.DataFrame(index=cont_df.index, columns=cont_df.columns)  # New empty df to be populated
        for col in cont_df.columns:
            disc_df[col] = cont_df[col].apply(lambda x: bn.discretize(bound_dict[var], x))
        classified_dict[run_no] = disc_df

        # Calculate classification error (proportion of time model predicted class correctly)
        error = bn.classification_error(disc_df['obs'],disc_df['pred'])
        class_error_dict[run_no] = error

    # ---------------------------------------------------------------------------------
    # Aggregate results over runs

    corr_coeffs = pd.Series(cc_dict)  # These match those calculated by bnlearn, good!
    mses = pd.Series(mse_dict)
    errors = pd.Series(class_error_dict)

    # ---------------------------------------------------------------------------------
    # Take a look at the output
#     print("Correlation coefficients, %s:" % var)
#     print(corr_coeffs)
#     print("\nMean correlation coefficient, %s: %s" % (var, corr_coeffs.mean()))

#     print("\nmse, %s:" % var)
#     print(mses)
#     print("Mean mse, %s: %s" % (var, mses.mean()))

#     print("\nClassification errors, %s:" % var)
#     print(errors)
#     print("Mean classification error for %s: %s" % (var, errors.mean()))

    # Return dictionaries
#     results_dict = {
#                     "corr_coeffs": corr_coeffs,
#                     "classification_errors": errors,
#                     "cont_data_dict": cont_dict,
#                     "classified_data_dict": classified_dict,
#                     }
#     return results_dict
    results_series = pd.Series(data = np.array([corr_coeffs.mean(), mses.mean(), errors.mean()]),
                               index = ['mean_CC','mean_mse','mean_class_error'])
    
    return results_series

# Calculate stats

In [4]:
# Get list of files to loop over. Produced in 02_BN_development_1Season

fpaths = os.listdir("../Data/CrossValidation/CV_predictions")

fpath_dict = {}
for file in fpaths:
    var = file.split('_')[0]
    if var in var_li:
        if var in fpath_dict.keys():
            fpath_dict[var].append(file)
        else:
            fpath_dict[var] = [file]

fpath_dict

{'cyano': ['cyano_continuous_fromAllNodes_metno.csv',
  'cyano_continuous_fromPredictableNodes_metno.csv',
  'cyano_continuous_fromAllNodes_era5.csv',
  'cyano_continuous_fromPredictableNodes_era5.csv'],
 'TP': ['TP_continuous_fromPredictableNodes_metno.csv',
  'TP_continuous_fromAllNodes_era5.csv',
  'TP_continuous_fromPredictableNodes_era5.csv',
  'TP_continuous_fromAllNodes_metno.csv'],
 'chla': ['chla_cont_fromAllNodes_noWind_era5.csv',
  'chla_continuous_fromPredictableNodes_era5.csv',
  'chla_continuous_fromPredictableNodes_metno.csv',
  'chla_continuous_fromAllNodes_era5.csv',
  'chla_cont_fromPredictableNodes_noWind_metno.csv',
  'chla_continuous_fromAllNodes_metno.csv',
  'chla_cont_fromPredictableNodes_noWind_era5.csv',
  'chla_cont_fromAllNodes_noWind_metno.csv'],
 'colour': ['colour_continuous_metno.csv', 'colour_continuous_era5.csv']}

In [5]:
series_li = []
for var in var_li:
    for file in fpath_dict[var]:
        stats_series = xval_postprocess(var, os.path.join('../Data/CrossValidation/CV_predictions', file))
        stats_series.name = file.split('.')[0]
        series_li.append(stats_series)

df = pd.concat(series_li, axis=1, keys=[s.name for s in series_li]).transpose()

df['Variable'] = [i.split('_')[0] for i in list(df.index)]
df['met_data'] = [i.split('_')[-1] for i in list(df.index)]

# Write to csv
# df.to_csv('../Data/CrossValidation/results/CV_results.csv')

df

Unnamed: 0,mean_CC,mean_mse,mean_class_error,Variable,met_data
TP_continuous_fromPredictableNodes_metno,0.57598,15.681293,0.328947,TP,metno
TP_continuous_fromAllNodes_era5,0.723217,10.934003,0.315385,TP,era5
TP_continuous_fromPredictableNodes_era5,0.581586,15.179048,0.325641,TP,era5
TP_continuous_fromAllNodes_metno,0.751779,10.206142,0.265789,TP,metno
chla_cont_fromAllNodes_noWind_era5,0.697502,15.762621,0.2,chla,era5
chla_continuous_fromPredictableNodes_era5,0.570645,20.995248,0.274359,chla,era5
chla_continuous_fromPredictableNodes_metno,0.545362,22.763624,0.336842,chla,metno
chla_continuous_fromAllNodes_era5,0.707008,15.356126,0.194872,chla,era5
chla_cont_fromPredictableNodes_noWind_metno,0.537419,22.886193,0.315789,chla,metno
chla_continuous_fromAllNodes_metno,0.712344,15.572903,0.242105,chla,metno


# TP

TP: Comparison of classification errors between continuous and discrete networks:

Including met.no wind:
- Predictable/measurable nodes, discrete BN: 0.55
- Predictable/measurable nodes, continuous BN: 0.36
- All nodes, discrete BN: 0.59
- All nodes, continuous BN: 0.32

**Other structures**

Remove met.no wind:
- All stats same or slightly improved without wind in continuous network. Therefore **decide to remove wind-TP link**.
- **Comparison of discrete and continuous network classification error, without metno wind**, excluding 2019 data, becomes:

    - Predictable/measurable nodes, discrete: 0.51
    - Predictable/measurable nodes, continuous: 0.33
    - All nodes, discrete: 0.53
    - All nodes, continuous: 0.31
    
So take this as the final comparison of discrete vs continuous for TP, and part of the justification for going continuous. Other justification is the cpts in the discrete, which don't show the right behaviour (see markdown comments in BN development notebook).

# Chl-a

In [6]:
df.loc[df['Variable']=='chla']

Unnamed: 0,mean_CC,mean_mse,mean_class_error,Variable,met_data
chla_cont_fromAllNodes_noWind_era5,0.697502,15.762621,0.2,chla,era5
chla_continuous_fromPredictableNodes_era5,0.570645,20.995248,0.274359,chla,era5
chla_continuous_fromPredictableNodes_metno,0.545362,22.763624,0.336842,chla,metno
chla_continuous_fromAllNodes_era5,0.707008,15.356126,0.194872,chla,era5
chla_cont_fromPredictableNodes_noWind_metno,0.537419,22.886193,0.315789,chla,metno
chla_continuous_fromAllNodes_metno,0.712344,15.572903,0.242105,chla,metno
chla_cont_fromPredictableNodes_noWind_era5,0.529494,22.539917,0.310256,chla,era5
chla_cont_fromAllNodes_noWind_metno,0.704083,15.859139,0.215789,chla,metno


**Comparison of classification error between continuous and discrete, met.no data (data to 2018):**

Whole network used to predict chl-a node:
- discrete BN:0.08
- continuous BN: 0.21
- continuous, adding extra WFD class (G-M boundary at 10.0): 0.44

Only include nodes that will be updated when forecasting:
- discrete BN: 0.08
- continuous BN: 0.32
- continuous, adding extra WFD class (G-M boundary at 10.0): 0.49

Lower classification error when using the discrete network. But problems with the discrete chla cpts outweigh these differences:
- When chla_prevSummer is high, changing wind speed from L to H doesn't result in any change to probs for low TP. But if TP is high, higher wind speed makes it more likely to have high chla. This isn't right, and is only because of a lack of data points. Therefore would have had to remove the wind-chla link.
- Even after doing this, still have probs: when previous summer's chl-a is low, chla is always predicted to be low, and TP has no effect. When previous summer's chl-a is high, the TP effect is the wrong way around: when TP is low, expect high chla. When TP is high, have a lower chance of high TP than when TP is low. Again, just due to low data volume

**Removing wind:**

- No change or slight deterioration, depending on whether you predict using just predictable nodes (no change) or the whole network (slight deterioration).

# Cyano

In [7]:
df.loc[df['Variable']=='cyano']

Unnamed: 0,mean_CC,mean_mse,mean_class_error,Variable,met_data
cyano_continuous_fromAllNodes_metno,0.786742,0.644327,0.186957,cyano,metno
cyano_continuous_fromPredictableNodes_metno,0.635034,0.999678,0.16087,cyano,metno
cyano_continuous_fromAllNodes_era5,0.786441,0.623401,0.1875,cyano,era5
cyano_continuous_fromPredictableNodes_era5,0.632786,0.971029,0.179167,cyano,era5


**Comparison of classification error, continuous vs discrete networks (met.no data, data to 2018):**

- Predictable/measurable nodes, discrete: 0.23
- Predictable/measurable nodes, continuous: 0.17
- All nodes, discrete: 0.13
- All nodes, continuous: 0.18

In this case, the continuous network has lower classification error than the discrete network when using predictable/measurable nodes, but slightly higher error when using all nodes (first time this has been the case). But errors still pretty low, and added benefit of the relationships being sensible between nodes rather than the cpt issue we have for one of the cpts in the cyano class: in the high summer colour class, as chla increases from L to H, the chance of high cyano decreases. I think we would expect positive relationship between chla and cyano regardless of colour, just a lower chance of high cyano when colour is higher. So this is another low data vol artefact.

**To investigate in the future:**
- Cross val of cyano predictive ability with/without colour-cyano link

# Proposed changes to structure based on cross validation results

(in this notebook, and the BN_Development_1Season one)

- Remove wind-TP link. Keep wind-chla link.
- Keep rain-colour link