# Try committee model

We found that generally the following three models perform with insignificant differences:
- D-MPNN/CGR
- XGB/FP
- LogReg/FP

If we were to use these as a committee (i.e. averaging probabilities from all three predictions), do we get better predictions? 

In [1]:
import pathlib
import sys
sys.path.append(str(pathlib.Path("__file__").absolute().parents[1]))

from sklearn.metrics import average_precision_score
from sklearn.preprocessing import LabelEncoder
from scipy.stats import wilcoxon
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import wandb

from src.util.definitions import PRED_DIR, DATA_ROOT
from utils import get_runs_as_list

In [2]:
df_true = pd.read_csv(DATA_ROOT / "synferm_dataset_2023-09-05_40018records.csv")

In [3]:
summary_list, config_list, tag_list, name_list  = get_runs_as_list(filters={"jobType": "hparam_best"}
                                                                   )
df_all = pd.json_normalize(config_list).merge(pd.json_normalize(summary_list), left_index=True, right_index=True)
df_all["tags"] = tag_list
df_all["run_id"] = name_list
df_all["run_group"] = [s.rsplit("_", maxsplit=1)[0] for s in name_list]
df_all["Model+Features"] = df_all["name"] + "/" + df_all["decoder.global_features"].str.join("+").str.replace("None", "CGR")

In [4]:
# check available experiments by split
for tag, row in df_all.groupby("tags")[["experiment_id"]].agg(set).iterrows():
    print(tag, "-->", row["experiment_id"])

('0D',) --> {'JG1160', 'JG1116', 'JG1117', 'JG1106', 'JG1131', 'JG1100', 'JG1128', 'JG1135', 'JG1115', 'JG1109', 'JG1185'}
('0D_1.25',) --> {'JG1153', 'JG1141', 'JG1159', 'JG1147'}
('0D_10',) --> {'JG1144', 'JG1138', 'JG1150', 'JG1156'}
('0D_2.5',) --> {'JG1146', 'JG1140', 'JG1158', 'JG1152'}
('0D_20',) --> {'JG1149', 'JG1137', 'JG1155', 'JG1143'}
('0D_40',) --> {'JG1136', 'JG1142', 'JG1154', 'JG1148'}
('0D_5',) --> {'JG1157', 'JG1139', 'JG1151', 'JG1145'}
('1D',) --> {'JG1125', 'JG1118', 'JG1101', 'JG1123', 'JG1104', 'JG1129', 'JG1132', 'JG1121', 'JG1186', 'JG1126'}
('1D_10',) --> {'JG1164', 'JG1182', 'JG1170', 'JG1176'}
('1D_2.5',) --> {'JG1184', 'JG1178', 'JG1166', 'JG1172'}
('1D_20',) --> {'JG1175', 'JG1169', 'JG1163', 'JG1181'}
('1D_40',) --> {'JG1162', 'JG1174', 'JG1180', 'JG1168'}
('1D_5',) --> {'JG1177', 'JG1165', 'JG1171', 'JG1183'}
('1D_80',) --> {'JG1167', 'JG1173', 'JG1179', 'JG1161'}
('2D',) --> {'JG1124', 'JG1112', 'JG1130', 'JG1133', 'JG1105', 'JG1119', 'JG1102', 'JG1122

## 1D split

In [108]:
exp_ids = ["JG1101", "JG1129", "JG1186"]  # D-MPNN/CGR, XGB/FP, LogReg/FP for 1D
df_exps = df_all.loc[df_all.experiment_id.isin(exp_ids)]

In [109]:
val_avg_precision = []
for fold in range(9):
    val_preds = []
    test_preds = []
    # obtain all experiments for that fold
    for _, exp in df_exps.loc[df_exps["run_id"].str[-1] == str(fold)].iterrows():
        # first we check if predicted values are available
        val_pred_path = PRED_DIR / exp.run_id / "val_preds_last.csv"
        test_pred_path = PRED_DIR / exp.run_id / "test_preds_last.csv"
    
        for name, file, preds in zip(["val", "test"], [val_pred_path, test_pred_path], [val_preds, test_preds]):
            if file.is_file():
                # import predictions
                df = pd.read_csv(file, index_col="idx")
                preds.append(df)
            else:
                print(f"{name} predictions not found for {exp.run_id} ({exp.experiment_id})")
    # merge all the predictions and the ground truth
    val = pd.concat(val_preds, keys=["modelA", "modelB", "modelC",], axis=1).merge(pd.concat([df_true], axis=1, keys=["true"]), how="left", left_index=True, right_index=True)
    test = pd.concat(test_preds, keys=["modelA", "modelB", "modelC",], axis=1).merge(pd.concat([df_true], axis=1, keys=["true"]), how="left", left_index=True, right_index=True)
    
    
    # add the committee predictions
    for i in range(3):
        val["committee", f"pred_{i}"] = (val["modelA", f"pred_{i}"] + val["modelB", f"pred_{i}"] + val["modelC", f"pred_{i}"]) / 3  # simply take the mean
        test["committee", f"pred_{i}"] = (test["modelA", f"pred_{i}"] + test["modelB", f"pred_{i}"] + test["modelC", f"pred_{i}"]) / 3  # simply take the mean
    
    
    # calculate metrics for the committee model
    
    # extract predictions
    y_prob = val["committee"].to_numpy()
    y_hat = (y_prob > 0.5).astype(np.int_)
    y_true = val["true"][["binary_A", "binary_B", "binary_C"]].to_numpy()
    
    # calculate metric
    val_avg_precision.append(average_precision_score(y_true, y_prob, average="macro"))


In [110]:
# check mean and std for committee model on val set
print("Committee model:")
print(np.mean(val_avg_precision), np.std(val_avg_precision), sep="±")
print()

# check mean and std for constituent models
print("Constituent models:")
print(df_exps.groupby(["Model+Features"])["val/avgPrecision_macro"].aggregate([np.mean, np.std]))

Committee model:
0.8919505789953232±0.03353670577237679

Constituent models:
                           mean       std
Model+Features                           
D-MPNN/CGR             0.890784  0.030425
LogisticRegression/FP  0.871912  0.043093
XGB/FP                 0.883578  0.042024


In [111]:
# is the committee model significantly different from the individual models?
for model in df_exps["Model+Features"].drop_duplicates():
    metrics_model = df_exps.loc[df_exps["Model+Features"] == model].sort_values(by="run_id")["val/avgPrecision_macro"].to_numpy()
    print(model, ":", wilcoxon(val_avg_precision, metrics_model))

XGB/FP : WilcoxonResult(statistic=11.0, pvalue=0.203125)
LogisticRegression/FP : WilcoxonResult(statistic=0.0, pvalue=0.00390625)
D-MPNN/CGR : WilcoxonResult(statistic=22.0, pvalue=1.0)


### Conclusion 1D
The committee model is has the highest mean score with second lowest std.
It is significantly better than LogReg/FP, but not better than the other two models

## 2D split

In [112]:
exp_ids = ["JG1102", "JG1130", "JG1112"]  # D-MPNN/CGR, XGB/FP, LogReg/FP for 2D
df_exps = df_all.loc[df_all.experiment_id.isin(exp_ids)]

In [113]:
val_avg_precision = []
for fold in range(9):
    val_preds = []
    test_preds = []
    # obtain all experiments for that fold
    for _, exp in df_exps.loc[df_exps["run_id"].str[-1] == str(fold)].iterrows():
        # first we check if predicted values are available
        val_pred_path = PRED_DIR / exp.run_id / "val_preds_last.csv"
        test_pred_path = PRED_DIR / exp.run_id / "test_preds_last.csv"
    
        for name, file, preds in zip(["val", "test"], [val_pred_path, test_pred_path], [val_preds, test_preds]):
            if file.is_file():
                # import predictions
                df = pd.read_csv(file, index_col="idx")
                preds.append(df)
            else:
                print(f"{name} predictions not found for {exp.run_id} ({exp.experiment_id})")
    # merge all the predictions and the ground truth
    val = pd.concat(val_preds, keys=["modelA", "modelB", "modelC",], axis=1).merge(pd.concat([df_true], axis=1, keys=["true"]), how="left", left_index=True, right_index=True)
    test = pd.concat(test_preds, keys=["modelA", "modelB", "modelC",], axis=1).merge(pd.concat([df_true], axis=1, keys=["true"]), how="left", left_index=True, right_index=True)
    
    
    # add the committee predictions
    for i in range(3):
        val["committee", f"pred_{i}"] = (val["modelA", f"pred_{i}"] + val["modelB", f"pred_{i}"] + val["modelC", f"pred_{i}"]) / 3  # simply take the mean
        test["committee", f"pred_{i}"] = (test["modelA", f"pred_{i}"] + test["modelB", f"pred_{i}"] + test["modelC", f"pred_{i}"]) / 3  # simply take the mean
    
    
    # calculate metrics for the committee model
    
    # extract predictions
    y_prob = val["committee"].to_numpy()
    y_hat = (y_prob > 0.5).astype(np.int_)
    y_true = val["true"][["binary_A", "binary_B", "binary_C"]].to_numpy()
    
    # calculate metric
    val_avg_precision.append(average_precision_score(y_true, y_prob, average="macro"))


In [114]:
# check mean and std for committee model on val set
print("Committee model:")
print(np.mean(val_avg_precision), np.std(val_avg_precision), sep="±")
print()

# check mean and std for constituent models
print("Constituent models:")
print(df_exps.groupby(["Model+Features"])["val/avgPrecision_macro"].aggregate([np.mean, np.std]))

Committee model:
0.809958986512768±0.11051314722498096

Constituent models:
                           mean       std
Model+Features                           
D-MPNN/CGR             0.809626  0.124337
LogisticRegression/FP  0.774621  0.122331
XGB/FP                 0.790416  0.112848


In [115]:
# is the committee model significantly different from the individual models?
for model in df_exps["Model+Features"].drop_duplicates():
    metrics_model = df_exps.loc[df_exps["Model+Features"] == model].sort_values(by="run_id")["val/avgPrecision_macro"].to_numpy()
    print(model, ":", wilcoxon(val_avg_precision, metrics_model))

LogisticRegression/FP : WilcoxonResult(statistic=0.0, pvalue=0.00390625)
XGB/FP : WilcoxonResult(statistic=7.0, pvalue=0.07421875)
D-MPNN/CGR : WilcoxonResult(statistic=22.0, pvalue=1.0)


### Conclusion 2D
The committee model has the highest mean score and the lowest std.
It is significantly better than the LogReg/FP model, but not different from the other two models

## 3D split

In [117]:
exp_ids = ["JG1103", "JG1111", "JG1108"]  # D-MPNN/CGR, XGB/FP, LogReg/FP for 3D
df_exps = df_all.loc[df_all.experiment_id.isin(exp_ids)]

In [118]:
val_avg_precision = []
for fold in range(9):
    val_preds = []
    test_preds = []
    # obtain all experiments for that fold
    for _, exp in df_exps.loc[df_exps["run_id"].str[-1] == str(fold)].iterrows():
        # first we check if predicted values are available
        val_pred_path = PRED_DIR / exp.run_id / "val_preds_last.csv"
        test_pred_path = PRED_DIR / exp.run_id / "test_preds_last.csv"
    
        for name, file, preds in zip(["val", "test"], [val_pred_path, test_pred_path], [val_preds, test_preds]):
            if file.is_file():
                # import predictions
                df = pd.read_csv(file, index_col="idx")
                preds.append(df)
            else:
                print(f"{name} predictions not found for {exp.run_id} ({exp.experiment_id})")
    # merge all the predictions and the ground truth
    val = pd.concat(val_preds, keys=["modelA", "modelB", "modelC",], axis=1).merge(pd.concat([df_true], axis=1, keys=["true"]), how="left", left_index=True, right_index=True)
    test = pd.concat(test_preds, keys=["modelA", "modelB", "modelC",], axis=1).merge(pd.concat([df_true], axis=1, keys=["true"]), how="left", left_index=True, right_index=True)
    
    
    # add the committee predictions
    for i in range(3):
        val["committee", f"pred_{i}"] = (val["modelA", f"pred_{i}"] + val["modelB", f"pred_{i}"] + val["modelC", f"pred_{i}"]) / 3  # simply take the mean
        test["committee", f"pred_{i}"] = (test["modelA", f"pred_{i}"] + test["modelB", f"pred_{i}"] + test["modelC", f"pred_{i}"]) / 3  # simply take the mean
    
    
    # calculate metrics for the committee model
    
    # extract predictions
    y_prob = val["committee"].to_numpy()
    y_hat = (y_prob > 0.5).astype(np.int_)
    y_true = val["true"][["binary_A", "binary_B", "binary_C"]].to_numpy()
    
    # calculate metric
    val_avg_precision.append(average_precision_score(y_true, y_prob, average="macro"))


In [119]:
# check mean and std for committee model on val set
print("Committee model:")
print(np.mean(val_avg_precision), np.std(val_avg_precision), sep="±")
print()

# check mean and std for constituent models
print("Constituent models:")
print(df_exps.groupby(["Model+Features"])["val/avgPrecision_macro"].aggregate([np.mean, np.std]))

Committee model:
0.8064186998158501±0.04332848404326717

Constituent models:
                           mean       std
Model+Features                           
D-MPNN/CGR             0.769681  0.056595
LogisticRegression/FP  0.794498  0.042385
XGB/FP                 0.800110  0.041297


In [120]:
# is the committee model significantly different from the individual models?
for model in df_exps["Model+Features"].drop_duplicates():
    metrics_model = df_exps.loc[df_exps["Model+Features"] == model].sort_values(by="run_id")["val/avgPrecision_macro"].to_numpy()
    print(model, ":", wilcoxon(val_avg_precision, metrics_model))

LogisticRegression/FP : WilcoxonResult(statistic=5.0, pvalue=0.0390625)
XGB/FP : WilcoxonResult(statistic=11.0, pvalue=0.203125)
D-MPNN/CGR : WilcoxonResult(statistic=1.0, pvalue=0.0078125)


### Conclusion 3D
The committee model has the highest mean score but the second worst standard deviation.
It is significantly better than D-MPNN/CGR and LogReg/FP. It is not different from XGB/FP

## Conclusion
Several things can be seen her (some corroborate findings from other experiments):

- The committee model is equal or better than all of its constitutuents
- The committee model is better than SOME of its constituents
- XGB is the most reliable model across all different situations
- D-MPNN model is not suitable for the 3D problem due to lack of data
- In 1D and 2D situations, where lots of data is available, XGB and D-MPNN outperform the simpler Logistic Regression.
  It is not fully clear how much this is due to XGB and D-MPNN profiting from larger number of samples vs. abusing combinatorial information.

### So should we use a committee model?
- Performance-wise the answer is clearly yes.
- On the flipside the committee model is more expensive to use in inference and more complex, adding possible points of failure
- The committee model never significantly outperforms XGB/FP alone, using only XGB/FP seems to be the logical compromise