# Compare

In this notebook, we compare the performance of different causal inference methods using various metrics. We load multiple datasets generated by different methods and evaluate how well each method identifies causal relationships compared to the ground truth.

We employ metrics such as accuracy, balanced accuracy, precision, recall, and F1-score to assess the performance of each method. Additionally, we aggregate the results into a comprehensive table for easy comparison and analysis.


In [31]:
import pickle 
import pandas as pd
import numpy as np

from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, balanced_accuracy_score

In [32]:
with open('example/causal_dfs.pkl', 'rb') as f:
    causal_dfs_var, causal_dfs_varlingam, causal_dfs_pcmci, causal_dfs_granger, causal_dfs_dynotears, causal_dfs_d2c, true_causal_dfs = pickle.load(f)
    
causal_dfs_varlingam = pd.concat(causal_dfs_varlingam.values()).reset_index(drop=True)
causal_dfs_var = pd.concat(causal_dfs_var.values()).reset_index(drop=True)
causal_dfs_pcmci = pd.concat(causal_dfs_pcmci.values()).reset_index(drop=True)
causal_dfs_granger = pd.concat(causal_dfs_granger.values()).reset_index(drop=True)
causal_dfs_dynotears = pd.concat(causal_dfs_dynotears.values()).reset_index(drop=True)
causal_dfs_d2c = pd.concat(causal_dfs_d2c.values()).reset_index(drop=True)
true_causal_dfs = pd.concat(true_causal_dfs.values()).reset_index(drop=True)

In [35]:
# We perform a sanity check to ensure that the causal_dfs are the same across all methods

assert (causal_dfs_varlingam['from'] == causal_dfs_var['from']).all()
assert (causal_dfs_varlingam['from'] == causal_dfs_pcmci['from']).all()
assert (causal_dfs_varlingam['from'] == causal_dfs_granger['from']).all()
assert (causal_dfs_varlingam['from'] == causal_dfs_dynotears['from']).all()
assert (causal_dfs_varlingam['from'] == causal_dfs_d2c['from']).all()
assert (causal_dfs_varlingam['from'] == true_causal_dfs['from']).all()

assert (causal_dfs_varlingam['to'] == causal_dfs_var['to']).all()
assert (causal_dfs_varlingam['to'] == causal_dfs_pcmci['to']).all()
assert (causal_dfs_varlingam['to'] == causal_dfs_granger['to']).all()
assert (causal_dfs_varlingam['to'] == causal_dfs_dynotears['to']).all()
assert (causal_dfs_varlingam['to'] == causal_dfs_d2c['to']).all()
assert (causal_dfs_varlingam['to'] == true_causal_dfs['to']).all()


In [39]:
y_true = true_causal_dfs['is_causal'].astype(int)
y_pred_var = causal_dfs_var['is_causal'].astype(int)
y_pred_pcmci = causal_dfs_pcmci['is_causal'].astype(int)
y_pred_granger = causal_dfs_granger['is_causal'].astype(int)
y_pred_dynotears = causal_dfs_dynotears['is_causal'].astype(int)
y_pred_d2c = causal_dfs_d2c['is_causal'].astype(int)
y_pred_varlingam = causal_dfs_varlingam['is_causal'].astype(int)

In [40]:
concat_predictions = pd.concat([y_pred_var, 
                                y_pred_varlingam,
                                y_pred_pcmci, 
                                y_pred_granger, 
                                y_pred_dynotears,
                                y_pred_d2c, 
                                y_true], axis=1)
concat_predictions.columns = ['VAR', 'VarLingam', 'PCMCI', 'Granger', 'Dynotears', 'D2C', 'True']

In [41]:
scores = pd.DataFrame(
    columns=[
        "Method",
        "Accuracy",
        "Balanced Accuracy",
        "Precision",
        "Recall",
        "F1",
        "Total",
        "Positive",
    ]
)

scores.loc[len(scores)] = [
    "VAR",
    accuracy_score(y_true, y_pred_var),
    balanced_accuracy_score(y_true, y_pred_var),
    precision_score(y_true, y_pred_var, zero_division=np.nan),
    recall_score(y_true, y_pred_var, zero_division=np.nan),
    f1_score(y_true, y_pred_var, zero_division=np.nan),
    len(y_true),
    y_true.sum(),
]

scores.loc[len(scores)] = [
    "PCMCI",
    accuracy_score(y_true, y_pred_pcmci),
    balanced_accuracy_score(y_true, y_pred_pcmci),
    precision_score(y_true, y_pred_pcmci, zero_division=np.nan),
    recall_score(y_true, y_pred_pcmci, zero_division=np.nan),
    f1_score(y_true, y_pred_pcmci, zero_division=np.nan),
    len(y_true),
    y_true.sum(),
]

scores.loc[len(scores)] = [
    "Granger",
    accuracy_score(y_true, y_pred_granger),
    balanced_accuracy_score(y_true, y_pred_granger),
    precision_score(y_true, y_pred_granger, zero_division=np.nan),
    recall_score(y_true, y_pred_granger, zero_division=np.nan),
    f1_score(y_true, y_pred_granger, zero_division=np.nan),
    len(y_true),
    y_true.sum(),
]

scores.loc[len(scores)] = [
    "Dynotears",
    accuracy_score(y_true, y_pred_dynotears),
    balanced_accuracy_score(y_true, y_pred_dynotears),
    precision_score(y_true, y_pred_dynotears, zero_division=np.nan),
    recall_score(y_true, y_pred_dynotears, zero_division=np.nan),
    f1_score(y_true, y_pred_dynotears, zero_division=np.nan),
    len(y_true),
    y_true.sum(),
]

scores.loc[len(scores)] = [
    "VarLingam",
    accuracy_score(y_true, y_pred_varlingam),
    balanced_accuracy_score(y_true, y_pred_varlingam),
    precision_score(y_true, y_pred_varlingam, zero_division=np.nan),
    recall_score(y_true, y_pred_varlingam, zero_division=np.nan),
    f1_score(y_true, y_pred_varlingam, zero_division=np.nan),
    len(y_true),
    y_true.sum(),
]

scores.loc[len(scores)] = [
    "D2C",
    accuracy_score(y_true, y_pred_d2c),
    balanced_accuracy_score(y_true, y_pred_d2c),
    precision_score(y_true, y_pred_d2c, zero_division=np.nan),
    recall_score(y_true, y_pred_d2c, zero_division=np.nan),
    f1_score(y_true, y_pred_d2c, zero_division=np.nan),
    len(y_true),
    y_true.sum(),
]

In [42]:
scores

Unnamed: 0,Method,Accuracy,Balanced Accuracy,Precision,Recall,F1,Total,Positive
0,VAR,0.9334,0.585005,0.393103,0.188742,0.255034,5000,302
1,PCMCI,0.9586,0.977969,0.59332,1.0,0.74476,5000,302
2,Granger,0.7316,0.43734,0.028131,0.102649,0.04416,5000,302
3,Dynotears,0.9396,0.5,,0.0,0.0,5000,302
4,VarLingam,0.9546,0.975841,0.570888,1.0,0.726835,5000,302
5,D2C,0.6594,0.49497,0.058601,0.307947,0.098465,5000,302


# Conclusion

The results presented in this notebook indicate varying performances of different causal inference methods when applied to a limited training dataset. It's important to note that these methods typically require larger and more diverse datasets to achieve reliable and accurate results, particularly for complex tasks like causal inference. 

For more detailed and complete results, please refer to the `submission` branch or further experimentation with broader datasets.
