# Benchmark
A comprehensive comparison against state-of-the-art causal discovery methodologies is performed. <br>
Each selected method exemplifies a distinct category within the spectrum of causal inference approaches. <br>
Namely, we includes the Pairwise Granger Causality test implementation available in the `statsmodels` package for Python <br>

From the constraint-based family, we select the PCMCI algorithm developed by Runge, which is implemented in the Tigramite Python package. <br> 
The VarLiNGAM method, as proposed by Hyvärinen et al., is our choice for the noise-based category.  <br> 
It is implemented in Python via the LiNGAM library .   <br> 

Finally, for the score-based category, we incorporate DYNOTEARS, a method introduced by Pamfil et al. and implemented in the CausalNex Python library.  <br> 

We also include standard VAR modeling, adopting the regression coefficients as if they were causal.  <br> 


It is important to note that each method has its own underlying assumptions, which might not always be respected in practical scenarios. For example, VarLiNGAM assumes non-Gaussian errors, which is not the case in our experiments. Similarly, we use PCMCI with the ParCorr independence test; while a nonlinear test would be more appropriate. We faced computational challenges with PCMCI when attempting to use its nonlinear inpendence test CMIknn. Nevertheless, our goal is not to demonstrate that our approach outperforms all others under all conditions. Instead, we aim to show that our method can be a valuable addition to the toolbox for causal discovery in time series, offering unique insights and potentially complementing existing techniques. The adopted conditions for each method might be suboptimal for the given dataset, yet they provide a robust benchmark to evaluate the relative strengths and potential applications of our proposed approach.

Each method has been wrapped conveniently to uniformize the way to create the objects, run the execution, and return the same structure.  

In [1]:
import os

# This is because VARLINGAM will use all available CPU with n_jobs > 1 - Limit to 1 thread
os.environ['MKL_NUM_THREADS'] = '1'  
os.environ['NUMEXPR_NUM_THREADS'] = '1'
os.environ['OMP_NUM_THREADS'] = '1'
os.environ['OPENBLAS_NUM_THREADS'] = '1'
os.environ['OMP_NUM_THREADS'] = '1'

import pickle 
import os
from d2c.descriptors import DataLoader
from d2c.benchmark import VARLiNGAM, PCMCI, Granger, DYNOTEARS, D2CWrapper, VAR

from imblearn.ensemble import BalancedRandomForestClassifier

#suppress future warning
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)


In [2]:
N_VARS = 5
N_JOBS = 40
MAXLAGS = 5

## Explanation
We start by loading our test data and collecting the corresponding observations. <br>
Notice that we collect the original observations rather than the lagged one. <br>
This is because the methods will build lag matrices internally. <br>

At this stage it's important to introduce the concept of `causal_df`. <br> 
A `causal_df` is a representation of the output of causal discovery.<br> 
It's a dataframe containing the following columns: <br> 
`['from', 'to', 'effect', 'p_value', 'probability', 'is_causal']`

- `from`: the source variable (in the past)
- `to`: the target variable (in the present)
- `effect`: the estimated effect size (if provided by the method)
- `p_value`: the p-value associated with the effect (if provided by the method)
- `probability`: the probability of the causal relationship (if provided by the method)
- `is_causal`: a boolean indicating whether the relationship is causal <br>

We remind our variable naming convention: <br>
- A time series of `n_variables` dimensions will have names from 0 to `n_variables - 1` to refer to the variables at time `t` (present)
- names from `n_variables` to `n_variables*2 - 1` will indicate the same variable at time `t-1` (1-lag)
- names from `n_variables*2` to `n_variables*3 - 1` will indicate the same variable at time `t-2` (2-lag)  <br>
For example if `n_variables = 5`, the line where `from` is 12 and `to` is 4, refers to the link between variable `3` at `t-2` to variable `5` at time `t` <br>

Here is an example of a `causal_df` for 5 variables

|   |    from | to  |  effect  | p_value |probability |is_causal |
|------|---------|-----|----------|---------|------------|----------|
|     |   5  | 0 |  0.52155 | 0.313978   |     None    |     0     |
|     |   5  | 1 |-0.006598 | 0.059683   |     None    |     0     |
|     |   5  | 2 | 0.445405 | 0.968117   |     None    |     0     |
|     |   5  | 3 | 0.017567 | 0.033022   |     None    |     0     |
|     |   5  | 4 | -0.04921 | 0.205457   |     None    |     0     |
|    | ...  |.. |      ... |      ...   |      ...    |   ...     |
|   |  29  | 0 | 0.064319 | 0.170674   |     None    |     0     |
|   |  29  | 1 |-0.059419 | 0.550454   |     None    |     0     |
|   |  29  | 2 |-0.017221 | 0.829726   |     None    |     0     |
|   |  29  | 3 | 0.020435 | 0.218739   |     None    |     0     |
|   |  29  | 4 | 0.042379 | 0.294798   |     None    |     0     |

In [3]:
dataloader = DataLoader(n_variables = N_VARS, maxlags = MAXLAGS)
dataloader.from_pickle('example/synthetic_data_test.pkl')
observations = dataloader.get_original_observations()
dags = dataloader.get_dags()
true_causal_dfs = dataloader.get_true_causal_dfs()


Each method from the benchmark will take as input 
- `ts_list`: a list of `np.arrays` containing the values of the time series, 
- `maxlags`: the maxlags, 
- `n_jobs`: the number of jobs. <br>
It's important to notice that `get_causal_dfs()` will return a dictionary of dataframes where the key is the index of the corresponding time series from the input list `ts_list`.
So, if our data contains `15` time series and you want to access the last one we can do `causal_dfs[15 - 1]`

## Competitors

In [4]:
var = VAR(ts_list=observations, maxlags=MAXLAGS, n_jobs=N_JOBS)
var.run()
causal_dfs_var = var.get_causal_dfs()

In [5]:
varlingam = VARLiNGAM(ts_list=observations, maxlags=MAXLAGS, n_jobs=N_JOBS)
varlingam.run()
causal_dfs_varlingam = varlingam.get_causal_dfs()

In [6]:
pcmci = PCMCI(ts_list=observations, maxlags=MAXLAGS, n_jobs=N_JOBS)
pcmci.run()
causal_dfs_pcmci = pcmci.get_causal_dfs()

In [7]:
granger = Granger(ts_list=observations, maxlags=MAXLAGS, n_jobs=N_JOBS)
granger.run()
causal_dfs_granger = granger.get_causal_dfs()

In [8]:
dynotears = DYNOTEARS(ts_list=observations, maxlags=MAXLAGS, n_jobs=N_JOBS)
dynotears.run()
causal_dfs_dynotears = dynotears.get_causal_dfs()

## D2CWrapper
For coherence with the other results, a D2CWrapper class has been created that behave exactly like the other approaches.<br>
It therefore exposes the methods `run()` and `get_causal_dfs()`. <br>
It requires a model that has been trained already and it will compute descriptors for unseen data of which the DAG is ignored. <br>
In this case, the model cannot select a subset of features (no `couples_to_consider_per_dag` attribute).
The predictions from the model on the newly computed descriptors are the labels that will be provided in the causal df. <br>

<b>Important:</b> make sure your model has been trained on the same feature set. If you have used `full=True` when generating the training descriptors, you should use `full=True` here as well

In [9]:
with open('descriptors_df_train.pkl', 'rb') as f:
    descriptors_df_train = pickle.load(f)

X_train = descriptors_df_train.drop(columns=['graph_id','edge_source','edge_dest','is_causal'])
y_train = descriptors_df_train['is_causal']

clf = BalancedRandomForestClassifier(n_estimators=10, max_depth=None, random_state=0, sampling_strategy='auto',replacement=True,bootstrap=True)
clf.fit(X_train, y_train)

In [10]:
d2cwrapper = D2CWrapper(ts_list=observations, n_variables=N_VARS, model=clf, maxlags=MAXLAGS, n_jobs = N_JOBS, full=True)
d2cwrapper.run()
causal_dfs_d2c = d2cwrapper.get_causal_dfs()

## Saving

In [11]:
with open('example/causal_dfs.pkl', 'wb') as f:
    pickle.dump((causal_dfs_var, 
                causal_dfs_varlingam, 
                causal_dfs_pcmci,
                causal_dfs_granger, 
                causal_dfs_dynotears,
                causal_dfs_d2c, 
                true_causal_dfs), f)