This notebook generates the training descriptors. 
For each file; only a subset of 300 couples of variables are chosen, according to the following logic

- 100 causal couples: A selection of 20 pairs of variables that are causally related according to the DAG. These serve as positive examples.
- 100 opposite couples: For each of the causal couples selected, the corresponding opposite pair (effect as cause and cause as effect) is also chosen. These pairs, despite being theoretically informative, do not respect the temporal ordering and thus they won't appear in the results returned by competitor methods. For this reason, they will be excluded from the evaluation phase later on, although they will remain in the training set of our classifier.
- 100 additional noncausal couples: To compensate for the exclusion of the opposite couples in validation and ensure a balanced dataset. Unlike the opposite ones, these pairs are chosen based on their lack of causal connection in the DAG and are more likely to resemble noncausal relationships encountered in real-world scenarios.

In [2]:
import pickle 
import os
import pandas as pd
from d2c.descriptors import D2C, DataLoader
from tqdm import tqdm

N_JOBS = 40
SEED = 42
MB_SIZE = 5
COUPLES_TO_CONSIDER_PER_DAG = 60
maxlags = 5  

root = './data/'
for file in tqdm(sorted(os.listdir(root))):
    gen_process_number = int(file.split('_')[0][1:])
    n_variables = int(file.split('_')[1][1:])
    max_neighborhood_size = int(file.split('_')[2][2:])
    noise_std = float(file.split('_')[3][1:-4])

    dataloader = DataLoader(n_variables = n_variables,
                    maxlags = maxlags)
    dataloader.from_pickle(root+file)

    d2c = D2C(observations=dataloader.get_observations(), 
            dags=dataloader.get_dags(), 
            couples_to_consider_per_dag=COUPLES_TO_CONSIDER_PER_DAG, 
            MB_size=MB_SIZE, 
            n_variables=n_variables, 
            maxlags=maxlags,
            seed=SEED,
            n_jobs=N_JOBS,
            full=True)

    d2c.initialize()

    descriptors_df = d2c.get_descriptors_df()

    descriptors_df.insert(0, 'process_id', gen_process_number)
    descriptors_df.insert(2, 'n_variables', n_variables)
    descriptors_df.insert(3, 'max_neighborhood_size', max_neighborhood_size)
    descriptors_df.insert(4, 'noise_std', noise_std)

    descriptors_df.to_pickle(f'./descriptors/P{gen_process_number}_N{n_variables}_Nj{max_neighborhood_size}_n{noise_std}_MB{MB_SIZE}.pkl')

100%|██████████| 486/486 [4:09:47<00:00, 30.84s/it]  
