# RUN NOTEBOOK

## INTRO

This Jupyter notebook is designed for conducting experiments or analyses related to a machine learning pipeline involving `TD2C` method It focuses on data processing, model training, and evaluation within the context of `multivariate TS` data. The notebook includes steps for selecting and loading datasets, computing descriptors, and training a `Random Forest` classifiers.

### Breakdown of the Notebook:

1. **Scope and Purpose**:
    - Data Selection: The notebook filters and selects relevant files containing time-series data, ensuring consistency with precomputed descriptors.
    - Descriptor Loading: It loads and concatenates data descriptors, which are crucial for the subsequent machine learning tasks.
    - Model Training and Evaluation: It employs machine learning models (particularly Random Forest) to train on the selected data and then evaluates the models using various metrics.

2. **Key Modules and Libraries**:
    - `Pandas`: For data manipulation and handling.
    - `NumPy`: For numerical computations.
    - `Scikit-learn`: For implementing machine learning models and evaluation metrics.
    - `Imbalanced-learn`: Specifically for handling imbalanced datasets using a BalancedRandomForestClassifier.
    - Custom Modules (`d2c.benchmark`, `d2c.descriptors.loader`): These seem to be part of a custom package or framework, likely related to the `D2C` (Data to Concepts) methodology.

3. **Main Functions and Classes**:
    - `DataLoader`: Likely used to load data descriptors from files.
    - `D2CWrapper`: Presumably a wrapper class that facilitates benchmarking of models using the D2C approach.
    - `RandomForestClassifier`: A machine learning model used to train on the data.
    - `BalancedRandomForestClassifier`: An alternative to the standard RandomForest, designed to handle imbalanced datasets.

4. **Outputs**: The notebook produces several key outputs:
    - Trained Machine Learning Models: After processing and training, models like the Random Forest classifier are produced.
    - Evaluation Metrics: The notebook computes and outputs various metrics (e.g., `accuracy`, `F1 score`, `ROC-AUC`) to evaluate the performance of the trained models.
    - Filtered Data Files: It also generates or modifies lists of files to be processed based on specific criteria.

## SETTINGS

### IMPORT PACKAGES

In [None]:
import pickle 
import os
import pandas as pd
from tqdm import tqdm
import numpy as np

from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, balanced_accuracy_score

from d2c.benchmark import D2CWrapper

from d2c.descriptors_generation.loader import DataLoader

from sklearn.ensemble import RandomForestClassifier
from imblearn.ensemble import BalancedRandomForestClassifier

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score

### SET LAGS

In [None]:
maxlags = 5 

We select only the files that we are interested in working with. This should be coherent with the files for which we have computed the descriptors.

## RUN METHODS

### TD2C

In [None]:
data_root = '/home/jpalombarini/td2c/notebooks/paper_td2c/.data/'

to_dos = []

# This loop gest a list of all the files to be processed
for testing_file in sorted(os.listdir(data_root)):
    if testing_file.endswith('.pkl'):
        gen_process_number = int(testing_file.split('_')[0][1:])
        n_variables = int(testing_file.split('_')[1][1:])
        max_neighborhood_size = int(testing_file.split('_')[2][2:])
        noise_std = float(testing_file.split('_')[3][1:-4])
          
        if noise_std != 0.01: # if the noise is different we skip the file
            continue

        if max_neighborhood_size != 2: # if the max_neighborhood_size is different we skip the file
            continue

        to_dos.append(testing_file) # we add the file to the list (to_dos) to be processed

# sort to_dos by number of variables
to_dos_5_variables = [file for file in to_dos if int(file.split('_')[1][1:]) == 5]
to_dos_10_variables = [file for file in to_dos if int(file.split('_')[1][1:]) == 10]
to_dos_25_variables = [file for file in to_dos if int(file.split('_')[1][1:]) == 25]

# we create a dictionary with the lists of files to be processed
todos = {'5': to_dos_5_variables, '10': to_dos_10_variables, '25': to_dos_25_variables}

In [None]:
# we create a dictionary to store the results

dfs = []
descriptors_root = '/home/jpalombarini/td2c/notebooks/paper_td2c/.descriptors_ts_original_entropy/'

# This loop gets the descriptors for the files to be processed
for testing_file in sorted(os.listdir(descriptors_root)):
    if testing_file.endswith('.pkl'):
        df = pd.read_pickle(descriptors_root + testing_file)
        dfs.append(df)

# we concatenate the descriptors
descriptors_training = pd.concat(dfs, axis=0).reset_index(drop=True)

In [None]:
# # plot the precision recall curve against the threshold
# from sklearn.metrics import roc_auc_score, precision_score, recall_score, f1_score, precision_recall_curve
# import matplotlib.pyplot as plt

# maxlags = 5

# td2c_rocs_process = {}
# td2c_precision_process = {}
# td2c_recall_process = {}
# td2c_f1_process = {}

# for testing_file in to_dos_5_variables:
#     print("Working on", testing_file)
#     gen_process_number = int(testing_file.split('_')[0][1:])
#     n_variables = int(testing_file.split('_')[1][1:])
#     max_neighborhood_size = int(testing_file.split('_')[2][2:])
#     noise_std = float(testing_file.split('_')[3][1:-4])


#     # split training and testing data
#     training_data = descriptors_training.loc[descriptors_training['process_id'] != gen_process_number]
#     X_train = training_data.drop(columns=['process_id', 'graph_id', 'n_variables', 'max_neighborhood_size','noise_std', 'edge_source', 'edge_dest', 'is_causal',])
#     y_train = training_data['is_causal']

#     testing_data = descriptors_training.loc[(descriptors_training['process_id'] == gen_process_number) & (descriptors_training['n_variables'] == n_variables) & (descriptors_training['max_neighborhood_size'] == max_neighborhood_size) & (descriptors_training['noise_std'] == noise_std)]

#     model = BalancedRandomForestClassifier(n_estimators=50, random_state=0, n_jobs=50, max_depth=None, sampling_strategy='auto', replacement=True, bootstrap=False)
#     # model = RandomForestClassifier(n_estimators=50, random_state=0, n_jobs=50, max_depth=10)

#     model.fit(X_train, y_train)

#     rocs = {}
#     precisions = {}
#     recalls = {}
#     f1s = {}
#     #load testing descriptors
#     test_df = testing_data
#     test_df = test_df.sort_values(by=['edge_source','edge_dest']).reset_index(drop=True) # sort to match the order of the model!

#     X_test = test_df.drop(columns=['process_id', 'graph_id', 'n_variables', 'max_neighborhood_size','noise_std', 'edge_source', 'edge_dest', 'is_causal',])
#     y_test = test_df['is_causal']

#     y_pred_proba = model.predict_proba(X_test)[:,1]
#     y_pred = model.predict(X_test)

#     # Step 2: Calculate precision, recall, and thresholds
#     precision, recall, thresholds = precision_recall_curve(y_test, y_pred_proba)

#     # Step 3: Plot the precision-recall curve
#     plt.figure(figsize=(10, 6))
#     plt.plot(thresholds, precision[:-1], label='Precision')
#     plt.plot(thresholds, recall[:-1], label='Recall')
#     plt.xlabel('Threshold')
#     plt.ylabel('Precision/Recall')
#     plt.title('Precision-Recall vs Threshold')
#     plt.legend()
#     plt.grid()
#     plt.show()


In [None]:
# from sklearn.metrics import roc_auc_score
# maxlags = 5
# descriptors_root = './descriptors_ts_original_entropy/'
# data_root = '../data/'

# training_set_full = []
# for testing_file in to_dos_5_variables:
#     gen_process_number = int(testing_file.split('_')[0][1:])
#     n_variables = int(testing_file.split('_')[1][1:])
#     max_neighborhood_size = int(testing_file.split('_')[2][2:])
#     noise_std = float(testing_file.split('_')[3][1:-4])

#     dataloader = DataLoader(n_variables = n_variables,
#                 maxlags = maxlags)
#     dataloader.from_pickle(data_root+testing_file)
#     true_causal_dfs = dataloader.get_true_causal_dfs()

#     for i in range(40):

#         test_df = pd.read_csv(descriptors_root+testing_file+'_'+str(i)+'.csv', index_col=0)
#         test_df = test_df.sort_values(by=['edge_source','edge_dest']).reset_index(drop=True) # sort to match the order of the model!
#         test_df['is_causal'] = true_causal_dfs[i]['is_causal']
#         test_df['process_id'] = gen_process_number
#         test_df['graph_id'] = i
#         test_df['n_variables'] = n_variables
#         test_df['max_neighborhood_size'] = max_neighborhood_size
#         test_df['noise_std'] = noise_std

#         training_set_full.append(test_df)

# training_set_full = pd.concat(training_set_full, axis=0).reset_index(drop=True)



In [None]:
# This loop does the following:
# 1. Creates some dictionaries to store the results
# 2. Loads the training data
# 3. Trains the model
# 4. Evaluates the model
# 5. Stores the results in the dictionaries
# 6. Saves the dictionaries in a pickle file

for n_vars, todo in todos.items():
    td2c_rocs_process = {}
    td2c_precision_process = {}
    td2c_recall_process = {}
    td2c_f1_process = {}
    for testing_file in tqdm(todo):
        gen_process_number = int(testing_file.split('_')[0][1:])
        n_variables = int(testing_file.split('_')[1][1:])
        max_neighborhood_size = int(testing_file.split('_')[2][2:])
        noise_std = float(testing_file.split('_')[3][1:-4])

        # split training and testing data
        training_data = descriptors_training.loc[descriptors_training['process_id'] != gen_process_number]
        X_train = training_data.drop(columns=['process_id', 'graph_id', 'n_variables', 'max_neighborhood_size','noise_std', 'edge_source', 'edge_dest', 'is_causal',])
        y_train = training_data['is_causal']

        testing_data = descriptors_training.loc[(descriptors_training['process_id'] == gen_process_number) & (descriptors_training['n_variables'] == n_variables) & (descriptors_training['max_neighborhood_size'] == max_neighborhood_size) & (descriptors_training['noise_std'] == noise_std)]

        model = BalancedRandomForestClassifier(n_estimators=100, random_state=0, n_jobs=50, max_depth=None, sampling_strategy='auto', replacement=True, bootstrap=False)
        # model = RandomForestClassifier(n_estimators=50, random_state=0, n_jobs=50, max_depth=10)

        model.fit(X_train, y_train)

        rocs = {}
        precisions = {}
        recalls = {}
        f1s = {}
        for graph_id in range(40):
            #load testing descriptors
            test_df = testing_data.loc[testing_data['graph_id'] == graph_id]
            test_df = test_df.sort_values(by=['edge_source','edge_dest']).reset_index(drop=True) # sort for coherence

            X_test = test_df.drop(columns=['process_id', 'graph_id', 'n_variables', 'max_neighborhood_size','noise_std', 'edge_source', 'edge_dest', 'is_causal',])
            y_test = test_df['is_causal']

            y_pred_proba = model.predict_proba(X_test)[:,1]
            y_pred = model.predict(X_test)

            roc = roc_auc_score(y_test, y_pred_proba)
            precision = precision_score(y_test, y_pred)
            recall = recall_score(y_test, y_pred)
            f1 = f1_score(y_test, y_pred)
            
            rocs[graph_id] = roc
            precisions[graph_id] = precision
            recalls[graph_id] = recall
            f1s[graph_id] = f1

        td2c_rocs_process[gen_process_number] = rocs
        td2c_precision_process[gen_process_number] = precisions
        td2c_recall_process[gen_process_number] = recalls
        td2c_f1_process[gen_process_number] = f1s

    # pickle everything
    with open(f'journal_results_t2dc_N{n_vars}.pkl', 'wb') as f:
        everything = (td2c_rocs_process, td2c_precision_process, td2c_recall_process, td2c_f1_process)
        pickle.dump(everything, f)

### TD2C CMIKNN

In [None]:
dfs = []
descriptors_root = '/home/jpalombarini/td2c/notebooks/paper_td2c/.descriptors_ts_cmiknn_entropy/'
for testing_file in sorted(os.listdir(descriptors_root)):
    if testing_file.endswith('.pkl'):
        df = pd.read_pickle(descriptors_root + testing_file)
        dfs.append(df)

descriptors_training = pd.concat(dfs, axis=0).reset_index(drop=True)

In [None]:

for n_vars, todo in todos.items():
    td2c_rocs_process = {}
    td2c_precision_process = {}
    td2c_recall_process = {}
    td2c_f1_process = {}
    for testing_file in tqdm(todo):
        gen_process_number = int(testing_file.split('_')[0][1:])
        n_variables = int(testing_file.split('_')[1][1:])
        max_neighborhood_size = int(testing_file.split('_')[2][2:])
        noise_std = float(testing_file.split('_')[3][1:-4])

        # split training and testing data
        training_data = descriptors_training.loc[descriptors_training['process_id'] != gen_process_number]
        X_train = training_data.drop(columns=['process_id', 'graph_id', 'n_variables', 'max_neighborhood_size','noise_std', 'edge_source', 'edge_dest', 'is_causal',])
        y_train = training_data['is_causal']

        testing_data = descriptors_training.loc[(descriptors_training['process_id'] == gen_process_number) & (descriptors_training['n_variables'] == n_variables) & (descriptors_training['max_neighborhood_size'] == max_neighborhood_size) & (descriptors_training['noise_std'] == noise_std)]

        model = BalancedRandomForestClassifier(n_estimators=100, random_state=0, n_jobs=50, max_depth=None, sampling_strategy='auto', replacement=True, bootstrap=False)
        # model = RandomForestClassifier(n_estimators=50, random_state=0, n_jobs=50, max_depth=10)

        model.fit(X_train, y_train)

        rocs = {}
        precisions = {}
        recalls = {}
        f1s = {}
        for graph_id in range(40):
            #load testing descriptors
            test_df = testing_data.loc[testing_data['graph_id'] == graph_id]
            test_df = test_df.sort_values(by=['edge_source','edge_dest']).reset_index(drop=True) # sort for coherence

            X_test = test_df.drop(columns=['process_id', 'graph_id', 'n_variables', 'max_neighborhood_size','noise_std', 'edge_source', 'edge_dest', 'is_causal',])
            y_test = test_df['is_causal']

            y_pred_proba = model.predict_proba(X_test)[:,1]
            y_pred = model.predict(X_test)

            roc = roc_auc_score(y_test, y_pred_proba)
            precision = precision_score(y_test, y_pred)
            recall = recall_score(y_test, y_pred)
            f1 = f1_score(y_test, y_pred)
            
            rocs[graph_id] = roc
            precisions[graph_id] = precision
            recalls[graph_id] = recall
            f1s[graph_id] = f1

        td2c_rocs_process[gen_process_number] = rocs
        td2c_precision_process[gen_process_number] = precisions
        td2c_recall_process[gen_process_number] = recalls
        td2c_f1_process[gen_process_number] = f1s

    # pickle everything
    with open(f'journal_results_t2dc_cmiknn_N{n_vars}.pkl', 'wb') as f:
        everything = (td2c_rocs_process, td2c_precision_process, td2c_recall_process, td2c_f1_process)
        pickle.dump(everything, f)

In [None]:
td2c_rocs_process

In [None]:
# # load 
# with open('journal_results_t2dc_N5.pkl', 'rb') as f:
#     td2c_rocs_process, td2c_precision_process, td2c_recall_process, td2c_f1_process = pickle.load(f)

In [None]:
# mix_td2c = pd.concat([pd.DataFrame(td2c_rocs_process).mean(),pd.DataFrame(td2c_precision_process).mean(),pd.DataFrame(td2c_recall_process).mean(),pd.DataFrame(td2c_f1_process).mean()],axis=1)
# mix_td2c.columns = ['roc','precision','recall','f1']
# #index name 'generative process'
# mix_td2c.index.name = 'generative process'
# mix_td2c

In [None]:
# mix_td2c = pd.concat([pd.DataFrame(td2c_rocs_process).mean(),pd.DataFrame(td2c_precision_process).mean(),pd.DataFrame(td2c_recall_process).mean(),pd.DataFrame(td2c_f1_process).mean()],axis=1)
# mix_td2c.columns = ['roc','precision','recall','f1']
# #index name 'generative process'
# mix_td2c.index.name = 'generative process'
# mix_td2c

In [None]:
# mix_td2c = pd.concat([pd.DataFrame(td2c_rocs_process).mean(),pd.DataFrame(td2c_precision_process).mean(),pd.DataFrame(td2c_recall_process).mean(),pd.DataFrame(td2c_f1_process).mean()],axis=1)
# mix_td2c.columns = ['roc','precision','recall','f1']
# #index name 'generative process'
# mix_td2c.index.name = 'generative process'
# mix_td2c

In [None]:
# mix_pcmci = pd.concat([pd.DataFrame(pcmci_rocs_process).mean(),pd.DataFrame(pcmci_precision_process).mean(),pd.DataFrame(pcmci_recall_process).mean(),pd.DataFrame(pcmci_f1_process).mean()],axis=1)
# mix_pcmci.columns = ['roc','precision','recall','f1']
# #index name 'generative process'
# mix_pcmci.index.name = 'generative process'
# mix_pcmci

### Granger

In [None]:
#ETA 8 min

from d2c.benchmark.granger import Granger
# suppress Future Warnings
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

for n_vars, todo in todos.items():

        granger_rocs_process = {}
        granger_precision_process = {}
        granger_recall_process = {}
        granger_f1_process = {}

        for testing_file in tqdm(todo):
                gen_process_number = int(testing_file.split('_')[0][1:])
                n_variables = int(testing_file.split('_')[1][1:])
                max_neighborhood_size = int(testing_file.split('_')[2][2:])
                noise_std = float(testing_file.split('_')[3][1:-4])

                # load original data for truth values
                dataloader = DataLoader(n_variables = n_variables,
                                maxlags = maxlags)
                dataloader.from_pickle(data_root+testing_file)
                observations = dataloader.get_original_observations()
                true_causal_dfs = dataloader.get_true_causal_dfs()


                granger = Granger(ts_list=observations, maxlags=maxlags, n_jobs=40)
                granger.run()
                causal_dfs_granger = granger.get_causal_dfs()
                rocs = {}
                precisions = {}
                recalls = {}
                f1s = {}
                for i in range(40):

                        y_pred = causal_dfs_granger[i]['is_causal'].astype(int)
                        y_pred_proba = 1 - causal_dfs_granger[i]['p_value']
                        y_test = true_causal_dfs[i]['is_causal'].astype(int)

                        roc = roc_auc_score(y_test, y_pred_proba)
                        precision = precision_score(y_test, y_pred, zero_division=0)
                        recall = recall_score(y_test, y_pred)
                        f1 = f1_score(y_test, y_pred)
                        
                        rocs[i] = roc
                        precisions[i] = precision
                        recalls[i] = recall
                        f1s[i] = f1

                granger_rocs_process[gen_process_number] = rocs
                granger_precision_process[gen_process_number] = precisions
                granger_recall_process[gen_process_number] = recalls
                granger_f1_process[gen_process_number] = f1s

        # pickle everything
        with open(f'journal_results_granger_N{n_vars}.pkl', 'wb') as f:
                everything = (granger_rocs_process, granger_precision_process, granger_recall_process, granger_f1_process)
                pickle.dump(everything, f)

In [None]:
# pickle everything
with open('journal_results_granger_N25.pkl', 'wb') as f:
    everything = (granger_rocs_process, granger_precision_process, granger_recall_process, granger_f1_process)
    pickle.dump(everything, f)

In [None]:
#ETA 50 min 

from d2c.benchmark.pcmci import PCMCI

for n_vars, todo in todos.items():

        pcmci_rocs_process = {}
        pcmci_precision_process = {}
        pcmci_recall_process = {}
        pcmci_f1_process = {}

        for testing_file in tqdm(todo):
                gen_process_number = int(testing_file.split('_')[0][1:])
                n_variables = int(testing_file.split('_')[1][1:])
                max_neighborhood_size = int(testing_file.split('_')[2][2:])
                noise_std = float(testing_file.split('_')[3][1:-4])

                # load original data for truth values
                dataloader = DataLoader(n_variables = n_variables,
                                maxlags = maxlags)
                dataloader.from_pickle(data_root+testing_file)
                observations = dataloader.get_original_observations()
                true_causal_dfs = dataloader.get_true_causal_dfs()


                pcmci = PCMCI(ts_list=observations, maxlags=maxlags, n_jobs=40, ci="ParCorr")
                pcmci.run()
                causal_dfs_pcmci = pcmci.get_causal_dfs()
                rocs = {}
                precisions = {}
                recalls = {}
                f1s = {}
                for i in range(40):

                        y_pred = causal_dfs_pcmci[i]['is_causal'].astype(int)
                        y_pred_proba = 1 - causal_dfs_pcmci[i]['p_value']
                        y_test = true_causal_dfs[i]['is_causal'].astype(int)

                        roc = roc_auc_score(y_test, y_pred_proba)
                        precision = precision_score(y_test, y_pred)
                        recall = recall_score(y_test, y_pred)
                        f1 = f1_score(y_test, y_pred)
                        
                        rocs[i] = roc
                        precisions[i] = precision
                        recalls[i] = recall
                        f1s[i] = f1

                pcmci_rocs_process[gen_process_number] = rocs
                pcmci_precision_process[gen_process_number] = precisions
                pcmci_recall_process[gen_process_number] = recalls
                pcmci_f1_process[gen_process_number] = f1s

        # pickle everything
        with open(f'journal_results_pcmci_N{n_vars}.pkl', 'wb') as f:
                everything = (pcmci_rocs_process, pcmci_precision_process, pcmci_recall_process, pcmci_f1_process)
                pickle.dump(everything, f)

In [None]:
# mix_pcmci = pd.concat([pd.DataFrame(pcmci_rocs_process).mean(),pd.DataFrame(pcmci_precision_process).mean(),pd.DataFrame(pcmci_recall_process).mean(),pd.DataFrame(pcmci_f1_process).mean()],axis=1)
# mix_pcmci.columns = ['roc','precision','recall','f1']
# #index name 'generative process'
# mix_pcmci.index.name = 'generative process'
# mix_pcmci

In [None]:
# ! conda install -y seaborn

In [None]:
# # boxplot mix_pcmci and mix_td2c
# import matplotlib.pyplot as plt
# import seaborn as sns

# fig, ax = plt.subplots(2, 2, figsize=(20, 10))

# sns.boxplot(data=mix_pcmci, ax=ax[0, 0])
# ax[0, 0].set_title('PCMCI')
# ax[0, 0].set_ylabel('Score')
# ax[0, 0].set_xlabel('Metric')

# sns.boxplot(data=mix_td2c, ax=ax[0, 1])
# ax[0, 1].set_title('TD2C')
# ax[0, 1].set_ylabel('Score')
# ax[0, 1].set_xlabel('Metric')



### Dynotears

In [None]:
from d2c.benchmark.dynotears import DYNOTEARS
N_JOBS = 40

for n_vars, todo in todos.items():

        dyno_rocs_process = {}
        dyno_precision_process = {}
        dyno_recall_process = {}
        dyno_f1_process = {}

        for testing_file in tqdm(todo):
                gen_process_number = int(testing_file.split('_')[0][1:])
                n_variables = int(testing_file.split('_')[1][1:])
                max_neighborhood_size = int(testing_file.split('_')[2][2:])
                noise_std = float(testing_file.split('_')[3][1:-4])

                # load original data for truth values
                dataloader = DataLoader(n_variables = n_variables,
                                maxlags = maxlags)
                dataloader.from_pickle(data_root+testing_file)
                observations = dataloader.get_original_observations()
                true_causal_dfs = dataloader.get_true_causal_dfs()

                dynotears = DYNOTEARS(ts_list=observations, maxlags=maxlags, n_jobs=N_JOBS)
                dynotears.run()
                causal_dfs_dynotears = dynotears.get_causal_dfs()

                rocs = {}
                precisions = {}
                recalls = {}
                f1s = {}
                for i in range(40):

                        y_pred = causal_dfs_dynotears[i]['is_causal'].astype(int)
                        y_pred_proba = 1 - causal_dfs_dynotears[i]['p_value']
                        y_test = true_causal_dfs[i]['is_causal'].astype(int)

                        roc = None
                        precision = precision_score(y_test, y_pred, zero_division=0)
                        recall = recall_score(y_test, y_pred)
                        f1 = f1_score(y_test, y_pred)
                        
                        rocs[i] = roc
                        precisions[i] = precision
                        recalls[i] = recall
                        f1s[i] = f1

                dyno_rocs_process[gen_process_number] = rocs
                dyno_precision_process[gen_process_number] = precisions
                dyno_recall_process[gen_process_number] = recalls
                dyno_f1_process[gen_process_number] = f1s

        # pickle everything
        with open(f'journal_results_dyno_N{n_vars}.pkl', 'wb') as f:
                everything = (dyno_rocs_process, dyno_precision_process, dyno_recall_process, dyno_f1_process)
                pickle.dump(everything, f)

In [None]:
# mix_dyno = pd.concat([pd.DataFrame(dyno_rocs_process).mean(),pd.DataFrame(dyno_precision_process).mean(),pd.DataFrame(dyno_recall_process).mean(),pd.DataFrame(dyno_f1_process).mean()],axis=1)
# mix_dyno.columns = ['roc','precision','recall','f1']
# #index name 'generative process'
# mix_dyno.index.name = 'generative process'
# mix_dyno

In [None]:
import os
N_JOBS = 40
from d2c.benchmark.varlingam import VARLiNGAM

os.environ['MKL_NUM_THREADS'] = '1'  # Limit to 4 threads
os.environ['NUMEXPR_NUM_THREADS'] = '1'
os.environ['OMP_NUM_THREADS'] = '1'
os.environ['OPENBLAS_NUM_THREADS'] = '1'  # Limit to 4 threads
os.environ['OMP_NUM_THREADS'] = '1'

# def complete_causal_df(causal_df, n_variables, maxlags):
#     causal_df = causal_df.copy()
#     all_pairs = [(from_, to) for from_ in range(n_variables,n_variables * (maxlags + 1)) for to in range(n_variables)]
    
#     existing_pairs = set(zip(causal_df['from'], causal_df['to']))
#     missing_pairs = [(from_, to) for from_, to in all_pairs if (from_, to) not in existing_pairs]
    
#     # Create all missing rows at once if there are any missing pairs
#     if missing_pairs:
#         missing_rows = pd.DataFrame(missing_pairs, columns=['from', 'to'])
#         missing_rows['effect'] = 0.0
#         missing_rows['p-value'] = None
#         missing_rows['probability'] = 0.0
#         missing_rows['is_causal'] = False
#         causal_df = pd.concat([causal_df, missing_rows], ignore_index=True)
    
#     return causal_df.sort_values(by=['from', 'to']).reset_index(drop=True)

for n_vars, todo in todos.items():
        
        if n_vars != '25':
                continue


        varlingam_rocs_process = {}
        varlingam_precision_process = {}
        varlingam_recall_process = {}
        varlingam_f1_process = {}

        for testing_file in tqdm(todo):
                gen_process_number = int(testing_file.split('_')[0][1:])
                n_variables = int(testing_file.split('_')[1][1:])
                max_neighborhood_size = int(testing_file.split('_')[2][2:])
                noise_std = float(testing_file.split('_')[3][1:-4])

                # load original data for truth values
                dataloader = DataLoader(n_variables = n_variables,
                                maxlags = maxlags)
                dataloader.from_pickle(data_root+testing_file)
                observations = dataloader.get_original_observations()
                true_causal_dfs = dataloader.get_true_causal_dfs()

                varlingam = VARLiNGAM(ts_list=observations, maxlags=maxlags, n_jobs=N_JOBS)
                varlingam.run()
                causal_dfs_varlingam = varlingam.get_causal_dfs()

                # causal_dfs_varlingam = [complete_causal_df(causal_df, n_variables,maxlags) for causal_df in causal_dfs_varlingam.values()]

                rocs = {}
                precisions = {}
                recalls = {}
                f1s = {}
                for i in range(40):

                        y_pred = causal_dfs_varlingam[i]['is_causal'].astype(int)
                        y_pred_proba = causal_dfs_varlingam[i]['probability']
                        y_test = true_causal_dfs[i]['is_causal'].astype(int)

                        roc = roc_auc_score(y_test, y_pred_proba)
                        precision = precision_score(y_test, y_pred)
                        recall = recall_score(y_test, y_pred)
                        f1 = f1_score(y_test, y_pred)
                        
                        rocs[i] = roc
                        precisions[i] = precision
                        recalls[i] = recall
                        f1s[i] = f1

                varlingam_rocs_process[gen_process_number] = rocs
                varlingam_precision_process[gen_process_number] = precisions
                varlingam_recall_process[gen_process_number] = recalls
                varlingam_f1_process[gen_process_number] = f1s

        # pickle everything
        with open(f'journal_results_varlingam_N{n_vars}.pkl', 'wb') as f:
                everything = (varlingam_rocs_process, varlingam_precision_process, varlingam_recall_process, varlingam_f1_process)
                pickle.dump(everything, f)

In [None]:
mix_varlingam = pd.concat([pd.DataFrame(varlingam_rocs_process).mean(),pd.DataFrame(varlingam_precision_process).mean(),pd.DataFrame(varlingam_recall_process).mean(),pd.DataFrame(varlingam_f1_process).mean()],axis=1)
mix_varlingam.columns = ['roc','precision','recall','f1']
#index name 'generative process'
mix_varlingam.index.name = 'generative process'
mix_varlingam

In [None]:
# pickle everything
with open('journal_results_N5.pkl', 'wb') as f:
    everything = (td2c_rocs_process, td2c_precision_process, td2c_recall_process, td2c_f1_process, pcmci_rocs_process, pcmci_precision_process, pcmci_recall_process, pcmci_f1_process, dyno_rocs_process, dyno_precision_process, dyno_recall_process, dyno_f1_process, varlingam_rocs_process, varlingam_precision_process, varlingam_recall_process, varlingam_f1_process)
    pickle.dump(everything, f)

In [None]:
# load 
import pickle
with open('journal_results_N5.pkl', 'rb') as f:
    td2c_rocs_process, td2c_precision_process, td2c_recall_process, td2c_f1_process, pcmci_rocs_process, pcmci_precision_process, pcmci_recall_process, pcmci_f1_process, dyno_rocs_process, dyno_precision_process, dyno_recall_process, dyno_f1_process, varlingam_rocs_process, varlingam_precision_process, varlingam_recall_process, varlingam_f1_process = pickle.load(f)

In [None]:
import matplotlib.pyplot as plt

df1 = pd.DataFrame(td2c_rocs_process)
df2 = pd.DataFrame(pcmci_rocs_process)
df3 = pd.DataFrame(varlingam_rocs_process)

# Combine data for boxplot
combined_data = []

for col in df1.columns:
    combined_data.append(df1[col])
    combined_data.append(df2[col])
    combined_data.append(df3[col])

# Create labels for x-axis
labels = []
for col in df1.columns:
    labels.append(f'{col} TD2C')
    labels.append(f'{col} PCMCI')
    labels.append(f'{col} VARLiNGAM')

# Plotting
plt.figure(figsize=(12, 6))
box = plt.boxplot(combined_data, patch_artist=True)

# Color customization
colors = ['lightblue', 'lightgreen', 'lightcoral', 'lightyellow']
for patch, i in zip(box['boxes'], range(len(box['boxes']))):
    patch.set_facecolor(colors[i % 3])


plt.xticks(range(1, len(labels) + 1), labels, rotation=-90)
plt.title('Side-by-Side Boxplots for Two DataFrames')
plt.xlabel('Processes')
plt.ylabel('Values')
plt.tight_layout()
plt.show()

In [None]:
import matplotlib.pyplot as plt

df1 = pd.DataFrame(td2c_f1_process)
df2 = pd.DataFrame(pcmci_f1_process)
df3 = pd.DataFrame(varlingam_f1_process)
df4 = pd.DataFrame(dyno_f1_process)

# Combine data for boxplot
combined_data = []

for col in df1.columns:
    combined_data.append(df1[col])
    combined_data.append(df2[col])
    combined_data.append(df3[col])
    combined_data.append(df4[col])

# Create labels for x-axis
labels = []
for col in df1.columns:
    labels.append(f'{col} td2c')
    labels.append(f'{col} pcmci')
    labels.append(f'{col} varlingam')
    labels.append(f'{col} dyno')

# Plotting
plt.figure(figsize=(12, 6))
box = plt.boxplot(combined_data, patch_artist=True)

# Color customization
colors = ['lightblue', 'lightgreen', 'lightcoral', 'lightyellow']
for patch, i in zip(box['boxes'], range(len(box['boxes']))):
    patch.set_facecolor(colors[i % 4])


plt.xticks(range(1, len(labels) + 1), labels, rotation=-90)
plt.title('Side-by-Side Boxplots for Two DataFrames')
plt.xlabel('Processes')
plt.ylabel('Values')
plt.tight_layout()
plt.show()

In [None]:
df1.columns

In [None]:
import matplotlib.pyplot as plt
import pandas as pd

# Assuming td2c_f1_process, pcmci_f1_process, varlingam_f1_process, and dyno_f1_process are defined

df1 = pd.DataFrame(td2c_f1_process)
df2 = pd.DataFrame(pcmci_f1_process)
df3 = pd.DataFrame(varlingam_f1_process)
df4 = pd.DataFrame(dyno_f1_process)

# Number of processes
num_processes = df1.shape[1]
fontsize = 7
# Plotting each process separately in a grid of 6 columns and 3 rows
fig, axes = plt.subplots(3, 6, figsize=(10, 7), sharex=True, sharey=True)

colors = ['lightblue', 'lightgreen', 'lightcoral', 'lightyellow']

for i, col in enumerate([1, 2, 3, 4, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 18, 19, 20]):
    combined_data = [df1[col], df2[col], df3[col], df4[col]]
    labels = [f'TD2C', f'PCMCI', f'VARLINGAM', f'DYNOTEARS']
    
    row, col_idx = divmod(i, 6)
    box = axes[row, col_idx].boxplot(combined_data, patch_artist=True)
    
    for patch, j in zip(box['boxes'], range(len(box['boxes']))):
        patch.set_facecolor(colors[j % 4])
    
    axes[row, col_idx].set_title(f'Process {col}')
    axes[row, col_idx].title.set_fontsize(fontsize)
    axes[row, col_idx].set_xticks(range(1, len(labels) + 1))
    axes[row, col_idx].set_xticklabels(labels, rotation=-90)
    axes[row, col_idx].tick_params(axis='x', labelsize=fontsize)
    if col_idx == 0:
        axes[row, col_idx].set_ylabel('F1 Score')
        axes[row, col_idx].yaxis.label.set_size(fontsize)
    # Add this line:
    axes[row, col_idx].grid(True)


# Remove any empty subplots if the number of processes is less than 18
if num_processes < 18:
    for i in range(num_processes, 18):
        row, col_idx = divmod(i, 6)
        fig.delaxes(axes[row, col_idx])

plt.tight_layout()
plt.savefig('f1_scores_N5.pdf', format='pdf')

plt.show()

#make vector image pdf


In [None]:
import matplotlib.pyplot as plt
import pandas as pd

# Assuming td2c_f1_process, pcmci_f1_process, varlingam_f1_process, and dyno_f1_process are defined

df1 = pd.DataFrame(td2c_rocs_process)
df2 = pd.DataFrame(pcmci_rocs_process)
df3 = pd.DataFrame(varlingam_rocs_process)

# Number of processes
num_processes = df1.shape[1]
fontsize = 7
# Plotting each process separately in a grid of 6 columns and 3 rows
fig, axes = plt.subplots(3, 6, figsize=(10, 7), sharex=True, sharey=True)

colors = ['lightblue', 'lightgreen', 'lightcoral', 'lightyellow']

for i, col in enumerate([1, 2, 3, 4, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 18, 19, 20]):
    combined_data = [df1[col], df2[col], df3[col]]
    labels = [f'TD2C', f'PCMCI', f'VARLINGAM']
    
    row, col_idx = divmod(i, 6)
    box = axes[row, col_idx].boxplot(combined_data, patch_artist=True)
    
    for patch, j in zip(box['boxes'], range(len(box['boxes']))):
        patch.set_facecolor(colors[j % 3])
    
    axes[row, col_idx].set_title(f'Process {col}')
    axes[row, col_idx].title.set_fontsize(fontsize)
    axes[row, col_idx].set_xticks(range(1, len(labels) + 1))
    axes[row, col_idx].set_xticklabels(labels, rotation=-90)
    axes[row, col_idx].tick_params(axis='x', labelsize=fontsize)
    if col_idx == 0:
        axes[row, col_idx].set_ylabel('ROC AUC')
        axes[row, col_idx].yaxis.label.set_size(fontsize)
    # Add this line:
    axes[row, col_idx].grid(True)


# Remove any empty subplots if the number of processes is less than 18
if num_processes < 18:
    for i in range(num_processes, 18):
        row, col_idx = divmod(i, 6)
        fig.delaxes(axes[row, col_idx])

plt.tight_layout()
plt.savefig('roc_scores_N5.pdf', format='pdf')

plt.show()

#make vector image pdf


In [None]:
import matplotlib.pyplot as plt
import pandas as pd

# Assuming td2c_f1_process, pcmci_f1_process, varlingam_f1_process, and dyno_f1_process are defined

df1 = pd.DataFrame(td2c_f1_process)
df2 = pd.DataFrame(pcmci_f1_process)
df3 = pd.DataFrame(varlingam_f1_process)
df4 = pd.DataFrame(dyno_f1_process)

# Concatenate the data for each method across all processes
combined_td2c = pd.concat([df1[col] for col in df1.columns], ignore_index=True)
combined_pcmci = pd.concat([df2[col] for col in df2.columns], ignore_index=True)
combined_varlingam = pd.concat([df3[col] for col in df3.columns], ignore_index=True)
combined_dyno = pd.concat([df4[col] for col in df4.columns], ignore_index=True)

# Combine all methods into one DataFrame for plotting
combined_data = [combined_td2c, combined_pcmci, combined_varlingam, combined_dyno]
labels = ['TD2C', 'PCMCI', 'VARLINGAM', 'DYNOTEARS']

# Create a single boxplot
fig, ax = plt.subplots(figsize=(8, 6))
box = ax.boxplot(combined_data, patch_artist=True)

colors = ['lightblue', 'lightgreen', 'lightcoral', 'lightyellow']
for patch, color in zip(box['boxes'], colors):
    patch.set_facecolor(color)

ax.set_xticks(range(1, len(labels) + 1))
ax.set_xticklabels(labels, rotation=-90, fontsize=10)
ax.set_ylabel('F1 Score', fontsize=10)
ax.set_title('Combined F1 Scores for All Processes - N=5', fontsize=12)
ax.grid(True)

plt.tight_layout()
plt.savefig('combined_f1_scores_N5.pdf', format='pdf')

plt.show()


In [None]:
import matplotlib.pyplot as plt
import pandas as pd

# Assuming td2c_rocs_process, pcmci_rocs_process, and varlingam_rocs_process are defined

df1 = pd.DataFrame(td2c_rocs_process)
df2 = pd.DataFrame(pcmci_rocs_process)
df3 = pd.DataFrame(varlingam_rocs_process)

# Concatenate the data for each method across all processes
combined_td2c = pd.concat([df1[col] for col in df1.columns], ignore_index=True)
combined_pcmci = pd.concat([df2[col] for col in df2.columns], ignore_index=True)
combined_varlingam = pd.concat([df3[col] for col in df3.columns], ignore_index=True)

# Combine all methods into one DataFrame for plotting
combined_data = [combined_td2c, combined_pcmci, combined_varlingam]
labels = ['TD2C', 'PCMCI', 'VARLINGAM']

# Create a single boxplot
fig, ax = plt.subplots(figsize=(8, 6))
box = ax.boxplot(combined_data, patch_artist=True)

colors = ['lightblue', 'lightgreen', 'lightcoral']
for patch, color in zip(box['boxes'], colors):
    patch.set_facecolor(color)

ax.set_xticks(range(1, len(labels) + 1))
ax.set_xticklabels(labels, rotation=-90, fontsize=10)
ax.set_ylabel('ROC AUC', fontsize=10)
ax.set_title('Combined ROC AUC Scores for All Processes - N=5', fontsize=12)
ax.grid(True)

plt.tight_layout()
plt.savefig('combined_roc_scores_N5.pdf', format='pdf')

plt.show()


In [None]:

from sklearn.metrics import roc_auc_score

rocs_process = {}
for process in descriptors_training['process_id'].unique():

    training_data = descriptors_training.loc[descriptors_training['process_id'] != process]
    X_train = training_data.drop(columns=['process_id', 'graph_id', 'n_variables', 'max_neighborhood_size','noise_std', 'edge_source', 'edge_dest', 'is_causal',])
    y_train = training_data['is_causal']

    model = RandomForestClassifier(n_estimators=50, random_state=0, n_jobs=50)

    model.fit(X_train, y_train)

    rocs = {}
    for i in range(40):
        test_df = pd.read_csv(f'./d2c_benchmark/P{process}_N5_Nj2_n0.01.pkl'+'_'+str(i)+'.csv', index_col=0)
        test_df = test_df.sort_values(by=['edge_source','edge_dest']).reset_index(drop=True)

        X_test = test_df.drop(columns=['graph_id', 'edge_source', 'edge_dest', 'is_causal'])
        y_test = true_causal_dfs[i]['is_causal']


        y_pred = model.predict_proba(X_test)[:,1]

        roc = roc_auc_score(y_test, y_pred)

        rocs[i] = roc

    rocs_process[process] = rocs




In [None]:
data_root = '../../data/new_data/'
destination_root = './'
destination = 'd2c_all_couples_MB5_full'
if not os.path.exists(destination_root+'/'+destination):
    os.makedirs(destination_root+'/'+destination)
maxlags = 5
# empty folder ../../data/new_benchmark/
for todo in [to_dos_10_variables]:
    for testing_file in tqdm(todo):
        if testing_file.endswith('.pkl'):
            gen_process_number = int(testing_file.split('_')[0][1:])
            n_variables = int(testing_file.split('_')[1][1:])
            max_neighborhood_size = int(testing_file.split('_')[2][2:])
            noise_std = float(testing_file.split('_')[3][1:-4])
                
            filename = f'{destination_root}/{destination}/P{gen_process_number}_N{n_variables}_Nj{max_neighborhood_size}_n{noise_std}.pkl'

            training_data = descriptors_training.loc[descriptors_training['process_id'] != gen_process_number]
            X_train = training_data.drop(columns=['process_id', 'graph_id', 'n_variables', 'max_neighborhood_size','noise_std', 'edge_source', 'edge_dest', 'is_causal',])
            y_train = training_data['is_causal']

            model = BalancedRandomForestClassifier(n_estimators=50, random_state=0, n_jobs=1, replacement=True, sampling_strategy='all')

            model.fit(X_train, y_train)

            dataloader = DataLoader(n_variables = n_variables,
                            maxlags = maxlags)
            dataloader.from_pickle(data_root+testing_file)
            observations = dataloader.get_original_observations()
            true_causal_dfs = dataloader.get_true_causal_dfs()

            d2cwrapper = D2CWrapper(ts_list=observations, n_variables=n_variables, model=model, maxlags=maxlags, n_jobs = 55, full=True)

            d2cwrapper.run()

            causal_df = d2cwrapper.get_causal_dfs()

            with open(filename, 'wb') as f:
                    pickle.dump((
                                causal_df, 
                                true_causal_dfs), f)     
    

25 vars

In [None]:
data_root = '../../data/new_data/'
destination_root = '../../data/new_benchmark'
destination = 'd2c_all_couples_MB5_full'

if not os.path.exists(destination_root+'/'+destination):
    os.makedirs(destination_root+'/'+destination)
maxlags = 5
# empty folder ../../data/new_benchmark/

for todo in [to_dos_25_variables]:
    for testing_file in tqdm(todo):
        if testing_file.endswith('.pkl'):
            gen_process_number = int(testing_file.split('_')[0][1:])
            n_variables = int(testing_file.split('_')[1][1:])
            max_neighborhood_size = int(testing_file.split('_')[2][2:])
            noise_std = float(testing_file.split('_')[3][1:-4])
                
            filename = f'{destination_root}/{destination}/P{gen_process_number}_N{n_variables}_Nj{max_neighborhood_size}_n{noise_std}.pkl'

            training_data = descriptors_training.loc[descriptors_training['process_id'] != gen_process_number]
            X_train = training_data.drop(columns=['process_id', 'graph_id', 'n_variables', 'max_neighborhood_size','noise_std', 'edge_source', 'edge_dest', 'is_causal',])
            y_train = training_data['is_causal']

            model = BalancedRandomForestClassifier(n_estimators=50, random_state=0, n_jobs=1, replacement=True, sampling_strategy='all')

            model.fit(X_train, y_train)

            dataloader = DataLoader(n_variables = n_variables,
                            maxlags = maxlags)
            dataloader.from_pickle(data_root+testing_file)
            observations = dataloader.get_original_observations()
            true_causal_dfs = dataloader.get_true_causal_dfs()


            d2cwrapper = D2CWrapper(ts_list=observations, n_variables=n_variables, model=model, maxlags=maxlags, n_jobs = 55, full=True)

            d2cwrapper.run()

            causal_df = d2cwrapper.get_causal_dfs()

            with open(filename, 'wb') as f:
                    pickle.dump((
                                causal_df, 
                                true_causal_dfs), f)     
    

In [None]:
# handle the two missing files separately
missing_files = ['P8_N25_Nj8_n0.005.pkl','P9_N25_Nj8_n0.005.pkl']

data_root = '../../data/new_data/'
destination_root = '../../data/new_benchmark'
destination = 'd2c_all_couples_MB5_full'

maxlags = 5
# empty folder ../../data/new_benchmark/


for testing_file in tqdm(missing_files):
    if testing_file.endswith('.pkl'):
        gen_process_number = int(testing_file.split('_')[0][1:])
        n_variables = int(testing_file.split('_')[1][1:])
        max_neighborhood_size = int(testing_file.split('_')[2][2:])
        noise_std = float(testing_file.split('_')[3][1:-4])
            
        filename = f'{destination_root}/{destination}/P{gen_process_number}_N{n_variables}_Nj{max_neighborhood_size}_n{noise_std}.pkl'

        training_data = descriptors_training.loc[descriptors_training['process_id'] != gen_process_number]
        X_train = training_data.drop(columns=['process_id', 'graph_id', 'n_variables', 'max_neighborhood_size','noise_std', 'edge_source', 'edge_dest', 'is_causal',])
        y_train = training_data['is_causal']

        model = BalancedRandomForestClassifier(n_estimators=50, random_state=0, n_jobs=1, replacement=True, sampling_strategy='all')

        model.fit(X_train, y_train)

        dataloader = DataLoader(n_variables = n_variables,
                        maxlags = maxlags)
        dataloader.from_pickle(data_root+testing_file)
        observations = dataloader.get_original_observations()
        true_causal_dfs = dataloader.get_true_causal_dfs()


        d2cwrapper = D2CWrapper(ts_list=observations, n_variables=n_variables, model=model, maxlags=maxlags, n_jobs = 55, full=True)

        d2cwrapper.run()

        causal_df = d2cwrapper.get_causal_dfs()

        with open(filename, 'wb') as f:
                pickle.dump((
                            causal_df, 
                            true_causal_dfs), f)     
    

50 variables is just too much

In [None]:
# root = '../../data/new_data/'
# destination_root = '../../data/new_benchmark'
# destination = 'd2c_all_couples_MB5_full'

# if not os.path.exists(destination_root+'/'+destination):
#     os.makedirs(destination_root+'/'+destination)
# maxlags = 5
# # empty folder ../../data/new_benchmark/

# for todo in [to_dos_50_variables]:
#     for file in tqdm(todo):
#         if file.endswith('.pkl'):
#             gen_process_number = int(file.split('_')[0][1:])
#             n_variables = int(file.split('_')[1][1:])
#             max_neighborhood_size = int(file.split('_')[2][2:])
#             noise_std = float(file.split('_')[3][1:-4])
                
#             filename = f'{destination_root}/{destination}/P{gen_process_number}_N{n_variables}_Nj{max_neighborhood_size}_n{noise_std}.pkl'

#             training_data = data.loc[data['process_id'] != gen_process_number]
#             X_train = training_data.drop(columns=['process_id', 'graph_id', 'n_variables', 'max_neighborhood_size','noise_std', 'edge_source', 'edge_dest', 'is_causal',])
#             y_train = training_data['is_causal']

#             model = BalancedRandomForestClassifier(n_estimators=50, random_state=0, n_jobs=1, replacement=True, sampling_strategy='all')

#             model.fit(X_train, y_train)

#             dataloader = DataLoader(n_variables = n_variables,
#                             maxlags = maxlags)
#             dataloader.from_pickle(root+file)
#             observations = dataloader.get_original_observations()
#             true_causal_dfs = dataloader.get_true_causal_dfs()


#             d2cwrapper = D2CWrapper(ts_list=observations, n_variables=n_variables, model=model, maxlags=maxlags, n_jobs = 55, full=True)

#             d2cwrapper.run()

#             causal_df = d2cwrapper.get_causal_dfs()

#             with open(filename, 'wb') as f:
#                     pickle.dump((
#                                 causal_df, 
#                                 true_causal_dfs), f)     
    