This notebook handles the descriptors computation. 
This procedure involves the following steps:
- For each variable, we estimate its Markov Blanket (MB) by selecting its lagged versions from one time step before and one after. 
- We standardize the time series to avoid varsortability
- Using the estimated MB, we compute a set of descriptors for all possible causal pairs (i.e., $ t-\tau \rightarrow t, \forall \tau$). that characterize the causal relationship between the variable pairs. These descriptors include conditional mutual information terms and other statistical properties that provide insights into the dependencies and interactions between the variables.
- For families of descriptors, we compute the quantiles of their empirical distributions. This step captures the distributional characteristics and aids in feature representation for the classifier.
- The computed descriptors and their quantiles are compiled into an input feature vector. This vector encapsulates the essential characteristics of the causal relationships and serves as the input for the classifier.
- For training data, each input vector is labeled as causal (1) or noncausal (0) based on the original selection criteria from the synthetic data's Directed Acyclic Graph (DAG). This labeling is crucial for supervised learning and model training.
- The labeled dataset, comprising the feature vectors, is used to train a classifier. The classifier learns to predict the likelihood of causal relationships based on the descriptors.
- For unseen time series data, the trained classifier predicts the probability of causal links for each pair of variables. The predictions are based on the computed descriptors for the test data.

![descriptors.png](descriptors.png)

In [2]:
from d2c.descriptors import D2C, DataLoader

In [3]:
N_VARS = 5
MAXLAGS = 3
N_JOBS = 40

The DataLoader class for handling time series data and directed acyclic graphs (DAGs) to prepare for the following stage: the descriptors computation.
The preparation includes:
1. Creating lagged time series: This step involves generating lagged versions of the original time series data. Lagged time series help in capturing temporal dependencies and interactions between variables at different time steps.
2. Flattening the original dictionaries into coherent lists: The original data, which may be stored in nested dictionaries, is flattened into lists. This transformation ensures that the data is in a consistent and accessible format for further processing.
3. Renaming the nodes of the DAGs: The nodes of the Directed Acyclic Graphs (DAGs) are renamed to maintain consistency and clarity. This step is crucial for accurately representing the causal relationships between variables in the DAGs.


In [4]:
dataloader = DataLoader(n_variables = N_VARS,
                    maxlags = MAXLAGS)
dataloader.from_pickle('synthetic_data.pkl')

original_observations = dataloader.get_original_observations()
lagged_flattened_observations = dataloader.get_observations()
flattened_dags = dataloader.get_dags()

We are now ready for the core of our methodology: the D2C method. <br>
This method starts from a list of observations and dags and computes the corresponding descritpors, storing them in a dataframe. <br>
The D2C class gets the following arguments: 
**Args:**
- `dags` (list): List of directed acyclic graphs (DAGs) representing causal relationships.
- `observations` (list): List of observations (pd.DataFrame) corresponding to each DAG.
- `n_variables` (int, optional): Number of variables in the time series. Defaults to 3.
- `maxlags` (int, optional): Maximum number of lags in the time series. Defaults to 3.
- `mutual_information_proxy` (str, optional): Method to use for mutual information computation. Defaults to "Ridge".
- `proxy_params` (dict, optional): Parameters for the mutual information computation. Defaults to None.
- `verbose` (bool, optional): Whether to print verbose output. Defaults to False.
- `seed` (int, optional): Random seed for reproducibility. Defaults to 42.
- `n_jobs` (int, optional): Number of parallel jobs to run. Defaults to 1.


**Methods:**
- `initialize(self)`: Initialize the D2C object by computing descriptors in parallel for all observations.
- `compute_descriptors_without_dag(self, n_variables, maxlags)`: Compute all descriptors when a DAG is not available.
- `compute_descriptors_with_dag(self, dag_idx, dag, n_variables, maxlags, num_samples=20)`: Compute all descriptors associated with a DAG.
- `get_markov_blanket(self, dag, node)`: Compute the REAL Markov Blanket of a node in a specific DAG.
- `standardize_data(self, observations)`: Standardize the observation DataFrame.
- `check_data_validity(self, observations)`: Check the validity of the data.
- `update_dictionary_quantiles(self, dictionary, name, quantiles)`: Update the dictionary with quantiles.
- `update_dictionary_distribution(self, dictionary, name, values)`: Update the dictionary with distribution moments.
- `update_dictionary_actual_values(self, dictionary, name, values)`: Update the dictionary with actual values.
- `compute_descriptors_for_couple(self, dag_idx, ca, ef, label)`: Compute descriptors for a given couple of nodes in a DAG.
- `get_descriptors_df(self)`: Get the concatenated DataFrame of X and Y.
- `get_test_couples(self)`: Get the test couples.



In [7]:
d2c = D2C(observations=lagged_flattened_observations,
        dags=flattened_dags, 
        couples_to_consider_per_dag=20, 
        n_variables=N_VARS, 
        maxlags=MAXLAGS,
        seed=42,
        n_jobs=N_JOBS,
        full=True)

d2c.initialize()

Now let's look at the descriptors. 

In [8]:
descriptors_df = d2c.get_descriptors_df()
print(descriptors_df.columns)
descriptors_df

Index(['graph_id', 'edge_source', 'edge_dest', 'is_causal', 'coeff_cause',
       'coeff_eff', 'HOC_3_1', 'HOC_1_2', 'HOC_2_1', 'HOC_1_3', 'kurtosis_ca',
       'kurtosis_ef', 'mca_mef_cau_q0', 'mca_mef_cau_q1', 'mca_mef_cau_q2',
       'mca_mef_eff_q0', 'mca_mef_eff_q1', 'mca_mef_eff_q2', 'cau_m_eff_q0',
       'cau_m_eff_q1', 'cau_m_eff_q2', 'eff_m_cau_q0', 'eff_m_cau_q1',
       'eff_m_cau_q2', 'm_cau_q0', 'm_cau_q1', 'm_cau_q2', 'com_cau',
       'cau_eff', 'eff_cau', 'eff_cau_mbeff', 'cau_eff_mbcau',
       'eff_cau_mbcau_plus_q0', 'eff_cau_mbcau_plus_q1',
       'eff_cau_mbcau_plus_q2', 'cau_eff_mbeff_plus_q0',
       'cau_eff_mbeff_plus_q1', 'cau_eff_mbeff_plus_q2', 'm_eff_q0',
       'm_eff_q1', 'm_eff_q2', 'mca_mca_cau_q0', 'mca_mca_cau_q1',
       'mca_mca_cau_q2', 'mbe_mbe_eff_q0', 'mbe_mbe_eff_q1', 'mbe_mbe_eff_q2',
       'n_samples', 'n_features', 'n_features/n_samples', 'skewness_ca',
       'skewness_ef'],
      dtype='object')


Unnamed: 0,graph_id,edge_source,edge_dest,is_causal,coeff_cause,coeff_eff,HOC_3_1,HOC_1_2,HOC_2_1,HOC_1_3,...,mca_mca_cau_q1,mca_mca_cau_q2,mbe_mbe_eff_q0,mbe_mbe_eff_q1,mbe_mbe_eff_q2,n_samples,n_features,n_features/n_samples,skewness_ca,skewness_ef
0,0,11,1,1,0.012450,0.013721,0.051125,0.082896,0.064003,-0.222003,...,0.0,0.013471,0.0,0.0,0.010264,247,20,0.080972,-0.105222,-0.119371
1,0,1,11,0,0.013721,0.012450,-0.222003,0.064003,0.082896,0.051125,...,0.0,0.010264,0.0,0.0,0.013471,247,20,0.080972,-0.119371,-0.105222
2,0,14,4,1,-0.001982,0.011765,0.567162,-0.057519,-0.132904,0.421689,...,0.0,0.014072,0.0,0.0,0.011671,247,20,0.080972,-0.071466,-0.071288
3,0,4,14,0,0.011765,-0.001982,0.421689,-0.132904,-0.057519,0.567162,...,0.0,0.011671,0.0,0.0,0.014072,247,20,0.080972,-0.071288,-0.071466
4,0,7,0,1,0.039412,0.082011,0.543090,-0.013511,-0.104255,0.363200,...,0.0,0.005817,0.0,0.0,0.000000,247,20,0.080972,-0.021983,0.153496
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1433,79,19,2,0,0.008442,0.015381,0.384304,0.049649,0.016760,0.248328,...,0.0,0.000000,0.0,0.0,0.000000,247,20,0.080972,0.107268,-0.024408
1434,79,12,1,0,-0.043651,-0.012628,-0.061665,0.129942,-0.111994,-0.067194,...,0.0,0.000424,0.0,0.0,0.000000,247,20,0.080972,-0.135363,0.037282
1435,79,17,3,0,0.008170,-0.002797,0.481132,-0.078427,0.204501,0.142043,...,0.0,0.000970,0.0,0.0,0.000000,247,20,0.080972,0.193563,-0.041191
1436,79,5,1,0,-0.018201,-0.001662,0.094858,-0.152048,-0.123522,-0.171476,...,0.0,0.000000,0.0,0.0,0.000000,247,20,0.080972,-0.152121,0.037282


Now we can see if we can learn something from these descriptors. We will perform a 50% train/test split and run a classifier. 

In [9]:
descriptors_df_train = descriptors_df.iloc[:len(descriptors_df)//2]
descriptors_df_test = descriptors_df.iloc[len(descriptors_df)//2:]

X_train = descriptors_df_train.drop(columns=['graph_id','edge_source','edge_dest','is_causal'])
y_train = descriptors_df_train['is_causal']
X_test = descriptors_df_test.drop(columns=['graph_id','edge_source','edge_dest','is_causal'])
y_test = descriptors_df_test['is_causal']

from imblearn.ensemble import BalancedRandomForestClassifier
from sklearn.metrics import accuracy_score, roc_auc_score, confusion_matrix

clf = BalancedRandomForestClassifier(n_estimators=10, max_depth=None, random_state=0, sampling_strategy='auto',replacement=True,bootstrap=True)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
y_pred_proba = clf.predict_proba(X_test)

print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print(f"ROC AUC Score: {roc_auc_score(y_test, y_pred_proba[:, 1]):.4f}")
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))

Accuracy: 0.8540
ROC AUC Score: 0.9187
Confusion Matrix:
[[398  82]
 [ 23 216]]
