# D2C Class

The `D2C` class is responsible for Dependency to Causality (D2C) analysis. D2C is a method for inferring causal relationships from observational data using simulated directed acyclic graphs (DAGs). The D2C class uses these DAGs to compute asymmetric descriptors from the observations associated with each DAG. This class allows for a comprehensive analysis, offering functions for initialization, computation of descriptors, and model evaluation.

## Class Attributes

- `DAGs_index`: A numpy array containing the indices of the simulated directed acyclic graphs (DAGs).

- `DAGs`: The list of simulated DAGs.

- `observations_from_DAGs`: The list of observations associated with each simulated DAG.

- `rev`: A boolean flag indicating whether to consider reverse edges. By default, it's set to True, indicating that reverse edges should be considered.

- `X`: A DataFrame containing the computed descriptors from the observations. Initially, it's set to None and gets populated by the `initialize` method.

- `Y`: A DataFrame containing the labels indicating if an edge is "is.child". It's initially set to None and is populated by the `initialize` method.

- `verbose`: A boolean flag indicating the verbosity of the execution. If it's set to True, the class will print more information during execution.

- `n_jobs`: The number of parallel jobs to run during computation. By default, it's set to 1, indicating no parallelization.

- `random_state`: The seed for the random number generator. By default, it's set to 42.


## Class Structure

- `__init__(self, simulatedDAGs: SimulatedDAGs, rev: bool = True, verbose=False, random_state: int = 42, n_jobs: int = 1) -> None`: 
The initializer method for the class. It accepts an instance of the SimulatedDAGs class, flags for considering reverse edges, verbosity, a seed for the random state, and the number of jobs for parallel processing.

- `initialize(self) -> None`: 
This method initializes the D2C object by computing descriptors in parallel for all observations.

- `_compute_descriptors_for_edge_pairs(self, DAG_index: Any) -> Tuple[list, list]`: 
A helper method that computes descriptors in parallel for a given observation.

- `_generate_edge_pairs(self, DAG_index, dependency_type: str) -> list`: 
This helper method generates pairs of edges in a DAG according to a specified dependency type.

- `_compute_descriptors(self, DAG_index, ca, ef, ns=None, maxs=20, lin=False, acc=True, struct=True, pq= [0.05,0.1,0.25,0.5,0.75,0.9,0.95], boot="mrmr")`: 
This is a comprehensive function for computing the descriptor of two variables in a dataset under the assumption that one variable is the cause and the other the effect. It includes numerous optional parameters for customization of the descriptor computation process.

- `get_df(self) -> pd.DataFrame`: 
A method to get the concatenated DataFrame of X (descriptors) and Y (labels indicating if an edge is "is.child").

- `get_score(self, model: RandomForestClassifier = RandomForestClassifier(), test_size: float = 0.2, metric: str = "accuracy") -> Union[float, None]`: 
This method is used to evaluate the performance of a machine learning model using the specified metric. By default, it uses a RandomForestClassifier model and the 'accuracy' metric.


In [None]:
import sys
sys.path.append("..")

In [1]:
import pandas as pd
import numpy as np 
from multiprocessing import Pool
import networkx as nx

from typing import Union, Tuple, Any

from scipy.stats import skew, tstd
from numpy.linalg import inv, pinv

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score, precision_score, recall_score, roc_auc_score
from sklearn.preprocessing import StandardScaler

from d2c.simulatedDAGs import SimulatedDAGs
from d2c.utils import *


class D2C:
    def __init__(self, simulatedDAGs: SimulatedDAGs, rev: bool = True, verbose=False, random_state: int = 42, n_jobs: int = 1) -> None:
        """
        Class for D2C analysis.

        D2C (Dependency to Causalilty) analysis is a method for inferring causal relationships
        from observational data using simulated directed acyclic graphs (DAGs) and computing 
        asymmetric descriptors from the observations associated with each DAG.

        Args:
            simulatedDAGs (SimulatedDAGs): An instance of the SimulatedDAGs class.
            rev (bool, optional): Whether to consider reverse edges. Defaults to True.
            n_jobs (int, optional): Number of parallel jobs. Defaults to 1.
            random_state (int, optional): Random seed. Defaults to 42.
        """
        self.DAGs_index = np.arange(len(simulatedDAGs.list_DAGs))
        self.DAGs = simulatedDAGs.list_DAGs
        self.observations_from_DAGs = simulatedDAGs.list_observations
        self.rev = rev
        self.X = None
        self.Y = None
        self.verbose = verbose
        self.n_jobs = n_jobs
        self.random_state = random_state

    def initialize(self) -> None:
        """
        Initialize the D2C object by computing descriptors in parallel for all observations.

        """
        if self.n_jobs == 1:
            results = [self._compute_descriptors_for_edge_pairs(DAG_index) for DAG_index in self.DAGs_index]
        else:
            with Pool(processes=self.n_jobs) as pool:
                results = pool.starmap(
                    self._compute_descriptors_for_edge_pairs,
                    zip(self.DAGs_index)
                )

        X_list, Y_list = zip(*results)
        self.X = pd.concat([pd.DataFrame(X) for X in X_list], axis=0)
        self.Y = pd.concat([pd.DataFrame(Y) for Y in Y_list], axis=0)

    def _compute_descriptors_for_edge_pairs(self, DAG_index: Any) -> Tuple[list, list]:
        """
        Compute descriptors in parallel for a given observation.
        """
        X = []
        Y = []

        edge_pairs = self._generate_edge_pairs(DAG_index,"is.child")

        for edge_pair in edge_pairs:
            parent, child = edge_pair[0], edge_pair[1]
            descriptor = self._compute_descriptors(DAG_index, parent, child)
            X.append(descriptor)
            Y.append(1)  # Label edge as "is.child"

            if self.rev:
                # Reverse edge direction
                descriptor_reverse = self._compute_descriptors(DAG_index, child, parent)
                X.append(descriptor_reverse)
                Y.append(0)  # Label reverse edge as NOT "is.child"

        return X, Y
    
    def load_descriptors(self, filename='dataframe.csv'):
        descriptors = pd.read_csv(filename)
        self.X = descriptors.iloc[:,:-1]
        self.Y = descriptors.iloc[:,-1]
        #TODO: handle multivariate case


    def _generate_edge_pairs(self, DAG_index, dependency_type: str) -> list:
        """
        Generate pairs of edges based on the dependency type.

        Args:
            dependency_type (str): The type of dependency.

        Returns:
            list: List of edge pairs.

        """
        edge_pairs = []
        if dependency_type == "is.child":
            for parent_node, child_node in self.DAGs[DAG_index].edges:
                edge_pairs.append((parent_node, child_node))
        print("Edge pairs for DAG", DAG_index, "computed:", edge_pairs)
        return edge_pairs

    


    def _compute_descriptors(self, DAG_index, ca, ef, ns=None, maxs=20,
            lin=False, acc=True, struct=True,
            pq= [0.05,0.1,0.25,0.5,0.75,0.9,0.95], boot="mrmr"):
        
        """
        Compute descriptor of two variables in a dataset under the assumption that one variable is the cause and the other the effect.

        Parameters:
        ca (int): Node index of the putative cause. Must be in the range [0, n).
        ef (int): Node index of the putative effect. Must be in the range [0, n).
        ns (int, optional): Size of the Markov Blanket. Defaults to min(4, n - 2).
        lin (bool, optional): If True, uses a linear model to assess dependency. Defaults to False.
        acc (bool, optional): If True, uses the accuracy of the regression as a descriptor. Defaults to True.
        struct (bool, optional): If True, uses the ranking in the Markov blanket as a descriptor. Defaults to False.
        pq (list of float, optional): A list of quantiles used to compute the descriptor. Defaults to [0.1,0.25,0.5,0.75,0.9].
        maxs (int, optional): Max number of pairs MB(i), MB(j) considered. Defaults to 10.
        boot (str, optional): Feature selection algorithm. Defaults to "mimr".
        errd (bool, optional): If True, includes the descriptors of the error. Defaults to False.
        delta (bool, optional): Not used in current implementation. Defaults to False.
        stabD (bool, optional): Not used in current implementation. Defaults to False.

        Returns:
        dict: A dictionary with the computed descriptors.

        Raises:
        ValueError: If there are missing or infinite values in D.
        """
        D = self.observations_from_DAGs[DAG_index]

        print('Computing descriptors for edge pair: ', ca, ef, 'in DAG', DAG_index)

        #scale using pandas
        D = (D - D.mean()) / D.std()

        if np.any(np.isnan(D)) or np.any(np.isinf(D)): raise ValueError("Error: NA or Inf in descriptor") # Check if there are any missing or infinite values in the data
    
        # Number of variables
        n = D.shape[1]
        # Number of observations
        N = D.shape[0]


        # Set default value for ns if not provided
        if ns is None:
            ns = min(4, n-2)

        # Check that ca and ef are within the valid range
        if ca >= n or ef >= n:
            raise ValueError(f"ca={ca}, ef={ef}, n={n}\nerror in D2C_n")
    

        # Initial sets for Markov Blanket
        MBca = set(np.arange(n)) - {ca}
        MBef = set(np.arange(n)) - {ef}
        # intersection of the Markov Blanket of ca and ef
        common_causes = MBca.intersection(MBef)
        if self.verbose: 
            print("common_causes: ", common_causes)
        # Creation of the Markov Blanket of ca (denoted MBca) and ef (MBef)
        if n > (ns+1):

            if self.verbose: print("Computing Markov Blanket")
            # MBca
            ind = list(set(np.arange(n)) - {ca})

            if self.verbose: 
                print("About to rankrho")

            ind = rankrho(D.iloc[:,ind],D.iloc[:,ca],nmax=min(len(ind),5*ns),verbose=self.verbose) - 1 #python starts from 0
            if self.verbose: print('Ind:',ind)
            if self.verbose: print('Exited rankrho')

            if boot == "mrmr":
                mrmr = mRMR(D.iloc[:,ind],D.iloc[:,ca],nmax=ns,verbose=self.verbose)
                MBca = ind[mrmr]  
                if self.verbose: print("MBca: ", MBca)
            # MBef
            ind2 = list(set(np.arange(n)) - {ef})
            ind2 = rankrho(D.iloc[:,ind2],D.iloc[:,ef],nmax=min(len(ind2),5*ns),verbose=self.verbose)  
            if boot == "mrmr":
                MBef = ind2[mRMR(D.iloc[:,ind2],D.iloc[:,ef],nmax=ns,verbose=self.verbose)]  
                if self.verbose: print("MBef: ", MBef)
    
        if acc:
            comcau = 1

            if len(common_causes) > 0:
                if self.verbose: print("common_causes: ", common_causes)
                comcau = normalized_conditional_information(D.iloc[:, ef], D.iloc[:, ca], D.iloc[:, list(common_causes)], lin=lin,verbose=self.verbose) 

            effca = coeff(D.iloc[:, ef], D.iloc[:, ca], D.iloc[:, MBef], verbose=self.verbose)
            effef = coeff(D.iloc[:, ca], D.iloc[:, ef], D.iloc[:, MBca], verbose=self.verbose)

            if self.verbose: print("effca: ", effca, "effef: ", effef)

            ca_ef = normalized_conditional_information(D.iloc[:, ca], D.iloc[:, ef], lin=lin,verbose=self.verbose) 
            ef_ca = normalized_conditional_information(D.iloc[:, ef], D.iloc[:, ca], lin=lin,verbose=self.verbose) 

            if self.verbose: print("ca_ef: ", ca_ef, "ef_ca: ", ef_ca)

            delta = normalized_conditional_information(D.iloc[:, ef], D.iloc[:, ca], D.iloc[:, MBef], lin=lin,verbose=self.verbose) 
            delta2 = normalized_conditional_information(D.iloc[:, ca], D.iloc[:, ef], D.iloc[:, MBca], lin=lin,verbose=self.verbose) 

            if self.verbose: print("delta: ", delta, "delta2: ", delta2)    


            delta_i = []  
            delta2_i = []
            arrays_m_plus_MBca = [np.unique(array).tolist() for array in [np.concatenate(([m], MBca)) for m in MBef]]
            arrays_m_plus_MBef = [np.unique(array).tolist() for array in [np.concatenate(([m], MBef)) for m in MBca]]

            for array in arrays_m_plus_MBca:
                delta_i.append( normalized_conditional_information(D.iloc[:, ef], D.iloc[:, ca], D.iloc[:,  array], lin=lin,verbose=self.verbose))
            for array in arrays_m_plus_MBef:
                delta2_i.append(normalized_conditional_information(D.iloc[:, ca], D.iloc[:, ef], D.iloc[:,  array], lin=lin,verbose=self.verbose))

            if self.verbose: print("delta_i: ", delta_i, "delta2_i: ", delta2_i)

            I1_i = [normalized_conditional_information(D.iloc[:, MBef[j]], D.iloc[:, ca], lin=lin,verbose=self.verbose) for j in range(len(MBef))]
            I1_j = [normalized_conditional_information(D.iloc[:, MBca[j]], D.iloc[:, ef], lin=lin,verbose=self.verbose) for j in range(len(MBca))]

            if self.verbose: print("I1_i: ", I1_i, "I1_j: ", I1_j)

            I2_i = [normalized_conditional_information(D.iloc[:, ca], D.iloc[:, MBef[j]], D.iloc[:, ef], lin=lin,verbose=self.verbose) for j in range(len(MBef))]
            I2_j = [normalized_conditional_information(D.iloc[:, ef], D.iloc[:, MBca[j]], D.iloc[:, ca], lin=lin,verbose=self.verbose) for j in range(len(MBca))]

            if self.verbose: print("I2_i: ", I2_i, "I2_j: ", I2_j)

            # Randomly select maxs pairs
            IJ = np.array(np.meshgrid(np.arange(len(MBca)), np.arange(len(MBef)))).T.reshape(-1,2)
            np.random.shuffle(IJ)
            IJ = IJ[:min(maxs, len(IJ))]

            if self.verbose: print("IJ: ", IJ)

            I3_i = [normalized_conditional_information(D.iloc[:, MBca[i]], D.iloc[:, MBef[j]], D.iloc[:, ca], lin=lin,verbose=self.verbose) for i, j in IJ]
            I3_j = [normalized_conditional_information(D.iloc[:, MBca[i]], D.iloc[:, MBef[j]], D.iloc[:, ef], lin=lin,verbose=self.verbose) for i, j in IJ]

            if self.verbose: print("I3_i: ", I3_i, "I3_j: ", I3_j)

            IJ = np.array([(i, j) for i in range(len(MBca)) for j in range(i+1, len(MBca))])
            np.random.shuffle(IJ)
            IJ = IJ[:min(maxs, len(IJ))]

            Int3_i = [normalized_conditional_information(D.iloc[:, MBca[i]], D.iloc[:, MBca[j]], D.iloc[:, ca], lin=lin,verbose=self.verbose) - normalized_conditional_information(D.iloc[:, MBca[i]], D.iloc[:, MBca[j]], lin=lin,verbose=self.verbose) for i, j in IJ]

            if self.verbose: print("Int3_i: ", Int3_i)

            IJ = np.array([(i, j) for i in range(len(MBef)) for j in range(i+1, len(MBef))])
            np.random.shuffle(IJ)
            IJ = IJ[:min(maxs, len(IJ))]

            Int3_j = [normalized_conditional_information(D.iloc[:, MBef[i]], D.iloc[:, MBef[j]], D.iloc[:, ef], lin=lin,verbose=self.verbose) - normalized_conditional_information(D.iloc[:, MBef[i]], D.iloc[:, MBef[j]], lin=lin,verbose=self.verbose) for i, j in IJ]

            if self.verbose: print("Int3_j: ", Int3_j)

            E_ef = ecdf(D.iloc[:, ef],verbose=self.verbose)(D.iloc[:, ef]) 
            E_ca = ecdf(D.iloc[:, ca],verbose=self.verbose)(D.iloc[:, ca])

            if self.verbose: print("E_ef: ", E_ef, "E_ca: ", E_ca)

            gini_ca_ef = normalized_conditional_information(D.iloc[:, ca], pd.DataFrame(E_ef), lin=lin,verbose=self.verbose)
            gini_ef_ca = normalized_conditional_information(D.iloc[:, ef], pd.DataFrame(E_ca), lin=lin,verbose=self.verbose)

            if self.verbose: print("gini_ca_ef: ", gini_ca_ef, "gini_ef_ca: ", gini_ef_ca)

            gini_delta = normalized_conditional_information(D.iloc[:, ef], pd.DataFrame(E_ca), D.iloc[:, MBef], lin=lin,verbose=self.verbose)
            gini_delta2 = normalized_conditional_information(D.iloc[:, ca], pd.DataFrame(E_ef), D.iloc[:, MBca], lin=lin,verbose=self.verbose)

            if self.verbose: print("gini_delta: ", gini_delta, "gini_delta2: ", gini_delta2)

            namesx = ["effca","effef","comcau","delta","delta2"]
            namesx += ["delta.i" + str(i+1) for i in range(len(pq))]
            namesx += ["delta2.i" + str(i+1) for i in range(len(pq))]
            namesx += ["ca.ef","ef.ca"]
            namesx += ["I1.i" + str(i+1) for i in range(len(pq))]
            namesx += ["I1.j" + str(i+1) for i in range(len(pq))]
            namesx += ["I2.i" + str(i+1) for i in range(len(pq))]
            namesx += ["I2.j" + str(i+1) for i in range(len(pq))]
            namesx += ["I3.i" + str(i+1) for i in range(len(pq))]
            namesx += ["I3.j" + str(i+1) for i in range(len(pq))]
            namesx += ["Int3.i" + str(i+1) for i in range(len(pq))]
            namesx += ["Int3.j" + str(i+1) for i in range(len(pq))]
            namesx += ["gini.delta","gini.delta2","gini.ca.ef","gini.ef.ca"]

            keys = namesx
            
            values = [effca, effef, comcau, delta, delta2]
            values.extend(np.quantile(delta_i, q=pq, axis=0).flatten()) 
            values.extend(np.quantile(delta2_i, q=pq, axis=0).flatten()) 
            values.extend([ca_ef, ef_ca])
            values.extend(np.quantile(I1_i, q=pq, axis=0).flatten()) 
            values.extend(np.quantile(I1_j, q=pq, axis=0).flatten()) 
            values.extend(np.quantile(I2_i, q=pq, axis=0).flatten()) 
            values.extend(np.quantile(I2_j, q=pq, axis=0).flatten()) 
            values.extend(np.quantile(I3_i, q=pq, axis=0).flatten()) 
            values.extend(np.quantile(I3_j, q=pq, axis=0).flatten()) 
            values.extend(np.quantile(Int3_i, q=pq, axis=0).flatten()) 
            values.extend(np.quantile(Int3_j, q=pq, axis=0).flatten()) 
            values.extend([gini_delta, gini_delta2,gini_ca_ef, gini_ef_ca]) 

            
            # Replace NA values with 0
            dictionary = dict(zip(keys, values))
            # for key in dictionary:
            #     if np.isnan(dictionary[key]):
            #         dictionary[key] = 0
            
            print("Descriptors for DAG", DAG_index, "edge pair", ca, ef, "computed")
            return dictionary


    def get_df(self) -> pd.DataFrame:
        """
        Get the concatenated DataFrame of X and Y.

        Returns:
            pd.DataFrame: The concatenated DataFrame of X and Y.

        """
        return pd.concat([self.X,self.Y], axis=1)
    

    def get_score(self, model: RandomForestClassifier = RandomForestClassifier(), test_size: float = 0.2, metric: str = "accuracy") -> Union[float, None]:
        """
        Get the score of a machine learning model using the specified metric.

        Parameters:
            model (RandomForestClassifier): The machine learning model to evaluate.
            test_size (float): The proportion of the data to use for testing.
            metric (str): The evaluation metric to use (default is "accuracy"). Valid metrics are: 'accuracy', 'f1', 'precision', 'recall', 'auc'.

        Returns:
            float: The score of the model using the specified metric.
        
        Raises:
            ValueError: If an invalid metric is provided.

        """
        data = self.X
        labels = self.Y

        # Split the data into training and testing sets
        X_train, X_test, y_train, y_test = train_test_split(data, labels, train_size=1-test_size, test_size=test_size, random_state=self.random_state)

        y_train = y_train.values.ravel()
        y_test = y_test.values.ravel()

        # Create an instance of the Random Forest classifier
        model = RandomForestClassifier(n_jobs=self.n_jobs, random_state=self.random_state)

        # Train the model
        model.fit(X_train, y_train)

        # Get the accuracy of the model
        if metric == "accuracy":
            return model.score(X_test, y_test)
        elif metric == "f1":
            return f1_score(y_test, model.predict(X_test))
        elif metric == "precision":
            return precision_score(y_test, model.predict(X_test))
        elif metric == "recall":
            return recall_score(y_test, model.predict(X_test))
        elif metric == "auc":
            return roc_auc_score(y_test, model.predict(X_test))
        else:
            raise ValueError("Invalid metric. Valid metrics are: 'accuracy', 'f1', 'precision', 'recall'")
