# TD2C Validation on Real-World Data

This notebook is designed to validate the `TD2C` method on real-world datasets of varying sizes and compositions. The primary goals are to:

1. **Validate on Real-World Datasets**: Apply the `TD2C` method to datasets that differ in size and composition to assess its robustness and effectiveness in practical scenarios.

2. **Model Training Comparison**: Evaluate the performance of the `TD2C` method when trained on models with different process characteristics:
   - **Linear-Only Training**: Validate the method using a model trained exclusively on linear processes.
   - **Mixed Training (Linear and Nonlinear)**: Assess the method using a model equally trained on both linear and nonlinear processes.
   
3. **Performance Comparison**: Compare the results from the different training approaches to determine under which circumstances the `TD2C` estimates perform better, particularly focusing on their application to real-world data.

By the end of this notebook, users will have a clear understanding of how the `TD2C` method performs in various real-world contexts, and will be equipped to apply the method effectively in both linear and nonlinear data environments.


# Settings

### Load Packages

In [1]:
import pickle
import os
import pandas as pd
from tqdm import tqdm
import numpy as np
import joblib

from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, balanced_accuracy_score

from d2c.benchmark import D2CWrapper

from d2c.descriptors_generation.loader import DataLoader

from sklearn.ensemble import RandomForestClassifier
from imblearn.ensemble import BalancedRandomForestClassifier

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score


### Load Trained Models

In [2]:
# full model 
model = joblib.load('/home/jpalombarini/td2c/notebooks/contributions/Real_data_validation/model.pkl')
# model trained only on linear generative processes
model_linear = joblib.load('/home/jpalombarini/td2c/notebooks/contributions/Real_data_validation/td2c_R_N5_LOPO_combined.pkl')

# HYBRID PAPER DATA:

## Antivirus activity

Impacts of antivirus activity in servers.
13 time series such that 3 of them are collected with a one-minute sampling rate and the rest with a five-minute sampling rate.

The two processed datasets consist of 1321 timestamps:
  - preprocessed 1
  - preprocessed 2
  
  ### Ground truth:
  - memory_usage_Portal -> Physical_Memory_prct_used_Portal
  - cpu_usage_Portal -> cpu_prct_used_Portal
  - Physical_Memory_prct_used_Portal -> 0_C_read_Portal
  - cpu_prct_used_Portal -> 0_C_read_Portal
  - memory_usage_VDI -> Physical_Memory_prct_used_VDI
  - cpu_usage_VDI -> cpu_prct_used_VDI
  - Physical_Memory_prct_used_VDI -> 0_C_read_VDI
  - cpu_prct_used_VDI -> 0_C_read_VDI
  - Physical_Memory_prct_used_Portal -> Chargement_portail
  - cpu_prct_used_Portal -> Chargement_portail
  - 0_C_read_Portal -> Chargement_portail
  - Physical_Memory_prct_used_VDI -> Chargement_IE
  - cpu_prct_used_VDI -> Chargement_IE
  - 0_C_read_VDI -> Chargement_IE
  - Chargement_portail -> Default_Transaction
  - Chargement_IE -> Default_Transaction

### Load Data

In [3]:
ts_1 = np.loadtxt('/home/jpalombarini/td2c/notebooks/contributions/Real_data_validation/data/Antivirus_activity/preprocessed_1.txt', delimiter=',',skiprows=1)
ts_2 = np.loadtxt('/home/jpalombarini/td2c/notebooks/contributions/Real_data_validation/data/Antivirus_activity/preprocessed_2.txt', delimiter=',',skiprows=1)
ts_1_df = pd.read_csv('/home/jpalombarini/td2c/notebooks/contributions/Real_data_validation/data/Antivirus_activity/preprocessed_1.txt')
ts_2_df = pd.read_csv('/home/jpalombarini/td2c/notebooks/contributions/Real_data_validation/data/Antivirus_activity/preprocessed_2.txt')

### Get Causal DF and manipulate it

In [22]:
MODEL = model
# model: USING A MODEL TRAINED ON ALL PROCESSES
# model_linear: USING A MODEL TRAINED ONLY ON LINEAR PROCESSES

ts = ts_2
# ts_1: Antivirus_activity_1
# ts_2: Antivirus_activity_2

d2cwrapper = D2CWrapper(ts_list=[ts], 
                        n_variables=13, 
                        model=MODEL, 
                        maxlags=1, 
                        n_jobs=1, 
                        full=True, 
                        quantiles=True,
                        filename='d2c_results',
                        normalize=True, 
                        cmi='original', 
                        mb_estimator='ts')

d2cwrapper
d2cwrapper.run()
causal_df = d2cwrapper.get_causal_dfs()
causal_df

{0:      from  to effect p_value  probability  is_causal
 0      23   4   None    None         0.16      False
 1      17   3   None    None         0.14      False
 2      19   0   None    None         0.04      False
 3      17  12   None    None         0.04      False
 4      19   9   None    None         0.44      False
 ..    ...  ..    ...     ...          ...        ...
 164    15   3   None    None         0.12      False
 165    15  12   None    None         0.06      False
 166    16  11   None    None         0.22      False
 167    18   8   None    None         0.20      False
 168    21   7   None    None         0.14      False
 
 [169 rows x 6 columns]}

In [24]:
# PRINT DATAFRAME WITH CAUSAL RELATIONSHIPS

df = causal_df[0]
# order df by 'from' and 'by' columns
df = df.sort_values(by=['from', 'to'])

df.to_csv('/home/jpalombarini/td2c/notebooks/contributions/Real_data_validation/data/Antivirus_activity/results/causal_df.csv', index=False)

df

Unnamed: 0,from,to,effect,p_value,probability,is_causal
79,13,0,,,0.60,True
16,13,1,,,0.22,False
94,13,2,,,0.28,False
30,13,3,,,0.22,False
110,13,4,,,0.12,False
...,...,...,...,...,...,...
86,25,8,,,0.10,False
151,25,9,,,0.28,False
99,25,10,,,0.04,False
39,25,11,,,0.12,False


In [25]:
# RELEVEL FROM AND TO COLUMNS WITH THE RIGHT NODES NAMES

import pandas as pd

# Assuming df is your DataFrame

# Step 1: Get the unique levels
from_levels = df['from'].unique()
to_levels = df['to'].unique()

# Step 2: Create mapping dictionaries
from_mapping = {level: i+1 for i, level in enumerate(from_levels)}
to_mapping = {level: i+1 for i, level in enumerate(to_levels)}

# Step 3: Apply the mapping to the columns
df['from'] = df['from'].map(from_mapping)
df['to'] = df['to'].map(to_mapping)

# Display the result
df


Unnamed: 0,from,to,effect,p_value,probability,is_causal
79,1,1,,,0.60,True
16,1,2,,,0.22,False
94,1,3,,,0.28,False
30,1,4,,,0.22,False
110,1,5,,,0.12,False
...,...,...,...,...,...,...
86,13,9,,,0.10,False
151,13,10,,,0.28,False
99,13,11,,,0.04,False
39,13,12,,,0.12,False


In [26]:
# show only df rows that have 'is_causal' == True
df[df['is_causal'] == True]

Unnamed: 0,from,to,effect,p_value,probability,is_causal
79,1,1,,,0.6,True
78,5,5,,,0.7,True
74,10,10,,,0.68,True


In [27]:
# LOAD DATA AS DATAFRAME
ts_df = ts_1_df
# ts_1_df: Antivirus_activity_1
# ts_2_df: Antivirus_activity_2

# list the names in the first row
names = ts_df.columns

# associate a number to each name
name_to_number = {name: i+1 for i, name in enumerate(names)}

name_to_number

{'memory_usage_Portal': 1,
 'cpu_usage_Portal': 2,
 'Physical_Memory_prct_used_Portal': 3,
 'cpu_prct_used_Portal': 4,
 '0_C_read_Portal': 5,
 'memory_usage_VDI': 6,
 'cpu_usage_VDI': 7,
 'Physical_Memory_prct_used_VDI': 8,
 'cpu_prct_used_VDI': 9,
 '0_C_read_VDI': 10,
 'Chargement_portail': 11,
 'Chargement_IE': 12,
 'Default_Transaction': 13}

In [31]:
# take only the columns 'from', 'to' for a certain probability threshold
caus = df[df['probability'] > 0.4][['from', 'to']]

number_to_name = {v: k for k, v in name_to_number.items()}

# apply the mapping
caus['from'] = caus['from'].replace(number_to_name)
caus['to'] = caus['to'].replace(number_to_name)

# save caus to a csv file
caus.to_csv('/home/jpalombarini/td2c/notebooks/contributions/Real_data_validation/data/Antivirus_activity/results/causal_relations.csv', index=False)

caus

Unnamed: 0,from,to
79,memory_usage_Portal,memory_usage_Portal
47,memory_usage_Portal,memory_usage_VDI
134,cpu_usage_Portal,0_C_read_Portal
139,cpu_prct_used_Portal,0_C_read_Portal
115,cpu_prct_used_Portal,0_C_read_VDI
78,0_C_read_Portal,0_C_read_Portal
67,memory_usage_VDI,memory_usage_Portal
37,memory_usage_VDI,memory_usage_VDI
4,cpu_usage_VDI,0_C_read_VDI
11,cpu_prct_used_VDI,0_C_read_VDI


### Load the Ground truth and print the results

In [32]:
# load the ground truth
gt = pd.read_csv('/home/jpalombarini/td2c/notebooks/contributions/Real_data_validation/data/Antivirus_activity/ground_truth.txt')

# merge column 'from' with column 'to' of caus to create a new column 'From -> To'
caus['From -> To'] = caus['from'] + ' -> ' + caus['to']
caus = caus.drop(columns=['from', 'to'])
caus, gt

(                                     From -> To
 79   memory_usage_Portal -> memory_usage_Portal
 47      memory_usage_Portal -> memory_usage_VDI
 134         cpu_usage_Portal -> 0_C_read_Portal
 139     cpu_prct_used_Portal -> 0_C_read_Portal
 115        cpu_prct_used_Portal -> 0_C_read_VDI
 78           0_C_read_Portal -> 0_C_read_Portal
 67      memory_usage_VDI -> memory_usage_Portal
 37         memory_usage_VDI -> memory_usage_VDI
 4                 cpu_usage_VDI -> 0_C_read_VDI
 11            cpu_prct_used_VDI -> 0_C_read_VDI
 74                 0_C_read_VDI -> 0_C_read_VDI
 92     Chargement_portail -> Chargement_portail,
                                            From -> To
 0   memory_usage_Portal -> Physical_Memory_prct_us...
 1            cpu_usage_Portal -> cpu_prct_used_Portal
 2   Physical_Memory_prct_used_Portal -> 0_C_read_P...
 3             cpu_prct_used_Portal -> 0_C_read_Portal
 4   memory_usage_VDI -> Physical_Memory_prct_used_VDI
 5                  cpu_usage_VD

In [33]:
print(f'Numbers of correctly estimated causal paths: {sum(caus["From -> To"].isin(gt["From -> To"]))} / {gt.shape[0]}'), 
print(f'Percentage of correctly estimated causal paths: {round((sum(caus["From -> To"].isin(gt["From -> To"])) / gt.shape[0]) * 100, 2)}%')

Numbers of correctly estimated causal paths: 2 / 16
Percentage of correctly estimated causal paths: 12.5%


## Dairy markets

Ten years (from 09/2008 to 12/2018) of monthly prices for milk M, butter B, and
cheddar cheese C, so the three time series are of length 124.

Ground truth: B <- M -> C

                  0 0 0 
        adj.mat = 1 0 1
                  0 0 0

### Load Data

In [165]:
ts_1 = np.loadtxt('/home/jpalombarini/td2c/notebooks/contributions/Real_data_validation/data/Dairy_markets/dairy_markets_merged.txt', delimiter=',',skiprows=1, usecols=range(1, 4))
ts_1_df = pd.read_csv('/home/jpalombarini/td2c/notebooks/contributions/Real_data_validation/data/Dairy_markets/dairy_markets_merged.txt', usecols=range(1, 4))

### Get Causal DF and manipulate it

In [166]:
MODEL = model_linear
# model: USING A MODEL TRAINED ON ALL PROCESSES
# model_linear: USING A MODEL TRAINED ONLY ON LINEAR PROCESSES

ts = ts_1
# ts_1: Antivirus_activity_1
# ts_2: Antivirus_activity_2

d2cwrapper = D2CWrapper(ts_list=[ts], 
                        n_variables=3, 
                        model=MODEL, 
                        maxlags=1, 
                        n_jobs=1, 
                        full=True, 
                        quantiles=True,
                        filename='d2c_results',
                        normalize=True, 
                        cmi='original', 
                        mb_estimator='ts')

d2cwrapper
d2cwrapper.run()
causal_df = d2cwrapper.get_causal_dfs()
causal_df

{0:    from  to effect p_value  probability  is_causal
 0     4   0   None    None         0.00      False
 1     3   1   None    None         0.01      False
 2     5   1   None    None         0.00      False
 3     4   2   None    None         0.00      False
 4     3   0   None    None         0.49      False
 5     5   0   None    None         0.00      False
 6     3   2   None    None         0.01      False
 7     4   1   None    None         0.47      False
 8     5   2   None    None         0.57       True}

In [167]:
# PRINT DATAFRAME WITH CAUSAL RELATIONSHIPS

df = causal_df[0]
# order df by 'from' and 'by' columns
df = df.sort_values(by=['from', 'to'])

df.to_csv('/home/jpalombarini/td2c/notebooks/contributions/Real_data_validation/data/Dairy_markets/results/causal_df.csv', index=False)

df

Unnamed: 0,from,to,effect,p_value,probability,is_causal
4,3,0,,,0.49,False
1,3,1,,,0.01,False
6,3,2,,,0.01,False
0,4,0,,,0.0,False
7,4,1,,,0.47,False
3,4,2,,,0.0,False
5,5,0,,,0.0,False
2,5,1,,,0.0,False
8,5,2,,,0.57,True


In [168]:
# RELEVEL FROM AND TO COLUMNS WITH THE RIGHT NODES NAMES

import pandas as pd

# Assuming df is your DataFrame

# Step 1: Get the unique levels
from_levels = df['from'].unique()
to_levels = df['to'].unique()

# Step 2: Create mapping dictionaries
from_mapping = {level: i+1 for i, level in enumerate(from_levels)}
to_mapping = {level: i+1 for i, level in enumerate(to_levels)}

# Step 3: Apply the mapping to the columns
df['from'] = df['from'].map(from_mapping)
df['to'] = df['to'].map(to_mapping)

# Display the result
df


Unnamed: 0,from,to,effect,p_value,probability,is_causal
4,1,1,,,0.49,False
1,1,2,,,0.01,False
6,1,3,,,0.01,False
0,2,1,,,0.0,False
7,2,2,,,0.47,False
3,2,3,,,0.0,False
5,3,1,,,0.0,False
2,3,2,,,0.0,False
8,3,3,,,0.57,True


In [169]:
# show only df rows that have 'is_causal' == True
df[df['is_causal'] == True]

Unnamed: 0,from,to,effect,p_value,probability,is_causal
8,3,3,,,0.57,True


In [170]:
# LOAD DATA AS DATAFRAME
ts_df = ts_1_df
# ts_1_df: Antivirus_activity_1
# ts_2_df: Antivirus_activity_2

# list the names in the first row
names = ts_df.columns

# associate a number to each name
name_to_number = {name: i+1 for i, name in enumerate(names)}

name_to_number

{'Butter': 1, 'Cheese': 2, 'Milk': 3}

In [171]:
# take only the columns 'from', 'to' for a certain probability threshold
caus = df[df['probability'] > 0.4][['from', 'to']]

number_to_name = {v: k for k, v in name_to_number.items()}

# apply the mapping
caus['from'] = caus['from'].replace(number_to_name)
caus['to'] = caus['to'].replace(number_to_name)

# save caus to a csv file
caus.to_csv('/home/jpalombarini/td2c/notebooks/contributions/Real_data_validation/data/Dairy_markets/results/causal_relations.csv', index=False)

caus

Unnamed: 0,from,to
4,Butter,Butter
7,Cheese,Cheese
8,Milk,Milk


### Load the Ground truth and print the results

In [172]:
# load the ground truth
gt = pd.read_csv('/home/jpalombarini/td2c/notebooks/contributions/Real_data_validation/data/Dairy_markets/ground_truth.txt')

# merge column 'from' with column 'to' of caus to create a new column 'From -> To'
caus['From -> To'] = caus['from'] + ' -> ' + caus['to']
caus = caus.drop(columns=['from', 'to'])
caus, gt

(         From -> To
 4  Butter -> Butter
 7  Cheese -> Cheese
 8      Milk -> Milk,
        From -> To
 0  Milk -> Butter
 1  Milk -> Cheese)

In [173]:
print(f'Numbers of correctly estimated causal paths: {sum(caus["From -> To"].isin(gt["From -> To"]))} / {gt.shape[0]}'), 
print(f'Percentage of correctly estimated causal paths: {round((sum(caus["From -> To"].isin(gt["From -> To"])) / gt.shape[0]) * 100, 2)}%')

Numbers of correctly estimated causal paths: 0 / 2
Percentage of correctly estimated causal paths: 0.0%


## Temperature

Bivariate time series of length 168 about indoor I and outdoor O measurements

Ground truth: O -> I

      adj.mat = 0 1
                0 0 
                


### Load Data

In [138]:
ts_1 = np.loadtxt('/home/jpalombarini/td2c/notebooks/contributions/Real_data_validation/data/temperature/temperature.txt', delimiter=',',skiprows=1, usecols=range(1, 3))
ts_1_df = pd.read_csv('/home/jpalombarini/td2c/notebooks/contributions/Real_data_validation/data/temperature/temperature.txt', usecols=range(1, 3))

### Get Causal DF and manipulate it

In [149]:
MODEL = model_linear
# model: USING A MODEL TRAINED ON ALL PROCESSES
# model_linear: USING A MODEL TRAINED ONLY ON LINEAR PROCESSES

ts = ts_1
# ts_1: Antivirus_activity_1
# ts_2: Antivirus_activity_2

d2cwrapper = D2CWrapper(ts_list=[ts], 
                        n_variables=2, 
                        model=MODEL, 
                        maxlags=1, 
                        n_jobs=1, 
                        full=True, 
                        quantiles=True,
                        filename='d2c_results',
                        normalize=True, 
                        cmi='original', 
                        mb_estimator='ts')

d2cwrapper
d2cwrapper.run()
causal_df = d2cwrapper.get_causal_dfs()
causal_df

{0:    from  to effect p_value  probability  is_causal
 0     3   1   None    None         0.45      False
 1     2   0   None    None         0.17      False
 2     2   1   None    None         0.17      False
 3     3   0   None    None         0.31      False}

In [150]:
# PRINT DATAFRAME WITH CAUSAL RELATIONSHIPS

df = causal_df[0]
# order df by 'from' and 'by' columns
df = df.sort_values(by=['from', 'to'])

df.to_csv('/home/jpalombarini/td2c/notebooks/contributions/Real_data_validation/data/temperature/results/causal_df.csv', index=False)

df

Unnamed: 0,from,to,effect,p_value,probability,is_causal
1,2,0,,,0.17,False
2,2,1,,,0.17,False
3,3,0,,,0.31,False
0,3,1,,,0.45,False


In [151]:
# RELEVEL FROM AND TO COLUMNS WITH THE RIGHT NODES NAMES

import pandas as pd

# Assuming df is your DataFrame

# Step 1: Get the unique levels
from_levels = df['from'].unique()
to_levels = df['to'].unique()

# Step 2: Create mapping dictionaries
from_mapping = {level: i+1 for i, level in enumerate(from_levels)}
to_mapping = {level: i+1 for i, level in enumerate(to_levels)}

# Step 3: Apply the mapping to the columns
df['from'] = df['from'].map(from_mapping)
df['to'] = df['to'].map(to_mapping)

# Display the result
df


Unnamed: 0,from,to,effect,p_value,probability,is_causal
1,1,1,,,0.17,False
2,1,2,,,0.17,False
3,2,1,,,0.31,False
0,2,2,,,0.45,False


In [152]:
# show only df rows that have 'is_causal' == True
df[df['is_causal'] == True]

Unnamed: 0,from,to,effect,p_value,probability,is_causal


In [153]:
# LOAD DATA AS DATAFRAME
ts_df = ts_1_df
# ts_1_df: Antivirus_activity_1
# ts_2_df: Antivirus_activity_2

# list the names in the first row
names = ts_df.columns

# associate a number to each name
name_to_number = {name: i+1 for i, name in enumerate(names)}

name_to_number

{'Indoor temperature': 1, 'Outdoor temperature': 2}

In [154]:
# take only the columns 'from', 'to' for a certain probability threshold
caus = df[df['probability'] > 0.39][['from', 'to']]

number_to_name = {v: k for k, v in name_to_number.items()}

# apply the mapping
caus['from'] = caus['from'].replace(number_to_name)
caus['to'] = caus['to'].replace(number_to_name)

# save caus to a csv file
caus.to_csv('/home/jpalombarini/td2c/notebooks/contributions/Real_data_validation/data/temperature/results/causal_relations.csv', index=False)

caus

Unnamed: 0,from,to
0,Outdoor temperature,Outdoor temperature


### Load the Ground truth and print the results

In [155]:
# load the ground truth
gt = pd.read_csv('/home/jpalombarini/td2c/notebooks/contributions/Real_data_validation/data/temperature/ground_truth.txt')

# merge column 'from' with column 'to' of caus to create a new column 'From -> To'
caus['From -> To'] = caus['from'] + ' -> ' + caus['to']
caus = caus.drop(columns=['from', 'to'])
caus, gt

(                                   From -> To
 0  Outdoor temperature -> Outdoor temperature,
                                   From -> To
 0  Outdoor temperature -> Indoor temperature)

In [156]:
print(f'Numbers of correctly estimated causal paths: {sum(caus["From -> To"].isin(gt["From -> To"]))} / {gt.shape[0]}'), 
print(f'Percentage of correctly estimated causal paths: {round((sum(caus["From -> To"].isin(gt["From -> To"])) / gt.shape[0]) * 100, 2)}%')

Numbers of correctly estimated causal paths: 0 / 1
Percentage of correctly estimated causal paths: 0.0%


## Veilleux

Interactions between predatory ciliate Dinidum nasutum and its prey Paramecium aurelia with different values of
Cerophyl concentrations (CC): 0.375 and 0.5. The lengths of the time series are 71 and 65.
- CC05a
- CC035

Ground truth: P -> D in both cases

    adj.mat = 0 1
              1 0  
                

### Load Data

In [121]:
converter = lambda s: float(s.decode().replace(',', '.'))

# Load the data using np.loadtxt
ts_1 = np.loadtxt('/home/jpalombarini/td2c/notebooks/contributions/Real_data_validation/data/veilleux/veilleux_subset_CC05a.txt', delimiter=';', skiprows=1, usecols=range(1, 3), converters={i: converter for i in range(1, 3)})
ts_2 = np.loadtxt('/home/jpalombarini/td2c/notebooks/contributions/Real_data_validation/data/veilleux/veilleux_subset_CC035.txt', delimiter=';', skiprows=1, usecols=range(1, 3), converters={i: converter for i in range(1, 3)})
ts_1_df = pd.read_csv('/home/jpalombarini/td2c/notebooks/contributions/Real_data_validation/data/veilleux/veilleux_subset_CC05a.txt', delimiter=';', usecols=range(1, 3))
ts_2_df = pd.read_csv('/home/jpalombarini/td2c/notebooks/contributions/Real_data_validation/data/veilleux/veilleux_subset_CC035.txt', delimiter=';', usecols=range(1, 3))

### Get Causal DF and manipulate it

In [130]:
MODEL = model_linear
# model: USING A MODEL TRAINED ON ALL PROCESSES
# model_linear: USING A MODEL TRAINED ONLY ON LINEAR PROCESSES

ts = ts_2
# ts_1: Antivirus_activity_1
# ts_2: Antivirus_activity_2

d2cwrapper = D2CWrapper(ts_list=[ts], 
                        n_variables=2, 
                        model=MODEL, 
                        maxlags=1, 
                        n_jobs=1, 
                        full=True, 
                        quantiles=True,
                        filename='d2c_results',
                        normalize=True, 
                        cmi='original', 
                        mb_estimator='ts')

d2cwrapper
d2cwrapper.run()
causal_df = d2cwrapper.get_causal_dfs()
causal_df

{0:    from  to effect p_value  probability  is_causal
 0     3   1   None    None         0.18      False
 1     2   0   None    None         0.48      False
 2     2   1   None    None         0.14      False
 3     3   0   None    None         0.08      False}

In [131]:
# PRINT DATAFRAME WITH CAUSAL RELATIONSHIPS

df = causal_df[0]
# order df by 'from' and 'by' columns
df = df.sort_values(by=['from', 'to'])

df.to_csv('/home/jpalombarini/td2c/notebooks/contributions/Real_data_validation/data/veilleux/results/causal_df.csv', index=False)

df

Unnamed: 0,from,to,effect,p_value,probability,is_causal
1,2,0,,,0.48,False
2,2,1,,,0.14,False
3,3,0,,,0.08,False
0,3,1,,,0.18,False


In [132]:
# RELEVEL FROM AND TO COLUMNS WITH THE RIGHT NODES NAMES

import pandas as pd

# Assuming df is your DataFrame

# Step 1: Get the unique levels
from_levels = df['from'].unique()
to_levels = df['to'].unique()

# Step 2: Create mapping dictionaries
from_mapping = {level: i+1 for i, level in enumerate(from_levels)}
to_mapping = {level: i+1 for i, level in enumerate(to_levels)}

# Step 3: Apply the mapping to the columns
df['from'] = df['from'].map(from_mapping)
df['to'] = df['to'].map(to_mapping)

# Display the result
df


Unnamed: 0,from,to,effect,p_value,probability,is_causal
1,1,1,,,0.48,False
2,1,2,,,0.14,False
3,2,1,,,0.08,False
0,2,2,,,0.18,False


In [133]:
# show only df rows that have 'is_causal' == True
df[df['is_causal'] == True]

Unnamed: 0,from,to,effect,p_value,probability,is_causal


In [134]:
# LOAD DATA AS DATAFRAME
ts_df = ts_1_df
# ts_1_df: Antivirus_activity_1
# ts_2_df: Antivirus_activity_2

# list the names in the first row
names = ts_df.columns

# associate a number to each name
name_to_number = {name: i+1 for i, name in enumerate(names)}

name_to_number

{'Paramecium': 1, 'Didinium': 2}

In [135]:
# take only the columns 'from', 'to' for a certain probability threshold
caus = df[df['probability'] > 0.38][['from', 'to']]

number_to_name = {v: k for k, v in name_to_number.items()}

# apply the mapping
caus['from'] = caus['from'].replace(number_to_name)
caus['to'] = caus['to'].replace(number_to_name)

# save caus to a csv file
caus.to_csv('/home/jpalombarini/td2c/notebooks/contributions/Real_data_validation/data/veilleux/results/causal_relations.csv', index=False)

caus

Unnamed: 0,from,to
1,Paramecium,Paramecium


### Load the Ground truth and print the results

In [136]:
# load the ground truth
gt = pd.read_csv('/home/jpalombarini/td2c/notebooks/contributions/Real_data_validation/data/veilleux/ground_truth.txt')

# merge column 'from' with column 'to' of caus to create a new column 'From -> To'
caus['From -> To'] = caus['from'] + ' -> ' + caus['to']
caus = caus.drop(columns=['from', 'to'])
caus, gt

(                 From -> To
 1  Paramecium -> Paramecium,
                From -> To
 0  Paramecium -> Didinium)

In [137]:
print(f'Numbers of correctly estimated causal paths: {sum(caus["From -> To"].isin(gt["From -> To"]))} / {gt.shape[0]}'), 
print(f'Percentage of correctly estimated causal paths: {round((sum(caus["From -> To"].isin(gt["From -> To"])) / gt.shape[0]) * 100, 2)}%')

Numbers of correctly estimated causal paths: 0 / 1
Percentage of correctly estimated causal paths: 0.0%


## Web activity

Activity in a web server which is provided by EasyVista
Ten time series collected with a one-minute sampling rate.
The two processed datasets contain 3000 timestamps.
- preprocessed 1
- preprocessed 2

      Ground truth:
          Net_In_Global -> Net_Out_Global
          Net_In_Global -> Nb_process_http
          Net_In_Global -> Nb_connection_mysql
          Nb_process_http -> Cpu_http
          Nb_process_http -> Nb_process_php
          Nb_process_http -> Ram_http
          Nb_process_php -> Cpu_php
          Nb_process_php -> Nb_connection_mysql
          Nb_connection_mysql -> Net_Out_Global
          Nb_connection_mysql -> Disque_write_global
          Nb_connection_mysql -> Cpu_global
          Cpu_http -> Cpu_global
          Cpu_php -> Cpu_global
          Disque_write_global -> Cpu_global
          


### Load Data

In [103]:
ts_1 = np.loadtxt('/home/jpalombarini/td2c/notebooks/contributions/Real_data_validation/data/Web_activity/preprocessed_1.txt', delimiter=',',skiprows=1, usecols=range(1, 11))
ts_2 = np.loadtxt('/home/jpalombarini/td2c/notebooks/contributions/Real_data_validation/data/Web_activity/preprocessed_2.txt', delimiter=',',skiprows=1, usecols=range(1, 11))
ts_1_df = pd.read_csv('/home/jpalombarini/td2c/notebooks/contributions/Real_data_validation/data/Web_activity/preprocessed_1.txt', usecols=range(1, 11))
ts_2_df = pd.read_csv('/home/jpalombarini/td2c/notebooks/contributions/Real_data_validation/data/Web_activity/preprocessed_2.txt', usecols=range(1, 11))

### Get Causal DF and manipulate it

In [113]:
MODEL = model_linear
# model: USING A MODEL TRAINED ON ALL PROCESSES
# model_linear: USING A MODEL TRAINED ONLY ON LINEAR PROCESSES

ts = ts_1
# ts_1: Antivirus_activity_1
# ts_2: Antivirus_activity_2

d2cwrapper = D2CWrapper(ts_list=[ts], 
                        n_variables=10, 
                        model=MODEL, 
                        maxlags=1, 
                        n_jobs=1, 
                        full=True, 
                        quantiles=True,
                        filename='d2c_results',
                        normalize=True, 
                        cmi='original', 
                        mb_estimator='ts')

d2cwrapper
d2cwrapper.run()
causal_df = d2cwrapper.get_causal_dfs()
causal_df

{0:     from  to effect p_value  probability  is_causal
 0     12   4   None    None         0.13      False
 1     17   3   None    None         0.00      False
 2     19   0   None    None         0.47      False
 3     19   9   None    None         0.56       True
 4     10   6   None    None         0.07      False
 ..   ...  ..    ...     ...          ...        ...
 95    10   4   None    None         0.00      False
 96    11   3   None    None         0.00      False
 97    13   6   None    None         0.36      False
 98    15   3   None    None         0.22      False
 99    18   8   None    None         0.68       True
 
 [100 rows x 6 columns]}

In [114]:
# PRINT DATAFRAME WITH CAUSAL RELATIONSHIPS

df = causal_df[0]
# order df by 'from' and 'by' columns
df = df.sort_values(by=['from', 'to'])

df.to_csv('/home/jpalombarini/td2c/notebooks/contributions/Real_data_validation/data/Web_activity/results/causal_df.csv', index=False)

df

Unnamed: 0,from,to,effect,p_value,probability,is_causal
78,10,0,,,0.54,True
48,10,1,,,0.00,False
88,10,2,,,0.10,False
56,10,3,,,0.16,False
95,10,4,,,0.00,False
...,...,...,...,...,...,...
87,19,5,,,0.09,False
29,19,6,,,0.38,False
94,19,7,,,0.02,False
63,19,8,,,0.41,False


In [115]:
# RELEVEL FROM AND TO COLUMNS WITH THE RIGHT NODES NAMES

import pandas as pd

# Assuming df is your DataFrame

# Step 1: Get the unique levels
from_levels = df['from'].unique()
to_levels = df['to'].unique()

# Step 2: Create mapping dictionaries
from_mapping = {level: i+1 for i, level in enumerate(from_levels)}
to_mapping = {level: i+1 for i, level in enumerate(to_levels)}

# Step 3: Apply the mapping to the columns
df['from'] = df['from'].map(from_mapping)
df['to'] = df['to'].map(to_mapping)

# Display the result
df


Unnamed: 0,from,to,effect,p_value,probability,is_causal
78,1,1,,,0.54,True
48,1,2,,,0.00,False
88,1,3,,,0.10,False
56,1,4,,,0.16,False
95,1,5,,,0.00,False
...,...,...,...,...,...,...
87,10,6,,,0.09,False
29,10,7,,,0.38,False
94,10,8,,,0.02,False
63,10,9,,,0.41,False


In [116]:
# show only df rows that have 'is_causal' == True
df[df['is_causal'] == True]

Unnamed: 0,from,to,effect,p_value,probability,is_causal
78,1,1,,,0.54,True
49,4,1,,,0.58,True
14,7,7,,,0.55,True
99,9,9,,,0.68,True
3,10,10,,,0.56,True


In [117]:
# LOAD DATA AS DATAFRAME
ts_df = ts_1_df
# ts_1_df: Antivirus_activity_1
# ts_2_df: Antivirus_activity_2

# list the names in the first row
names = ts_df.columns

# associate a number to each name
name_to_number = {name: i+1 for i, name in enumerate(names)}

name_to_number

{'Cpu_php': 1,
 'Net_In_Global': 2,
 'Disque_write_global': 3,
 'Ram_http': 4,
 'Net_Out_Global': 5,
 'Nb_connection_mysql': 6,
 'Nb_process_php': 7,
 'Cpu_http': 8,
 'Cpu_global': 9,
 'Nb_process_http': 10}

In [118]:
# take only the columns 'from', 'to' for a certain probability threshold
caus = df[df['probability'] > 0.4][['from', 'to']]

number_to_name = {v: k for k, v in name_to_number.items()}

# apply the mapping
caus['from'] = caus['from'].replace(number_to_name)
caus['to'] = caus['to'].replace(number_to_name)

# save caus to a csv file
caus.to_csv('/home/jpalombarini/td2c/notebooks/contributions/Real_data_validation/data/Antivirus_activity/results/causal_relations.csv', index=False)

caus

Unnamed: 0,from,to
78,Cpu_php,Cpu_php
49,Ram_http,Cpu_php
6,Ram_http,Cpu_global
22,Nb_connection_mysql,Cpu_php
51,Nb_connection_mysql,Nb_process_php
14,Nb_process_php,Nb_process_php
99,Cpu_global,Cpu_global
2,Nb_process_http,Cpu_php
63,Nb_process_http,Cpu_global
3,Nb_process_http,Nb_process_http


### Load the Ground truth and print the results

In [119]:
# load the ground truth
gt = pd.read_csv('/home/jpalombarini/td2c/notebooks/contributions/Real_data_validation/data/Web_activity/ground_truth.txt')

# merge column 'from' with column 'to' of caus to create a new column 'From -> To'
caus['From -> To'] = caus['from'] + ' -> ' + caus['to']
caus = caus.drop(columns=['from', 'to'])
caus, gt

(                               From -> To
 78                     Cpu_php -> Cpu_php
 49                    Ram_http -> Cpu_php
 6                  Ram_http -> Cpu_global
 22         Nb_connection_mysql -> Cpu_php
 51  Nb_connection_mysql -> Nb_process_php
 14       Nb_process_php -> Nb_process_php
 99               Cpu_global -> Cpu_global
 2              Nb_process_http -> Cpu_php
 63          Nb_process_http -> Cpu_global
 3      Nb_process_http -> Nb_process_http,
                                     From -> To
 0              Net_In_Global -> Net_Out_Global
 1             Net_In_Global -> Nb_process_http
 2         Net_In_Global -> Nb_connection_mysql
 3                  Nb_process_http -> Cpu_http
 4            Nb_process_http -> Nb_process_php
 5                  Nb_process_http -> Ram_http
 6                    Nb_process_php -> Cpu_php
 7        Nb_process_php -> Nb_connection_mysql
 8        Nb_connection_mysql -> Net_Out_Global
 9   Nb_connection_mysql -> Disque_write_globa

In [120]:
print(f'Numbers of correctly estimated causal paths: {sum(caus["From -> To"].isin(gt["From -> To"]))} / {gt.shape[0]}'), 
print(f'Percentage of correctly estimated causal paths: {round((sum(caus["From -> To"].isin(gt["From -> To"])) / gt.shape[0]) * 100, 2)}%')

Numbers of correctly estimated causal paths: 0 / 14
Percentage of correctly estimated causal paths: 0.0%


## Monitoring

### Load Data

In [315]:
import glob

folder_path_1 = '/home/jpalombarini/td2c/notebooks/contributions/Real_data_validation/data/monitoring/returns/storm_chain/'
file_pattern_1 = folder_path_1 + 'contiChainData*.txt'
folder_path_2 = '/home/jpalombarini/td2c/notebooks/contributions/Real_data_validation/data/monitoring/returns/storm_fork/'
file_pattern_2 = folder_path_2 + 'contiForkData*.txt'

# Get a list of all matching files
file_list_1 = glob.glob(file_pattern_1)
file_list_2 = glob.glob(file_pattern_2)

# Load each file and store the data in a list
ts_1 = [np.loadtxt(file, delimiter=',', skiprows=1) for file in file_list_1]
ts_2 = [np.loadtxt(file, delimiter=',', skiprows=1) for file in file_list_2]

### Get Causal DF and manipulate it

In [316]:
MODEL = model_linear
# model: USING A MODEL TRAINED ON ALL PROCESSES
# model_linear: USING A MODEL TRAINED ONLY ON LINEAR PROCESSES

ts = ts_1
# ts_1: storm_chain
# ts_2: storm_fork

d2cwrapper = D2CWrapper(ts_list=ts, 
                        n_variables=5, 
                        model=MODEL, 
                        maxlags=1, 
                        n_jobs=1, 
                        full=True, 
                        quantiles=True,
                        filename='d2c_results',
                        normalize=True, 
                        cmi='original', 
                        mb_estimator='ts')

d2cwrapper
d2cwrapper.run()
causal_df = d2cwrapper.get_causal_dfs()
causal_df

{0:     from  to effect p_value  probability  is_causal
 0      5   4   None    None         0.09      False
 1      5   1   None    None         0.01      False
 2      8   0   None    None         0.04      False
 3      9   2   None    None         0.00      False
 4      8   3   None    None         0.17      False
 5      7   4   None    None         0.39      False
 6      6   2   None    None         0.04      False
 7      7   1   None    None         0.02      False
 8      5   0   None    None         0.28      False
 9      5   3   None    None         0.36      False
 10     8   2   None    None         0.00      False
 11     9   1   None    None         0.08      False
 12     9   4   None    None         0.38      False
 13     6   1   None    None         0.48      False
 14     7   0   None    None         0.01      False
 15     6   4   None    None         0.29      False
 16     7   3   None    None         0.07      False
 17     5   2   None    None         0.00  

In [317]:
causal_df[0] # from 0 to 9

Unnamed: 0,from,to,effect,p_value,probability,is_causal
0,5,4,,,0.09,False
1,5,1,,,0.01,False
2,8,0,,,0.04,False
3,9,2,,,0.0,False
4,8,3,,,0.17,False
5,7,4,,,0.39,False
6,6,2,,,0.04,False
7,7,1,,,0.02,False
8,5,0,,,0.28,False
9,5,3,,,0.36,False


In [318]:
# names of elements in the list
names = [f'causal_df_{i}' for i in range(10)]
names

['causal_df_0',
 'causal_df_1',
 'causal_df_2',
 'causal_df_3',
 'causal_df_4',
 'causal_df_5',
 'causal_df_6',
 'causal_df_7',
 'causal_df_8',
 'causal_df_9']

In [319]:
dataframes = {
    'causal_df_0': causal_df[0],
    'causal_df_1': causal_df[1],
    'causal_df_2': causal_df[2],
    'causal_df_3': causal_df[3],
    'causal_df_4': causal_df[4],
    'causal_df_5': causal_df[5],
    'causal_df_6': causal_df[6],
    'causal_df_7': causal_df[7],
    'causal_df_8': causal_df[8],
    'causal_df_9': causal_df[9]
}

# Convert the list of DataFrame names to actual DataFrame objects using the dictionary
causal_dfs = [dataframes[df_name] for df_name in names]

# Sort each DataFrame in the list by 'from' and 'to' columns
sorted_dfs = [df.sort_values(by=['from', 'to']) for df in causal_dfs]

In [320]:
sorted_dfs[0]

Unnamed: 0,from,to,effect,p_value,probability,is_causal
8,5,0,,,0.28,False
1,5,1,,,0.01,False
17,5,2,,,0.0,False
9,5,3,,,0.36,False
0,5,4,,,0.09,False
23,6,0,,,0.02,False
13,6,1,,,0.48,False
6,6,2,,,0.04,False
24,6,3,,,0.13,False
15,6,4,,,0.29,False


In [321]:
# Assuming sorted_dfs is a list of sorted DataFrames
output_dir = '/home/jpalombarini/td2c/notebooks/contributions/Real_data_validation/data/monitoring/results/'

# Save each DataFrame in the list to a separate CSV file
for i, df in enumerate(sorted_dfs):
    output_path = f'{output_dir}causal_df_{i}.csv'
    df.to_csv(output_path, index=False)

In [322]:
folder_path_1 = '/home/jpalombarini/td2c/notebooks/contributions/Real_data_validation/data/monitoring/returns/storm_chain/'
file_pattern_1 = folder_path_1 + 'contiChainData*.txt'
folder_path_2 = '/home/jpalombarini/td2c/notebooks/contributions/Real_data_validation/data/monitoring/returns/storm_fork/'
file_pattern_2 = folder_path_2 + 'contiForkData*.txt'

# Get a list of all matching files
file_list_1 = glob.glob(file_pattern_1)
file_list_2 = glob.glob(file_pattern_2)

# Load each file as a DataFrame and store the data in a list
ts_1 = [pd.read_csv(file, delimiter=',') for file in file_list_1]
ts_2 = [pd.read_csv(file, delimiter=',') for file in file_list_2]
ts = ts_1

# Assuming ts is a list of DataFrames
name_to_number_list = []

for df in ts:
    # List the names in the first row (columns)
    names = df.columns
    
    # Associate a number to each name
    name_to_number = {name: i+1 for i, name in enumerate(names)}
    
    # Append the dictionary to the list
    name_to_number_list.append(name_to_number)

# Display the list of dictionaries
name_to_number_list

[{'Metric_0': 1, 'Metric_1': 2, 'Metric_2': 3, 'Metric_3': 4, 'Metric_4': 5},
 {'Metric_0': 1, 'Metric_1': 2, 'Metric_2': 3, 'Metric_3': 4, 'Metric_4': 5},
 {'Metric_0': 1, 'Metric_1': 2, 'Metric_2': 3, 'Metric_3': 4, 'Metric_4': 5},
 {'Metric_0': 1, 'Metric_1': 2, 'Metric_2': 3, 'Metric_3': 4, 'Metric_4': 5},
 {'Metric_0': 1, 'Metric_1': 2, 'Metric_2': 3, 'Metric_3': 4, 'Metric_4': 5},
 {'Metric_0': 1, 'Metric_1': 2, 'Metric_2': 3, 'Metric_3': 4, 'Metric_4': 5},
 {'Metric_0': 1, 'Metric_1': 2, 'Metric_2': 3, 'Metric_3': 4, 'Metric_4': 5},
 {'Metric_0': 1, 'Metric_1': 2, 'Metric_2': 3, 'Metric_3': 4, 'Metric_4': 5},
 {'Metric_0': 1, 'Metric_1': 2, 'Metric_2': 3, 'Metric_3': 4, 'Metric_4': 5},
 {'Metric_0': 1, 'Metric_1': 2, 'Metric_2': 3, 'Metric_3': 4, 'Metric_4': 5}]

In [323]:
# show levels of 'from' in df
sorted_dfs[1]['from'].unique()

array([5, 6, 7, 8, 9])

In [324]:
sorted_dfs[1]['to'].unique()

array([0, 1, 2, 3, 4])

In [325]:
mapping = {5: 1, 6: 2, 7: 3, 8: 4, 9: 5}

# Apply the mapping to variable 'for' to all df in the list
mapped_dfs = [df.replace({'from': mapping, 'to': mapping}) for df in sorted_dfs]

In [326]:
mapping = {0: 1, 1:2, 2:3, 3:4, 4:5}

# Apply the mapping to variable 'to' to all df in the list
mapped_dfs = [df.replace({'from': mapping, 'to': mapping}) for df in mapped_dfs]

In [327]:
mapped_dfs[1]

Unnamed: 0,from,to,effect,p_value,probability,is_causal
8,2,1,,,0.43,False
1,2,2,,,0.1,False
17,2,3,,,0.09,False
9,2,4,,,0.27,False
0,2,5,,,0.45,False
23,3,1,,,0.1,False
13,3,2,,,0.52,True
6,3,3,,,0.04,False
24,3,4,,,0.13,False
15,3,5,,,0.25,False


In [328]:
# show only df rows that have 'is_causal' == True
mapped_dfs[0][mapped_dfs[0]['is_causal'] == True]

Unnamed: 0,from,to,effect,p_value,probability,is_causal


In [329]:
# take only rows that have 'is_causal' == True for each df in the list
causal_dfs = [df[df['probability'] > 0.2] for df in mapped_dfs]

In [330]:
causal_dfs[0]

Unnamed: 0,from,to,effect,p_value,probability,is_causal
8,2,1,,,0.28,False
9,2,4,,,0.36,False
13,3,2,,,0.48,False
15,3,5,,,0.29,False
22,4,3,,,0.22,False
5,4,5,,,0.39,False
12,5,5,,,0.38,False


In [331]:
# Assuming name_to_number_list is a list of dictionaries
number_to_name_list = []

for name_to_number in name_to_number_list:
    # Create the number_to_name dictionary for the current name_to_number dictionary
    number_to_name = {v: k for k, v in name_to_number.items()}
    
    # Append the dictionary to the list
    number_to_name_list.append(number_to_name)

# Display the list of dictionaries
number_to_name_list

[{1: 'Metric_0', 2: 'Metric_1', 3: 'Metric_2', 4: 'Metric_3', 5: 'Metric_4'},
 {1: 'Metric_0', 2: 'Metric_1', 3: 'Metric_2', 4: 'Metric_3', 5: 'Metric_4'},
 {1: 'Metric_0', 2: 'Metric_1', 3: 'Metric_2', 4: 'Metric_3', 5: 'Metric_4'},
 {1: 'Metric_0', 2: 'Metric_1', 3: 'Metric_2', 4: 'Metric_3', 5: 'Metric_4'},
 {1: 'Metric_0', 2: 'Metric_1', 3: 'Metric_2', 4: 'Metric_3', 5: 'Metric_4'},
 {1: 'Metric_0', 2: 'Metric_1', 3: 'Metric_2', 4: 'Metric_3', 5: 'Metric_4'},
 {1: 'Metric_0', 2: 'Metric_1', 3: 'Metric_2', 4: 'Metric_3', 5: 'Metric_4'},
 {1: 'Metric_0', 2: 'Metric_1', 3: 'Metric_2', 4: 'Metric_3', 5: 'Metric_4'},
 {1: 'Metric_0', 2: 'Metric_1', 3: 'Metric_2', 4: 'Metric_3', 5: 'Metric_4'},
 {1: 'Metric_0', 2: 'Metric_1', 3: 'Metric_2', 4: 'Metric_3', 5: 'Metric_4'}]

In [332]:
# apply the mapping to each caus df
mapped_caus = [df.replace(number_to_name) for df in causal_dfs]

In [333]:
mapped_caus[1]

Unnamed: 0,from,to,effect,p_value,probability,is_causal
8,Metric_1,Metric_0,,,0.43,False
9,Metric_1,Metric_3,,,0.27,False
0,Metric_1,Metric_4,,,0.45,False
13,Metric_2,Metric_1,,,0.52,True
15,Metric_2,Metric_4,,,0.25,False
22,Metric_3,Metric_2,,,0.29,False
16,Metric_3,Metric_3,,,0.32,False
5,Metric_3,Metric_4,,,0.26,False
4,Metric_4,Metric_3,,,0.43,False
12,Metric_4,Metric_4,,,0.56,True


In [334]:
# Assuming mapped_caus is a list of DataFrames
output_dir = '/home/jpalombarini/td2c/notebooks/contributions/Real_data_validation/data/monitoring/results/'

# Save each DataFrame in the list to a separate CSV file
for i, df in enumerate(mapped_caus):
    output_path = f'{output_dir}causal_relations_{i}.csv'
    df.to_csv(output_path, index=False)

In [335]:
# get ground truth
gt_1 = pd.read_csv('/home/jpalombarini/td2c/notebooks/contributions/Real_data_validation/data/monitoring/ground_truth/storm_chain.txt')
gt_2 = pd.read_csv('/home/jpalombarini/td2c/notebooks/contributions/Real_data_validation/data/monitoring/ground_truth/storm_fork.txt')
gt_2

Unnamed: 0,From -> To
0,Metric_2 -> Metric_0
1,Metric_2 -> Metric_1
2,Metric_0 -> Metric_0
3,Metric_1 -> Metric_1
4,Metric_2 -> Metric_2


In [336]:
for i, df in enumerate(mapped_caus):
    df['From -> To'] = df['from'] + ' -> ' + df['to']
    
    # Drop the 'from' and 'to' columns
    df = df.drop(columns=['from', 'to'])
    
    # Update the DataFrame in the list
    mapped_caus[i] = df

In [337]:
# drop all columns except 'From -> To' in each df
mapped_caus[0] = mapped_caus[0][['From -> To']]
mapped_caus[1] = mapped_caus[1][['From -> To']]
mapped_caus[2] = mapped_caus[2][['From -> To']]
mapped_caus[3] = mapped_caus[3][['From -> To']]
mapped_caus[4] = mapped_caus[4][['From -> To']]
mapped_caus[5] = mapped_caus[5][['From -> To']]
mapped_caus[6] = mapped_caus[6][['From -> To']]
mapped_caus[7] = mapped_caus[7][['From -> To']]
mapped_caus[8] = mapped_caus[8][['From -> To']]
mapped_caus[9] = mapped_caus[9][['From -> To']]

In [338]:
# Display the updated list of DataFrames
mapped_caus[0], mapped_caus[1], mapped_caus[2], mapped_caus[3], mapped_caus[4], mapped_caus[5], mapped_caus[6], mapped_caus[7], mapped_caus[8], mapped_caus[9]

(              From -> To
 8   Metric_1 -> Metric_0
 9   Metric_1 -> Metric_3
 13  Metric_2 -> Metric_1
 15  Metric_2 -> Metric_4
 22  Metric_3 -> Metric_2
 5   Metric_3 -> Metric_4
 12  Metric_4 -> Metric_4,
               From -> To
 8   Metric_1 -> Metric_0
 9   Metric_1 -> Metric_3
 0   Metric_1 -> Metric_4
 13  Metric_2 -> Metric_1
 15  Metric_2 -> Metric_4
 22  Metric_3 -> Metric_2
 16  Metric_3 -> Metric_3
 5   Metric_3 -> Metric_4
 4   Metric_4 -> Metric_3
 12  Metric_4 -> Metric_4,
               From -> To
 8   Metric_1 -> Metric_0
 9   Metric_1 -> Metric_3
 13  Metric_2 -> Metric_1
 15  Metric_2 -> Metric_4
 4   Metric_4 -> Metric_3
 12  Metric_4 -> Metric_4,
               From -> To
 9   Metric_1 -> Metric_3
 24  Metric_2 -> Metric_3
 15  Metric_2 -> Metric_4
 22  Metric_3 -> Metric_2
 16  Metric_3 -> Metric_3
 4   Metric_4 -> Metric_3,
               From -> To
 9   Metric_1 -> Metric_3
 13  Metric_2 -> Metric_1
 15  Metric_2 -> Metric_4
 4   Metric_4 -> Metric_3
 11  Met

In [339]:
# Assuming mapped_caus is a list of DataFrames and gt is the ground truth DataFrame
for i, df in enumerate(mapped_caus):
    num_correct_paths = sum(df['From -> To'].isin(gt['From -> To']))
    total_paths = gt.shape[0]
    percentage_correct_paths = round((num_correct_paths / total_paths) * 100, 2)
    
    print(f'DataFrame {i}:')
    print(f'Numbers of correctly estimated causal paths: {num_correct_paths} / {total_paths}')
    print(f'Percentage of correctly estimated causal paths: {percentage_correct_paths}%')
    print()  # For better readability

DataFrame 0:
Numbers of correctly estimated causal paths: 0 / 5
Percentage of correctly estimated causal paths: 0.0%

DataFrame 1:
Numbers of correctly estimated causal paths: 0 / 5
Percentage of correctly estimated causal paths: 0.0%

DataFrame 2:
Numbers of correctly estimated causal paths: 0 / 5
Percentage of correctly estimated causal paths: 0.0%

DataFrame 3:
Numbers of correctly estimated causal paths: 0 / 5
Percentage of correctly estimated causal paths: 0.0%

DataFrame 4:
Numbers of correctly estimated causal paths: 0 / 5
Percentage of correctly estimated causal paths: 0.0%

DataFrame 5:
Numbers of correctly estimated causal paths: 0 / 5
Percentage of correctly estimated causal paths: 0.0%

DataFrame 6:
Numbers of correctly estimated causal paths: 0 / 5
Percentage of correctly estimated causal paths: 0.0%

DataFrame 7:
Numbers of correctly estimated causal paths: 0 / 5
Percentage of correctly estimated causal paths: 0.0%

DataFrame 8:
Numbers of correctly estimated causal paths

In [340]:
print(f'Numbers of correctly estimated causal paths: {sum(caus["From -> To"].isin(gt["From -> To"]))} / {gt.shape[0]}'), 
print(f'Percentage of correctly estimated causal paths: {round((sum(caus["From -> To"].isin(gt["From -> To"])) / gt.shape[0]) * 100, 2)}%')

Numbers of correctly estimated causal paths: 1 / 5
Percentage of correctly estimated causal paths: 20.0%



# BIGGEST DATASETS


## biggest 1


Data 1

### Load Data

In [None]:
ts_1 = np.loadtxt('/home/jpalombarini/td2c/notebooks/contributions/Real_data_validation/data/Antivirus_activity/preprocessed_1.txt', delimiter=',',skiprows=1)
ts_2 = np.loadtxt('/home/jpalombarini/td2c/notebooks/contributions/Real_data_validation/data/Antivirus_activity/preprocessed_2.txt', delimiter=',',skiprows=1)
ts_1_df = pd.read_csv('/home/jpalombarini/td2c/notebooks/contributions/Real_data_validation/data/Antivirus_activity/preprocessed_1.txt')
ts_2_df = pd.read_csv('/home/jpalombarini/td2c/notebooks/contributions/Real_data_validation/data/Antivirus_activity/preprocessed_2.txt')

### Get Causal DF and manipulate it

In [None]:
MODEL = model
# model: USING A MODEL TRAINED ON ALL PROCESSES
# model_linear: USING A MODEL TRAINED ONLY ON LINEAR PROCESSES

ts = ts_1
# ts_1: Antivirus_activity_1
# ts_2: Antivirus_activity_2

d2cwrapper = D2CWrapper(ts_list=[ts], 
                        n_variables=13, 
                        model=MODEL, 
                        maxlags=1, 
                        n_jobs=1, 
                        full=True, 
                        quantiles=True,
                        filename='d2c_results',
                        normalize=True, 
                        cmi='original', 
                        mb_estimator='ts')

d2cwrapper
d2cwrapper.run()
causal_df = d2cwrapper.get_causal_dfs()
causal_df

{0:      from  to effect p_value  probability  is_causal
 0      23   4   None    None         0.06      False
 1      17   3   None    None         0.04      False
 2      19   0   None    None         0.04      False
 3      17  12   None    None         0.04      False
 4      19   9   None    None         0.18      False
 ..    ...  ..    ...     ...          ...        ...
 164    15   3   None    None         0.18      False
 165    15  12   None    None         0.06      False
 166    16  11   None    None         0.16      False
 167    18   8   None    None         0.16      False
 168    21   7   None    None         0.14      False
 
 [169 rows x 6 columns]}

In [None]:
# PRINT DATAFRAME WITH CAUSAL RELATIONSHIPS

df = causal_df[0]
# order df by 'from' and 'by' columns
df = df.sort_values(by=['from', 'to'])

df.to_csv('/home/jpalombarini/td2c/notebooks/contributions/Real_data_validation/data/Antivirus_activity/results/causal_df.csv', index=False)

df

Unnamed: 0,from,to,effect,p_value,probability,is_causal
79,13,0,,,0.32,False
16,13,1,,,0.20,False
94,13,2,,,0.26,False
30,13,3,,,0.18,False
110,13,4,,,0.10,False
...,...,...,...,...,...,...
86,25,8,,,0.12,False
151,25,9,,,0.16,False
99,25,10,,,0.04,False
39,25,11,,,0.10,False


In [None]:
# RELEVEL FROM AND TO COLUMNS WITH THE RIGHT NODES NAMES

import pandas as pd

# Assuming df is your DataFrame

# Step 1: Get the unique levels
from_levels = df['from'].unique()
to_levels = df['to'].unique()

# Step 2: Create mapping dictionaries
from_mapping = {level: i+1 for i, level in enumerate(from_levels)}
to_mapping = {level: i+1 for i, level in enumerate(to_levels)}

# Step 3: Apply the mapping to the columns
df['from'] = df['from'].map(from_mapping)
df['to'] = df['to'].map(to_mapping)

# Display the result
df


Unnamed: 0,from,to,effect,p_value,probability,is_causal
79,1,1,,,0.32,False
16,1,2,,,0.20,False
94,1,3,,,0.26,False
30,1,4,,,0.18,False
110,1,5,,,0.10,False
...,...,...,...,...,...,...
86,13,9,,,0.12,False
151,13,10,,,0.16,False
99,13,11,,,0.04,False
39,13,12,,,0.10,False


In [None]:
# show only df rows that have 'is_causal' == True
df[df['is_causal'] == True]

Unnamed: 0,from,to,effect,p_value,probability,is_causal
78,5,5,,,0.62,True
74,10,10,,,0.52,True


In [None]:
# LOAD DATA AS DATAFRAME
ts_df = ts_1_df
# ts_1_df: Antivirus_activity_1
# ts_2_df: Antivirus_activity_2

# list the names in the first row
names = ts_df.columns

# associate a number to each name
name_to_number = {name: i+1 for i, name in enumerate(names)}

name_to_number

{'memory_usage_Portal': 1,
 'cpu_usage_Portal': 2,
 'Physical_Memory_prct_used_Portal': 3,
 'cpu_prct_used_Portal': 4,
 '0_C_read_Portal': 5,
 'memory_usage_VDI': 6,
 'cpu_usage_VDI': 7,
 'Physical_Memory_prct_used_VDI': 8,
 'cpu_prct_used_VDI': 9,
 '0_C_read_VDI': 10,
 'Chargement_portail': 11,
 'Chargement_IE': 12,
 'Default_Transaction': 13}

In [None]:
# take only the columns 'from', 'to' for a certain probability threshold
caus = df[df['probability'] > 0.4][['from', 'to']]

number_to_name = {v: k for k, v in name_to_number.items()}

# apply the mapping
caus['from'] = caus['from'].replace(number_to_name)
caus['to'] = caus['to'].replace(number_to_name)

# save caus to a csv file
caus.to_csv('/home/jpalombarini/td2c/notebooks/contributions/Real_data_validation/data/Antivirus_activity/results/causal_relations.csv', index=False)

caus

Unnamed: 0,from,to
78,0_C_read_Portal,0_C_read_Portal
37,memory_usage_VDI,memory_usage_VDI
119,cpu_prct_used_VDI,cpu_prct_used_VDI
74,0_C_read_VDI,0_C_read_VDI


### Load the Ground truth and print the results

In [None]:
# load the ground truth
gt = pd.read_csv('/home/jpalombarini/td2c/notebooks/contributions/Real_data_validation/data/Antivirus_activity/ground_truth.txt')

# merge column 'from' with column 'to' of caus to create a new column 'From -> To'
caus['From -> To'] = caus['from'] + ' -> ' + caus['to']
caus = caus.drop(columns=['from', 'to'])
caus, gt

(                                 From -> To
 78       0_C_read_Portal -> 0_C_read_Portal
 37     memory_usage_VDI -> memory_usage_VDI
 119  cpu_prct_used_VDI -> cpu_prct_used_VDI
 74             0_C_read_VDI -> 0_C_read_VDI,
                                            From -> To
 0   memory_usage_Portal -> Physical_Memory_prct_us...
 1            cpu_usage_Portal -> cpu_prct_used_Portal
 2   Physical_Memory_prct_used_Portal -> 0_C_read_P...
 3             cpu_prct_used_Portal -> 0_C_read_Portal
 4   memory_usage_VDI -> Physical_Memory_prct_used_VDI
 5                  cpu_usage_VDI -> cpu_prct_used_VDI
 6       Physical_Memory_prct_used_VDI -> 0_C_read_VDI
 7                   cpu_prct_used_VDI -> 0_C_read_VDI
 8   Physical_Memory_prct_used_Portal -> Chargement...
 9          cpu_prct_used_Portal -> Chargement_portail
 10              0_C_read_Portal -> Chargement_portail
 11     Physical_Memory_prct_used_VDI -> Chargement_IE
 12                 cpu_prct_used_VDI -> Chargement_IE
 13 

In [None]:
print(f'Numbers of correctly estimated causal paths: {sum(caus["From -> To"].isin(gt["From -> To"]))} / {gt.shape[0]}'), 
print(f'Percentage of correctly estimated causal paths: {round((sum(caus["From -> To"].isin(gt["From -> To"])) / gt.shape[0]) * 100, 2)}%')

Numbers of correctly estimated causal paths: 0 / 16
Percentage of correctly estimated causal paths: 0.0%


## biggest 2


Data 2

### Load Data

In [None]:
ts_1 = np.loadtxt('/home/jpalombarini/td2c/notebooks/contributions/Real_data_validation/data/Antivirus_activity/preprocessed_1.txt', delimiter=',',skiprows=1)
ts_2 = np.loadtxt('/home/jpalombarini/td2c/notebooks/contributions/Real_data_validation/data/Antivirus_activity/preprocessed_2.txt', delimiter=',',skiprows=1)
ts_1_df = pd.read_csv('/home/jpalombarini/td2c/notebooks/contributions/Real_data_validation/data/Antivirus_activity/preprocessed_1.txt')
ts_2_df = pd.read_csv('/home/jpalombarini/td2c/notebooks/contributions/Real_data_validation/data/Antivirus_activity/preprocessed_2.txt')

### Get Causal DF and manipulate it

In [None]:
MODEL = model
# model: USING A MODEL TRAINED ON ALL PROCESSES
# model_linear: USING A MODEL TRAINED ONLY ON LINEAR PROCESSES

ts = ts_1
# ts_1: Antivirus_activity_1
# ts_2: Antivirus_activity_2

d2cwrapper = D2CWrapper(ts_list=[ts], 
                        n_variables=13, 
                        model=MODEL, 
                        maxlags=1, 
                        n_jobs=1, 
                        full=True, 
                        quantiles=True,
                        filename='d2c_results',
                        normalize=True, 
                        cmi='original', 
                        mb_estimator='ts')

d2cwrapper
d2cwrapper.run()
causal_df = d2cwrapper.get_causal_dfs()
causal_df

{0:      from  to effect p_value  probability  is_causal
 0      23   4   None    None         0.06      False
 1      17   3   None    None         0.04      False
 2      19   0   None    None         0.04      False
 3      17  12   None    None         0.04      False
 4      19   9   None    None         0.18      False
 ..    ...  ..    ...     ...          ...        ...
 164    15   3   None    None         0.18      False
 165    15  12   None    None         0.06      False
 166    16  11   None    None         0.16      False
 167    18   8   None    None         0.16      False
 168    21   7   None    None         0.14      False
 
 [169 rows x 6 columns]}

In [None]:
# PRINT DATAFRAME WITH CAUSAL RELATIONSHIPS

df = causal_df[0]
# order df by 'from' and 'by' columns
df = df.sort_values(by=['from', 'to'])

df.to_csv('/home/jpalombarini/td2c/notebooks/contributions/Real_data_validation/data/Antivirus_activity/results/causal_df.csv', index=False)

df

Unnamed: 0,from,to,effect,p_value,probability,is_causal
79,13,0,,,0.32,False
16,13,1,,,0.20,False
94,13,2,,,0.26,False
30,13,3,,,0.18,False
110,13,4,,,0.10,False
...,...,...,...,...,...,...
86,25,8,,,0.12,False
151,25,9,,,0.16,False
99,25,10,,,0.04,False
39,25,11,,,0.10,False


In [None]:
# RELEVEL FROM AND TO COLUMNS WITH THE RIGHT NODES NAMES

import pandas as pd

# Assuming df is your DataFrame

# Step 1: Get the unique levels
from_levels = df['from'].unique()
to_levels = df['to'].unique()

# Step 2: Create mapping dictionaries
from_mapping = {level: i+1 for i, level in enumerate(from_levels)}
to_mapping = {level: i+1 for i, level in enumerate(to_levels)}

# Step 3: Apply the mapping to the columns
df['from'] = df['from'].map(from_mapping)
df['to'] = df['to'].map(to_mapping)

# Display the result
df


Unnamed: 0,from,to,effect,p_value,probability,is_causal
79,1,1,,,0.32,False
16,1,2,,,0.20,False
94,1,3,,,0.26,False
30,1,4,,,0.18,False
110,1,5,,,0.10,False
...,...,...,...,...,...,...
86,13,9,,,0.12,False
151,13,10,,,0.16,False
99,13,11,,,0.04,False
39,13,12,,,0.10,False


In [None]:
# show only df rows that have 'is_causal' == True
df[df['is_causal'] == True]

Unnamed: 0,from,to,effect,p_value,probability,is_causal
78,5,5,,,0.62,True
74,10,10,,,0.52,True


In [None]:
# LOAD DATA AS DATAFRAME
ts_df = ts_1_df
# ts_1_df: Antivirus_activity_1
# ts_2_df: Antivirus_activity_2

# list the names in the first row
names = ts_df.columns

# associate a number to each name
name_to_number = {name: i+1 for i, name in enumerate(names)}

name_to_number

{'memory_usage_Portal': 1,
 'cpu_usage_Portal': 2,
 'Physical_Memory_prct_used_Portal': 3,
 'cpu_prct_used_Portal': 4,
 '0_C_read_Portal': 5,
 'memory_usage_VDI': 6,
 'cpu_usage_VDI': 7,
 'Physical_Memory_prct_used_VDI': 8,
 'cpu_prct_used_VDI': 9,
 '0_C_read_VDI': 10,
 'Chargement_portail': 11,
 'Chargement_IE': 12,
 'Default_Transaction': 13}

In [None]:
# take only the columns 'from', 'to' for a certain probability threshold
caus = df[df['probability'] > 0.4][['from', 'to']]

number_to_name = {v: k for k, v in name_to_number.items()}

# apply the mapping
caus['from'] = caus['from'].replace(number_to_name)
caus['to'] = caus['to'].replace(number_to_name)

# save caus to a csv file
caus.to_csv('/home/jpalombarini/td2c/notebooks/contributions/Real_data_validation/data/Antivirus_activity/results/causal_relations.csv', index=False)

caus

Unnamed: 0,from,to
78,0_C_read_Portal,0_C_read_Portal
37,memory_usage_VDI,memory_usage_VDI
119,cpu_prct_used_VDI,cpu_prct_used_VDI
74,0_C_read_VDI,0_C_read_VDI


### Load the Ground truth and print the results

In [None]:
# load the ground truth
gt = pd.read_csv('/home/jpalombarini/td2c/notebooks/contributions/Real_data_validation/data/Antivirus_activity/ground_truth.txt')

# merge column 'from' with column 'to' of caus to create a new column 'From -> To'
caus['From -> To'] = caus['from'] + ' -> ' + caus['to']
caus = caus.drop(columns=['from', 'to'])
caus, gt

(                                 From -> To
 78       0_C_read_Portal -> 0_C_read_Portal
 37     memory_usage_VDI -> memory_usage_VDI
 119  cpu_prct_used_VDI -> cpu_prct_used_VDI
 74             0_C_read_VDI -> 0_C_read_VDI,
                                            From -> To
 0   memory_usage_Portal -> Physical_Memory_prct_us...
 1            cpu_usage_Portal -> cpu_prct_used_Portal
 2   Physical_Memory_prct_used_Portal -> 0_C_read_P...
 3             cpu_prct_used_Portal -> 0_C_read_Portal
 4   memory_usage_VDI -> Physical_Memory_prct_used_VDI
 5                  cpu_usage_VDI -> cpu_prct_used_VDI
 6       Physical_Memory_prct_used_VDI -> 0_C_read_VDI
 7                   cpu_prct_used_VDI -> 0_C_read_VDI
 8   Physical_Memory_prct_used_Portal -> Chargement...
 9          cpu_prct_used_Portal -> Chargement_portail
 10              0_C_read_Portal -> Chargement_portail
 11     Physical_Memory_prct_used_VDI -> Chargement_IE
 12                 cpu_prct_used_VDI -> Chargement_IE
 13 

In [None]:
print(f'Numbers of correctly estimated causal paths: {sum(caus["From -> To"].isin(gt["From -> To"]))} / {gt.shape[0]}'), 
print(f'Percentage of correctly estimated causal paths: {round((sum(caus["From -> To"].isin(gt["From -> To"])) / gt.shape[0]) * 100, 2)}%')

Numbers of correctly estimated causal paths: 0 / 16
Percentage of correctly estimated causal paths: 0.0%
