# TD2C Validation on Real-World Data

This notebook is designed to validate the `TD2C` method on real-world datasets of varying sizes and compositions. The primary goals are to:

1. **Validate on Real-World Datasets**: Apply the `TD2C` method to datasets that differ in size and composition to assess its robustness and effectiveness in practical scenarios.

2. **Model Training Comparison**: Evaluate the performance of the `TD2C` method when trained on models with different process characteristics:
   - **Linear-Only Training**: Validate the method using a model trained exclusively on linear processes.
   - **Mixed Training (Linear and Nonlinear)**: Assess the method using a model equally trained on both linear and nonlinear processes.
   
3. **Performance Comparison**: Compare the results from the different training approaches to determine under which circumstances the `TD2C` estimates perform better, particularly focusing on their application to real-world data.

By the end of this notebook, users will have a clear understanding of how the `TD2C` method performs in various real-world contexts, and will be equipped to apply the method effectively in both linear and nonlinear data environments.


# Settings

### Load Packages

In [1]:
import pickle
import os
import pandas as pd
from tqdm import tqdm
import numpy as np
import joblib

from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, balanced_accuracy_score

from d2c.benchmark import D2CWrapper

from d2c.descriptors_generation.loader import DataLoader

from sklearn.ensemble import RandomForestClassifier
from imblearn.ensemble import BalancedRandomForestClassifier

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score


### Load Trained Models

In [2]:
# full model 
model = joblib.load('/home/jpalombarini/td2c/notebooks/contributions/Real_data_validation/model.pkl')
# model trained only on linear generative processes
model_linear = joblib.load('/home/jpalombarini/td2c/notebooks/contributions/Real_data_validation/td2c_R_N5_LOPO_combined.pkl')

# HYBRID PAPER DATA:

## Antivirus activity

Impacts of antivirus activity in servers.
13 time series such that 3 of them are collected with a one-minute sampling rate and the rest with a five-minute sampling rate.

The two processed datasets consist of 1321 timestamps:
  - preprocessed 1
  - preprocessed 2
  
  ### Ground truth:
  - memory_usage_Portal -> Physical_Memory_prct_used_Portal
  - cpu_usage_Portal -> cpu_prct_used_Portal
  - Physical_Memory_prct_used_Portal -> 0_C_read_Portal
  - cpu_prct_used_Portal -> 0_C_read_Portal
  - memory_usage_VDI -> Physical_Memory_prct_used_VDI
  - cpu_usage_VDI -> cpu_prct_used_VDI
  - Physical_Memory_prct_used_VDI -> 0_C_read_VDI
  - cpu_prct_used_VDI -> 0_C_read_VDI
  - Physical_Memory_prct_used_Portal -> Chargement_portail
  - cpu_prct_used_Portal -> Chargement_portail
  - 0_C_read_Portal -> Chargement_portail
  - Physical_Memory_prct_used_VDI -> Chargement_IE
  - cpu_prct_used_VDI -> Chargement_IE
  - 0_C_read_VDI -> Chargement_IE
  - Chargement_portail -> Default_Transaction
  - Chargement_IE -> Default_Transaction

### Load Data

In [4]:
ts_1 = np.loadtxt('/home/jpalombarini/td2c/notebooks/contributions/Real_data_validation/data/Antivirus_activity/preprocessed_1.txt', delimiter=',',skiprows=1)
ts_2 = np.loadtxt('/home/jpalombarini/td2c/notebooks/contributions/Real_data_validation/data/Antivirus_activity/preprocessed_2.txt', delimiter=',',skiprows=1)
ts_1_df = pd.read_csv('/home/jpalombarini/td2c/notebooks/contributions/Real_data_validation/data/Antivirus_activity/preprocessed_1.txt')
ts_2_df = pd.read_csv('/home/jpalombarini/td2c/notebooks/contributions/Real_data_validation/data/Antivirus_activity/preprocessed_2.txt')

### Get Causal DF and manipulate it

In [5]:
MODEL = model
# model: USING A MODEL TRAINED ON ALL PROCESSES
# model_linear: USING A MODEL TRAINED ONLY ON LINEAR PROCESSES

ts = ts_1
# ts_1: Antivirus_activity_1
# ts_2: Antivirus_activity_2

d2cwrapper = D2CWrapper(ts_list=[ts], 
                        n_variables=13, 
                        model=MODEL, 
                        maxlags=1, 
                        n_jobs=1, 
                        full=True, 
                        quantiles=True,
                        filename='d2c_results',
                        normalize=True, 
                        cmi='original', 
                        mb_estimator='ts')

d2cwrapper
d2cwrapper.run()
causal_df = d2cwrapper.get_causal_dfs()
causal_df

{0:      from  to effect p_value  probability  is_causal
 0      23   4   None    None         0.06      False
 1      17   3   None    None         0.04      False
 2      19   0   None    None         0.04      False
 3      17  12   None    None         0.04      False
 4      19   9   None    None         0.18      False
 ..    ...  ..    ...     ...          ...        ...
 164    15   3   None    None         0.18      False
 165    15  12   None    None         0.06      False
 166    16  11   None    None         0.16      False
 167    18   8   None    None         0.16      False
 168    21   7   None    None         0.14      False
 
 [169 rows x 6 columns]}

In [6]:
# PRINT DATAFRAME WITH CAUSAL RELATIONSHIPS

df = causal_df[0]
# order df by 'from' and 'by' columns
df = df.sort_values(by=['from', 'to'])

df.to_csv('/home/jpalombarini/td2c/notebooks/contributions/Real_data_validation/data/Antivirus_activity/results/causal_df.csv', index=False)

df

Unnamed: 0,from,to,effect,p_value,probability,is_causal
79,13,0,,,0.32,False
16,13,1,,,0.20,False
94,13,2,,,0.26,False
30,13,3,,,0.18,False
110,13,4,,,0.10,False
...,...,...,...,...,...,...
86,25,8,,,0.12,False
151,25,9,,,0.16,False
99,25,10,,,0.04,False
39,25,11,,,0.10,False


In [8]:
# RELEVEL FROM AND TO COLUMNS WITH THE RIGHT NODES NAMES

import pandas as pd

# Assuming df is your DataFrame

# Step 1: Get the unique levels
from_levels = df['from'].unique()
to_levels = df['to'].unique()

# Step 2: Create mapping dictionaries
from_mapping = {level: i+1 for i, level in enumerate(from_levels)}
to_mapping = {level: i+1 for i, level in enumerate(to_levels)}

# Step 3: Apply the mapping to the columns
df['from'] = df['from'].map(from_mapping)
df['to'] = df['to'].map(to_mapping)

# Display the result
df


Unnamed: 0,from,to,effect,p_value,probability,is_causal
79,1,1,,,0.32,False
16,1,2,,,0.20,False
94,1,3,,,0.26,False
30,1,4,,,0.18,False
110,1,5,,,0.10,False
...,...,...,...,...,...,...
86,13,9,,,0.12,False
151,13,10,,,0.16,False
99,13,11,,,0.04,False
39,13,12,,,0.10,False


In [9]:
# show only df rows that have 'is_causal' == True
df[df['is_causal'] == True]

Unnamed: 0,from,to,effect,p_value,probability,is_causal
78,5,5,,,0.62,True
74,10,10,,,0.52,True


In [20]:
# LOAD DATA AS DATAFRAME
ts_df = ts_1_df
# ts_1_df: Antivirus_activity_1
# ts_2_df: Antivirus_activity_2

# list the names in the first row
names = ts_df.columns

# associate a number to each name
name_to_number = {name: i+1 for i, name in enumerate(names)}

name_to_number

{'memory_usage_Portal': 1,
 'cpu_usage_Portal': 2,
 'Physical_Memory_prct_used_Portal': 3,
 'cpu_prct_used_Portal': 4,
 '0_C_read_Portal': 5,
 'memory_usage_VDI': 6,
 'cpu_usage_VDI': 7,
 'Physical_Memory_prct_used_VDI': 8,
 'cpu_prct_used_VDI': 9,
 '0_C_read_VDI': 10,
 'Chargement_portail': 11,
 'Chargement_IE': 12,
 'Default_Transaction': 13}

In [15]:
# take only the columns 'from', 'to' for a certain probability threshold
caus = df[df['probability'] > 0.4][['from', 'to']]

number_to_name = {v: k for k, v in name_to_number.items()}

# apply the mapping
caus['from'] = caus['from'].replace(number_to_name)
caus['to'] = caus['to'].replace(number_to_name)

# save caus to a csv file
caus.to_csv('/home/jpalombarini/td2c/notebooks/contributions/Real_data_validation/data/Antivirus_activity/results/causal_relations.csv', index=False)

caus

Unnamed: 0,from,to
78,0_C_read_Portal,0_C_read_Portal
37,memory_usage_VDI,memory_usage_VDI
119,cpu_prct_used_VDI,cpu_prct_used_VDI
74,0_C_read_VDI,0_C_read_VDI


### Load the Ground truth and print the results

In [16]:
# load the ground truth
gt = pd.read_csv('/home/jpalombarini/td2c/notebooks/contributions/Real_data_validation/data/Antivirus_activity/ground_truth.txt')

# merge column 'from' with column 'to' of caus to create a new column 'From -> To'
caus['From -> To'] = caus['from'] + ' -> ' + caus['to']
caus = caus.drop(columns=['from', 'to'])
caus, gt

(                                 From -> To
 78       0_C_read_Portal -> 0_C_read_Portal
 37     memory_usage_VDI -> memory_usage_VDI
 119  cpu_prct_used_VDI -> cpu_prct_used_VDI
 74             0_C_read_VDI -> 0_C_read_VDI,
                                            From -> To
 0   memory_usage_Portal -> Physical_Memory_prct_us...
 1            cpu_usage_Portal -> cpu_prct_used_Portal
 2   Physical_Memory_prct_used_Portal -> 0_C_read_P...
 3             cpu_prct_used_Portal -> 0_C_read_Portal
 4   memory_usage_VDI -> Physical_Memory_prct_used_VDI
 5                  cpu_usage_VDI -> cpu_prct_used_VDI
 6       Physical_Memory_prct_used_VDI -> 0_C_read_VDI
 7                   cpu_prct_used_VDI -> 0_C_read_VDI
 8   Physical_Memory_prct_used_Portal -> Chargement...
 9          cpu_prct_used_Portal -> Chargement_portail
 10              0_C_read_Portal -> Chargement_portail
 11     Physical_Memory_prct_used_VDI -> Chargement_IE
 12                 cpu_prct_used_VDI -> Chargement_IE
 13 

In [17]:
print(f'Numbers of correctly estimated causal paths: {sum(caus["From -> To"].isin(gt["From -> To"]))} / {gt.shape[0]}'), 
print(f'Percentage of correctly estimated causal paths: {round((sum(caus["From -> To"].isin(gt["From -> To"])) / gt.shape[0]) * 100, 2)}%')

Numbers of correctly estimated causal paths: 0 / 16
Percentage of correctly estimated causal paths: 0.0%


## Dairy markets

Ten years (from 09/2008 to 12/2018) of monthly prices for milk M, butter B, and
cheddar cheese C, so the three time series are of length 124.

Ground truth: B <- M -> C

                  0 0 0 
        adj.mat = 1 0 1
                  0 0 0

### Load Data

In [None]:
ts_1 = np.loadtxt('/home/jpalombarini/td2c/notebooks/contributions/Real_data_validation/data/Antivirus_activity/preprocessed_1.txt', delimiter=',',skiprows=1)
ts_2 = np.loadtxt('/home/jpalombarini/td2c/notebooks/contributions/Real_data_validation/data/Antivirus_activity/preprocessed_2.txt', delimiter=',',skiprows=1)
ts_1_df = pd.read_csv('/home/jpalombarini/td2c/notebooks/contributions/Real_data_validation/data/Antivirus_activity/preprocessed_1.txt')
ts_2_df = pd.read_csv('/home/jpalombarini/td2c/notebooks/contributions/Real_data_validation/data/Antivirus_activity/preprocessed_2.txt')

### Get Causal DF and manipulate it

In [None]:
MODEL = model
# model: USING A MODEL TRAINED ON ALL PROCESSES
# model_linear: USING A MODEL TRAINED ONLY ON LINEAR PROCESSES

ts = ts_1
# ts_1: Antivirus_activity_1
# ts_2: Antivirus_activity_2

d2cwrapper = D2CWrapper(ts_list=[ts], 
                        n_variables=13, 
                        model=MODEL, 
                        maxlags=1, 
                        n_jobs=1, 
                        full=True, 
                        quantiles=True,
                        filename='d2c_results',
                        normalize=True, 
                        cmi='original', 
                        mb_estimator='ts')

d2cwrapper
d2cwrapper.run()
causal_df = d2cwrapper.get_causal_dfs()
causal_df

{0:      from  to effect p_value  probability  is_causal
 0      23   4   None    None         0.06      False
 1      17   3   None    None         0.04      False
 2      19   0   None    None         0.04      False
 3      17  12   None    None         0.04      False
 4      19   9   None    None         0.18      False
 ..    ...  ..    ...     ...          ...        ...
 164    15   3   None    None         0.18      False
 165    15  12   None    None         0.06      False
 166    16  11   None    None         0.16      False
 167    18   8   None    None         0.16      False
 168    21   7   None    None         0.14      False
 
 [169 rows x 6 columns]}

In [None]:
# PRINT DATAFRAME WITH CAUSAL RELATIONSHIPS

df = causal_df[0]
# order df by 'from' and 'by' columns
df = df.sort_values(by=['from', 'to'])

df.to_csv('/home/jpalombarini/td2c/notebooks/contributions/Real_data_validation/data/Antivirus_activity/results/causal_df.csv', index=False)

df

Unnamed: 0,from,to,effect,p_value,probability,is_causal
79,13,0,,,0.32,False
16,13,1,,,0.20,False
94,13,2,,,0.26,False
30,13,3,,,0.18,False
110,13,4,,,0.10,False
...,...,...,...,...,...,...
86,25,8,,,0.12,False
151,25,9,,,0.16,False
99,25,10,,,0.04,False
39,25,11,,,0.10,False


In [None]:
# RELEVEL FROM AND TO COLUMNS WITH THE RIGHT NODES NAMES

import pandas as pd

# Assuming df is your DataFrame

# Step 1: Get the unique levels
from_levels = df['from'].unique()
to_levels = df['to'].unique()

# Step 2: Create mapping dictionaries
from_mapping = {level: i+1 for i, level in enumerate(from_levels)}
to_mapping = {level: i+1 for i, level in enumerate(to_levels)}

# Step 3: Apply the mapping to the columns
df['from'] = df['from'].map(from_mapping)
df['to'] = df['to'].map(to_mapping)

# Display the result
df


Unnamed: 0,from,to,effect,p_value,probability,is_causal
79,1,1,,,0.32,False
16,1,2,,,0.20,False
94,1,3,,,0.26,False
30,1,4,,,0.18,False
110,1,5,,,0.10,False
...,...,...,...,...,...,...
86,13,9,,,0.12,False
151,13,10,,,0.16,False
99,13,11,,,0.04,False
39,13,12,,,0.10,False


In [None]:
# show only df rows that have 'is_causal' == True
df[df['is_causal'] == True]

Unnamed: 0,from,to,effect,p_value,probability,is_causal
78,5,5,,,0.62,True
74,10,10,,,0.52,True


In [None]:
# LOAD DATA AS DATAFRAME
ts_df = ts_1_df
# ts_1_df: Antivirus_activity_1
# ts_2_df: Antivirus_activity_2

# list the names in the first row
names = ts_df.columns

# associate a number to each name
name_to_number = {name: i+1 for i, name in enumerate(names)}

name_to_number

{'memory_usage_Portal': 1,
 'cpu_usage_Portal': 2,
 'Physical_Memory_prct_used_Portal': 3,
 'cpu_prct_used_Portal': 4,
 '0_C_read_Portal': 5,
 'memory_usage_VDI': 6,
 'cpu_usage_VDI': 7,
 'Physical_Memory_prct_used_VDI': 8,
 'cpu_prct_used_VDI': 9,
 '0_C_read_VDI': 10,
 'Chargement_portail': 11,
 'Chargement_IE': 12,
 'Default_Transaction': 13}

In [None]:
# take only the columns 'from', 'to' for a certain probability threshold
caus = df[df['probability'] > 0.4][['from', 'to']]

number_to_name = {v: k for k, v in name_to_number.items()}

# apply the mapping
caus['from'] = caus['from'].replace(number_to_name)
caus['to'] = caus['to'].replace(number_to_name)

# save caus to a csv file
caus.to_csv('/home/jpalombarini/td2c/notebooks/contributions/Real_data_validation/data/Antivirus_activity/results/causal_relations.csv', index=False)

caus

Unnamed: 0,from,to
78,0_C_read_Portal,0_C_read_Portal
37,memory_usage_VDI,memory_usage_VDI
119,cpu_prct_used_VDI,cpu_prct_used_VDI
74,0_C_read_VDI,0_C_read_VDI


### Load the Ground truth and print the results

In [None]:
# load the ground truth
gt = pd.read_csv('/home/jpalombarini/td2c/notebooks/contributions/Real_data_validation/data/Antivirus_activity/ground_truth.txt')

# merge column 'from' with column 'to' of caus to create a new column 'From -> To'
caus['From -> To'] = caus['from'] + ' -> ' + caus['to']
caus = caus.drop(columns=['from', 'to'])
caus, gt

(                                 From -> To
 78       0_C_read_Portal -> 0_C_read_Portal
 37     memory_usage_VDI -> memory_usage_VDI
 119  cpu_prct_used_VDI -> cpu_prct_used_VDI
 74             0_C_read_VDI -> 0_C_read_VDI,
                                            From -> To
 0   memory_usage_Portal -> Physical_Memory_prct_us...
 1            cpu_usage_Portal -> cpu_prct_used_Portal
 2   Physical_Memory_prct_used_Portal -> 0_C_read_P...
 3             cpu_prct_used_Portal -> 0_C_read_Portal
 4   memory_usage_VDI -> Physical_Memory_prct_used_VDI
 5                  cpu_usage_VDI -> cpu_prct_used_VDI
 6       Physical_Memory_prct_used_VDI -> 0_C_read_VDI
 7                   cpu_prct_used_VDI -> 0_C_read_VDI
 8   Physical_Memory_prct_used_Portal -> Chargement...
 9          cpu_prct_used_Portal -> Chargement_portail
 10              0_C_read_Portal -> Chargement_portail
 11     Physical_Memory_prct_used_VDI -> Chargement_IE
 12                 cpu_prct_used_VDI -> Chargement_IE
 13 

In [None]:
print(f'Numbers of correctly estimated causal paths: {sum(caus["From -> To"].isin(gt["From -> To"]))} / {gt.shape[0]}'), 
print(f'Percentage of correctly estimated causal paths: {round((sum(caus["From -> To"].isin(gt["From -> To"])) / gt.shape[0]) * 100, 2)}%')

Numbers of correctly estimated causal paths: 0 / 16
Percentage of correctly estimated causal paths: 0.0%


## Temperature

Bivariate time series of length 168 about indoor I and outdoor O measurements

Ground truth: O -> I

      adj.mat = 0 1
                0 0 
                


### Load Data

In [None]:
ts_1 = np.loadtxt('/home/jpalombarini/td2c/notebooks/contributions/Real_data_validation/data/Antivirus_activity/preprocessed_1.txt', delimiter=',',skiprows=1)
ts_2 = np.loadtxt('/home/jpalombarini/td2c/notebooks/contributions/Real_data_validation/data/Antivirus_activity/preprocessed_2.txt', delimiter=',',skiprows=1)
ts_1_df = pd.read_csv('/home/jpalombarini/td2c/notebooks/contributions/Real_data_validation/data/Antivirus_activity/preprocessed_1.txt')
ts_2_df = pd.read_csv('/home/jpalombarini/td2c/notebooks/contributions/Real_data_validation/data/Antivirus_activity/preprocessed_2.txt')

### Get Causal DF and manipulate it

In [None]:
MODEL = model
# model: USING A MODEL TRAINED ON ALL PROCESSES
# model_linear: USING A MODEL TRAINED ONLY ON LINEAR PROCESSES

ts = ts_1
# ts_1: Antivirus_activity_1
# ts_2: Antivirus_activity_2

d2cwrapper = D2CWrapper(ts_list=[ts], 
                        n_variables=13, 
                        model=MODEL, 
                        maxlags=1, 
                        n_jobs=1, 
                        full=True, 
                        quantiles=True,
                        filename='d2c_results',
                        normalize=True, 
                        cmi='original', 
                        mb_estimator='ts')

d2cwrapper
d2cwrapper.run()
causal_df = d2cwrapper.get_causal_dfs()
causal_df

{0:      from  to effect p_value  probability  is_causal
 0      23   4   None    None         0.06      False
 1      17   3   None    None         0.04      False
 2      19   0   None    None         0.04      False
 3      17  12   None    None         0.04      False
 4      19   9   None    None         0.18      False
 ..    ...  ..    ...     ...          ...        ...
 164    15   3   None    None         0.18      False
 165    15  12   None    None         0.06      False
 166    16  11   None    None         0.16      False
 167    18   8   None    None         0.16      False
 168    21   7   None    None         0.14      False
 
 [169 rows x 6 columns]}

In [None]:
# PRINT DATAFRAME WITH CAUSAL RELATIONSHIPS

df = causal_df[0]
# order df by 'from' and 'by' columns
df = df.sort_values(by=['from', 'to'])

df.to_csv('/home/jpalombarini/td2c/notebooks/contributions/Real_data_validation/data/Antivirus_activity/results/causal_df.csv', index=False)

df

Unnamed: 0,from,to,effect,p_value,probability,is_causal
79,13,0,,,0.32,False
16,13,1,,,0.20,False
94,13,2,,,0.26,False
30,13,3,,,0.18,False
110,13,4,,,0.10,False
...,...,...,...,...,...,...
86,25,8,,,0.12,False
151,25,9,,,0.16,False
99,25,10,,,0.04,False
39,25,11,,,0.10,False


In [None]:
# RELEVEL FROM AND TO COLUMNS WITH THE RIGHT NODES NAMES

import pandas as pd

# Assuming df is your DataFrame

# Step 1: Get the unique levels
from_levels = df['from'].unique()
to_levels = df['to'].unique()

# Step 2: Create mapping dictionaries
from_mapping = {level: i+1 for i, level in enumerate(from_levels)}
to_mapping = {level: i+1 for i, level in enumerate(to_levels)}

# Step 3: Apply the mapping to the columns
df['from'] = df['from'].map(from_mapping)
df['to'] = df['to'].map(to_mapping)

# Display the result
df


Unnamed: 0,from,to,effect,p_value,probability,is_causal
79,1,1,,,0.32,False
16,1,2,,,0.20,False
94,1,3,,,0.26,False
30,1,4,,,0.18,False
110,1,5,,,0.10,False
...,...,...,...,...,...,...
86,13,9,,,0.12,False
151,13,10,,,0.16,False
99,13,11,,,0.04,False
39,13,12,,,0.10,False


In [None]:
# show only df rows that have 'is_causal' == True
df[df['is_causal'] == True]

Unnamed: 0,from,to,effect,p_value,probability,is_causal
78,5,5,,,0.62,True
74,10,10,,,0.52,True


In [None]:
# LOAD DATA AS DATAFRAME
ts_df = ts_1_df
# ts_1_df: Antivirus_activity_1
# ts_2_df: Antivirus_activity_2

# list the names in the first row
names = ts_df.columns

# associate a number to each name
name_to_number = {name: i+1 for i, name in enumerate(names)}

name_to_number

{'memory_usage_Portal': 1,
 'cpu_usage_Portal': 2,
 'Physical_Memory_prct_used_Portal': 3,
 'cpu_prct_used_Portal': 4,
 '0_C_read_Portal': 5,
 'memory_usage_VDI': 6,
 'cpu_usage_VDI': 7,
 'Physical_Memory_prct_used_VDI': 8,
 'cpu_prct_used_VDI': 9,
 '0_C_read_VDI': 10,
 'Chargement_portail': 11,
 'Chargement_IE': 12,
 'Default_Transaction': 13}

In [None]:
# take only the columns 'from', 'to' for a certain probability threshold
caus = df[df['probability'] > 0.4][['from', 'to']]

number_to_name = {v: k for k, v in name_to_number.items()}

# apply the mapping
caus['from'] = caus['from'].replace(number_to_name)
caus['to'] = caus['to'].replace(number_to_name)

# save caus to a csv file
caus.to_csv('/home/jpalombarini/td2c/notebooks/contributions/Real_data_validation/data/Antivirus_activity/results/causal_relations.csv', index=False)

caus

Unnamed: 0,from,to
78,0_C_read_Portal,0_C_read_Portal
37,memory_usage_VDI,memory_usage_VDI
119,cpu_prct_used_VDI,cpu_prct_used_VDI
74,0_C_read_VDI,0_C_read_VDI


### Load the Ground truth and print the results

In [None]:
# load the ground truth
gt = pd.read_csv('/home/jpalombarini/td2c/notebooks/contributions/Real_data_validation/data/Antivirus_activity/ground_truth.txt')

# merge column 'from' with column 'to' of caus to create a new column 'From -> To'
caus['From -> To'] = caus['from'] + ' -> ' + caus['to']
caus = caus.drop(columns=['from', 'to'])
caus, gt

(                                 From -> To
 78       0_C_read_Portal -> 0_C_read_Portal
 37     memory_usage_VDI -> memory_usage_VDI
 119  cpu_prct_used_VDI -> cpu_prct_used_VDI
 74             0_C_read_VDI -> 0_C_read_VDI,
                                            From -> To
 0   memory_usage_Portal -> Physical_Memory_prct_us...
 1            cpu_usage_Portal -> cpu_prct_used_Portal
 2   Physical_Memory_prct_used_Portal -> 0_C_read_P...
 3             cpu_prct_used_Portal -> 0_C_read_Portal
 4   memory_usage_VDI -> Physical_Memory_prct_used_VDI
 5                  cpu_usage_VDI -> cpu_prct_used_VDI
 6       Physical_Memory_prct_used_VDI -> 0_C_read_VDI
 7                   cpu_prct_used_VDI -> 0_C_read_VDI
 8   Physical_Memory_prct_used_Portal -> Chargement...
 9          cpu_prct_used_Portal -> Chargement_portail
 10              0_C_read_Portal -> Chargement_portail
 11     Physical_Memory_prct_used_VDI -> Chargement_IE
 12                 cpu_prct_used_VDI -> Chargement_IE
 13 

In [None]:
print(f'Numbers of correctly estimated causal paths: {sum(caus["From -> To"].isin(gt["From -> To"]))} / {gt.shape[0]}'), 
print(f'Percentage of correctly estimated causal paths: {round((sum(caus["From -> To"].isin(gt["From -> To"])) / gt.shape[0]) * 100, 2)}%')

Numbers of correctly estimated causal paths: 0 / 16
Percentage of correctly estimated causal paths: 0.0%


## Veilleux

Interactions between predatory ciliate Dinidum nasutum and its prey Paramecium aurelia with different values of
Cerophyl concentrations (CC): 0.375 and 0.5. The lengths of the time series are 71 and 65.
- CC05a
- CC035

Ground truth: P -> D in both cases

    adj.mat = 0 1
              1 0  
                

### Load Data

In [None]:
ts_1 = np.loadtxt('/home/jpalombarini/td2c/notebooks/contributions/Real_data_validation/data/Antivirus_activity/preprocessed_1.txt', delimiter=',',skiprows=1)
ts_2 = np.loadtxt('/home/jpalombarini/td2c/notebooks/contributions/Real_data_validation/data/Antivirus_activity/preprocessed_2.txt', delimiter=',',skiprows=1)
ts_1_df = pd.read_csv('/home/jpalombarini/td2c/notebooks/contributions/Real_data_validation/data/Antivirus_activity/preprocessed_1.txt')
ts_2_df = pd.read_csv('/home/jpalombarini/td2c/notebooks/contributions/Real_data_validation/data/Antivirus_activity/preprocessed_2.txt')

### Get Causal DF and manipulate it

In [None]:
MODEL = model
# model: USING A MODEL TRAINED ON ALL PROCESSES
# model_linear: USING A MODEL TRAINED ONLY ON LINEAR PROCESSES

ts = ts_1
# ts_1: Antivirus_activity_1
# ts_2: Antivirus_activity_2

d2cwrapper = D2CWrapper(ts_list=[ts], 
                        n_variables=13, 
                        model=MODEL, 
                        maxlags=1, 
                        n_jobs=1, 
                        full=True, 
                        quantiles=True,
                        filename='d2c_results',
                        normalize=True, 
                        cmi='original', 
                        mb_estimator='ts')

d2cwrapper
d2cwrapper.run()
causal_df = d2cwrapper.get_causal_dfs()
causal_df

{0:      from  to effect p_value  probability  is_causal
 0      23   4   None    None         0.06      False
 1      17   3   None    None         0.04      False
 2      19   0   None    None         0.04      False
 3      17  12   None    None         0.04      False
 4      19   9   None    None         0.18      False
 ..    ...  ..    ...     ...          ...        ...
 164    15   3   None    None         0.18      False
 165    15  12   None    None         0.06      False
 166    16  11   None    None         0.16      False
 167    18   8   None    None         0.16      False
 168    21   7   None    None         0.14      False
 
 [169 rows x 6 columns]}

In [None]:
# PRINT DATAFRAME WITH CAUSAL RELATIONSHIPS

df = causal_df[0]
# order df by 'from' and 'by' columns
df = df.sort_values(by=['from', 'to'])

df.to_csv('/home/jpalombarini/td2c/notebooks/contributions/Real_data_validation/data/Antivirus_activity/results/causal_df.csv', index=False)

df

Unnamed: 0,from,to,effect,p_value,probability,is_causal
79,13,0,,,0.32,False
16,13,1,,,0.20,False
94,13,2,,,0.26,False
30,13,3,,,0.18,False
110,13,4,,,0.10,False
...,...,...,...,...,...,...
86,25,8,,,0.12,False
151,25,9,,,0.16,False
99,25,10,,,0.04,False
39,25,11,,,0.10,False


In [None]:
# RELEVEL FROM AND TO COLUMNS WITH THE RIGHT NODES NAMES

import pandas as pd

# Assuming df is your DataFrame

# Step 1: Get the unique levels
from_levels = df['from'].unique()
to_levels = df['to'].unique()

# Step 2: Create mapping dictionaries
from_mapping = {level: i+1 for i, level in enumerate(from_levels)}
to_mapping = {level: i+1 for i, level in enumerate(to_levels)}

# Step 3: Apply the mapping to the columns
df['from'] = df['from'].map(from_mapping)
df['to'] = df['to'].map(to_mapping)

# Display the result
df


Unnamed: 0,from,to,effect,p_value,probability,is_causal
79,1,1,,,0.32,False
16,1,2,,,0.20,False
94,1,3,,,0.26,False
30,1,4,,,0.18,False
110,1,5,,,0.10,False
...,...,...,...,...,...,...
86,13,9,,,0.12,False
151,13,10,,,0.16,False
99,13,11,,,0.04,False
39,13,12,,,0.10,False


In [None]:
# show only df rows that have 'is_causal' == True
df[df['is_causal'] == True]

Unnamed: 0,from,to,effect,p_value,probability,is_causal
78,5,5,,,0.62,True
74,10,10,,,0.52,True


In [None]:
# LOAD DATA AS DATAFRAME
ts_df = ts_1_df
# ts_1_df: Antivirus_activity_1
# ts_2_df: Antivirus_activity_2

# list the names in the first row
names = ts_df.columns

# associate a number to each name
name_to_number = {name: i+1 for i, name in enumerate(names)}

name_to_number

{'memory_usage_Portal': 1,
 'cpu_usage_Portal': 2,
 'Physical_Memory_prct_used_Portal': 3,
 'cpu_prct_used_Portal': 4,
 '0_C_read_Portal': 5,
 'memory_usage_VDI': 6,
 'cpu_usage_VDI': 7,
 'Physical_Memory_prct_used_VDI': 8,
 'cpu_prct_used_VDI': 9,
 '0_C_read_VDI': 10,
 'Chargement_portail': 11,
 'Chargement_IE': 12,
 'Default_Transaction': 13}

In [None]:
# take only the columns 'from', 'to' for a certain probability threshold
caus = df[df['probability'] > 0.4][['from', 'to']]

number_to_name = {v: k for k, v in name_to_number.items()}

# apply the mapping
caus['from'] = caus['from'].replace(number_to_name)
caus['to'] = caus['to'].replace(number_to_name)

# save caus to a csv file
caus.to_csv('/home/jpalombarini/td2c/notebooks/contributions/Real_data_validation/data/Antivirus_activity/results/causal_relations.csv', index=False)

caus

Unnamed: 0,from,to
78,0_C_read_Portal,0_C_read_Portal
37,memory_usage_VDI,memory_usage_VDI
119,cpu_prct_used_VDI,cpu_prct_used_VDI
74,0_C_read_VDI,0_C_read_VDI


### Load the Ground truth and print the results

In [None]:
# load the ground truth
gt = pd.read_csv('/home/jpalombarini/td2c/notebooks/contributions/Real_data_validation/data/Antivirus_activity/ground_truth.txt')

# merge column 'from' with column 'to' of caus to create a new column 'From -> To'
caus['From -> To'] = caus['from'] + ' -> ' + caus['to']
caus = caus.drop(columns=['from', 'to'])
caus, gt

(                                 From -> To
 78       0_C_read_Portal -> 0_C_read_Portal
 37     memory_usage_VDI -> memory_usage_VDI
 119  cpu_prct_used_VDI -> cpu_prct_used_VDI
 74             0_C_read_VDI -> 0_C_read_VDI,
                                            From -> To
 0   memory_usage_Portal -> Physical_Memory_prct_us...
 1            cpu_usage_Portal -> cpu_prct_used_Portal
 2   Physical_Memory_prct_used_Portal -> 0_C_read_P...
 3             cpu_prct_used_Portal -> 0_C_read_Portal
 4   memory_usage_VDI -> Physical_Memory_prct_used_VDI
 5                  cpu_usage_VDI -> cpu_prct_used_VDI
 6       Physical_Memory_prct_used_VDI -> 0_C_read_VDI
 7                   cpu_prct_used_VDI -> 0_C_read_VDI
 8   Physical_Memory_prct_used_Portal -> Chargement...
 9          cpu_prct_used_Portal -> Chargement_portail
 10              0_C_read_Portal -> Chargement_portail
 11     Physical_Memory_prct_used_VDI -> Chargement_IE
 12                 cpu_prct_used_VDI -> Chargement_IE
 13 

In [None]:
print(f'Numbers of correctly estimated causal paths: {sum(caus["From -> To"].isin(gt["From -> To"]))} / {gt.shape[0]}'), 
print(f'Percentage of correctly estimated causal paths: {round((sum(caus["From -> To"].isin(gt["From -> To"])) / gt.shape[0]) * 100, 2)}%')

Numbers of correctly estimated causal paths: 0 / 16
Percentage of correctly estimated causal paths: 0.0%


## Web activity

Activity in a web server which is provided by EasyVista
Ten time series collected with a one-minute sampling rate.
The two processed datasets contain 3000 timestamps.
- preprocessed 1
- preprocessed 2

      Ground truth:
          Net_In_Global -> Net_Out_Global
          Net_In_Global -> Nb_process_http
          Net_In_Global -> Nb_connection_mysql
          Nb_process_http -> Cpu_http
          Nb_process_http -> Nb_process_php
          Nb_process_http -> Ram_http
          Nb_process_php -> Cpu_php
          Nb_process_php -> Nb_connection_mysql
          Nb_connection_mysql -> Net_Out_Global
          Nb_connection_mysql -> Disque_write_global
          Nb_connection_mysql -> Cpu_global
          Cpu_http -> Cpu_global
          Cpu_php -> Cpu_global
          Disque_write_global -> Cpu_global
          


### Load Data

In [None]:
ts_1 = np.loadtxt('/home/jpalombarini/td2c/notebooks/contributions/Real_data_validation/data/Antivirus_activity/preprocessed_1.txt', delimiter=',',skiprows=1)
ts_2 = np.loadtxt('/home/jpalombarini/td2c/notebooks/contributions/Real_data_validation/data/Antivirus_activity/preprocessed_2.txt', delimiter=',',skiprows=1)
ts_1_df = pd.read_csv('/home/jpalombarini/td2c/notebooks/contributions/Real_data_validation/data/Antivirus_activity/preprocessed_1.txt')
ts_2_df = pd.read_csv('/home/jpalombarini/td2c/notebooks/contributions/Real_data_validation/data/Antivirus_activity/preprocessed_2.txt')

### Get Causal DF and manipulate it

In [None]:
MODEL = model
# model: USING A MODEL TRAINED ON ALL PROCESSES
# model_linear: USING A MODEL TRAINED ONLY ON LINEAR PROCESSES

ts = ts_1
# ts_1: Antivirus_activity_1
# ts_2: Antivirus_activity_2

d2cwrapper = D2CWrapper(ts_list=[ts], 
                        n_variables=13, 
                        model=MODEL, 
                        maxlags=1, 
                        n_jobs=1, 
                        full=True, 
                        quantiles=True,
                        filename='d2c_results',
                        normalize=True, 
                        cmi='original', 
                        mb_estimator='ts')

d2cwrapper
d2cwrapper.run()
causal_df = d2cwrapper.get_causal_dfs()
causal_df

{0:      from  to effect p_value  probability  is_causal
 0      23   4   None    None         0.06      False
 1      17   3   None    None         0.04      False
 2      19   0   None    None         0.04      False
 3      17  12   None    None         0.04      False
 4      19   9   None    None         0.18      False
 ..    ...  ..    ...     ...          ...        ...
 164    15   3   None    None         0.18      False
 165    15  12   None    None         0.06      False
 166    16  11   None    None         0.16      False
 167    18   8   None    None         0.16      False
 168    21   7   None    None         0.14      False
 
 [169 rows x 6 columns]}

In [None]:
# PRINT DATAFRAME WITH CAUSAL RELATIONSHIPS

df = causal_df[0]
# order df by 'from' and 'by' columns
df = df.sort_values(by=['from', 'to'])

df.to_csv('/home/jpalombarini/td2c/notebooks/contributions/Real_data_validation/data/Antivirus_activity/results/causal_df.csv', index=False)

df

Unnamed: 0,from,to,effect,p_value,probability,is_causal
79,13,0,,,0.32,False
16,13,1,,,0.20,False
94,13,2,,,0.26,False
30,13,3,,,0.18,False
110,13,4,,,0.10,False
...,...,...,...,...,...,...
86,25,8,,,0.12,False
151,25,9,,,0.16,False
99,25,10,,,0.04,False
39,25,11,,,0.10,False


In [None]:
# RELEVEL FROM AND TO COLUMNS WITH THE RIGHT NODES NAMES

import pandas as pd

# Assuming df is your DataFrame

# Step 1: Get the unique levels
from_levels = df['from'].unique()
to_levels = df['to'].unique()

# Step 2: Create mapping dictionaries
from_mapping = {level: i+1 for i, level in enumerate(from_levels)}
to_mapping = {level: i+1 for i, level in enumerate(to_levels)}

# Step 3: Apply the mapping to the columns
df['from'] = df['from'].map(from_mapping)
df['to'] = df['to'].map(to_mapping)

# Display the result
df


Unnamed: 0,from,to,effect,p_value,probability,is_causal
79,1,1,,,0.32,False
16,1,2,,,0.20,False
94,1,3,,,0.26,False
30,1,4,,,0.18,False
110,1,5,,,0.10,False
...,...,...,...,...,...,...
86,13,9,,,0.12,False
151,13,10,,,0.16,False
99,13,11,,,0.04,False
39,13,12,,,0.10,False


In [None]:
# show only df rows that have 'is_causal' == True
df[df['is_causal'] == True]

Unnamed: 0,from,to,effect,p_value,probability,is_causal
78,5,5,,,0.62,True
74,10,10,,,0.52,True


In [None]:
# LOAD DATA AS DATAFRAME
ts_df = ts_1_df
# ts_1_df: Antivirus_activity_1
# ts_2_df: Antivirus_activity_2

# list the names in the first row
names = ts_df.columns

# associate a number to each name
name_to_number = {name: i+1 for i, name in enumerate(names)}

name_to_number

{'memory_usage_Portal': 1,
 'cpu_usage_Portal': 2,
 'Physical_Memory_prct_used_Portal': 3,
 'cpu_prct_used_Portal': 4,
 '0_C_read_Portal': 5,
 'memory_usage_VDI': 6,
 'cpu_usage_VDI': 7,
 'Physical_Memory_prct_used_VDI': 8,
 'cpu_prct_used_VDI': 9,
 '0_C_read_VDI': 10,
 'Chargement_portail': 11,
 'Chargement_IE': 12,
 'Default_Transaction': 13}

In [None]:
# take only the columns 'from', 'to' for a certain probability threshold
caus = df[df['probability'] > 0.4][['from', 'to']]

number_to_name = {v: k for k, v in name_to_number.items()}

# apply the mapping
caus['from'] = caus['from'].replace(number_to_name)
caus['to'] = caus['to'].replace(number_to_name)

# save caus to a csv file
caus.to_csv('/home/jpalombarini/td2c/notebooks/contributions/Real_data_validation/data/Antivirus_activity/results/causal_relations.csv', index=False)

caus

Unnamed: 0,from,to
78,0_C_read_Portal,0_C_read_Portal
37,memory_usage_VDI,memory_usage_VDI
119,cpu_prct_used_VDI,cpu_prct_used_VDI
74,0_C_read_VDI,0_C_read_VDI


### Load the Ground truth and print the results

In [None]:
# load the ground truth
gt = pd.read_csv('/home/jpalombarini/td2c/notebooks/contributions/Real_data_validation/data/Antivirus_activity/ground_truth.txt')

# merge column 'from' with column 'to' of caus to create a new column 'From -> To'
caus['From -> To'] = caus['from'] + ' -> ' + caus['to']
caus = caus.drop(columns=['from', 'to'])
caus, gt

(                                 From -> To
 78       0_C_read_Portal -> 0_C_read_Portal
 37     memory_usage_VDI -> memory_usage_VDI
 119  cpu_prct_used_VDI -> cpu_prct_used_VDI
 74             0_C_read_VDI -> 0_C_read_VDI,
                                            From -> To
 0   memory_usage_Portal -> Physical_Memory_prct_us...
 1            cpu_usage_Portal -> cpu_prct_used_Portal
 2   Physical_Memory_prct_used_Portal -> 0_C_read_P...
 3             cpu_prct_used_Portal -> 0_C_read_Portal
 4   memory_usage_VDI -> Physical_Memory_prct_used_VDI
 5                  cpu_usage_VDI -> cpu_prct_used_VDI
 6       Physical_Memory_prct_used_VDI -> 0_C_read_VDI
 7                   cpu_prct_used_VDI -> 0_C_read_VDI
 8   Physical_Memory_prct_used_Portal -> Chargement...
 9          cpu_prct_used_Portal -> Chargement_portail
 10              0_C_read_Portal -> Chargement_portail
 11     Physical_Memory_prct_used_VDI -> Chargement_IE
 12                 cpu_prct_used_VDI -> Chargement_IE
 13 

In [None]:
print(f'Numbers of correctly estimated causal paths: {sum(caus["From -> To"].isin(gt["From -> To"]))} / {gt.shape[0]}'), 
print(f'Percentage of correctly estimated causal paths: {round((sum(caus["From -> To"].isin(gt["From -> To"])) / gt.shape[0]) * 100, 2)}%')

Numbers of correctly estimated causal paths: 0 / 16
Percentage of correctly estimated causal paths: 0.0%



# BIGGEST DATASETS


## biggest 1


Data 1

### Load Data

In [None]:
ts_1 = np.loadtxt('/home/jpalombarini/td2c/notebooks/contributions/Real_data_validation/data/Antivirus_activity/preprocessed_1.txt', delimiter=',',skiprows=1)
ts_2 = np.loadtxt('/home/jpalombarini/td2c/notebooks/contributions/Real_data_validation/data/Antivirus_activity/preprocessed_2.txt', delimiter=',',skiprows=1)
ts_1_df = pd.read_csv('/home/jpalombarini/td2c/notebooks/contributions/Real_data_validation/data/Antivirus_activity/preprocessed_1.txt')
ts_2_df = pd.read_csv('/home/jpalombarini/td2c/notebooks/contributions/Real_data_validation/data/Antivirus_activity/preprocessed_2.txt')

### Get Causal DF and manipulate it

In [None]:
MODEL = model
# model: USING A MODEL TRAINED ON ALL PROCESSES
# model_linear: USING A MODEL TRAINED ONLY ON LINEAR PROCESSES

ts = ts_1
# ts_1: Antivirus_activity_1
# ts_2: Antivirus_activity_2

d2cwrapper = D2CWrapper(ts_list=[ts], 
                        n_variables=13, 
                        model=MODEL, 
                        maxlags=1, 
                        n_jobs=1, 
                        full=True, 
                        quantiles=True,
                        filename='d2c_results',
                        normalize=True, 
                        cmi='original', 
                        mb_estimator='ts')

d2cwrapper
d2cwrapper.run()
causal_df = d2cwrapper.get_causal_dfs()
causal_df

{0:      from  to effect p_value  probability  is_causal
 0      23   4   None    None         0.06      False
 1      17   3   None    None         0.04      False
 2      19   0   None    None         0.04      False
 3      17  12   None    None         0.04      False
 4      19   9   None    None         0.18      False
 ..    ...  ..    ...     ...          ...        ...
 164    15   3   None    None         0.18      False
 165    15  12   None    None         0.06      False
 166    16  11   None    None         0.16      False
 167    18   8   None    None         0.16      False
 168    21   7   None    None         0.14      False
 
 [169 rows x 6 columns]}

In [None]:
# PRINT DATAFRAME WITH CAUSAL RELATIONSHIPS

df = causal_df[0]
# order df by 'from' and 'by' columns
df = df.sort_values(by=['from', 'to'])

df.to_csv('/home/jpalombarini/td2c/notebooks/contributions/Real_data_validation/data/Antivirus_activity/results/causal_df.csv', index=False)

df

Unnamed: 0,from,to,effect,p_value,probability,is_causal
79,13,0,,,0.32,False
16,13,1,,,0.20,False
94,13,2,,,0.26,False
30,13,3,,,0.18,False
110,13,4,,,0.10,False
...,...,...,...,...,...,...
86,25,8,,,0.12,False
151,25,9,,,0.16,False
99,25,10,,,0.04,False
39,25,11,,,0.10,False


In [None]:
# RELEVEL FROM AND TO COLUMNS WITH THE RIGHT NODES NAMES

import pandas as pd

# Assuming df is your DataFrame

# Step 1: Get the unique levels
from_levels = df['from'].unique()
to_levels = df['to'].unique()

# Step 2: Create mapping dictionaries
from_mapping = {level: i+1 for i, level in enumerate(from_levels)}
to_mapping = {level: i+1 for i, level in enumerate(to_levels)}

# Step 3: Apply the mapping to the columns
df['from'] = df['from'].map(from_mapping)
df['to'] = df['to'].map(to_mapping)

# Display the result
df


Unnamed: 0,from,to,effect,p_value,probability,is_causal
79,1,1,,,0.32,False
16,1,2,,,0.20,False
94,1,3,,,0.26,False
30,1,4,,,0.18,False
110,1,5,,,0.10,False
...,...,...,...,...,...,...
86,13,9,,,0.12,False
151,13,10,,,0.16,False
99,13,11,,,0.04,False
39,13,12,,,0.10,False


In [None]:
# show only df rows that have 'is_causal' == True
df[df['is_causal'] == True]

Unnamed: 0,from,to,effect,p_value,probability,is_causal
78,5,5,,,0.62,True
74,10,10,,,0.52,True


In [None]:
# LOAD DATA AS DATAFRAME
ts_df = ts_1_df
# ts_1_df: Antivirus_activity_1
# ts_2_df: Antivirus_activity_2

# list the names in the first row
names = ts_df.columns

# associate a number to each name
name_to_number = {name: i+1 for i, name in enumerate(names)}

name_to_number

{'memory_usage_Portal': 1,
 'cpu_usage_Portal': 2,
 'Physical_Memory_prct_used_Portal': 3,
 'cpu_prct_used_Portal': 4,
 '0_C_read_Portal': 5,
 'memory_usage_VDI': 6,
 'cpu_usage_VDI': 7,
 'Physical_Memory_prct_used_VDI': 8,
 'cpu_prct_used_VDI': 9,
 '0_C_read_VDI': 10,
 'Chargement_portail': 11,
 'Chargement_IE': 12,
 'Default_Transaction': 13}

In [None]:
# take only the columns 'from', 'to' for a certain probability threshold
caus = df[df['probability'] > 0.4][['from', 'to']]

number_to_name = {v: k for k, v in name_to_number.items()}

# apply the mapping
caus['from'] = caus['from'].replace(number_to_name)
caus['to'] = caus['to'].replace(number_to_name)

# save caus to a csv file
caus.to_csv('/home/jpalombarini/td2c/notebooks/contributions/Real_data_validation/data/Antivirus_activity/results/causal_relations.csv', index=False)

caus

Unnamed: 0,from,to
78,0_C_read_Portal,0_C_read_Portal
37,memory_usage_VDI,memory_usage_VDI
119,cpu_prct_used_VDI,cpu_prct_used_VDI
74,0_C_read_VDI,0_C_read_VDI


### Load the Ground truth and print the results

In [None]:
# load the ground truth
gt = pd.read_csv('/home/jpalombarini/td2c/notebooks/contributions/Real_data_validation/data/Antivirus_activity/ground_truth.txt')

# merge column 'from' with column 'to' of caus to create a new column 'From -> To'
caus['From -> To'] = caus['from'] + ' -> ' + caus['to']
caus = caus.drop(columns=['from', 'to'])
caus, gt

(                                 From -> To
 78       0_C_read_Portal -> 0_C_read_Portal
 37     memory_usage_VDI -> memory_usage_VDI
 119  cpu_prct_used_VDI -> cpu_prct_used_VDI
 74             0_C_read_VDI -> 0_C_read_VDI,
                                            From -> To
 0   memory_usage_Portal -> Physical_Memory_prct_us...
 1            cpu_usage_Portal -> cpu_prct_used_Portal
 2   Physical_Memory_prct_used_Portal -> 0_C_read_P...
 3             cpu_prct_used_Portal -> 0_C_read_Portal
 4   memory_usage_VDI -> Physical_Memory_prct_used_VDI
 5                  cpu_usage_VDI -> cpu_prct_used_VDI
 6       Physical_Memory_prct_used_VDI -> 0_C_read_VDI
 7                   cpu_prct_used_VDI -> 0_C_read_VDI
 8   Physical_Memory_prct_used_Portal -> Chargement...
 9          cpu_prct_used_Portal -> Chargement_portail
 10              0_C_read_Portal -> Chargement_portail
 11     Physical_Memory_prct_used_VDI -> Chargement_IE
 12                 cpu_prct_used_VDI -> Chargement_IE
 13 

In [None]:
print(f'Numbers of correctly estimated causal paths: {sum(caus["From -> To"].isin(gt["From -> To"]))} / {gt.shape[0]}'), 
print(f'Percentage of correctly estimated causal paths: {round((sum(caus["From -> To"].isin(gt["From -> To"])) / gt.shape[0]) * 100, 2)}%')

Numbers of correctly estimated causal paths: 0 / 16
Percentage of correctly estimated causal paths: 0.0%


## biggest 2


Data 2

### Load Data

In [None]:
ts_1 = np.loadtxt('/home/jpalombarini/td2c/notebooks/contributions/Real_data_validation/data/Antivirus_activity/preprocessed_1.txt', delimiter=',',skiprows=1)
ts_2 = np.loadtxt('/home/jpalombarini/td2c/notebooks/contributions/Real_data_validation/data/Antivirus_activity/preprocessed_2.txt', delimiter=',',skiprows=1)
ts_1_df = pd.read_csv('/home/jpalombarini/td2c/notebooks/contributions/Real_data_validation/data/Antivirus_activity/preprocessed_1.txt')
ts_2_df = pd.read_csv('/home/jpalombarini/td2c/notebooks/contributions/Real_data_validation/data/Antivirus_activity/preprocessed_2.txt')

### Get Causal DF and manipulate it

In [None]:
MODEL = model
# model: USING A MODEL TRAINED ON ALL PROCESSES
# model_linear: USING A MODEL TRAINED ONLY ON LINEAR PROCESSES

ts = ts_1
# ts_1: Antivirus_activity_1
# ts_2: Antivirus_activity_2

d2cwrapper = D2CWrapper(ts_list=[ts], 
                        n_variables=13, 
                        model=MODEL, 
                        maxlags=1, 
                        n_jobs=1, 
                        full=True, 
                        quantiles=True,
                        filename='d2c_results',
                        normalize=True, 
                        cmi='original', 
                        mb_estimator='ts')

d2cwrapper
d2cwrapper.run()
causal_df = d2cwrapper.get_causal_dfs()
causal_df

{0:      from  to effect p_value  probability  is_causal
 0      23   4   None    None         0.06      False
 1      17   3   None    None         0.04      False
 2      19   0   None    None         0.04      False
 3      17  12   None    None         0.04      False
 4      19   9   None    None         0.18      False
 ..    ...  ..    ...     ...          ...        ...
 164    15   3   None    None         0.18      False
 165    15  12   None    None         0.06      False
 166    16  11   None    None         0.16      False
 167    18   8   None    None         0.16      False
 168    21   7   None    None         0.14      False
 
 [169 rows x 6 columns]}

In [None]:
# PRINT DATAFRAME WITH CAUSAL RELATIONSHIPS

df = causal_df[0]
# order df by 'from' and 'by' columns
df = df.sort_values(by=['from', 'to'])

df.to_csv('/home/jpalombarini/td2c/notebooks/contributions/Real_data_validation/data/Antivirus_activity/results/causal_df.csv', index=False)

df

Unnamed: 0,from,to,effect,p_value,probability,is_causal
79,13,0,,,0.32,False
16,13,1,,,0.20,False
94,13,2,,,0.26,False
30,13,3,,,0.18,False
110,13,4,,,0.10,False
...,...,...,...,...,...,...
86,25,8,,,0.12,False
151,25,9,,,0.16,False
99,25,10,,,0.04,False
39,25,11,,,0.10,False


In [None]:
# RELEVEL FROM AND TO COLUMNS WITH THE RIGHT NODES NAMES

import pandas as pd

# Assuming df is your DataFrame

# Step 1: Get the unique levels
from_levels = df['from'].unique()
to_levels = df['to'].unique()

# Step 2: Create mapping dictionaries
from_mapping = {level: i+1 for i, level in enumerate(from_levels)}
to_mapping = {level: i+1 for i, level in enumerate(to_levels)}

# Step 3: Apply the mapping to the columns
df['from'] = df['from'].map(from_mapping)
df['to'] = df['to'].map(to_mapping)

# Display the result
df


Unnamed: 0,from,to,effect,p_value,probability,is_causal
79,1,1,,,0.32,False
16,1,2,,,0.20,False
94,1,3,,,0.26,False
30,1,4,,,0.18,False
110,1,5,,,0.10,False
...,...,...,...,...,...,...
86,13,9,,,0.12,False
151,13,10,,,0.16,False
99,13,11,,,0.04,False
39,13,12,,,0.10,False


In [None]:
# show only df rows that have 'is_causal' == True
df[df['is_causal'] == True]

Unnamed: 0,from,to,effect,p_value,probability,is_causal
78,5,5,,,0.62,True
74,10,10,,,0.52,True


In [None]:
# LOAD DATA AS DATAFRAME
ts_df = ts_1_df
# ts_1_df: Antivirus_activity_1
# ts_2_df: Antivirus_activity_2

# list the names in the first row
names = ts_df.columns

# associate a number to each name
name_to_number = {name: i+1 for i, name in enumerate(names)}

name_to_number

{'memory_usage_Portal': 1,
 'cpu_usage_Portal': 2,
 'Physical_Memory_prct_used_Portal': 3,
 'cpu_prct_used_Portal': 4,
 '0_C_read_Portal': 5,
 'memory_usage_VDI': 6,
 'cpu_usage_VDI': 7,
 'Physical_Memory_prct_used_VDI': 8,
 'cpu_prct_used_VDI': 9,
 '0_C_read_VDI': 10,
 'Chargement_portail': 11,
 'Chargement_IE': 12,
 'Default_Transaction': 13}

In [None]:
# take only the columns 'from', 'to' for a certain probability threshold
caus = df[df['probability'] > 0.4][['from', 'to']]

number_to_name = {v: k for k, v in name_to_number.items()}

# apply the mapping
caus['from'] = caus['from'].replace(number_to_name)
caus['to'] = caus['to'].replace(number_to_name)

# save caus to a csv file
caus.to_csv('/home/jpalombarini/td2c/notebooks/contributions/Real_data_validation/data/Antivirus_activity/results/causal_relations.csv', index=False)

caus

Unnamed: 0,from,to
78,0_C_read_Portal,0_C_read_Portal
37,memory_usage_VDI,memory_usage_VDI
119,cpu_prct_used_VDI,cpu_prct_used_VDI
74,0_C_read_VDI,0_C_read_VDI


### Load the Ground truth and print the results

In [None]:
# load the ground truth
gt = pd.read_csv('/home/jpalombarini/td2c/notebooks/contributions/Real_data_validation/data/Antivirus_activity/ground_truth.txt')

# merge column 'from' with column 'to' of caus to create a new column 'From -> To'
caus['From -> To'] = caus['from'] + ' -> ' + caus['to']
caus = caus.drop(columns=['from', 'to'])
caus, gt

(                                 From -> To
 78       0_C_read_Portal -> 0_C_read_Portal
 37     memory_usage_VDI -> memory_usage_VDI
 119  cpu_prct_used_VDI -> cpu_prct_used_VDI
 74             0_C_read_VDI -> 0_C_read_VDI,
                                            From -> To
 0   memory_usage_Portal -> Physical_Memory_prct_us...
 1            cpu_usage_Portal -> cpu_prct_used_Portal
 2   Physical_Memory_prct_used_Portal -> 0_C_read_P...
 3             cpu_prct_used_Portal -> 0_C_read_Portal
 4   memory_usage_VDI -> Physical_Memory_prct_used_VDI
 5                  cpu_usage_VDI -> cpu_prct_used_VDI
 6       Physical_Memory_prct_used_VDI -> 0_C_read_VDI
 7                   cpu_prct_used_VDI -> 0_C_read_VDI
 8   Physical_Memory_prct_used_Portal -> Chargement...
 9          cpu_prct_used_Portal -> Chargement_portail
 10              0_C_read_Portal -> Chargement_portail
 11     Physical_Memory_prct_used_VDI -> Chargement_IE
 12                 cpu_prct_used_VDI -> Chargement_IE
 13 

In [None]:
print(f'Numbers of correctly estimated causal paths: {sum(caus["From -> To"].isin(gt["From -> To"]))} / {gt.shape[0]}'), 
print(f'Percentage of correctly estimated causal paths: {round((sum(caus["From -> To"].isin(gt["From -> To"])) / gt.shape[0]) * 100, 2)}%')

Numbers of correctly estimated causal paths: 0 / 16
Percentage of correctly estimated causal paths: 0.0%
