# **Scenario 3 : Data Generation from a Probability Distribution**

As before, we train ANNs or define our own functions to calculate each variable in the DAG. However, instead of a Monte Carlo (agent based) approach, we use Causal Jazz to build a discretised probability distribution.


Import the usual suspects and the pmf module from causaljazz.

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import csv

from causaljazz.visualiser import Visualiser
from causaljazz.inference import TEDAG_FUNCTION
from causaljazz.inference import TEDAG
import causaljazz.data as data

from scipy.stats import norm

import tensorflow as tf
from tensorflow.keras import layers

pygame 2.5.1 (SDL 2.28.2, Python 3.11.4)
Hello from the pygame community. https://www.pygame.org/contribute.html


# **Helper Functions**

In [2]:
# Return the approximate discretised probability mass function for a normal distribution with x_mean and x_sd. The discretisation goes from x_min to x_max with res bins.
def generateGaussianNoisePmf(x_min, x_max, x_mean, x_sd, res):
    x_space = np.linspace(x_min, x_max, res)
    x_dist = [a * ((x_max-x_min)/res) for a in norm.pdf(x_space, x_mean, x_sd)]
    x_pmf = [a / sum(x_dist) for a in x_dist]
    return x_pmf

Load the original data into an ND-array structure.
The original data is made up of up to 25 sets of 1000 data points. Setting number_of_experiments higher improves the functions.

In [3]:
csv_names = ['X1', 'X2', 'LC', 'X3', 'SF']

# Load the data into an array
ground = [] # The ground truth array
number_of_experiments = 25
with open('ground.csv') as csvfile:
    ground_reader = csv.DictReader(csvfile)
    for row in ground_reader:
        if int(row['Sim']) > number_of_experiments:
            break
        d = [float(a) for a in [row[k] for k in csv_names]]
        ground += [d]

# Normalise it otherwise training doesn't work!
max_vals = np.max(ground, axis=0)
min_vals = np.min(ground, axis=0)
ground = ((np.array(ground)-min_vals)/(max_vals-min_vals))

Set grid resolution variables and flags.

In [4]:
generate_models = True

input_res = 10
output_res = 30
output_buffer = 5
total_output_size = 2*(output_res+output_buffer)

# **Latent Function C**

Before learning the ANN functions, let's design our function for C.<br>

*func_c* takes two inputs, X<sub>1</sub> and X<sub>2</sub>, and returns a distribution across two values, 0 and 1.

To achieve the expected path coefficients, we first define the expected value of a non-dichotomised (non-binary) C based on X<sub>1</sub> and X<sub>2</sub>.
<br><br>
E[C] <- 0.3X<sub>1</sub> + 0.3X<sub>2</sub>
<br><br>
Next, we must define the variance of C for each value of X<sub>1</sub> and X<sub>2</sub>. For simplicity we will assume that the variance is normally distributed around the expected value with a standard deviation of 1. The conditional distributions could be dependent on the inputs and considerably more complicated than a normal distribution. Note that any skew or bias in the distribution will affect the resulting covariance.<br><br>
If the latent variable is not to be processed further (for example dichotomised), the function is simple and can return a normal distribution around the expected value. However, this imparts no new information from the latent variable beyond some additional variance - which may be all that is required (enigmatic variation). In this case, though, we also wish to capture a 30/70 split between high and low risk groups. <br><br>
To get a distribution across two values, 1 and 0, we need to dichotamise the joint distribution so that 30% falls into the high-risk group (C=1). This step has to be performed on the full joint distribution of X<sub>1</sub>, X<sub>2</sub>, and C so that lower values of X<sub>1</sub> and X<sub>2</sub> are more likely to appear in the low-risk group.



In [5]:
def func_c_exp(y):
  x1 = np.array(y)[:,0]
  x2 = np.array(y)[:,1]

  out = np.reshape((0.3*x1 + 0.3*x2), (np.array(y).shape[0],1))
  print(out.shape)
  return out

def func_c_noise(y):
  x1 = np.array(y)[:,0]
  x2 = np.array(y)[:,1]

  out = np.array([generateGaussianNoisePmf(-0.5, 0.5, 0, 0.1, total_output_size) for a in range(np.array(y).shape[0])]).T
  print(out.shape)
  return out

def func_c_sampled(y):
  x1 = y[0]
  x2 = y[1]

  exp_c = 0.3*x1 + 0.3*x2
  cont_c = np.random.normal(loc=exp_c, scale=0.1)
  return cont_c

Learn the functions for X<sub>2</sub>, X<sub>3</sub>, and S. In this example dataset, X<sub>1</sub> is normally distributed around 0.0 with a standard deviation of 1.0.

In [6]:
data_points = np.stack([ground[:,0], ground[:,1]])
func_e_x2, func_x2_noise = data.trainANN('x2_given_x1', generate_models, data_points, [input_res], output_res, output_buffer)

# C <- X1,X2
generated_c = np.array([func_c_sampled(x) for x in ground[:,:2]])

# X3 <- X1,X2,C
data_points = np.stack([ground[:,0], ground[:,1], generated_c, ground[:,3]])
func_e_x3, func_x3_noise = data.trainANN('x3_given_x1x2c', generate_models, data_points, [input_res,input_res,input_res], output_res, output_buffer)

# S <- X1,X2,C,X3
data_points = np.stack([ground[:,0], ground[:,1], generated_c, ground[:,3], ground[:,4]])
func_e_s, func_s_noise  = data.trainANN('s_given_x1x2cx3', generate_models, data_points, [input_res,input_res,input_res,input_res], output_res, output_buffer)



Epoch 1/2000
[1m175/175[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step - loss: 0.0634 - val_loss: 0.0133
Epoch 2/2000
[1m175/175[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - loss: 0.0129 - val_loss: 0.0132
Epoch 3/2000
[1m175/175[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - loss: 0.0130 - val_loss: 0.0131
Epoch 4/2000
[1m175/175[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - loss: 0.0131 - val_loss: 0.0133
Epoch 5/2000
[1m175/175[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - loss: 0.0131 - val_loss: 0.0133
Epoch 6/2000
[1m175/175[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - loss: 0.0130 - val_loss: 0.0132
Epoch 7/2000
[1m175/175[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - loss: 0.0130 - val_loss: 0.0132
Epoch 8/2000
[1m175/175[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - loss: 0.0131 - val_loss: 0.0133
Epoch 9/2000
[1m175/175

[1m175/175[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - loss: 4.3484e-06 - val_loss: 4.0329e-06
Epoch 42/2000
[1m175/175[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - loss: 4.0074e-06 - val_loss: 4.0168e-06
Epoch 43/2000
[1m175/175[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - loss: 3.9370e-06 - val_loss: 3.9502e-06
Epoch 44/2000
[1m175/175[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - loss: 4.1306e-06 - val_loss: 4.0845e-06
Epoch 45/2000
[1m175/175[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - loss: 4.5453e-06 - val_loss: 3.9560e-06
Epoch 46/2000
[1m175/175[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - loss: 3.9106e-06 - val_loss: 3.9624e-06
Epoch 47/2000
[1m175/175[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - loss: 3.9190e-06 - val_loss: 3.9654e-06
Epoch 48/2000
[1m175/175[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - loss: 

[1m175/175[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - loss: 1.7556e-04 - val_loss: 1.9059e-04
Epoch 34/2000
[1m175/175[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - loss: 1.7857e-04 - val_loss: 1.9090e-04
Epoch 35/2000
[1m175/175[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - loss: 1.8385e-04 - val_loss: 1.9043e-04
Epoch 36/2000
[1m175/175[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - loss: 1.8085e-04 - val_loss: 1.9065e-04
Epoch 37/2000
[1m175/175[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - loss: 1.7203e-04 - val_loss: 1.9187e-04
Epoch 38/2000
[1m175/175[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - loss: 1.7940e-04 - val_loss: 1.9055e-04
Epoch 39/2000
[1m175/175[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - loss: 1.7241e-04 - val_loss: 1.9088e-04
Epoch 40/2000
[1m175/175[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - loss: 

[1m175/175[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - loss: 1.6970e-04 - val_loss: 1.8535e-04
Epoch 96/2000
[1m175/175[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - loss: 1.7680e-04 - val_loss: 1.8486e-04
Epoch 97/2000
[1m175/175[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - loss: 1.6627e-04 - val_loss: 1.8620e-04
Epoch 98/2000
[1m175/175[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - loss: 1.7364e-04 - val_loss: 1.8528e-04
Epoch 99/2000
[1m175/175[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - loss: 1.7699e-04 - val_loss: 1.8536e-04
Epoch 100/2000
[1m175/175[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - loss: 1.6437e-04 - val_loss: 1.8566e-04
Epoch 101/2000
[1m175/175[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - loss: 1.7833e-04 - val_loss: 1.8553e-04
Epoch 102/2000
[1m175/175[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - los

[1m175/175[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - loss: 1.5488e-04 - val_loss: 1.7995e-04
Epoch 158/2000
[1m175/175[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - loss: 1.5133e-04 - val_loss: 1.8183e-04
Epoch 159/2000
[1m175/175[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - loss: 1.6427e-04 - val_loss: 1.8037e-04
Epoch 1/2000
[1m175/175[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step - loss: 0.0213 - val_loss: 0.0149
Epoch 2/2000
[1m175/175[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - loss: 0.0142 - val_loss: 0.0156
Epoch 3/2000
[1m175/175[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - loss: 0.0147 - val_loss: 0.0149
Epoch 4/2000
[1m175/175[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - loss: 0.0143 - val_loss: 0.0148
Epoch 5/2000
[1m175/175[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - loss: 0.0139 - val_loss: 0.0152
Epoch 6/2

Epoch 38/2000
[1m175/175[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - loss: 7.6562e-04 - val_loss: 7.6238e-04
Epoch 39/2000
[1m175/175[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - loss: 7.5544e-04 - val_loss: 7.6156e-04
Epoch 40/2000
[1m175/175[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - loss: 7.3880e-04 - val_loss: 7.6336e-04
Epoch 41/2000
[1m175/175[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - loss: 7.5329e-04 - val_loss: 7.6405e-04
Epoch 42/2000
[1m175/175[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - loss: 7.6874e-04 - val_loss: 7.6163e-04
Epoch 43/2000
[1m175/175[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - loss: 7.3920e-04 - val_loss: 7.6462e-04
Epoch 44/2000
[1m175/175[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - loss: 7.4237e-04 - val_loss: 7.6415e-04
Epoch 45/2000
[1m175/175[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms

# **Causal Jazz Simulation**

In [7]:
def func_sum(y):
  out = np.reshape(np.sum(y, axis=-1), (np.array(y).shape[0],1))
  print(out.shape)
  return out

# Define the variable names for each function in TEDAG
tedag_func_x2_e = TEDAG_FUNCTION(['X1'], 'X2E', 0, func_e_x2)
tedag_func_x2_noise = TEDAG_FUNCTION(['X1'], 'X2N', 0, func_x2_noise)
tedag_func_x2 = TEDAG_FUNCTION(['X2E', 'X2N'], 'X2', 0, func_sum)
tedag_func_lc_e = TEDAG_FUNCTION(['X1', 'X2'], 'LCE', 0, func_c_exp)
tedag_func_lc_noise = TEDAG_FUNCTION(['X1', 'X2'], 'LCN', 0, func_c_noise)
tedag_func_lc = TEDAG_FUNCTION(['LCE', 'LCN'], 'LC', 0, func_sum)
tedag_func_x3_e = TEDAG_FUNCTION(['X1', 'X2', 'LC'], 'X3E', 0, func_e_x3)
tedag_func_x3_noise = TEDAG_FUNCTION(['X1', 'X2', 'LC'], 'X3N', 0, func_x3_noise)
tedag_func_x3 = TEDAG_FUNCTION(['X3E', 'X3N'], 'X3', 0, func_sum)
tedag_func_s_e  = TEDAG_FUNCTION(['X1', 'X2', 'LC', 'X3'], 'SE', 0, func_e_s)
tedag_func_s_noise  = TEDAG_FUNCTION(['X1', 'X2', 'LC', 'X3'], 'SN', 0, func_s_noise)
tedag_func_s = TEDAG_FUNCTION(['SE', 'SN'], 'S', 0, func_sum)

number_agents = 100

# Initialise the TEDAG
tedag = TEDAG(1, [tedag_func_x2_e,tedag_func_x2_noise,tedag_func_lc_e,tedag_func_lc_noise,tedag_func_x2,tedag_func_lc,tedag_func_x3_e,tedag_func_x3_noise,tedag_func_x3,tedag_func_s_e,tedag_func_s_noise,tedag_func_s], observables=['X1', 'X2', 'LC', 'X3', 'S'], verbose=True)

# Add a single intervention to set X1
tedag.addIntervention(['X1'], [np.random.normal(0.0, 1.0, size=number_agents)], 0)

# Forward calculate the distributions
while tedag.findNextFunctionAndApply(0):
    continue

In findNextFunctionAndApply, found function X2E at iteration 0
Required input variables are: ['X10']
['X10']
[array([ 0.65553906, -1.39060544, -0.23496691,  1.30939573, -0.20399215,
       -0.20691362, -0.60160275, -0.01782741,  0.33731722, -0.44379008,
        0.13540547,  0.23653401, -0.72523348, -0.23267386,  0.75423203,
        0.15914363,  1.0097213 , -0.25346209, -1.29088515, -0.46042593,
        0.15145119,  0.92642759, -1.32151279,  0.72839253,  1.32189646,
       -0.88630907,  2.35882009, -0.52842572,  0.00287391, -0.80968879,
        2.2338624 ,  0.83077123, -1.24010352, -0.73118325,  1.1234118 ,
       -1.03424634, -0.26975981,  0.19372783, -0.15615234,  0.94594288,
        0.54066385,  1.42355206, -0.65406663,  0.06452185,  0.20546118,
        2.08795238, -1.05655071,  1.14334421, -1.13844557,  1.02732752,
        0.68140076, -1.21194884, -0.13007907, -0.37357335, -0.23521296,
        2.20198315, -0.70994504,  1.17793784, -1.14905284, -0.05890571,
        0.9493082 ,  0.218

Current Nodes: ['X10', 'X2E0', 'X2N0']
In findNextFunctionAndApply, found function X2 at iteration 0
Required input variables are: ['X2E0', 'X2N0']
['X2E0', 'X2N0']
[array([[0.5474166 ],
       [0.3278647 ],
       [0.32806396],
       [0.69732255],
       [0.33421707],
       [0.33360332],
       [0.29055083],
       [0.38682574],
       [0.48876083],
       [0.29801035],
       [0.43876362],
       [0.47087562],
       [0.29160035],
       [0.32854438],
       [0.5660919 ],
       [0.44664955],
       [0.61466205],
       [0.32425284],
       [0.32042128],
       [0.29669452],
       [0.44410628],
       [0.59869134],
       [0.32254732],
       [0.5612024 ],
       [0.7019439 ],
       [0.2953673 ],
       [1.0855241 ],
       [0.29256272],
       [0.3942957 ],
       [0.29319733],
       [1.0391747 ],
       [0.580573  ],
       [0.31689614],
       [0.29168344],
       [0.6376205 ],
       [0.3032331 ],
       [0.3213237 ],
       [0.45825654],
       [0.34571505],
       [0.60243

ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (2,) + inhomogeneous part.

Let's plot the points and compare to the original data.

In [None]:
space = [0.0, 1.0, 0.0, 1.0]
var_names = ['X1', 'S']

state = tedag.getSubState(var_names, 0)

fig = plt.figure(1, dpi=100)
plt.xlim([space[0],space[1]])
plt.xlabel(var_names[0])
plt.ylabel(var_names[1])
plt.ylim([space[2],space[3]])
plt.scatter(ground[:100,0],ground[:100,4],s=1.0,color='#FF00FF')
plt.scatter(state[0,:100],state[0,:100],s=0.2)
plt.show(block=False)
plt.close()