# Verifying Synthetic Dataset

In this colab, we provide some basic plotting and printing commands to explore the synthetic dataset and confirm that the data distribution matches the configured causal graph.

Note: This colab was built to support the data config used in the main paper. This means there are a couple of assumptions:
* The dataset has been saved as a .pkl file (this is how the provided dataset generating script saves datasets).
* A single batch element can be plotted without causing a memory failure.
* There are two labels, and each has a 1-dimensional, 2-element contingency table based on a single concept's binary values.
* Each feature is influenced by a single concept.  

If your dataset violates any of these assumptions, you will have to update the code to fit your use case.

In [1]:
#@title Imports

import pprint
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

import dataset_specs
import dataset_utils


In [None]:
#@title Load Dataset

dataset_filename = "enter .pkl dataset filename here" #@param {type: "string"}
dataset = dataset_utils.load_pickled_data(dataset_filename)
print(dataset.keys())

In [None]:
#@title Verify Data Config
#@markdown We print the dataset config below in order to reveal how we encode the causal graph. Some notes on the various config keys:
#@markdown * feature_specs: A list of specs describing each feature in the dataset. Each spec encodes characteristic such as mean, noise, and 
#@markdown pattern-related parameters.
#@markdown * concept_specs: A list of specs describing each concept in the dataset. Each concept spec contains: 
#@markdown  * a list of affected feature indexes
#@markdown  * a mapping from affected feature index to the feature's agreement with that concept, as described in the paper
#@markdown  * a mapping from affected feature index to the temporal feature pattern that the affected feature should exhibit when activated.
#@markdown * label_specs: A list of specs describing each label in the dataset. This spec includes a list of indexes for the concepts that influence it,
#@markdown and a binary contingency table that has a rank which matches the number of concept indexes.
#@markdown * num_trains: number of batch elements in the training split.
#@markdown * num_tests: number of batch elements in the validation & test splits.
#@markdown * scaling_type: the scaling type used for the numerical features.
#@markdown * seed: The RNG seed used to ensure a deterministic data generation process.

pprint.pprint(dataset["config"])

In [None]:
#@title Verify Data Shapes
#@markdown Confirm that the shapes of each of the dataset split elements make sense, and that the number of batch elements in the training
#@markdown and test splits match the num_trains and num_tests config parameters above. Below is a description of each dataset split field:
#@markdown * sequence: Float sequence of features of shape [num_unroll, batch_size, ndims].
#@markdown * label: Float array of labels of shape [num_unroll, batch_size, num_labels]
#@markdown * concept: Bool array of concept indicators of shape [batch_size, num_concepts].
#@markdown * concept_sequence: Int array of concept indicators on a timestep resolution of shape [num_unroll, batch_size, num_concepts].
#@markdown * changes: Int array of concept changepoints of shape [batch_size, num_concepts].
#@markdown * features: Bool array of concept-feature indicators of shape [batch_size, num_concepts, ndims]

train_dataset = dataset["train_split"]
test_dataset = dataset["test_split"]
print("Example dataset split keys:")
print(train_dataset.keys())
print()

train_sequence = train_dataset["sequence"]
train_label = train_dataset["label"]
train_concepts = train_dataset["concept"]
train_concept_sequence = train_dataset["concept_sequence"]
train_changes = train_dataset["changes"]
train_features = train_dataset["features"]

print("train sequence shape:", train_sequence.shape)
print("train label shape:", train_label.shape)
print("train concepts shape:", train_concepts.shape)
print("train concept sequence shape:", train_concept_sequence.shape)
print("train changes shape:", train_changes.shape)
print("train features shape:", train_features.shape)
print()

test_sequence = test_dataset["sequence"]
test_label = test_dataset["label"]
test_concepts = test_dataset["concept"]
test_concept_sequence = test_dataset["concept_sequence"]
test_changes = test_dataset["changes"]
test_features = test_dataset["features"]

print("test sequence shape:", test_sequence.shape)
print("test label shape:", test_label.shape)
print("test concepts shape:", test_concepts.shape)
print("test concept sequence shape:", test_concept_sequence.shape)
print("test changes shape:", test_changes.shape)
print("test features shape:", test_features.shape)

In [None]:
#@title Plot Example Sequence

example_idx = 6 #@param {type: 'integer'}

#@markdown Below we print information for a single batch element, indexed by the value provided above, and plot its feature values.

#@markdown We also plot the label values as well to see how they coincide with feature behavior.

example_train_sequence = train_sequence[:, example_idx, :]
example_train_label = train_label[:, example_idx, :]
example_train_concepts = train_concepts[example_idx, :]
example_train_concept_sequence = train_concept_sequence[:, example_idx, :]
example_train_changes = train_changes[example_idx, :]
example_train_features = train_features[example_idx, :, :]

print("example train sequence shape:", example_train_sequence.shape)
print("example train label shape:", example_train_label.shape)
print("example train concept sequence shape:", example_train_concept_sequence.shape)
print("example train concepts shape:", example_train_concepts.shape)
print("example train concepts:", example_train_concepts)
print("example train changes shape:", example_train_changes.shape)
print("example train changes:", example_train_changes)
print("example train features shape:", example_train_features.shape)
print("example train features transposed:")
print(example_train_features.T)

fig = plt.figure(figsize=(20, 5))
ax = sns.heatmap(example_train_sequence.T, linewidth=0)
ax.set_xlabel("Time")
ax.set_ylabel("Feature Index")
plt.show()

fig = plt.figure(figsize=(16, 2))
for i in range(2):
  plt.plot(example_train_label[:, i], label=f"label {i}")
plt.legend()
plt.xlabel("Time")
plt.ylabel("Label Value")
plt.show()

In [None]:
#@title Verify Conditional Label Probabilities

#@markdown Here, we compare the theoretical and experimental label probabilities conditioned on concept value.
#@markdown The contingency table for each label gives us the theoretical label probabilities, and we filter the actual data to get experimental.

# Max label over all timesteps in each batch element to get element-level labels
train_batch_labels = np.max(train_label, axis=0)

for label_idx in range(2):
  print(f"label idx: {label_idx}")
  concept_idx = dataset["config"]["label_specs"][label_idx]["concept_idxs"][0]
  theoretical_label_prob_conditioned_on_pos_concept = dataset["config"]["label_specs"][label_idx]["contingency_table"][1]
  exp_label_prob_conditioned_on_pos_concept = np.mean(train_batch_labels[train_concepts[:, concept_idx], label_idx])
  theoretical_label_prob_conditioned_on_neg_concept = dataset["config"]["label_specs"][label_idx]["contingency_table"][0]
  exp_label_prob_conditioned_on_neg_concept = np.mean(train_batch_labels[np.invert(train_concepts[:, concept_idx]), label_idx])
  print(f'theoretical label prob conditioned on pos concept: {theoretical_label_prob_conditioned_on_pos_concept}')
  print(f"experimental label prob conditioned on pos concept: {exp_label_prob_conditioned_on_pos_concept}")
  print()
  print(f'theoretical label prob conditioned on neg concept: {theoretical_label_prob_conditioned_on_neg_concept}')
  print(f"experimental label prob conditioned on neg concept: {exp_label_prob_conditioned_on_neg_concept}")
  print()

In [None]:
#@title Verify Conditional Feature Probabilities

#@markdown Here we do the same thing as above, but for feature probabilities. The term agreement may be confusing for those that are not familiar with
#@markdown how the term is used in the paper. 

#@markdown As a refresher: We design our features to have symmetrical conditional probabilities, such that   
#@markdown p(Feature | Concept) = p(No Feature| No Concept). We use the term agreement to refer to the probability that the feature will deterministically 
#@markdown match the concept value. If the feature does not deterministically match, we still allow it to match with a probability of 0.5. 
#@markdown This is why p(Feature | Concept) = p(No Feature| No Concept) = agreement + ((1 - agreement) / 2)

for concept_idx, concept_spec in enumerate(dataset["config"]["concept_specs"]):
  concept_name = concept_spec["name"]
  print(f"\nConcept idx: {concept_idx} (name: {concept_name})")
  for feature_idx, feature_agreement in concept_spec["feature_idx_to_agreement"].items():
    theoretical_feature_prob_conditioned_on_pos_concept = feature_agreement + ((1 - feature_agreement) / 2)
    exp_feature_prob_conditioned_on_pos_concept = np.mean(train_features[train_concepts[:, concept_idx], 0, feature_idx])
    theoretical_feature_prob_conditioned_on_neg_concept = 1 - (feature_agreement + ((1 - feature_agreement) / 2))
    exp_feature_prob_conditioned_on_neg_concept = np.mean(train_features[np.invert(train_concepts[:, concept_idx]), 0, feature_idx])
    print(f"feature idx: {feature_idx}")
    print(f"theoretical feature prob conditioned on pos concept: {theoretical_feature_prob_conditioned_on_pos_concept}")
    print(f"exp feature prob conditioned on pos concept: {exp_feature_prob_conditioned_on_pos_concept}")
    print(f"theoretical feature prob conditioned on neg concept: {theoretical_feature_prob_conditioned_on_neg_concept}")
    print(f"exp feature prob conditioned on neg concept: {exp_feature_prob_conditioned_on_neg_concept}")