# If 99% are Vaccinated, 1% Get a Reaction and 2% Get the Disease How can you Generate a Synthetic Dataset for Causal Inference?

## How to generate synthetic data for any causal Inference or probabilistic project in less than 10 lines of Python code

![mufid-majnun-cM1aU42FnRg-unsplash.jpg](attachment:mufid-majnun-cM1aU42FnRg-unsplash.jpg)
Photo by <a href="https://unsplash.com/@mufidpwt?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Mufid Majnun</a> on <a href="https://unsplash.com/s/photos/vaccine?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Unsplash</a>

### Introduction

In "The Book of Why" Judea Pearl uses an example of vaccination, reaction, smallpox and death to describe and explain "The Ladder of Causation" and this motivated me to attempt to develop an example of causal influence in Python but any causal inference project needs test data to get started and this article explains how to generate that data easily and quickly.

https://www.amazon.co.uk/Book-Why-Science-Cause-Effect/dp/0241242630

Before we dive into the synthetic data generation please consider …

Joining Medium with my referral link (I will receive a proportion of the fees if you sign up using this link).

Subscribing to a free e-mail whenever I publish a new story.

Taking a quick look at my previous articles.

Downloading my free strategic data-driven decision making framework.

Visiting my data science website - The Data Blog.

### Background

Pearl proposes a causal diagram for the hypothetical relationships between a vaccination program, reactions (side-effects), development of the disease and fatalities arising from the disease and side-effects as follows -

![diseaase_causal_diagram.png](attachment:diseaase_causal_diagram.png)

Pearl then goes on to describe the findings of a hypothetical survey as follows ...

1. Out of 1 million children, 990,000 (99%) are vaccinated
2. Of those vaccinated, 9,900 (1% of 990,000) have a reaction
3. Of those that have a reaction, 99 (1% of 9,900) die from the reaction 
4. Of those not vaccinated, 200 (2% of 10,000) get smallpox
5. Of those that get smallpox, 40 (20% of 200) die from the disease

Given the causal influence diagram proposed by Pearl and the statistics from the hypothetical survey I surmised that my first step was to create a dataset to represent the survey as a pandas ``DataFrame`` so I looked around on the Internet for a way to create the data.

I found scikit-learn libraries for generating "blob" and "classification" data (https://scikit-learn.org/stable/datasets/sample_generators.html) and I also found libraries like faker (https://faker.readthedocs.io/en/master/) and SDV (https://sdv.dev/SDV/) but I could not find anything to generate a binary dataset suitable as an input for a causal or probabilistic model.

Given that I could not find anything suitable online and that I needed some data to begin my causal inference journey, I set about writing code that can create any synthetic binary dataset myself ...

### A Procedural Solution
The basic requirements are -
1. Create a blank ``DataFrame`` with the correct number of rows and features
2. Update the ``DataFrame`` with values that represent each one of the 5 findings from the hypothetical survey

Creating the empty ``DataFrame`` is very straight-forward -

In [1]:
import pandas as pd
import numpy as np

columns = ['Vaccination?', 'Reaction?', 'Smallpox?', 'Death?']
df = pd.DataFrame(data=np.full(shape=(1_000_000, len(columns)), fill_value=0, dtype=int), columns=columns)
df.head()

Unnamed: 0,Vaccination?,Reaction?,Smallpox?,Death?
0,0,0,0,0
1,0,0,0,0
2,0,0,0,0
3,0,0,0,0
4,0,0,0,0


The next step is to implement this rule - "Out of 1 million children, 990,000 (99%) are vaccinated" by changing ``Vaccination? = 0`` to ``Vaccination? = 1`` for 990,000 or 99% of the rows.

That turned out to be surprisingly difficult. My first thought was to take a sample using ``DataFrame.sample(990000)`` and to update the values but the update is not applied to the underlying ``DataFrame`` so that does not work.

After some experimentation, this is the solution I came up with ...

In [2]:
vaccination_indices = list(df.sample(990_000, random_state=42).index)
df.loc[vaccination_indices, ["Vaccination?"]] = 1

... and to make sure the code worked ...

In [3]:
df["Vaccination?"].value_counts()

1    990000
0     10000
Name: Vaccination?, dtype: int64

So far, so good. The next stage is to implement this rule - "Of those vaccinated, 9,900 (1% of 990,000) have a reaction". This is a bit more complicated as it requires 2 stages -

1. Select the rows where ``Vaccination? == 1``
2. Set ``Reaction? = 1`` for 9,900 (or 1%) of those selected rows

After a bit more trial-and-error, this is the solution I developed ...

In [4]:
criteria_for_selection = df["Vaccination?"]==1
reaction_indices = list(df[criteria_for_selection].sample(9_900, random_state=42).index)
df.loc[reaction_indices, ["Reaction?"]] = 1

... and to test it out ...

In [5]:
df.value_counts()

Vaccination?  Reaction?  Smallpox?  Death?
1             0          0          0         980100
0             0          0          0          10000
1             1          0          0           9900
dtype: int64

Now that the hard work of finding a solution is over, the entire dataset that implements all 5 rules can be generated in just 12 lines of code ...

In [6]:
# 1. There are 1,000,000 children in the hypothetical survey ...
columns = ['Vaccination?', 'Reaction?', 'Smallpox?', 'Death?']
df = pd.DataFrame(data=np.full(shape=(1_000_000, len(columns)), fill_value=0, dtype=int), columns=columns)

# 2. Set 990_000 of the dataset to get the vaccination
vaccination_indices = list(df.sample(990_000, random_state=42).index)
df.loc[vaccination_indices, ["Vaccination?"]] = 1

# 3. Set 9_900 of those vaccinated to get the reaction
reaction_indices = list(df[df["Vaccination?"]==1].sample(9_900, random_state=42).index)
df.loc[reaction_indices, ["Reaction?"]] = 1

# 4. 99 of those who have the reaction die from it
reaction_die_indices = list(df[df["Reaction?"]==1].sample(99, random_state=42).index)
df.loc[reaction_die_indices, ["Death?"]] = 1

# 5. Of the 10,000 that do not get vaccinate, 200 get smallpox
smallpox_indices = list(df[df["Vaccination?"]==0].sample(200, random_state=42).index)
df.loc[smallpox_indices, ["Smallpox?"]] = 1

# 6. Of the 200 that get smallpox, 40 die ...
smallpox_die_indices = list(df[df["Smallpox?"]==1].sample(40, random_state=42).index)
df.loc[smallpox_die_indices, ["Death?"]] = 1

... and to verify the solution ...

In [7]:
print(df.shape)
df.value_counts().reset_index().rename({0:"Count"}, axis=1)

(1000000, 4)


Unnamed: 0,Vaccination?,Reaction?,Smallpox?,Death?,Count
0,1,0,0,0,980100
1,1,1,0,0,9801
2,0,0,0,0,9800
3,0,0,1,0,160
4,1,1,0,1,99
5,0,0,1,1,40


### An Object Oriented Programming (OOP) Solution
The procedural solution provides a decent implementation for generating synthetic binary datasets to be used as the inputs to causal and probabilistic projects but it can be re-written as an OOP solution to reduce code duplication and to maximize code re-use ...

In [8]:
import pandas as pd
import numpy as np

class BinaryDataGenerator():

    def __init__(self, population_size : int, columns : list, random_state : int = 42):
        self.random_state = random_state
        self.__data = pd.DataFrame(data=np.full(shape=(population_size, len(columns)), fill_value=0, dtype=int), columns=columns)

    def set_values(self, col_name : str, count : int = None, frac : float = None, condition : pd.Series = None):
        if condition is None: # If no condition specified, create a dummy condition that is True for every row so the update will be applies to all rows
            condition = pd.Series(data=[True] * len(self.__data))

        update_indices = list(self.__data[condition].sample(n=count, frac=frac, random_state=self.random_state).index)
        self.__data.loc[update_indices, [col_name]] = 1

    @property
    def data(self) -> pd.DataFrame:
        return self.__data

    @property
    def summary(self) -> pd.DataFrame:
        return self.__data.value_counts().reset_index().rename({0:"Count"}, axis=1)

Creating our synthetic dataset for the hypothetical smallpox example then becomes very straight-forward ...

In [9]:
generator = BinaryDataGenerator(population_size=1_000_000, columns=['Vaccination?', 'Reaction?', 'Smallpox?', 'Death?'], random_state=42) # The Smallpox data contains 1m rows and 4 columns
generator.set_values(col_name="Vaccination?", frac=0.99, condition=None) # 99% of the population get the vaccination
generator.set_values(col_name="Reaction?", frac=0.01, condition=generator.data["Vaccination?"]==1) # 1% of those vaccinated get a reaction
generator.set_values(col_name="Death?", frac=0.01, condition=generator.data["Reaction?"]==1) # 1% of those who get a reaction die
generator.set_values(col_name="Smallpox?", frac=0.02, condition=generator.data["Vaccination?"]==0) # 2% of those who do not get vaccinated get smallpox
generator.set_values(col_name="Death?", frac=0.2, condition=generator.data["Smallpox?"]==1) # 20% of those who get smallpox die from it

In [10]:
print(generator.data.shape)
generator.summary

(1000000, 4)


Unnamed: 0,Vaccination?,Reaction?,Smallpox?,Death?,Count
0,1,0,0,0,980100
1,1,1,0,0,9801
2,0,0,0,0,9800
3,0,0,1,0,160
4,1,1,0,1,99
5,0,0,1,1,40


### Conclusion
Causal inference is a hot-topic and the number of libraries and articles supporting and informing this branch of machine learning and artificial intelligence are increasing exponentially.

However the first step to creating a project to test a causal inference solution is to create some synthetic data and there are no libraries or articles that provide a template for the data creation.

This article has explored the background to the problem, provided a procedural solution to generate the synthetic data and also provided an object-oriented solution that can be easily re-used to generate synthetic data to test any causal inference problem.

If you enjoyed this article please consider …

Joining Medium with my referral link (I will receive a proportion of the fees if you sign up using this link).

Subscribing to a free e-mail whenever I publish a new story.

Taking a quick look at my previous articles.

Downloading my free strategic data-driven decision making framework.

Visiting my data science website — The Data Blog.