# Data generation process

This notebook demonstrates the process of generating some synthetic data for causal inference problems.

The diagram below shows how the profit from a customer is related to other factors.
- Profit is the outcome we are interested in.
- Promotion is believed to have a direct impact on the profit. The impact of a promotion also varies based on the average amount of previous orders.
- Profit is also associated with three factors.
  - The income of a customer.
  - The average amount of previous orders.
  - The number of years sicne registration.

  
Below is a more formal mathematical representation.
- $ Y = TE * T + a * X_1 + b * X_2 + c * X_3$
- $ TE = d * X_2^2$


```mermaid
%%{init: {'theme':'default'}}%%

flowchart
subgraph  
direction BT
x1((x1 \n income))
x2((x2 \n avg \n order))
x3((x3 \n yrs since \n registration))
y((Y \n profit))
t((T \n promotion))
x1 --a--> y
t --TE--> y
x2 --b--> y
x3 --c--> y
end
```

In [63]:
import pandas as pd
import numpy as np
import plotly.express as px

num_samples = 10000
np.random.seed(42)
x1 = abs(np.random.normal(loc=3000, scale=1500, size=num_samples))
x2 = abs(np.random.normal(loc=50, scale=10, size=num_samples))
x3 = abs(np.random.randn(num_samples) + 3)
T = np.random.random(num_samples)

a = 0.01
b = 0.3
c = 3
d = 2.68

TE = d * x2**2
Y = TE * T + a * x1 + b * x2 + c * x3

df = pd.DataFrame({'income': x1, 'avg_order': x2, 'yrs': x3, 'promotion': T, 'TE': TE, 'profit': Y})
df.describe().transpose().drop(columns='count').applymap(lambda x: round(x, 1))

Unnamed: 0,mean,std,min,25%,50%,75%,max
income,3023.0,1451.9,1.7,1992.7,2996.1,4006.6,8889.4
avg_order,50.1,10.0,11.4,43.4,50.2,56.9,94.8
yrs,3.0,1.0,0.0,2.3,3.0,3.7,6.7
promotion,0.5,0.3,0.0,0.2,0.5,0.8,1.0
TE,7004.8,2715.5,350.5,5043.3,6742.5,8688.6,24080.6
profit,3560.9,2591.3,25.2,1495.7,3081.5,5128.4,19783.1
