# Generative process

## Definitions

We can generate data according to EBM. 

$S \sim {\rm UniformPermutation}(\cdot)$

$S$ follows a distribution of uniform permutation. That means the ordering of biomarkers is random. 

$k_j \sim {\rm DiscreteUniform}(N)$

$k_j$ follows a discrete uniform distribution, which means a participant is equally likely to fall in a progression stage (e.g., from $0$ to $4$, where $0$ indicate this participant is healthy.)

$$X_{S(n)j} | S, k_j  \sim I(z_j == 1) \left[ I(k_n \leq k_j ) p(X_{S(n)j} \mid \theta_{S(n)} ) +I(k_n \gt k_j) p(X_{S(n)j} \mid \phi_{S(n)}) \right] +  \left(1-I(z_j==1) \right) p(X_{S(n)j} \mid \phi_{S(n)})$$

## Parameters

$z_j$: $1$ if the participant is diseased; otherwise $0$.

$I(True) = 1$, $I(False) = 0$

$S$ denotes the ordering of a sequence of biomarkers. 

$N$: number of observed biomarkers.

$n$: a specific biomarker; e.g., biomarker $b$.

$J$: number of participants. 

$j$ denotes a participant. 

$X$ is observed values of biomarkers; it is a matrix of dimension of $N \times J$ or $J \times N$.

$k$, a scalar whose value is the participant's stage of the disease

$K$: number of disease stages

$S_n$ means the disease stage that a specific biomarker $n$ indicates. 

$k_j$: disease stage that a participant is at. 

$X_{nj}$ means the observed value of the biomarker $n$ in participant $j$. 

$\theta_n$ is the parameters for the probability density function (PDF) of observed value of biomarker $n$ when this biomarker has been affected by the disease. Let's assume this distribution is a Gaussian distribution.

$\phi_n$ is the parameters for the probability density function (PDF) of observed value of biomarker $n$ when this biomarker has NOT been affected by the disease. 

## Simulation

**We are going to generate biomarker values for each participant by randomly drawing from distributions defined by $\theta$ or $\phi$**

In [1]:
import numpy as np
import scipy.stats as stats
import altair as alt
import pandas as pd

def simulate_ebm(N, J):
    """
    Simulate an Event-Based Model (EBM) for disease progression.
    
    Args:
    N (int): Number of biomarkers.
    J (int): Number of participants.
    
    Returns:
    tuple: A tuple containing:
        - S (numpy.ndarray): The random permutation representing the order of biomarker implication.
        - kjs (numpy.ndarray): The disease stages for participant.
        - X (numpy.ndarray): The biomarker matrix with participant data.
          - Each cell in X is tuple containing participantId, biomarker value, 
            disease stage of this participant, disease stage current biomarker indicates
            and healthy status
    """
    
    # Random permutation for ordering biomarkers, starting from 0
    # S indicates the disease progression, S[0] -> stage1, S[1] -> stage2
    S = np.random.permutation(N)
    
    # Generate a random stage for each participant
    # The stage should be between 0 and N, inclusive
    kjs = np.random.randint(0, N+1, size=J)
    
    # Initiate biomarker matrix (J participants x N biomarkers), with entries as None
    X = np.full((J, N), None, dtype=object)

    # Generate theta and phi, which are both a function of biomarker n
    # np.random.seed(42)
    # theta_means = {n: np.random.randint(low=0, high=9) for n in range(N)}
    # theta_vars = {n: np.random.rand() for n in range(N)}
    # phi_means = {n: np.random.randint(low=0, high=9) for n in range(N)}
    # phi_vars = {n: np.random.rand() for n in range(N)}
    theta_means = [1, 3, 5, 6, 8, 0, 4, 2, 7, 9]
    theta_vars = [0.3, 0.5, 0.2, 1.3, 3.3, 2.2, 0.8, 0.9, 0.7, 0.6]
    phi_means = [12, 11, 14, 16, 18, 19, 10, 13, 15, 17]
    phi_vars = [1.3, 2.4, 1.4, 0.9, 1.5, 1.9, 2.4, 1.7, 2.0, 1.0]
    theta = {n: stats.norm(theta_means[n], theta_vars[n]) for n in range(N)}
    phi = {n: stats.norm(phi_means[n], phi_vars[n]) for n in range(N)}

    # Iterate through participants
    for j in range(J):
        # Iterate through biomarkers
        for n in range(N):
            # Disease stage of the current participant
            k_j = kjs[j]
            # Disease stage indicated by the current biomarker
            # Note that biomarkers always indicate that the participant is diseased
            # Thus, S_n >= 1
            S_n = np.where(S == n)[0][0] + 1
            
            # Assign values based on whether the participant's stage is at or past the biomarker's stage
            if k_j >= 1:
                if k_j >= S_n:
                    X[j, n] = (j, theta[n].rvs(), k_j, S_n, 'affected') 
                else:
                    X[j, n] = (j, phi[n].rvs(), k_j, S_n, 'not_affected')  
            else:
                X[j, n] = (j, phi[n].rvs(), k_j, S_n, 'not_affected')        
    return S, kjs, X, theta_means, theta_vars, phi_means, phi_vars

In [2]:
N = 10
J = 100
S, kjs, X, theta_means, theta_vars, phi_means, phi_vars = simulate_ebm(N, J)

In [3]:
theta_means, theta_vars, phi_means, phi_vars

([1, 3, 5, 6, 8, 0, 4, 2, 7, 9],
 [0.3, 0.5, 0.2, 1.3, 3.3, 2.2, 0.8, 0.9, 0.7, 0.6],
 [12, 11, 14, 16, 18, 19, 10, 13, 15, 17],
 [1.3, 2.4, 1.4, 0.9, 1.5, 1.9, 2.4, 1.7, 2.0, 1.0])

In [4]:
df_means_vars = pd.DataFrame([theta_means, theta_vars, phi_means, phi_vars]).transpose()
df_means_vars.columns = ['theta_mean', 'theta_var', 'phi_mean', 'phi_var']
df_means_vars = df_means_vars.rename_axis("biomarker", axis=0).reset_index()
df_means_vars

Unnamed: 0,biomarker,theta_mean,theta_var,phi_mean,phi_var
0,0,1.0,0.3,12.0,1.3
1,1,3.0,0.5,11.0,2.4
2,2,5.0,0.2,14.0,1.4
3,3,6.0,1.3,16.0,0.9
4,4,8.0,3.3,18.0,1.5
5,5,0.0,2.2,19.0,1.9
6,6,4.0,0.8,10.0,2.4
7,7,2.0,0.9,13.0,1.7
8,8,7.0,0.7,15.0,2.0
9,9,9.0,0.6,17.0,1.0


In [5]:
df_means_vars.to_csv('data/means_vars.csv', index=False)

In [6]:
S

array([8, 5, 9, 3, 0, 2, 1, 4, 6, 7])

In [7]:
X[10]

array([(10, 1.360158474671899, 8, 5, 'affected'),
       (10, 2.3421069423052012, 8, 7, 'affected'),
       (10, 5.069026898278845, 8, 6, 'affected'),
       (10, 7.425149380293285, 8, 4, 'affected'),
       (10, 16.00610857217842, 8, 8, 'affected'),
       (10, -3.824634943720829, 8, 2, 'affected'),
       (10, 10.667840722351617, 8, 9, 'not_affected'),
       (10, 10.476070300675264, 8, 10, 'not_affected'),
       (10, 6.352190707579194, 8, 1, 'affected'),
       (10, 8.979658587350963, 8, 3, 'affected')], dtype=object)

## Visualizing simulated results

With the above data structure, we can visualize the following data:

- Distribution of all biomarker values by biomarker
- Distribution of all biomarker values when the participant is at a certain disease stage
- Comparing a certain biomarker data 
- A certain participant's data

### Distribution of all biomarker values by biomarker

In [8]:
df = pd.DataFrame(X, columns = [f"Biomarker {i}" for i in range(N)])

# make this dataframe wide to long 
df_long = df.melt(var_name = "Biomarker", value_name="Value")

# exapand the value column into a dataframe
values_df = df_long['Value'].apply(pd.Series)
values_df.columns = ['participant', 'measurement', 'k_j', 'S_n', 'affected_or_not']

# join values_df with df_long
df_expanded = df_long.drop('Value', axis = 1).join(values_df)

alt.Chart(df_expanded).transform_density(
    'measurement',
    as_=['measurement', 'Density'],
    groupby=['Biomarker']
).mark_area().encode(
    x="measurement:Q",
    y="Density:Q",
    facet = alt.Facet(
        "Biomarker:N",
        columns = 5
    ),
    color=alt.Color(
        'Biomarker:N'
    )
).properties(
    width= 140,
    height = 200,
).properties(
    title='Biomarker data for all participants across all stages'
)

In [9]:
df_expanded.to_csv("data/participant_data.csv", index=False)
df_expanded.head()


Unnamed: 0,Biomarker,participant,measurement,k_j,S_n,affected_or_not
0,Biomarker 0,0,10.336151,1,5,not_affected
1,Biomarker 0,1,10.865467,1,5,not_affected
2,Biomarker 0,2,1.238857,7,5,affected
3,Biomarker 0,3,10.163485,2,5,not_affected
4,Biomarker 0,4,11.036735,0,5,not_affected


In [10]:
# get data for conjugate priors
# biomarker 1 & drawn from diseased groups
df_expanded[(
    df_expanded.Biomarker == "Biomarker 1") & (
        df_expanded.affected_or_not == "affected")].to_csv("data/conjugate_data.csv", index=False)

In [11]:
# df_expanded[df_expanded.k_j == 0]

### Distribution of all biomarker values when the participant is at a certain disease stage

In [12]:
# biomarker data when the participant is at stage 6
df_kj_6 = df_expanded[df_expanded.k_j == 6]
df_kj_6

alt.Chart(df_kj_6).transform_density(
    'measurement',
    as_=['measurement', 'Density'],
    groupby=['Biomarker']
).mark_area().encode(
    x="measurement:Q",
    y="Density:Q",
    facet = alt.Facet(
        "Biomarker:N",
        columns = 5
    ),
    color=alt.Color(
        'Biomarker:N'
    )
).properties(
    width= 140,
    height = 200,
).properties(
    title='Biomarker data when the participant is at stage six'
)

### Comparing a certain biomarker data 

In [13]:
# select only biomarker 2 
bio_2_data = df_expanded[df_expanded.Biomarker=='Biomarker 2'].drop(['k_j', 'S_n', 'Biomarker'], axis = 1)
# biomarker2 data, comparing from diseased and healthy groups
alt.Chart(bio_2_data).transform_density(
    'measurement',
    as_=['measurement', 'Density'],
    groupby=['affected_or_not']
).mark_area().encode(
    x="measurement:Q",
    y="Density:Q",
    facet = alt.Facet(
        "affected_or_not:N",
    ),
    color=alt.Color(
        'affected_or_not:N'
    )
).properties(
    width= 240,
    height = 200,
).properties(
    title='Biomarker2 data, compring healthy group and diseased group'
)

### A certain participant's data

In [14]:
# participant 10
participant10_data = df_expanded[df_expanded.participant == 10]
alt.Chart(participant10_data).mark_bar().encode(
    x='Biomarker',
    y='measurement',
    color=alt.Color(
        'affected_or_not:N'
    ),
    tooltip=['Biomarker', 'affected_or_not', 'measurement']
).interactive().properties(
    title=f'Biomarker data for participant10 (k_j = {participant10_data.k_j.to_list()[0]})'
)