# Generative process

## Definitions

We can generate data according to EBM. 

$S \sim {\rm UniformPermutation}(\cdot)$

$S$ follows a distribution of uniform permutation. That means the ordering of biomarkers is random. 

$k_j \sim {\rm DiscreteUniform}(N)$

$k_j$ follows a discrete uniform distribution, which means a participant is equally likely to fall in a progression stage (e.g., from $0$ to $4$, where $0$ indicate this participant is healthy.)

$X_{S(n)j} | S, k_n  \sim I(z_j == 1) \left[ I(n \leq k_j ) p(X_{S(n)j} \mid \theta_{S(n)} ) +I(n \gt k_j) p(X_{S(n)j} \mid \phi_{S(n)}) \right] +  \left(1-I(z_j==1) \right) p(X_{S(n)j} \mid \phi_{S(n)})$

## Parameters

$S$ denotes the ordering of a sequence of biomarkers. 

$N$: number of observed biomarkers.

$n$: a specific biomarker; e.g., biomarker $b$.

$J$: number of participants. 

$j$ denotes a participant. 

$X$ is observed values of biomarkers; it is a matrix of dimension of $N \times J$ or $J \times N$.

$k$, a scalar whose value is the participant's stage of the disease

$K$: number of disease stages

$k_n$ means the disease stage that a specific biomarker $n$ indicates. 

$k_j$: disease stage that a participant is at. 

$X_{nj}$ means the observed value of the biomarker $n$ in participant $j$. 

$\theta_n$ is the parameters for the probability density function (PDF) of observed value of biomarker $n$ when this biomarker has been affected by the disease. Let's assume this distribution is a Gaussian distribution.

$\phi_n$ is the parameters for the probability density function (PDF) of observed value of biomarker $n$ when this biomarker has NOT been affected by the disease. 

## Simulation

**We are going to generate biomarker values for each participant by randomly drawing from distributions defined by $\theta$ or $\phi$**

In [382]:
import numpy as np
import scipy.stats as stats
import altair as alt
import pandas as pd

def simulate_ebm(N, J):
    """
    Simulate an Event-Based Model (EBM) for disease progression.
    
    Args:
    N (int): Number of biomarkers.
    J (int): Number of participants.
    
    Returns:
    tuple: A tuple containing:
        - S (numpy.ndarray): The random permutation representing the order of biomarker implication.
        - kjs (numpy.ndarray): The disease stages for participant.
        - X (numpy.ndarray): The biomarker matrix with participant data.
          - Each cell in X is tuple containing participantId, biomarker value, 
            disease stage of this participant, disease stage current biomarker indicates
            and healthy status
    """
    
    # Random permutation for ordering biomarkers, starting from 0
    # S indicates the disease progression, S[0] -> stage1, S[1] -> stage2
    S = np.random.permutation(N)
    
    # Generate a random stage for each participant
    # The stage should be between 0 and N, inclusive
    kjs = np.random.randint(0, N+1, size=J)
    
    # Initiate biomarker matrix (J participants x N biomarkers), with entries as None
    X = np.full((J, N), None, dtype=object)

    # Generate theta and phi, which are both a function of biomarker n
    # np.random.seed(42)
    theta_means = {n: np.random.randint(low=0, high=9) for n in range(N)}
    theta_vars = {n: np.random.rand() for n in range(N)}
    phi_means = {n: np.random.randint(low=0, high=9) for n in range(N)}
    phi_vars = {n: np.random.rand() for n in range(N)}
    theta = {n: stats.norm(theta_means[n], theta_vars[n]) for n in range(N)}
    phi = {n: stats.norm(phi_means[n], phi_vars[n]) for n in range(N)}
    
    # Iterate through participants
    for j in range(J):
        # Iterate through biomarkers
        for n in range(N):
            # Disease stage of the current participant
            k_j = kjs[j]
            # Disease stage indicated by the current biomarker
            # Note that biomarkers always indicate that the participant is diseased
            # Thus, k_n >= 1
            k_n = np.where(S == n)[0][0] + 1
            
            # Assign values based on whether the participant's stage is at or past the biomarker's stage
            if k_j >= 1:
                if k_j >= k_n:
                    X[j, n] = (j, theta[n].rvs(), k_j, k_n, 'diseased') 
                else:
                    X[j, n] = (j, phi[n].rvs(), k_j, k_n, 'healthy')  
            else:
                X[j, n] = (j, phi[n].rvs(), k_j, k_n, 'healthy')        
    return S, kjs, X, theta_means, theta_vars, phi_means, phi_vars

In [383]:
N = 10
J = 100
S, kjs, X, theta_means, theta_vars, phi_means, phi_vars = simulate_ebm(N, J)

In [384]:
theta_means, theta_vars, phi_means, phi_vars

({0: 5, 1: 3, 2: 1, 3: 4, 4: 6, 5: 4, 6: 5, 7: 3, 8: 3, 9: 2},
 {0: 0.9185501310287616,
  1: 0.5702123449545294,
  2: 0.1308031737252292,
  3: 0.712476621560504,
  4: 0.480014982033799,
  5: 0.988017948674132,
  6: 0.8771941066363272,
  7: 0.4270363172911351,
  8: 0.2573156090531834,
  9: 0.2190406963064374},
 {0: 8, 1: 6, 2: 4, 3: 7, 4: 1, 5: 2, 6: 6, 7: 4, 8: 8, 9: 8},
 {0: 0.22630021495321773,
  1: 0.4695335648290815,
  2: 0.2585685461063849,
  3: 0.266312113240211,
  4: 0.9161328486402734,
  5: 0.523594065293155,
  6: 0.6405821373576045,
  7: 0.2562941691118066,
  8: 0.7231158430621516,
  9: 0.37307995198288746})

In [385]:
df_means_vars = pd.DataFrame([theta_means, theta_vars, phi_means, phi_vars]).transpose()
df_means_vars.columns = ['theta_mean', 'theta_var', 'phi_mean', 'phi_var']
df_means_vars = df_means_vars.rename_axis("biomarker", axis=0).reset_index()
df_means_vars

Unnamed: 0,biomarker,theta_mean,theta_var,phi_mean,phi_var
0,0,5.0,0.91855,8.0,0.2263
1,1,3.0,0.570212,6.0,0.469534
2,2,1.0,0.130803,4.0,0.258569
3,3,4.0,0.712477,7.0,0.266312
4,4,6.0,0.480015,1.0,0.916133
5,5,4.0,0.988018,2.0,0.523594
6,6,5.0,0.877194,6.0,0.640582
7,7,3.0,0.427036,4.0,0.256294
8,8,3.0,0.257316,8.0,0.723116
9,9,2.0,0.219041,8.0,0.37308


In [386]:
df_means_vars.to_csv('data/means_vars.csv', index=False)

In [387]:
S

array([8, 5, 1, 7, 9, 4, 3, 6, 0, 2])

In [388]:
X[10]

array([(10, 7.929126616691131, 5, 9, 'healthy'),
       (10, 3.293988866540326, 5, 3, 'diseased'),
       (10, 4.195254461098294, 5, 10, 'healthy'),
       (10, 6.8313757854195085, 5, 7, 'healthy'),
       (10, 0.731401292684964, 5, 6, 'healthy'),
       (10, 5.0513688596628885, 5, 2, 'diseased'),
       (10, 5.017734371923654, 5, 8, 'healthy'),
       (10, 2.778493215087366, 5, 4, 'diseased'),
       (10, 3.1279945641572002, 5, 1, 'diseased'),
       (10, 2.0137658413921526, 5, 5, 'diseased')], dtype=object)

## Visualizing simulated results

With the above data structure, we can visualize the following data:

- Distribution of all biomarker values by biomarker
- Distribution of all biomarker values when the participant is at a certain disease stage
- Comparing a certain biomarker data 
- A certain participant's data

### Distribution of all biomarker values by biomarker

In [389]:
df = pd.DataFrame(X, columns = [f"Biomarker {i}" for i in range(N)])

# make this dataframe wide to long 
df_long = df.melt(var_name = "Biomarker", value_name="Value")

# exapand the value column into a dataframe
values_df = df_long['Value'].apply(pd.Series)
values_df.columns = ['participant', 'measurement', 'k_j', 'k_n', 'drawn_from']

# join values_df with df_long
df_expanded = df_long.drop('Value', axis = 1).join(values_df)

alt.Chart(df_expanded).transform_density(
    'measurement',
    as_=['measurement', 'Density'],
    groupby=['Biomarker']
).mark_area().encode(
    x="measurement:Q",
    y="Density:Q",
    facet = alt.Facet(
        "Biomarker:N",
        columns = 5
    ),
    color=alt.Color(
        'Biomarker:N'
    )
).properties(
    width= 140,
    height = 200,
).properties(
    title='Biomarker data for all participants across all stages'
)

In [390]:
df_expanded.to_csv("data/participant_data.csv", index=False)
df_expanded.head()


Unnamed: 0,Biomarker,participant,measurement,k_j,k_n,drawn_from
0,Biomarker 0,0,7.946853,4,9,healthy
1,Biomarker 0,1,4.57732,9,9,diseased
2,Biomarker 0,2,7.957196,8,9,healthy
3,Biomarker 0,3,5.508214,10,9,diseased
4,Biomarker 0,4,3.619316,9,9,diseased


In [391]:
# get data for conjugate priors
# biomarker 1 & drawn from diseased groups
df_expanded[(
    df_expanded.Biomarker == "Biomarker 1") & (
        df_expanded.drawn_from == "diseased")].to_csv("data/conjugate_data.csv", index=False)

In [392]:
# df_expanded[df_expanded.k_j == 0]

### Distribution of all biomarker values when the participant is at a certain disease stage

In [393]:
# biomarker data when the participant is at stage 6
df_kj_6 = df_expanded[df_expanded.k_j == 6]
df_kj_6

alt.Chart(df_kj_6).transform_density(
    'measurement',
    as_=['measurement', 'Density'],
    groupby=['Biomarker']
).mark_area().encode(
    x="measurement:Q",
    y="Density:Q",
    facet = alt.Facet(
        "Biomarker:N",
        columns = 5
    ),
    color=alt.Color(
        'Biomarker:N'
    )
).properties(
    width= 140,
    height = 200,
).properties(
    title='Biomarker data when the participant is at stage six'
)

### Comparing a certain biomarker data 

In [394]:
# select only biomarker 2 
bio_2_data = df_expanded[df_expanded.Biomarker=='Biomarker 2'].drop(['k_j', 'k_n', 'Biomarker'], axis = 1)
# biomarker2 data, comparing from diseased and healthy groups
alt.Chart(bio_2_data).transform_density(
    'measurement',
    as_=['measurement', 'Density'],
    groupby=['drawn_from']
).mark_area().encode(
    x="measurement:Q",
    y="Density:Q",
    facet = alt.Facet(
        "drawn_from:N",
    ),
    color=alt.Color(
        'drawn_from:N'
    )
).properties(
    width= 240,
    height = 200,
).properties(
    title='Biomarker2 data, compring healthy group and diseased group'
)

### A certain participant's data

In [395]:
# participant 10
participant10_data = df_expanded[df_expanded.participant == 10]
alt.Chart(participant10_data).mark_bar().encode(
    x='Biomarker',
    y='measurement',
    color=alt.Color(
        'drawn_from:N'
    ),
    tooltip=['Biomarker', 'drawn_from', 'measurement']
).interactive().properties(
    title=f'Biomarker data for participant10 (k_j = {participant10_data.k_j.to_list()[0]})'
)