# EBM as A Generative Model {#sec-gen}

We can use EBM to generate synthetic biomarker data if we know:

- The order ($S$) in which different biomarkers get affected by the disease. 
- Parameters (i.e., mean and standard deviation) of biomarkers' distribution when they are affected ($\theta$) and not affected ($\phi$) by the disease. 
- Stages ($k_j$) that each participant is in. 

Data we can generate looks like this:

![Sample Data](img/sample_data.png){#fig-sample-data .lightbox}

This data is from a single participant. 

As we mentioned above, to generate this data, we need to know:

- $S$, i.e., the order of biomarkers. In the above example, $S$ is HIP-FCI, PCC-FCI, HIP-GMI, FUS-GMI, FUS-FCI.
- $\mathcal N(\theta_{\mu}, \theta_{\sigma})$ and $\mathcal N(\phi_{\mu}, \phi_{\sigma})$ for each of the five biomarkers, which are known but not shown directly here in the dataset. 
- $k_j$, which is 2 in the above example. 

We explain how this data is constructed in the following, column by column. 

First, the `participant` id is $67$. The `biomarker` indicates each of the five biomarkers examined and measured. The `measurement` is the biomarkers' measurement. `k_j` is the participant's stage. If this stage is above 0, it means `Diseased = True`. `S_n` indicates the $n$-th rank in the order. If `k_j < S_n`, it means the participant's stage hasn't reached that biomarker's rank; therefore, this biomarker is `not affected`. If `k_j >= S_n`, then this biomarker is `affected`. 

If a biomarker is `affected`, then its measurement comes from $\mathcal N(\theta_{\mu}, \theta_{\sigma})$ of that biomarker; if `not_affected`, $\mathcal N(\phi_{\mu}, \phi_{\sigma})$.

## Generative Process

The generative process of biomarker measurements can be described as:

$$
X_{nj} \mid S, k_j, \theta_n, \phi_n \sim I(z_j = 1) \bigg[ I(S(n) \leq k_j) \, p(X_{nj} \mid \theta_{n}) + I(S(n) > k_j) \, p(X_{nj} \mid \phi_{n}) \bigg] \\
+ \left(1 - I(z_j = 1) \right) p(X_{nj} \mid \phi_{n})
$$ {#eq-gen}


<!-- \begin{align*}
X_{nj} \mid S, k_j, \theta_n, \phi_n &\sim I(z_j == 1) \bigg[ I(S(n) \leq k_j) \, p(X_{nj} \mid \theta_{n}) + I(S(n) > k_j) \, p(X_{nj} \mid \phi_{n}) \bigg] \\
&\quad + \left(1 - I(z_j == 1) \right) p(X_{nj} \mid \phi_{n})
\end{align*} -->

This model says that given that we know $S, k_j, \theta_n, \text{and } \phi_n$, we can draw the biomarker measurement from a distribution. 

$S \sim \mathrm{UniformPermutation}(\cdot)$

$S$ follows a distribution of uniform permutation. That means the ordering of biomarkers is random. 

$k_j \sim \mathrm{DiscreteUniform}(N)$

$k_j$ follows a discrete uniform distribution, which means a participant is equally likely to fall in a progression stage (e.g., from $0$ to $5$, where $0$ indicate this participant is healthy.)

## Graphical Explanation

In the following, we explain the generative model in three different scenarios using [graphical models](https://en.wikipedia.org/wiki/Graphical_model): (1) All participants are healthy; (2) Both healthy and diseased participants, but all biomarkers are affected among diseased people; (3) Both healthy and diseased participants, but we do not whether biomarkers are affected or not among patients.

### Scenario 1

If all participants are healthy:

$$
X_{nj} \sim p(X_{nj} \mid \phi_{n})
$$ {#eq-gen-s1}

Where

$X_{nj}$ indicates the measurement of biomarker $n$ in participant $j$.

$\phi_{n}$ represents $\mathcal N(\phi_{\mu}, \phi_{\sigma})$ for biomarker $n$.

The graphical model would look like:

![Graphical Model of Scenario 1](img/g1.png){#fig-g1 .lightbox}

### Scenario 2

If we have oth diseased and healthy participants, and all biomarkers are affected among diseased participants.

$$
X_{nj} \sim I(z_j == 1) p(X_{nj} \mid \theta_n) + (1-I(z_j == 1))p(X_{nj} \mid \phi_n)
$$ {#eq-gen-s2}

Where:

$z_j = 1$ indicates this participant is diseased and $z_j = 1$ represents a healthy participant. 

$I(True) = 1$ and $I(False) = 0$.

$\theta_{n}$ represents $\mathcal N(\theta_{\mu}, \theta_{\sigma})$ for biomarker $n$.

The graphical model would look like:

![Graphical Model of Scenario 2](img/g2.png){#fig-g2 .lightbox}

### Scenario 3

If we have both healthy and diseased participants, but we do not whether biomarkers are affected or not among patients, see @eq-gen.

This is the model in usual cases. 

The graphical model looks like:

![Graphical Model of Scenario 3](img/g3.png){#fig-g3 .lightbox}