# Estimating biomarker ordering

>The sampler for the biomarker ordering can be a bit tricker. The simplest way to do it might be to do a Metropolis-Hastings step where you select two indicies and propose swapping their order. Then you can work out the relative probabilities, evaluate and then accept/reject based on that. It's not the fastest sampler, but it is a lot more straightforward than some ways of doing it.  

In the following, we assume that we know the actual $\theta$ and $\phi$ values. Other than those, we do not know anything except for participants' observed biomarker values. And we want to estimate the current order in which different biomarkers are affected by the disease in question. 

In [22]:
import pandas as pd 
import numpy as np 
import re 
import altair as alt 
import matplotlib.pyplot as plt 
from collections import Counter

We only have three columns: biomarker, participant, and measurement. 

In [3]:
data = pd.read_csv('data/participant_data.csv')
data.Biomarker = [re.sub("Biomarker ", "", text) for text in data.Biomarker.tolist()]
data_we_have = data.drop(['k_j', 'S_n', 'affected_or_not'], axis = 1)
data_we_have.head()

Unnamed: 0,Biomarker,participant,measurement
0,0,0,10.336151
1,0,1,10.865467
2,0,2,1.238857
3,0,3,10.163485
4,0,4,11.036735


In [4]:
theta_phi = pd.read_csv('data/means_vars.csv')
theta_phi.head()

Unnamed: 0,biomarker,theta_mean,theta_var,phi_mean,phi_var
0,0,1.0,0.3,12.0,1.3
1,1,3.0,0.5,11.0,2.4
2,2,5.0,0.2,14.0,1.4
3,3,6.0,1.3,16.0,0.9
4,4,8.0,3.3,18.0,1.5


In [5]:
type(theta_phi['biomarker'][0])

numpy.int64

In [6]:
def fill_up_pdata(pdata, k_j):
    '''Fill up participant data using k_j
    Input:
    - pdata: a dataframe of ten biomarker values for a specific participant 
    - k_j: a scalar
    '''
    data = pdata.copy()
    data['k_j'] = k_j
    data['affected'] = data.apply(lambda row: row.k_j >= row.S_n, axis = 1)
    return data 

In [7]:
def compute_single_measurement_likelihood(theta_phi, biomarker, affected, measurement):
    '''Computes the likelihood of the measurement value of a single biomarker
    We know the normal distribution defined by either theta or phi
    and we know the measurement. This will give us the probability
    of the given measurement. 

    input:
    - theta_phi: the dataframe containing theta and phi values for each biomarker
    - biomarker: an integer between 0 and 9 
    - affected: boolean 
    - measurement: the observed value for a biomarker in a specific participant

    output: a number 
    '''
    biomarker_params = theta_phi[theta_phi.biomarker == biomarker].reset_index()
    mu = biomarker_params['theta_mean'][0] if affected else biomarker_params['phi_mean'][0]
    var = biomarker_params['theta_var'][0] if affected else biomarker_params['phi_var'][0]
    sigma = np.sqrt(var)
    return np.exp(-(measurement - mu)**2/(2*sigma**2))/np.sqrt(2*np.pi*sigma**2)

In [8]:
def compute_likelihood(pdata, k_j):
    '''This implementes the formula of https://ebm-book2.vercel.app/distributions.html#known-k-j
    '''
    data = fill_up_pdata(pdata, k_j)
    likelihood = 1
    for i, row in data.iterrows():
        biomarker = int(row['Biomarker'])
        measurement = row['measurement']
        affected = row['affected']
        likelihood *= compute_single_measurement_likelihood(theta_phi, biomarker, affected, measurement)
    return likelihood

## Testing

The above functions can compute the likelihood of a participant's sequence of biomarker data, given that we know the exact ordering and we assume a `k_j`. Next, we will test those functions by selecting a specific participant. We compute the likelihood by trying all possible `k_j` and see whether the one with the highest likelihood is the real `k_j` in the original data. 

In [9]:
p = 15 # we chose this participant
pdata = data[data.participant == p].reset_index(drop=True)
pdata

Unnamed: 0,Biomarker,participant,measurement,k_j,S_n,affected_or_not
0,0,15,0.477114,10,5,affected
1,1,15,3.272517,10,7,affected
2,2,15,5.13286,10,6,affected
3,3,15,4.204429,10,4,affected
4,4,15,7.186972,10,8,affected
5,5,15,0.853901,10,2,affected
6,6,15,6.140174,10,9,affected
7,7,15,2.328832,10,10,affected
8,8,15,7.311669,10,1,affected
9,9,15,9.443621,10,3,affected


In [10]:
# ordering of biomarker affected by the disease
real_ordering_dic = dict(zip(np.arange(10), pdata.S_n))
real_ordering_dic

{0: 5, 1: 7, 2: 6, 3: 4, 4: 8, 5: 2, 6: 9, 7: 10, 8: 1, 9: 3}

In [11]:
# get the participant data without k_j, S_n, and affected or not
pdata = data_we_have[data_we_have.participant == p].reset_index(drop=True)
# obtain real ordering:
pdata['S_n'] = pdata.apply(lambda row: real_ordering_dic[int(row['Biomarker'])], axis = 1)
pdata

Unnamed: 0,Biomarker,participant,measurement,S_n
0,0,15,0.477114,5
1,1,15,3.272517,7
2,2,15,5.13286,6
3,3,15,4.204429,4
4,4,15,7.186972,8
5,5,15,0.853901,2
6,6,15,6.140174,9
7,7,15,2.328832,10
8,8,15,7.311669,1
9,9,15,9.443621,3


In [12]:
num_biomarkers = len(pdata.Biomarker.unique())
# calculate likelihood for all possible k_j
likelihood_list = [compute_likelihood(pdata=pdata, k_j=x) for x in range(num_biomarkers+1)]
kjs = np.arange(11)
dic = dict(zip(kjs, likelihood_list))
df = pd.DataFrame.from_dict(dic, orient='index', columns=['likelihood']).reset_index()
df.sort_values('likelihood', ascending=False)

Unnamed: 0,index,likelihood
10,10,1.85686e-06
9,9,4.085494e-21
8,8,1.853449e-21
7,7,3.6020979999999997e-38
6,6,7.004614999999999e-44
5,5,1.764717e-56
4,4,8.863316e-79
3,3,9.910822000000001e-112
2,2,3.6103399999999997e-124
1,1,1.067792e-161


From the above result, we can see that the most likelihood `k_j` is 10, which is in fact the real `k_j` in the participant data. 

## Metropolis-Hastings Algorithm Implementation

Next, we will implement the metropolis-hastings algorithm using the above functions. 

In [13]:
def average_all_likelihood(pdata, num_biomarkers):
    '''This is to compute https://ebm-book2.vercel.app/distributions.html#unknown-k-j
    '''
    return np.mean([compute_likelihood(pdata=pdata, k_j=x) for x in range(num_biomarkers+1)])

In [14]:
def metropolis_hastings(pdata, num_biomarkers, iterations):
    '''Implement the metropolis-hastings algorithm
    Inputs: 
        - pdata: a dataframe of ten biomarker values for a specific participant
        - num_biomarker: the number of unique biomarkers
        - iterations: number of iterations

    Outputs:
        - best_order: a numpy array
        - best_likelihood: a scalar 
    '''
    # initialize an ordering and likelihood
    best_order = np.arange(num_biomarkers)
    best_likelihood = -np.inf 
    for it in range(iterations):
        new_order = best_order.copy()
        # randomly select two indices
        a, b = np.random.choice(num_biomarkers, 2, replace=False)
        # swaping the order
        new_order[a], new_order[b] = new_order[b], new_order[a]
        pdata['S_n'] = new_order
        likelihood = average_all_likelihood(pdata, num_biomarkers)
        if likelihood > best_likelihood:
            best_likelihood = likelihood 
            best_order = new_order
    return best_order, best_likelihood


Let's test the algorithm with one specific participant:

In [15]:
pdata = data_we_have[data_we_have.participant == p].reset_index(drop=True)

In [16]:
num_biomarkers = len(pdata.Biomarker.unique())
num_iterations = 1000
best_order, best_likelihood = metropolis_hastings(pdata, num_biomarkers, num_iterations)
best_order

array([0, 8, 2, 3, 4, 1, 9, 5, 7, 6])

In [28]:
# real order
real_ordering = np.array(list(real_ordering_dic.values()))
real_ordering

array([ 5,  7,  6,  4,  8,  2,  9, 10,  1,  3])

We find that the `best_order` obtained from our algorithm is different from the real ordering. 

### All participants

One problem is that we only considered one specific participant; we have have 100 participants. We can do estimate the ordering for all participants:

In [18]:
num_participants = len(data.participant.unique())
p_order_dict = {}
num_biomarkers = len(pdata.Biomarker.unique())
num_iterations = 100
for p in range(num_participants):
    pdata = data_we_have[data_we_have.participant == p].reset_index(drop=True)
    best_order, best_likelihood = metropolis_hastings(pdata, num_biomarkers, num_iterations)
    p_order_dict[p] = best_order
    print(f"participant #{p} is done.")

participant #0 is done.
participant #1 is done.
participant #2 is done.
participant #3 is done.
participant #4 is done.
participant #5 is done.
participant #6 is done.
participant #7 is done.
participant #8 is done.
participant #9 is done.
participant #10 is done.
participant #11 is done.
participant #12 is done.
participant #13 is done.
participant #14 is done.
participant #15 is done.
participant #16 is done.
participant #17 is done.
participant #18 is done.
participant #19 is done.
participant #20 is done.
participant #21 is done.
participant #22 is done.
participant #23 is done.
participant #24 is done.
participant #25 is done.
participant #26 is done.
participant #27 is done.
participant #28 is done.
participant #29 is done.
participant #30 is done.
participant #31 is done.
participant #32 is done.
participant #33 is done.
participant #34 is done.
participant #35 is done.
participant #36 is done.
participant #37 is done.
participant #38 is done.
participant #39 is done.
participan

In [24]:
estimated_orderings = list(p_order_dict.values())
tuples = [tuple(arr) for arr in estimated_orderings]
counter = Counter(tuples)
most_common_tuple, count = counter.most_common(1)[0]
most_common_tuple, count

((0, 9, 8, 1, 6, 2, 5, 7, 4, 3), 3)

In [31]:
# to see whether the real ordering is in the estimated ordering list
result = any(np.array_equal(real_ordering, arr) for arr in estimated_orderings)
result

False

### Conclusion

It seems this method is slow and also not accurate. The correct ordering is not in the estimated ordering array. I am not sure whether it is because the iterations (100) is too small. 