# Estimate

## Conjugacy 

Conjugacy occurs when the posterior distribution is in the same family of distribution as the prior distribution, but with new parameter values. 

Why conjugacy is important? Because without it, one has to do the integration, which oftentimes is hard. 

Three major conjugate families:

- Beta-Binomial
- Gamma-Poisson
- Normal-Normal

In our example, we assume that the measurement data for each biomarker follows a normal distribution; however, we do not know the exact $\mu$ and $\sigma$. Our job is to estimate the two parameters for each biomarker based on the data we have. 

According to [*An Introduction to Bayesian Thinking*](https://statswithr.github.io/book/inference-and-decision-making-with-multiple-parameters.html#sec:normal-gamma) by Clyde et al. (2022), if the data comes from a normal distribution with unknown $\mu$ and $\sigma$, the conjugate prior for $\mu$ has a normal distribution with mean $m_0$ and variance $\frac{\sigma^2}{n_0}$. The conjugate prior for $\frac{1}{\sigma^2}$ has a Gamma distribution with shape $\frac{v_0}{2}$ and rate $\frac{v_0 s_0^{2}}{2}$ where

- $m_0$: prior estimate of $\mu$.
- $n_0$: how strongly is the prior belief in $m_0$ is held.
- $s_0^2$: prior estimate of $\sigma^2$.
- $v_0$: prior degress of freedome, influencing the certainty of $s_0^2$.

That is to say: 

$$\mu | \sigma^2 \sim \mathcal{N}(m_0, \sigma^2/n_0)$$

$$1/\sigma^2 \sim Gamma\left(\frac{v_0}{2}, \frac{v_0 s_0^2}{2} \right)$$

Combined, we have:

$$(\mu, 1/\sigma^2) \sim NormalGamma(m_0, n_0, s_0^2, v_0)$$

The posterior also follows a Normal-Gamma distribution:

$$(\mu, 1/\sigma^2) | data \sim NormalGamma(m_n, n_n, s_n^2, v_n)$$

More specifically

$$1/\sigma^2 | data \sim Gamma(v_n/2, s_n^2 v_n/2)$$

$$\mu | data, \sigma^2 \sim \mathcal{N}(m_n, \sigma^2/n_n)$$

Based on the above two equations, we know that the mean of posterior mean is $m_n$ and the mean of the posterior variance is $(s_n^2 v_n/2)/(v_n/2)$. This is beceause the expected value of $Gamma(\alpha, \beta)$ is $\frac{\alpha}{\beta}$.

where 

- $m_n$: posterior mean, mode, and median for $\mu$
- $n_n$: posterior sample size 
- $s_n^2$: posterior variance 
- $v_n$: posterior degrees of freedome

The updating rules to get the new hyper-parameters: 

$$m_n = \frac{n \bar{y} + n_0m_0}{n + n_0}$$

$$n_n = n_0 + n$$

$$v_n = v_0 + n$$

$$s_n^2 = \frac{1}{v_n}\left[s^2(n-1) + s_0^2v_0 + \frac{n_0n}{n_n}(\bar{y}-m_0)^2\right]$$

where

- $n$: sample size
- $\bar{y}$: sample mean 
- $s^2$: sample variance

In [124]:
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
import altair as alt

In [125]:
def estimate_params_exact(m0, n0, s0_sq, v0, data):
    '''
    data: a vector of measurements 
    '''
    # Data summary
    sample_mean = np.mean(data)
    sample_size = len(data)
    sample_var = np.var(data, ddof=1)  # ddof=1 for unbiased estimator

    # Update hyperparameters for the Normal-Inverse Gamma posterior
    updated_m0 = (n0 * m0 + sample_size * sample_mean) / (n0 + sample_size)
    updated_n0 = n0 + sample_size
    updated_v0 = v0 + sample_size 
    updated_s0_sq = (1 / updated_v0) * ((sample_size - 1) * sample_var + v0 * s0_sq + 
                    (n0 * sample_size / updated_n0) * (sample_mean - m0)**2)
    updated_alpha = updated_v0/2
    updated_beta = updated_v0*updated_s0_sq/2

    # Posterior estimates
    mu_posterior_mean = updated_m0
    sigma_squared_posterior_mean = updated_beta/updated_alpha

    return mu_posterior_mean, sigma_squared_posterior_mean

In [126]:
import pymc as pm
import logging
logger = logging.getLogger('pymc')
logger.setLevel(logging.WARNING)
print(f"Running on PyMC v{pm.__version__}")

def estimate_params(m0, n0, s0_sq, v0, data):
    '''
    data: a vector of measurements 
    '''
    # Hyperparameters for the Normal-Gamma prior
    m0 = m0        # prior mean for mu
    n0 = n0        # relates to the variance of the mu prior (how strongly m0 is believed)
    s0_sq = s0_sq     # prior estimate for the variance sigma^2
    v0 = v0        # prior degrees of freedom for the precision of sigma^2

    with pm.Model() as model:
        # Prior for the precision of the normal distribution (1/sigma^2)
        # lambda is sigma_squared_inverse
        lambda_ = pm.Gamma('lambda', alpha=v0 / 2, beta=v0 * s0_sq / 2)

        # Prior for the mean of the normal distribution
        mu = pm.Normal('mu', mu=m0, tau=n0 * lambda_)

        # Likelihood (how probably the data is under a hypothesis)
        # hypothesis here is that the truth follows a normal distribution where mu is mu and 
        # sigma is 1/sqrt(lambda_)
        Y_obs = pm.Normal('Y_obs', mu=mu, sigma=1/pm.math.sqrt(lambda_), observed=data)

        # Draw posterior samples
        trace = pm.sample(1000, progressbar=False)
        
    mu_estimate = np.mean(trace.posterior['mu'].values)
    lambda_estimate = np.mean(trace.posterior['lambda'].values)
    var_estimate = 1 / lambda_estimate  # Convert precision to variance
    return mu_estimate, var_estimate

Running on PyMC v5.6.1


In [127]:
# truth data; used to validate the estimates
truth = pd.read_csv('data/means_vars.csv')
# observed data:
observed_data = pd.read_csv('data/participant_data.csv')
observed_data.head()

Unnamed: 0,Biomarker,participant,measurement,k_j,k_n,drawn_from
0,Biomarker 0,0,5.103049,10,10,diseased
1,Biomarker 0,1,7.732748,6,10,healthy
2,Biomarker 0,2,7.960703,9,10,healthy
3,Biomarker 0,3,7.640616,4,10,healthy
4,Biomarker 0,4,7.244324,4,10,healthy


In [128]:
# empty list of dictionaries to store the estimates from pymc
means_vars_estimate_dict_list = []
biomarkers = [_ for _ in range(10)]
for biomarker in biomarkers: 
    dic = {'biomarker': biomarker}  # Initialize dictionary outside the inner loop
    for i in ['healthy', 'diseased']:
        data_full = observed_data[(observed_data.Biomarker == f"Biomarker {biomarker}") & (
        observed_data.drawn_from == i)]
        data = data_full.measurement
        mu_estimate, var_estimate = estimate_params(
            m0 = 0, n0 = 1, s0_sq = 1, v0 = 1, data=data)
        if i == 'diseased':
            dic['theta_mean'] = mu_estimate
            dic['theta_var'] = var_estimate
        else:
            dic['phi_mean'] = mu_estimate
            dic['phi_var'] = var_estimate
    print(f"biomarker {biomarker} done!")
    means_vars_estimate_dict_list.append(dic)

biomarker 0 done!
biomarker 1 done!
biomarker 2 done!
biomarker 3 done!
biomarker 4 done!
biomarker 5 done!
biomarker 6 done!
biomarker 7 done!
biomarker 8 done!
biomarker 9 done!


In [129]:
estimate_df = pd.DataFrame(means_vars_estimate_dict_list)
estimate_df.to_csv("data/estimate_means_vars.csv", index = False)
estimate_df

Unnamed: 0,biomarker,phi_mean,phi_var,theta_mean,theta_var
0,0,7.898087,0.736293,4.11282,4.60231
1,1,1.658082,0.42057,0.100143,0.239418
2,2,3.824225,0.740501,5.80567,1.06113
3,3,1.000575,0.036298,7.669765,2.139471
4,4,5.958301,0.541965,7.553567,4.25432
5,5,0.031827,0.750299,0.964183,0.304356
6,6,3.864497,0.868878,6.878794,0.93011
7,7,7.285676,5.370402,5.888697,0.787011
8,8,0.988506,0.437517,2.872274,0.391523
9,9,0.041772,0.071868,7.952178,1.147079


In [130]:
biomarkers = [_ for _ in range(10)]
charts = []
for n in biomarkers: 
    biomarker_data_est = estimate_df[estimate_df.biomarker == n].reset_index()
    biomarker_data = truth[truth.biomarker == n].reset_index()
    # theta for diseased
    theta_mean_est = biomarker_data_est.theta_mean[0]
    theta_var_est = biomarker_data_est.theta_var[0]
    theta_std_est = np.sqrt(theta_var_est)

    theta_mean = biomarker_data.theta_mean[0]
    theta_var = biomarker_data.theta_var[0]
    theta_std = np.sqrt(theta_var)
    # phi for healthy
    phi_mean_est = biomarker_data_est.phi_mean[0]
    phi_var_est = biomarker_data_est.phi_var[0]
    phi_std_est = np.sqrt(phi_var_est)

    phi_mean = biomarker_data.phi_mean[0]
    phi_var = biomarker_data.phi_var[0]
    phi_std = np.sqrt(phi_var)

    mean1, std1 = theta_mean, theta_std
    mean2, std2 = theta_mean_est, theta_std_est

    # Generating points on the x axis
    x_thetas = np.linspace(min(mean1 - 3*std1, mean2 - 3*std2), 
                    max(mean1 + 3*std1, mean2 + 3*std2), 1000)

    # Creating DataFrames for each distribution
    df1 = pd.DataFrame({'x': x_thetas, 'pdf': norm.pdf(x_thetas, mean1, std1), 'Distribution': 'Actual'})
    df2 = pd.DataFrame({'x': x_thetas, 'pdf': norm.pdf(x_thetas, mean2, std2), 'Distribution': 'Estimated'})

    # Combining the DataFrames
    df3 = pd.concat([df1, df2])

    # Altair plot
    chart_theta = alt.Chart(df3).mark_line().encode(
        x='x',
        y='pdf',
        color=alt.Color('Distribution:N', legend=alt.Legend(title="Theta"))
    ).properties(
        title=f'Biomarker {n}, Theta'
        )

    mean1, std1 = phi_mean, phi_std
    mean2, std2 = phi_mean_est, phi_std_est

    # Generating points on the x axis
    x_phis = np.linspace(min(mean1 - 3*std1, mean2 - 3*std2), 
                    max(mean1 + 3*std1, mean2 + 3*std2), 1000)

    # Creating DataFrames for each distribution
    df1 = pd.DataFrame({'x': x_phis, 'pdf': norm.pdf(x_phis, mean1, std1), 'Distribution': 'Actual'})
    df2 = pd.DataFrame({'x': x_phis, 'pdf': norm.pdf(x_phis, mean2, std2), 'Distribution': 'Estimated'})

    # Combining the DataFrames
    df3 = pd.concat([df1, df2])

    # Altair plot
    chart_phi = alt.Chart(df3).mark_line().encode(
        x='x',
        y='pdf',
        color=alt.Color('Distribution:N', legend=alt.Legend(title="Phi"))
    ).properties(
        title=f'Biomarker {n}, Phi'
        )
    
    # Concatenate theta and phi charts horizontally
    hconcat_chart = alt.hconcat(chart_theta, chart_phi).resolve_scale(color="independent")

    # Append the concatenated chart to the list of charts
    charts.append(hconcat_chart)
# Concatenate all the charts vertically
final_chart = alt.vconcat(*charts)

# Display the final chart
final_chart.display()