# Bayesian hierarchical model with conjugate prior

**1. A generative model:**

$$ y \mid \beta, \sigma^2 \sim N(X \beta, \sigma^2 V)$$

**2. Prior:**

$$ \beta \mid \sigma^2 \sim N(\mu_\beta, \sigma^2 V_\beta)$$

$$ \sigma^2 \sim \text{IG}(a, b) $$

**3. Posterior:**

$$ \beta \mid y, \sigma^2 \sim N(Mm, \sigma^2 M)$$

$$ \sigma^2 \mid y \sim \text{IG}(a^+, b^+)$$

where

$$ M^{-1} = V^{-1}_\beta + X^T V^{-1} X $$

$$ m = V^{-1}_\beta \mu_\beta + X^T V^{-1} y $$

$$ a^+ = a + \frac n2 $$

$$ b^+ = b + \frac{1}{2} \left( y^T V^{-1} y + \mu_\beta^T V^{-1}_\beta \mu_\beta - m^T M m \right) $$

### Implement in Julia

In [1]:
using Distributions
using Random
Random.seed!(1234)

function BayesLinReg(y::Vector, X, μᵦ, Vᵦ, V, a, b, nSim)
    # compute M, m, a⁺, b⁺
    n, p = size(X)
    V⁻¹ = inv(V)  # switch to Cholesky 
    Vᵦ⁻¹ = inv(Vᵦ)  # switch to Cholesky
    M = inv(Vᵦ⁻¹ + X'*V⁻¹*X)  # switch to Cholesky
    m = Vᵦ⁻¹*μᵦ + X'*V⁻¹*y
    a⁺ = a + n/2
    b⁺ = b + 1/2*(y'*V⁻¹*y + μᵦ'*Vᵦ⁻¹*μᵦ - m'*M*m)
    # sample from posterior
    σ²sim = zeros(nSim)
    βsim = zeros(nSim, p)
    for i in 1:nSim
        # sample from p(σ²∣y)
        σ² = rand(InverseGamma(a⁺, b⁺))
        σ²sim = σ²
        # sample from p(β∣y,σ²)
        β = rand(MvNormal(M*m, σ²*M))
        βsim[i, :] = β'
    end
    return σ²sim, βsim
end

BayesLinReg (generic function with 1 method)

### Implement in R

In [2]:
using RCall

R"""
library(mvtnorm)

BayesLinReg <- function(y, X, mu.beta, V.beta.inv, a, b, nSim){
    # compute M, m, a_new, b_new
    n <- dim(X)[1]
    p <- dim(X)[2]
    M <- solve(V.beta.inv + t(X) %*% X)
    m <- V.beta.inv %*% mu.beta + t(X) %*% y
    Mm <- M %*% m
    a_new <- a + n / 2
    b_new <- b + 1 / 2 * (t(y) %*% y + t(mu.beta) %*% V.beta.inv %*% mu.beta - t(m) %*% Mm)
    # sample from posterior
    sigma2sim <- rep(NA, nSim)
    betasim <- matrix(NA, nSim, p)
    for (i in 1:nSim){
        # sample sigma2
        sigma2 <- 1 / rgamma(1, a_new, b_new)
        sigma2sim[i] <- sigma2
        # sample beta
        beta <- rmvnorm(1, Mm, sigma2 * M)
        betasim[i, ] <- beta
    }
    return(results = list(sigma2sim = sigma2sim, betasim = betasim))
}
"""

└ @ RCall /Users/minsookim/.julia/packages/RCall/eRsxl/src/io.jl:160


RObject{ClosSxp}
function (y, X, mu.beta, V.beta.inv, a, b, nSim) 
{
    n <- dim(X)[1]
    p <- dim(X)[2]
    M <- solve(V.beta.inv + t(X) %*% X)
    m <- V.beta.inv %*% mu.beta + t(X) %*% y
    Mm <- M %*% m
    a_new <- a + n/2
    b_new <- b + 1/2 * (t(y) %*% y + t(mu.beta) %*% V.beta.inv %*% 
        mu.beta - t(m) %*% Mm)
    sigma2sim <- rep(NA, nSim)
    betasim <- matrix(NA, nSim, p)
    for (i in 1:nSim) {
        sigma2 <- 1/rgamma(1, a_new, b_new)
        sigma2sim[i] <- sigma2
        beta <- rmvnorm(1, Mm, sigma2 * M)
        betasim[i, ] <- beta
    }
    return(results = list(sigma2sim = sigma2sim, betasim = betasim))
}


### Choice of prior that leads to the posterior estimate that is the same as maximum likelihood estimate (MLE)

Suppose $\boldsymbol{V} = \mathbf{I}_n$ and $\tilde{\boldsymbol{y}} \mid \boldsymbol{\beta}, \sigma^2, \boldsymbol{y} \sim N(\tilde{\boldsymbol{X}} \boldsymbol{\beta}, \sigma^2 \mathbf{I}_m)$.

We need to find $a, b, \boldsymbol{V}^{-1}_\beta$ such that $\boldsymbol{\beta} \mid \boldsymbol{y}, \sigma^2 \sim N(Mm, \sigma^2 M) = N(\hat{\boldsymbol{\beta}}, \sigma^2 (\boldsymbol{X}^T \boldsymbol{X})^{-1})$, where $\hat{\boldsymbol{\beta}} = (\boldsymbol{X}^T \boldsymbol{X})^{-1} \boldsymbol{X}^T \boldsymbol{y}$. If we let $\boldsymbol{V}^{-1}_\beta = \mathbf{0}_{p \times p}$ such that non-informative prior is used and the data (i.e. likelihood) dictates the parameter estimation, then $M^{-1} = \boldsymbol{X^T} \boldsymbol{X}$, $M m = \hat{\boldsymbol{\beta}}$.

$$
\text{E}[\sigma^2 \mid \boldsymbol{y}] = \frac{b^+}{a^+ - 1} (\because \sigma^2 \mid \boldsymbol{y} \sim \text{IG}(a^+, b^+)
$$

$$
= \frac{b + \frac{1}{2} (\boldsymbol{y}^T \boldsymbol{y} - m^T M m)}{a + \frac{n}{2} - 1}
$$

$$
= \frac{b + \frac{1}{2} (\boldsymbol{y} - \boldsymbol{X} \hat{\boldsymbol{\beta}})^T (\boldsymbol{y} - \boldsymbol{X} \hat{\boldsymbol{\beta}})}{a + \frac{n}{2} - 1}
$$

$\hat{\sigma}^2$ from frequentist approach is $\frac{(\boldsymbol{y} - \boldsymbol{X} \hat{\boldsymbol{\beta}})^T (\boldsymbol{y} - \boldsymbol{X} \hat{\boldsymbol{\beta}})}{n - p}$, so there are numerous $a, b$ that can equate the two, one being $a = 1- \frac{p}{2}, b = 0$.

To compute $\text{Var}(\tilde{\boldsymbol{y}} \mid \sigma^2, \boldsymbol{y})$,

$$
\tilde{\boldsymbol{y}}, \boldsymbol{\beta} \mid \sigma^2, \boldsymbol{y} \propto e^{-\frac{1}{2\sigma^2}[(\tilde{\boldsymbol{y}} - \tilde{\boldsymbol{X}} \boldsymbol{\beta})^T (\tilde{\boldsymbol{y}} - \tilde{\boldsymbol{X}} \boldsymbol{\beta}) + (\boldsymbol{\beta} - \hat{\boldsymbol{\beta}})^T \boldsymbol{X}^T \boldsymbol{X}(\boldsymbol{\beta} - \hat{\boldsymbol{\beta}})]}
$$

Tracking $\boldsymbol{\beta}$ to integrate it out,

$$ 
\boldsymbol{\beta}^T (\tilde{\boldsymbol{X}}^T \tilde{\boldsymbol{X}} + \boldsymbol{X}^T \boldsymbol{X}) \boldsymbol{\beta} - 2 (\tilde{\boldsymbol{y}}^T \tilde{\boldsymbol{X}} + \hat{\boldsymbol{\beta}}^T \boldsymbol{X}^T \boldsymbol{X}) \boldsymbol{\beta}
$$

$$
= [\boldsymbol{\beta} - (\tilde{\boldsymbol{X}}^T \tilde{\boldsymbol{X}} + \boldsymbol{X}^T \boldsymbol{X})^{-1}(\boldsymbol{X}^T \boldsymbol{X} \hat{\boldsymbol{\beta}} + \tilde{\boldsymbol{X}}^T \tilde{\boldsymbol{y}})]^T 
(\tilde{\boldsymbol{X}}^T \tilde{\boldsymbol{X}} + \boldsymbol{X}^T \boldsymbol{X})
[\boldsymbol{\beta} - (\tilde{\boldsymbol{X}}^T \tilde{\boldsymbol{X}} + \boldsymbol{X}^T \boldsymbol{X})^{-1}(\boldsymbol{X}^T \boldsymbol{X} \hat{\boldsymbol{\beta}} + \tilde{\boldsymbol{X}}^T \tilde{\boldsymbol{y}})]
$$

$$ 
- (\boldsymbol{X}^T \boldsymbol{X} \hat{\boldsymbol{\beta}} + \tilde{\boldsymbol{X}}^T \tilde{\boldsymbol{y}})^T 
(\tilde{\boldsymbol{X}}^T \tilde{\boldsymbol{X}} + \boldsymbol{X}^T \boldsymbol{X})^{-1}
(\boldsymbol{X}^T \boldsymbol{X} \hat{\boldsymbol{\beta}} + \tilde{\boldsymbol{X}}^T \tilde{\boldsymbol{y}})
$$

After integrating with respect to $\boldsymbol{\beta}$,

$$
\tilde{\boldsymbol{y}} \mid \sigma^2, \boldsymbol{y} \propto
e^{-\frac{1}{2\sigma^2} \tilde{\boldsymbol{y}}^T [\mathbf{I}_m - \tilde{\boldsymbol{X}} (\tilde{\boldsymbol{X}}^T \tilde{\boldsymbol{X}} + \boldsymbol{X}^T \boldsymbol{X})^{-1} \tilde{\boldsymbol{X}}^T] \tilde{\boldsymbol{y}} - 2 \tilde{\boldsymbol{y}}^T \tilde{\boldsymbol{X}} (\tilde{\boldsymbol{X}}^T \tilde{\boldsymbol{X}} + \boldsymbol{X}^T \boldsymbol{X})^{-1}\boldsymbol{X}^T \boldsymbol{X}\hat{\boldsymbol{\beta}}} 
$$

$$
\therefore \tilde{\boldsymbol{y}} \mid \sigma^2, \boldsymbol{y} \sim N([\mathbf{I}_m + \tilde{\boldsymbol{X}} (\tilde{\boldsymbol{X}}^T \tilde{\boldsymbol{X}} + \boldsymbol{X}^T \boldsymbol{X})^{-1} \tilde{\boldsymbol{X}}^T]^{-1} \tilde{\boldsymbol{X}} (\tilde{\boldsymbol{X}}^T \tilde{\boldsymbol{X}} + \boldsymbol{X}^T \boldsymbol{X})^{-1}\boldsymbol{X}^T \boldsymbol{X}\hat{\boldsymbol{\beta}}, \sigma^2 [\mathbf{I}_m - \tilde{\boldsymbol{X}} (\tilde{\boldsymbol{X}}^T \tilde{\boldsymbol{X}} + \boldsymbol{X}^T \boldsymbol{X})^{-1} \tilde{\boldsymbol{X}}^T]^{-1})
$$

$\text{Var}(\tilde{\boldsymbol{y}} \mid \sigma^2, \boldsymbol{y}) = \sigma^2 [\mathbf{I}_m - \tilde{\boldsymbol{X}} (\tilde{\boldsymbol{X}}^T \tilde{\boldsymbol{X}} + \boldsymbol{X}^T \boldsymbol{X})^{-1} \tilde{\boldsymbol{X}}^T]^{-1}$ for Bayesian approach. Using the binomial inversion formula, it can be further expanded to $\sigma^2 [\mathbf{I}_m + \tilde{\boldsymbol{X}} (\boldsymbol{X}^T \boldsymbol{X})^{-1} \tilde{\boldsymbol{X}}^T]$. Alternatively, $\text{Var}(\tilde{\boldsymbol{y}} \mid \sigma^2, \boldsymbol{y})$ can be calculated as follows,

$$
\tilde{\boldsymbol{y}} = \tilde{\boldsymbol{X}} {\boldsymbol{\beta}} + \boldsymbol{\epsilon}
$$

$$
\text{Var}(\tilde{\boldsymbol{y}}) = \tilde{\boldsymbol{X}} \text{Var}({\boldsymbol{\beta}}) \tilde{\boldsymbol{X}}^T + \sigma^2 \mathbf{I}_m = \sigma^2 [\mathbf{I}_m + \tilde{\boldsymbol{X}} (\boldsymbol{X}^T \boldsymbol{X})^{-1} \tilde{\boldsymbol{X}}^T],
$$

which yields the same result as above and as that from frequentist approach.