# Estimating hidden covariates with prior (HCP) for eQTL analyses

Suppose $\mathbf{Y} \in \mathbb{R}^{n\times p}$ is a gene expression matrix with $n$ individuals and $p$ genes, $\mathbf{Z} \in \mathbb{R}^{n\times k}$ is a set of $k$ hidden covariates influencing gene expression with $\mathbf{B} \in \mathbb{R}^{k \times p}$ their corresponding effect sizes, and $\mathbf{F} \in \mathbb{R}^{n\times m}$ is a set of $m$ known covariates with $\mathbf{U} \in \mathbb{R}^{m\times k}$ capturing their relation to hidden covariates. The [the original paper](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0068141) aims to estimate $\mathbf{Z}, \mathbf{B}, \mathbf{U}$ by minimizing the following objective,

$$
L(\mathbf{Z}, \mathbf{B}, \mathbf{U}) = \|\mathbf{Y} - \mathbf{Z} \mathbf{B}\|_{\text{F}}^2 + \lambda_1 \|\mathbf{Z} - \mathbf{F} \mathbf{U}\|_{\text{F}}^2 + \lambda_2 \|\mathbf{B}\|_{\text{F}}^2 + \lambda_3 \|\mathbf{U}\|_{\text{F}}^2, $$

where $k, \lambda_1, \lambda_2, \lambda_3$ are regularization parameters provided by the user. One can set these values using cross-validation, by evaluating the performance of the resulting residual data on a desired task. Typically, if $\lambda_1$ > 5, then the hidden factors match closely the known covariates. 

Given that the objective is a biconvex function, we can implement alternative interative updates by setting the matrix derivatives with respect to individual arguments to zero whil holding all others as constant. Hence, 

\begin{eqnarray*}
\mathbf{Z}^{(t+1)} &=& (\lambda_1 \mathbf{F}\mathbf{U}^{(t)} + \mathbf{Y} \mathbf{B}^{(t)T}) (\mathbf{B}^{(t)}\mathbf{B}^{(t)T} + \lambda_1 \mathbf{I}_k)^{-1} \\
\mathbf{B}^{(t+1)} &=& (\mathbf{Z}^{(t+1)T} \mathbf{Z}^{(t+1)} + \lambda_2 \mathbf{I}_k)^{-1} \mathbf{Z}^{(t+1)T} \mathbf{Y} \\
\mathbf{U}^{(t+1)} &=& (\lambda_1 \mathbf{F}^T \mathbf{F} + \lambda_3 \mathbf{I}_m)^{-1} \mathbf{F}^T \mathbf{Z}^{(t+1)},
\end{eqnarray*}

where superscript $t$ indicates the iteration number. Note that the above matrix inverses exist, because they are positive definnite.

### Implement in Julia

In [1]:
using LinearAlgebra, Random, RCall

In [2]:
function hcp(
        Y::Matrix{T},
        F::Matrix{T},
        k::Int,
        λ₁::T,
        λ₂::T,
        λ₃::T;
        maxiter::Int = 1000, 
        tol::Float64 = 1e-6
        ) where T <: AbstractFloat
    # set up starting point
    n, p = size(Y)
    m = size(F, 2)
    Z = zeros(T, n, k)
    B = rand(T, k, p)
    U = zeros(T, m, k)
    # pre-allocate arrays
    storage_nk = Matrix{Float64}(undef, n, k)
    storage_kk = Matrix{Float64}(undef, k, k)
    storage_kp = Matrix{Float64}(undef, k, p)
    storage_mm = Matrix{Float64}(undef, m, m)
    BLAS.gemm!('T', 'N', λ₁, F, F, T(0), storage_mm)
    for j in 1:m
        storage_mm[j, j] += λ₃
    end
    LAPACK.potrf!('L', storage_mm)
    LAPACK.potri!('L', storage_mm)
    LinearAlgebra.copytri!(storage_mm, 'L')
    storage_mk = Matrix{Float64}(undef, m, k)
    # compute objective
    L1 = copy(Y)
    L2 = copy(Z)
    BLAS.gemm!('N', 'N', T(-1), Z, B, T(1), L1)
    BLAS.gemm!('N', 'N', T(-1), F, U, T(1), L2)
    obj = abs2(norm(L1)) + λ₁ * abs2(norm(L2)) + 
        λ₂ * abs2(norm(B)) + λ₃ * abs2(norm(U))
    # iterative steps
    i = 1
    for outer i in 1:maxiter
        # update Z
        BLAS.gemm!('N', 'N', λ₁, F, U, T(0), storage_nk)
        BLAS.gemm!('N', 'T', T(1), Y, B, T(1), storage_nk)
        BLAS.gemm!('N', 'T', T(1), B, B, T(0), storage_kk)
        for j in 1:k
            storage_kk[j, j] += λ₁
        end
        LAPACK.potrf!('L', storage_kk)
        LAPACK.potri!('L', storage_kk)
        LinearAlgebra.copytri!(storage_kk, 'L')
        BLAS.gemm!('N', 'N', T(1), storage_nk, storage_kk, T(0), Z)
        # update B
        BLAS.gemm!('T', 'N', T(1), Z, Z, T(0), storage_kk)
        for j in 1:k
            storage_kk[j, j] += λ₂
        end
        BLAS.gemm!('T', 'N', T(1), Z, Y, T(0), storage_kp)
        LAPACK.potrf!('L', storage_kk)
        LAPACK.potrs!('L', storage_kk, storage_kp)
        copy!(B, storage_kp)
        # update U
        BLAS.gemm!('T', 'N', T(1), F, Z, T(0), storage_mk)
        BLAS.gemm!('N', 'N', T(1), storage_mm, storage_mk, T(0), U)
        # compare objective
        copyto!(L1, Y)
        copyto!(L2, Z)
        BLAS.gemm!('N', 'N', T(-1), Z, B, T(1), L1)
        BLAS.gemm!('N', 'N', T(-1), F, U, T(1), L2)
        newobj = abs2(norm(L1)) + λ₁ * abs2(norm(L2)) + 
        λ₂ * abs2(norm(B)) + λ₃ * abs2(norm(U))
        if abs(newobj - obj) / (abs(obj) + 1) < tol
            break
        end
        obj = newobj
    end
    return Z, B, U, obj, i
end

hcp (generic function with 1 method)

In [3]:
Random.seed!(1234)
Y = randn(1864, 25774)
F = randn(1864, 30)
k = 40
λ₁ = 5.0
λ₂ = 1.0
λ₃ = 1.0
@time Z, B, U, obj, i = hcp(Y, F, k, λ₁, λ₂, λ₃)

 19.534391 seconds (527.23 k allocations: 413.643 MiB, 0.16% gc time, 1.00% compilation time)


([-0.35867907612621014 0.37592281647434855 … 0.20682640675566086 0.25516722377174306; -0.32953552442888145 0.0019339074372268616 … 0.11624162105169271 -0.24441533752786007; … ; 0.05582875372494605 -0.041143465426455014 … 0.28954199359799593 -0.1740687800975141; -0.38873858445916076 -0.11566986336381858 … -0.05388415896694071 -0.0832239017024325], [-0.2126723620471438 0.10530072255638963 … 0.08663131749792738 -0.012366209859764901; 0.2688657317038759 0.11209726126268842 … -0.24041858866928387 0.15942317441520656; … ; 0.1176479975975411 0.15404059310969465 … -0.06741951549923031 -0.3087474126162104; 0.0737962755369853 -0.050712807087285566 … -0.14488283389360299 0.1342095195394703], [0.0009579509666962903 4.81789981739273e-5 … 0.0012769142378248567 -0.0006668161762021437; 0.0011024648218550105 -0.00087285437873093 … 0.0008709713158047769 0.0011214313631533143; … ; 9.299626561307413e-5 0.0006577731148752081 … 0.0018525946744075293 -0.0008532792361416287; -0.0004565515720252119 0.001686060

### Implement in R

The following code is similar to the ones found in [this GitHub repo](https://github.com/mvaniterson/Rhcpp).

In [4]:
R"""
EstimateHCP <- function(F, Y, k, lambda1, lambda2, lambda3, iter) {
  # input:
  #      F: a matrix n x m of known covariates, where n = number of subjects and m = number of known covariates. 
  #      * must be standardized (columns have 0 mean and constant sd)
  #      Y: a matrix n x p of expression data
  #      * must be standardized (columns have 0 mean and constant sd)
  #      k: number of inferred hidden covariates (k is an integer)
  #      lambda1, lambda2, lambda3 are model parameters
  #      (optional) iter: number of iterations (default = 1000)
  # output:
  #      Z: matrix of hidden covariates, dimension: n x k
  #      B: effect size of hidden covariates, dimension: k x p
  #      o: value of objective function on consecutive iterations
  
  library(MASS)
  library(pracma)
  tol <- 1e-6
  U <- matrix(0, nrow = dim(F)[2], k)
  Z <- matrix(0, nrow = dim(F)[1], k)
  B <- matrix(runif(dim(Z)[2] * dim(Y)[2]), nrow = dim(Z)[2], ncol = dim(Y)[2])
  F <- as.matrix(F)
  n1 <- dim(F)[1]
  d1 <- dim(F)[2]
  n2 <- dim(Y)[1]
  d2 <- dim(Y)[2]
  
  if(n1 != n2) {
    stop("number of rows in F and Y must agree")
  }
  if (k < 1 | lambda1 < 1e-6 | lambda2 < 1e-6 | lambda3 < 1e-6 ) {
    stop("lambda1, lambda2, lambda3 must be positive and/or k must be an integer")
  }
  
  o <- vector(length = iter)
  for (ii in 1:iter) {
    o[ii] <- sum((Y - Z %*% B)^2) + lambda1 * sum((Z - F %*% U)^2) + lambda2 * (sum(B^2)) + lambda3 * (sum(U^2))
    Z <- (Y %*% t(B) + lambda1 * F %*% U) %*% ginv(B %*% t(B) + lambda1 * diag(dim(B)[1]))
    B <- mldivide(t(Z) %*% Z + lambda2 * diag(dim(Z)[2]), (t(Z) %*% Y))
    U <- mldivide(t(F) %*% F * lambda1 + lambda3 * diag(dim(U)[1]), lambda1 * t(F) %*% Z)
    if (ii > 1 && (abs(o[ii] - o[ii - 1]) / abs(o[ii] + 1)) < tol) {
      break
    }
  }
  dataout <- list(Z = Z, B = B, U = U, obj = o, i = ii)
  return(dataout)
}
"""

RObject{ClosSxp}
function (F, Y, k, lambda1, lambda2, lambda3, iter) 
{
    library(MASS)
    library(pracma)
    tol <- 1e-06
    U <- matrix(0, nrow = dim(F)[2], k)
    Z <- matrix(0, nrow = dim(F)[1], k)
    B <- matrix(runif(dim(Z)[2] * dim(Y)[2]), nrow = dim(Z)[2], 
        ncol = dim(Y)[2])
    F <- as.matrix(F)
    n1 <- dim(F)[1]
    d1 <- dim(F)[2]
    n2 <- dim(Y)[1]
    d2 <- dim(Y)[2]
    if (n1 != n2) {
        stop("number of rows in F and Y must agree")
    }
    if (k < 1 | lambda1 < 1e-06 | lambda2 < 1e-06 | lambda3 < 
        1e-06) {
        stop("lambda1, lambda2, lambda3 must be positive and/or k must be an integer")
    }
    o <- vector(length = iter)
    for (ii in 1:iter) {
        o[ii] <- sum((Y - Z %*% B)^2) + lambda1 * sum((Z - F %*% 
            U)^2) + lambda2 * (sum(B^2)) + lambda3 * (sum(U^2))
        Z <- (Y %*% t(B) + lambda1 * F %*% U) %*% ginv(B %*% 
            t(B) + lambda1 * diag(dim(B)[1]))
        B <- mldivide(t(Z) %*% Z + lambda2 * diag(dim(

In [5]:
R"""
system.time(EstimateHCP(F = $F, Y = $Y, lambda1 = 5, lambda2 = 1, lambda3 = 1, k = 40, iter = 1000))
"""

└ @ RCall /Users/minsookim/.julia/packages/RCall/eRsxl/src/io.jl:160


RObject{RealSxp}
   user  system elapsed 
400.809   2.511 404.486 


Although `Julia` and `R` implementations give highly similar results (minor difference due to random initial starting point for $\mathbf{B}$), there is ~20 fold difference in runtime for a data matrix of size $1864 \times 25774$.

# Other methods for estimating latent factors

### [Surrogate variables](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1994707/)

### [Combat](https://academic.oup.com/biostatistics/article/8/1/118/252073)

### [PEER](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2865505/)

[Other examples](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-12-290) ... TODO

In [6]:
versioninfo()

Julia Version 1.6.0
Commit f9720dc2eb (2021-03-24 12:55 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin19.6.0)
  CPU: Intel(R) Core(TM) i9-9880H CPU @ 2.30GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, skylake)
