In [12]:
versioninfo()

Julia Version 1.6.2
Commit 1b93d53fc4 (2021-07-14 15:36 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin18.7.0)
  CPU: Intel(R) Core(TM) i5-6500 CPU @ 3.20GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, skylake)


In [13]:
using Pkg
Pkg.activate("../..")
Pkg.status()

[32m[1m  Activating[22m[39m new environment at `~/Desktop/Project.toml`


[32m[1m      Status[22m[39m `~/Desktop/Project.toml` (empty project)


# QR Decomposition

* We learned Cholesky decomposition as **one** approach for solving linear regression.

* Another approach for linear regression uses the QR decomposition.  
    **This is how the `lm()` function in R does linear regression.**
    \
    **This is also how Julia's (and MATLAB's) `\` works for rectangular($\neq$ square) matrices.**  
    Note that given rectangular matrix `A`, julia code `A \ b` yields a least squares solution. 

In [14]:
using Random

Random.seed!(280) # seed

n, p = 5, 3
X = randn(n, p) # predictor matrix
y = randn(n)    # response vector

# backslash finds the (minimum L2 norm) least squares solution
X \ y
# Since X is not a square matrix, it does not have the inverse.
# Not solving X * beta = y, but solving argmin norm(y - X * beta)^2

3-element Vector{Float64}:
 0.3795466676698624
 0.6508866456093487
 0.39225041956535506

We want to understand what is QR and how it is used for solving least squares problem.

## Definitions

* Assume $\mathbf{X} \in \mathbb{R}^{n \times p}$ has full column rank. Necessarilly $n \ge p$.

* **Full QR decomposition**:  
$$
    \mathbf{X} = \mathbf{Q} \mathbf{R},  
$$
where  
- $\mathbf{Q} \in \mathbb{R}^{n \times n}$, $\mathbf{Q}^T \mathbf{Q} = \mathbf{Q}\mathbf{Q}^T = \mathbf{I}_n$. In other words, $\mathbf{Q}$ is an orthogonal matrix.  
    - First $p$ columns of $\mathbf{Q}$ form an orthonormal basis of ${\cal R}(\mathbf{X})$ (**range** or column space of $\mathbf{X}$)      
    - Last $n-p$ columns of $\mathbf{Q}$ form an orthonormal basis of ${\cal N}(\mathbf{X}^T)$ (**null space** of $\mathbf{X}^T$)
    - Recall that $\mathcal{N}(\mathbf{X}^T)=\mathcal{R}(\mathbf{X})^{\perp}$ and $\mathcal{R}(\mathbf{X}) \oplus \mathcal{N}(\mathbf{X}^T) = \mathbb{R}^n$.
- $\mathbf{R} \in \mathbb{R}^{n \times p}$  is upper triangular with positive diagonal entries. 
    - The lower $(n-p)\times p$ block of $\mathbf{R}$ is $\mathbf{0}$ (why?).

* **Reduced QR decomposition**:
$$
    \mathbf{X} = \mathbf{Q}_1 \mathbf{R}_1,
$$
where
- $\mathbf{Q}_1 \in \mathbb{R}^{n \times p}$, $\mathbf{Q}_1^T \mathbf{Q}_1 = \mathbf{I}_p$. In other words, $\mathbf{Q}_1$ is a partially orthogonal matrix. Note $\mathbf{Q}_1\mathbf{Q}_1^T \neq \mathbf{I}_n$.
- $\mathbf{R}_1 \in \mathbb{R}^{p \times p}$  is an upper triangular matrix with positive diagonal entries.
- If $\mathbf{Q}$ and $\mathbf{R}$ are from Full QR then $\mathbf{Q} = [\,\mathbf{Q}_1 \; | \; \mathbf{Q_0}\,]$ and $\mathbf{R}=\begin{bmatrix} \mathbf{R}_1 \\ \mathbf{0} \end{bmatrix}$ so that $QR=Q_1R_1+Q_0\cdot 0=Q_1R_1$

* Given QR decomposition $\mathbf{X} = \mathbf{Q} \mathbf{R}$,
    $$
    \mathbf{X}^T \mathbf{X} = \mathbf{R}^T \mathbf{Q}^T \mathbf{Q} \mathbf{R} = \mathbf{R}^T \mathbf{R} = \mathbf{R}_1^T \mathbf{R}_1.
    $$
    - Once we have a (reduced) QR decomposition of $\mathbf{X}$, we automatically have the Cholesky decomposition of the *Gram matrix*  $\mathbf{X}^T \mathbf{X}$. ( $\because \mathbf{R_1}$ is upper triangular)

### Application: least squares

* Normal equation
$$
    \mathbf{X}^T\mathbf{X}\beta = \mathbf{X}^T\mathbf{y}
$$
is equivalently written with reduced QR as
$$
    \mathbf{R}_1^T\mathbf{R}_1\beta = \mathbf{R}_1^T\mathbf{Q}_1^T\mathbf{y}
$$

* Since $\mathbf{R}_1$ is invertible ( given $\mathbf{X}$ is full column rank) , we only need to solve the triangluar system
$$
    \mathbf{R}_1\beta = \mathbf{Q}_1^T\mathbf{y}
$$
Multiplication $\mathbf{Q}_1^T \mathbf{y}$ is done implicitly (see below). 

* This method is numerically more stable than directly solving the normal equation, since $\kappa(\mathbf{X}^T\mathbf{X}) = \kappa(\mathbf{X})^2$ ! (This is because the largest and smallest eigenvalue of $X^TX$ are exactly same with the square of largest and smallest signular values of $X$)

* By QR decompostion we only requires solving one onetriangular system, not two. Also the conditioning of the problem becomes better so that the method is numerically more stable. 

* In case we need standard errors, compute inverse of $\mathbf{R}_1^T \mathbf{R}_1$. This involves solving triangular systems .

## Gram-Schmidt procedure

* Wait! Does $\mathbf{X}$ always have a QR decomposition? (Assuming $\mathbf{X}$ is full column rank)
    - Yes. It is equivalent to the Gram-Schmidt procedure for basis orthonormalization. 
    - LU decomposition $\Leftrightarrow$ Gauss Elimination. $\quad$ QR decomposition $\Leftrightarrow$ Gram-Schmidt procedure.

<img src="https://upload.wikimedia.org/wikipedia/commons/e/e7/Jørgen_Pedersen_Gram_by_Johannes_Hauerslev.jpg" width="200" align="center"/>

[Jørgen Pedersen Gram, 1850-1916](https://en.wikipedia.org/wiki/Jørgen_Pedersen_Gram)

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/1/16/Erhard_Schmidt.jpg/220px-Erhard_Schmidt.jpg" width="200" align="center"/>

[Erhard Schmidt, 1876-1959](https://en.wikipedia.org/wiki/Erhard_Schmidt)

* Assume $\mathbf{X} = [\mathbf{x}_1 | \dotsb | \mathbf{x}_p] \in \mathbb{R}^{n \times p}$ has full column rank. That is,  $\mathbf{x}_1,\ldots,\mathbf{x}_p$ are *linearly independent*.

* Gram-Schmidt (GS) procedure produces nested orthonormal basis vectors $\{\mathbf{q}_1, \dotsc, \mathbf{q}_p\}$ that spans $\mathcal{R}(\mathbf{X})$, i.e.,
$$
\begin{split}
    \text{span}(\{\mathbf{x}_1\}) &= \text{span}(\{\mathbf{q}_1\}) \\
    \text{span}(\{\mathbf{x}_1, \mathbf{x}_2\}) &= \text{span}(\{\mathbf{q}_1, \mathbf{q}_2\}) \\
    & \vdots \\
    \text{span}(\{\mathbf{x}_1, \mathbf{x}_2, \dotsc, \mathbf{x}_p\}) &= \text{span}(\{\mathbf{q}_1, \mathbf{q}_2, \dotsc, \mathbf{q}_p\}) 
\end{split}
$$
and $\langle \mathbf{q}_i, \mathbf{q}_j \rangle = \delta_{ij}=I(i=j)$.

* The algorithm:
0. Initialize $\mathbf{q}_1 = \mathbf{x}_1 / \|\mathbf{x}_1\|_2$
0. For $k=2, \ldots, p$, 
$$
\begin{align*}
	\mathbf{v}_k &= \mathbf{x}_k - P_{\text{span}(\{\mathbf{q}_1,\ldots,\mathbf{q}_{k-1}\})}(\mathbf{x}_k) = \mathbf{x}_k -  \sum_{j=1}^{k-1} \langle \mathbf{q}_j, \mathbf{x}_k \rangle \cdot \mathbf{q}_j \\
	\mathbf{q}_k &= \mathbf{v}_k / \|\mathbf{v}_k\|_2
\end{align*}
$$

### GS conducts reduced QR

* $\mathbf{Q} = [\mathbf{q}_1 | \dotsb | \mathbf{q}_p]$. Obviously $\mathbf{Q}^T \mathbf{Q} = \mathbf{I}_p$. This $\mathbf{Q}$ is in fact equals to $\mathbf{Q}_1$ above.

* Where is $\mathbf{R}$? 
- Let $r_{jk} = \langle \mathbf{q}_j, \mathbf{x}_k \rangle$ for $j < k$, and $r_{kk} = \|\mathbf{v}_k\|_2$.
- Re-write the above expression:
$$
    r_{kk} \mathbf{q}_k = \mathbf{v}_k = \mathbf{x}_k -  \sum_{j=1}^{k-1} r_{jk} \cdot \mathbf{q}_j
$$ 
or
$$
    \mathbf{x}_k = r_{kk} \mathbf{q}_k +  \sum_{j=1}^{k-1} r_{jk} \cdot \mathbf{q}_j=[\mathbf{q}_1\, |\,\cdots\,|\mathbf{q}_p]\begin{bmatrix} r_{1k} \\ \vdots \\ r_{kk} \\ - \\ \mathbf{0} \end{bmatrix} = \mathbf{Q}\mathbf{r}_k  
$$  
- If we let $r_{jk} = 0$ for $j > k$, then $\mathbf{R}=(r_{jk})$ is upper triangular and
$$
    \mathbf{X} = \mathbf{Q}\mathbf{R}
$$
 
<img src="https://dsc-spidal.github.io/harp/img/daalAlgosNew/HarpIllustrations_QR.png" width="800" align="center"/> 
  
Source: <https://dsc-spidal.github.io/harp/docs/harpdaal/algorithms/>  
  

### Classical Gram-Schmidt

In [15]:
using LinearAlgebra
function cgs(X::Matrix{T}) where T<:AbstractFloat
    n, p = size(X)
    Q = Matrix{T}(undef, n, p)
    R = zeros(T, p, p)
    for j=1:p
        Q[:, j] .= X[:, j]
        for i=1:j-1  # i < j 
            R[i, j] = dot(Q[:, i], X[:, j])     
            Q[:, j] .-= R[i, j] * Q[:, i]       # calculating v_j 
        end
        R[j, j] = norm(Q[:, j])             
        Q[:, j] /= R[j, j]          # normalizing v_j to earn q_j
    end
    Q, R
end

cgs (generic function with 1 method)

* CGS is numerically *unstable* (we lose orthogonality of matrix $\mathbf{Q}$ due to roundoff errors) when columns of $\mathbf{X}$ are almost collinear ( $\mathbf{X}$ is nearly singular ).

In [16]:
e = eps(Float32)
A = [1f0 1f0 1f0; e 0 0; 0 e 0; 0 0 e]
# A is nonsingular mathematically but columns of A is almost collinear.

4×3 Matrix{Float32}:
 1.0         1.0         1.0
 1.19209f-7  0.0         0.0
 0.0         1.19209f-7  0.0
 0.0         0.0         1.19209f-7

In [17]:
Q, R = cgs(A)
Q

# 0.707107 is 1/sqrt(2)

4×3 Matrix{Float32}:
 1.0          0.0        0.0
 1.19209f-7  -0.707107  -0.707107
 0.0          0.707107   0.0
 0.0          0.0        0.707107

In [18]:
transpose(Q)*Q

3×3 Matrix{Float32}:
  1.0         -8.42937f-8  -8.42937f-8
 -8.42937f-8   1.0          0.5
 -8.42937f-8   0.5          1.0

* `Q` is hardly orthogonal.
* Where exactly does the problem occur? (HW)

### Modified Gram-Schmidt

* The algorithm:
0. Initialize $\mathbf{q}_1 = \mathbf{x}_1 / \|\mathbf{x}_1\|_2$
0. For $k=2, \ldots, p$, 
$$
\begin{align*}
	\mathbf{v}_k &= \mathbf{x}_k - P_{\text{span}(\{\mathbf{q}_1,\ldots,\mathbf{q}_{k-1}\})}(\mathbf{x}_k) = \mathbf{x}_k -  \sum_{j=1}^{k-1} \langle \mathbf{q}_j, \mathbf{x}_k \rangle \cdot \mathbf{q}_j \\
    &=  \mathbf{x}_k -  \sum_{j=1}^{k-1} \left\langle \mathbf{q}_j, \mathbf{x}_k - \sum_{l=1}^{j-1}\langle \mathbf{q}_l, \mathbf{x}_k \rangle \mathbf{q}_l \right\rangle \cdot \mathbf{q}_j \\
	\mathbf{q}_k &= \mathbf{v}_k / \|\mathbf{v}_k\|_2
\end{align*}
$$
The equation on the second line holds true because $\mathbf{q}_j$ is orthogonal to $\mathbf{q}_1,\cdots, \mathbf{q}_{j-1}$

In [19]:
using LinearAlgebra
function cgs(X::Matrix{T}) where T<:AbstractFloat
    n, p = size(X)
    Q = Matrix{T}(undef, n, p)
    R = zeros(T, p, p)
    for j=1:p
        Q[:, j] .= X[:, j]
        for i=1:j-1  # i < j 
            R[i, j] = dot(Q[:, i], X[:, j])     
            Q[:, j] .-= R[i, j] * Q[:, i]       # calculating v_j 
        end
        R[j, j] = norm(Q[:, j])             
        Q[:, j] /= R[j, j]          # normalizing v_j to earn q_j
    end
    Q, R
end

cgs (generic function with 1 method)

In [20]:
function mgs!(X::Matrix{T}) where T<:AbstractFloat
    n, p = size(X)
    R = zeros(T, p, p)
    for j=1:p
        for i=1:j-1 # i < j
            R[i, j] = dot(X[:, i], X[:, j])     # Note that X[:, j] mutates as iteration of "i = 1 : j-1" proceeds. 
            X[:, j] -= R[i, j] * X[:, i]        
        end
        R[j, j] = norm(X[:, j])
        X[:, j] /= R[j, j]
    end
    X, R
end

mgs! (generic function with 1 method)

<img src=./images/cgsVSmgs1.jpg width="800" align="center">  
<img src=./images/cgsVSmgs2.jpg width="800" align="center">


* $\mathbf{X}$ is overwritten by $\mathbf{Q}$ and $\mathbf{R}$ is stored in a separate array.

In [21]:
Q, R = mgs!(copy(A))
Q

4×3 Matrix{Float32}:
 1.0          0.0        0.0
 1.19209f-7  -0.707107  -0.408248
 0.0          0.707107  -0.408248
 0.0          0.0        0.816497

In [22]:
transpose(Q)*Q

3×3 Matrix{Float32}:
  1.0         -8.42937f-8  -4.8667f-8
 -8.42937f-8   1.0          3.14007f-8
 -4.8667f-8    3.14007f-8   1.0

* So MGS is more stable than CGS. However, even MGS is not completely immune to instability.

In [23]:
B = [0.7f0 0.7071068f0; 0.7000001f0 0.7071068f0]
# Again, columns of B are almost collinear

2×2 Matrix{Float32}:
 0.7  0.707107
 0.7  0.707107

In [24]:
Q, R = mgs!(copy(B))
Q

2×2 Matrix{Float32}:
 0.707107  1.0
 0.707107  0.0

In [25]:
transpose(Q)*Q

2×2 Matrix{Float32}:
 1.0       0.707107
 0.707107  1.0

* `Q` is hardly orthogonal.
* Where exactly the problem occurs? (HW)

* Computational cost of CGS and MGS is $\sum_{k=1}^p 4n(k-1) \approx 2np^2$. ( $2$ inner products of $n$-dim vectors and $k$ subtractions of $n$-dim vectors for $k$-th column of $X$ )

* There are 3 algorithms to compute QR: (modified) Gram-Schmidt, Householder transform, (fast) Givens transform.

    In particular, the **Householder transform** for QR is implemented in LAPACK and thus used in R and Julia.

## QR by Householder transform

<img src="http://www-history.mcs.st-andrews.ac.uk/BigPictures/Householder_2.jpeg" width="200" align="center"/>

[Alston Scott Householder (1904-1993)](https://en.wikipedia.org/wiki/Alston_Scott_Householder)

* **This is the algorithm for solving linear regression in R**.

* Assume again $\mathbf{X} = [\mathbf{x}_1 | \dotsb | \mathbf{x}_p] \in \mathbb{R}^{n \times p}$ has full column rank.

* Gram-Schmidt can be understood as:
$$
    \mathbf{X}\mathbf{R}_{1} \mathbf{R}_2 \cdots  \mathbf{R}_n = \mathbf{Q}_1
$$
where $\mathbf{R}_j$ are a sequence of upper triangular matrices. Note that $\mathbf{R}_j$ has off diagonal entries only on the $j$-th column.  
The problem is that round-off errors due to floating operations in multiplying those $\mathbf{R}_j$'s may result in numerically non-orthogonal $\mathbf{Q}$

* Householder QR does
$$
    \mathbf{H}_{p} \cdots \mathbf{H}_2 \mathbf{H}_1 \mathbf{X} = \begin{pmatrix} \mathbf{R}_1 \\ \mathbf{0} \end{pmatrix},
$$
where $\mathbf{H}_j \in \mathbf{R}^{n \times n}$ are a sequence of Householder transformation matrices ( which are orthogonal and symmetric matrices ).

It yields the **full QR** where $\mathbf{Q} = \mathbf{H}_1 \cdots \mathbf{H}_p \in \mathbb{R}^{n \times n}$. Recall that CGS/MGS only produces the **reduced QR** decomposition.  
Note that numerically, orthogonality is quite well preserved when orthogonal matrices are multiplied. ( Product of orthogonal matrices is orthogonal mathematically. )

* Gram-Schmidt QR vs Householder QR
    - Target is orthogonal matrix $Q$. During the process, orthogonality may be lost. $\leftrightarrow$ Target is upper triangular matrix $R$. Keep multiplying orthogonal matrices during the process.
    - Result is the reduced QR decomposition. $\leftrightarrow$ Result is the full QR decomposition.

* For arbitrary ( not identical ) vectors $\mathbf{v}, \mathbf{w} \in \mathbb{R}^{n}$ with $\|\mathbf{v}\|_2 = \|\mathbf{w}\|_2$, we can construct a **Householder matrix** (or **Householder reflector**)
$$
    \mathbf{H} = \mathbf{I}_n - 2 \mathbf{u} \mathbf{u}^T, \quad \mathbf{u} = \frac{1}{\|\mathbf{v} - \mathbf{w}\|_2} (\mathbf{v} - \mathbf{w}),
$$
that transforms $\mathbf{v}$ to $\mathbf{w}$:
$$
	\mathbf{H} \mathbf{v} = \mathbf{w}.
$$
$\mathbf{H}$ is symmetric and orthogonal. Calculation of Householder vector $\mathbf{u}$ costs $4n$ flops for general $\mathbf{u}$ and $\mathbf{w}$.

<img src="https://www.cs.utexas.edu/users/flame/laff/alaff-beta/images/Chapter03/reflector.png" width="400" align="center"/>

Source: <https://www.cs.utexas.edu/users/flame/laff/alaff-beta/images/Chapter03/reflector.png>  

$u^H$ means $u^T$ because here $H$ denotes "Hermitian".  
Plug in $v$ and $w$ on $x$ and $(I-2uu^T)$ respectively, at the figure above. Then $u$ is the same direction with $v-w$ while $u$ is normalized. 

* Now choose $\mathbf{H}_1$ so that
$$
	\mathbf{H}_1 \mathbf{x}_1 = \begin{pmatrix} \|\mathbf{x}_{1}\|_2 \\ 0 \\ \vdots \\ 0 \end{pmatrix}.
$$
That is, $\mathbf{v} = \mathbf{x}_1$ and $\mathbf{w} = \|\mathbf{x}_1\|_2\mathbf{e}_1$. (Norm of two vectors are the same)

* Left-multiplying $\mathbf{H}_1$ zeros out the first column of $\mathbf{X}$ below (1, 1).
$$
\begin{align*}
\mathbf{H}_1\mathbf{X} &= 
\begin{bmatrix} 
\times & \times & \times & \times \\ 
0 & \mathbf{\times} & \mathbf{\times} & \mathbf{\times}  \\
0 & \mathbf{\times} & \mathbf{\times} & \mathbf{\times}  \\
0 & \mathbf{\times} & \mathbf{\times} & \mathbf{\times}  \\
0 & \mathbf{\times} & \mathbf{\times} & \mathbf{\times} 
\end{bmatrix} \\
&=
\begin{bmatrix}
m_{11} & m_{12} & m_{13} & m_{14} \\
\mathbf{0} & \tilde{\mathbf{x}}_2 & \tilde{\mathbf{x}}_3 & \tilde{\mathbf{x}}_4
\end{bmatrix}
\end{align*}
$$


* Take $\mathbf{H}_2$ to zero the second column below diagonal, i.e. choose $\tilde{\mathbf H}_2\in \mathbb{R}^{(n-1)\times (n-1)}$ and $\mathbf{H}\in \mathbb{R}^{n\times n}$ such that
$$
\tilde{\mathbf H}_2 \tilde{\bf x}_2 = \begin{bmatrix}  \|\tilde{\bf x}_2\|_2  \\ 0 \\ \vdots \\ 0 \end{bmatrix}
$$

$$
\mathbf{H}_2 = \begin{bmatrix} 1 & \mathbf{0}^T \\ \mathbf{0} & \tilde{\bf H}_2 \end{bmatrix}
$$

$$
\mathbf{H}_2\mathbf{H}_1\mathbf{X} = 
\begin{bmatrix} 
\times & \times & \times & \times \\ 
0 & \boldsymbol{\times} & \boldsymbol{\times} & \boldsymbol{\times}  \\
0 & \mathbf{0} & \boldsymbol{\times} & \boldsymbol{\times}  \\
0 & \mathbf{0} & \boldsymbol{\times} & \boldsymbol{\times}  \\
0 & \mathbf{0} & \boldsymbol{\times} & \boldsymbol{\times} 
\end{bmatrix} 
$$
The first row does not change. The first column also does not change. The second column is changed as we intended. The other columns are also changed but we will deal with them in the next steps.

* In general, choose the $j$-th Householder transform $\mathbf{H}_j = \mathbf{I}_n - 2 \mathbf{u}_j \mathbf{u}_j^T$ for $j=1,2,\cdots, p$, where 
$$
     \mathbf{u}_j = \begin{bmatrix} \mathbf{0}_{j-1} \\ {\tilde u}_j \end{bmatrix}, \quad {\tilde u}_j \in \mathbb{R}^{n-j+1},
$$
to zero the $j$-th column below diagonal. (Note that for $j=1$, the first element need not be zero.)  
$\mathbf{H}_j$ takes the form
$$
	\mathbf{H}_j = \begin{bmatrix}
	\mathbf{I}_{j-1} & \\
	& \mathbf{I}_{n-j+1} - 2 {\tilde u}_j {\tilde u}_j^T
	\end{bmatrix} = \begin{bmatrix}
	\mathbf{I}_{j-1} & \\
	& {\tilde H}_{j}
	\end{bmatrix}.
$$

* Applying a Householder transform $\mathbf{H} = \mathbf{I} - 2 \mathbf{u} \mathbf{u}^T$ to a matrix $\mathbf{X} \in \mathbb{R}^{n \times p}$
$$
	\mathbf{H} \mathbf{X} = \mathbf{X} - 2 \mathbf{u} (\mathbf{u}^T \mathbf{X})
$$
costs $4np$ flops. ( Inner product $\mathbf{u}^T\mathbf{X}$ : $2np$ flops / Outer product $2\mathbf{u}(\mathbf{u}^T\mathbf{X})$ : $np$ flops / Subtraction $\mathbf{X}-2\mathbf{u}(\mathbf{u}^T\mathbf{X})$ : $np$ flops )  
Note that naive matrix multiplication $\mathbf{HX}$ would cost $2n^2p$ flops. 
 
**Householder updates never entails explicit formation of the Householder matrices.** We don't use matrix-matrix multiplication. Instead, we take advantage of special structure of Householder transform.

* Note applying ${\tilde H}_j$ to $\mathbf{X}$ only needs $4(n-j+1)(p-j+1)$ flops. As $j$ goes from $1$ to $p$ , dimension of $\tilde H_j$ gets smaller and flop counts decreases.

### Algorithm

```Julia
for j=1:p
    u = House!(X[j:n, j])       
    # On the j-th step, u in H=I-2uu' only requires jth column of current X whose top j-1 entries are not necessary
    for i=j:p
        X[j:n, j:p] .-= 2u*(u'X[j:n, j:p])  # This equals to X ==> HX = X - 2u(u'X)
    end
end
```

* The process is done in place. Upper triangular part of $\mathbf{X}$ is overwritten by $\mathbf{R}_1$ and the essential Householder vectors ($\tilde u_{j1}$ is normalized to 1) are stored in $\mathbf{X}[j:n,j]$. ( This is possible since $\tilde u$ vectors are subject to constraint that $\|\tilde u\|=1$ )

* At $j$-th stage  
 1. computing the Householder vector ${\tilde u}_j$ costs $3(n-j+1)$ flops  
 2. applying the Householder transform ${\tilde H}_j$ to the $\mathbf{X}[j:n, j:p]$ block costs $4(n-j+1)(p-j+1)$ flops  
     
* In total we need $\sum_{j=1}^p [3(n-j+1) + 4(n-j+1)(p-j+1)] \approx 2np^2 - \frac 23 p^3$ flops.

* Where is $\mathbf{Q}$? 
    - $\mathbf{Q} = \mathbf{H}_1 \cdots \mathbf{H}_p$. In some applications, it's necessary to form the orthogonal matrix $\mathbf{Q}$. 

    Accumulating $\mathbf{Q}$ costs another $2np^2 - \frac 23 p^3$ flops.

* When computing $\mathbf{Q}^T \mathbf{v}$ or $\mathbf{Q} \mathbf{v}$ as in some applications (e.g., solve linear equation using QR : $\mathbf{X}^T\mathbf{X}\beta = \mathbf{X}^T\mathbf{y} \Rightarrow \mathbf{R}_1\beta = \mathbf{Q}_1^T\mathbf{y}$), no need to form $\mathbf{Q}$. Simply apply Householder transforms successively to the vector $\mathbf{v}$. (HW)  
$\mathbf{Qv}=\mathbf{H}_1\cdots\mathbf{H}_p\mathbf{v}$ and $\mathbf{Q^Tv}=\mathbf{H}_p\cdots\mathbf{H}_1\mathbf{v}$ . It is simply iterative computation of vector dot product, scalar multiplication, and subtraction in order.

* Computational cost of Householder QR for linear regression: $2n p^2 - \frac 23 p^3$ (regression coefficients and $\hat \sigma^2$) or more (fitted values, s.e., ...).

### Householder QR with column pivoting

Consider rank deficient $\mathbf{X}$ ( columns of $\mathbf{X}$ are linearly dependent). We want to figure out the rank of $\mathbf{X}$

* At the $j$-th stage, swap the column `X[:, j]` with `X[:, k]` where `k` is the column number in `X[j:n,j:p]` with maximum $\ell_2$ norm to be the pivot column. If the maximum $\ell_2$ norm is 0, it stops, ending with
$$
\mathbf{X} \mathbf{P} = \mathbf{Q} \begin{bmatrix} \mathbf{R}_{11} & \mathbf{R}_{12} \\ \mathbf{0}_{(n-r) \times r} & \mathbf{0}_{(n-r) \times (p-r)} \end{bmatrix},
$$
where $\mathbf{P} \in \mathbb{R}^{p \times p}$ is a permutation matrix and $r$ is the rank of $\mathbf{X}$. QR with column pivoting is rank revealing.

### Implementation

* Julia functions: [`qr`](https://docs.julialang.org/en/v1/stdlib/LinearAlgebra/#LinearAlgebra.qr), [`qrfact!`](https://docs.julialang.org/en/v1/stdlib/LinearAlgebra/#LinearAlgebra.qr!), or call LAPACK wrapper functions [`geqrf!`](https://docs.julialang.org/en/v1/stdlib/LinearAlgebra/#LinearAlgebra.LAPACK.geqrf!) and [`geqp3!`](https://docs.julialang.org/en/v1/stdlib/LinearAlgebra/#LinearAlgebra.LAPACK.geqp3!)

* R function: [`qr`](https://stat.ethz.ch/R-manual/R-devel/library/base/html/qr.html). Wraps LAPACK routine [`dgeqp3`](http://www.netlib.org/lapack/explore-html/dd/d9a/group__double_g_ecomputational_ga1b0500f49e03d2771b797c6e88adabbb.html)  (with `LAPACK=TRUE`; default uses LINPACK, an ancient version of LAPACK).

In [27]:
X

5×3 Matrix{Float64}:
  0.126238  -1.12783   -0.88267
 -2.34688    1.14786    1.71384
  1.91661   -1.35471   -0.733952
 -0.239209  -0.237065   1.08182
 -0.578451  -0.680994  -0.170406

In [28]:
y

5-element Vector{Float64}:
 -1.794259674143309
  1.0913793110025305
  0.4266277597108536
 -0.6244337204329091
  0.03204861737738283

In [29]:
X \ y # least squares solution by QR

3-element Vector{Float64}:
 0.3795466676698624
 0.6508866456093487
 0.39225041956535506

In [30]:
# same as
qr(X) \ y

3-element Vector{Float64}:
 0.37954666766986195
 0.6508866456093481
 0.3922504195653549

In [31]:
cholesky(X'X) \ (X'y) # least squares solution by Cholesky

3-element Vector{Float64}:
 0.37954666766986256
 0.6508866456093485
 0.3922504195653555

In [32]:
# QR factorization with column pivoting
xqr = qr(X, Val(true))

QRPivoted{Float64, Matrix{Float64}}
Q factor:
5×5 LinearAlgebra.QRPackedQ{Float64, Matrix{Float64}}:
 -0.0407665  -0.692007    0.318693   0.185526  -0.619257
  0.757887   -0.0465712  -0.260086  -0.522259  -0.288166
 -0.618938   -0.233814   -0.404293  -0.624592  -0.0931611
  0.0772486  -0.235405   -0.808135   0.53435    0.00216687
  0.186801   -0.639431    0.119392  -0.131075   0.724429
R factor:
3×3 Matrix{Float64}:
 -3.09661  1.60888   1.84089
  0.0      1.53501   0.556903
  0.0      0.0      -1.32492
permutation:
3-element Vector{Int64}:
 1
 2
 3

In [33]:
xqr \ y # least squares solution

3-element Vector{Float64}:
 0.3795466676698624
 0.6508866456093487
 0.39225041956535506

In [34]:
# thin Q matrix multiplication (a sequence of Householder transforms)
norm(xqr.Q * xqr.R - X[:, xqr.p]) # recovers X (with columns permuted)

1.071020016095422e-15

## QR by Givens rotation

* Householder transform $\mathbf{H}_j$ introduces batch of zeros into a vector.

* **Givens transform** (aka **Givens rotation**, **Jacobi rotation**, **plane rotation**) selectively zeros one element of a vector.

* 한 스텝에 0을 하나만 만들어냄. 그만큼 스텝 하나하나의 복잡도는 낮긴 함.

* Overall QR by Givens rotation is less efficient than the Householder method, but is better suited for matrices with structured patterns of nonzero elements.

* **Givens/Jacobi rotations**: 
$$
	\mathbf{G}(i,k,\theta) = \begin{bmatrix} 
	1 & & 0 & & 0 & & 0 \\
	\vdots & \ddots & \vdots & & \vdots & & \vdots \\
	0 & & c & & s & & 0 \\ 
	\vdots & & \vdots & \ddots & \vdots & & \vdots \\
	0 & & - s & & c & & 0 \\
	\vdots & & \vdots & & \vdots & \ddots & \vdots \\
	0 & & 0 & & 0 & & 1 \end{bmatrix},
$$
where $c = \cos(\theta)$ and $s = \sin(\theta)$. $\mathbf{G}(i,k,\theta)$ is orthogonal.  
유클리드 공간에서 $i$번째 축과 $k$번째 축으로 이루어진 평면을 $\theta$ 만큼 rotate 시키는 변환. 

* Pre-multiplication by $\mathbf{G}(i,k,\theta)^T$ rotates counterclockwise $\theta$ radians in the $(i,k)$ coordinate plane. If $\mathbf{x} \in \mathbb{R}^n$ and $\mathbf{y} = \mathbf{G}(i,k,\theta)^T \mathbf{x}$, then
$$
	y_j = \begin{cases}
	cx_i - s x_k & j = i \\
	sx_i + cx_k & j = k \\
	x_j & j \ne i, k
	\end{cases}.
$$
Apparently if we choose $\tan(\theta) = -x_k / x_i$, or equivalently,
$$
\begin{align*}
	c = \frac{x_i}{\sqrt{x_i^2 + x_k^2}}, \quad s = \frac{-x_k}{\sqrt{x_i^2 + x_k^2}},
\end{align*}
$$
then $y_k=0$.

* Pre-applying Givens transform $\mathbf{G}(i,k,\theta)^T \in \mathbb{R}^{n \times n}$ to a matrix $\mathbf{A} \in \mathbb{R}^{n \times m}$ only effects two rows of $\mathbf{
A}$:
$$
	\mathbf{A}([i, k], :) \gets \begin{bmatrix} c & s \\ -s & c \end{bmatrix}^T \mathbf{A}([i, k], :),
$$
costing $6m$ flops.

* Post-applying Givens transform $\mathbf{G}(i,k,\theta) \in \mathbb{R}^{m \times m}$ to a matrix $\mathbf{A} \in \mathbb{R}^{n \times m}$ only effects two columns of $\mathbf{A}$:
$$
	\mathbf{A}(:, [i,k]) \gets \mathbf{A}(:, [i,k]) \begin{bmatrix} c & s \\ -s & c \end{bmatrix},
$$
costing $6n$ flops.

* QR by Givens: $\mathbf{G}_t^T \cdots \mathbf{G}_1^T \mathbf{X} =  \begin{bmatrix} \mathbf{R}_1 \\ \mathbf{0} \end{bmatrix}$.

<img src="./images/QR_by_Givens.png" width="600" align="center"/>

* Zeros in $\mathbf{X}$ can also be introduced row-by-row.

* If $\mathbf{X} \in \mathbb{R}^{n \times p}$, the total cost is $3np^2 - p^3$ flops and $O(np)$ square roots.

* Note each Givens transform can be summarized by a single number, which is stored in the zeroed entry of $\mathbf{X}$.

## Applications

### Linear regression

* QR decomposition of $\mathbf{X}$: $2np^2 - \frac 23 p^3$ flops.

* Solve $\mathbf{R}^T \mathbf{R} \beta = \mathbf{R}^T \mathbf{Q}^T \mathbf{y}$ for $\beta$.

* If $\mathbf{X}$ is full rank, then $\mathbf{R}$ is invertible, so we only need to solve the triangular system
$$
    \mathbf{R} \beta = \mathbf{Q}^T \mathbf{y}
    .
$$
Multiplication $\mathbf{Q}^T \mathbf{y}$ is done implicitly.

* If need standard errors, compute inverse of $\mathbf{R}^T \mathbf{R}$. This involves triangular solves.

## Further reading

* Section II.5.3 of [Computational Statistics](https://link.springer.com/book/10.1007%2F978-0-387-98144-4) by James Gentle (2010).

* Chapter 5 of [Matrix Computation](https://www.amazon.com/Computations-Hopkins-Studies-Mathematical-Sciences/dp/1421407949/ref=sr_1_1?keywords=matrix+computation+golub&qid=1567157884&s=gateway&sr=8-1) by Gene Golub and Charles Van Loan (2013).

## Acknowledgment

Many parts of this lecture note is based on [Dr. Hua Zhou](http://hua-zhou.github.io)'s 2019 Spring Statistical Computing course notes available at <http://hua-zhou.github.io/teaching/biostatm280-2019spring/index.html>.