## Understanding `linalg.householder_product`

The documentation of [`torch.linalg.householder_product`](https://docs.pytorch.org/docs/2.8/generated/torch.linalg.householder_product.html) describes what the operation does. I'd like to understand it and write a python implementation. I'm going to disect the description by quoting each sentence and expanding on it with a specific example scenario.

> Let $\Bbb K$ be $\Bbb R$ or $\Bbb C$, and let $A \in \Bbb K^{m\times n}$ be a matrix with columns $a_i \in \Bbb K^m$ for $i = 1, ..., m$ with $m \ge n$.

Input $A$ is a real or complex matrix with $m$ rows and $n$ columns.

The matrix must have either and equal number of rows and columns or more rows than columns, $m \ge n$.

We will refer to each column of the matrix as a vector $a_i$. Since the matrix has $m$ rows, each column $a_i$ has $m$ elements.

For instance, given the following matrix:

$$
A =
\begin{pmatrix}
A_{1 1} & A_{1 2} & A_{1 3} \\
A_{2 1} & A_{2 2} & A_{2 3} \\
A_{3 1} & A_{3 2} & A_{3 3} \\
A_{4 1} & A_{4 2} & A_{4 3} \\
A_{5 1} & A_{5 2} & A_{5 3} \\
\end{pmatrix}
$$

We have 5 rows and 3 columns, so $m = 5$ and $n = 3$. $m \ge n$ is satisfied.

Each $a_i$ is:

$$
a_1 = \begin{pmatrix}
A_{1 1} \\
A_{2 1} \\
A_{3 1} \\
A_{4 1} \\
A_{5 1}
\end{pmatrix}

\quad

a_2 = \begin{pmatrix}
A_{1 2} \\
A_{2 2} \\
A_{3 2} \\
A_{4 2} \\
A_{5 2}
\end{pmatrix}

\quad

a_3 = \begin{pmatrix}
A_{1 3} \\
A_{2 3} \\
A_{3 3} \\
A_{4 3} \\
A_{5 3}
\end{pmatrix}
$$

> Denote by $b_i$ the vector resulting from zeroing out the first $i - 1$ components of $a_i$ and setting to $1$ the $i$-th.

In our case:

$$
b_1 = \begin{pmatrix}
1 \\
A_{2 1} \\
A_{3 1} \\
A_{4 1} \\
A_{5 1}
\end{pmatrix}

\quad

b_2 = \begin{pmatrix}
0 \\
1 \\
A_{3 2} \\
A_{4 2} \\
A_{5 2}
\end{pmatrix}

\quad

b_3 = \begin{pmatrix}
0 \\
0 \\
1 \\
A_{4 3} \\
A_{5 3}
\end{pmatrix}
$$

> For a vector $\tau \in \Bbb K^k$ with $k \le n$, this function computes the first $n$ columns of the matrix
>
>  $H_1 H_2 ... H_k$ with $H_i = I_m - \tau_i b_i b_i^H$
>
> where $I_m$ is the m-dimensional identity matrix and $b^H$ is the conjugate transpose when $b$ is complex and the transpose when $b$ is real-valued.

So input $\tau$ is a vector that must have length equal to or less than the number of columns in $A$, $k \le n$. (In the formula for the product of $H_i$ matrices, if $k \lt n$ then the out of bounds values of $\tau_i$ are taken to be 0, so that the corresponding $H_i$'s are the identity matrix.)

In our case, let's use the following $\tau$:

$$
\tau = \begin{pmatrix}
\tau_1 \\
\tau_2 \\
\tau_3
\end{pmatrix}
$$

One thing that isn't 100% clear to me is which kind of vector product to use for the $b_i b_i^H$ term. From the [Wikipedia article](https://en.wikipedia.org/wiki/Householder_transformation#:~:text=.-,Householder%20matrix,-%5Bedit%5D) on the Householder transformation, the formula for a Householder matrix supposedly uses the outer product for this term. So I'll go ahead and assume that must be what we need to use here as well.

The $b^H$ vectors are:

$$
b_1^H = \begin{pmatrix}
1 &
A_{2 1}^* &
A_{3 1}^* &
A_{4 1}^* &
A_{5 1}^*
\end{pmatrix}
$$
$$
b_2^H = \begin{pmatrix}
0 &
1 &
A_{3 2}^* &
A_{4 2}^* &
A_{5 2}^*
\end{pmatrix}
$$
$$
b_3^H = \begin{pmatrix}
0 &
0 &
1 &
A_{4 3}^* &
A_{5 3}^*
\end{pmatrix}
$$

Then the outer products are:

$$
b_1 b_1^H = \begin{pmatrix}
1 & A_{2 1}^* & A_{3 1}^* & A_{4 1}^* & A_{5 1}^* \\
A_{2 1} & A_{2 1} A_{2 1}^* & A_{2 1} A_{3 1}^* & A_{2 1} A_{4 1}^* & A_{2 1} A_{5 1}^* \\
A_{3 1} & A_{3 1} A_{2 1}^* & A_{3 1} A_{3 1}^* & A_{3 1} A_{4 1}^* & A_{3 1} A_{5 1}^* \\
A_{4 1} & A_{4 1} A_{2 1}^* & A_{4 1} A_{3 1}^* & A_{4 1} A_{4 1}^* & A_{4 1} A_{5 1}^* \\
A_{5 1} & A_{5 1} A_{2 1}^* & A_{5 1} A_{3 1}^* & A_{5 1} A_{4 1}^* & A_{5 1} A_{5 1}^* \\
\end{pmatrix}
$$

$$
b_2 b_2^H = \begin{pmatrix}
0 & 0 & 0 & 0 & 0\\
0 & 1 & A_{3 2}^* & A_{4 2}^* & A_{5 2}^* \\
0 & A_{3 2} & A_{3 2} A_{3 2}^* & A_{3 2} A_{4 2}^* & A_{3 2} A_{5 2}^* \\
0 & A_{4 2} & A_{4 2} A_{3 2}^* & A_{4 2} A_{4 2}^* & A_{4 2} A_{5 2}^* \\
0 & A_{5 2} & A_{5 2} A_{3 2}^* & A_{5 2} A_{4 2}^* & A_{5 2} A_{5 2}^* \\
\end{pmatrix}
$$

$$
b_3 b_3^H = \begin{pmatrix}
0 & 0 & 0 & 0 & 0\\
0 & 0 & 0 & 0 & 0\\
0 & 0 & 1 & A_{4 3}^* & A_{5 3}^* \\
0 & 0 & A_{4 3} & A_{4 3} A_{4 3}^* & A_{4 3} A_{5 3}^* \\
0 & 0 & A_{5 3} & A_{5 3} A_{4 3}^* & A_{5 3} A_{5 3}^* \\
\end{pmatrix}
$$

## Python prototype implementation

I'll implement it in python now to make sure I understand it, and I'll compare its output to that of the official pytorch impl.

In [1]:
import itertools
import torch

def run_test(fn, fn_check):
    dtypes = [
        torch.double,
        torch.cdouble,
    ]

    shapes = [
        # [A_shape, tau_shape]
        [(5, 3), (3,)],
        [(5, 3), (2,)],
        [(5, 3), (1,)],
        [(10, 5, 3), (10, 3,)],
        [(10, 5, 3), (10, 2,)],
        [(2, 10, 5, 3), (2, 10, 3,)],
        [(40, 1, 20, 15), (40, 1, 10,)],
    ]

    for dtype, (A_shape, tau_shape) in itertools.product(dtypes, shapes):
        A = torch.randn(A_shape, dtype=dtype)
        tau = torch.randn(tau_shape, dtype=dtype)

        r_check = fn_check(A, tau)
        r = fn(A, tau)
        assert torch.allclose(r, r_check)


In [2]:
def householder_prod_0(A, tau):
    m = A.shape[-2]
    n = A.shape[-1]
    k = tau.shape[-1]

    I = torch.eye(m, dtype=A.dtype)
    H_prod = I

    for i in range(k):
        b = A[..., i].clone()
        b[..., :i] = 0
        b[..., i] = 1
        b_bH = b.unsqueeze(-1) @ b.conj().unsqueeze(-2)
        tau_i = tau[..., i, None, None]
        H = I - tau_i * b_bH
        H_prod = H_prod @ H

    return H_prod[..., :n]

run_test(householder_prod_0, torch.linalg.householder_product)

This implementation gives the same outputs as the pytorch impl for the above cases. But each of the householder matrices is calculated serially. We should be able to batch that part

In [3]:
def householder_prod_1(A, tau):
    m = A.shape[-2]
    n = A.shape[-1]
    k = tau.shape[-1]

    B = A[..., :k].transpose(-1, -2).clone()

    for i in range(k):
        B[..., i, :i] = 0
        B[..., i, i] = 1

    I = torch.eye(m, dtype=A.dtype)
    B_BH = B.unsqueeze(-1) @ B.conj().unsqueeze(-2)
    H_matrices = I - tau[..., None, None] * B_BH
    H_prod = H_matrices[..., 0, :, :]

    for i in range(1, k):
        H = H_matrices[..., i, :, :]
        H_prod = H_prod @ H

    return H_prod[..., :n]

run_test(householder_prod_1, torch.linalg.householder_product)

That should give better performance. We're still looping over the matrices to perform the matmul product, but we have to do that because there is no matrix product reduction operation in pytorch. We're also looping to create the `B` vectors, and there may be a better way to do that in python, but this python impl is only a prototype, which has now served its purpose.

## Planning a Metal GPU impl

I need to implement `householder_product` to run on the Apple Metal GPU. A few possible strategies occur to me:

1. Use only existing `at::` operations.

2. Use an existing linalg library that has Metal gpu support.

3. Write a single Metal kernel to do the whole operation.

4. Write a Metal kernel to do the `(A, tau) -> H_matrices` calculation, and then iteratively call the existing `matmul` op to calculate the product.

Option (1) is of course very easy to do, now that I have the python impl, but it will be the least efficient one. Since it is so easy to do, I should do this one first and just check how the performance compares to the CPU impl.

It seems that option (2) is not possible, because I can't find any existing implementation of orgqr for Metal.

I think option (3) would probably be pretty difficult, and I think it would actually have to operate serially anyway to do the matrix product.

So it seems to me that option (4) is the best one.

Ok I've tried out option (1), and the performance is bad, as expected. But furthermore, the performance of the series of matmuls is significantly worse than the entire CPU impl. So I'm going to have to go with option (3) to have any hope of the MPS impl giving better performance than CPU.

I'm going to continue with the example I was working out above with the $5 \times 3$ $A$ matrix and $\tau$ with 3 elements. My hope is that if I write out the final result $H_1 H_2 H_3$, then it will be fairly easy to see how to generalize the solution for any input sizes.

$$
H_1
= I_5 - \tau_1 b_1 b_1^H
$$

$$
H_1 =
\begin{pmatrix}
1 & 0 & 0 & 0 & 0 \\
0 & 1 & 0 & 0 & 0 \\
0 & 0 & 1 & 0 & 0 \\
0 & 0 & 0 & 1 & 0 \\
0 & 0 & 0 & 0 & 1
\end{pmatrix}
- \tau_1
\begin{pmatrix}
1 &
A_{2 1}^* &
A_{3 1}^* &
A_{4 1}^* &
A_{5 1}^* \\

A_{2 1} &
A_{2 1} A_{2 1}^* &
A_{2 1} A_{3 1}^* &
A_{2 1} A_{4 1}^* &
A_{2 1} A_{5 1}^* \\

A_{3 1} &
A_{3 1} A_{2 1}^* &
A_{3 1} A_{3 1}^* &
A_{3 1} A_{4 1}^* &
A_{3 1} A_{5 1}^* \\

A_{4 1} &
A_{4 1} A_{2 1}^* &
A_{4 1} A_{3 1}^* &
A_{4 1} A_{4 1}^* &
A_{4 1} A_{5 1}^* \\

A_{5 1} &
A_{5 1} A_{2 1}^* &
A_{5 1} A_{3 1}^* &
A_{5 1} A_{4 1}^* &
A_{5 1} A_{5 1}^*
\end{pmatrix}
$$

$$
H_1 =
\begin{pmatrix}
1 - \tau_1 &
-\tau_1 A_{2 1}^* &
-\tau_1 A_{3 1}^* &
-\tau_1 A_{4 1}^* &
-\tau_1 A_{5 1}^* \\

-\tau_1 A_{2 1} &
1 - \tau_1 A_{2 1} A_{2 1}^* &
-\tau_1 A_{2 1} A_{3 1}^* &
-\tau_1 A_{2 1} A_{4 1}^* &
-\tau_1 A_{2 1} A_{5 1}^* \\

-\tau_1 A_{3 1} &
-\tau_1 A_{3 1} A_{2 1}^* &
1 - \tau_1 A_{3 1} A_{3 1}^* &
-\tau_1 A_{3 1} A_{4 1}^* &
-\tau_1 A_{3 1} A_{5 1}^* \\

-\tau_1 A_{4 1} &
-\tau_1 A_{4 1} A_{2 1}^* &
-\tau_1 A_{4 1} A_{3 1}^* &
1 - \tau_1 A_{4 1} A_{4 1}^* &
-\tau_1 A_{4 1} A_{5 1}^* \\

-\tau_1 A_{5 1} &
-\tau_1 A_{5 1} A_{2 1}^* &
-\tau_1 A_{5 1} A_{3 1}^* &
-\tau_1 A_{5 1} A_{4 1}^* &
1 - \tau_1 A_{5 1} A_{5 1}^*
\end{pmatrix}
$$

$$
H_2
= I_5 - \tau_2 b_2 b_2^H
$$

$$
H_2 =
\begin{pmatrix}
1 & 0 & 0 & 0 & 0 \\
0 & 1 & 0 & 0 & 0 \\
0 & 0 & 1 & 0 & 0 \\
0 & 0 & 0 & 1 & 0 \\
0 & 0 & 0 & 0 & 1
\end{pmatrix}
- \tau_2
\begin{pmatrix}
0 & 0 & 0 & 0 & 0\\
0 & 1 & A_{3 2}^* & A_{4 2}^* & A_{5 2}^* \\
0 & A_{3 2} & A_{3 2} A_{3 2}^* & A_{3 2} A_{4 2}^* & A_{3 2} A_{5 2}^* \\
0 & A_{4 2} & A_{4 2} A_{3 2}^* & A_{4 2} A_{4 2}^* & A_{4 2} A_{5 2}^* \\
0 & A_{5 2} & A_{5 2} A_{3 2}^* & A_{5 2} A_{4 2}^* & A_{5 2} A_{5 2}^* \\
\end{pmatrix}
$$

$$
H_2 =
\begin{pmatrix}
1 &
0 &
0 &
0 &
0 \\

0 &
1 - \tau_2 &
- \tau_2 A_{3 2}^* &
- \tau_2 A_{4 2}^* &
- \tau_2 A_{5 2}^* \\

0 &
- \tau_2 A_{3 2} &
1 - \tau_2 A_{3 2} A_{3 2}^* &
- \tau_2 A_{3 2} A_{4 2}^* &
- \tau_2 A_{3 2} A_{5 2}^* \\

0 &
- \tau_2 A_{4 2} &
- \tau_2 A_{4 2} A_{3 2}^* &
1 - \tau_2 A_{4 2} A_{4 2}^* &
- \tau_2 A_{4 2} A_{5 2}^* \\

0 &
- \tau_2 A_{5 2} &
- \tau_2 A_{5 2} A_{3 2}^* &
- \tau_2 A_{5 2} A_{4 2}^* &
1 - \tau_2 A_{5 2} A_{5 2}^*
\end{pmatrix}
$$

$$
H_3
= I_5 - \tau_3 b_3 b_3^H
$$

$$
H_3 =
\begin{pmatrix}
1 & 0 & 0 & 0 & 0 \\
0 & 1 & 0 & 0 & 0 \\
0 & 0 & 1 & 0 & 0 \\
0 & 0 & 0 & 1 & 0 \\
0 & 0 & 0 & 0 & 1
\end{pmatrix}
- \tau_3
\begin{pmatrix}
0 &
0 &
0 &
0 &
0 \\

0 &
0 &
0 &
0 &
0 \\

0 &
0 &
1 &
A_{4 3}^* &
A_{5 3}^* \\

0 &
0 &
A_{4 3} &
A_{4 3} A_{4 3}^* &
A_{4 3} A_{5 3}^* \\

0 &
0 &
A_{5 3} &
A_{5 3} A_{4 3}^* &
A_{5 3} A_{5 3}^*
\end{pmatrix}
$$

$$
H_3 =
\begin{pmatrix}
1 &
0 &
0 &
0 &
0 \\

0 &
1 &
0 &
0 &
0 \\

0 &
0 &
1 - \tau_3  &
- \tau_3 A_{4 3}^* &
- \tau_3 A_{5 3}^* \\

0 &
0 &
- \tau_3 A_{4 3} &
1 - \tau_3 A_{4 3} A_{4 3}^* &
- \tau_3 A_{4 3} A_{5 3}^* \\

0 &
0 &
- \tau_3 A_{5 3} &
- \tau_3 A_{5 3} A_{4 3}^* &
1 - \tau_3 A_{5 3} A_{5 3}^*
\end{pmatrix}
$$

To write a generalized formula for each element of each Householder matrix, I'm going to use my own notation. ${H_i}^{[r, c]}$ represents the element from the $r$-th row and $c$-th column of the $i$-th Householder matrix.

$$
{H_i}^{[r, c]}
= \begin{Bmatrix}
1, & \text{if } r = c \\
0, & \text{else}
\end{Bmatrix}

- \tau_i
\begin{Bmatrix}
A_{c i}^*, & \text{if } c > i \\
1,         & \text{else}
\end{Bmatrix}
\begin{Bmatrix}
A_{r i}, & \text{if } r > i \\
1,         & \text{else}
\end{Bmatrix}
$$

Now I'll write out the matrix products. Using the ${H_i}^{[r, c]}$ notation, if either $r$ or $c$ is a colon, $:$, then it represents every element, just as in Python array indexing notation. So for instance, ${H_1}^{[4, :]}$ represents the fourth row of matrix 1, and ${H_3}^{[:, 2]}$ represents the second column of matrix 3.

$$
H_{1..2} = H_1 H_2
= \begin{pmatrix}
{H_1}^{[1, :]} \cdot {H_2}^{[:, 1]} &
{H_1}^{[1, :]} \cdot {H_2}^{[:, 2]} &
{H_1}^{[1, :]} \cdot {H_2}^{[:, 3]} &
{H_1}^{[1, :]} \cdot {H_2}^{[:, 4]} &
{H_1}^{[1, :]} \cdot {H_2}^{[:, 5]} \\
{H_1}^{[2, :]} \cdot {H_2}^{[:, 1]} &
{H_1}^{[2, :]} \cdot {H_2}^{[:, 2]} &
{H_1}^{[2, :]} \cdot {H_2}^{[:, 3]} &
{H_1}^{[2, :]} \cdot {H_2}^{[:, 4]} &
{H_1}^{[2, :]} \cdot {H_2}^{[:, 5]} \\
{H_1}^{[3, :]} \cdot {H_2}^{[:, 1]} &
{H_1}^{[3, :]} \cdot {H_2}^{[:, 2]} &
{H_1}^{[3, :]} \cdot {H_2}^{[:, 3]} &
{H_1}^{[3, :]} \cdot {H_2}^{[:, 4]} &
{H_1}^{[3, :]} \cdot {H_2}^{[:, 5]} \\
{H_1}^{[4, :]} \cdot {H_2}^{[:, 1]} &
{H_1}^{[4, :]} \cdot {H_2}^{[:, 2]} &
{H_1}^{[4, :]} \cdot {H_2}^{[:, 3]} &
{H_1}^{[4, :]} \cdot {H_2}^{[:, 4]} &
{H_1}^{[4, :]} \cdot {H_2}^{[:, 5]} \\
{H_1}^{[5, :]} \cdot {H_2}^{[:, 1]} &
{H_1}^{[5, :]} \cdot {H_2}^{[:, 2]} &
{H_1}^{[5, :]} \cdot {H_2}^{[:, 3]} &
{H_1}^{[5, :]} \cdot {H_2}^{[:, 4]} &
{H_1}^{[5, :]} \cdot {H_2}^{[:, 5]} \\
\end{pmatrix}
$$

$$
H_{1..3} = H_1 H_2 H_3
= \begin{pmatrix}
{H_{1..2}}^{[1, :]} \cdot {H_3}^{[:, 1]} &
{H_{1..2}}^{[1, :]} \cdot {H_3}^{[:, 2]} &
{H_{1..2}}^{[1, :]} \cdot {H_3}^{[:, 3]} &
{H_{1..2}}^{[1, :]} \cdot {H_3}^{[:, 4]} &
{H_{1..2}}^{[1, :]} \cdot {H_3}^{[:, 5]} \\
{H_{1..2}}^{[2, :]} \cdot {H_3}^{[:, 1]} &
{H_{1..2}}^{[2, :]} \cdot {H_3}^{[:, 2]} &
{H_{1..2}}^{[2, :]} \cdot {H_3}^{[:, 3]} &
{H_{1..2}}^{[2, :]} \cdot {H_3}^{[:, 4]} &
{H_{1..2}}^{[2, :]} \cdot {H_3}^{[:, 5]} \\
{H_{1..2}}^{[3, :]} \cdot {H_3}^{[:, 1]} &
{H_{1..2}}^{[3, :]} \cdot {H_3}^{[:, 2]} &
{H_{1..2}}^{[3, :]} \cdot {H_3}^{[:, 3]} &
{H_{1..2}}^{[3, :]} \cdot {H_3}^{[:, 4]} &
{H_{1..2}}^{[3, :]} \cdot {H_3}^{[:, 5]} \\
{H_{1..2}}^{[4, :]} \cdot {H_3}^{[:, 1]} &
{H_{1..2}}^{[4, :]} \cdot {H_3}^{[:, 2]} &
{H_{1..2}}^{[4, :]} \cdot {H_3}^{[:, 3]} &
{H_{1..2}}^{[4, :]} \cdot {H_3}^{[:, 4]} &
{H_{1..2}}^{[4, :]} \cdot {H_3}^{[:, 5]} \\
{H_{1..2}}^{[5, :]} \cdot {H_3}^{[:, 1]} &
{H_{1..2}}^{[5, :]} \cdot {H_3}^{[:, 2]} &
{H_{1..2}}^{[5, :]} \cdot {H_3}^{[:, 3]} &
{H_{1..2}}^{[5, :]} \cdot {H_3}^{[:, 4]} &
{H_{1..2}}^{[5, :]} \cdot {H_3}^{[:, 5]} \\
\end{pmatrix}
$$

We can generalize that for the product of Householder matrices 1 to $i$:

$$
H_{1..i} = H_{1..(i-1)} H_i
= \begin{pmatrix}
{H_{1..(i-1)}}^{[1, :]} \cdot {H_i}^{[:, 1]} &
{H_{1..(i-1)}}^{[1, :]} \cdot {H_i}^{[:, 2]} &
{H_{1..(i-1)}}^{[1, :]} \cdot {H_i}^{[:, 3]} &
{H_{1..(i-1)}}^{[1, :]} \cdot {H_i}^{[:, 4]} &
{H_{1..(i-1)}}^{[1, :]} \cdot {H_i}^{[:, 5]} \\
{H_{1..(i-1)}}^{[2, :]} \cdot {H_i}^{[:, 1]} &
{H_{1..(i-1)}}^{[2, :]} \cdot {H_i}^{[:, 2]} &
{H_{1..(i-1)}}^{[2, :]} \cdot {H_i}^{[:, 3]} &
{H_{1..(i-1)}}^{[2, :]} \cdot {H_i}^{[:, 4]} &
{H_{1..(i-1)}}^{[2, :]} \cdot {H_i}^{[:, 5]} \\
{H_{1..(i-1)}}^{[3, :]} \cdot {H_i}^{[:, 1]} &
{H_{1..(i-1)}}^{[3, :]} \cdot {H_i}^{[:, 2]} &
{H_{1..(i-1)}}^{[3, :]} \cdot {H_i}^{[:, 3]} &
{H_{1..(i-1)}}^{[3, :]} \cdot {H_i}^{[:, 4]} &
{H_{1..(i-1)}}^{[3, :]} \cdot {H_i}^{[:, 5]} \\
{H_{1..(i-1)}}^{[4, :]} \cdot {H_i}^{[:, 1]} &
{H_{1..(i-1)}}^{[4, :]} \cdot {H_i}^{[:, 2]} &
{H_{1..(i-1)}}^{[4, :]} \cdot {H_i}^{[:, 3]} &
{H_{1..(i-1)}}^{[4, :]} \cdot {H_i}^{[:, 4]} &
{H_{1..(i-1)}}^{[4, :]} \cdot {H_i}^{[:, 5]} \\
{H_{1..(i-1)}}^{[5, :]} \cdot {H_i}^{[:, 1]} &
{H_{1..(i-1)}}^{[5, :]} \cdot {H_i}^{[:, 2]} &
{H_{1..(i-1)}}^{[5, :]} \cdot {H_i}^{[:, 3]} &
{H_{1..(i-1)}}^{[5, :]} \cdot {H_i}^{[:, 4]} &
{H_{1..(i-1)}}^{[5, :]} \cdot {H_i}^{[:, 5]} \\
\end{pmatrix}
$$

And we can further generalize that to the element at row $r$ and column $c$ of the product of Householder matrices 1 to $i$:

$$
{H_{1..i}}^{[r, c]}
= {H_{1..(i-1)}}^{[r, :]} \cdot {H_i}^{[:, c]}
$$

That last expression can be applied to any size of the inputs $A$ and $\tau$.

Now I can see a strategy emerging for calculating the product of the Householder matrices in a Metal kernel, given $H_1$ through $H_k$. We can parallelize by running one thread per element of the output matrix. First, each thread will calculate its corresponding element of $H_{1..2}$ by doing the appropriate dot product, then all threads will barrier sync, write to the output buffer, and then sync again. Then each thread calculates its corresponding element of $H_{1..3}$, sync, write, and sync again. Repeat for each matrix multiplication.

This process does still have a serial nature, but I think that is unavoidable, and it should drastically cut down on the overhead of my prototype option (1) implementation, since it is done in a single kernel launch.

The only missing piece is how to calculate the Householder matrices. As I see it, we can just do calculate the next Householder matrices between each matmul.

So to sumarize, each thread executes the following algorithm:

> $r = $ one of the $m$ rows  
> $c = $ one of the $n$ columns  
> for $i \in [1, ..., k]$:  
>> calc and write ${H_i}^{[r, c]}$ to secondary buffer  
>> sync  
>> calc $
>>   {H_{1..i}}^{[r, c]}
>>   = {H_{1..(i-1)}}^{[r, :]} \cdot {H_i}^{[:, c]}
>>$  
>> sync  
>> write ${H_{1..i}}^{[r, c]}$ to output buffer  
>> sync if $i \ne k$  
