---
# Section 1.1:  Matrix Multiplication
---

### Example:

Let

$$
A = 
\begin{bmatrix} 
2 & 1 & 4 \\ 
3 & -2 & 1 
\end{bmatrix} 
\in \mathbb{R}^{2 \times 3},
\qquad
x = 
\begin{bmatrix}
1 \\ 2 \\ 1
\end{bmatrix}
\in \mathbb{R}^3.
$$

Compute the matrix-vector product $Ax$ by hand and check that Julia gives the same answer.

---

In general, if $A$ is a real matrix with $m$ rows and $n$ columns, and $x$ is a real vector with $n$ entries, then

$$
A = 
\begin{bmatrix}
a_{11} & \cdots & a_{1n} \\
\vdots &        & \vdots \\
a_{m1} & \cdots & a_{mn}
\end{bmatrix}
\in \mathbb{R}^{m \times n}
\quad \text{and} \quad
x = 
\begin{bmatrix}
x_1 \\ \vdots \\ x_n
\end{bmatrix}
\in \mathbb{R}^n.
$$

If $b = Ax$, then $b \in \mathbb{R}^m$ and

$$
b_i = \sum_{j = 1}^n a_{ij} x_j = a_{i1}x_1 + \cdots + a_{in}x_n,
\quad
i = 1,\ldots, m.
$$

Thus, $b_i$ is the **inner-product** between $\mathrm{row}_i(A) = \begin{bmatrix} a_{i1} & \cdots & a_{in} \end{bmatrix}$ and the vector $x$.

Also, 

$$b = \mathrm{col}_1(A) x_1 + \cdots + \mathrm{col}_n(A) x_n,$$

so $b$ is a **linear combination** of the columns of $A$.

---

### Exercise:

Write a Julia function to multiply a matrix and a vector.

---

### Exercise:

Test the speed of your function to compute $b = Ax$.


### What are your conclusions?

1. 

---

## Storage of arrays in memory

In Julia, the matrix 

$$
A = 
\begin{bmatrix}
2 & 1 & 4 \\
3 & -2 & 1
\end{bmatrix}
$$

is stored in computer memory in **column-major order**:

| $\vdots$     |
|:------:|
| 2.0  |
| 3.0  |
| 1.0  |
| -2.0 |
| 4.0  |
| 1.0  |
| $\vdots$     |

### Computer memory architecture

When the CPU needs data from memory, the **page** in memory where that data is located gets loaded into the **cache**.

$$
\begin{matrix}
& \text{fast}& & \text{slow} & \\
\fbox{CPU} & \Longleftrightarrow & \fbox{cache} & \longleftrightarrow & \fbox{memory} \\
& & \text{3 MB} & & \text{16 GB}
\end{matrix}
$$

It is better to load data from memory that is stored contiguously.



### Row-major vs column-major order

Some languages store arrays in **row-major order**, such as:
- C/C++
- Python
- Mathematica

Other languages use **column-major order**, such as:
- Fortran
- MATLAB
- R
- Julia

See the [Row-major order](https://en.wikipedia.org/wiki/Row-major_order) Wikipedia page for more information.


---

### Floating-point operations (flops)

A **flop** is a *floating-point operation* between numbers stored in a floating-point format on a computer.

We will discuss this floating-point format in detail later in the course. For now, it is enough to know that this format is a way of storing real numbers on a computer that is like *scientific notation*. It allows us to store a large range of numbers, but only to a finite precision.

If $x$ and $y$ are numbers stored in a floating point format, then the following operations are each *one flop*:

$$
x + y, \quad x - y, \quad xy, \quad x/y.
$$

---

### Example:

Count the number of flops in the following code.

```julia
for j = 1:n
    for i = 1:m
        b[i] = b[i] + A[i, j]*x[j]
    end
end
```

---

For $A \in \mathbb{R}^{n \times n}$ and $x \in \mathbb{R}^n$, computing $b = Ax$ requires $2n^2$ flops.

Thus, we expect that computing $b = Ax$ with $n = 2000$ will take 4 times as long as the same computation with $n = 1000$.

We say that $b = Ax$ is an **order $n^2$** operation:

$$\fbox{Computing $b = Ax$ requires $O(n^2)$ flops.}$$

The exact number of flops matters less than how the number of flops grows as $n$ grows.

---

### Speed test

Write code to test compare running times when $n$ is doubled.

---

### Matrix-Matrix Multiplication

Let $A \in \mathbb{R}^{m \times n}$ and $X \in \mathbb{R}^{n \times p}$.

If $B = AX$ then $B \in \mathbb{R}^{m \times p}$ and

$$
b_{ij} = \sum_{k = 1}^n a_{ik} x_{kj}, \quad i = 1,\ldots,m, \quad j = 1,\ldots,p.
$$

That is, $b_{ij}$ is the **inner-product** between row $i$ of $A$ and column $j$ of $X$.

Also, each column of $B$ is a **linear combination** of the columns of $A$.

Total flops required is $2mnp$.

If $A, X \in \mathbb{R}^{n \times n}$, then computing $B = AX$ requires $2n^3 = O(n^3)$ flops.

---

### Exercise:

Write a Julia function to multiply two matrices.

---

### Exercise:

Compare the running time of your function to Julia's built-in matrix-matrix multiplication.

---

### Block Matrices

Partition $A \in \mathbb{R}^{m \times n}$ and $X \in \mathbb{R}^{n \times p}$ into blocks:

$$
\begin{matrix}
    & & \begin{matrix} n_1 & n_2 \end{matrix} \\
A = & \begin{matrix} m_1 \\ m_2 \end{matrix}
    & \begin{bmatrix}
    A_{11} & A_{12} \\
    A_{21} & A_{22}
    \end{bmatrix},
\end{matrix}
\qquad
\begin{matrix}
    & & \begin{matrix} p_1 & p_2 \end{matrix} \\
X = & \begin{matrix} n_1 \\ n_2 \end{matrix}
    & \begin{bmatrix}
    X_{11} & X_{12} \\
    X_{21} & X_{22}
    \end{bmatrix},
\end{matrix}
$$

where $n = n_1 + n_2$, $m = m_1 + m_2$, and $p = p_1 + p_2$.

If $B = AX$ and

$$
\begin{matrix}
    & & \begin{matrix} p_1 & p_2 \end{matrix} \\
B = & \begin{matrix} m_1 \\ m_2 \end{matrix}
    & \begin{bmatrix}
    B_{11} & B_{12} \\
    B_{21} & B_{22}
    \end{bmatrix},
\end{matrix}
$$

then

$$
\begin{align}
\begin{bmatrix}
    B_{11} & B_{12} \\
    B_{21} & B_{22}
\end{bmatrix} = B = AX &= 
\begin{bmatrix}
    A_{11} & A_{12} \\
    A_{21} & A_{22}
\end{bmatrix}
\begin{bmatrix}
    X_{11} & X_{12} \\
    X_{21} & X_{22}
\end{bmatrix}\\\\
&=
\begin{bmatrix}
    A_{11} X_{11} + A_{12} X_{21} & A_{11} X_{12} + A_{12} X_{22} \\
    A_{21} X_{11} + A_{22} X_{21} & A_{21} X_{12} + A_{22} X_{22} \\
\end{bmatrix}
\end{align}
$$

That is,

$$
B_{ij} = \sum_{k = 1}^2 A_{ik}X_{kj}, \qquad i,j = 1,2.
$$

---

### Exercise:

Verify the above block matrix multiplication formula on random matrices in Julia.

---

### Use of Block Matrix Operations to Decrease Data Movement Delays

Suppose $n = rs$.

If $A, X \in \mathbb{R}^{n \times n}$ are partitioned into $s \times s$ block matrices where each block is of size $r \times r$, then $B = AX$ can be computed as in the following pseudo-code:

```julia
B = zeros(n, n)
for i = 1:s
    for j = 1:s
        for k = 1:s
            Bij = Bij + Aik * Xkj
        end
    end
end
```



We can do the following operations in parallel:

1. Multiply $A_{ik} X_{kj}$ in $O(r^3)$ time.

2. Fetch the next blocks $A_{i,k+1}$ and $X_{k+1,j}$ in $O(r^2)$ time.

We should be able to choose $r$ so that step 2 takes less time than step 1.

Therefore, the CPU will not have to wait to load data from the memory into the cache.

Cache size needs to be taken into account. We need to be able to store the following 5 submatrices in cache at the same time:

$$
B_{ij}, \quad A_{ik}, \quad X_{kj}, \quad A_{i,k+1}, \quad X_{k+1,j}.
$$

In addition, multiple processors can compute different submatrices $B_{ij}$ at the same time.

---