# Lecture 3: Matrix multiplication, scalar product, unitary matrices

## Syllabus
**Week 1:** Intro week, floating point, vector norms, matrix multiplication


## Recap of the previous lecture
- Concept of floating point
- Basic vector norms(p-norm, Manhattan distance, 2-norm, Chebyshev norm)
- A short demo on $L_1$-norm minimization
- Concept of forward/backward error

## Today lecture
Today we will talk about:
- Matrices
- Matrix multiplication
- Matrix norms, operator norms
- unitary matrices, unitary invariant norms
- Concept of block algorithms for NLA: why and how.
- Complexity of matrix multiplication

## Matrix-by-matrix product

Consider composition of two linear operators:

1. $y = Bx$
2. $z = Ay$

Then, $z = Ay =  A B x = C x$, where $C$ is the **matrix-by-matrix product**.

A product of an $n \times k$ matrix $A$ and a $k \times m$ matrix $B$ is a $n \times m$ matrix $C$ with the elements  
$$
   c_{ij} = \sum_{s=1}^k a_{is} b_{sj}, \quad i = 1, \ldots, n, \quad j = 1, \ldots, m 
$$

## Complexity of MM
Complexity of a naive algorithm for MM is $\mathcal{O}(n^3)$.   

Matrix-by-matrix product is the **core** for almost all efficient algorithms in linear algebra.  

Basically, all the NLA algorithms are reduced to a sequence of matrix-by-matrix products, 

so efficient implementation of MM reduces the complexity of numerical algorithms by the same factor.  

However, implementing MM is not easy at all!

## Efficient implementation for MM
Is it easy to multiply a matrix by a matrix?  

The answer is: **no**, if you want it as fast as possible,  

using the computers that are at hand.

## Demo
Let us do a short demo and compare a `np.dot()` procedure which in my case uses MKL with a hand-written matrix-by-matrix routine in Python and also its Cython version (and also gives a very short introduction to Cython).

In [18]:
import numpy as np
def matmul(a, b):
    n = a.shape[0]
    k = a.shape[1]
    m = b.shape[1]
    c = np.zeros((n, m))
    for i in range(n):
        for j in range(m):
            for s in range(k):
                c[i, j] += a[i, s] * b[s, j]

In [19]:
%reload_ext cythonmagic

In [None]:
%%cython
import numpy as np
def cython_matmul(double [:, :] a, double[:, :] b):
    cdef int n = a.shape[0]
    cdef int k = a.shape[1]
    cdef int m = b.shape[1]
    cdef int i
    cdef int j 
    cdef int s
    c = np.zeros((n, m))
    cdef double[:, :] cview = c
    for i in range(n):
        for j in range(m):
            for s in range(k):
                c[i, j] += a[i, s] * b[s, j]
    return c

Then we just compare computational times.

Guess the answer.

In [None]:
n = 100
a = np.random.randn(n, n)
b = np.random.randn(n, n)
%timeit c = matmul(a, b)
%timeit cf = cython_matmul(a, b)
%timeit c = np.dot(a, b)

Why it is so?   
There are two important issues:

- Computers are more and more parallel (multicore, graphics processing units)
- The memory pyramid: there is a whole hierarchy of levels 

## Memory architecture
<img width=80% src="Computer_Memory_Hierarchy.svg">
Fast memory is small, bigger memory is slow. 

- Data fits into the  fast memory:  
  load all data, compute
- Data does not fit into the fast memory:  
  load data by chunks, compute, load again

We need to reduce the number of read/write operations!  

This is typically achieved in efficient implementations of the BLAS libraries, one of which (Intel MKL) we now use.

## BLAS
Basic linear algebra operations (**BLAS**) have three levels:
1. BLAS-1, operations like $c = a + b$
2. BLAS-2, operations like matrix-by-vector product
3. BLAS-3, matrix-by-matrix product

What is the principal differences between them?

The main difference is the number of operations vs. the number of input data!

1. BLAS-1: $\mathcal{O}(n)$ data, $\mathcal{O}(n)$ operations
2. BLAS-2: $\mathcal{O}(n^2)$ data, $\mathcal{O}(n^2)$ operations
3. BLAS-3: $\mathcal{O}(n^2)$ data, $\mathcal{O}(n^3)$ operations

**Remark**: a quest for $\mathcal{O}(n^2)$ matrix-by-matrix multiplication algorithm is not yet done.

Strassen gives $\mathcal{O}(n^{2.78...})$   

World record $\mathcal{O}(n^{2.37})$ [Reference](http://arxiv.org/pdf/1401.7714v1.pdf)  

The constant is unfortunately too big to make it practical!

## Memory hierarchy
How we can use memory hierarchy? 
<img src="Computer_Memory_Hierarchy.svg" width = 70%>

Break the matrix into blocks! ($2 \times 2$ is an **illustration**)  

$
   A = \begin{bmatrix}
         A_{11} & A_{12} \\
         A_{21} & A_{22}
        \end{bmatrix}$, $B = \begin{bmatrix}
         B_{11} & B_{12} \\
         B_{21} & B_{22}
        \end{bmatrix}$

Then,  

$AB$ = $\begin{bmatrix}A_{11} B_{11} + A_{12} B_{21} & A_{11} B_{12} + A_{12} B_{22} \\
            A_{21} B_{11} + A_{22} B_{21} & A_{21} B_{12} + A_{22} B_{22}\end{bmatrix}.$


If $A_{11}, B_{11}$ and their product fit into the cache memory (which is 1024 Kb for the [Haswell Intel Chip](http://en.wikipedia.org/wiki/List_of_Intel_Core_i7_microprocessors#.22Haswell-H.22_.28MCP.2C_quad-core.2C_22_nm.29)), then we load them only once into the memory.  

### Key point
The number of read/writes is reduced by a factor $\sqrt{M}$, where $M$ is the cache size.  

- Have to do linear algebra in terms of blocks! 
- So, you can not even do Gaussian elimination as usual (or just suffer 10x performance loss)

## Parallelization
The blocking has also deep connection with parallel computations.  
Consider adding two vectors:
$$ c = a + b$$
and we have two processors.  

How fast can we go?  

Of course, not faster then twice.

In [12]:
## This demo requires Anaconda distribution to be installed
import mkl
import numpy as np
n = 1000
a = np.random.randn(n)
mkl.set_num_threads(1)
%timeit a + a
mkl.set_num_threads(2)
%timeit a + a

The slowest run took 60.65 times longer than the fastest. This could mean that an intermediate result is being cached 
100000 loops, best of 3: 1.25 µs per loop
The slowest run took 32.57 times longer than the fastest. This could mean that an intermediate result is being cached 
1000000 loops, best of 3: 1.29 µs per loop


In [13]:
## This demo requires Anaconda distribution to be installed
import mkl
n = 500
a = np.random.randn(n, n)
mkl.set_num_threads(1)
%timeit a.dot(a)
mkl.set_num_threads(2)
%timeit a.dot(a)

100 loops, best of 3: 12.2 ms per loop
100 loops, best of 3: 10.9 ms per loop


Typically, two cases are distinguished: 
1. Shared memory (i.e., multicore on every desktop/smartphone)
2. Distributed memory (i.e. each processor has its own memory, can send information through a network)

In both cases, the efficiency is governed by a  

**memory bandwidth**:  

I.e., for BLAS-1,2 routines (like sum of two vectors) reads/writes take all the time.  

For BLAS-3 routines, the speedup can be obtained that is more noticable.  

For large-scale clusters (>100 000 cores, see the [Top500 list](http://www.top500.org/lists/)) there is still scaling.

## Communication-avoiding algorithms
A new direction in NLA is **communication-avoiding** algorithms (i.e. Hadoop), when you have many computing nodes, but very slow communication with limited communication capabilities.  

This requires **absolutely different algorithms**.

This can be an interesting **project** (i.e. do NLA in a cloud).

## Summary of MM part
- MM is the core of NLA. You have to think in block terms, if you want high efficiency
- This is all about computer memory hierarchy
- $\mathcal{O}(n^{2 + \epsilon})$ complexity hypothesis is not proven or disproven yet.

Now we go to **matrix norms**.

## Matrices and norms
How to measure distances between matrices?  

A trivial answer is that there is no big differences between matrices and vectors, and here comes the **Frobenius** norm of the matrix:
$$
  \Vert A \Vert_F = \sum_{i=1}^n \Big(\sum_{j=1}^m |a_{ij}|^2\Big)^{1/2}
$$
But there is a problem in such definition: this is not a **matrix norm**

## Matrix norms
$\Vert \cdot \Vert$ is called a **matrix norm** if it is a vector norm on the linear space of $n \times m$ matrices, and it also is consistent with the matrix-by-matrix product, i.e.

$$\Vert A B \Vert \leq \Vert A \Vert \Vert B \Vert$$

The multiplicative property is needed in many places, for example in the estimates for the error of solution of linear systems (we will cover this subject later).   

Can you think of some matrix norms, and is Frobenius norm a matrix norm?

## Operator norms
The most important class of the norms is the class of **operator norms**. Mathematically, they are defined as

$$
    \Vert A \Vert_* = \sup_{x \ne 0} \frac{\Vert A x \Vert_*}{\Vert x \Vert_*},
$$

where $\Vert \cdot \Vert_*$ is a **vector norm**.

## Matrix p-norms
It is not diffcult to show that operator norm is a matrix norm. Among all operator norms $p$-norms are used, where $p$-norm is used as the vector norm. Among all $p$-norms three norms are the most common ones:  

- $p = 2, \quad$ spectral norm, denoted by $\Vert A \Vert_2$.
- $p = \infty, \quad \Vert A \Vert_{\infty} = \max_i \sum_j |A_{ij}|$.
- $p = 1, \quad \Vert A \Vert_{1} = \max_j \sum_i |A_{ij}|$.

## Spectral norm
Spectral norm, $\Vert A \Vert_2$ is undoubtedly the most used matrix norm. It can not be computed directly from the entries using a simple formula, like the Euclidean norm, however, there are efficient algorithm to compute it.  It is directly related to the **singular value decomposition** (SVD) of the matrix. It holds

$$
   \Vert A \Vert_2 = \sigma_1(A)
$$

where $\sigma_1(A)$ is the largest singular value of the matrix $A$. We will soon learn all about this. Meanwhile, we can already compute the norm in Python.

In [16]:
import numpy as np
n = 100
a = np.random.randn(n, n) #Random n x n matrix
s1 = np.linalg.norm(a, 2) #Spectral
s2 = np.linalg.norm(a, 'fro') #Frobenius
s3 = np.linalg.norm(a, 1) #1-norm
s4 = np.linalg.norm(a, np.inf) #It was trick to find the infinity
print 'Spectral:', s1, 'Frobenius:', s2, '1-norm', s3, 'infinity', s4

Spectral: 19.7089521303 Frobenius: 100.396005941 1-norm 93.761637223 infinity 96.0482085766


## Scalar product
If norm is a measure of distances, then the scalar product takes angle into account.  

The scalar product is defined as
$$
   (x, y) = \sum_{i=1}^n \overline{x}_i y_i,
$$
where $\overline{x}$ denotes the *complex conjugate* of $x$. The Euclidean norm is then

$$
   \Vert x \Vert^2 = (x, x),
$$

or it is said the the norm is **induced** by scalar product.  

**Remark**. For the angle between two vectors is defined as
$$
   \cos \phi = \frac{(x, y)}{\Vert x \Vert \Vert y \Vert} 
$$

An important property of the scalar product is the **Cauchy-Bunyakovski inequality**:
$$
    (x, y) \leq \Vert x \Vert \Vert y \Vert,
$$
and thus the angle between two vectors is defined properly.

The scalar product can be written as a matrix-by-matrix product  
$$
  (x, y) = x^* y,
$$
where $^*$ is a **conjugate transpose** of the matrix:  
$$
B = A^*, \quad B_{ij} = \overline{A_{ji}}.
$$

## Norm conservation
For stability it is really important that the error does not grow. Suppose that you approximately get a vector,  

$$
  \Vert x - \widehat{x} \Vert \leq \varepsilon.
$$
Let  final result is (some) linear transformation of $x$:  
$$
   y = Ux, \quad \widehat{y} = U \widehat{x}.
$$
If we want to estimate a difference between $\widehat{y}$ and $y$:  

$$
   \Vert y - \widehat{y} \Vert = \Vert U ( x - \widehat{x}) \Vert \leq \Vert U \Vert \varepsilon.
$$

## Matrices, preserving the norm
The question is for which kind of matrices the norm of the vector **will not change**.  

For the euclidean norm this produces a very important class of matrices: **unitary** (or orthogonal in the real case) matrices.

## Unitary (orthogonal) matrices
Let $U$ be an $n \times r$ matrix, and $\Vert U z \Vert = \Vert z \Vert$ for all $z$. This can happen if and only if  

$$
   U^* U = I,
$$

where $I$ is an **identity matrix**

Indeed, $$\Vert Uz \Vert^2 = (Uz, Uz) = (Uz)^* Uz = z^* (U^* U) z = z^* z,$$ 

which can also hold if $U^* U = I$.

## Unitary matrices
In the real case, when $U^* = U^{\top}$, the matrix is called orthogonal. 

Are there many unitary matrices? First of all, **a product of two unitary matrices is a unitary matrix:**  

$$(UV)^* UV = V^* (U^* U) V = V^* V = I,$$

thus if we give some non-trivial examples of unitary matrices, we will be able to get any unitary transformation.

## Examples of unitary matrices
There are two important classes of unitary matrices, using those we can make any unitary matrix
1. Householder matrices
2. Givens (Jacobi) matrices

## Householder matrices
Householder matrix is the matrix of the form  
$$H = I - 2 vv^*,$$
where $u$ is an $n \times 1$ matrix and $v^* v = I$. Can you show that $H$ is unitary?  It is also a reflection. <img src="householder.jpeg">  
A simple proof: $H^* H = (I - 2 vv^*)(I - 2 v v^*) = I$

## Givens(Jacobi) matrix
A Givens matrix is a matrix  

$$
    A = \begin{bmatrix}
          \cos \alpha & \sin \alpha \\
          -\sin \alpha & \cos \alpha
        \end{bmatrix},
$$

which is a rotation. For a general case, we select two $(i, j)$ planes and rotate only in those:  

$$
    x'_i = \cos \alpha x_i + \sin \alpha x_j, \quad x'_j = -\sin \alpha x_i + \cos\alpha x_j,
$$

with all other $x_i$ remain unchanged.

## Summary on unitary matrices
- Unitary matrices preserve the norm
- There are two "basic" classes of unitary matrices, Householder and Givens.
- Every unitary matrix can be represented as a product of those.

## Take home message
- Matrix multiplication, idea of blocking, memory hiearchy
- Scalar product, unitary matrices, basic classes of unitary matrices

## Next week
- TA week
- You got PSet 1
- Think of course projects (i.e. oil&gas, power networks, social networks)
- There is a course on Mathematics of the Internet going now at Skoltech, you are welcome to visit.

##### Questions?

In [14]:
from IPython.core.display import HTML
def css_styling():
    styles = open("./styles/custom.css", "r").read()
    return HTML(styles)
css_styling()