#### **Non-Negative Matrix Factorization**

Daniele Coppola Viktor Gsteiger Maša Nešić Matteo Oldani



Eidgenössische Technische Hochschule Zürich Swiss Federal Institute of Technology Zurich

### The Algorithm

$$V = W \times H$$

$$H_{[i,j]}^{n+1} = H_{[i,j]}^n \cdot \frac{((W^n)^T V)_{[i,j]}}{((W^n)^T W^n H^n)_{[i,j]}}$$

$$W_{[i,j]}^{n+1} = W_{[i,j]}^{n} \cdot \frac{(V(H^{n+1})^{T})_{[i,j]}}{(W^{n} H^{n+1} (H^{n+1})^{T})_{[i,j]}}$$







# **Cost analysis**

#### Asymptotic runtime complexity :

$$O(m \cdot n \cdot r + m \cdot r^2 + n \cdot r^2 + n^2 \cdot r + m \cdot n)$$

Cost - measured by counting operations inside the code:

```
2 * m * n * r + 5 * m * n + 3 +

number_of_iterations * (6 * m * n * r +

3 * m * r * r +

3 * n * r * r +

5 * m * n +

2 * m * r +

2 * n * r + 3)
```

 $V: m \times n$   $W: m \times r$  $H: r \times n$ 

## **Baseline Implementations**





#### **Matrix Multiplication Optimizations**

- Blocking (block size = 16)
- Loop unrolling to have 8 independent lines of computations
- Vectorization of the computation

Performance Plots - Matrix Multiplication



int idx b = k \* B n col + jj;for (kk = k; kk < k + nB; kk++){ a0 = mm256 set1 pd(A[Aii + kk]);a1 = mm256 set1 pd(A[Aii + A n col + kk]);b0 = \_mm256\_load\_pd((double \*) &B[idx\_b]);  $b1 = mm256 load_pd((double *) &B[idx_b + 4]);$ b2 = mm256 load pd((double \*) &B[idx b + 8]); $b3 = _mm256\_load\_pd((double *) \&B[idx_b + 12]);$  $r0 = _{mm256_{fmadd_pd(a0, b0, r0)}};$  $r1 = _mm256_fmadd_pd(a0, b1, r1);$  $r2 = _mm256_fmadd_pd(a0, b2, r2);$  $r3 = _{mm256_fmadd_pd(a0, b3, r3)};$ r4 = mm256 fmadd pd(a1, b0, r4); $r5 = mm256 fmadd_pd(a1, b1, r5);$  $r6 = _{mm256}fmadd_pd(a1, b2, r6);$  $r7 = _mm256_fmadd_pd(a1, b3, r7);$  $idx_b += B_n_{col}$ ;

Input matrix size

## **Matrix Padding**

- R multiple of block size
  - Outperforming BLAS
- R not multiple of block size
  - BLAS outperforming us

- Matrix padding to the block size
  - More computation but better microarchitecture usage
  - Outperforming BLAS
  - Padding with 0s



#### Algorithmic optimization 1 – Reuse entries of W

■ Reduces the number of read accesses to (W<sup>n</sup>)<sup>T</sup> in the computation of H<sup>n+1</sup>, and to (H<sup>n+1</sup>)<sup>T</sup> in the computation of W<sup>n+1</sup> by half



#### Algorithmic optimization 4 – Reuse block of W across 2 iterations

■ The calculated block of W<sup>n+1</sup> is immediately used in the calculation of (W<sup>n+1</sup>)<sup>T</sup>V and (W<sup>n+1</sup>)<sup>T</sup> W<sup>n+1</sup>



#### **Runtime and Performance Plots**



#### Flops/cycle



NNMF (double precision) on i5-6600K, 3.5 GHz, r = 16

#### **Roofline Plot**



NNMF (double precision) on i5-6600K, 3.5 GHz, r = 16

### Speedup



NNMF (double precision) on i5-6600K, 3.5 GHz, r = 16

#### Average speedup:

- Optimization 61 23.95
- Optimization 60 22.07
- Optimization 48 21.05
- Optimization 51 20.49
- Optimization 54 19.24
- Optimization 24 17.84
- Optimization 42 7.71

# **Benchmarking plot**



# **Questions?**





# **Appendix**

#### **Runtime and Performance Plots - scalar**



NNMF (double precision) on i5-6600K, 3.5 GHz, r = 16

## **Performance Scalar Optimizations**



## **Runtime Scalar Optimizations**



# **Roofline Scalar Optimizations**



#### **Performance Scalar Algorithmic Optimizations**



## **Runtime Scalar Algorithmic Optimizations**



### **Roofline Scalar Algorithmic Optimizations**



# **Performance BLAS Optimizations**



## **Runtime BLAS Optimizations**



# **Roofline BLAS Optimizations**



#### **Performance Vectorized Optimizations**



NNMF (double precision) on i5-6600K, 3.5 GHz, r = 16

### **Runtime Vectorized Optimizations**



# **Roofline Vectorized Optimizations**



## **Performance Vectorized Algorithmic Optimizations**



#### **Runtime Vectorized Algorithmic Optimizations**



#### **Roofline Vectorized Algorithmic Optimizations**









Rectangular matrices - Optimization 61



Rectangular matrices - Optimization 54





#### **3D Performance Plot Baseline 2**



# **3D Performance Plot Optimization 47**



# **3D Performance Plot Optimization 61**



#### Algorithmic optimization 2 – Interleave matrix multiplications

- Avoids having to store and reread the numerator and the denominator
- The same approach is used in the calculation of W<sup>n+1</sup> as well

numerator - N
$$H_{[i,j]}^{n+1} = H_{[i,j]}^{n} \cdot \frac{((W^n)^T V)_{[i,j]}}{((W^n)^T W^n H^n)_{[i,j]}}$$
denominator\_left - DI
$$denominator - D$$

```
V: m \times n

W: m \times r

H: r \times n

Dl: r \times r

N: r \times n

D: r \times n
```

```
N = matrix_mul(W^T, V)
D = matrix_mul(Dl, H) contain triple loops
for(i=0; i<r; i++)
  for (j=0; j< n; j++)
    H[i][j] = H[i][j] * N[i][j] / D[i][j]
for(i=0; i<r; i++)
  for(j=0; j<n; j++)
     accumulator N = 0
    accumulator D = 0
    for (k=0; k < m; k++)
       accumulator_N += W^{T}[i][k] * V[k][j]
       if(k < r)
         accumulator D += Dl[i][k] * H[k][j]
    H[i][j] = H[i][j] * accumulator N / accumulator D
```

#### Algorithmic optimization 3 – Reuse block of H

■ The calculated block of  $H^{n+1}$  is immediately used in the calculation of  $V(H^{n+1})^T$  and  $H^{n+1}(H^{n+1})^T$ 



# **Additional analysis**

- opt47 reduces cache misses by blocking
- opt53 reduces cache misses reusing W
- opt60 computes the approximation matrix WH one block at the time
- opt61 adopts both opt53 and 60

#### LLC MISSES(M)



V: 1256x1256

W: 1256x16

H: 16x1256