In [1]:
import numpy as np
import numpy.linalg as la
import scipy.signal as scisig

np.set_printoptions(linewidth=100)

### Introduction

We will be implementing and doing a cost analysis of [Winograd Convolution](https://arxiv.org/pdf/1509.09308.pdf).

We will be implementing a $F(m,r)$ filter. 

### 1-D Filter F(2,3)

Given a $r=3$ sized filter, we'd like to have an output of size $m=2$. This can be done with the algorithm

$$Y = A^T \Big[(Gg) \odot (B^Td) \Big]$$

We define $A^T, G, B$ below. Here, $d$ is the data and $g$ is the filter.

In [2]:
B23 = np.asarray([
    [1, 0,-1, 0],
    [0, 1, 1, 0],
    [0,-1, 1, 0],
    [0, 1, 0,-1]
]).T

G23 = np.asarray([
    [ 1,  0, 0],
    [.5, .5,.5],
    [.5,-.5,.5],
    [ 0, 0,  1]
])

A23 = np.asarray([
    [1,1,1,0],
    [0,1,-1,-1]
]).T

### F(4x4,3x3)

For full convolution, we need a stride length $s=r-1=2$.

In [3]:
n = 2**10

g = np.random.random(3)
d = np.random.random(n)
y = np.zeros(0)

d2 = np.append(np.zeros(2),d)
d2 = np.append(d2, np.zeros(2))
for i in range(len(d2)//2-1):
    yTemp = np.dot(A23.T, (G23 @ g) * (B23.T @ d2[i*2:(i*2)+4]) )
    y = np.append(y,yTemp)

convLib = np.convolve(d,g[::-1])

print("Error:",la.norm(y-convLib)/la.norm(convLib))

Error: 1.3951435055142857e-16


### Cost Analysis

$A^T$, the Winograd domain inversion, is a $2 \times 4$ matrix. $G$, the filter transformation is a $4 \times 3$. $B$, the data transformation, is a $4 \times 4$ matrix.

So in each of the small filters, we will require 
1. $(4 \times 3, 4 \times 3)$ flops for filter transformation (Only need to be done ONCE)
1. $(4 \times 4, 4 \times 4)$ flops for data transformation
1. $(0,4)$ flops for pointwise multiplication. This is the big $\alpha = m + r -1$ multiplies the Winograd paper points out
1. $(2 \times 4, 2 \times 4)$ flops for transformation back

So in total, we will need $(24,28)$ flops for each $F(2,3)$ filter. Let $\alpha = (m+r-1)$. We can generalize the cost in flops as:

$$( \ \alpha ( \alpha + m) \ , \ \alpha(\alpha + m + 1 ) \ )$$

We purposefully leave out the setup filter which cost $(\alpha r, \ \alpha r)$. For our specific $F(2,3)$, this turns out to be

$$(24,28) \textrm{ flops}$$

Quick analysis shows this is definitely worse than a standard $(4,6)$ flops for the direct method. However, we will see later in the 2-D case the savings are even better.

### 2-D Convolution F(2x2, 3x3)

2-D convolution is expanded to

$$Y = A^T \Big[(GgG^T) \odot (B^TdB) \Big]A$$

We implement Winograd's Minimal Filtering Algorithm for 2D convolution below

In [8]:
# assume N=1 image, K=1 filter, C=1 channel
def simpleWinogradAlg(g,d,m,B,G,A):
    N = K = C = 1
    """
    @g: 2d.numpy_array as square filter
    @d: 2d.numpy_array as data
    @m: int as output of FIR filter F(m,r)
    """
    h,w = d.shape
    r = g.shape[0]
    
    assert(g.shape[0] == g.shape[1])
    assert(h%m == 0 and w%m == 0)
    
    h-=m; w-=m
    
    P = (h//m)*(w//m) # num of tiles
    a = m+r-1 # input tile size
    
    dChunks = np.zeros((C,P,a,a))
    for c in range(C):
        for y in range(h//m):
            for x in range(w//m):
                b = y*(w//m) + x
                dChunks[c,b] = d[(y*m):(y*m)+a, (x*m):(x*m)+a]
    
    print(a,K,C)
    U = np.zeros((a,a,K,C))
    for k in range(K):
        for c in range(C):
            uInterm = np.dot(G, np.dot(g, G.T))
            for e in range(a):
                for v in range(a):
                    U[e,v,k,c] = uInterm[e,v]
            
    print(a,C,P)
    V = np.zeros((a,a,C,P))
    for b in range(P):
        for c in range(C):
            vInterm = np.dot(B.T, np.dot(dChunks[c,b], B))
            for e in range(a):
                for v in range(a):
                    V[e,v,c,b] = vInterm[e,v]
            
    M = np.zeros((a,a,K,P))
    for e in range(a):
        for v in range(a):
            M[e,v] = np.dot(U[e,v], V[e,v])
            
    Y = np.zeros((K,P,m,m))
    for k in range(K):
        for b in range(P):
            mInterm = np.zeros((a,a))
            for e in range(a):
                for v in range(a):
                    mInterm[e,v] = M[e,v,k,b]         
            Y[k,b] = np.dot(A.T, np.dot(mInterm, A))
        
    Ynew = np.zeros((K,h,w))
    for k in range(K):
        for y in range(h//m):
            for x in range(w//m):
                b = y*(w//m) + x
                Ynew[k,y*m:(y+1)*m, x*m:(x+1)*m] = Y[k,b]
    return Ynew

def padImage(g,r):
    h,w = g.shape
    g2 = np.zeros((2*r-2 + h,2*r-2 + w))
    g2[r-1:r-1+h,r-1:r-1+w] = g
    return g2

def revMatrix(M):
    n1,n2 = M.shape
    return np.eye(n1)[::-1] @ M @ np.eye(n2)[::-1]

In [9]:
n = 2**7
r = 3
f = np.random.random((r,r))
g = np.random.random((n,n))
%time c = scisig.convolve2d(f,g)

g2 = revMatrix(g)
g2 = padImage(g2,r)
%time cWino = simpleWinogradAlg(f,g2,2,B23,G23,A23)[0]
cWino = revMatrix(cWino)

print("Error:",la.norm(c - cWino)/la.norm(c))

CPU times: user 356 ms, sys: 2.05 ms, total: 358 ms
Wall time: 359 ms
4 1 1
4 1 4225
CPU times: user 87.5 ms, sys: 927 µs, total: 88.4 ms
Wall time: 88.8 ms
Error: 1.9056215930084894e-16


### Cost Analysis

Assume a filter size $R \times r$ and input of size $H \times W$. Let $N$ be the number of images, $K$ be the number of filters, and $C$ be the number of channels. Define $\alpha = R+m-1$.

#### Direct

The direct method would just do a series of matrix element-wise multiplies and sum them up.

In total, we expect there to be approximately $HW$ (slightly larger due to padding, but we will say this is negligible) of these multiplies with the $R \times R$ filter. For each convolution, we need to add up the neighboring multiplies. We expect again approximately $HW$ of these. In each, we should expect $(R-1)^2$ additions required.

For one filter, one image, and one channel, we bound the number of additions and multiplies, denoted as $( \cdot, \cdot)$ to

$$(0,HWR^2) + (HWR^2,HWR^2)$$

Over the entire algorithm, the direct algorithm will incur a cost of

$$(CNKHWR^2,2 \cdot CNKHWR^2) $$

#### Winograd

The algorithm involves four steps:
1. Filter Transformation
1. Data Transformation
1. Multiplication in Winograd Space
1. Winograd Space Inversion

For the **filter transformation**, this is only done once per filter per channel. For each filter+channel combination, this incurs a $(\alpha \times R \times R) + (R \times \alpha \times \alpha)$ matrix multiplication.

For the **data transformation**, this is done once per image per filter per channel. For each slice of the image+channel combination, we perform two $\alpha \times \alpha \times \alpha$ matrix multiplications

For the **multiplication**, this is done once per image per channel. This is done $\alpha^2$ times and involves a $K \times C \times P$ matrix multiplication.

For the **inversion**, this is done once per image per filter per channel. For each image slice+filter combination, we perform a $(m \times \alpha \times \alpha) + (\alpha \times \alpha \times m)$ matrix multiplication.

In total, we see the cost of just addition and multiplications is

$$ KC \Big( M(\alpha,R,R) + M(R,\alpha,\alpha) \Big) + 2\frac{NHW}{m^2}M(\alpha, \alpha, \alpha) + \alpha^2NM(K,C,\frac{HW}{m^2}) + \frac{KNHW}{m^2} \Big( M(m, \alpha, \alpha) + M(\alpha, \alpha, m ) \Big) $$

where $M(\cdot,\cdot,\cdot)$ represents the cost of a matrix-matrix multiplication with those axis sizes. Here we let $P = \frac{HW}{m^2}$.

Since $m \ge 1$ and $r \ge 1$, we trivially show $\alpha \ge r$ and $\alpha \ge m$, or that $\alpha$ dominates $r$ and $m$. A cleaner variation is

$$\le 2KC \cdot M(R,\alpha,\alpha) + 2\frac{NHW}{m^2} \cdot M(\alpha, \alpha, \alpha) + \alpha^2N \cdot M(K,C,\frac{HW}{m^2}) + 2\frac{KNHW}{m^2} \cdot M(\alpha, \alpha, m ) $$

In [6]:
def simpleWinogradAlg_FLOPS(h,w,r,m,B,G,A,N,K,C,matmul):
    assert(h%m == 0 and w%m == 0)
    
    h-=2; w-=2
    
    P = (h//m)*(w//m) # num of tiles
    a = m+r-1 # input tile size
    
    dChunks = np.zeros((C,P,a,a))
    
    flops = np.zeros(2)
    
    U = np.zeros((a,a,K,C))
    g = np.zeros((r,r))
    temp = K * C * ( matmul(G,g) + matmul(g,G.T))
    flops += temp
            
    V = np.zeros((a,a,C,P))
    temp = N * P * C * ( matmul(B.T, dChunks[0,0]) + matmul(dChunks[0,0],B) )
    flops += temp
            
    M = np.zeros((a,a,K,P))
    # (K,C) x (C,P)
    temp = N * a * a * matmul(U[0,0],V[0,0])
    flops += temp
            
    Y = np.zeros((K,P,m,m))
    mInterm = np.zeros((a,a))
    temp = K * P * ( matmul(A.T, mInterm) + matmul(mInterm, A) )
    flops += temp
        
    # reorder
    return flops

def directMatmul(M1,M2):
    assert(M1.shape[1] == M2.shape[0])
    return M1.shape[0] * M1.shape[1] * M2.shape[1]

def direct2DConvFlops(C,N,K,H,W,R):
    return np.asarray([C*N*K*H*W*(R**2),2*C*N*K*H*W*(R**2)])

In [7]:
N = 1
K = 96 # higher means more savings for multiplications
C = 3

R = 11
M = 2
p = 8
H = M**p
W = M**p

a = M + R - 1
B = np.zeros((a,a)).T
G = np.zeros((a,R))
A = np.zeros((M,a)).T
# h,w,r,m,B,G,A,N,K,C,matmul
winoFlops = simpleWinogradAlg_FLOPS(H,W,R,M,B,G,A,N,K,C,directMatmul)
# C,N,K,H,W,R
directFlops = direct2DConvFlops(C,N,K,H,W,R)

lowest = H*W*K*N*C

print("Direct Flops:",directFlops)
print("Winograd Flops:",winoFlops)

print("Winograd Savings:",directFlops/winoFlops)

Direct Flops: [2283798528 4567597056]
Winograd Flops: [1.7288329e+09 1.7288329e+09]
Winograd Savings: [1.32100594 2.64201188]


### Comparing the Costs

We will use the filter $F(2 \times 2, 3 \times 3)$ and $\alpha = 3+2-1=4$.

In the direct method, we have a cost of about 

$$(9KCNHW,18 \cdot KCNHW) $$

For Winograd, we have

$$2KC \cdot M(3,4,4) + \frac{NHW}{2} \cdot M(4,4,4) + 16N \cdot M(K,C,\frac{HW}{4}) + \frac{KNHW}{2} \cdot M(4,4,2) $$

Using just direct matrix-multiplication, the cost here is then

$$\le 2KC \cdot 48 + 32NHW + 4KCNHW + 16KNHW$$

Here, let $C=3$ for the typical RGB filter, where the cost is reduced further to:

$$(27KNHW,54KNHW) $$

and 

$$288K + 32NHW + 12KNHW + 16KNHW$$

Given a sufficiently high filter count $K > 1$, we can expect a speed up of approximately $\frac{54}{28} \approx 2$ with the number of multiplications.

### Winograd's Saving

In the paper, we see this algorithm requires only $16$ multiplications compared to $36$. This is a saving of about $36/16 = 2.25$, which is approximately what we see in practice.

Typically, for a $F(m \times m, r \times r)$ filter, we should see a saving of $\frac{m^2r^2}{(m+r-1)^2}$. The larger $m,r$ can go, the better the savings we should expect. However, this can occur more constant additions and instability. For future work, we can investigate how to improve this using our current work with Toom-Cook.