We illustrate using `njit`, which is able to compile a subset of python for speedup. We implement the dot product in `njit`.

In [1]:
from numba import njit


In [2]:
@njit
def dot(a, b):
    res = 0.
    for i in range(len(a)):
        res += a[i] * b[i]
    return res


In [3]:
from numpy import array


The following illustrates the difference between the first run, during which it is compiled, to the second.

In [4]:
%time dot(array([1., 2, 3]), array([2., 3, 4]))


CPU times: user 239 ms, sys: 0 ns, total: 239 ms
Wall time: 238 ms


20.0

In [5]:
%time dot(array([1., 2, 3]), array([2., 3, 4]))


CPU times: user 18 µs, sys: 0 ns, total: 18 µs
Wall time: 18.8 µs


20.0

Thereby only two of the three loops are implemented in Python.

In [6]:
import torch

def matmul(a, b):
    (ar, ac), (br, bc) = a.shape, b.shape
    c = torch.zeros(ar, bc)
    for i in range(ar):
        for j in range(bc):
            c[i, j] = dot(a[i, :], b[:, j])
    return c


In [7]:
from pathlib import Path
from urllib.request import urlretrieve
import gzip, pickle

MNIST_URL='https://github.com/mnielsen/neural-networks-and-deep-learning/blob/master/data/mnist.pkl.gz?raw=true'
path_data = Path('data')
path_data.mkdir(exist_ok=True)
path_gz = path_data/'mnist.pkl.gz'

if not path_gz.exists():
    urlretrieve(MNIST_URL, path_gz)

with gzip.open(path_gz, 'rb') as f:
    ((x_train, y_train), (x_valid, y_valid), _) = pickle.load(f, encoding='latin-1')

from torch import tensor

x_train,y_train,x_valid,y_valid = map(tensor, (x_train,y_train,x_valid,y_valid))

torch.manual_seed(1)
weights = torch.randn(784, 10)
bias = torch.zeros(10)

m1 = x_valid[:5]
m2 = weights
ar, ac = m1.shape 
br, bc = m2.shape

t1 = torch.zeros(ar, bc)

for i in range(ar):         # 5
    for j in range(bc):     # 10
        for k in range(ac): # 784
            t1[i, j] += m1[i, k] * m2[k, j]


We replicate the test that we did earlier for matmul entirely in python, but now for `njit`. We will have to export `m1` and `m2` from torch tensors to numpy arrays.

In [8]:
m1a, m2a = m1.numpy(), m2.numpy()


We verify correctness.

In [9]:
from fastcore.test import *

test_close(t1, matmul(m1a, m2a))


And test performance, which seems to be about 100x as fast.

In [10]:
%timeit -n 50 matmul(m1a, m2a)


266 µs ± 36.7 µs per loop (mean ± std. dev. of 7 runs, 50 loops each)
