# Hello Loopy: Computing a Rank-One Matrix

## Setup Code

In [1]:
import numpy as np
import pyopencl as cl
import pyopencl.array
import pyopencl.clrandom
import loopy as lp

from loopy.version import LOOPY_USE_LANGUAGE_VERSION_2018_2

In [2]:
ctx = cl.create_some_context(interactive=True)
queue = cl.CommandQueue(ctx)

In [3]:
n = 1024
a = cl.clrandom.rand(queue, n, dtype=np.float32)
b = cl.clrandom.rand(queue, n, dtype=np.float32)

## The Initial Kernel

In [4]:
knl = lp.make_kernel(
    "{[i,j]: 0<=i,j<n}",
    "c[i, j] = a[i]*b[j]")

In [5]:
knl = lp.set_options(knl, write_cl=True)
evt, (mat,) = knl(queue, a=a, b=b)

## Transforming kernels: Loop Splitting

Next: transform kernel. Example: Split a loop into fixed-length "chunks".

In [6]:
isplit_knl = knl
isplit_knl = lp.split_iname(isplit_knl, "i", 4)

evt, (mat,) = isplit_knl(queue, a=a, b=b)

Want to get rid of the conditional?

## Transforming kernels: Implementation Tags

Every loop axis ("iname") comes with an *implementation tag*.

In [7]:
isplit_knl = knl
isplit_knl = lp.assume(isplit_knl, "n mod 4 = 0")
isplit_knl = lp.split_iname(isplit_knl, "i", 4)
isplit_knl = lp.tag_inames(isplit_knl, {"i_inner": "unr"})

evt, (mat,) = isplit_knl(queue, a=a, b=b)

May want to influence loop ordering.

----
"Map to GPU hw axis" is an iname tag as well.

Use shortcuts for less typing:

In [8]:
split_knl = knl
split_knl = lp.split_iname(split_knl, "i", 16,
        outer_tag="g.0", inner_tag="l.0")
split_knl = lp.split_iname(split_knl, "j", 16,
        outer_tag="g.1", inner_tag="l.1")

evt, (mat,) = split_knl(queue, a=a, b=b)

## Transforming kernels: Leveraging data reuse

Better! But still not much data reuse.

In [9]:
fetch1_knl = knl

fetch1_knl = lp.add_prefetch(fetch1_knl, "a", fetch_outer_inames="i")
fetch1_knl = lp.add_prefetch(fetch1_knl, "b", fetch_outer_inames="i,j")

evt, (mat,) = fetch1_knl(queue, a=a, b=b)

But this is useless for the GPU version. (demo)

---

Would like to fetch entire "access footprint" of a loop.

In [10]:
fetch_knl = split_knl

fetch_knl = lp.add_prefetch(fetch_knl, "a", ["i_inner"], default_tag="l.auto")
fetch_knl = lp.add_prefetch(fetch_knl, "b", ["j_inner"], default_tag="l.auto")

fetch_knl = lp.add_inames_for_unused_hw_axes(fetch_knl, "id:*fetch*")
evt, (mat,) = fetch_knl(queue, a=a, b=b)

## Transforming kernels: Eliminating Conditionals

All those conditionals take time to evaluate!

In [11]:
sfetch_knl = knl
sfetch_knl = lp.split_iname(sfetch_knl, "i", 16,
        outer_tag="g.0", inner_tag="l.0", slabs=(0,1))
sfetch_knl = lp.split_iname(sfetch_knl, "j", 16,
        outer_tag="g.1", inner_tag="l.1", slabs=(0,1))

sfetch_knl = lp.add_prefetch(sfetch_knl, "a", ["i_inner"], default_tag="l.auto")
sfetch_knl = lp.add_prefetch(sfetch_knl, "b", ["j_inner"], default_tag="l.auto")
sfetch_knl = lp.add_inames_for_unused_hw_axes(sfetch_knl, "id:*fetch*")

evt, (mat,) = sfetch_knl(queue, a=a, b=b)