Tutorial to TVM
============

TVM is a framework allows you to write high tensor kernels productively.
The big idea of TVM's productivity is to decouple kernel description and code organization.
Before actually diving into the usage of TVM, we first show how to install TVM.

Installation
-------------

If you are using AWS Sagemake Notebook with CUDA 10 installed, [@icelemon9](https://github.com/icemelon9/) provides a pip package:
````
pip install https://haichen-tvm.s3-us-west-2.amazonaws.com/tvm_cu100-0.6.dev0-cp36-cp36m-linux_x86_64.whl
pip install https://haichen-tvm.s3-us-west-2.amazonaws.com/topi-0.6.dev0-py3-none-any.whl
````

If you are running a different environment or want full control on installation, a tutorial to building from source is below. First, TVM depends on:
0. TVM sticks on Python 3(.5), [conda](https://www.anaconda.com/distribution/) is recommended.
1. Make sure your `cmake` version is older than *3.2* to succeccfully generate build files.
2. TVM uses on LLVM JIT to run CPU codes. Download a proper LLVM binary distribution [here](http://releases.llvm.org/download.html). 6.x and 8.x are recommended (Versions older than 4.x is not supported and 6.x has problem on binary compatibility with gcc).
3. For GPU users, CUDA should be installed.

After prerequisites are resolved, let's clone the repo and build it.
````
git clone --recursive https://github.com/dmlc/tvm.git && cd tvm
mkdir build && cd build
touch config.cmake # cmake/config.cmake
echo "set(USE_LLVM path/to/llvm-config)" >> config.cmake
echo "set(USE_CUDA path/to/cuda)" >> config.cmake # GPU users only
cmake ..
make -j`nproc`
````

After build is done, you should
````
export PYTHONPATH="$HOME/tvm-dev/tvm/python":$PYTHONPATH
export PYTHONPATH="$HOME/tvm-dev/tvm/topi/python":$PYTHONPATH
````

Then, let's start a quick tour to TVM.

Op
----
In TVM, every tensor is an `Op` node. An `Op` has its own inputs, outputs, and semantics.
Writing TVM applications is essentially to depict the relationship among op's.
In this tutorial, we will instruct you how to write `ComputeOp`'s and a `HybridOp`'s.

ComputeOp
--------------

As it is aforementioned, the key of enabling both high performance and productivity is to decouple kernel description and code organization. In this section, we first demonstrate how to depict computation, then show how to manipulate the decoupled code organization.

### A simple vector addition
We first declare the input tensors. The value of each dimension of the shape can either be a constant or a symbolic expression.

In [1]:
import tvm

# You can also make a TVM symbolic shape.
# We leave this as an exercise.
# Hint: Try `var_n = tvm.var('n')`, and put `var_n` into argument list too.
vec_n = 128

# placeholder(shape: tuple, dtype: str, name: str)
vec_a = tvm.placeholder((vec_n, ), dtype='float32', name='a')
vec_b = tvm.placeholder((vec_n, ), dtype='float32', name='b')

After the vectors are declared, we can now depict the computation.
Most of the deep learning kernels, are equations/formulae wrapped by levels of loops.
For example, "vector addition" code looks like:
```` C
for (int x = 0; x < n; ++x)
  c[x] = a[x] + b[x];
````
Thus, in TVM, it allows you to depict computations by simply specifying the shape and the formula.

In [2]:
# compute(shape: tuple, fcompute: callable)
vec_c = tvm.compute((vec_n, ), lambda x: vec_a[x] + vec_b[x], name='c')

# Write a outer product! Try 2-D!

Once compute description is done, we can compile this Op by creating a schedule for this Op and compile this Op.
TVM provides two interfaces of building: one for sanity check and the other for execution.

In [3]:
sch = tvm.create_schedule(vec_c.op)

# schedule, argument list, options
# simple_mode indicates this lowering is for sanity check
# The function compiled by TVM uses a side-effect style.
# The output tensor is also in the argument list instead of being a return value.
ir = tvm.lower(sch, [vec_a, vec_b, vec_c], simple_mode=True)

print(ir)

produce c {
  for (x, 0, 128) {
    c[x] = (a[x] + b[x])
  }
}



A schedule provides interfaces for you to manipulate the code organization. A well-tuned code organization can provide orders of magnitute acceleration comparing against the vanilla code. Meanwhile, it aggressively refactor the code, which also aggressively break the readability, and modularity.

To make the code organization clear TVM provides `split`, `unroll`, `vectorize`, `reorder`, and etc. premitives for you to play to the code organization. The example below demonstrates the usage of both `split` and `vectorization`. Please refer [TVM schedule premitives](https://docs.tvm.ai/tutorials/language/schedule_primitives.html#sphx-glr-tutorials-language-schedule-primitives-py) for more details.

In [4]:
#axis, reduce_axis
x = vec_c.op.axis[0]

xo, xi = sch[vec_c].split(x, 8)

sch[vec_c].vectorize(xi)

ir = tvm.lower(sch, [vec_a, vec_b, vec_c], simple_mode=True)

print(ir)

produce c {
  for (x.outer, 0, 16) {
    c[ramp((x.outer*8), 1, 8)] = (a[ramp((x.outer*8), 1, 8)] + b[ramp((x.outer*8), 1, 8)])
  }
}



In [5]:
# schedule, argument list, options
# This API will call lower with simple_mode=False
module = tvm.build(sch, [vec_a, vec_b, vec_c], target='llvm')

import numpy as np

npa = np.arange(128).astype('float32')
npb = np.arange(128).astype('float32')

nda = tvm.ndarray.array(npa)
ndb = tvm.ndarray.array(npb)
ndc = tvm.ndarray.array(np.zeros((128, ), dtype='float32'))

module(nda, ndb, ndc)

# Test the results
tvm.testing.assert_allclose(ndc.asnumpy(), np.arange(128).astype('float32') * 2)

### What about matrix multiplication?
Compared to vector add, a key difference of matrix multiplication is "reduction".
To write a reduction Op, we first need to define the reduction domain.

In [6]:
m, k, n = 128, 64, 128

red = tvm.reduce_axis((0, k), name='k')
mat_a = tvm.placeholder((m, k), dtype='float32', name='mat_a')
mat_b = tvm.placeholder((k, n), dtype='float32', name='mat_b')

# axis can be a single red (1d), or an array of red (2d or more).
# Writing 2d convolution as exercise.
mat_c = tvm.compute((m, n), lambda x, y: tvm.sum(mat_a[x, red] * mat_b[red, y], axis=red), name='mat_c')

Optimizing GeMM is beyond the scope of this tutorial. Refer [this tutorial](https://docs.tvm.ai/tutorials/optimize/opt_gemm.html) for more details, if interested. Here we only demonstrate the sanity of building this `Op`.

In [7]:
sch = tvm.create_schedule(mat_c.op)

print(tvm.lower(sch, [mat_a, mat_b, mat_c], simple_mode=True))

produce mat_c {
  for (x, 0, 128) {
    for (y, 0, 128) {
      mat_c[((x*128) + y)] = 0f
      for (k, 0, 64) {
        mat_c[((x*128) + y)] = (mat_c[((x*128) + y)] + (mat_a[((x*64) + k)]*mat_b[((k*128) + y)]))
      }
    }
  }
}



### Put it all together

Recall that in TVM everything is an Op and every Op is a tensor. Thus, actually `vec_c` and `mat_c` can also interact with each other.

In [8]:
d = tvm.compute((m, n), lambda x, y: mat_c[x, y] * vec_c[x], name='d')

sch = tvm.create_schedule(d.op)

We first look at the vanilla IR. In this case `vec_c` and `mat_c` becomes intermediate results, so whether put them in the argument list depends if you need these results (they both are correct).

In [9]:
print(tvm.lower(sch, [vec_a, vec_b, mat_a, mat_b, d], simple_mode=True))

// attr [mat_c] storage_scope = "global"
allocate mat_c[float32 * 16384]
// attr [c] storage_scope = "global"
allocate c[float32 * 128]
produce mat_c {
  for (x, 0, 128) {
    for (y, 0, 128) {
      mat_c[((x*128) + y)] = 0f
      for (k, 0, 64) {
        mat_c[((x*128) + y)] = (mat_c[((x*128) + y)] + (mat_a[((x*64) + k)]*mat_b[((k*128) + y)]))
      }
    }
  }
}
produce c {
  for (x, 0, 128) {
    c[x] = (a[x] + b[x])
  }
}
produce d {
  for (x, 0, 128) {
    for (y, 0, 128) {
      d[((x*128) + y)] = (mat_c[((x*128) + y)]*c[x])
    }
  }
}



Then we introduce an important concept Op fusion: actually we do not need to hold the whole `mat_c` in memory, we can compute it when necessary. TVM provides 2 premitives `compute_at` and `compute_inline`.

`compute_inline` totally fuses the an op into another.

In [10]:
sch[vec_c].compute_inline()
print(tvm.lower(sch, [vec_a, vec_b, mat_a, mat_b, d], simple_mode=True))

// attr [mat_c] storage_scope = "global"
allocate mat_c[float32 * 16384]
produce mat_c {
  for (x, 0, 128) {
    for (y, 0, 128) {
      mat_c[((x*128) + y)] = 0f
      for (k, 0, 64) {
        mat_c[((x*128) + y)] = (mat_c[((x*128) + y)] + (mat_a[((x*64) + k)]*mat_b[((k*128) + y)]))
      }
    }
  }
}
produce d {
  for (x, 0, 128) {
    for (y, 0, 128) {
      d[((x*128) + y)] = (mat_c[((x*128) + y)]*(a[x] + b[x]))
    }
  }
}



`mat_c` is an reduction op, so it cannot be inlined. We can still apply `compute_at` to it, which indicated we only partially compute the elements of tensor `mat_c` required by `d`'s `axis[1]`.

In [11]:
sch[mat_c].compute_at(sch[d], d.op.axis[1])
print(tvm.lower(sch, [vec_a, vec_b, mat_a, mat_b, d], simple_mode=True))

// attr [mat_c] storage_scope = "global"
allocate mat_c[float32 * 1]
produce d {
  for (x, 0, 128) {
    for (y, 0, 128) {
      produce mat_c {
        mat_c[0] = 0f
        for (k, 0, 64) {
          mat_c[0] = (mat_c[0] + (mat_a[((x*64) + k)]*mat_b[((k*128) + y)]))
        }
      }
      d[((x*128) + y)] = (mat_c[0]*(a[x] + b[x]))
    }
  }
}



Bonus: actually here, using `sch[vec_c].compute_at(sch[d], d.op.axis[1])` is the same as `compute_inline` for LLVM code generation. `compute_inline` is only to make the HalideIR looks nicer.

Hybrid Op
------------
As it is mentioned in last section, compute op is dedicated for Op's that are formulae wrapped by levels of loops. When it comes to some irregular kernels, you still need a general tool to do this. In TVM, we have Hybrid Script, which allows you to simply write a subset of Python, and this Python can be lowered to TVM Op's.

To use this functionality, we first need to write a Python function and annotate it with a `tvm.hybrid.script` decorator.

In [12]:
@tvm.hybrid.script
def foo(n, a, b):
    # Do not worry about this undeclared function, the decorator already handles it for you!
    c = output_tensor((n, ), a.dtype)
    for i in range(n):
        if a[i] > 100:
            c[i] = a[i] + b[i]
        else:
            c[i] = a[i] - b[i]
    return c

Then this function has two modes: software emulation and `Op` compilation. It is simply an overload. If you pass actual tensors `list`, `tuple`, or `np.array` it will emulate the behaviour of software.

In [13]:
npa = np.arange(128).astype('float32')
npb = np.arange(128).astype('float32')

# npa and npb are numpy arrays, so it does software emulation.
out = foo(128, npa, npb)
ref = np.array([0] * 101 + list(range(202, 256, 2))).astype('float32')

# We can check the results.
tvm.testing.assert_allclose(out, ref)

If you pass TVM tensors to this function, it will do Op compilation. 

In [14]:
a = tvm.placeholder((128, ), dtype='float32', name='a')
b = tvm.placeholder((128, ), dtype='float32', name='b')

# Note: if you want to pass a constant to the compilation,
# you should convert it to TVM symbolic.
# `tvm.convert` is an intelligent function, which can differentiate
# the data type and convert it to corresponding TVM data structure.
c_foo = foo(tvm.convert(128), a, b)

print(c_foo.op.body)

sch = tvm.create_schedule(c_foo.op)

for (i, 0, 128) {
  if ((a(i) > 100f)) {
    c(i) =(a(i) + b(i))
  } else {
    c(i) =(a(i) - b(i))
  }
}



We can use exactly the same APIs to play with this schedule. From loop organization to target compilation.

In [15]:
nda = tvm.ndarray.array(npa)
ndb = tvm.ndarray.array(npb)
ndc = tvm.ndarray.array(np.zeros((128, ), dtype='float32'))

# Since `n` is replaced by a constant, you no longer need to pass `n` as a argument.
module = tvm.build(sch, [a, b, c_foo], target='llvm')
module(nda, ndb, ndc)
tvm.testing.assert_allclose(ndc.asnumpy(), ref)

For the full syntax feature supported by Hybrid Script, please refer [the language manual](https://docs.tvm.ai/langref/hybrid_script.html).

Hybrid Script is not perfect, there may be bugs. If you cannot find a solution after carefully reading the dumped stack trace, feel free to post on [the forum](https://discuss.tvm.ai) and @me.