Due 11:59pm, Tuesday May 5th
In this assignment you will create an efficient schedule a DNN convolution layer in Halide. Implementing the Halide algorithm for a conv layer is quite easy (we give you the algorithm in Halide in the starter code). The challenge is coming up with an efficient schedule. Good schedules will use a combination of ideas discussed in class, such as: SIMD vector processing, multi-core execution, and efficient blocking for cache locality.
In general, this is a free-for-all assignment. We want to you learn a bit about writing Halide schedules, and try your hand at making performance go faster. You will have to do some Halide documentation reading on your own.
Step 1: Grab the assignment starter code:
git clone git@github.com:stanford-cs348k/asst2-convlayer.git
The codebase uses a simple Makefile
as the build system. However, there is a dependency on Halide.
Step 2: Install Halide:
To build the starter code, run make
from the top level directory. The driver source code is in src/
, and
the implementation of the convolution layer generator you will modify is in the top-level file conv_layer_generators.cpp
.
Object files and binaries will be populated in build/
and bin/
respectively.
To install and use Halide follow the instructions at http://halide-lang.org/. In particular, you should download a binary release of Halide. Once you've downloaded and untar'd the release, say into directory halide_dir
, change the previous lines back, and also the following line in Makefile
HALIDE_DIR=/MY/PATH/TO/HALIDE
to
HALIDE_DIR=<halide_dir>
Step 3: Build the code:
make
This creates a binary bin/convlayer
.
Running the starter code:
To get commandline help, run the command:
./bin/convlayer -h
To run your scheduled conv layer try:
./bin/convlayer --schedule student
This code will run your version of the convolution layer using randomly generated inputs and weights. It will run for 3 trials, and report the minimum time of the 3 runs. To run correctly you must ensure that Halide is in your library load path. On OSX this can be done like so:
DYLD_LIBRARY_PATH=<halide_dir>/bin ./bin/convlayer <args>
and on Linux it will be:
LD_LIBRARY_PATH=<halide_dir>/bin ./bin/convlayer <args>
Modifying the code
Programming in Halide is a form of [meta-programming]https://en.wikipedia.org/wiki/Metaprogramming(). Halide is embedded in C++, so you write C++ code that in turn generates a Halide program representation. Then Halide compiles this representation into a library, which is linked by the binary convlayer.
.
In the code base, a generator is a C++ class that creates a Halide program. Your modifications to the code should only go in the file conv_layer_generators.cpp
inside the class StudentConvLayerGenerator
. Inside that file you should not modify the implementation of the convolution layer algorithm. You should only add Halide scheduling directives to the program to make it run faster. Regions you should modify will be marked by comments that look like so:
// BEGIN: CS348K STUDENTS MODIFY THIS CODE
// END: CS348K STUDENTS MODIFY THIS CODE
We have provided a reference implementation in conv_layer_generators.cpp
called DefaultConvLayerGenerator
. This implementation generates a conv layer implementation using the default Halide schedule. The code generated by DefaultConvLayerGenerator
is called by HalideConvolutionLayer
, which is located in the source file ./src/default_convolution_layer.cpp
.
You are allowed to use the reference Halide algorithm provided in the codebase verbatim (see StudentConvLayerGenerator
in conv_layer_generators.cpp
). The starter code uses a naive/default Halide schedule, which corresponds to an evaluation order equivalent to code with loops that look like:
// Initialization
for n:
for z:
for y:
for x:
conv(...) = ...
// Updates
for n:
for z:
for y:
for x:
for r0:
for r1:
for r2:
conv(...) = ...
Your job is to implement a custom Halide schedule that performs better than the default. (See Halide::Func::print_loop_nest()
to inspect and debug your schedule like this.)
You may wish to consider:
- Parallelization onto multiple CPU cores (
.parallel()
) - SIMD vector processing (
.vectorize()
) or loop unrolling (.unroll()
) - Tiling the computation using
.split()
and.reorder
. - More advanced implementations might even consider data layout transformations the interchange the storage order of different axes. See tutorial 16 and tutorial 17 for more info.
- Halide tutorials. In particular, see Tutorial 01 for a basic introduction, Tutorial 07 for a 2D convolution example, and Tutorial 05 for an introduction to Halide schedules, and Tutorial 08 for more advanced scheduling topics.
- Exhaustive Halide documentation.
- This is a good Youtube video about Halide scheduling.
Your handin should contain two components:
- Your full source tree (as a zip file). Please make sure you
make clean
prior to zipping. - A writeup, in a file named
writeup.pdf
, that describes the iterative process you used to arrive at your solution. At each step, we expect the writeup to say:- I tried XXX for the following reason.
- Then I measured my performance and I saw an increase/decrease, etc.
This assignment is all about performance optimization, but we are not grading on the actual performance obtained. We are grading on a three category basis: no-credit, credit, amazing. Anyone would makes a legimate effort to optimized the code should get credit. Just please make sure your handin writeup properly described your process. Students that obtain impressively high performance will get "amazing"!