Skip to content

kylindu/heterogeneity-aware-lowering-and-optimization

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

86 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

License PRs Welcome

Build and Test API Publish

HALO

Heterogeneity-Aware Lowering and Optimization (HALO) is a heterogeneous computing acceleration platform based on compiler technology. It exploits the powerfulness of heterogeneous computing while hiding the heterogeneity of computing resources through the abstract, extendable interface called Open Deep Learning API (ODLA). HALO provides a unified Ahead-Of-Time compilation solution, auto tailored for various cloud, edge, and IoT scenarios.

Get Started

Design Overview

Recently new domain specific heterogeneous hardware platforms are booming for accelerating AI applications. However, it faces a twofold challenge to deploy and enable these heterogeneous accelerators in a production environment.

First, various ML frameworks are used to build and run ML algorithms and applications, e.g., TensorFlow, PyTorch, Caffe, and MxNet, etc. Each framework consists its own closed software ecosystem although some converters exist for converting between each other. Second, heterogenous accelerators are normally fragmented and diversified, in terms of functionality, performance, scalability, and integrability.

The objective of HALO is to address the above challenge by designing a centralized and uniform heterogeneity-aware compiler targeting AI applications in cloud, edge and IoT fields. The HALO front-end parses models and algorithms, which are built from widely used frameworks including TensorFlow, Caffe, ONNX and more, into a multi-level and multi-grained intermediate representation, HALO IR. The middle-end of HALO then performs essential and profitable transformations and optimizations at the IR level.

In order to adapt and integrate with accelerators on which only high-level SDKs are available, HALO introduces a uniform, abstract, and hardware-independent interface, Open Deep Learning API (ODLA). ODLA consists of logical computing operations, as well as device related operations, which are semantic enough to represent the device runtime logic but abstract enough to hide any hardware implementation details. The back-end of HALO code-generates a device independent program written in ODLA. The uniform ODLA program is cross-platform and can be ported and deployed on various accelerators by linking with hardware-specific runtime libraries.

HALO decouples the upper-level algorithms from the lower-level hardware implementations. With the compiler technology and the abstract ODLA interface, it unleashes the power of heterogeneous hardware and accelerates applications smoothly and transparently. Furthermore, it allows computing workload to be dispatched and switched among different accelerators dynamically.

System Requirements

HALO has been fully tested in the following development environment:

OS:

  • Ubuntu 18.04

Tools and libraries:

  • C++ compiler that supports C++17. (e.g. GCC >= 7.5.0)
  • CMake (>= 3.14.5)
  • Clang tools (>= 9.0)
  • glog (>= 0.4)
  • Protobuf 3.9.1

Software packages for some demos and examples:

  • OpenCV 3.2.0
  • Python3
  • PyTorch and TensorFlow / Keras (to get pretrained model)
  • ImageMagick (to preprocess test images)
  • Device acceleration libraries:

NVIDIA® GPU environment:

  • CUDA® (>= 10.0)
  • CUDA® Deep Neural Network library™ (cuDNN) (>= 7.6.0)
  • TensorRT™ (7.0.0)

Docker Environment

For convenience, the above system requirements are also prepared and packed as a docker environment, which is under utils/docker:

Build From Scratch

Get HALO

git clone https://github.com/alibaba/heterogeneity-aware-lowering-and-optimization.git --recurse-submodules -j8

Configure and Build

mkdir halo/build
cd build
cmake -DCMAKE_BUILD_TYPE=Release -G Ninja ..
ninja

Some CMAKE options:

  • CMAKE_BUILD_TYPE=[Release|Debug]: select the build type.
  • -DHALO_USE_GLOG=[ON]: use glob library for logging by default.
  • -DHALO_CCACHE_BUILD=[ON] : enable or disable ccache for build.

Unit Tests

HALO uses llvm-lit test tools for unit testing. To run all unit tests, simply by

ninja check-halo

Using HALO

A computation model, which is similar to a function, consists of input nodes, constant nodes, computational nodes and output nodes.

HALO compiles the whole model into an ODLA-based C/C++ function, in which input nodes and output nodes are represented as function arguments passed by pointers. The function body consists of ODLA-based APIs corresponding to the computational nodes.

Constants (weights) are compiled into a separate file, either a C/C++ source file or a binary ELF file.

Basic usage:

halo [options] model_files -o output_odla_file

Below is a typical workflow of deploying a model using HALO:

  1. Use HALO to compile the model file(s) into an ODLA-based C/C++ source file.
  2. Use a conventional C/C++ compiler to compile the generated C/C++ file into an object file.
  3. Link the object file, the generated weight file, and the ODLA device runtime libraries.

A Simple Example

Let's start with a simple example. Given a simple TensorFlow model for MNIST handwritten digit classification based on TensorFlow Tutorial:

simple MNIST model

First, compile the model into ODLA C++ code:

halo -target cxx mnist_simple.pb -o out/model.cc

It generates 3 files:

  • out/model.h : the header file to be used by application
  • out/model.cc : the ODLA C++ file that represents the model
  • out/model.bin : the weights

From generated model.h:

extern "C" {
void mnist_simple(const float x[1 * 784], float out_y[1 * 10]);
void mnist_simple_init();
void mnist_simple_fini();
};

mnist_simple() is the entry function to do the inference, which takes array x as input and results will be written into out_y. mnist_simple() can be called multiple times while mnist_simple_init() and minist_simple_fini() are called once to initialize and to cleanup the whole computation process, respectively.

By default, the names of the above three functions are derived from the filename of inputs. In YOLO v3 example, we will demonstrate how to specify function names.

Note that, for portability purpose, HALO always exports functions in the C convention even though the output file model.cc is in the C++ format.

model.cc is an ODLA-based C++ file. The main part of it is to build the model computation using ODLA APIs:

// Graph building
static void mnist_simple_helper() {
  odla_CreateComputation(&Comp);
  auto x = odla_CreateArgument({ODLA_FLOAT32, {.size = 2, .dims = {1, 784}}},
                               (const odla_value_id)("x"));
  auto V =
      odla_CreateConstant({ODLA_FLOAT32, {.size = 2, .dims = {784, 10}}},
                          Variable, (const odla_value_id) "V");
  auto V1 = odla_CreateConstant(
      {ODLA_FLOAT32, {.size = 2, .dims = {1, 10}}}, Variable_1_broadcasted_7,
      (const odla_value_id) "V1");
  auto MatMul =
      odla_Gemm(x, 0, V, 0, 1, 0, nullptr, {.size = 2, .dims = {1, 10}},
                (const odla_value_id) "MatMul");
  auto add =
      odla_Add(MatMul, V1, (const odla_value_id) "add");
  auto y = odla_Softmax(add, -1, (const odla_value_id) "y");
  odla_SetValueAsOutput(y);
}

// Entry function
void mnist_simple(const float x[1 * 784], float out_y[1 * 10]) {
  // ...some setup code skipped.
  mnist_simple_init(); // it calls mnist_simple_helper() once.
  odla_BindToArgumentById((const odla_value_id) "x", x, Ctx);
  odla_BindToOutputById((const odla_value_id) "y", out_y, Ctx);
  odla_ExecuteComputation(Comp, Ctx, ODLA_COMPUTE_INFERENCE, nullptr);
}

The code snippet of demo application (main.cc):

#include "out/model.h" // include the generated header.
int main(int argc, char** argv) {
  //... read 1000 images & labels.
  mnist_simple_init(); // Initialize computation.
  
  int correct = 0;
  for (int i = 0; i < 1000; ++i) {
    std::array<float, 28 * 28> input;
    std::array<float, 10> output;
    // ... preprocess inputs
    mnist_simple(input.data(), output.data());
    int pred = std::max_element(output.begin(), output.end()) - output.begin();
    correct += (pred == labels[i]);
  }
  std::cout << "Accuracy: " << << correct / 1000.0 << "% \n";
  mnist_simple_fini(); // Clean up.
}

Next, we can use any modern C++ compiler to compile the generated code:

g++ out/model.cc -I<halo_install_path>/include -c -o out/model.o
g++ main.cc -Iout -c -o out/main.o

Assume we link it with a DNNL based ODLA accelerating runtime library:

g++ -o out/demo out/main.o out/model.o out/model.bin \
  -L<halo_install_path>/lib/ODLA -lodla_dnnl -Wl,-rpath=<halo_install_path>/lib/ODLA

To switch to the TensorRT based ODLA runtime, just simply replace "-lodla_dnnl" with "-lodla_tensorrt".

MNIST example code can be found here

Please refer to HALO options list for all command line options.

A Complete Example on Object Detection

This example demonstrates how to deploy a pretrained YOLO v3 model with pre-processing and post-processing on host.

First, download the model from ONNX Model Repository. For demonstration purpose, we use XNNPACK-based ODLA runtime, which supports the ODLA interpret programming mode.

Next, compile the model into the C code:

halo -target cc -exec-mode=interpret -emit-value-id-as-int -reorder-data-layout=channel-last -remove-input-transpose -remove-output-transpose  -o out/yolo.c yolov3-10.onnx  --disable-broadcasting -outputs conv2d_59 -outputs conv2d_67 -outputs conv2d_75 -input-shape=input_1:1x3x416x416 -entry-func-name=yolo_v3

Options explained:

  • -target cc: to generate the C99 code.
  • -exec-mode=interpret: to generate the ODLA interpret mode code.
  • -emit-value-as-int: to generate the ODLA value ids as integers.
  • -reorder-data-layout=channel-last: to enable the data layout conversion since ONNX uses NCHW while the XNNPACK runtime prefers NHWC.
  • -remove-input-transpose: to optimize away the input transpose.
  • -remove-output-transpose: to optimize away the output transpose.
  • -disable-broadcasting: to disable the offline weights broadcasting since the ODLA runtime supports element-wise ops with broadcasting.
  • -outputs: to specify the output nodes by their names.
  • -input-shape: to explicitly specify the input shape.
  • -entry-func-name=yolo_v3: to specify the generate function names as yolo_v3(), yolo_v3_init() and yolo_v3_fini().

A complete Yolo application, including the input preprocessing, inferencing, and the result rendering, can be found here.

Example of Using Inside Python

HALO generated ODLA function can also be used inside Python.

Here we use CaffeNet as an example.

First, we compile the Caffe model into ODLA:

halo deploy.prototxt bvlc_reference_caffenet.caffemodel -target cxx -disable-broadcasting -entry-func-name=caffenet -batch-size=1 --input-shape=data:1x3x227x227 -o deploy.cc

g++ deploy.cc -c -fPIC -o deploy.o -I<halo_install_path>/include

Then, link it as a shared library using the TensorRT-based ODLA runtime library:

g++ -shared deploy.o deploy.bin -lodla_tensorrt -L <halo_install_path>/lib/ODLA -Wl,-rpath=<halo_install_path>/lib/ODLA -o /tmp/deploy.so

In a Python script, the CaffeNet inference can be invoked as:

#...
c_lib = ctypes.CDLL('/tmp/deploy.so')
image = get_image_as_ndarray(path)
image = preprocess(image)
image = image.astype(ctypes.c_float)
ret = (ctypes.c_float * 1000)()
c_lib.caffenet(ctypes.c_void_p(image.ctypes.data), ret)
ret = np.array(ret)
ind = ret.argsort()[-3:][::-1]
#...

CaffeNet example can be found here.

More Examples

models directory contains scripts for the following models, which download the pretrained models, compile and deploy them using HALO on X86-CPU or NVGPU. Please refer to Instruction.md for more details about how to run the examples.

Image Classification

Model Class Model Source HALO Examples
AlexNet PyTorch models/vision/classification/alexnet
CaffeNet BVLC/Caffe models/vision/classification/caffenet
DenseNet-121 PyTorch models/vision/classification/densenet
GoogleNet PyTorch models/vision/classification/googlenet
Inception_V1 ONNX models/vision/classification/inception
Inception_V3 PyTorch models/vision/classification/inception
MNIST TensorFlow Tutorial models/vision/classification/mnist_simple
MobileNet_V2 PyTorch models/vision/classification/mobilenet
Resnet V1-18 ONNX models/vision/classification/resnet
ResNet V2-50 ONNX models/vision/classification/resnet
ResNet V2-101 ONNX models/vision/classification/resnet
ShuffleNet ONNX models/vision/classification/shufflenet
ShuffleNet_V2 ONNX
SqueezeNet_10 PyTorch models/vision/classification/squeezenet
SqueezeNet_11 PyTorch models/vision/classification/squeezenet
VGG-16 PyTorch models/vision/classification/vgg
VGG-19 PyTorch models/vision/classification/vgg

Object Detection & Segmentation

Model Class Model Source HALO Examples
YOLO v3 ONNX models/vision/detection/yolo
UNet PyTorch models/vision/segmentation/unet
RetinaNet
SSD

NLP

Model Class Description
BERT

List of HALO Command Line Options:

Command line options of HALO:

Option Descriptions
--help Display available options.
--target [cxx|cc] cxx: Generate the C++11 souce code.
cc: Generate the C99 source code.
-o <filename> Specify the output file. Weight file is automatically generated with '.bin' suffix.
--batch-size <number> Specify/override the batch size of inputs. It assumes the first dimension of input is for batch number.
--exec-mode=[compile|interpret] Specify the ODLA execution mode. Default is the compile mode.
--entry-func-name=<name> Specify the name of generated function. Default is the model's file name.
--reorder-data-layout=[channel-first,channel-last] Specify the model to be compiled into the specific data layout. By default, the generated ODLA function uses the same data layout (NHWC or NCHW) as the input model. Transpose operation might be inserted for input nodes.
--remove-input-transpose Remove the transpose operation on input nodes. This option is usually used together with --reorder-data-layout.
--remove-output-transpose Remove the transpose operation on output nodes. This option is usually used together with --reorder-data-layout.
--inputs=<name> Specify the input nodes.
--input-shape=<shape> Specify input shape. E.g.: --input-shape=foo:1x3x10 --input-shape=bar:5x4. It overrides the shape defined in the model file.
--outputs=<name> Specify the output nodes. By default, HALO uses all the sink nodes as outputs. This option with --inputs can be used to compile a partial part of the computation.
--fuse-conv-bias Specify to fuse convolution and bias.
--fuse-matmul-bias Specify to fuse matmul and bias.
--emit-value-reset Specify to emit odla_ReleaseValue() whenever an ODLA value is no longer needed under the interpreter mode.
--emit-value-id-as-int Specify integer as ODLA value id. By default, HALO generates string-based value id.
--emit-data-as-c Generate the weigths file as C file, instead of default ELF file.
--print-mem-stats Display the estimated memory usage.

Contributions and Feedback

We're always looking for help to improve the HALO quality.

Coding Standards

We mainly follow the Google C++ Style Guide. The clang-tidy is used to enforce the coding style check.

Issues

We use GitHub issues to track bugs.

License

HALO is licensed under the Apache 2.0 License

About

heterogeneity-aware-lowering-and-optimization

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • C++ 85.2%
  • C 5.9%
  • CMake 5.4%
  • Shell 1.6%
  • Python 0.9%
  • Dockerfile 0.6%
  • Other 0.4%