# DNNL Design

## Primitives

Functor object that encapsulate an operation (eg conv, gemm), or a fused operator (eg: gemm + Relu).  
Immutable state: operation params (eg tensor shape), or precomputation.  
Mutable state: memory buffer, temporary storage used during execution.

Primitive creation is an expensive operation (precomputation), it should be created once and used many times.

## Engine

Computational device

## Streams

Encapsulate execution context (eg: OpenCL command queues)

## Memory objects

Handles to memory allocated on a specific engine, memory format, and tensor indices mapping to offset.

## Memory descriptors

Tensor dimensions, datatype, and memory format.

## Operation descriptors

Describe an operation (eg: Convolution) and it's properties (eg: shape of input tensors, forward or backward operation).

## Primitive descriptors

Abstraction between primitive and Operator descriptors, give details about primitive implementation and memory format.

## Basic API Usage

1. Init engine and stream
2. Prepare data input (eg store to std::vector)
3. Wrapping data into memory objects
4. Creating the operations primitives
5. Executing the primitives
6. Load output data from memory objects

## Memory format propagation

The placeholder memory format let the implementation choose the actual format.  
For convolutions, the best choice is based on hardware and ops params (eg filter size), so it's best to keep the placeholder format.  

Other primitives such as elemwhise, batchnorm, should use the same format as previous layer, this is called format propagation.

## Propagation Kinds

- Forward Inference: during inference, input data fed into model, produce result
- Forward Traning: during training, some difference with inference for performance (need to keep data for backward phase)
- Backward data: propagation error with respect to the input data
- Backward weights: propagation error with respect to the model weights

For forward inference:
- do as many in-place operations as possible
- some primitives can be chained / fused together

## Workspace

A workspace is an additional tensor, required for computations for some operands.  
For example, some primitives fills it during forward pass, and use it during backward pass.  
Workspace is different from scratchpad (only needed during primitive executing, no need to be preserved during calls).

## Quantization

We can do Int8 inference using quantization: Convert 32fp to int8, using a scaling factor.  
They are some int8 primitives for computation

## API

The library API is written in C.
A header-based wrapper around the C API is available for C++. It makes code much more convenient to use.  
Some features aren't available in the C++ API.