Skip to content

Commit

Permalink
Add doc/public.md and make more documentation improvements
Browse files Browse the repository at this point in the history
  • Loading branch information
bjacob committed Dec 16, 2016
1 parent b39283c commit 21db823
Show file tree
Hide file tree
Showing 7 changed files with 200 additions and 34 deletions.
12 changes: 7 additions & 5 deletions README.md
Expand Up @@ -72,11 +72,13 @@ gemmlowp's main public interface is in the `public/` subdirectory.

This is a headers-only library, so there is nothing to link to.

Usage documentation may be found in [doc/public.md](doc/public.md) .
Usage documentation, and comments on the deprecation status of each public entry
point, may be found in [doc/public.md](doc/public.md) .

A full, self-contained usage example, showing how to quantize float matrices
and perform a quantized matrix multiplication approximating a float matrix
multiplication, is given in `doc/quantization_example.cc`.
A full, self-contained usage example, showing how to quantize float matrices and
perform a quantized matrix multiplication approximating a float matrix
multiplication, is given in
[doc/quantization_example.cc](doc/quantization_example.cc).

### Old EightBitIntGemm legacy deprecated interface

Expand Down Expand Up @@ -212,7 +214,7 @@ arm-linux-androideabi-g++ that does include NEON.
The main benchmark is

```
benchmark.cc
test/benchmark.cc
```

It doesn't need to be linked to any other source file. We recommend building
Expand Down
30 changes: 21 additions & 9 deletions doc/design.md
Expand Up @@ -135,19 +135,31 @@ for (int r = 0; r < rows; r += block_params.l2_rows) {

The files in `internal/` fall into a few categories:

There are two top-level GEMM implementations, * single_thread_gemm.h *
multi_thread_gemm.h
There are two top-level GEMM implementations,

* [internal/single_thread_gemm.h](../internal/single_thread_gemm.h)
* [internal/multi_thread_gemm.h](../internal/multi_thread_gemm.h)

They both call into pack/compute/unpack stages (see [kernel.md](kernel.md) and
[packing.md](packing.md)) implemented in the following files: * pack.h *
compute.h * unpack.h * unpack.h in turn calls into output.h for the output
pipeline (see [output.md](output.md))
[packing.md](packing.md)) implemented in the following files:

* [internal/pack.h](../internal/pack.h)
* [internal/compute.h](../internal/compute.h)
* [internal/unpack.h](../internal/unpack.h)
* This in turn calls into [internal/output.h](../internal/output.h) for
the output pipeline (see [output.md](output.md))

The pack.h and unpack.h files contain generic templated code that can be
overridden by optimized code in template specializations; see the NEON optimized
code here: * pack_neon.h * unpack_neon.h
overridden by optimized code in template specializations; for example, see the
NEON optimized code here:

* [internal/pack_neon.h](../internal/pack_neon.h)
* [internal/unpack_neon.h](../internal/unpack_neon.h)
* This in turn calls into
[internal/output_neon.h](../internal/output_neon.h)

The compute stage contains generic code in compute.h that only calls into
optimized code through the Kernel::Run() entry point. Each kernel is basically
just as struct offering a Run() implementation; see the NEON kernels in: *
kernel_neon.h
just as struct offering a Run() implementation; see the NEON kernels in:

* [internal/kernel_neon.h](../internal/kernel_neon.h)
16 changes: 9 additions & 7 deletions doc/kernel.md
Expand Up @@ -144,18 +144,18 @@ lhs and rhs matrices for optimally efficient traversal by the kernel. This
depends on fine details of the kernel format, in ways that can only be
efficiently handled by knowing these kernel format details at compile-time.

This is the reason why all the code in `internal/pack.h` is templated in the
corresponding kernel format.
This is the reason why all the code in [internal/pack.h](../internal/pack.h) is
templated in the corresponding kernel format.

The code in internal/pack.h isn't tightly optimized by itself, but it is
structured in such a way that the critical code is in a template,
`PackingRegisterBlock`, that can easily be specialized to override the slow
generic code with fast specific packing code for specific formats, on specific
platforms.

See `internal/pack_neon.h` which provides NEON specializations of the packing
code for the particular kernel formats that are used by the NEON kernels in
`internal/kernel_neon.h`.
See [internal/pack_neon.h](../internal/pack_neon.h) which provides NEON
specializations of the packing code for the particular kernel formats that are
used by the NEON kernels in [internal/kernel_neon.h](../internal/kernel_neon.h).

## Wrapping up: how to optimize gemmlowp for a CPU architecture

Expand All @@ -166,5 +166,7 @@ dictate its required data layout; each data layout then also needs optimized
packing code. The steps are thus:

1. Freely design a GEMM kernel with a freely chosen data layout.
2. Implement the GEMM kernel, similar to `internal/kernel_neon.h`.
3. Implement the optimized packing code, similar to `internal/pack_neon.h`.
2. Implement the GEMM kernel, similar to
[internal/kernel_neon.h](../internal/kernel_neon.h).
3. Implement the optimized packing code, similar to
[internal/pack_neon.h](../internal/pack_neon.h).
11 changes: 6 additions & 5 deletions doc/low-precision.md
Expand Up @@ -25,8 +25,8 @@ mechanism by which gemmlowp becomes generic enough to support multiple 8bit
computation paradigms, by allowing the user to set up a chain of transformations
to be performed on internal 32bit accumulators to obtain the final outputs.

The public entry point in `public/gemmlowp.h` allowing to set un an arbitrary
output pipeline is `GemmWithOutputPipeline`.
The public entry point in [public/gemmlowp.h](../public/gemmlowp.h) allowing to
set un an arbitrary output pipeline is `GemmWithOutputPipeline`.

Refer to [quantization.md](quantization.md) for details of how one gets from
first principles to the actual output pipelines to assemble for successful
Expand All @@ -51,7 +51,7 @@ int32 accumulators, to obtain the final outputs.

This older paradigm is the one exposed by the following entry points:

* In `public/gemmlowp.h`, the `Gemm` entry point.
* In [public/gemmlowp.h](../public/gemmlowp.h), the `Gemm` entry point.
* The deprecateed `eight_bit_int_gemm` directory.

Originally, gemmlowp started an implementation of the (now deprecated)
Expand Down Expand Up @@ -171,7 +171,8 @@ In gemmlowp, at the packing stage (where we traverse blocks of the lhs and rhs
to prepare them for efficient repeated traversal by the kernel), we compute the
sum of each row of the lhs block and the sum of each column of the rhs block.

See in `internal/pack.h`, in the PackedSideBlock class, the following member:
See in [internal/pack.h](../internal/pack.h), in the PackedSideBlock class, the
following member:

```
// Handle on the additional buffer backing the vector of sums of slices
Expand All @@ -186,4 +187,4 @@ After these rank one updates have been computed at the packing stage, they are
ignored at the compute kernel stage, since that stage is only concerned with the
first of the four terms in (2); they are only used at the unpacking stage. See
the default/reference implementation, `UnpackResultImpl`, in
`internal/unpack.h`.
[internal/unpack.h](../internal/unpack.h).
7 changes: 4 additions & 3 deletions doc/output.md
Expand Up @@ -24,12 +24,13 @@ output pipeline.
## Usage

The gemmlowp entry point allowing to use an arbitrary output pipeline is
`GemmWithOutputPipeline` in `public/gemmlowp.h`.
`GemmWithOutputPipeline` in [public/gemmlowp.h](../public/gemmlowp.h).

The output pipeline is specified as a `std::tuple` of "output stages", each of
which defining an elementary arithmetic transformation.

All available output stages are defined in `public/output_stages.h`.
All available output stages are defined in
[public/output_stages.h](../public/output_stages.h).

## Example usage

Expand All @@ -49,4 +50,4 @@ TestOutputStages
Separately, a self-contained example showing how to use gemmlowp to compute a
quantized matrix multiplication with a sounds quantization paradigm, is here:

`doc/quantization_example.cc`
[doc/quantization_example.cc](quantization_example.cc)
145 changes: 145 additions & 0 deletions doc/public.md
@@ -0,0 +1,145 @@
# Gemmlowp's public entry points

gemmlowp's public interface is defined in
[public/gemmlowp.h](../public/gemmlowp.h).

## GemmWithOutputPipeline

The primary public entry point is: `GemmWithOutputPipeline`.

A usage example is given in
[doc/quantization_example.cc](quantization_example.cc).

The prototype is:

```
template <typename InputScalar, typename OutputScalar, typename BitDepthParams,
MapOrder LhsOrder, MapOrder RhsOrder, MapOrder ResultOrder,
typename OutputPipelineType, typename GemmContextType>
void GemmWithOutputPipeline(GemmContextType* context,
const MatrixMap<const InputScalar, LhsOrder>& lhs,
const MatrixMap<const InputScalar, RhsOrder>& rhs,
MatrixMap<OutputScalar, ResultOrder>* result,
int lhs_offset, int rhs_offset,
const OutputPipelineType& output_pipeline);
```

A typical call looks like (from the [usage example](quantization_example.cc)):

```
gemmlowp::GemmWithOutputPipeline<std::uint8_t, std::uint8_t,
gemmlowp::DefaultL8R8BitDepthParams>(
&gemm_context, uint8_lhs_matrix, uint8_rhs_matrix,
&uint8_result_matrix, lhs_offset, rhs_offset, output_pipeline);
```

### Template parameters

Typically only the 3 first template parameters need to be specified, the rest
being automatically deduced from function parameters:

* `InputScalar`: The scalar type of the LHS and RHS operands. At the moment,
this must be `std::uint8_t`.
* `OutputScalar`: The scalar type of the LHS and RHS operands. At the moment,
this must be `std::uint8_t`.
* `BitDepthParams`: Defines the bit format of the input and output matrices
and the required accuracy of the computation. At the moment, the only
non-deprecated valid value is `gemmlowp::DefaultL8R8BitDepthParams`. See
[less-than-8-bit.md](less-than-8-bit.md) for other values and the general
idea of this, and how it may become more useful in the future.

The other template parameters, which typically do not need to be specified, are:

* `LhsOrder`, `RhsOrder`, `ResultOrder`: the storage orders (row-major or
column-major) of the LHS, RHS, result matrices. See
[public/map.h](../public/map.h). See the below performance note: we
recommend using respectively RowMajor, ColMajor, ColMajor for optimal
performance.
* `OutputPipelineType`: the actual `std::tuple` type of the output pipeline.
See below explanation of the `output_pipeline` parameter, and
[output.md](output.md).
* `GemmContextType`: the type of the `context` parameter. At the moment, this
must be `gemmlowp::GemmContext`.

### Function parameters

The function parameters taken by `GemmWithOutputPipeline` are:

* `context`: The `gemmlowp::GemmContext` object holding state and resources to
be used for this gemmlowp call.
* `lhs`, `rhs`: The LHS and RHS operand matrices. Note that these are
`MatrixMap` objects, mapping external buffers as matrices, not owning data.
See [public/map.h](../public/map.h).
* `result`: pointer to the destination `MatrixMap` object, which must be
already constructed, wrapping the external destination buffer with the
wanted destination matrix shape and storage layout. No memory allocation
will be performed by gemmlowp for the destination buffer. See
[public/map.h](../public/map.h).
* `lhs_offset`, `rhs_offset` are constants added to each matrix entry in the
LHS, RHS matrices respectively, as explained in
[low-precision.md](low-precision.md). This is only the part of the
quantization paradigm explained in [quantization.md](quantization.md) that
needs to be implemented as operations on the operands; everything else is
operations on the result, see `output_pipeline`.
* `output_pipeline` is a `std::tuple` of output stages (see
[public/output_stages.h](../public/output_stages.h)), specifying the output
pipeline (see [output.md](output.md)). This is the part of the quantization
paradigm explained in [quantization.md](quantization.md) that needs to be
implemented as operations on the result matrix.

### Performance note on storage orders.

gemmlowp supports arbitrary combinations of storage orders for the LHS, RHS and
result matrices. However, not all are equally optimized for.

Because gemmlowp is primarily aimed at neural network inference workloads,
optimization focus is on this particular combination of storage orders:

* `LhsOrder=RowMajor`
* `RhsOrder=ColMajor`
* `ResultOrder=ColMajor`

The rationale is that the LHS is typically the constant weights of a neural
network layer (e.g. the weights of a Convolutional layer implemented as a matrix
multiplication), while the RHS and result are neural network activations,
respectively the input and output activations of the layer.

Because the RHS and result are activations, we want them to share the same
storage order -- so that one layer's output activations can be readily used as
the next layer's input activations. Thus, we focus on `RhsOrder=ResultOrder`.

We also know from general considerations on matrix multiplication that it is
slightly more efficient to have the direction of accumulation (the "depth"
dimension) be the direction of contiguous storage in memory. That means that it
is always going to be slightly easier and more efficient to have
`LhsOrder=RowMajor` and `RhsOrder=ColMajor`.

Putting this together, we arrive at gemmlowp's focus on the above-described
combination of storage orders.

Using other storage orders will typically mean taking less efficient paths in
the packing and unpacking stages, see [packing.md](packing.md). The compute
kernel stage ([kernel.md](kernel.md)) is unaffected.

## GemmWithOutputPipelinePC

This is a variant where `lhs_offset` and `rhs_offset` may be vectors instead of
scalar. They are then broadcasted against LHS, RHS respectively.

This is useful for some flavors of neural network inference with "per-channel
quantization", whence the PC suffix. This has been useful in some settings where
a neural network trained in float arithmetic was subsequently quantized. On the
other hand, retraining neural networks for quantized inference tends to remove
the need for per-channel quantization. For that reason, the long-term usefulness
of this entry point is in question.

## Gemm

This is gemmlowp's original, now legacy and deprecated, entry point. See the
section of [low-precision.md](low-precision.md) on the legacy quantization
paradigm. Avoid in new code.

## The eight_bit_int_gemm directory

As explained in the top-level [README.md](../README.md#public-interfaces), this
is entirely deprecated.
13 changes: 8 additions & 5 deletions doc/quantization.md
@@ -1,7 +1,7 @@
# Building a quantization paradigm from first principles

**TLDR:** If you prefer example code over theory, look at
`doc/quantization_example.cc`.
[doc/quantization_example.cc](quantization_example.cc).

## Overview

Expand Down Expand Up @@ -304,7 +304,8 @@ paradigm, i.e. implementing the precise computation detailed in the previous
section (equation (5)), is
`OutputStageQuantizeDownInt32ToUint8ScaleByFixedPoint`.

Please refer to the comment explaining it in `public/output_stages.h`.
Please refer to the comment explaining it in
[public/output_stages.h](../public/output_stages.h).

## How this differs from the older legacy gemmlowp quantization paradigm

Expand All @@ -315,8 +316,9 @@ implementing it, `OutputStageQuantizeDownInt32ToUint8Scale`, and the new output
stage implementing the new paradigm,
`OutputStageQuantizeDownInt32ToUint8ScaleByFixedPoint`.

Please refer to the comments in `public/output_stages.h` for details about these
two output stages and how they differ.
Please refer to the comments in
[public/output_stages.h](../public/output_stages.h) for details about these two
output stages and how they differ.

Issues with the old output stage `OutputStageQuantizeDownInt32ToUint8Scale` are:

Expand All @@ -341,4 +343,5 @@ Issues with the old output stage `OutputStageQuantizeDownInt32ToUint8Scale` are:
## Example code illustrating the new quantization paradigm

Example code showing how to perfom a quantized matrix multiplication in the
quantization paradigm discussed here is in `doc/quantization_example.cc`.
quantization paradigm discussed here is in
[doc/quantization_example.cc](quantization_example.cc).

0 comments on commit 21db823

Please sign in to comment.