Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

quantization support in onnx #1872

Merged

Conversation

linkerzhang
Copy link
Member

@linkerzhang linkerzhang commented Mar 18, 2019

ONNX quanitzation support:
Requirements:

  1. Interoperability MUST be ensured.
    ONLY widely accepted quantization schema can be standardized in ONNX. In this design, 8 bits linear (scale/zero_point) quantization will be standardized.
  2. Customized quantization schema should be allowed.
    ONNX should be able to represent customized quantization schemas (the schema hasn’t been standardized in ONNX yet) with a subgraph consisting of primitive operators.
  3. All ONNX operators must define a mathematical function of the following form:
    outputs = OP(inputs, attrs)
    It means the data needed for mathematical calculation defined by an op must be either an input or an attribute.
  4. Enable both static and dynamic quantization.
    Quantization parameters used in defining an op will be defined as inputs/outputs. Static quantization will be a special case of dynamic one, where the quantization parameter inputs are from either initializers or constant nodes.
    NOTE: as a best practice, weights in an inference model should be statically quantized.
  5. Support model verification for static quantization models. The verification includes,
    a. Same tensor should have same real-value representation.
    If they use same static quantization parameters, then this can be ensured.
    b. Any other kind of quantization parameters’ value check before sending a model to a hardware vendor.

Goals of this design/PR:

  1. Add a small set of operators to standardize 8 bits linear (scale/zero_point) quantization.
    QLinearMatMul/QLinearConv are added in this PR, more ops may be added in separate PRs as needed later.
  2. Add a small set of operators to further enable ONNX to represent other quantization schemas.
    MatmulInteger/ConvInteger are added in this PR, more ops may be added in separate PRs as needed later.
  3. Add quantization information as model level annotation for easy model verification.

@linkerzhang
Copy link
Member Author

Test cases for all ops will be added soon.

Copy link
Contributor

@diyessi diyessi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a couple wording changes to make it easier to read.
These quantization additions look like what we agreed to at Intel.

namespace ONNX_NAMESPACE {

static const char* QuantizeLinear_ver10_doc = R"DOC(
The linear quantization operator. It consumes a high precision tensor, a scale, a zero point and computes the low precision / quantized tensor.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change to and a zero point to compute the ...

}));

static const char* DequantizeLinear_ver10_doc = R"DOC(
The linear dequantization operator. It consumes a quantized tensor, a scale, a zero point and computes the full precision tensor.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change to and a zero point to compute the ...

Copy link

@darrenscrews darrenscrews left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked at the PR, it matches what we reviewed in the document so I'm good for a sign off.

.Input(
2,
"y_zero_point",
"Zero point for doing quantization to get 'y'. It's a scalar, which means a per-tensor/layer quantization.",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to have y_zero_point as optional in the quantizer if it is optional in the dequantizer so that we are consistent.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good.

2,
"a_zero_point",
"Zero point tensor for input 'A'. It's optional and default value is 0. It could be a scalar or a 1-D tensor, "
"which means a per-tensor or per-row quantization. If it's a 1-D tensor, its number of elements "

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should explicitly pass the axis as an argument here to be consistent with the definition of per-row/per-column/per-channel quantization.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that we need this flexibility, since for per-channel, it has to be per output channel (for conv) for weights. For matmul, if it's not per tensor, per row for the first and per-column for the 2nd is a good one, but per-column for the 1st input and per-row for 2nd input is not that useful (math is not that straight-forward). So I'd suggest to keep this specification.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what your saying but it might be good to be consistent across the spec. For Quantize_Linear, we specify axis as attribute:
.Attr(
"axis",
"The axis along which same quantization parameters are applied. It's optional. If it's not specified, it means per-tensor quantization and input 'x_scale' and 'x_zero_point' must be scalars. If it's specified, it means per 'axis' quantization and input 'x_scale' and 'x_zero_point' must be 1-D tensors.",
AttributeProto::INT,
false)

Might be good to make this consistent across all ops. We could still enforce that the op itself only supports per-row for the first tensor and per column for the second one.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad. "axis" for quant should be removed.

.Input(
1,
"x_scale",
"Scale tensor for input 'x'. It's a scalar, which means a per-tensor/layer quantization.",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets explicitly provide the axis as a parameter here. For example if I have 3x3x3x32 (3 input channels, 3x3 kernel and 32 output channels), I would specify axis=3 and have a vector of length 32 to specify the per-channel scale and zero-point

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for conv, the most frequent used one is per-layer (per-tensor), or per output channel (for weights). I don't see much usage about other cases. I'd suggest to keep the spec and extend it when we need the other cases (flexibility). Sounds good?

@linkerzhang
Copy link
Member Author

am going to have all test cases covered in separate PR soon.

@linkerzhang
Copy link
Member Author

@raghuramank100 I'm going to check it in today. Please let me know if you have more comments.

@spandantiwari please help to drive the pytorch test failure. I'm not taking the failure as a check-in blocker.

Thank you!

DequantizeLinear,
10,
OpSchema()
.Input(0, "x", "N-D quantized input tensor to be de-quantized.", "T")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also have an axis argument, like we have with the quantizer?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clarified above, "axis" in quant should be removed. given only "weights" has per output channel quant (which is static).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well it's certainly simpler to implement DequantizeLinear with zero-point and scale being scalars. We originally included the axis attribute because it was a compromise between space savings and accuracy, and with channel-wise weight packing, @youngkim93 noticed significant accuracy improvements across the various image recognition models. Are you dropping it because QLinearConv doesn't have it? (granted, getting QLinearConv to support it would greatly complicate things) This should be compatible with our current code because the axis was optional anyway.

.Output(0, "Y", "Matrix multiply results from A * B", "T3")
.TypeConstraint(
"T1",
{"tensor(int8)", "tensor(uint8)"},
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please consider support int16 as well. There's cblas_gemm_s16s16s32 for int16 inputs to generate int32 outputs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@darrenscrews is going to make that (more types support) in separate PR.

@houseroad
Copy link
Member

CI side is okay. Waiting @raghuramank100 for the final approval :-)

@linkerzhang
Copy link
Member Author

as clarified in mail thread, the "adding axis" comment from @raghuramank100 may be added when we see use cases and want to support per axis quantization for activations. I'm merging this PR now. @raghuramank100 please feel free to share more comments if any. we can keep tuning it.

static const char* QuantizeLinear_ver10_doc = R"DOC(
The linear per-tensor/layer quantization operator. It consumes a high precision tensor, a scale, a zero point to compute the low precision / quantized tensor.
The quantization formula is y = saturate ((x / y_scale) + y_zero_point). For saturation, it saturates to [0, 255] if it's uint8, or [-128, 127] if it's int8.
For (x / y_scale), it's rounding to nearest ties to even. Refer to https://en.wikipedia.org/wiki/Rounding for details. 'y_zero_point' and 'y' must have same type.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Guessing this was input from Intel to change from round-evens-away-from-zero (which was our earlier choice, which was TensorFlow's default and C++ std::round's default)?

@linkerzhang linkerzhang merged commit 4cfa542 into onnx:master Apr 4, 2019
bool require_kernel_shape,
int input1Idx,
int input2Idx) {
// propagateElemTypeFromInputToOutput(ctx, 0, 0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this?

@@ -872,6 +893,298 @@ ONNX_OPERATOR_SET_SCHEMA(
1,
OpSchema().FillUsing(ConvOpSchemaGenerator("a filter")));

static const char* QLinearConv_ver10_doc = R"DOC(
The convolution operator consumes a quantized input tensor, its scale and zero point,
a quantized filter, its scale and zero point, and output�s scale and zero point,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

output's is non-utf8 character.

Copy link
Member

@houseroad houseroad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This diff didn't correctly register ops in the operator set and contains non-utf8 character... needs more polish.

houseroad added a commit that referenced this pull request Apr 5, 2019
@houseroad
Copy link
Member

The merged version is prematured. Please address my comments and resubmit the PR. Thanks

houseroad added a commit that referenced this pull request Apr 5, 2019
hariharans29 pushed a commit to hariharans29/onnx that referenced this pull request Aug 15, 2019
* add quantized ops

* shape inference for matmul ops.

* update

* add quantization parameter annotation in graph proto.

* update IR version

* update proto files

* update operator.md

* update

* fix build break.

* fix build break.

* sync and resolve comments

* revert the change by mistake.

* fix shape inference test failure.

* fix comments
hariharans29 pushed a commit to hariharans29/onnx that referenced this pull request Aug 15, 2019
jcwchen pushed a commit to jcwchen/onnx that referenced this pull request Sep 23, 2020
* add quantized ops

* shape inference for matmul ops.

* update

* add quantization parameter annotation in graph proto.

* update IR version

* update proto files

* update operator.md

* update

* fix build break.

* fix build break.

* sync and resolve comments

* revert the change by mistake.

* fix shape inference test failure.

* fix comments
jcwchen pushed a commit to jcwchen/onnx that referenced this pull request Sep 23, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants