Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Training] Add Adagrad optimizer operator #1955

Merged
merged 44 commits into from
Mar 11, 2020
Merged
Show file tree
Hide file tree
Changes from 14 commits
Commits
Show all changes
44 commits
Select commit Hold shift + click to select a range
fb8f634
Adagrad draft
wschin Apr 21, 2019
f5b090e
MIMO
wschin Apr 22, 2019
5801aba
Support multiple tensors to be optimized
wschin Apr 22, 2019
e6ede10
Address comments
wschin Apr 22, 2019
38eaac9
Merge branch 'master' into adagrad
wschin Apr 22, 2019
167996c
Move optimizers to a new place
wschin Apr 25, 2019
a19309e
Merge branch 'master' into adagrad
wschin Apr 26, 2019
17c35f5
Fix build
wschin Apr 26, 2019
fa9a4cc
Add shape test
wschin Apr 28, 2019
e2fdd2a
Fix shape inf
wschin Apr 29, 2019
fac4ccc
Fix shape inf
wschin Apr 30, 2019
c1885c3
Merge branch 'master' into adagrad
wschin Apr 30, 2019
3a9fdd2
fix shape inf
wschin Apr 30, 2019
a04049e
Merge branch 'adagrad' of github.com:wschin/onnx into adagrad
wschin Apr 30, 2019
2e24eb5
Format
wschin Apr 30, 2019
6053906
Add function type
wschin Apr 30, 2019
2f1a5b1
Merge lines
wschin Apr 30, 2019
9f5abb7
Format
wschin Apr 30, 2019
bff1d53
Merge branch 'master' into adagrad
wschin May 18, 2019
92720a5
Fix version number
wschin May 19, 2019
d3095ad
Update op version in model files
wschin May 19, 2019
974d6ce
Merge branch 'master' into adagrad
wschin May 22, 2019
c071750
Fix a test function and update related test files
wschin May 23, 2019
d38c08f
Merge branch 'adagrad' of github.com:wschin/onnx into adagrad
wschin May 23, 2019
9d31276
Merge branch 'master' into adagrad
wschin May 23, 2019
e8e8ebb
Update onnx/backend/test/case/node/adagrad.py
wschin May 23, 2019
2ecb233
Merge branch 'master' into adagrad
wschin May 24, 2019
4025540
Merge branch 'master' into adagrad
wschin Sep 16, 2019
f149a1a
Merge branch 'master' into adagrad
wschin Feb 21, 2020
1e0c018
Remove unused file
wschin Feb 21, 2020
154f00f
Merge branch 'master' into adagrad
wschin Feb 21, 2020
19306d0
sync docs
wschin Feb 21, 2020
0b29802
Fix shape test
wschin Feb 21, 2020
999a4c5
Merge branch 'master' into adagrad
wschin Feb 21, 2020
6dcdde5
sync doc
wschin Feb 21, 2020
6d921f2
sync doc
wschin Feb 28, 2020
fd795b8
Merge branch 'master' into adagrad
wschin Mar 6, 2020
53989fd
sync with master
wschin Mar 6, 2020
1c143c6
Update onnx/defs/training/defs.cc
wschin Mar 6, 2020
ea8790d
sync doc
wschin Mar 6, 2020
fd7a217
address comments
wschin Mar 10, 2020
9fc0529
address a minor comment
wschin Mar 10, 2020
653c573
Merge branch 'master' into adagrad
wschin Mar 10, 2020
055b02f
Polish one line
wschin Mar 10, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
96 changes: 96 additions & 0 deletions docs/Changelog.md
Original file line number Diff line number Diff line change
Expand Up @@ -9261,6 +9261,102 @@ This version of the operator has been available since version 9 of the default O
</dl>

## Version 10 of the default ONNX operator set
### <a name="Adagrad-10"></a>**Adagrad-10**</a>

Compute one iteration of ADAGRAD, a stochastic gradient based optimization
algorithm. This operator can conduct the optimization of multiple tensor variables.

Let's define the behavior of this operator. As you can imagine, ADAGRAD requires
some parameters:

- The initial learning-rate "R".
- The update count "T". That is, the number of training iterations conducted.
- A L2-norm regularization coefficient "lambda".
- A learning-rate decay factor "decay_factor".
- A small constant "epsilon" to avoid dividing-by-zero.

At each ADAGRAD iteration, the optimized tensors are moved along a direction
computed based on their estimated gradient and accumulated squared gradient. Assume
that only a single tensor "X" is updated by this operator. We need the value of "X",
its gradient "G", and its accumulated squared gradient "H". Therefore, variables in
this operator's input list are sequentially "R", "T", "X", "G", and "H". Other
parameters are given as attributes because they are usually constants. Also, the
corresponding output tensors are the new value of "X" (called "X_new"), and then
the new accumulated squared gradient (called "H_new"). Those outputs are computed
from the given inputs following the pseudo code below.

Let "+", "-", "*", and "/" are all element-wise arithmetic operations with
numpy-style broadcasting support. The pseudo code to compute those outputs is:

// Compute a scalar learning-rate factor. If X is never updated, T should be 0.
r = R / (1 + T * decay_factor);

// Add gradient of 0.5 * lambda * ||X||_2^2, where ||X||_2 is the 2-norm.
G_regularized = lambda * X + G;

// Compute new accumulated squared gradient.
H_new = H + G_regularized * G_regularized;

// Compute the adaptive part of per-coordinate learning rate. Note that Sqrt(...)
// compute square root element-wisely.
H_adaptive = Sqrt(H_new) + epsilon

// Compute the new value of "X".
X_new = X - r * G_regularized / H_adaptive;

If one assign this operators to optimize multiple inputs, for example, "X_1" and "X_2", the same
pseudo code may be extended to handle all tensors jointly. More specifically, we can view "X" as a
concatenation of "X_1" and "X_2" (of course, their gradient and accumulate gradient should
be concatenated too) and then just reuse the entire pseudo code.

Note that ADAGRAD was first proposed in http://jmlr.org/papers/volume12/duchi11a/duchi11a.pdf.
In that reference paper, this operator is a spacial case of the Figure 1's composite mirror
descent update.

#### Version

This version of the operator has been available since version 10 of the default ONNX operator set.

#### Attributes

<dl>
<dt><tt>decay_factor</tt> : float (default is 0.0)</dt>
<dd>The decay factor of learning rate after one update.The effective learning rate is computed by r = R / (1 + T * decay_factor). Default to 0 so that increasing update counts doesn't reduce the learning rate.</dd>
<dt><tt>epsilon</tt> : float (default is 0.0)</dt>
<dd>Small scalar to avoid dividing by zero.</dd>
<dt><tt>lambda</tt> : float (default is 0.0)</dt>
<dd>Regularization coefficient of 0.5 * lambda * ||X||_2^2. Default to 0, which means no regularization.</dd>
</dl>

#### Inputs (3 - &#8734;)

<dl>
<dt><tt>R</tt> : T1</dt>
<dd>The initial learning rate.</dd>
<dt><tt>T</tt> : T2</dt>
<dd>The update count of "X". It should be a scalar.</dd>
<dt><tt>inputs</tt> (variadic, heterogeneous) : T3</dt>
<dd>It sequentially contains the current values of optimized tensors and then the current values of accumulated gradient. For example, if two tensor "X_1" and "X_2" are optimized, The input list would be ["X_1", "X_2", gradient of "X_1", gradient of "X_2", accumulated squared gradient of "X_1", accumulated squared gradient of "X_2"].</dd>
</dl>

#### Outputs (1 - &#8734;)

<dl>
<dt><tt>outputs</tt> (variadic, heterogeneous) : T2</dt>
<dd>It sequentially contains the new values of optimized tensors and then the new values of accumulated gradient. For example, if two tensor "X_1" and "X_2" are optimized, the output list would be [new value of "X_1," new value of "X_2" new accumulated squared gradient of "X_1", new accumulated squared gradient of "X_2"].</dd>
</dl>

#### Type Constraints

<dl>
<dt><tt>T1</tt> : tensor(float), tensor(double)</dt>
<dd>Constrain input types to float scalars.</dd>
<dt><tt>T2</tt> : tensor(int64)</dt>
<dd>Constrain output types to 64-bit integer scalars.</dd>
<dt><tt>T3</tt> : tensor(float), tensor(double)</dt>
<dd>Constrain input types to float tensors.</dd>
</dl>

### <a name="AveragePool-10"></a>**AveragePool-10**</a>

AveragePool consumes an input tensor X and applies average pooling across
Expand Down
98 changes: 98 additions & 0 deletions docs/Operators.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@
* <a href="#Abs">Abs</a>
* <a href="#Acos">Acos</a>
* <a href="#Acosh">Acosh</a>
* <a href="#Adagrad">Adagrad</a>
* <a href="#Add">Add</a>
* <a href="#And">And</a>
* <a href="#ArgMax">ArgMax</a>
Expand Down Expand Up @@ -334,6 +335,103 @@ expect(node, inputs=[x], outputs=[y],
</details>


### <a name="Adagrad"></a><a name="adagrad">**Adagrad**</a>

Compute one iteration of ADAGRAD, a stochastic gradient based optimization
algorithm. This operator can conduct the optimization of multiple tensor variables.

Let's define the behavior of this operator. As you can imagine, ADAGRAD requires
some parameters:

- The initial learning-rate "R".
- The update count "T". That is, the number of training iterations conducted.
- A L2-norm regularization coefficient "lambda".
- A learning-rate decay factor "decay_factor".
- A small constant "epsilon" to avoid dividing-by-zero.

At each ADAGRAD iteration, the optimized tensors are moved along a direction
computed based on their estimated gradient and accumulated squared gradient. Assume
that only a single tensor "X" is updated by this operator. We need the value of "X",
its gradient "G", and its accumulated squared gradient "H". Therefore, variables in
this operator's input list are sequentially "R", "T", "X", "G", and "H". Other
parameters are given as attributes because they are usually constants. Also, the
corresponding output tensors are the new value of "X" (called "X_new"), and then
the new accumulated squared gradient (called "H_new"). Those outputs are computed
from the given inputs following the pseudo code below.

Let "+", "-", "*", and "/" are all element-wise arithmetic operations with
numpy-style broadcasting support. The pseudo code to compute those outputs is:

// Compute a scalar learning-rate factor. If X is never updated, T should be 0.
r = R / (1 + T * decay_factor);

// Add gradient of 0.5 * lambda * ||X||_2^2, where ||X||_2 is the 2-norm.
G_regularized = lambda * X + G;

// Compute new accumulated squared gradient.
H_new = H + G_regularized * G_regularized;

// Compute the adaptive part of per-coordinate learning rate. Note that Sqrt(...)
// compute square root element-wisely.
H_adaptive = Sqrt(H_new) + epsilon

// Compute the new value of "X".
X_new = X - r * G_regularized / H_adaptive;

If one assign this operators to optimize multiple inputs, for example, "X_1" and "X_2", the same
pseudo code may be extended to handle all tensors jointly. More specifically, we can view "X" as a
concatenation of "X_1" and "X_2" (of course, their gradient and accumulate gradient should
be concatenated too) and then just reuse the entire pseudo code.

Note that ADAGRAD was first proposed in http://jmlr.org/papers/volume12/duchi11a/duchi11a.pdf.
In that reference paper, this operator is a spacial case of the Figure 1's composite mirror
descent update.

#### Version

This version of the operator has been available since version 10 of the default ONNX operator set.

#### Attributes

<dl>
<dt><tt>decay_factor</tt> : float (default is 0.0)</dt>
<dd>The decay factor of learning rate after one update.The effective learning rate is computed by r = R / (1 + T * decay_factor). Default to 0 so that increasing update counts doesn't reduce the learning rate.</dd>
<dt><tt>epsilon</tt> : float (default is 0.0)</dt>
<dd>Small scalar to avoid dividing by zero.</dd>
<dt><tt>lambda</tt> : float (default is 0.0)</dt>
<dd>Regularization coefficient of 0.5 * lambda * ||X||_2^2. Default to 0, which means no regularization.</dd>
</dl>

#### Inputs (3 - &#8734;)

<dl>
<dt><tt>R</tt> : T1</dt>
<dd>The initial learning rate.</dd>
<dt><tt>T</tt> : T2</dt>
<dd>The update count of "X". It should be a scalar.</dd>
<dt><tt>inputs</tt> (variadic, heterogeneous) : T3</dt>
<dd>It sequentially contains the current values of optimized tensors and then the current values of accumulated gradient. For example, if two tensor "X_1" and "X_2" are optimized, The input list would be ["X_1", "X_2", gradient of "X_1", gradient of "X_2", accumulated squared gradient of "X_1", accumulated squared gradient of "X_2"].</dd>
</dl>

#### Outputs (1 - &#8734;)

<dl>
<dt><tt>outputs</tt> (variadic, heterogeneous) : T2</dt>
<dd>It sequentially contains the new values of optimized tensors and then the new values of accumulated gradient. For example, if two tensor "X_1" and "X_2" are optimized, the output list would be [new value of "X_1," new value of "X_2" new accumulated squared gradient of "X_1", new accumulated squared gradient of "X_2"].</dd>
</dl>

#### Type Constraints

<dl>
<dt><tt>T1</tt> : tensor(float), tensor(double)</dt>
<dd>Constrain input types to float scalars.</dd>
<dt><tt>T2</tt> : tensor(int64)</dt>
<dd>Constrain output types to 64-bit integer scalars.</dd>
<dt><tt>T3</tt> : tensor(float), tensor(double)</dt>
<dd>Constrain input types to float tensors.</dd>
</dl>


### <a name="Add"></a><a name="add">**Add**</a>

Performs element-wise binary addition (with Numpy-style broadcasting support).
Expand Down
84 changes: 83 additions & 1 deletion docs/TestCoverage.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
* [Overall Test Coverage](#overall-test-coverage)
# Node Test Coverage
## Summary
Node tests have covered 125/132 (94.70%, 5 generators excluded) common operators.
Node tests have covered 126/133 (94.74%, 5 generators excluded) common operators.

Node tests have covered 0/0 (N/A) experimental operators.

Expand Down Expand Up @@ -88,6 +88,88 @@ expect(node, inputs=[x], outputs=[y],
</details>


### Adagrad
There are 2 test cases, listed as following:
<details>
<summary>adagrad</summary>

```python
# Define operator attributes.
norm_coefficient = 0.001
epsilon = 1e-5
decay_factor = 0.1

# Create operator.
node = onnx.helper.make_node('Adagrad',
inputs=['R', 'T', 'X', 'G', 'H'],
outputs=['X_new', 'H_new'],
norm_coefficient=norm_coefficient,
epsilon=epsilon,
decay_factor=decay_factor
)

# Define operator inputs.
r = np.array(0.1, dtype=np.float32) # scalar
t = np.array(0, dtype=np.int64) # scalar
x = np.array([1.0], dtype=np.float32)
g = np.array([-1.0], dtype=np.float32)
h = np.array([2.0], dtype=np.float32)

# Compute expected outputs of Adagrad.
x_new, h_new = apply_adagrad(r, t, x, g, h,
norm_coefficient, epsilon, decay_factor)

# Check results.
expect(node, inputs=[r, t, x, g, h],
outputs=[x_new, h_new], name='test_adagrad')
```

</details>
<details>
<summary>adagrad_multiple</summary>

```python
# Define operator attributes.
norm_coefficient = 0.001
epsilon = 1e-5
decay_factor = 0.1

node = onnx.helper.make_node('Adagrad',
inputs=['R', 'T', 'X1', 'X2',
'G1', 'G2', 'H1', 'H2'],
outputs=['X1_new', 'X2_new',
'H1_new', 'H2_new'],
norm_coefficient=norm_coefficient,
epsilon=epsilon,
decay_factor=decay_factor
)

# Define operator inputs.
r = np.array(0.1, dtype=np.float32) # scalar
t = np.array(0, dtype=np.int64) # scalar

x1 = np.array([1.0], dtype=np.float32)
g1 = np.array([-1.0], dtype=np.float32)
h1 = np.array([2.0], dtype=np.float32)

x2 = np.array([1.0, 2.0], dtype=np.float32)
g2 = np.array([-1.0, -3.0], dtype=np.float32)
h2 = np.array([4.0, 1.0], dtype=np.float32)

# Compute expected outputs of Adagrad.
x1_new, h1_new = apply_adagrad(r, t, x1, g1, h1,
norm_coefficient, epsilon, decay_factor)
x2_new, h2_new = apply_adagrad(r, t, x2, g2, h2,
norm_coefficient, epsilon, decay_factor)

# Check results.
expect(node, inputs=[r, t, x1, x2, g1, g2, h1, h2],
outputs=[x1_new, x2_new, h1_new, h2_new], name='test_adagrad_multiple')
```

</details>


### Add
There are 2 test cases, listed as following:
<details>
Expand Down