Skip to content

Commit

Permalink
Move constants to attributes
Browse files Browse the repository at this point in the history
  • Loading branch information
wschin committed Apr 26, 2019
1 parent 3252352 commit 9ef7a96
Show file tree
Hide file tree
Showing 4 changed files with 120 additions and 270 deletions.
154 changes: 37 additions & 117 deletions docs/Changelog.md
Original file line number Diff line number Diff line change
Expand Up @@ -9271,43 +9271,43 @@ This version of the operator has been available since version 9 of the default O
- The initial learning-rate "R".
- The update count "T". That is, the number of training iterations conducted.
- A Frobenius norm regularization coefficient "Lambda".
- A learning-rate decay factor per iteration "D".
- A small constant "Eps" to avoid dividing-by-zero.
- A L2-norm regularization coefficient "lambda".
- A learning-rate decay factor "decay_factor".
- A small constant "epsilon" to avoid dividing-by-zero.

At each ADAGRAD iteration, the optimized tensor variables are moved along a direction
At each ADAGRAD iteration, the optimized tensors are moved along a direction
computed based on their estimated gradient and accumulated squared gradient. Assume
that only a single tensor "X" is updated by this operator. We need the value of "X",
its gradient "G", and its accumulated squared gradient "H". Consequently, if "X" is
the only one tensor to be optimized, variables in this operator's input list are
sequentially "R", "T", "D", "Eps", "X", "G", and "H". Also, the corresponding output
tensors are the new value of "X" (called "X_new"), and the new accumulated squared
gradient (called "H_new"). Those outputs are computed from the given inputs following
the pseudo code below.
its gradient "G", and its accumulated squared gradient "H". Therefore, variables in
this operator's input list are sequentially "R", "T", "X", "G", and "H". Other
parameters are given as attributes because they are usually constants. Also, the
corresponding output tensors are the new value of "X" (called "X_new"), and then
the new accumulated squared gradient (called "H_new"). Those outputs are computed
from the given inputs following the pseudo code below.

Let "+", "-", "*", and "/" are all element-wise operations with numpy-style broadcasting.
The pseudo code to compute those outputs is:
Let "+", "-", "*", and "/" are all element-wise arithmetic operations with
numpy-style broadcasting support. The pseudo code to compute those outputs is:

// Compute a scalar learning-rate factor. If X is never updated, T is 0.
r = R / (1 + T * D);
// Compute a scalar learning-rate factor. If X is never updated, T should be 0.
r = R / (1 + T * decay_factor);

// Add gradient of 0.5 * Lambda * ||X||_F^2, where ||X||_F is the Frobenius norm.
G_regularized = Lambda * X + G;
// Add gradient of 0.5 * lambda * ||X||_2^2, where ||X||_2 is the 2-norm.
G_regularized = lambda * X + G;

// Compute new accumulated squared gradient.
H_new = H + G_regularized * G_regularized;

// Compute the adaptive part of per-coordinate learning rate. Note that Sqrt(...)
// compute square root element-wisely.
H_adaptive = Sqrt(H_new) + Eps
H_adaptive = Sqrt(H_new) + epsilon

// Compute the new value of "X".
X_new = X - r * G_regularized / H_adaptive;

If one assign this operators to optimize multiple inputs, for example, "X_1" and "X_2". The same
pseudo code would be extended to handle all tensors jointly. More specifically, we can view "X" as a
If one assign this operators to optimize multiple inputs, for example, "X_1" and "X_2", the same
pseudo code may be extended to handle all tensors jointly. More specifically, we can view "X" as a
concatenation of "X_1" and "X_2" (of course, their gradient and accumulate gradient should
be concatenated too) and then just execute our pseudo code.
be concatenated too) and then just reuse the entire pseudo code.

Note that ADAGRAD was first proposed in http://jmlr.org/papers/volume12/duchi11a/duchi11a.pdf.
In that reference paper, this operator is a spacial case of the Figure 1's composite mirror
Expand All @@ -9317,20 +9317,25 @@ This version of the operator has been available since version 9 of the default O

This version of the operator has been available since version 10 of the default ONNX operator set.

#### Inputs (6 - ∞)
#### Attributes

<dl>
<dt><tt>decay_factor</tt> : float (default is 0.0)</dt>
<dd>The decay factor of learning rate after one update.The effective learning rate is computed by r = R / (1 + T * decay_factor). Default to 0 so that increasing update counts doesn't reduce the learning rate.</dd>
<dt><tt>epsilon</tt> : float (default is 0.0)</dt>
<dd>Small scalar to avoid dividing by zero.</dd>
<dt><tt>lambda</tt> : float (default is 0.0)</dt>
<dd>Regularization coefficient of 0.5 * lambda * ||X||_2^2. Default to 0, which means no regularization.</dd>
</dl>

#### Inputs (3 - &#8734;)

<dl>
<dt><tt>R</tt> : T1</dt>
<dd>The initial learning rate.</dd>
<dt><tt>D</tt> : T1</dt>
<dd>The decay factor of learning rate after one update.</dd>
<dt><tt>T</tt> : T3</dt>
<dt><tt>T</tt> : T2</dt>
<dd>The update count of "X". It should be a scalar.</dd>
<dt><tt>Lambda</tt> : T1</dt>
<dd>Regularization coefficient of 0.5 * Lambda * ||X||_F^2.</dd>
<dt><tt>Eps</tt> : T1</dt>
<dd>Small scalar to avoid dividing by zero.</dd>
<dt><tt>inputs</tt> (variadic, heterogeneous) : T2</dt>
<dt><tt>inputs</tt> (variadic, heterogeneous) : T3</dt>
<dd>It sequentially contains the current values of optimized tensors and then the current values of accumulated gradient. For example, if two tensor "X_1" and "X_2" are optimized, The input list would be ["X_1", "X_2", gradient of "X_1", gradient of "X_2", accumulated squared gradient of "X_1", accumulated squared gradient of "X_2"].</dd>
</dl>

Expand All @@ -9346,10 +9351,10 @@ This version of the operator has been available since version 10 of the default
<dl>
<dt><tt>T1</tt> : tensor(float), tensor(double)</dt>
<dd>Constrain input types to float scalars.</dd>
<dt><tt>T2</tt> : tensor(float), tensor(double)</dt>
<dd>Constrain input types to float tensors.</dd>
<dt><tt>T3</tt> : tensor(int64)</dt>
<dt><tt>T2</tt> : tensor(int64)</dt>
<dd>Constrain output types to 64-bit integer scalars.</dd>
<dt><tt>T3</tt> : tensor(float), tensor(double)</dt>
<dd>Constrain input types to float tensors.</dd>
</dl>

### <a name="AveragePool-10"></a>**AveragePool-10**</a>
Expand Down Expand Up @@ -9774,91 +9779,6 @@ This version of the operator has been available since version 10 of the default
<dd>Constrain input and output types to high-precision numeric tensors.</dd>
</dl>

### <a name="Momentum-10"></a>**Momentum-10**</a>

Compute one iteration of stochastic gradient update with momentum.
This operator can conduct the optimization of multiple tensor variables.

Let's define the behavior of this operator. As you can imagine, SG with momentum requires
several parameters:
- The learning-rate "R".
- The decay coefficient of previous accumulated gradient (i.e., momentum) "Alpha".
- The scaling coefficient of current gradient when computing momentum "Beta".
- A Frobenius norm regularization coefficient "Lambda".

Below we explain the computation rule of this operator. For the sake of simplicity,
we assume that there is only one tensor (called "X") to be optimized. Other necessary
variables include "X"'s gradient (called "G"), and "X"'s momentum (called "V"). Moreover,
there will be only two output tensors, the new value of "X" (called "X_new") and its new
momentum (called "V_new"). Depending on the mode attribute, this operator uses either
standard momentum or Nestrove's momentum. Setting the mode attribute to "Nestrove" activates
the second case. Otherwise, standard momentum may be used. Computation is detailed below.

Let "+", "-", "*", and "/" are all element-wise operations with numpy-style broadcasting.

Pseudo code for SG with Standard Momentum:

// Add gradient of 0.5 * Lambda * ||X||_F^2, where ||X||_F is the Frobenius norm.
G_regularized = Lambda * X + G;

// Compute the current momentum based on previous momentum and the current gradient.
V_new = Alpha * V + Beta * G;

// Update X.
X_new = X - R * V_new

Pseudo code for SG with Nestrove's Momentum:

// Add gradient of 0.5 * Lambda * ||X||_F^2, where ||X||_F is the Frobenius norm.
G_regularized = Lambda * X + G;

// Compute the current momentum based on previous momentum and the current gradient.
V_new = Alpha * V + Beta * G;

// Compute final update direction and then update X.
X_new = X - R * (G + Alpha * V_new)

If one assign this operators to optimize multiple inputs, for example, "X_1" and "X_2". The same
pseudo code would be extended to handle all tensors jointly. More specifically, we can view "X" as a
concatenation of "X_1" and "X_2" (of course, their gradient and accumulate gradient should
be concatenated too) and then our pseudo code becomes applicable naturally.

#### Version

This version of the operator has been available since version 10 of the default ONNX operator set.

#### Inputs (5 - &#8734;)

<dl>
<dt><tt>R</tt> : T1</dt>
<dd>The learning rate.</dd>
<dt><tt>Alpha</tt> : T2</dt>
<dd>The decay factor of momentum. It should be a scalar.</dd>
<dt><tt>Beta</tt> : T2</dt>
<dd>The coefficient of gradient in computing new momentum. It should be a scalar.</dd>
<dt><tt>Lambda</tt> : T2</dt>
<dd>Regularization coefficient of 0.5 * Lambda * ||X||_F^2.</dd>
<dt><tt>inputs</tt> (variadic, heterogeneous) : T2</dt>
<dd>It sequentially contains the current values of optimized tensors and then their momentum tensors. For example, if two tensor "X_1" and "X_2" are optimized, The expected input list would be ["X_1", "X_2", momentum of "X_1", momentum of "X_2"].</dd>
</dl>

#### Outputs (1 - &#8734;)

<dl>
<dt><tt>outputs</tt> (variadic, heterogeneous) : T2</dt>
<dd>It sequentially contains the new values of optimized tensors and then the new values of their momentum tensors. For example, if two tensor "X_1" and "X_2" are optimized, the output list would be [new value of "X_1," new value of "X_2" new momentum of "X_1", new momentum of "X_2"].</dd>
</dl>

#### Type Constraints

<dl>
<dt><tt>T1</tt> : tensor(float), tensor(double)</dt>
<dd>Constrain input types to float scalars.</dd>
<dt><tt>T2</tt> : tensor(float), tensor(double)</dt>
<dd>Constrain input types to float tensors.</dd>
</dl>

### <a name="NonMaxSuppression-10"></a>**NonMaxSuppression-10**</a>

Filter out boxes that have high intersection-over-union (IOU) overlap with previously selected boxes.
Expand Down

0 comments on commit 9ef7a96

Please sign in to comment.