SGDFusion function issue #256

minkkang · 2019-01-23T03:51:26Z

Hi i'm intel caffe user.

I think, i found the wrong flow of SGDFusion function (/sgd_solver.cpp).

When using GCC compiler or not using "iter_size", it doesn't make any problem. But, when using intel compiler and using "iter_size", LARS makes some problem.

As i know, when using intel compiler, SGD_FUSION option turns on.

In "SGD_FUSION" flow, it is executed in the order of "GetLocalRate(it includes LARS)", "normalize" , "regularization & update".

In this time, "normalize" divide "diff_data(mutable_cpu_diff or mutable_prv_diff)" by "iter_size". But, "LARS" is effected by sumsq_diff and sumsq_data.

So,i think "GetLocalRate" should be executed after "normalize".

After changing the SGD_FUSION flow("normalize" -> "GetLocalRate" -> "regularization & update"), LARS works fine.

Would you check the SGD_FUSION?

ftian1 · 2019-01-23T08:17:15Z

I think you are right. could you contribute a PR for our merge?

minkkang · 2019-01-23T09:31:53Z

yes, i'll do the PR. Thanks.

ftian1 · 2019-01-24T02:32:41Z

@minkkang hi, shall you move GetLocalRate() behind regularization() as the latter also changes the learning_param.diff? it's to be exact same with the no fusion version.

minkkang · 2019-01-24T07:04:42Z

@ftian1

Hi,

When i analyzed the "SGDFusion" version(GetLocalRate -> Regularize & update) and the "NON-SGDFusion" version(Regularize -> GetLocalRate -> update), i'm not sure but i think "SGDFusion version" is right if GetLocalRate(LARS) was executed after "Normalize".

When reading the LARS paper("LARGE BATCH TRAINING OF CONVOLUTIONAL NETWORKS WITH LAYER-WISE ADAPTIVE RATE SCALING[1]"),

the flow of LARS is as follow

Parameters: base LR γ0, momentum m, weight decay β, LARS coefficient η, number of steps T

g[t] ←∇L(w[t]) // obtain a stochastic gradient for the current mini-batch (1)

γ[t] ← γ0 * (1 − t/T)^2 // compute the global learning rate (2)

λ ← ||w[t]|| / (||g[t]|| + β * ||w[t]||) // compute the local LR λ (3)

// update the momentum (4)
v[t+1] ← mv[t] + γ[t+1] * λ * ( g[t] + β * w[t] )
==
v[t+1] ← mv[t] + γ[t+1] * { ||w[t]|| / ( ||∇L(w[t])|| + β * ||w[t]|| ) } * ( ∇L(w[t]) + β * w[t] )

w[t+1] ← w[t] - v[t+1] // update the weights (5)

But, the flow of "NON-SGDFusion" version is as follow

g[t] ← ∇L(w[t])
γ[t] ← γ0 * (1 − t/T)^2 // compute the global learning rate(2)

//Call Normalize function
//Call Regularization function

g[t] ← β * w[t] + g[t]
==
g[t] ← ∇L(w[t]) + β * w[t]

//Call ComputeUpdateValue function

// (3) compute the local LR λ
λ ← ||w[t]|| / ( ||g[t]|| + β * ||w[t]|| )
==
λ ← ||w[t]|| / ( ||∇L(w[t]) + β * w[t]|| + β * ||w[t]|| )

// update the momentum (4)
v[t+1] ← mv[t] + γ[t+1] * λ * g[t]
==
v[t+1] ← mv[t] + γ[t+1] * ( ||w[t]|| / (||∇L(w[t]) + β * w[t]|| + β * ||w[t]|| ) * ( ∇L(w[t]) + β * w[t] )

w[t+1] ← w[t] - v[t+1] // update the weights (5)

In this flow, v[t+1] value was changed.

// LARS original
v[t+1] ← mv[t] + γ[t+1] * { ||w[t]|| / ( ||∇L(w[t])|| + β * ||w[t]|| ) } * ( ∇L(w[t]) + β * w[t] )
// NON-SGDFusion
v[t+1] ← mv[t] + γ[t+1] * { ||w[t]|| / ( ||∇L(w[t]) + β * w[t]|| + β * ||w[t]|| ) } * ( ∇L(w[t]) + β * w[t] )

The flow of "SGDFusion" version is as follow

g[t] ← ∇L(w[t])
γ[t] ← γ0 * (1 − t/T)^2 // compute the global learning rate(2)

//Call SGDFusion function

λ ← ||w[t]|| / ( ||g[t]|| + β * ||w[t]|| ) // compute the local LR λ (3)
==
λ ← ||w[t]|| / ( ||∇L(w[t])|| + β * ||w[t]|| )

//execute Normalize( it should be executed before getlocalLR)

// update the momentum (4)
v[t+1] ← mv[t] + γ[t+1] * λ * ( g[t] + β * w[t] )
==
v[t+1] ← mv[t] + γ[t+1] * { ||w[t]|| / ( ||∇L(w[t])|| + β * ||w[t]|| ) } * ( ∇L(w[t]) + β * w[t] )

w[t+1] ← w[t] - v[t+1] // update the weights (5)

In this flow, v[t+1] value is same.

v[t+1] ← mv[t] + γ[t+1] * { ||w[t]|| / ( ||∇L(w[t])|| + β * ||w[t]|| ) } * ( ∇L(w[t]) + β * w[t] ) // LARS original
v[t+1] ← mv[t] + γ[t+1] * { ||w[t]|| / ( ||∇L(w[t])|| + β * ||w[t]|| ) } * ( ∇L(w[t]) + β * w[t] ) // SGDFusion

I think, "SGDFusion" version looks same as LARS algorithm in [1].

So i think we just have to change the flow of executing "GetLocalRate" after "normalization".

If i'm right, i'll change the "NON-SGDFusion".

minkkang · 2019-02-13T07:46:38Z

@ftian1
Hi,
May I change the flow of the no fusion version?

ftian1 · 2019-02-13T07:59:05Z

sorry for late respond due to Chinese New Year. Yes, I think your analysis is right. the non-fusion version should be updated.

minkkang · 2019-02-13T08:35:10Z

Thank you for reply. I'll do a PR after changing the flow.

minkkang changed the title ~~SGDFusion funtion issue~~ SGDFusion function issue Jan 23, 2019

minkkang mentioned this issue Jan 23, 2019

Update sgd_solver.cpp #259

Closed

minkkang mentioned this issue Mar 12, 2019

Update sgd_solver for LARS #264

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SGDFusion function issue #256

SGDFusion function issue #256

minkkang commented Jan 23, 2019 •

edited

ftian1 commented Jan 23, 2019

minkkang commented Jan 23, 2019

ftian1 commented Jan 24, 2019

minkkang commented Jan 24, 2019 •

edited

minkkang commented Feb 13, 2019

ftian1 commented Feb 13, 2019

minkkang commented Feb 13, 2019

SGDFusion function issue #256

SGDFusion function issue #256

Comments

minkkang commented Jan 23, 2019 • edited

ftian1 commented Jan 23, 2019

minkkang commented Jan 23, 2019

ftian1 commented Jan 24, 2019

minkkang commented Jan 24, 2019 • edited

minkkang commented Feb 13, 2019

ftian1 commented Feb 13, 2019

minkkang commented Feb 13, 2019

minkkang commented Jan 23, 2019 •

edited

minkkang commented Jan 24, 2019 •

edited