Skip to content
This repository has been archived by the owner on Aug 5, 2022. It is now read-only.

SGDFusion function issue #256

Open
minkkang opened this issue Jan 23, 2019 · 7 comments
Open

SGDFusion function issue #256

minkkang opened this issue Jan 23, 2019 · 7 comments

Comments

@minkkang
Copy link

minkkang commented Jan 23, 2019

Hi i'm intel caffe user.

I think, i found the wrong flow of SGDFusion function (/sgd_solver.cpp).

When using GCC compiler or not using "iter_size", it doesn't make any problem. But, when using intel compiler and using "iter_size", LARS makes some problem.

As i know, when using intel compiler, SGD_FUSION option turns on.

In "SGD_FUSION" flow, it is executed in the order of "GetLocalRate(it includes LARS)", "normalize" , "regularization & update".

In this time, "normalize" divide "diff_data(mutable_cpu_diff or mutable_prv_diff)" by "iter_size". But, "LARS" is effected by sumsq_diff and sumsq_data.

So,i think "GetLocalRate" should be executed after "normalize".

After changing the SGD_FUSION flow("normalize" -> "GetLocalRate" -> "regularization & update"), LARS works fine.

Would you check the SGD_FUSION?

@minkkang minkkang changed the title SGDFusion funtion issue SGDFusion function issue Jan 23, 2019
@ftian1
Copy link
Contributor

ftian1 commented Jan 23, 2019

I think you are right. could you contribute a PR for our merge?

@minkkang
Copy link
Author

yes, i'll do the PR. Thanks.

@ftian1
Copy link
Contributor

ftian1 commented Jan 24, 2019

@minkkang hi, shall you move GetLocalRate() behind regularization() as the latter also changes the learning_param.diff? it's to be exact same with the no fusion version.

@minkkang
Copy link
Author

minkkang commented Jan 24, 2019

@ftian1

Hi,

When i analyzed the "SGDFusion" version(GetLocalRate -> Regularize & update) and the "NON-SGDFusion" version(Regularize -> GetLocalRate -> update), i'm not sure but i think "SGDFusion version" is right if GetLocalRate(LARS) was executed after "Normalize".

When reading the LARS paper("LARGE BATCH TRAINING OF CONVOLUTIONAL NETWORKS WITH LAYER-WISE ADAPTIVE RATE SCALING[1]"),

the flow of LARS is as follow


Parameters: base LR γ0, momentum m, weight decay β, LARS coefficient η, number of steps T

g[t] ←∇L(w[t]) // obtain a stochastic gradient for the current mini-batch (1)

γ[t] ← γ0 * (1 − t/T)^2 // compute the global learning rate (2)

λ ← ||w[t]|| / (||g[t]|| + β * ||w[t]||) // compute the local LR λ (3)

// update the momentum (4)
v[t+1] ← mv[t] + γ[t+1] * λ * ( g[t] + β * w[t] )
==
v[t+1] ← mv[t] + γ[t+1] * { ||w[t]|| / ( ||∇L(w[t])|| + β * ||w[t]|| ) } * ( ∇L(w[t]) + β * w[t] )

w[t+1] ← w[t] - v[t+1] // update the weights (5)


But, the flow of "NON-SGDFusion" version is as follow


g[t] ← ∇L(w[t])
γ[t] ← γ0 * (1 − t/T)^2 // compute the global learning rate(2)

//Call Normalize function
//Call Regularization function

g[t] ← β * w[t] + g[t]
==
g[t] ← ∇L(w[t]) + β * w[t]

//Call ComputeUpdateValue function

// (3) compute the local LR λ
λ ← ||w[t]|| / ( ||g[t]|| + β * ||w[t]|| )
==
λ ← ||w[t]|| / ( ||∇L(w[t]) + β * w[t]|| + β * ||w[t]|| )

// update the momentum (4)
v[t+1] ← mv[t] + γ[t+1] * λ * g[t]
==
v[t+1] ← mv[t] + γ[t+1] * ( ||w[t]|| / (||∇L(w[t]) + β * w[t]|| + β * ||w[t]|| ) * ( ∇L(w[t]) + β * w[t] )

w[t+1] ← w[t] - v[t+1] // update the weights (5)


In this flow, v[t+1] value was changed.

// LARS original
v[t+1] ← mv[t] + γ[t+1] * { ||w[t]|| / ( ||∇L(w[t])|| + β * ||w[t]|| ) } * ( ∇L(w[t]) + β * w[t] )
// NON-SGDFusion
v[t+1] ← mv[t] + γ[t+1] * { ||w[t]|| / ( ||∇L(w[t]) + β * w[t]|| + β * ||w[t]|| ) } * ( ∇L(w[t]) + β * w[t] )

The flow of "SGDFusion" version is as follow


g[t] ← ∇L(w[t])
γ[t] ← γ0 * (1 − t/T)^2 // compute the global learning rate(2)

//Call SGDFusion function

λ ← ||w[t]|| / ( ||g[t]|| + β * ||w[t]|| ) // compute the local LR λ (3)
==
λ ← ||w[t]|| / ( ||∇L(w[t])|| + β * ||w[t]|| )

//execute Normalize( it should be executed before getlocalLR)

// update the momentum (4)
v[t+1] ← mv[t] + γ[t+1] * λ * ( g[t] + β * w[t] )
==
v[t+1] ← mv[t] + γ[t+1] * { ||w[t]|| / ( ||∇L(w[t])|| + β * ||w[t]|| ) } * ( ∇L(w[t]) + β * w[t] )

w[t+1] ← w[t] - v[t+1] // update the weights (5)


In this flow, v[t+1] value is same.

v[t+1] ← mv[t] + γ[t+1] * { ||w[t]|| / ( ||∇L(w[t])|| + β * ||w[t]|| ) } * ( ∇L(w[t]) + β * w[t] ) // LARS original
v[t+1] ← mv[t] + γ[t+1] * { ||w[t]|| / ( ||∇L(w[t])|| + β * ||w[t]|| ) } * ( ∇L(w[t]) + β * w[t] ) // SGDFusion

I think, "SGDFusion" version looks same as LARS algorithm in [1].

So i think we just have to change the flow of executing "GetLocalRate" after "normalization".

If i'm right, i'll change the "NON-SGDFusion".

@minkkang
Copy link
Author

@ftian1
Hi,
May I change the flow of the no fusion version?

@ftian1
Copy link
Contributor

ftian1 commented Feb 13, 2019

sorry for late respond due to Chinese New Year. Yes, I think your analysis is right. the non-fusion version should be updated.

@minkkang
Copy link
Author

Thank you for reply. I'll do a PR after changing the flow.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants