Fix All Reduce not converge on Imagenet dataset #18

yiheng · 2016-09-20T05:35:22Z

We find that all-reduce won't converge as fast as one-reduce when train model on imagenet dataset. The root cause is we truncate model parameter from fp32 to fp16 when apply gradient.

This PR fix this issue by keep parameter precision and include some other small changes.

…eet parallel

jason-dai · 2016-10-09T06:11:44Z

dl/src/main/scala/com/intel/analytics/sparkdl/ps/AllReduceParameterManager.scala


    val driverClassTag = classTag[T]
    val driverEV = ev
    val broadcastSplit = sc.broadcast(splits)


We should make broadcastSplit a member variable, which can be used in subsequent sync and sumAndUpdate methods (that is, no need to broadcast it again and again).

jason-dai · 2016-10-09T06:24:07Z

dl/src/main/scala/com/intel/analytics/sparkdl/ps/AllReduceParameterManager.scala

      val paramBlockIds = localSplits(splitId).blockIds
-      Parameter[T](bm.getRemoteBytes(paramBlockIds(localSplits(splitId).partitionId)).get)(
-        driverClassTag).copyTo(localParam)
      localUpdate(localParam, localGradient, localState)


We should pass the update method to the class constructor instead of the sumAndUpdate method, so that we can broadcast it once (and make broadcastUpdate a class member variable) instead of broadcasting it again and again.

jason-dai · 2016-10-09T08:38:20Z

@yiheng @dding3 We can simplify the AllReduceParameterManager implementation as follows:

There is no need to construct the partitionIds table in init() - each task has a unique partition id in the range of 0 to partitionNum-1
splits can just be an array indexed by the partition id, where each element is just a tuple of (offset, length, blockId). There is no need to compute a map of blockIds for the weight (see below)
Each task should only write one copy of the slit (for the weight it is responsible for), instead of writing a copy for every other task (as in the current implementation). The blockId for each copy of the split can be precomputed at the driver side.
Similar, paramTable in sumAndUpdate can be precomputed at the driver side and broadcasted only once

See other comments inline.

Rename imdb to sentiment

Restyle card components

yiheng added 7 commits September 20, 2016 10:43

Make all-reduce/one-reduce configurable in commandline option of imag…

9be1c9e

…eet parallel

Add unit test compare one-reduce with all-reduce

3f667e0

remove double type from ImageNetParrallel

d95ec7a

Don't truncate parameter in all reduce parameter manager

74bbf61

adjust lr schedule for googlenet in imagenetparallel

e0f118a

Add test model parallel

583982f

fix code format issue

e8b6f7a

yiheng merged commit 6c5d08d into intel:master Sep 20, 2016

jason-dai reviewed Oct 9, 2016

View reviewed changes

wzhongyuan pushed a commit to wzhongyuan/BigDL that referenced this pull request Aug 28, 2017

Merge pull request intel#18 from psyyz10/imdb

cfaa707

Rename imdb to sentiment

Oscilloscope98 pushed a commit to Oscilloscope98/ipex-llm that referenced this pull request Oct 13, 2022

Merge pull request intel#18 from Oscilloscope98/restyle-card-component

f51c49f

Restyle card components

kylinzhao90 mentioned this pull request Jun 26, 2024

ollama MTL iGPU issue #11425

Open

lucshi mentioned this pull request Jun 28, 2024

MTL GPU driver not shown and GPU demo crashed on Linux #11460

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix All Reduce not converge on Imagenet dataset #18

Fix All Reduce not converge on Imagenet dataset #18

Uh oh!

yiheng commented Sep 20, 2016

Uh oh!

jason-dai Oct 9, 2016

Uh oh!

jason-dai Oct 9, 2016

Uh oh!

jason-dai commented Oct 9, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fix All Reduce not converge on Imagenet dataset #18

Fix All Reduce not converge on Imagenet dataset #18

Uh oh!

Conversation

yiheng commented Sep 20, 2016

Uh oh!

jason-dai Oct 9, 2016

Choose a reason for hiding this comment

Uh oh!

jason-dai Oct 9, 2016

Choose a reason for hiding this comment

Uh oh!

jason-dai commented Oct 9, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants