Skip to content

Conversation

@yiheng
Copy link
Contributor

@yiheng yiheng commented Sep 20, 2016

We find that all-reduce won't converge as fast as one-reduce when train model on imagenet dataset. The root cause is we truncate model parameter from fp32 to fp16 when apply gradient.

This PR fix this issue by keep parameter precision and include some other small changes.

@yiheng yiheng merged commit 6c5d08d into intel:master Sep 20, 2016

val driverClassTag = classTag[T]
val driverEV = ev
val broadcastSplit = sc.broadcast(splits)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should make broadcastSplit a member variable, which can be used in subsequent sync and sumAndUpdate methods (that is, no need to broadcast it again and again).

val paramBlockIds = localSplits(splitId).blockIds
Parameter[T](bm.getRemoteBytes(paramBlockIds(localSplits(splitId).partitionId)).get)(
driverClassTag).copyTo(localParam)
localUpdate(localParam, localGradient, localState)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should pass the update method to the class constructor instead of the sumAndUpdate method, so that we can broadcast it once (and make broadcastUpdate a class member variable) instead of broadcasting it again and again.

@jason-dai
Copy link
Contributor

@yiheng @dding3 We can simplify the AllReduceParameterManager implementation as follows:

  • There is no need to construct the partitionIds table in init() - each task has a unique partition id in the range of 0 to partitionNum-1
  • splits can just be an array indexed by the partition id, where each element is just a tuple of (offset, length, blockId). There is no need to compute a map of blockIds for the weight (see below)
  • Each task should only write one copy of the slit (for the weight it is responsible for), instead of writing a copy for every other task (as in the current implementation). The blockId for each copy of the split can be precomputed at the driver side.
  • Similar, paramTable in sumAndUpdate can be precomputed at the driver side and broadcasted only once

See other comments inline.

wzhongyuan pushed a commit to wzhongyuan/BigDL that referenced this pull request Aug 28, 2017
Oscilloscope98 pushed a commit to Oscilloscope98/ipex-llm that referenced this pull request Oct 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants