-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Fix All Reduce not converge on Imagenet dataset #18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
|
||
| val driverClassTag = classTag[T] | ||
| val driverEV = ev | ||
| val broadcastSplit = sc.broadcast(splits) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should make broadcastSplit a member variable, which can be used in subsequent sync and sumAndUpdate methods (that is, no need to broadcast it again and again).
| val paramBlockIds = localSplits(splitId).blockIds | ||
| Parameter[T](bm.getRemoteBytes(paramBlockIds(localSplits(splitId).partitionId)).get)( | ||
| driverClassTag).copyTo(localParam) | ||
| localUpdate(localParam, localGradient, localState) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should pass the update method to the class constructor instead of the sumAndUpdate method, so that we can broadcast it once (and make broadcastUpdate a class member variable) instead of broadcasting it again and again.
|
@yiheng @dding3 We can simplify the
See other comments inline. |
Rename imdb to sentiment
Restyle card components
We find that all-reduce won't converge as fast as one-reduce when train model on imagenet dataset. The root cause is we truncate model parameter from fp32 to fp16 when apply gradient.
This PR fix this issue by keep parameter precision and include some other small changes.