-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
@OlivierDehaene @Narsil is it expected that the output should be the same (or very close) when using the TP implementation for a given model vs non-sharded/single GPU?
Am seeing quite different output, this is for example with flan-ul2 or flan-t5-xxl with 2 GPUs, using float16 for both single and double GPU cases.
This is using a different fork of the code - I'm still investigating and will also try with the latest from the main branch of this repo as-is, but would be very helpful to know generally what you observe / what's expected.
Metadata
Metadata
Assignees
Labels
No labels