Question about sharding / TP

@OlivierDehaene @Narsil is it expected that the output should be the same (or very close) when using the TP implementation for a given model vs non-sharded/single GPU?

Am seeing quite different output, this is for example with flan-ul2 or flan-t5-xxl with 2 GPUs, using float16 for both single and double GPU cases.

This is using a different fork of the code - I'm still investigating and will also try with the latest from the main branch of this repo as-is, but would be very helpful to know generally what you observe / what's expected.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Question about sharding / TP #349

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Question about sharding / TP #349

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions