Skip to content

Question about sharding / TP #349

@njhill

Description

@njhill

@OlivierDehaene @Narsil is it expected that the output should be the same (or very close) when using the TP implementation for a given model vs non-sharded/single GPU?

Am seeing quite different output, this is for example with flan-ul2 or flan-t5-xxl with 2 GPUs, using float16 for both single and double GPU cases.

This is using a different fork of the code - I'm still investigating and will also try with the latest from the main branch of this repo as-is, but would be very helpful to know generally what you observe / what's expected.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions