Skip to content

[BUG] Database or tuning conflict with Multi-GPU Environment #204

@LeiWang1999

Description

@LeiWang1999

When a program applies tensor parallelism, different rank may receive the same op_config, the tuning proc may become duplicated across different ranks, consider the following:

cpu 0: tune op0, save runtime into local database id 0
cpu 1: tune op0, save runtime into local database id 0 

This 0 -> 0 cross-overwriting process can potentially corrupt the local runtime module.

Maybe some bugs related to issue #186 .

Recommend solution:

  • save op into database with a spin locker.

TODO Items:

  • provide a test case to reproduce the bug
  • implement spin locks.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions