Bug Fix - Fix NUMA Domains Swap Issue in NDv4 Topology File #592

RyoYang · 2023-12-12T08:39:54Z

Description

Snyc ndv4-topo.xml update with azhpcimage repo to fix numa domains swap issue

codecov · 2023-12-12T08:50:39Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (27374ad) 86.12% compared to head (2d01cbf) 86.12%.

Additional details and impacted files

@@              Coverage Diff              @@
##           release/0.10     #592   +/-   ##
=============================================
  Coverage         86.12%   86.12%           
=============================================
  Files                97       97           
  Lines              6878     6878           
=============================================
  Hits               5924     5924           
  Misses              954      954

Flag	Coverage Δ
cpu-python3.6-unit-test	`71.83% <ø> (ø)`
cpu-python3.7-unit-test	`71.83% <ø> (ø)`
cpu-python3.8-unit-test	`72.24% <ø> (ø)`
cuda-unit-test	`84.15% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

guoshzhao · 2023-12-12T09:22:56Z

@abuccts Currently only cuda11.1.1 image has this file, do we need to add to other images if user upgrade their images?

abuccts · 2023-12-12T18:50:22Z

@abuccts Currently only cuda11.1.1 image has this file, do we need to add to other images if user upgrade their images?

It should be added in all cuda dockerfiles? rocm does not need this currently

**Description** * Snyc ndv4-topo.xml update with [azhpcimage repo](https://github.com/Azure/azhpc-images/blob/master/topology/ndv4-topo.xml) to fix numa domains swap issue

**Description** Cherry-pick bug fixes from v0.10.0 to main. **Major Revisions** * Benchmarks: Microbenchmark - Support different hipblasLt data types in dist_inference #590 * Benchmarks: Microbenchmark - Support in-place for NCCL/RCCL benchmark #591 * Bug Fix - Fix NUMA Domains Swap Issue in NDv4 Topology File #592 * Benchmarks: Microbenchmark - Add data type option for NCCL and RCCL tests #595 * Benchmarks: Bug Fix - Make metrics of dist-inference-cpp aligned with PyTorch version #596 * CI/CD - Add ndv5 topo file #597 * Benchmarks: Microbenchmark - Improve AMD GPU P2P performance with fine-grained GPU memory #593 * Benchmarks: Build Pipeline - fix nccl and nccl test version to 2.18.3 to resolve hang issue in cuda12.2 docker #599 * Dockerfile - Bug fix for rocm docker build and deploy #598 * Benchmarks: Microbenchmark - Adapt to hipblasLt data type changes #603 * Benchmarks: Micro benchmarks - Update hipblaslt metric unit to tflops #604 * Monitor - Upgrade pyrsmi to amdsmi python library. #601 * Benchmarks: Micro benchmarks - add fp8 and initialization for hipblaslt benchmark #605 * Dockerfile - Add rocm6.0 dockerfile #602 * Bug Fix - Bug fix for latest megatron-lm benchmark #600 * Docs - Upgrade version and release note #606 Co-authored-by: Ziyue Yang <ziyyang@microsoft.com> Co-authored-by: Yang Wang <yangwang1@microsoft.com> Co-authored-by: Yuting Jiang <yutingjiang@microsoft.com> Co-authored-by: guoshzhao <guzhao@microsoft.com>

fix topo file numa domains swap issue

d41b83a

RyoYang requested a review from a team as a code owner December 12, 2023 08:39

cp5555 requested review from cp5555 and abuccts December 12, 2023 17:54

cp5555 added the containers SuperBench Containers label Dec 12, 2023

cp5555 approved these changes Dec 12, 2023

View reviewed changes

cp5555 added the bug Something isn't working label Dec 12, 2023

cp5555 mentioned this pull request Dec 12, 2023

V0.10.0 Release Plan #559

Closed

30 tasks

abuccts approved these changes Dec 12, 2023

View reviewed changes

guoshzhao approved these changes Dec 13, 2023

View reviewed changes

RyoYang enabled auto-merge (squash) December 14, 2023 01:18

Merge branch 'release/0.10' into yangwang1/fix-topo-file

2d01cbf

RyoYang merged commit 21a9702 into microsoft:release/0.10 Dec 14, 2023
20 checks passed

abuccts mentioned this pull request Jan 3, 2024

Release - SuperBench v0.10.0 #607

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug Fix - Fix NUMA Domains Swap Issue in NDv4 Topology File #592

Bug Fix - Fix NUMA Domains Swap Issue in NDv4 Topology File #592

RyoYang commented Dec 12, 2023

codecov bot commented Dec 12, 2023 •

edited

Loading

guoshzhao commented Dec 12, 2023

abuccts commented Dec 12, 2023

Bug Fix - Fix NUMA Domains Swap Issue in NDv4 Topology File #592

Bug Fix - Fix NUMA Domains Swap Issue in NDv4 Topology File #592

Conversation

RyoYang commented Dec 12, 2023

codecov bot commented Dec 12, 2023 • edited Loading

Codecov Report

guoshzhao commented Dec 12, 2023

abuccts commented Dec 12, 2023

codecov bot commented Dec 12, 2023 •

edited

Loading