Skip to content

[pull] master from tensorflow:master#62

Merged
pull[bot] merged 22 commits intonoaai:masterfrom
tensorflow:master
Jan 15, 2025
Merged

[pull] master from tensorflow:master#62
pull[bot] merged 22 commits intonoaai:masterfrom
tensorflow:master

Conversation

@pull
Copy link
Copy Markdown

@pull pull bot commented Jan 15, 2025

See Commits and Changes for more details.


Created by pull[bot] (v2.0.0-alpha.1)

Can you help keep this open source service alive? 💖 Please sponsor : )

thomasjoerg and others added 22 commits January 15, 2025 09:23
…er (NaNs go last).

PiperOrigin-RevId: 715826491
Moves
- byte_order.h
- crash_analysis.h
- dynamic_annotations.h
- grpc_credentials.h
- intrusive_ptr.h
- prefetch.h
- ram_file_system.h
- resource.h
- resource_loader.h
- rocm_rocdl_path.h
- stack_frame.h

PiperOrigin-RevId: 715828782
This method was renamed but staging function kept, switch to renamed variant.

PiperOrigin-RevId: 715859237
Support Lock and Unlock, instantiate MLD cl environment as singleton instance.
Added CompileModel CPU test with OpenCL Tensorbuffers as inputs and outputs.

PiperOrigin-RevId: 715860165
…saBufferInterval are inclusive. Update logging in MSA to indicate as much.

PiperOrigin-RevId: 715882309
We had the GIL released when constructing an nb::bytes object, which isn't allowed.

In passing, also avoid an unnecessary string copy.

PiperOrigin-RevId: 715886008
…calizer`

`DynamicDimensionInference` expects all conditional inputs/outputs to be tuplized so that it can easily add more inputs and `RET_CHECK`-fails otherwise, but `ConditionalCanonicalizer` only canonicalizes the outputs. This CL changes the canonicalizer to tuplize the inputs of conditionals as well.

PiperOrigin-RevId: 715887964
…r::MemoryAllocators.

PiperOrigin-RevId: 715890862
PiperOrigin-RevId: 715898241
PiperOrigin-RevId: 715900904
PiperOrigin-RevId: 715902503
Imported from GitHub PR openxla/xla#21273

`ncclCommInitRankScalable` enables the initialization of communicators via multiple roots which improves the init performance at large scale.
The maximum number of ranks associated with a root rank to initialize a NCCL communicator can be tuned via `--xla_gpu_nccl_init_max_rank_per_root_ratio`. Default is 128 ranks per root.

Copybara import of the project:

--
98ef02dabc0bcb2c8206753bec4873c5f48e269f by Nicolas Castet <ncastet@nvidia.com>:

[XLA:GPU] Add support for NCCL ncclCommInitRankScalable API

--
f146a48fef5f1a1098b5c01ae79c5a0d9a9af8d7 by Nicolas Castet <ncastet@nvidia.com>:

Address review comments

--
dd6362af36a1f4d22532ad15b2007527898b5fa1 by Nicolas Castet <ncastet@nvidia.com>:

Add GpuCliqueKey::GetSubKeys unit test

Merging this change closes #21273

PiperOrigin-RevId: 715903412
+ Correctly (zero/value-)initialize PJRT_ExecuteOptions in tests and pjrt_c_api_client

```
If the number of initializer clauses is less than the number of members or
initializer list is completely empty, the remaining members are value-initialized
```

Context: openxla/xla#20429
PiperOrigin-RevId: 715906024
PiperOrigin-RevId: 715909749
…nge the function name

MacOS mangling changes the function name, use less strict contains check that must work on all platforms.

PiperOrigin-RevId: 715919685
…(dimensions whose size is 1).

It is meaningless to partition a dimension whose size is 1. Redundant padding and unpadding may be inserted. To avoid this, we replicate the sharding on these dimensions as a pre-processing.

Take the following input as example
```
ENTRY entry {
  %constant.785 = f32[1,8] constant({{0,1,2,3,4,5,6,7}}), sharding={devices=[1,8]<=[8]}
  %slice.62 = f32[1,1] slice(%constant.785), slice={[0:1], [0:1]}, sharding={devices=[1,8]<=[8]}
  ROOT %reshape.779 = f32[] reshape(%slice.62), sharding={replicated}
}
```

Previous result with redundant instructions
```
ENTRY %entry_spmd () -> f32[] {
  %constant.8 = u32[8]{0} constant({0, 1, 2, 3, 4, 5, 6, 7})
  %partition-id = u32[] partition-id()
  %dynamic-slice.3 = u32[1]{0} dynamic-slice(u32[8]{0} %constant.8, u32[] %partition-id), dynamic_slice_sizes={1}
  %reshape.2 = u32[] reshape(u32[1]{0} %dynamic-slice.3)
  %constant.9 = u32[] constant(0)
  %compare = pred[] compare(u32[] %reshape.2, u32[] %constant.9), direction=EQ
  %broadcast = pred[1,1]{1,0} broadcast(pred[] %compare), dimensions={}
  %constant.0 = f32[1,8]{1,0} constant({ { 0, 1, 2, 3, 4, 5, 6, 7 } })
  %constant.1 = s32[] constant(0)
  %constant.2 = s32[8]{0} constant({0, 1, 2, 3, 4, 5, 6, 7})
  %dynamic-slice = s32[1]{0} dynamic-slice(s32[8]{0} %constant.2, u32[] %partition-id), dynamic_slice_sizes={1}
  %reshape = s32[] reshape(s32[1]{0} %dynamic-slice)
  %dynamic-slice.1 = f32[1,1]{1,0} dynamic-slice(f32[1,8]{1,0} %constant.0, s32[] %constant.1, s32[] %reshape), dynamic_slice_sizes={1,1}
  %copy = f32[1,1]{1,0} copy(f32[1,1]{1,0} %dynamic-slice.1)
  %constant.10 = f32[] constant(0)
  %broadcast.1 = f32[1,1]{1,0} broadcast(f32[] %constant.10), dimensions={}
  %select = f32[1,1]{1,0} select(pred[1,1]{1,0} %broadcast, f32[1,1]{1,0} %copy, f32[1,1]{1,0} %broadcast.1)
  %all-reduce = f32[1,1]{1,0} all-reduce(f32[1,1]{1,0} %select), channel_id=1, replica_groups={{0,1,2,3,4,5,6,7}}, use_global_device_ids=true, to_apply=%add.clone
  ROOT %reshape.3 = f32[] reshape(f32[1,1]{1,0} %all-reduce)
}
```

Result with this improvement
```
ENTRY %entry_spmd () -> f32[] {
  %constant.0 = f32[1,8]{1,0} constant({ { 0, 1, 2, 3, 4, 5, 6, 7 } })
  %slice.0 = f32[1,1]{1,0} slice(f32[1,8]{1,0} %constant.0), slice={[0:1], [0:1]}
  ROOT %reshape.1 = f32[] reshape(f32[1,1]{1,0} %slice.0)
}
```

PiperOrigin-RevId: 715924899
PiperOrigin-RevId: 715934702
@pull pull bot added the ⤵️ pull label Jan 15, 2025
@pull pull bot merged commit 583b4aa into noaai:master Jan 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.