Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Calculate active core correctly for fusions which use shared cache #12208

Closed

Conversation

lingzhi98
Copy link
Contributor

@lingzhi98 lingzhi98 commented May 7, 2024

For current column reduction codegen, sm core active ratio is low if the last kept dimension is small, can see the below hlo:
fused_reduce {
param_1.15 = bf16[1,2048]{1,0} parameter(1)
bitcast.86.8 = bf16[2048]{0} bitcast(param_1.15)
convert.90.5 = f32[2048]{0} convert(bitcast.86.8)
broadcast.6.6 = f32[2048,256]{1,0} broadcast(convert.90.5), dimensions={0}, metadata={op_name="jit(func)/jit(main)/dot_general[dimension_numbers=(((1,), (0,)), ((), ())) precision=None preferred_element_type=bfloat16]"}
param_0.29 = bf16[2048,256]{1,0} parameter(0)
convert.83.3 = f32[2048,256]{1,0} convert(param_0.29)
multiply.8.3 = f32[2048,256]{1,0} multiply(broadcast.6.6, convert.83.3)
constant_9 = f32[] constant(0)
reduce.5 = f32[256]{0} reduce(multiply.8.3, constant_9), dimensions={0}, to_apply=scalar_add_computation
param_2.12 = bf16[2048,256]{1,0} parameter(2)
convert.87.3.clone.1 = f32[2048,256]{1,0} convert(param_2.12)
multiply.9.3.clone.1 = f32[2048,256]{1,0} multiply(broadcast.6.6, convert.87.3.clone.1)
reduce.1.1.clone.1 = f32[256]{0} reduce(multiply.9.3.clone.1, constant_9), dimensions={0}, to_apply=scalar_add_computation
ROOT tuple = (f32[256]{0}, f32[256]{0}) tuple(reduce.5, reduce.1.1.clone.1)
} // fused_reduce

I submit related optimization in other PR: split reduced dim to improve active core ratio (this idea is common and be widely used) and reach 2x performance improvment, from 20us to 10us on A100 40GB. But this change will make this test fail. I think it is the bug of the ComputeTime func, which is not suitable for fusions that use shared cache, like reduction and transpose. I move the change to ComputeTime func from the reduction PR to this so that easy to review.

gpu_device_info.core_count() * gpu_device_info.fpus_per_core();
int64_t num_threads, int64_t num_blocks,
const HloFusionAnalysis* fusion_analysis) {
int64_t core_count = gpu_device_info.core_count();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your PR!

I think there is indeed an issue with our current model that we assume that threads in one block can be distributed among multiple SMs. This is never true. Threads in one blocks will always reside on the same SM regardless of shared memory use. core_count = std::min(num_blocks, core_count) makes sense for all emitters.

Since this can potentially change fusions, I'll patch this line and run our benchmarks and tests to see if there are other failures/regressions, because of different fusions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your reply. Has any updates now?

copybara-service bot pushed a commit to tensorflow/tensorflow that referenced this pull request May 24, 2024
The compute time should be calculated based on the number of active cores, which is the minimum of the number of blocks and the number of cores on the device.

Originally suggested in openxla/xla#12208.

Co-authored-by: lingzhi98 <lingzhi.zhou@intel.com>
PiperOrigin-RevId: 632531510
copybara-service bot pushed a commit to tensorflow/tensorflow that referenced this pull request May 25, 2024
The compute time should be calculated based on the number of active cores, which is the minimum of the number of blocks and the number of cores on the device.

Originally suggested in openxla/xla#12208.

Co-authored-by: lingzhi98 <lingzhi.zhou@intel.com>
PiperOrigin-RevId: 632531510
copybara-service bot pushed a commit to tensorflow/tensorflow that referenced this pull request May 25, 2024
The compute time should be calculated based on the number of active cores, which is the minimum of the number of blocks and the number of cores on the device.

Originally suggested in openxla/xla#12208.

Co-authored-by: lingzhi98 <lingzhi.zhou@intel.com>
PiperOrigin-RevId: 632531510
copybara-service bot pushed a commit to tensorflow/tensorflow that referenced this pull request May 27, 2024
The compute time should be calculated based on the number of active cores, which is the minimum of the number of blocks and the number of cores on the device.

Originally suggested in openxla/xla#12208.

Co-authored-by: lingzhi98 <lingzhi.zhou@intel.com>
FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#12979 from Tixxx:tixxx/pgle_send 469b2d31ff6b0270dda28f8754462681514d0e04
PiperOrigin-RevId: 632531510
copybara-service bot pushed a commit that referenced this pull request May 28, 2024
The compute time should be calculated based on the number of active cores, which is the minimum of the number of blocks and the number of cores on the device.

Originally suggested in #12208.

Co-authored-by: lingzhi98 <lingzhi.zhou@intel.com>
FUTURE_COPYBARA_INTEGRATE_REVIEW=#13088 from ROCm:ci_reduce_atomic_min f894f19
PiperOrigin-RevId: 632531510
copybara-service bot pushed a commit to tensorflow/tensorflow that referenced this pull request May 28, 2024
The compute time should be calculated based on the number of active cores, which is the minimum of the number of blocks and the number of cores on the device.

Originally suggested in openxla/xla#12208.

Co-authored-by: lingzhi98 <lingzhi.zhou@intel.com>

Reverts df40f8d

FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#13088 from ROCm:ci_reduce_atomic_min f894f1954513019f0ca6890a27e09e0fee9d462e
PiperOrigin-RevId: 632531510
copybara-service bot pushed a commit that referenced this pull request May 29, 2024
The compute time should be calculated based on the number of active cores, which is the minimum of the number of blocks and the number of cores on the device.

Originally suggested in #12208.

Co-authored-by: lingzhi98 <lingzhi.zhou@intel.com>
PiperOrigin-RevId: 632531510
copybara-service bot pushed a commit to tensorflow/tensorflow that referenced this pull request May 29, 2024
The compute time should be calculated based on the number of active cores, which is the minimum of the number of blocks and the number of cores on the device.

Originally suggested in openxla/xla#12208.

Co-authored-by: lingzhi98 <lingzhi.zhou@intel.com>
PiperOrigin-RevId: 632531510
copybara-service bot pushed a commit that referenced this pull request May 29, 2024
The compute time should be calculated based on the number of active cores, which is the minimum of the number of blocks and the number of cores on the device.

Originally suggested in #12208.

Co-authored-by: lingzhi98 <lingzhi.zhou@intel.com>
PiperOrigin-RevId: 638183155
copybara-service bot pushed a commit to tensorflow/tensorflow that referenced this pull request May 29, 2024
The compute time should be calculated based on the number of active cores, which is the minimum of the number of blocks and the number of cores on the device.

Originally suggested in openxla/xla#12208.

Co-authored-by: lingzhi98 <lingzhi.zhou@intel.com>
PiperOrigin-RevId: 638183155
@lingzhi98 lingzhi98 closed this May 29, 2024
@lingzhi98 lingzhi98 deleted the lingzhi/shared_cache_cost_model branch June 19, 2024 13:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants