Skip to content

[X86] Attempt to use VPMADD52L/VPMULUDQ instead of VPMULLQ on slow VPMULLQ targets (or when VPMULLQ is unavailable) #158854

@RKSimon

Description

@RKSimon

Some Intel targets have notoriously slow VPMULLQ instructions - they should attempt to use alternatives such as VPMULUDQ and VPMADD52L (if IFMA52 is available - with accumulator set to zero) whenever possible.

  • Confirm which AVX512 targets have slower VPMULLQ than VPMULUDQ/VPMADD52L and add a new TuningSlowPMULLQ tuning flag for those targets - I think its just Intel targets since Cannonlake?
  • In LowerMUL - on IFMA (AVX/AVX512) capable targets attempt to use a single VPMADD52L instruction instead of a sequence of multiple VPMULUDQ ops, although a single VPMULUDQ is still the best option. VPMADD52L requires the input operands and the multiplication result to have zero bits in the upper 12-bits (see [X86] Recognise VPMADD52L pattern with AVX512IFMA/AVXIFMA (#153787) #156714 for details). We can refactor the existing vXi64 knownbits analysis in LowerMul to handle this.
  • On TuningSlowPMULLQ targets, attempt to lower to VPMADD52L if the upper 12 bits are all known zero (or VPMULUDQ) - this might be possible as a isel tablegen pattern, or perform it in combineMul, or we set vXi64 ISD::MUL Custom for TuningSlowPMULLQ targets and handle it in the same LowerMUL logic.

Metadata

Metadata

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions