Missing Constant Propagation in OpenMP Parallel Region #63130

AntonRydahl · 2023-06-05T21:37:40Z

I noticed that Flang-new does not perform constant propagation as I would have expected. I made two simple test programs in C and Fortran using reductions in OpenMP. The programs are the following:

C

Fortran

#include <stdio.h>
#include <stdlib.h>
int main(void) {
	const int length = 1024*1024;
	double * arr = (double *) malloc(length*sizeof(double));
	double sum = 0.0;
	#pragma omp parallel for reduction(+:sum)
	for(int i = 0; i < length; i++) {
		arr[i] = 1.0/length;
		sum += arr[i];
	}
	printf("The result of sum(arr[0:%d]) is %1.9lf\n",length-1,sum);
	free(arr);
	return 0;
}

PROGRAM parallel_loop
        IMPLICIT NONE
        INTEGER(kind=4)			:: length, i
        REAL(kind=8), allocatable 	:: arr(:)
        REAL(kind=8)			:: sumval
        length = 1024*1024
        allocate (arr(length))
        sumval = 0.0
        !$omp parallel do reduction(+:sumval)
        do i=1,length
	        arr(i) = 1.0/length
	        sumval = sumval + arr(i)
        end do
        !$omp end parallel do
        write(*,100) "The result of sum(arr(1:",length,")) is ", sumval
        100 format (A,I7,A,e13.6e2)
        deallocate(arr)
END PROGRAM parallel_loop

I compiled the two programs with:

clang -O3 -pthread -fno-dwarf2-cfi-asm -fno-asynchronous-unwind-tables  -fopenmp -emit-llvm -S  reduction.c -o c_reduction.ll
flang-new -O3 -fopenmp -emit-llvm -S reduction.f90 -o f_reduction.ll

In the LLVM IR generated with Clang, 1.0/length has been replaced with 0x3EB0000000000000 which is exactly 1.0/(1024.0*1024,0) as one would expect:

omp.inner.for.body:                               ; preds = %omp.inner.for.body.prol.loopexit, %omp.inner.for.body
  %indvars.iv = phi i64 [ %indvars.iv.next.3, %omp.inner.for.body ], [ %indvars.iv.unr, %omp.inner.for.body.prol.loopexit ]
  %arrayidx = getelementptr inbounds double, ptr %3, i64 %indvars.iv
  store double 0x3EB0000000000000, ptr %arrayidx, align 8, !tbaa !11
  %11 = load double, ptr %sum1, align 8, !tbaa !11
  %add5 = fadd double %11, 0x3EB0000000000000
  store double %add5, ptr %sum1, align 8, !tbaa !11
  %indvars.iv.next = add nsw i64 %indvars.iv, 1
  %arrayidx.1 = getelementptr inbounds double, ptr %3, i64 %indvars.iv.next
  store double 0x3EB0000000000000, ptr %arrayidx.1, align 8, !tbaa !11
  %12 = load double, ptr %sum1, align 8, !tbaa !11
  %add5.1 = fadd double %12, 0x3EB0000000000000
  store double %add5.1, ptr %sum1, align 8, !tbaa !11
  %indvars.iv.next.1 = add nsw i64 %indvars.iv, 2
  %arrayidx.2 = getelementptr inbounds double, ptr %3, i64 %indvars.iv.next.1
  store double 0x3EB0000000000000, ptr %arrayidx.2, align 8, !tbaa !11
  %13 = load double, ptr %sum1, align 8, !tbaa !11
  %add5.2 = fadd double %13, 0x3EB0000000000000
  store double %add5.2, ptr %sum1, align 8, !tbaa !11
  %indvars.iv.next.2 = add nsw i64 %indvars.iv, 3
  %arrayidx.3 = getelementptr inbounds double, ptr %3, i64 %indvars.iv.next.2
  store double 0x3EB0000000000000, ptr %arrayidx.3, align 8, !tbaa !11
  %14 = load double, ptr %sum1, align 8, !tbaa !11
  %add5.3 = fadd double %14, 0x3EB0000000000000
  store double %add5.3, ptr %sum1, align 8, !tbaa !11
  %indvars.iv.next.3 = add nsw i64 %indvars.iv, 4
  %lftr.wideiv.3 = trunc i64 %indvars.iv.next.3 to i32
  %exitcond.not.3 = icmp eq i32 %5, %lftr.wideiv.3
  br i1 %exitcond.not.3, label %omp.loop.exit, label %omp.inner.for.body

In the LLVM IR generated with Flang-new, the expression 1.0/length is evaluated every iteration. That happens in line %11 = fdiv contract float 1.000000e+00, %10 since %10 is a floating point cast of %9, which is an extension of %loadgep_, which ultimately refers to the statically allocated variable length.

omp_loop.body:                                    ; preds = %omp_loop.body.lr.ph, %omp_loop.body
  %omp_loop.iv68 = phi i32 [ 0, %omp_loop.body.lr.ph ], [ %7, %omp_loop.body ]
  %7 = add nuw i32 %omp_loop.iv68, 1
  %8 = add i32 %7, %4
  %9 = load i32, ptr %loadgep_, align 4, !tbaa !4
  %10 = sitofp i32 %9 to float
  %11 = fdiv contract float 1.000000e+00, %10
  %12 = fpext float %11 to double
  store ptr %.unpack, ptr %loadgep_4, align 8, !tbaa !8
  store i64 8, ptr %loadgep_4.repack19, align 8, !tbaa !8
  store i32 20180515, ptr %loadgep_4.repack21, align 8, !tbaa !8
  store i8 1, ptr %loadgep_4.repack23, align 4, !tbaa !8
  store i8 28, ptr %loadgep_4.repack25, align 1, !tbaa !8
  store i8 2, ptr %loadgep_4.repack27, align 2, !tbaa !8
  store i8 0, ptr %loadgep_4.repack29, align 1, !tbaa !8
  store i64 1, ptr %loadgep_4.repack31, align 8, !tbaa !8
  store i64 %.unpack14.unpack.unpack16, ptr %loadgep_4.repack31.repack33, align 8, !tbaa !8
  store i64 8, ptr %loadgep_4.repack31.repack35, align 8, !tbaa !8
  %13 = sext i32 %8 to i64
  %14 = add nsw i64 %13, -1
  %15 = getelementptr double, ptr %.unpack, i64 %14
  store double %12, ptr %15, align 8, !tbaa !4
  store ptr %.unpack, ptr %loadgep_6, align 8, !tbaa !8
  store i64 8, ptr %loadgep_6.repack49, align 8, !tbaa !8
  store i32 20180515, ptr %loadgep_6.repack51, align 8, !tbaa !8
  store i8 1, ptr %loadgep_6.repack53, align 4, !tbaa !8
  store i8 28, ptr %loadgep_6.repack55, align 1, !tbaa !8
  store i8 2, ptr %loadgep_6.repack57, align 2, !tbaa !8
  store i8 0, ptr %loadgep_6.repack59, align 1, !tbaa !8
  store i64 1, ptr %loadgep_6.repack61, align 8, !tbaa !8
  store i64 %.unpack14.unpack.unpack16, ptr %loadgep_6.repack61.repack63, align 8, !tbaa !8
  store i64 8, ptr %loadgep_6.repack61.repack65, align 8, !tbaa !8
  %16 = load double, ptr %1, align 8
  %17 = fadd contract double %16, %12
  store double %17, ptr %1, align 8
  %exitcond.not = icmp eq i32 %omp_loop.iv68, %reass.sub
  br i1 %exitcond.not, label %omp_loop.exit, label %omp_loop.body
}

Finally, the Flang IR has not been vectorized like the Clang IR. Should I file that as a separate bug report? The full LLVM IR and source files can be found here:

https://github.com/AntonRydahl/flangtests/blob/main/ir/CC/loop_reduction.ll
https://github.com/AntonRydahl/flangtests/blob/main/ir/FC/loop_reduction.ll
https://github.com/AntonRydahl/flangtests/blob/main/src/loop_reduction.c
https://github.com/AntonRydahl/flangtests/blob/main/src/loop_reduction.f90

The text was updated successfully, but these errors were encountered:

llvmbot · 2023-06-05T22:00:29Z

@llvm/issue-subscribers-flang-ir

llvmbot · 2023-06-05T22:00:35Z

@llvm/issue-subscribers-openmp

jdoerfert · 2023-06-05T23:40:53Z

FWIW, while the frontend might want to do the folding, enabling the Attributor will get us the desired folding as well.

kiranchandramohan · 2023-06-06T11:05:30Z

Thanks for looking into this. I think you are seeing a few issues, with the relevant code being the following.

  %9 = load i32, ptr %loadgep_, align 4, !tbaa !4
  %10 = sitofp i32 %9 to float
  %11 = fdiv contract float 1.000000e+00, %10

There is a pass in FIR called MemRefDataFlowOpt that forwards Stores to Loads. This might help remove the constant load. If you are using the fir-opt tool then this can be enabled using the option -fir-memref-dataflow-opt.
After forwarding, I see IR like the following which still shows a convert before the div, preventing constant propagation.

        %28 = fir.convert %c1048576_i32 : (i32) -> f32
        %29 = arith.divf %cst, %28 fastmath<contract> : f32

Conversion to llvm dialect, show the following IR with sitofp feeding a div.

        %105 = llvm.sitofp %14 : i32 to f32
        %106 = llvm.fdiv %10, %105  {fastmathFlags = #llvm.fastmath<contract>} : f32

But on translation to LLVM IR this conversion seems to be gone. I am able to obtain the following IR in the loop region.

  store double 0x3EB0000000000000, ptr %20, align 8

CommandLine

./bin/flang-new -fc1 -emit-fir -fopenmp -O3 canon.f90 -o - | ./bin/fir-opt --canonicalize --cse --fir-memref-dataflow-opt --canonicalize --cse --cfg-conversion --cg-rewrite --fir-to-llvm-ir | ./bin/mlir-translate --mlir-to-llvmir -allow-unregistered-dialect

It will be good to investigate whether the pass MemRefDataFlowOpt can be enabled by default when the Optimization flags (O1,O2,O3,Ofast) are ON.

It will be good to investigate the vectorization issue as well. One reason could be that we are not generating the atomic version for reductions.

llvmbot · 2023-06-06T21:05:41Z

@llvm/issue-subscribers-bug

github-actions bot added the new issue label Jun 5, 2023

EugeneZelenko added openmp flang:ir and removed new issue labels Jun 5, 2023

psteinfeld added the bug Indicates an unexpected problem or unintended behavior label Jun 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Missing Constant Propagation in OpenMP Parallel Region #63130

Missing Constant Propagation in OpenMP Parallel Region #63130

AntonRydahl commented Jun 5, 2023 •

edited by VoltrexKeyva

llvmbot commented Jun 5, 2023

llvmbot commented Jun 5, 2023

jdoerfert commented Jun 5, 2023

kiranchandramohan commented Jun 6, 2023 •

edited by VoltrexKeyva

llvmbot commented Jun 6, 2023

Missing Constant Propagation in OpenMP Parallel Region #63130

Missing Constant Propagation in OpenMP Parallel Region #63130

Comments

AntonRydahl commented Jun 5, 2023 • edited by VoltrexKeyva

llvmbot commented Jun 5, 2023

llvmbot commented Jun 5, 2023

jdoerfert commented Jun 5, 2023

kiranchandramohan commented Jun 6, 2023 • edited by VoltrexKeyva

llvmbot commented Jun 6, 2023

AntonRydahl commented Jun 5, 2023 •

edited by VoltrexKeyva

kiranchandramohan commented Jun 6, 2023 •

edited by VoltrexKeyva