Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing Constant Propagation in OpenMP Parallel Region #63130

Open
AntonRydahl opened this issue Jun 5, 2023 · 5 comments
Open

Missing Constant Propagation in OpenMP Parallel Region #63130

AntonRydahl opened this issue Jun 5, 2023 · 5 comments
Labels
bug Indicates an unexpected problem or unintended behavior flang:ir openmp

Comments

@AntonRydahl
Copy link
Contributor

AntonRydahl commented Jun 5, 2023

I noticed that Flang-new does not perform constant propagation as I would have expected. I made two simple test programs in C and Fortran using reductions in OpenMP. The programs are the following:

C Fortran
#include <stdio.h>
#include <stdlib.h>
int main(void) {
	const int length = 1024*1024;
	double * arr = (double *) malloc(length*sizeof(double));
	double sum = 0.0;
	#pragma omp parallel for reduction(+:sum)
	for(int i = 0; i < length; i++) {
		arr[i] = 1.0/length;
		sum += arr[i];
	}
	printf("The result of sum(arr[0:%d]) is %1.9lf\n",length-1,sum);
	free(arr);
	return 0;
}
PROGRAM parallel_loop
        IMPLICIT NONE
        INTEGER(kind=4)			:: length, i
        REAL(kind=8), allocatable 	:: arr(:)
        REAL(kind=8)			:: sumval
        length = 1024*1024
        allocate (arr(length))
        sumval = 0.0
        !$omp parallel do reduction(+:sumval)
        do i=1,length
	        arr(i) = 1.0/length
	        sumval = sumval + arr(i)
        end do
        !$omp end parallel do
        write(*,100) "The result of sum(arr(1:",length,")) is ", sumval
        100 format (A,I7,A,e13.6e2)
        deallocate(arr)
END PROGRAM parallel_loop

I compiled the two programs with:

clang -O3 -pthread -fno-dwarf2-cfi-asm -fno-asynchronous-unwind-tables  -fopenmp -emit-llvm -S  reduction.c -o c_reduction.ll
flang-new -O3 -fopenmp -emit-llvm -S reduction.f90 -o f_reduction.ll

In the LLVM IR generated with Clang, 1.0/length has been replaced with 0x3EB0000000000000 which is exactly 1.0/(1024.0*1024,0) as one would expect:

omp.inner.for.body:                               ; preds = %omp.inner.for.body.prol.loopexit, %omp.inner.for.body
  %indvars.iv = phi i64 [ %indvars.iv.next.3, %omp.inner.for.body ], [ %indvars.iv.unr, %omp.inner.for.body.prol.loopexit ]
  %arrayidx = getelementptr inbounds double, ptr %3, i64 %indvars.iv
  store double 0x3EB0000000000000, ptr %arrayidx, align 8, !tbaa !11
  %11 = load double, ptr %sum1, align 8, !tbaa !11
  %add5 = fadd double %11, 0x3EB0000000000000
  store double %add5, ptr %sum1, align 8, !tbaa !11
  %indvars.iv.next = add nsw i64 %indvars.iv, 1
  %arrayidx.1 = getelementptr inbounds double, ptr %3, i64 %indvars.iv.next
  store double 0x3EB0000000000000, ptr %arrayidx.1, align 8, !tbaa !11
  %12 = load double, ptr %sum1, align 8, !tbaa !11
  %add5.1 = fadd double %12, 0x3EB0000000000000
  store double %add5.1, ptr %sum1, align 8, !tbaa !11
  %indvars.iv.next.1 = add nsw i64 %indvars.iv, 2
  %arrayidx.2 = getelementptr inbounds double, ptr %3, i64 %indvars.iv.next.1
  store double 0x3EB0000000000000, ptr %arrayidx.2, align 8, !tbaa !11
  %13 = load double, ptr %sum1, align 8, !tbaa !11
  %add5.2 = fadd double %13, 0x3EB0000000000000
  store double %add5.2, ptr %sum1, align 8, !tbaa !11
  %indvars.iv.next.2 = add nsw i64 %indvars.iv, 3
  %arrayidx.3 = getelementptr inbounds double, ptr %3, i64 %indvars.iv.next.2
  store double 0x3EB0000000000000, ptr %arrayidx.3, align 8, !tbaa !11
  %14 = load double, ptr %sum1, align 8, !tbaa !11
  %add5.3 = fadd double %14, 0x3EB0000000000000
  store double %add5.3, ptr %sum1, align 8, !tbaa !11
  %indvars.iv.next.3 = add nsw i64 %indvars.iv, 4
  %lftr.wideiv.3 = trunc i64 %indvars.iv.next.3 to i32
  %exitcond.not.3 = icmp eq i32 %5, %lftr.wideiv.3
  br i1 %exitcond.not.3, label %omp.loop.exit, label %omp.inner.for.body

In the LLVM IR generated with Flang-new, the expression 1.0/length is evaluated every iteration. That happens in line %11 = fdiv contract float 1.000000e+00, %10 since %10 is a floating point cast of %9, which is an extension of %loadgep_, which ultimately refers to the statically allocated variable length.

omp_loop.body:                                    ; preds = %omp_loop.body.lr.ph, %omp_loop.body
  %omp_loop.iv68 = phi i32 [ 0, %omp_loop.body.lr.ph ], [ %7, %omp_loop.body ]
  %7 = add nuw i32 %omp_loop.iv68, 1
  %8 = add i32 %7, %4
  %9 = load i32, ptr %loadgep_, align 4, !tbaa !4
  %10 = sitofp i32 %9 to float
  %11 = fdiv contract float 1.000000e+00, %10
  %12 = fpext float %11 to double
  store ptr %.unpack, ptr %loadgep_4, align 8, !tbaa !8
  store i64 8, ptr %loadgep_4.repack19, align 8, !tbaa !8
  store i32 20180515, ptr %loadgep_4.repack21, align 8, !tbaa !8
  store i8 1, ptr %loadgep_4.repack23, align 4, !tbaa !8
  store i8 28, ptr %loadgep_4.repack25, align 1, !tbaa !8
  store i8 2, ptr %loadgep_4.repack27, align 2, !tbaa !8
  store i8 0, ptr %loadgep_4.repack29, align 1, !tbaa !8
  store i64 1, ptr %loadgep_4.repack31, align 8, !tbaa !8
  store i64 %.unpack14.unpack.unpack16, ptr %loadgep_4.repack31.repack33, align 8, !tbaa !8
  store i64 8, ptr %loadgep_4.repack31.repack35, align 8, !tbaa !8
  %13 = sext i32 %8 to i64
  %14 = add nsw i64 %13, -1
  %15 = getelementptr double, ptr %.unpack, i64 %14
  store double %12, ptr %15, align 8, !tbaa !4
  store ptr %.unpack, ptr %loadgep_6, align 8, !tbaa !8
  store i64 8, ptr %loadgep_6.repack49, align 8, !tbaa !8
  store i32 20180515, ptr %loadgep_6.repack51, align 8, !tbaa !8
  store i8 1, ptr %loadgep_6.repack53, align 4, !tbaa !8
  store i8 28, ptr %loadgep_6.repack55, align 1, !tbaa !8
  store i8 2, ptr %loadgep_6.repack57, align 2, !tbaa !8
  store i8 0, ptr %loadgep_6.repack59, align 1, !tbaa !8
  store i64 1, ptr %loadgep_6.repack61, align 8, !tbaa !8
  store i64 %.unpack14.unpack.unpack16, ptr %loadgep_6.repack61.repack63, align 8, !tbaa !8
  store i64 8, ptr %loadgep_6.repack61.repack65, align 8, !tbaa !8
  %16 = load double, ptr %1, align 8
  %17 = fadd contract double %16, %12
  store double %17, ptr %1, align 8
  %exitcond.not = icmp eq i32 %omp_loop.iv68, %reass.sub
  br i1 %exitcond.not, label %omp_loop.exit, label %omp_loop.body
}

Finally, the Flang IR has not been vectorized like the Clang IR. Should I file that as a separate bug report? The full LLVM IR and source files can be found here:

https://github.com/AntonRydahl/flangtests/blob/main/ir/CC/loop_reduction.ll
https://github.com/AntonRydahl/flangtests/blob/main/ir/FC/loop_reduction.ll
https://github.com/AntonRydahl/flangtests/blob/main/src/loop_reduction.c
https://github.com/AntonRydahl/flangtests/blob/main/src/loop_reduction.f90

@llvmbot
Copy link
Collaborator

llvmbot commented Jun 5, 2023

@llvm/issue-subscribers-flang-ir

@llvmbot
Copy link
Collaborator

llvmbot commented Jun 5, 2023

@llvm/issue-subscribers-openmp

@jdoerfert
Copy link
Member

FWIW, while the frontend might want to do the folding, enabling the Attributor will get us the desired folding as well.

@kiranchandramohan
Copy link
Contributor

kiranchandramohan commented Jun 6, 2023

Thanks for looking into this. I think you are seeing a few issues, with the relevant code being the following.

  %9 = load i32, ptr %loadgep_, align 4, !tbaa !4
  %10 = sitofp i32 %9 to float
  %11 = fdiv contract float 1.000000e+00, %10
  1. There is a pass in FIR called MemRefDataFlowOpt that forwards Stores to Loads. This might help remove the constant load. If you are using the fir-opt tool then this can be enabled using the option -fir-memref-dataflow-opt.
  2. After forwarding, I see IR like the following which still shows a convert before the div, preventing constant propagation.
        %28 = fir.convert %c1048576_i32 : (i32) -> f32
        %29 = arith.divf %cst, %28 fastmath<contract> : f32
  1. Conversion to llvm dialect, show the following IR with sitofp feeding a div.
        %105 = llvm.sitofp %14 : i32 to f32
        %106 = llvm.fdiv %10, %105  {fastmathFlags = #llvm.fastmath<contract>} : f32
  1. But on translation to LLVM IR this conversion seems to be gone. I am able to obtain the following IR in the loop region.
  store double 0x3EB0000000000000, ptr %20, align 8

CommandLine

./bin/flang-new -fc1 -emit-fir -fopenmp -O3 canon.f90 -o - | ./bin/fir-opt --canonicalize --cse --fir-memref-dataflow-opt --canonicalize --cse --cfg-conversion --cg-rewrite --fir-to-llvm-ir | ./bin/mlir-translate --mlir-to-llvmir -allow-unregistered-dialect 

It will be good to investigate whether the pass MemRefDataFlowOpt can be enabled by default when the Optimization flags (O1,O2,O3,Ofast) are ON.

It will be good to investigate the vectorization issue as well. One reason could be that we are not generating the atomic version for reductions.

@psteinfeld psteinfeld added the bug Indicates an unexpected problem or unintended behavior label Jun 6, 2023
@llvmbot
Copy link
Collaborator

llvmbot commented Jun 6, 2023

@llvm/issue-subscribers-bug

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Indicates an unexpected problem or unintended behavior flang:ir openmp
Projects
None yet
Development

No branches or pull requests

6 participants