Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposed changes to enable vectorization for parfors. #2709

Closed
wants to merge 530 commits into from

Conversation

DrTodd13
Copy link
Contributor

I have confirmed with these changes that a version of blackscholes with transcendentals removed will vectorize with these changes. There are probably many options to transmit this noalias flag to the relevant spot. Please review and if you want a different mechanism let's discuss.

Ehsan Totoni and others added 30 commits December 13, 2017 15:08
…the dimensionality of the input array. If neighborhood not specified, then calculate it only once and store it in StencilFunc and can then be queried through neighborhood attribute.
…ing section that stencil decorator returns StencilFunc type and that that contains a neighborhood attribute.
…ction about the various checks performed and exceptions raised.
… have access to attributes. Added a check to stencil decorator to check for unknown stencil options. Added support for standard_indexing option.
@DrTodd13
Copy link
Contributor Author

Lots of travis problems. Think I may know the reason but if I'm correct the solution would also prevent vectorization. I will investigate.

@DrTodd13
Copy link
Contributor Author

So, it seems that with "noalias nocapture" on parfor gufunc arrays, the LLVM LoopVectorizer will report that vectorization is possible but not profitable for the pseudo-blackscholes with transcendentals removed.

I'm not an expert on analyzing this output but the LAA: Bad stride looks a likely culprit. I did some experimenting and we get these same messages whether ascontiguousarray is used, not used, or used only for input params. So, I'm not sure it is doing for us what we thought it should.

LV: Checking a loop in "_ZN5numba8npyufunc6parfor39__numba_parfor_gufunc_0x7f162ddc471$242E5ArrayIxLi1E1C7mutable7alignedE5ArrayIdLi1E1C7mutable7alignedE5ArrayIdLi1E1C7mutable7alignedE5ArrayIdLi1E1C7mutable7alignedE5ArrayIdLi1E1C7mutable7alignedE5ArrayIdLi1E1C7mutable7alignedE5ArrayIdLi1E1C7mutable7alignedE" from gufunc_no_sqrt.ll
LV: Loop hints: force=? width=0 unroll=0
LV: Found a loop: B121
LV: Found an induction variable.
LV: Found an induction variable.
LV: Found FP op with unsafe algebra.
LV: Found FP op with unsafe algebra.
LV: Found FP op with unsafe algebra.
LV: Found FP op with unsafe algebra.
LV: Found FP op with unsafe algebra.
LV: Found FP op with unsafe algebra.
LV: Found FP op with unsafe algebra.
LV: Found FP op with unsafe algebra.
LV: Found FP op with unsafe algebra.
LV: Found FP op with unsafe algebra.
LV: Found FP op with unsafe algebra.
LV: Found FP op with unsafe algebra.
LV: Found FP op with unsafe algebra.
LV: Found FP op with unsafe algebra.
LV: Found FP op with unsafe algebra.
LV: Found FP op with unsafe algebra.
LV: Found FP op with unsafe algebra.
LV: Found FP op with unsafe algebra.
LV: Found FP op with unsafe algebra.
LV: Found FP op with unsafe algebra.
LV: Found FP op with unsafe algebra.
LV: Did not find one integer induction var.
LAA: Found a loop in _ZN5numba8npyufunc6parfor39__numba_parfor_gufunc_0x7f162ddc471$242E5ArrayIxLi1E1C7mutable7alignedE5ArrayIdLi1E1C7mutable7alignedE5ArrayIdLi1E1C7mutable7alignedE5ArrayIdLi1E1C7mutable7alignedE5ArrayIdLi1E1C7mutable7alignedE5ArrayIdLi1E1C7mutable7alignedE5ArrayIdLi1E1C7mutable7alignedE: B121
LAA: Processing memory accesses...
AST: Alias Set Tracker: 6 alias sets for 6 pointer values.
AliasSet[0x46100f0, 1] must alias, No access Pointers: (double* %.1041, 18446744073709551615)
AliasSet[0x45f5020, 1] must alias, No access Pointers: (double* %.428, 18446744073709551615)
AliasSet[0x45e8380, 1] must alias, No access Pointers: (double* %.461, 18446744073709551615)
AliasSet[0x45e8420, 1] must alias, No access Pointers: (double* %.517, 18446744073709551615)
AliasSet[0x4608270, 1] must alias, No access Pointers: (double* %.628, 18446744073709551615)
AliasSet[0x4608360, 1] must alias, No access Pointers: (double* %.691, 18446744073709551615)

LAA: Accesses(6):
%.1041 = getelementptr double, double* %arg._put_144param.4, i64 %.1028 (write)
%.428 = getelementptr double, double* %arg.sptpriceparam.4, i64 %.415 (read-only)
%.461 = getelementptr double, double* %arg.strikeparam.4, i64 %.448 (read-only)
%.517 = getelementptr double, double* %arg.volatilityparam.4, i64 %.504 (read-only)
%.628 = getelementptr double, double* %arg.timevparam.4, i64 %.615 (read-only)
%.691 = getelementptr double, double* %arg.rateparam.4, i64 %.678 (read-only)
Underlying objects for pointer %.1041 = getelementptr double, double* %arg._put_144param.4, i64 %.1028
double* %arg._put_144param.4
Underlying objects for pointer %.428 = getelementptr double, double* %arg.sptpriceparam.4, i64 %.415
double* %arg.sptpriceparam.4
Underlying objects for pointer %.461 = getelementptr double, double* %arg.strikeparam.4, i64 %.448
double* %arg.strikeparam.4
Underlying objects for pointer %.517 = getelementptr double, double* %arg.volatilityparam.4, i64 %.504
double* %arg.volatilityparam.4
Underlying objects for pointer %.628 = getelementptr double, double* %arg.timevparam.4, i64 %.615
double* %arg.timevparam.4
Underlying objects for pointer %.691 = getelementptr double, double* %arg.rateparam.4, i64 %.678
double* %arg.rateparam.4
LAA: We can perform a memory runtime check if needed.
LAA: No unsafe dependent memory operations in loop. We don't need runtime memory checks.
LV: We can vectorize this loop!
LV: Analyzing interleaved accesses...
LAA: Bad stride - Not an AddRecExpr pointer %.428 = getelementptr double, double* %arg.sptpriceparam.4, i64 %.415 SCEV: ((8 * ({%.234,+,1}<%B121> + %.414)) + %arg.sptpriceparam.4)
LAA: Bad stride - Not an AddRecExpr pointer %.461 = getelementptr double, double* %arg.strikeparam.4, i64 %.448 SCEV: ((8 * ({%.234,+,1}<%B121> + %.447)) + %arg.strikeparam.4)
LAA: Bad stride - Not an AddRecExpr pointer %.517 = getelementptr double, double* %arg.volatilityparam.4, i64 %.504 SCEV: ((8 * ({%.234,+,1}<%B121> + %.503)) + %arg.volatilityparam.4)
LAA: Bad stride - Not an AddRecExpr pointer %.628 = getelementptr double, double* %arg.timevparam.4, i64 %.615 SCEV: ((8 * ({%.234,+,1}<%B121> + %.614)) + %arg.timevparam.4)
LAA: Bad stride - Not an AddRecExpr pointer %.691 = getelementptr double, double* %arg.rateparam.4, i64 %.678 SCEV: ((8 * ({%.234,+,1}<%B121> + %.677)) + %arg.rateparam.4)
LAA: Bad stride - Not an AddRecExpr pointer %.1041 = getelementptr double, double* %arg._put_144param.4, i64 %.1028 SCEV: ((8 * ({%.234,+,1}<%B121> + %.1027)) + %arg._put_144param.4)
LV: The Smallest and Widest types: 64 / 64 bits.
LV: The Widest register is: 128 bits.
LV: Found an estimated cost of 0 for VF 1 For instruction: %.311.0157 = phi i64 [ %.320, %B121.lr.ph ], [ %.367, %B121 ]
LV: Found an estimated cost of 0 for VF 1 For instruction: %.309.0156 = phi i64 [ %.234, %B121.lr.ph ], [ %.371, %B121 ]
LV: Found an estimated cost of 1 for VF 1 For instruction: %.367 = add nsw i64 %.311.0157, -1
LV: Found an estimated cost of 1 for VF 1 For instruction: %.413 = icmp slt i64 %.309.0156, 0
LV: Found an estimated cost of 1 for VF 1 For instruction: %.414 = select i1 %.413, i64 %arg.sptpriceparam.5.0, i64 0
LV: Found an estimated cost of 1 for VF 1 For instruction: %.415 = add i64 %.414, %.309.0156
LV: Found an estimated cost of 0 for VF 1 For instruction: %.428 = getelementptr double, double* %arg.sptpriceparam.4, i64 %.415
LV: Found an estimated cost of 1 for VF 1 For instruction: %.429 = load double, double* %.428, align 8
LV: Found an estimated cost of 1 for VF 1 For instruction: %.447 = select i1 %.413, i64 %arg.strikeparam.5.0, i64 0
LV: Found an estimated cost of 1 for VF 1 For instruction: %.448 = add i64 %.447, %.309.0156
LV: Found an estimated cost of 0 for VF 1 For instruction: %.461 = getelementptr double, double* %arg.strikeparam.4, i64 %.448
LV: Found an estimated cost of 1 for VF 1 For instruction: %.462 = load double, double* %.461, align 8
LV: Found an estimated cost of 1 for VF 1 For instruction: %.371 = add i64 %.309.0156, 1
LV: Found an estimated cost of 1 for VF 1 For instruction: %.503 = select i1 %.413, i64 %arg.volatilityparam.5.0, i64 0
LV: Found an estimated cost of 1 for VF 1 For instruction: %.504 = add i64 %.503, %.309.0156
LV: Found an estimated cost of 0 for VF 1 For instruction: %.517 = getelementptr double, double* %arg.volatilityparam.4, i64 %.504
LV: Found an estimated cost of 1 for VF 1 For instruction: %.518 = load double, double* %.517, align 8
LV: Found an estimated cost of 2 for VF 1 For instruction: %.524 = fmul double %.518, 5.000000e-01
LV: Found an estimated cost of 1 for VF 1 For instruction: %.614 = select i1 %.413, i64 %arg.timevparam.5.0, i64 0
LV: Found an estimated cost of 1 for VF 1 For instruction: %.615 = add i64 %.614, %.309.0156
LV: Found an estimated cost of 0 for VF 1 For instruction: %.628 = getelementptr double, double* %arg.timevparam.4, i64 %.615
LV: Found an estimated cost of 1 for VF 1 For instruction: %.629 = load double, double* %.628, align 8
LV: Found an estimated cost of 2 for VF 1 For instruction: %.563 = fmul double %.518, %.524
LV: Found an estimated cost of 2 for VF 1 For instruction: %.635 = fmul double %.518, %.629
LV: Found an estimated cost of 1 for VF 1 For instruction: %.677 = select i1 %.413, i64 %arg.rateparam.5.0, i64 0
LV: Found an estimated cost of 1 for VF 1 For instruction: %.678 = add i64 %.677, %.309.0156
LV: Found an estimated cost of 0 for VF 1 For instruction: %.691 = getelementptr double, double* %arg.rateparam.4, i64 %.678
LV: Found an estimated cost of 1 for VF 1 For instruction: %.692 = load double, double* %.691, align 8
LV: Found an estimated cost of 2 for VF 1 For instruction: %.698 = fadd double %.563, %.692
LV: Found an estimated cost of 2 for VF 1 For instruction: %.737 = fmul double %.629, %.698
LV: Found an estimated cost of 38 for VF 1 For instruction: %.481.le = fdiv double %.429, %.462
LV: Found an estimated cost of 2 for VF 1 For instruction: %.655.le = fmul double %.635, 5.000000e-01
LV: Found an estimated cost of 2 for VF 1 For instruction: %.743 = fadd double %.481.le, %.737
LV: Found an estimated cost of 38 for VF 1 For instruction: %.762.le = fdiv double %.743, %.655.le
LV: Found an estimated cost of 2 for VF 1 For instruction: %.772 = fsub double %.762.le, %.655.le
LV: Found an estimated cost of 2 for VF 1 For instruction: %.778 = fmul double %.762.le, 5.000000e-01
LV: Found an estimated cost of 2 for VF 1 For instruction: %.784 = fadd double %.778, 5.000000e-01
LV: Found an estimated cost of 2 for VF 1 For instruction: %.790 = fmul double %.772, 5.000000e-01
LV: Found an estimated cost of 2 for VF 1 For instruction: %.796 = fadd double %.790, 5.000000e-01
LV: Found an estimated cost of 2 for VF 1 For instruction: %0 = fmul double %.629, %.692
LV: Found an estimated cost of 2 for VF 1 For instruction: %1 = fmul double %.462, %0
LV: Found an estimated cost of 2 for VF 1 For instruction: %2 = fmul double %1, %.796
LV: Found an estimated cost of 2 for VF 1 For instruction: %.957 = fmul double %.429, %.784
LV: Found an estimated cost of 2 for VF 1 For instruction: %.963 = fadd double %.957, %2
LV: Found an estimated cost of 2 for VF 1 For instruction: %.969 = fadd double %1, %.963
LV: Found an estimated cost of 2 for VF 1 For instruction: %.1008 = fadd double %.429, %.969
LV: Found an estimated cost of 1 for VF 1 For instruction: %.1027 = select i1 %.413, i64 %arg._put_144param.5.0, i64 0
LV: Found an estimated cost of 1 for VF 1 For instruction: %.1028 = add i64 %.1027, %.309.0156
LV: Found an estimated cost of 0 for VF 1 For instruction: %.1041 = getelementptr double, double* %arg._put_144param.4, i64 %.1028
LV: Found an estimated cost of 1 for VF 1 For instruction: store double %.1008, double* %.1041, align 8
LV: Found an estimated cost of 1 for VF 1 For instruction: %.358 = icmp sgt i64 %.311.0157, 1
LV: Found an estimated cost of 0 for VF 1 For instruction: br i1 %.358, label %B121, label %B136.loopexit
LV: Scalar loop costs: 136.
LAA: Bad stride - Not an AddRecExpr pointer %.428 = getelementptr double, double* %arg.sptpriceparam.4, i64 %.415 SCEV: ((8 * ({%.234,+,1}<%B121> + %.414)) + %arg.sptpriceparam.4)
LAA: Bad stride - Not an AddRecExpr pointer %.461 = getelementptr double, double* %arg.strikeparam.4, i64 %.448 SCEV: ((8 * ({%.234,+,1}<%B121> + %.447)) + %arg.strikeparam.4)
LAA: Bad stride - Not an AddRecExpr pointer %.517 = getelementptr double, double* %arg.volatilityparam.4, i64 %.504 SCEV: ((8 * ({%.234,+,1}<%B121> + %.503)) + %arg.volatilityparam.4)
LAA: Bad stride - Not an AddRecExpr pointer %.628 = getelementptr double, double* %arg.timevparam.4, i64 %.615 SCEV: ((8 * ({%.234,+,1}<%B121> + %.614)) + %arg.timevparam.4)
LAA: Bad stride - Not an AddRecExpr pointer %.691 = getelementptr double, double* %arg.rateparam.4, i64 %.678 SCEV: ((8 * ({%.234,+,1}<%B121> + %.677)) + %arg.rateparam.4)
LAA: Bad stride - Not an AddRecExpr pointer %.1041 = getelementptr double, double* %arg._put_144param.4, i64 %.1028 SCEV: ((8 * ({%.234,+,1}<%B121> + %.1027)) + %arg._put_144param.4)
LV: Found uniform instruction: %.358 = icmp sgt i64 %.311.0157, 1
LV: Found uniform instruction: %.311.0157 = phi i64 [ %.320, %B121.lr.ph ], [ %.367, %B121 ]
LV: Found uniform instruction: %.367 = add nsw i64 %.311.0157, -1
LV: Found scalar instruction: %.428 = getelementptr double, double* %arg.sptpriceparam.4, i64 %.415
LV: Found scalar instruction: %.461 = getelementptr double, double* %arg.strikeparam.4, i64 %.448
LV: Found scalar instruction: %.517 = getelementptr double, double* %arg.volatilityparam.4, i64 %.504
LV: Found scalar instruction: %.628 = getelementptr double, double* %arg.timevparam.4, i64 %.615
LV: Found scalar instruction: %.691 = getelementptr double, double* %arg.rateparam.4, i64 %.678
LV: Found scalar instruction: %.1041 = getelementptr double, double* %arg._put_144param.4, i64 %.1028
LV: Found scalar instruction: %.311.0157 = phi i64 [ %.320, %B121.lr.ph ], [ %.367, %B121 ]
LV: Found scalar instruction: %.367 = add nsw i64 %.311.0157, -1
LV: Found an estimated cost of 0 for VF 2 For instruction: %.311.0157 = phi i64 [ %.320, %B121.lr.ph ], [ %.367, %B121 ]
LV: Found an estimated cost of 0 for VF 2 For instruction: %.309.0156 = phi i64 [ %.234, %B121.lr.ph ], [ %.371, %B121 ]
LV: Found an estimated cost of 1 for VF 2 For instruction: %.367 = add nsw i64 %.311.0157, -1
LV: Found an estimated cost of 8 for VF 2 For instruction: %.413 = icmp slt i64 %.309.0156, 0
LV: Found an estimated cost of 1 for VF 2 For instruction: %.414 = select i1 %.413, i64 %arg.sptpriceparam.5.0, i64 0
LV: Found an estimated cost of 1 for VF 2 For instruction: %.415 = add i64 %.414, %.309.0156
LV: Found an estimated cost of 0 for VF 2 For instruction: %.428 = getelementptr double, double* %arg.sptpriceparam.4, i64 %.415
LV: Found an estimated cost of 27 for VF 2 For instruction: %.429 = load double, double* %.428, align 8
LV: Found an estimated cost of 1 for VF 2 For instruction: %.447 = select i1 %.413, i64 %arg.strikeparam.5.0, i64 0
LV: Found an estimated cost of 1 for VF 2 For instruction: %.448 = add i64 %.447, %.309.0156
LV: Found an estimated cost of 0 for VF 2 For instruction: %.461 = getelementptr double, double* %arg.strikeparam.4, i64 %.448
LV: Found an estimated cost of 27 for VF 2 For instruction: %.462 = load double, double* %.461, align 8
LV: Found an estimated cost of 1 for VF 2 For instruction: %.371 = add i64 %.309.0156, 1
LV: Found an estimated cost of 1 for VF 2 For instruction: %.503 = select i1 %.413, i64 %arg.volatilityparam.5.0, i64 0
LV: Found an estimated cost of 1 for VF 2 For instruction: %.504 = add i64 %.503, %.309.0156
LV: Found an estimated cost of 0 for VF 2 For instruction: %.517 = getelementptr double, double* %arg.volatilityparam.4, i64 %.504
LV: Found an estimated cost of 27 for VF 2 For instruction: %.518 = load double, double* %.517, align 8
LV: Found an estimated cost of 2 for VF 2 For instruction: %.524 = fmul double %.518, 5.000000e-01
LV: Found an estimated cost of 1 for VF 2 For instruction: %.614 = select i1 %.413, i64 %arg.timevparam.5.0, i64 0
LV: Found an estimated cost of 1 for VF 2 For instruction: %.615 = add i64 %.614, %.309.0156
LV: Found an estimated cost of 0 for VF 2 For instruction: %.628 = getelementptr double, double* %arg.timevparam.4, i64 %.615
LV: Found an estimated cost of 27 for VF 2 For instruction: %.629 = load double, double* %.628, align 8
LV: Found an estimated cost of 2 for VF 2 For instruction: %.563 = fmul double %.518, %.524
LV: Found an estimated cost of 2 for VF 2 For instruction: %.635 = fmul double %.518, %.629
LV: Found an estimated cost of 1 for VF 2 For instruction: %.677 = select i1 %.413, i64 %arg.rateparam.5.0, i64 0
LV: Found an estimated cost of 1 for VF 2 For instruction: %.678 = add i64 %.677, %.309.0156
LV: Found an estimated cost of 0 for VF 2 For instruction: %.691 = getelementptr double, double* %arg.rateparam.4, i64 %.678
LV: Found an estimated cost of 27 for VF 2 For instruction: %.692 = load double, double* %.691, align 8
LV: Found an estimated cost of 2 for VF 2 For instruction: %.698 = fadd double %.563, %.692
LV: Found an estimated cost of 2 for VF 2 For instruction: %.737 = fmul double %.629, %.698
LV: Found an estimated cost of 69 for VF 2 For instruction: %.481.le = fdiv double %.429, %.462
LV: Found an estimated cost of 2 for VF 2 For instruction: %.655.le = fmul double %.635, 5.000000e-01
LV: Found an estimated cost of 2 for VF 2 For instruction: %.743 = fadd double %.481.le, %.737
LV: Found an estimated cost of 69 for VF 2 For instruction: %.762.le = fdiv double %.743, %.655.le
LV: Found an estimated cost of 2 for VF 2 For instruction: %.772 = fsub double %.762.le, %.655.le
LV: Found an estimated cost of 2 for VF 2 For instruction: %.778 = fmul double %.762.le, 5.000000e-01
LV: Found an estimated cost of 2 for VF 2 For instruction: %.784 = fadd double %.778, 5.000000e-01
LV: Found an estimated cost of 2 for VF 2 For instruction: %.790 = fmul double %.772, 5.000000e-01
LV: Found an estimated cost of 2 for VF 2 For instruction: %.796 = fadd double %.790, 5.000000e-01
LV: Found an estimated cost of 2 for VF 2 For instruction: %0 = fmul double %.629, %.692
LV: Found an estimated cost of 2 for VF 2 For instruction: %1 = fmul double %.462, %0
LV: Found an estimated cost of 2 for VF 2 For instruction: %2 = fmul double %1, %.796
LV: Found an estimated cost of 2 for VF 2 For instruction: %.957 = fmul double %.429, %.784
LV: Found an estimated cost of 2 for VF 2 For instruction: %.963 = fadd double %.957, %2
LV: Found an estimated cost of 2 for VF 2 For instruction: %.969 = fadd double %1, %.963
LV: Found an estimated cost of 2 for VF 2 For instruction: %.1008 = fadd double %.429, %.969
LV: Found an estimated cost of 1 for VF 2 For instruction: %.1027 = select i1 %.413, i64 %arg._put_144param.5.0, i64 0
LV: Found an estimated cost of 1 for VF 2 For instruction: %.1028 = add i64 %.1027, %.309.0156
LV: Found an estimated cost of 0 for VF 2 For instruction: %.1041 = getelementptr double, double* %arg._put_144param.4, i64 %.1028
LV: Found an estimated cost of 27 for VF 2 For instruction: store double %.1008, double* %.1041, align 8
LV: Found an estimated cost of 1 for VF 2 For instruction: %.358 = icmp sgt i64 %.311.0157, 1
LV: Found an estimated cost of 0 for VF 2 For instruction: br i1 %.358, label %B121, label %B136.loopexit
LV: Vector loop of width 2 costs: 180.
LV: Selecting VF: 1.
LV: The target has 16 registers
LV(REG): Calculating max register usage:
LV(REG): At #0 Interval # 0
LV(REG): At #1 Interval # 1
LV(REG): At #2 Interval # 2
LV(REG): At #3 Interval # 3
LV(REG): At #4 Interval # 4
LV(REG): At #5 Interval # 5
LV(REG): At #6 Interval # 5
LV(REG): At #7 Interval # 5
LV(REG): At #8 Interval # 5
LV(REG): At #9 Interval # 6
LV(REG): At #10 Interval # 6
LV(REG): At #11 Interval # 6
LV(REG): At #12 Interval # 6
LV(REG): At #13 Interval # 7
LV(REG): At #14 Interval # 8
LV(REG): At #15 Interval # 8
LV(REG): At #16 Interval # 8
LV(REG): At #17 Interval # 8
LV(REG): At #18 Interval # 9
LV(REG): At #19 Interval # 10
LV(REG): At #20 Interval # 10
LV(REG): At #21 Interval # 10
LV(REG): At #22 Interval # 10
LV(REG): At #23 Interval # 10
LV(REG): At #24 Interval # 10
LV(REG): At #25 Interval # 11
LV(REG): At #26 Interval # 11
LV(REG): At #27 Interval # 11
LV(REG): At #28 Interval # 11
LV(REG): At #29 Interval # 11
LV(REG): At #30 Interval # 11
LV(REG): At #31 Interval # 12
LV(REG): At #32 Interval # 12
LV(REG): At #33 Interval # 11
LV(REG): At #34 Interval # 11
LV(REG): At #35 Interval # 11
LV(REG): At #36 Interval # 11
LV(REG): At #37 Interval # 11
LV(REG): At #38 Interval # 11
LV(REG): At #39 Interval # 11
LV(REG): At #40 Interval # 10
LV(REG): At #41 Interval # 9
LV(REG): At #42 Interval # 9
LV(REG): At #43 Interval # 9
LV(REG): At #44 Interval # 8
LV(REG): At #45 Interval # 7
LV(REG): At #46 Interval # 6
LV(REG): At #47 Interval # 6
LV(REG): At #48 Interval # 5
LV(REG): At #50 Interval # 3
LV(REG): VF = 1
LV(REG): Found max usage: 12
LV(REG): Found invariant usage: 2
LV(REG): LoopSize: 52
LV: Loop cost is 136
LV: Not Interleaving.
LV: Vectorization is possible but not beneficial.
LV: Interleaving is not beneficial.

@sklam
Copy link
Member

sklam commented Jan 31, 2018

I wonder if turning on fastmath will help vectorize the code.

@DrTodd13
Copy link
Contributor Author

When I did these tests, I did put fastmath on for the main function. At that time I wondered whether that flag would be carried through to where the gufunc was compiled. Later, I think I did some printing of the flags at the gufunc compilation point and I believe that fastmath was not defined there. So, I could add it back in there and give it a try.

@DrTodd13
Copy link
Contributor Author

DrTodd13 commented Feb 1, 2018

Some comments on that LLVM opt output from an expert.

The -debug dump below details the estimated cost of the original scalar loop body “For VF=1”, vs. the estimated cost of vectorizing “For VF=2”, which clearly favors the scalar version: 136 < 180.
o The expensive instructions are two fdiv’s (38 vs. 69, so vector is better) and five+one (all) loads+store of 64 bit doubles which are non-unit-stride and thus require gathers+scatter when vectorized (1 vs. 27, so scalar is a whole lot better).
o But such loads and stores can alternatively be “scalarized”, with a rough cost of 3 or somewhat more: two scalar loads and a shuffle to place them into a vector, or extracting each element from a vector followed by two scalar stores. Even if the cost of such scalarization were 12 instead of 27, the cost of the vector version would still be more attractive than that of the scalar version, as the former will be reduced by 6*(27-12) / 2 = 45 > 44 = 180 - 136.
o It should be interesting to measure the effect of vectorizing this case, by forcing it, to inspire what the “true” cost should be.
o Whether these loads and/or store could be recognized as unit-stride, and vectorized more efficiently, is also worth checking.

@DrTodd13 DrTodd13 closed this Feb 6, 2018
@codecov-io
Copy link

codecov-io commented Feb 6, 2018

Codecov Report

Merging #2709 into master will increase coverage by 0.01%.
The diff coverage is 77.77%.

@@            Coverage Diff             @@
##           master    #2709      +/-   ##
==========================================
+ Coverage   86.23%   86.25%   +0.01%     
==========================================
  Files         321      323       +2     
  Lines       66192    66591     +399     
  Branches     7378     7426      +48     
==========================================
+ Hits        57081    57436     +355     
- Misses       7932     7962      +30     
- Partials     1179     1193      +14

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants