Proposed changes to enable vectorization for parfors. #2709

DrTodd13 · 2018-01-31T00:03:39Z

I have confirmed with these changes that a version of blackscholes with transcendentals removed will vectorize with these changes. There are probably many options to transmit this noalias flag to the relevant spot. Please review and if you want a different mechanism let's discuss.

…ator.

…the dimensionality of the input array. If neighborhood not specified, then calculate it only once and store it in StencilFunc and can then be queried through neighborhood attribute.

…the kernel.

…run this example without it though.

…e img object when done.

…ing section that stencil decorator returns StencilFunc type and that that contains a neighborhood attribute.

…ction about the various checks performed and exceptions raised.

… have access to attributes. Added a check to stencil decorator to check for unknown stencil options. Added support for standard_indexing option.

… memory layout at runtime

and fixes for stride 1

…axis have 1s or 0s dims

in case of any layout array.

Enh/ascontiguousarray

# Conflicts: # numba/targets/arrayobj.py

1) C layout array can also be F contiguous. Need to check this in the is_contig checks 2) Numba is now smarter about contiguousness. Update test to reflect that.

DrTodd13 · 2018-01-31T01:30:49Z

Lots of travis problems. Think I may know the reason but if I'm correct the solution would also prevent vectorization. I will investigate.

DrTodd13 · 2018-01-31T17:17:23Z

So, it seems that with "noalias nocapture" on parfor gufunc arrays, the LLVM LoopVectorizer will report that vectorization is possible but not profitable for the pseudo-blackscholes with transcendentals removed.

I'm not an expert on analyzing this output but the LAA: Bad stride looks a likely culprit. I did some experimenting and we get these same messages whether ascontiguousarray is used, not used, or used only for input params. So, I'm not sure it is doing for us what we thought it should.

LV: Checking a loop in "_ZN5numba8npyufunc6parfor39__numba_parfor_gufunc_0x7f162ddc471$242E5ArrayIxLi1E1C7mutable7alignedE5ArrayIdLi1E1C7mutable7alignedE5ArrayIdLi1E1C7mutable7alignedE5ArrayIdLi1E1C7mutable7alignedE5ArrayIdLi1E1C7mutable7alignedE5ArrayIdLi1E1C7mutable7alignedE5ArrayIdLi1E1C7mutable7alignedE" from gufunc_no_sqrt.ll
LV: Loop hints: force=? width=0 unroll=0
LV: Found a loop: B121
LV: Found an induction variable.
LV: Found an induction variable.
LV: Found FP op with unsafe algebra.
LV: Found FP op with unsafe algebra.
LV: Found FP op with unsafe algebra.
LV: Found FP op with unsafe algebra.
LV: Found FP op with unsafe algebra.
LV: Found FP op with unsafe algebra.
LV: Found FP op with unsafe algebra.
LV: Found FP op with unsafe algebra.
LV: Found FP op with unsafe algebra.
LV: Found FP op with unsafe algebra.
LV: Found FP op with unsafe algebra.
LV: Found FP op with unsafe algebra.
LV: Found FP op with unsafe algebra.
LV: Found FP op with unsafe algebra.
LV: Found FP op with unsafe algebra.
LV: Found FP op with unsafe algebra.
LV: Found FP op with unsafe algebra.
LV: Found FP op with unsafe algebra.
LV: Found FP op with unsafe algebra.
LV: Found FP op with unsafe algebra.
LV: Found FP op with unsafe algebra.
LV: Did not find one integer induction var.
LAA: Found a loop in _ZN5numba8npyufunc6parfor39__numba_parfor_gufunc_0x7f162ddc471$242E5ArrayIxLi1E1C7mutable7alignedE5ArrayIdLi1E1C7mutable7alignedE5ArrayIdLi1E1C7mutable7alignedE5ArrayIdLi1E1C7mutable7alignedE5ArrayIdLi1E1C7mutable7alignedE5ArrayIdLi1E1C7mutable7alignedE5ArrayIdLi1E1C7mutable7alignedE: B121
LAA: Processing memory accesses...
AST: Alias Set Tracker: 6 alias sets for 6 pointer values.
AliasSet[0x46100f0, 1] must alias, No access Pointers: (double* %.1041, 18446744073709551615)
AliasSet[0x45f5020, 1] must alias, No access Pointers: (double* %.428, 18446744073709551615)
AliasSet[0x45e8380, 1] must alias, No access Pointers: (double* %.461, 18446744073709551615)
AliasSet[0x45e8420, 1] must alias, No access Pointers: (double* %.517, 18446744073709551615)
AliasSet[0x4608270, 1] must alias, No access Pointers: (double* %.628, 18446744073709551615)
AliasSet[0x4608360, 1] must alias, No access Pointers: (double* %.691, 18446744073709551615)

LAA: Accesses(6):
%.1041 = getelementptr double, double* %arg._put_144param.4, i64 %.1028 (write)
%.428 = getelementptr double, double* %arg.sptpriceparam.4, i64 %.415 (read-only)
%.461 = getelementptr double, double* %arg.strikeparam.4, i64 %.448 (read-only)
%.517 = getelementptr double, double* %arg.volatilityparam.4, i64 %.504 (read-only)
%.628 = getelementptr double, double* %arg.timevparam.4, i64 %.615 (read-only)
%.691 = getelementptr double, double* %arg.rateparam.4, i64 %.678 (read-only)
Underlying objects for pointer %.1041 = getelementptr double, double* %arg._put_144param.4, i64 %.1028
double* %arg._put_144param.4
Underlying objects for pointer %.428 = getelementptr double, double* %arg.sptpriceparam.4, i64 %.415
double* %arg.sptpriceparam.4
Underlying objects for pointer %.461 = getelementptr double, double* %arg.strikeparam.4, i64 %.448
double* %arg.strikeparam.4
Underlying objects for pointer %.517 = getelementptr double, double* %arg.volatilityparam.4, i64 %.504
double* %arg.volatilityparam.4
Underlying objects for pointer %.628 = getelementptr double, double* %arg.timevparam.4, i64 %.615
double* %arg.timevparam.4
Underlying objects for pointer %.691 = getelementptr double, double* %arg.rateparam.4, i64 %.678
double* %arg.rateparam.4
LAA: We can perform a memory runtime check if needed.
LAA: No unsafe dependent memory operations in loop. We don't need runtime memory checks.
LV: We can vectorize this loop!
LV: Analyzing interleaved accesses...
LAA: Bad stride - Not an AddRecExpr pointer %.428 = getelementptr double, double* %arg.sptpriceparam.4, i64 %.415 SCEV: ((8 * ({%.234,+,1}<%B121> + %.414)) + %arg.sptpriceparam.4)
LAA: Bad stride - Not an AddRecExpr pointer %.461 = getelementptr double, double* %arg.strikeparam.4, i64 %.448 SCEV: ((8 * ({%.234,+,1}<%B121> + %.447)) + %arg.strikeparam.4)
LAA: Bad stride - Not an AddRecExpr pointer %.517 = getelementptr double, double* %arg.volatilityparam.4, i64 %.504 SCEV: ((8 * ({%.234,+,1}<%B121> + %.503)) + %arg.volatilityparam.4)
LAA: Bad stride - Not an AddRecExpr pointer %.628 = getelementptr double, double* %arg.timevparam.4, i64 %.615 SCEV: ((8 * ({%.234,+,1}<%B121> + %.614)) + %arg.timevparam.4)
LAA: Bad stride - Not an AddRecExpr pointer %.691 = getelementptr double, double* %arg.rateparam.4, i64 %.678 SCEV: ((8 * ({%.234,+,1}<%B121> + %.677)) + %arg.rateparam.4)
LAA: Bad stride - Not an AddRecExpr pointer %.1041 = getelementptr double, double* %arg._put_144param.4, i64 %.1028 SCEV: ((8 * ({%.234,+,1}<%B121> + %.1027)) + %arg._put_144param.4)
LV: The Smallest and Widest types: 64 / 64 bits.
LV: The Widest register is: 128 bits.
LV: Found an estimated cost of 0 for VF 1 For instruction: %.311.0157 = phi i64 [ %.320, %B121.lr.ph ], [ %.367, %B121 ]
LV: Found an estimated cost of 0 for VF 1 For instruction: %.309.0156 = phi i64 [ %.234, %B121.lr.ph ], [ %.371, %B121 ]
LV: Found an estimated cost of 1 for VF 1 For instruction: %.367 = add nsw i64 %.311.0157, -1
LV: Found an estimated cost of 1 for VF 1 For instruction: %.413 = icmp slt i64 %.309.0156, 0
LV: Found an estimated cost of 1 for VF 1 For instruction: %.414 = select i1 %.413, i64 %arg.sptpriceparam.5.0, i64 0
LV: Found an estimated cost of 1 for VF 1 For instruction: %.415 = add i64 %.414, %.309.0156
LV: Found an estimated cost of 0 for VF 1 For instruction: %.428 = getelementptr double, double* %arg.sptpriceparam.4, i64 %.415
LV: Found an estimated cost of 1 for VF 1 For instruction: %.429 = load double, double* %.428, align 8
LV: Found an estimated cost of 1 for VF 1 For instruction: %.447 = select i1 %.413, i64 %arg.strikeparam.5.0, i64 0
LV: Found an estimated cost of 1 for VF 1 For instruction: %.448 = add i64 %.447, %.309.0156
LV: Found an estimated cost of 0 for VF 1 For instruction: %.461 = getelementptr double, double* %arg.strikeparam.4, i64 %.448
LV: Found an estimated cost of 1 for VF 1 For instruction: %.462 = load double, double* %.461, align 8
LV: Found an estimated cost of 1 for VF 1 For instruction: %.371 = add i64 %.309.0156, 1
LV: Found an estimated cost of 1 for VF 1 For instruction: %.503 = select i1 %.413, i64 %arg.volatilityparam.5.0, i64 0
LV: Found an estimated cost of 1 for VF 1 For instruction: %.504 = add i64 %.503, %.309.0156
LV: Found an estimated cost of 0 for VF 1 For instruction: %.517 = getelementptr double, double* %arg.volatilityparam.4, i64 %.504
LV: Found an estimated cost of 1 for VF 1 For instruction: %.518 = load double, double* %.517, align 8
LV: Found an estimated cost of 2 for VF 1 For instruction: %.524 = fmul double %.518, 5.000000e-01
LV: Found an estimated cost of 1 for VF 1 For instruction: %.614 = select i1 %.413, i64 %arg.timevparam.5.0, i64 0
LV: Found an estimated cost of 1 for VF 1 For instruction: %.615 = add i64 %.614, %.309.0156
LV: Found an estimated cost of 0 for VF 1 For instruction: %.628 = getelementptr double, double* %arg.timevparam.4, i64 %.615
LV: Found an estimated cost of 1 for VF 1 For instruction: %.629 = load double, double* %.628, align 8
LV: Found an estimated cost of 2 for VF 1 For instruction: %.563 = fmul double %.518, %.524
LV: Found an estimated cost of 2 for VF 1 For instruction: %.635 = fmul double %.518, %.629
LV: Found an estimated cost of 1 for VF 1 For instruction: %.677 = select i1 %.413, i64 %arg.rateparam.5.0, i64 0
LV: Found an estimated cost of 1 for VF 1 For instruction: %.678 = add i64 %.677, %.309.0156
LV: Found an estimated cost of 0 for VF 1 For instruction: %.691 = getelementptr double, double* %arg.rateparam.4, i64 %.678
LV: Found an estimated cost of 1 for VF 1 For instruction: %.692 = load double, double* %.691, align 8
LV: Found an estimated cost of 2 for VF 1 For instruction: %.698 = fadd double %.563, %.692
LV: Found an estimated cost of 2 for VF 1 For instruction: %.737 = fmul double %.629, %.698
LV: Found an estimated cost of 38 for VF 1 For instruction: %.481.le = fdiv double %.429, %.462
LV: Found an estimated cost of 2 for VF 1 For instruction: %.655.le = fmul double %.635, 5.000000e-01
LV: Found an estimated cost of 2 for VF 1 For instruction: %.743 = fadd double %.481.le, %.737
LV: Found an estimated cost of 38 for VF 1 For instruction: %.762.le = fdiv double %.743, %.655.le
LV: Found an estimated cost of 2 for VF 1 For instruction: %.772 = fsub double %.762.le, %.655.le
LV: Found an estimated cost of 2 for VF 1 For instruction: %.778 = fmul double %.762.le, 5.000000e-01
LV: Found an estimated cost of 2 for VF 1 For instruction: %.784 = fadd double %.778, 5.000000e-01
LV: Found an estimated cost of 2 for VF 1 For instruction: %.790 = fmul double %.772, 5.000000e-01
LV: Found an estimated cost of 2 for VF 1 For instruction: %.796 = fadd double %.790, 5.000000e-01
LV: Found an estimated cost of 2 for VF 1 For instruction: %0 = fmul double %.629, %.692
LV: Found an estimated cost of 2 for VF 1 For instruction: %1 = fmul double %.462, %0
LV: Found an estimated cost of 2 for VF 1 For instruction: %2 = fmul double %1, %.796
LV: Found an estimated cost of 2 for VF 1 For instruction: %.957 = fmul double %.429, %.784
LV: Found an estimated cost of 2 for VF 1 For instruction: %.963 = fadd double %.957, %2
LV: Found an estimated cost of 2 for VF 1 For instruction: %.969 = fadd double %1, %.963
LV: Found an estimated cost of 2 for VF 1 For instruction: %.1008 = fadd double %.429, %.969
LV: Found an estimated cost of 1 for VF 1 For instruction: %.1027 = select i1 %.413, i64 %arg._put_144param.5.0, i64 0
LV: Found an estimated cost of 1 for VF 1 For instruction: %.1028 = add i64 %.1027, %.309.0156
LV: Found an estimated cost of 0 for VF 1 For instruction: %.1041 = getelementptr double, double* %arg._put_144param.4, i64 %.1028
LV: Found an estimated cost of 1 for VF 1 For instruction: store double %.1008, double* %.1041, align 8
LV: Found an estimated cost of 1 for VF 1 For instruction: %.358 = icmp sgt i64 %.311.0157, 1
LV: Found an estimated cost of 0 for VF 1 For instruction: br i1 %.358, label %B121, label %B136.loopexit
LV: Scalar loop costs: 136.
LAA: Bad stride - Not an AddRecExpr pointer %.428 = getelementptr double, double* %arg.sptpriceparam.4, i64 %.415 SCEV: ((8 * ({%.234,+,1}<%B121> + %.414)) + %arg.sptpriceparam.4)
LAA: Bad stride - Not an AddRecExpr pointer %.461 = getelementptr double, double* %arg.strikeparam.4, i64 %.448 SCEV: ((8 * ({%.234,+,1}<%B121> + %.447)) + %arg.strikeparam.4)
LAA: Bad stride - Not an AddRecExpr pointer %.517 = getelementptr double, double* %arg.volatilityparam.4, i64 %.504 SCEV: ((8 * ({%.234,+,1}<%B121> + %.503)) + %arg.volatilityparam.4)
LAA: Bad stride - Not an AddRecExpr pointer %.628 = getelementptr double, double* %arg.timevparam.4, i64 %.615 SCEV: ((8 * ({%.234,+,1}<%B121> + %.614)) + %arg.timevparam.4)
LAA: Bad stride - Not an AddRecExpr pointer %.691 = getelementptr double, double* %arg.rateparam.4, i64 %.678 SCEV: ((8 * ({%.234,+,1}<%B121> + %.677)) + %arg.rateparam.4)
LAA: Bad stride - Not an AddRecExpr pointer %.1041 = getelementptr double, double* %arg._put_144param.4, i64 %.1028 SCEV: ((8 * ({%.234,+,1}<%B121> + %.1027)) + %arg._put_144param.4)
LV: Found uniform instruction: %.358 = icmp sgt i64 %.311.0157, 1
LV: Found uniform instruction: %.311.0157 = phi i64 [ %.320, %B121.lr.ph ], [ %.367, %B121 ]
LV: Found uniform instruction: %.367 = add nsw i64 %.311.0157, -1
LV: Found scalar instruction: %.428 = getelementptr double, double* %arg.sptpriceparam.4, i64 %.415
LV: Found scalar instruction: %.461 = getelementptr double, double* %arg.strikeparam.4, i64 %.448
LV: Found scalar instruction: %.517 = getelementptr double, double* %arg.volatilityparam.4, i64 %.504
LV: Found scalar instruction: %.628 = getelementptr double, double* %arg.timevparam.4, i64 %.615
LV: Found scalar instruction: %.691 = getelementptr double, double* %arg.rateparam.4, i64 %.678
LV: Found scalar instruction: %.1041 = getelementptr double, double* %arg._put_144param.4, i64 %.1028
LV: Found scalar instruction: %.311.0157 = phi i64 [ %.320, %B121.lr.ph ], [ %.367, %B121 ]
LV: Found scalar instruction: %.367 = add nsw i64 %.311.0157, -1
LV: Found an estimated cost of 0 for VF 2 For instruction: %.311.0157 = phi i64 [ %.320, %B121.lr.ph ], [ %.367, %B121 ]
LV: Found an estimated cost of 0 for VF 2 For instruction: %.309.0156 = phi i64 [ %.234, %B121.lr.ph ], [ %.371, %B121 ]
LV: Found an estimated cost of 1 for VF 2 For instruction: %.367 = add nsw i64 %.311.0157, -1
LV: Found an estimated cost of 8 for VF 2 For instruction: %.413 = icmp slt i64 %.309.0156, 0
LV: Found an estimated cost of 1 for VF 2 For instruction: %.414 = select i1 %.413, i64 %arg.sptpriceparam.5.0, i64 0
LV: Found an estimated cost of 1 for VF 2 For instruction: %.415 = add i64 %.414, %.309.0156
LV: Found an estimated cost of 0 for VF 2 For instruction: %.428 = getelementptr double, double* %arg.sptpriceparam.4, i64 %.415
LV: Found an estimated cost of 27 for VF 2 For instruction: %.429 = load double, double* %.428, align 8
LV: Found an estimated cost of 1 for VF 2 For instruction: %.447 = select i1 %.413, i64 %arg.strikeparam.5.0, i64 0
LV: Found an estimated cost of 1 for VF 2 For instruction: %.448 = add i64 %.447, %.309.0156
LV: Found an estimated cost of 0 for VF 2 For instruction: %.461 = getelementptr double, double* %arg.strikeparam.4, i64 %.448
LV: Found an estimated cost of 27 for VF 2 For instruction: %.462 = load double, double* %.461, align 8
LV: Found an estimated cost of 1 for VF 2 For instruction: %.371 = add i64 %.309.0156, 1
LV: Found an estimated cost of 1 for VF 2 For instruction: %.503 = select i1 %.413, i64 %arg.volatilityparam.5.0, i64 0
LV: Found an estimated cost of 1 for VF 2 For instruction: %.504 = add i64 %.503, %.309.0156
LV: Found an estimated cost of 0 for VF 2 For instruction: %.517 = getelementptr double, double* %arg.volatilityparam.4, i64 %.504
LV: Found an estimated cost of 27 for VF 2 For instruction: %.518 = load double, double* %.517, align 8
LV: Found an estimated cost of 2 for VF 2 For instruction: %.524 = fmul double %.518, 5.000000e-01
LV: Found an estimated cost of 1 for VF 2 For instruction: %.614 = select i1 %.413, i64 %arg.timevparam.5.0, i64 0
LV: Found an estimated cost of 1 for VF 2 For instruction: %.615 = add i64 %.614, %.309.0156
LV: Found an estimated cost of 0 for VF 2 For instruction: %.628 = getelementptr double, double* %arg.timevparam.4, i64 %.615
LV: Found an estimated cost of 27 for VF 2 For instruction: %.629 = load double, double* %.628, align 8
LV: Found an estimated cost of 2 for VF 2 For instruction: %.563 = fmul double %.518, %.524
LV: Found an estimated cost of 2 for VF 2 For instruction: %.635 = fmul double %.518, %.629
LV: Found an estimated cost of 1 for VF 2 For instruction: %.677 = select i1 %.413, i64 %arg.rateparam.5.0, i64 0
LV: Found an estimated cost of 1 for VF 2 For instruction: %.678 = add i64 %.677, %.309.0156
LV: Found an estimated cost of 0 for VF 2 For instruction: %.691 = getelementptr double, double* %arg.rateparam.4, i64 %.678
LV: Found an estimated cost of 27 for VF 2 For instruction: %.692 = load double, double* %.691, align 8
LV: Found an estimated cost of 2 for VF 2 For instruction: %.698 = fadd double %.563, %.692
LV: Found an estimated cost of 2 for VF 2 For instruction: %.737 = fmul double %.629, %.698
LV: Found an estimated cost of 69 for VF 2 For instruction: %.481.le = fdiv double %.429, %.462
LV: Found an estimated cost of 2 for VF 2 For instruction: %.655.le = fmul double %.635, 5.000000e-01
LV: Found an estimated cost of 2 for VF 2 For instruction: %.743 = fadd double %.481.le, %.737
LV: Found an estimated cost of 69 for VF 2 For instruction: %.762.le = fdiv double %.743, %.655.le
LV: Found an estimated cost of 2 for VF 2 For instruction: %.772 = fsub double %.762.le, %.655.le
LV: Found an estimated cost of 2 for VF 2 For instruction: %.778 = fmul double %.762.le, 5.000000e-01
LV: Found an estimated cost of 2 for VF 2 For instruction: %.784 = fadd double %.778, 5.000000e-01
LV: Found an estimated cost of 2 for VF 2 For instruction: %.790 = fmul double %.772, 5.000000e-01
LV: Found an estimated cost of 2 for VF 2 For instruction: %.796 = fadd double %.790, 5.000000e-01
LV: Found an estimated cost of 2 for VF 2 For instruction: %0 = fmul double %.629, %.692
LV: Found an estimated cost of 2 for VF 2 For instruction: %1 = fmul double %.462, %0
LV: Found an estimated cost of 2 for VF 2 For instruction: %2 = fmul double %1, %.796
LV: Found an estimated cost of 2 for VF 2 For instruction: %.957 = fmul double %.429, %.784
LV: Found an estimated cost of 2 for VF 2 For instruction: %.963 = fadd double %.957, %2
LV: Found an estimated cost of 2 for VF 2 For instruction: %.969 = fadd double %1, %.963
LV: Found an estimated cost of 2 for VF 2 For instruction: %.1008 = fadd double %.429, %.969
LV: Found an estimated cost of 1 for VF 2 For instruction: %.1027 = select i1 %.413, i64 %arg._put_144param.5.0, i64 0
LV: Found an estimated cost of 1 for VF 2 For instruction: %.1028 = add i64 %.1027, %.309.0156
LV: Found an estimated cost of 0 for VF 2 For instruction: %.1041 = getelementptr double, double* %arg._put_144param.4, i64 %.1028
LV: Found an estimated cost of 27 for VF 2 For instruction: store double %.1008, double* %.1041, align 8
LV: Found an estimated cost of 1 for VF 2 For instruction: %.358 = icmp sgt i64 %.311.0157, 1
LV: Found an estimated cost of 0 for VF 2 For instruction: br i1 %.358, label %B121, label %B136.loopexit
LV: Vector loop of width 2 costs: 180.
LV: Selecting VF: 1.
LV: The target has 16 registers
LV(REG): Calculating max register usage:
LV(REG): At #0 Interval # 0
LV(REG): At #1 Interval # 1
LV(REG): At #2 Interval # 2
LV(REG): At #3 Interval # 3
LV(REG): At #4 Interval # 4
LV(REG): At #5 Interval # 5
LV(REG): At #6 Interval # 5
LV(REG): At #7 Interval # 5
LV(REG): At #8 Interval # 5
LV(REG): At #9 Interval # 6
LV(REG): At #10 Interval # 6
LV(REG): At #11 Interval # 6
LV(REG): At #12 Interval # 6
LV(REG): At #13 Interval # 7
LV(REG): At #14 Interval # 8
LV(REG): At #15 Interval # 8
LV(REG): At #16 Interval # 8
LV(REG): At #17 Interval # 8
LV(REG): At #18 Interval # 9
LV(REG): At #19 Interval # 10
LV(REG): At #20 Interval # 10
LV(REG): At #21 Interval # 10
LV(REG): At #22 Interval # 10
LV(REG): At #23 Interval # 10
LV(REG): At #24 Interval # 10
LV(REG): At #25 Interval # 11
LV(REG): At #26 Interval # 11
LV(REG): At #27 Interval # 11
LV(REG): At #28 Interval # 11
LV(REG): At #29 Interval # 11
LV(REG): At #30 Interval # 11
LV(REG): At #31 Interval # 12
LV(REG): At #32 Interval # 12
LV(REG): At #33 Interval # 11
LV(REG): At #34 Interval # 11
LV(REG): At #35 Interval # 11
LV(REG): At #36 Interval # 11
LV(REG): At #37 Interval # 11
LV(REG): At #38 Interval # 11
LV(REG): At #39 Interval # 11
LV(REG): At #40 Interval # 10
LV(REG): At #41 Interval # 9
LV(REG): At #42 Interval # 9
LV(REG): At #43 Interval # 9
LV(REG): At #44 Interval # 8
LV(REG): At #45 Interval # 7
LV(REG): At #46 Interval # 6
LV(REG): At #47 Interval # 6
LV(REG): At #48 Interval # 5
LV(REG): At #50 Interval # 3
LV(REG): VF = 1
LV(REG): Found max usage: 12
LV(REG): Found invariant usage: 2
LV(REG): LoopSize: 52
LV: Loop cost is 136
LV: Not Interleaving.
LV: Vectorization is possible but not beneficial.
LV: Interleaving is not beneficial.

sklam · 2018-01-31T19:26:42Z

I wonder if turning on fastmath will help vectorize the code.

DrTodd13 · 2018-01-31T21:06:39Z

When I did these tests, I did put fastmath on for the main function. At that time I wondered whether that flag would be carried through to where the gufunc was compiled. Later, I think I did some printing of the flags at the gufunc compilation point and I believe that fastmath was not defined there. So, I could add it back in there and give it a try.

DrTodd13 · 2018-02-01T17:52:23Z

Some comments on that LLVM opt output from an expert.

The -debug dump below details the estimated cost of the original scalar loop body “For VF=1”, vs. the estimated cost of vectorizing “For VF=2”, which clearly favors the scalar version: 136 < 180.
o The expensive instructions are two fdiv’s (38 vs. 69, so vector is better) and five+one (all) loads+store of 64 bit doubles which are non-unit-stride and thus require gathers+scatter when vectorized (1 vs. 27, so scalar is a whole lot better).
o But such loads and stores can alternatively be “scalarized”, with a rough cost of 3 or somewhat more: two scalar loads and a shuffle to place them into a vector, or extracting each element from a vector followed by two scalar stores. Even if the cost of such scalarization were 12 instead of 27, the cost of the vector version would still be more attractive than that of the scalar version, as the former will be reduced by 6*(27-12) / 2 = 45 > 44 = 180 - 136.
o It should be interesting to measure the effect of vectorizing this case, by forcing it, to inspire what the “true” cost should be.
o Whether these loads and/or store could be recognized as unit-stride, and vectorized more efficiently, is also worth checking.

…ot check for negative array indices which means that LLVM can figure out it is unit stride and vectorize.

…y aliases are found.

codecov-io · 2018-02-06T23:09:42Z

Codecov Report

Merging #2709 into master will increase coverage by 0.01%.
The diff coverage is 77.77%.

@@            Coverage Diff             @@
##           master    #2709      +/-   ##
==========================================
+ Coverage   86.23%   86.25%   +0.01%     
==========================================
  Files         321      323       +2     
  Lines       66192    66591     +399     
  Branches     7378     7426      +48     
==========================================
+ Hits        57081    57436     +355     
- Misses       7932     7962      +30     
- Partials     1179     1193      +14

Ehsan Totoni and others added 30 commits December 13, 2017 15:08

fix stencil neighborhood option

fa521f6

PEP8 fixes

28624db

support stencil with variable neighborhood

ce1e518

Add result var of dot multiplication to typemap.

732cd24

2.7 fix where *args not supported.

e6aa1f0

support variable stencil offsets

5046ef0

test variable offset

1ff7ff9

refactor stencil index offsets

18bae89

support slices in stencil

f464bda

test slice in stencil

5ccd1c9

constant slice in stencil

56d3061

Benchmark updates.

e766e2a

Update of the examples.

b7d548a

New versions of kernel density.

5007647

New version of wave2d for just Numba.

6e1924c

Additional version of linear regression for Numba and ParallelAcceler…

1ee98c2

…ator.

Raise a runtime error if a user-specified neighborhood doesn't match …

0cb9475

…the dimensionality of the input array. If neighborhood not specified, then calculate it only once and store it in StencilFunc and can then be queried through neighborhood attribute.

Fixes for comments in Stencil PR.

3caa700

Note that user-specified output array for stencil must match type of …

362c808

…the kernel.

Adding Intel copyright.

dd55c98

Don't force Pillow to be installed. Raise ImportError if they try to …

874d045

…run this example without it though.

Fix bug with saving stencil neighborhood once computed.

b38bc5f

Require input file argument. Remove default sample.jpg argument. Clos…

3567c7a

…e img object when done.

Change to ValueError

68d55aa

Add a check that cval type matches kernel return type.

6f3be7e

Adding note that cval type must match stencil kernel return type. Add…

1a592fe

…ing section that stencil decorator returns StencilFunc type and that that contains a neighborhood attribute.

Adding information about StencilFunc neighborhood attribute. Added se…

cb7ce74

…ction about the various checks performed and exceptions raised.

Move replace returns and add indices functions to StencilFunc so they…

54dfdb2

… have access to attributes. Added a check to stencil decorator to check for unknown stencil options. Added support for standard_indexing option.

Adding description of standard_indexing stencil decorator option.

9ef872a

fix stencil() call error msg

846fc45

sklam and others added 17 commits January 3, 2018 12:49

Add is_contiguous and is_fortran util function

89ed79e

Add bypass logic in ascontiguousarray and asfortranarray to check for…

a9d71fc

… memory layout at runtime

Add itemsize argument

e18cbdf

Add tests for is_contiguous and is_fortran

1c5d2b2

and fixes for stride 1

Add tests for is_contiguous and is_fortran and fixes cases when some …

6cca595

…axis have 1s or 0s dims

Merge branch 'master' of https://github.com/numba/numba

f67f55b

Skip test on older numpy version

6afe8e4

Update array.flags.{c,f}_contiguous to also check runtime values

c0bfd78

in case of any layout array.

Add array attribute test on 0-d array

cb4e079

Merge branch 'master' of https://github.com/numba/numba

5ea001a

Merge branch 'master' into enh/ascontiguousarray

3acf1ae

Merge pull request #53 from sklam/enh/ascontiguousarray

219113e

Enh/ascontiguousarray

Merge branch 'master' into enh/ascontiguousarray

3c768c9

# Conflicts: # numba/targets/arrayobj.py

Fix test failure on older numpy.

2a98962

1) C layout array can also be F contiguous. Need to check this in the is_contig checks 2) Numba is now smarter about contiguousness. Update test to reflect that.

Use ascontiguousarray in parfor gufuncs.

ed78a74

Merge remote-tracking branch 'sklam/enh/ascontiguousarray'

ee85079

Add noalias and nocapture to pointer parameters to parfor gufuncs.

eaa4f2c

Only convert input arrays to contiguous.

6d4ec2c

DrTodd13 added 5 commits February 2, 2018 20:26

Support unsigned index for IndexValue.

bcc4207

Change many int to uint where appropriate. This will cause Numba to n…

a396658

…ot check for negative array indices which means that LLVM can figure out it is unit stride and vectorize.

Incoming start and end are signed but output is still unsigned.

2e0bff7

Add noalias and nocapture to pointer parameters to parfor gufuncs.

86225ca

Do an alias analysis of the parfor and don't add the alias flag if an…

61ad062

…y aliases are found.

DrTodd13 closed this Feb 6, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposed changes to enable vectorization for parfors. #2709

Proposed changes to enable vectorization for parfors. #2709

DrTodd13 commented Jan 31, 2018

DrTodd13 commented Jan 31, 2018

DrTodd13 commented Jan 31, 2018

sklam commented Jan 31, 2018

DrTodd13 commented Jan 31, 2018

DrTodd13 commented Feb 1, 2018

codecov-io commented Feb 6, 2018 •

edited

Loading

Proposed changes to enable vectorization for parfors. #2709

Proposed changes to enable vectorization for parfors. #2709

Conversation

DrTodd13 commented Jan 31, 2018

DrTodd13 commented Jan 31, 2018

DrTodd13 commented Jan 31, 2018

sklam commented Jan 31, 2018

DrTodd13 commented Jan 31, 2018

DrTodd13 commented Feb 1, 2018

codecov-io commented Feb 6, 2018 • edited Loading

Codecov Report

codecov-io commented Feb 6, 2018 •

edited

Loading