convolution.convolve() Cython code hoist optimizations #7

jamienoss · 2018-01-31T18:17:27Z

Hoist index computions

astropy/convolution/boundary_none.pyx::convolve2d_boundary_none()

...
 # Now run the proper convolution
        for i in range(wkx, nx - wkx):
            for j in range(wky, ny - wky):
                top = 0.
                bot = 0.
                for ii in range(i - wkx, i + wkx + 1):
                    for jj in range(j - wky, j + wky + 1):
                        val = f[ii, jj]
                        ker = g[<unsigned int>(nkx - 1 - (wkx + ii - i)),
                                <unsigned int>(nky - 1 - (wky + jj - j))]
                        if not npy_isnan(val):
                            top += val * ker
                            bot += ker
                if normalize_by_kernel:
                    if bot == 0:
                        conv[i, j] = f[i, j]
                    else:
                        conv[i, j] = top / bot
                else:
                    conv[i, j] = top
...

Is (at least) more performant as the foloowing (though this is probably (hopefully) optimized as such by the compiler anyhow)

...
   cdef unsigned int i, j, ii, jj
    cdef unsigned int nkx_minus_1 = nkx-1, nky_minus_1 = nky-1
    cdef unsigned int wkx_minus_i, wky_minus_j
    cdef unsigned int ker_i, ker_j
    cdef unsigned int ny_minus_wky = ny - wky
    cdef unsigned int i_minus_wkx, wkx_plus_1 = wkx + 1
    cdef unsigned int j_minus_wky, wky_plus_1 = wky + 1
    cdef unsigned int i_plus_wkx_plus_1, j_plus_wky_plus_1
    cdef unsigned int nkx_minus_1_minus_wkx_plus_i, nky_minus_1_minus_wky_plus_j

    cdef int iimin, iimax, jjmin, jjmax
    cdef DTYPE_t top, bot, ker, val

    # release the GIL
    with nogil:

        # Now run the proper convolution
        for i in range(wkx, nx - wkx):
             # wkx - 1
            wkx_minus_i = wkx - i
            # i - wkx
            i_minus_wkx = i - wkx
            # i + wkx + 1
            i_plus_wkx_plus_1 = i + wkx_plus_1
            # nkx - 1 - (wkx - i)
            nkx_minus_1_minus_wkx_plus_i = nkx_minus_1 - wkx_minus_i
            for j in range(wky, ny_minus_wky):
                wky_minus_j = wkx - j
                j_minus_wky = j - wky
                j_plus_wky_plus_1 = j + wky_plus_1
                nky_minus_1_minus_wky_plus_j = nky_minus_1 - wky_minus_j

                top = 0.
                bot = 0.
                for ii in range(i_minus_wkx, i_plus_wkx_plus_1):
                    # nkx - 1 - (wkx + ii - i)
                    ker_i = nkx_minus_1_minus_wkx_plus_i - ii
                    for jj in range(j_minus_wky, j_plus_wky_plus_1):
                        ker_j = nky_minus_1_minus_wky_plus_j - jj
                        val = f[ii, jj]
                        ker = g[ker_i, ker_j]
                        if not npy_isnan(val):
                            top += val * ker
                            bot += ker
                if normalize_by_kernel:
                    if bot == 0:
                        conv[i, j] = f[i, j]
                    else:
                        conv[i, j] = top / bot
                else:
                    conv[i, j] = top
    # GIL acquired again here
    return conv
...

Hoist kernel.sum()

c.f. #4

bot += ker (sum of the kernel) is being re-computed for every element in the image. This is unnecessary. Since bot is being reset to 0 for each element, it is highly unlikely that the C compiler is clever enough to optimize this out by hoisting it above the outer most loop.

Hmmm, maybe not. The interpolation scheme seems to prevent this hoist and is the reason that the kernel summation exists in the Cython code. If the image element is NaN then the kernel element is not incorporated in the sum. (Only an issue for NaN interpolation.)

Shortcircuit `normalize_by_kernel`

c.f. #4

If this was straight C, I would make this function inline and then wrap it with two funcs or by a single with a condition on normalize_by_kernel so as to shortcircuit the internal conditional pre-call. Since the function is now inline, the C compiler would (should) remove the conditional from within the loop altogether. This way the performance increase can be obtained whilst keeping the code in a single place to remove duplication and aid readability plus and maintenance.

The text was updated successfully, but these errors were encountered:

jamienoss mentioned this issue Jan 31, 2018

Convolution perf notes #2

Open

10 tasks

jamienoss changed the title ~~convolution.convolve() core code general perf rewrite~~ convolution.convolve() hoist optimizations Feb 1, 2018

jamienoss changed the title ~~convolution.convolve() hoist optimizations~~ convolution.convolve() Cython code hoist optimizations Feb 1, 2018

Repository owner locked and limited conversation to collaborators Mar 13, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

convolution.convolve() Cython code hoist optimizations #7

convolution.convolve() Cython code hoist optimizations #7

jamienoss commented Jan 31, 2018 •

edited

Loading

convolution.convolve() Cython code hoist optimizations #7

convolution.convolve() Cython code hoist optimizations #7

Comments

jamienoss commented Jan 31, 2018 • edited Loading

Hoist index computions

Hoist kernel.sum()

Shortcircuit normalize_by_kernel

jamienoss commented Jan 31, 2018 •

edited

Loading

Shortcircuit `normalize_by_kernel`