Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(Not so) minor optimizations #1012

Merged
merged 13 commits into from Jun 6, 2017
Merged

Conversation

adler-j
Copy link
Member

@adler-j adler-j commented May 29, 2017

This started out small but in the end i achieved a x2 speedup for small tomographic inversion problems, clearly nice.

Improvements include:

  • MInor optimization on hot paths
    • Remove unncessary "isinstance" call
    • Remove repeated "size" etc calls
  • Cache some often re-used stuff (like proximals) using functools.lrc_cache
  • Reduce number of "element" calls by using proper inplace arithmetic in douglas rachford solver
  • Performed some (rudimentary) performance tests and we now have three "levels" of lincomb, each optimal for its own case:
    • size < 100: Callback to pure numpy
    • 100 <= size <= 50000: Use blas style arithmetic but without blas
    • 50000 < size: Use blas
  • Renamed _lincomb to _lincomb_impl since the previous did not play well with the spyder profiler

@adler-j
Copy link
Member Author

adler-j commented May 29, 2017

Added several new optimizations, squeezing out about 20% more perfomance in the afforementioned small examples.

Some short guide on how I profiled to do this:

python -m cProfile -o profiling_data.pyprof profile_mri.py
pyprof2calltree -i profiling_data.pyprof

here, cprofile ships with python and pyprof2calltree can be installed via pip install pyprof2calltree

I then visualized the result using KCacheGrind

Overall this was quite easy and I don't have to feel as ashamed of my reconstruction times when I present, we should do it more often.


if _blas_is_applicable(x1, x2, out):
# If data is very big, use BLAS if possible
if size >= 50000 and _blas_is_applicable(x1, x2, out):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps add names for the constants 100 and 50000

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, please

tmp_2 = sum(Li.adjoint(wi) for Li, wi in zip(L, w2))
z1.lincomb(1.0, w1, - (tau / 2.0), tmp_2)
x += lam(k) * (z1 - p1)
# Compute tmp_domain = sum(Li.adjoint(wi) for Li, w2i in zip(L, w2))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wi -> w2i

Copy link
Member

@kohr-h kohr-h left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The optimizations look fine. I tried to eyeball them but didn't run any checks. I guess tests / examples will tell if this code is equivalent.
Check the points I raised, then go ahead.

@@ -535,7 +535,6 @@ class DiscretizedSpaceElement(DiscretizedSetElement, FnBaseVector):

def __init__(self, space, data):
"""Initialize a new instance."""
assert isinstance(space, DiscretizedSpace)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fully agree, I think I remove all of these in the #861 PR

# Cache for efficiency instead of re-computing
try:
strd = self.__stride
except AttributeError:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like this pattern of trying to access some attribute of self and react on failure. I would clearly prefer to initialize the attribute to some nonsense ("sentinel") value like None and check for that instead of checking if it's there at all. The check a is None should incur no cost whatsoever and make this more robust.

self.__stride = np.array(strd)
return self.__stride.copy()
else:
return strd.copy()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why copy? I can see this as a general pattern, but that would require changing similar code in lots of places.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well because otherwise there is a severe risk of stuff like this:

strides = grid.strides()
strides += 2

# lots of code

strides = grid.strides()  # returns bullshit

This (with the copy) is equivalent to the old code

factor = factor * arr[slc]

out *= factor

# Finally multiply by the remaining 1d array
slc = [None] * out.ndim
slc[last_ax] = np.s_[:]
slc[last_ax] = slice(None)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good change, now that we understand what's going on :-)

"""Decorate function to cache the result with given arguments.

This is equivalent to `functools.lru_cache` with Python 3, and currently
does nothing with Python 2 but this may change at some later point.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WTF?? Okay, I guess this is still better than depending on some backport package for Python 2. The "later point" at which this changes will be the point when we drop Python 2 compatibility.

In the future library, this seems to be available anyway: check here OK, not without an extra pip install.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll leave it for now. Python 2 is mostly supported because we have to in a sense. Minor optimizations of 10% should not really be expected on legacy platforms.

lam_k = lam(k)

# Compute tmp_domain = sum(Li.adjoint(vi) for Li, vi in zip(L, v))
L[0].adjoint(v[0], out=tmp_domain)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we reach this point if L is empty?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we allow L empty? we did not before anyway.

I guess we can add an if statement for it, but I'm not even sure if the algorithm is valid in that case.

@adler-j
Copy link
Member Author

adler-j commented Jun 3, 2017

Fixed the comments, will merge.

@adler-j adler-j merged commit 7118cb1 into odlgroup:master Jun 6, 2017
@adler-j adler-j deleted the minor_optimizations branch June 6, 2017 02:00
@adler-j adler-j mentioned this pull request Jun 9, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants