PERF: Updated andrews_curves to use Numpy arrays for its samples #11534

Merged
merged 1 commit into from Nov 24, 2015

Conversation

Projects
None yet
4 participants
Contributor

khs26 commented Nov 6, 2015

Hello,

I hope I've followed the contribution guidelines correctly, but am happy to change things if necessary.

I noticed that andrews_curves doesn't make use of numpy arrays in what I thought was a sensible use case: for generating its samples.

I added a test which uses variable length random data, so that I could check the timing changes between the numpy and non-numpy versions and found the following (rough data):

Length of data Time / s (non-numpy) Time / s (numpy)
Without test 3.43 3.84
10 4.69 4.21
100 6.58 5.20
1000 20.67 16.41
10000 162.17 125.37

The test adds some overhead (though it is decorated with @slow), so I'm happy to amend the commit and remove it. Otherwise, the changes seem to have resulted in a small speed up, which becomes more important for larger data (my original motivation, since I was trying to do it with a 100k x 5 dataframe).

Thanks,

Kyle

khs26 changed the title from Updated andrews_curves to use Numpy arrays for its samples to PERF: Updated andrews_curves to use Numpy arrays for its samples Nov 6, 2015

Member

sinhrks commented Nov 7, 2015

lgtm. @TomAugspurger ?

sinhrks added the Performance label Nov 7, 2015

Contributor

TomAugspurger commented Nov 7, 2015

Yeah. @khs26 could you add an item to the release notes under enhancements? doc/source/whatsnew/v0.17.0

jreback added this to the 0.17.1 milestone Nov 7, 2015

Contributor

jreback commented Nov 7, 2015

@khs26 are those times in seconds?

@jreback jreback and 1 other commented on an outdated diff Nov 7, 2015

pandas/tools/plotting.py
x1 = amplitudes[0]
result = x1 / sqrt(2.0)
harmonic = 1.0
for x_even, x_odd in zip(amplitudes[1::2], amplitudes[2::2]):
- result += (x_even * sin(harmonic * x) +
- x_odd * cos(harmonic * x))
+ result += (x_even * np.sin(harmonic * t) +
@jreback

jreback Nov 7, 2015

Contributor

this entire loops can be easily vectorized if you want to give that a shot

@khs26

khs26 Nov 8, 2015

Contributor

I did look at vectorising the loops, but I think it potentially gets problematic with Numpy slices being views and needing to resize the arrays. I didn't want to get a marginal performance increase by making the code way less clear. I'll certainly give it a look, if I have time, though.

Currently writing up my PhD, so need to make sure I keep focusing on the writing. I must say that you guys have made it really easy to contribute, what with being knowledgeable and decent human beings all at the same time. :) Also, the contribution guidelines are excellent.

Contributor

khs26 commented Nov 8, 2015

  1. I'll add changes to the release notes.

  2. Yeah, the times are in seconds and rough timing for the whole test (as opposed to just my changes). I'm still getting familiar with nose and testing generally in Python.

@jreback jreback commented on an outdated diff Nov 8, 2015

doc/source/whatsnew/v0.17.1.txt
@@ -64,6 +64,7 @@ Performance Improvements
- Improved performance of ``rolling_median`` (:issue:`11450`)
- Improved performance to ``to_excel`` (:issue:`11352`)
+- Improved performance of ``andrews_curves``
@jreback

jreback Nov 8, 2015

Contributor

add the issue number here (use this pr number as their is not an associated issue)

Contributor

jreback commented Nov 8, 2015

ok, some comments.

Let's create another issue that shows the performance implications (after this PR), and maybe you (or another brave soul) can address in the future.

Contributor

khs26 commented Nov 10, 2015

I actually found a bit of time to fix it up. I'm going to update the branch shortly.

This change didn't really change things hugely. I suspected it would only make a significant difference on wide dataframes (i.e. those with a large number of coefficients). I also think it would be pushing the utility of Andrews plots as a visualisation technique. It's definitely a close thing between a slight performance increase, with slightly lessened readability.

Here are the timings:

(length, width) of data Time / s (non-numpy) Time / s (first vectorisation) Time / s (both vectorisation)
length 10
10, 1 0.04 0.03 0.03
10, 10 0.07 0.06 0.07
10, 100 0.16 0.10 0.12
10, 1000 0.86 0.26 0.20
length 100
100, 1 0.29 0.31 0.25
100, 10 0.47 0.32 0.29
100, 100 1.06 0.55 0.42
100, 1000 7.74 1.59 1.22
length 1000
1000, 1 1.88 1.70 1.41
1000, 10 2.87 1.44 2.06
1000, 100 9.83 3.07 2.33
1000, 1000 75.11 13.48 10.19
length 10000
10000, 1 3.31 2.20 2.00
10000, 10 4.68 2.29 2.48
10000, 100 16.94 4.22 4.06
10000, 1000 149.49 25.25 19.45
Contributor

khs26 commented Nov 10, 2015

I've left it as a separate commit for the moment, but will squash it, if we're happy with it.

Contributor

jreback commented Nov 13, 2015

@jreback jreback commented on the diff Nov 13, 2015

pandas/tools/plotting.py
@@ -560,14 +575,14 @@ def f(x):
for i in range(n):
@jreback

jreback Nov 13, 2015

Contributor

the way to get a massive speedup, is do this ALL in a vectorized way, e.g.
pass in the entire numpy array (df.values), and do the calculation, then plot in a loop

@khs26

khs26 Nov 13, 2015

Contributor

Oh, of course. I'll do that and the asv benchmark. Will probably be over the weekend.

@jreback jreback modified the milestone: Next Major Release, 0.17.1 Nov 18, 2015

Contributor

jreback commented Nov 18, 2015

@khs26 can you rebase / update

Contributor

khs26 commented Nov 22, 2015

Sorry that took a while, been quite busy. I ran a profile on it, and it turns out that the existing changes make the slowest part be drawing in matplotlib (with the same/lower scaling).

I added an asv benchmark as well. Hopefully I got it right, it seems to run ok and I copied the basic template from timeseries plotting in benchmarks/plotting.py.

Contributor

khs26 commented Nov 22, 2015

Also rebased the commit on top of the latest master, so it should be good to merge.

@jreback jreback commented on an outdated diff Nov 22, 2015

doc/source/whatsnew/v0.17.1.txt
@@ -146,6 +146,7 @@ Performance Improvements
- Performance improvement in ``Categorical.remove_unused_categories``, (:issue:`11643`).
- Improved performance of ``Series`` constructor with no data and ``DatetimeIndex`` (:issue:`11433`)
- Improved performance of ``shift``, ``cumprod``, and ``cumsum`` with groupby (:issue:`4095`)
+- Improved performance of ``andrews_curves`` (:issue:`11534`)
@jreback

jreback Nov 22, 2015

Contributor

pls move to 0.18.0 file

@jreback jreback modified the milestone: 0.18.0, Next Major Release Nov 22, 2015

@khs26 @khs26 khs26 PERF: Updated andrews_curves to use a Numpy array for samples
DOC: Added some documentation to andrews_curves
TST: Added a variable length test to TestDataFramePlots.test_andrews_curves
f3cb813
Contributor

jreback commented Nov 23, 2015

looks good. ping on green.

Contributor

khs26 commented Nov 23, 2015

Travis is happy and I'm gonna leave it be now, so can merge whenever you want.

@jreback jreback added a commit that referenced this pull request Nov 24, 2015

@jreback jreback Merge pull request #11534 from khs26/numpify-andrews-curves
PERF: Updated andrews_curves to use Numpy arrays for its samples
5bc191a

@jreback jreback merged commit 5bc191a into pandas-dev:master Nov 24, 2015

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
Contributor

jreback commented Nov 24, 2015

thank you sir!

khs26 deleted the khs26:numpify-andrews-curves branch Nov 24, 2015

@jreback jreback added a commit that referenced this pull request Nov 29, 2015

@jreback jreback COMPAT: remove warnings from #11534 d9e679a
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment