Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: Sparse IntIndex.make_union / Numeric ops #13036

Closed
wants to merge 1 commit into from

Conversation

sinhrks
Copy link
Member

@sinhrks sinhrks commented Apr 30, 2016

  • tests added / passed
  • passes git diff upstream/master | flake8 --diff
  • whatsnew entry

Replace repeated list.append with np.union1d in IntIndex.make_union. make_union is used in numeric ops.

NOTE: It is also possible to fix IntIndex.intersect to use np.intersect1d, but it doesn't increase the performance (because the length of the result is smaller).

The below microbench assumes array's 90% is sparse.

import numpy as np
import pandas as pd

np.random.seed(1)
N = 1000000
a = np.array([np.nan] * N)
b = np.array([np.nan] * N)

indexer_a = np.unique(np.random.randint(0, N, N / 10))
indexer_b = np.unique(np.random.randint(0, N, N / 10))
a[indexer_a] = np.random.randint(0, 100, len(indexer_a))
b[indexer_b] = np.random.randint(0, 100, len(indexer_b))

sa = pd.SparseArray(a)
sb = pd.SparseArray(b)

on current master

%timeit sa.sp_index.make_union(sb.sp_index)
#10 loops, best of 3: 52.7 ms per loop

%timeit sa + sb
10 loops, best of 3: 47.8 ms per loop

After this PR

%timeit sa.sp_index.make_union(sb.sp_index)
100 loops, best of 3: 11.6 ms per loop

%timeit sa + sb
100 loops, best of 3: 15.3 ms per loop

@sinhrks sinhrks added Indexing Related to indexing on series/frames, not to indexes themselves Performance Memory or execution speed performance Numeric Operations Arithmetic, Comparison, and Logical operations Sparse Sparse Data Type labels Apr 30, 2016
@sinhrks sinhrks added this to the 0.18.2 milestone Apr 30, 2016
@codecov-io
Copy link

codecov-io commented Apr 30, 2016

Current coverage is 84.06%

Merging #13036 into master will decrease coverage by -0.00%

@@             master     #13036   diff @@
==========================================
  Files           136        136          
  Lines         50005      49994    -11   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
- Hits          42040      42027    -13   
- Misses         7965       7967     +2   
  Partials          0          0          
  1. 2 files (not in diff) in pandas/tseries were modified. more
    • Misses +1
    • Hits -7
  2. 2 files (not in diff) in pandas/io were modified. more
    • Misses +1
    • Hits -1
  3. 1 files (not in diff) in pandas were modified. more
    • Hits -5

Powered by Codecov. Last updated by e9ef522...33104f5

@jreback
Copy link
Contributor

jreback commented Apr 30, 2016

this looks fine for 0.18.1. do you want to add some sparse benchmarks?

@sinhrks sinhrks force-pushed the sparse_make_union branch 2 times, most recently from a48f05d to d0034fe Compare April 30, 2016 20:31
@sinhrks sinhrks modified the milestones: 0.18.1, 0.18.2 Apr 30, 2016
@sinhrks
Copy link
Member Author

sinhrks commented Apr 30, 2016

OK, added whatsnew and bench.

   before     after       ratio
  [286782da] [a48f05db]
-   18.15ms     8.20ms      0.45  sparse.sparse_arithmetic.time_sparse_addition_10percent
-   18.17ms     8.08ms      0.44  sparse.sparse_arithmetic.time_sparse_division_10percent
-   17.97ms     7.73ms      0.43  sparse.sparse_arithmetic.time_sparse_addition_10percent_zero
-   18.47ms     7.84ms      0.42  sparse.sparse_arithmetic.time_sparse_division_10percent_zero
-    2.06ms   810.82μs      0.39  sparse.sparse_arithmetic.time_sparse_division_1percent
-    2.05ms   804.12μs      0.39  sparse.sparse_arithmetic.time_sparse_addition_1percent


# if is one already, returns self
y = y_.to_int_index()

if self.length != y.length:
raise Exception('Indices must reference same underlying length')

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not that I am recommending it, because the numpy code is just so much shorter. But this could have been much faster if the array was pre-allocated (to a max len), then you slice it at the end. Rather than appending using a list.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah will take suggested approach when I fix others using list.append.

@jreback
Copy link
Contributor

jreback commented Apr 30, 2016

ok ping on green.

@sinhrks
Copy link
Member Author

sinhrks commented Apr 30, 2016

Thanks, now green.

@jreback jreback closed this in 3ff5af0 Apr 30, 2016
@sinhrks sinhrks deleted the sparse_make_union branch April 30, 2016 23:23
@jreback
Copy link
Contributor

jreback commented Apr 30, 2016

thanks @sinhrks

@sinhrks sinhrks mentioned this pull request May 4, 2016
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Indexing Related to indexing on series/frames, not to indexes themselves Numeric Operations Arithmetic, Comparison, and Logical operations Performance Memory or execution speed performance Sparse Sparse Data Type
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants