Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: mutli-index selection vs repeated selections #10287

Closed
jreback opened this issue Jun 5, 2015 · 0 comments · Fixed by #10290
Closed

PERF: mutli-index selection vs repeated selections #10287

jreback opened this issue Jun 5, 2015 · 0 comments · Fixed by #10290
Labels
MultiIndex Performance Memory or execution speed performance
Milestone

Comments

@jreback
Copy link
Contributor

jreback commented Jun 5, 2015

from SO

idx = pd.IndexSlice

n=10000000
np.random.seed(1234)
mdt = pandas.DataFrame()
mdt['A'] = np.random.choice(range(10000,45000,1000), n)
mdt['B'] = np.random.choice(range(10,400), n)
mdt['C'] = np.random.choice(range(1,150), n)
mdt['D'] = np.random.choice(range(10000,45000), n)
mdt['x'] = np.random.choice(range(400), n)
mdt['y'] = np.random.choice(range(25), n)


test_A = 25000
test_B = 25
test_C = 40
test_D = 35000

eps_A = 5000
eps_B = 5
eps_C = 5
eps_D = 5000

mdt2 = mdt.set_index(['A','B','C','D']).sortlevel()

selection

    In [106]: %timeit  mdt2.loc[idx[test_A-eps_A:test_A+eps_A,test_B-eps_B:test_B+eps_B,test_C-eps_C:test_C+eps_C,test_D-eps_D:test_D+eps_D],:]
    1 loops, best of 3: 4.34 s per loop

Repeated selection

    In [105]: %timeit mdt2.loc[idx[test_A-eps_A:test_A+eps_A],:].loc[idx[:,test_B-eps_B:test_B+eps_B],:].loc[idx[:,:,test_C-eps_C:test_C+eps_C],:].loc[idx[:,:,:,test_D-eps_D:test_D+eps_D],:]
    10 loops, best of 3: 140 ms per loop

@jreback jreback added Performance Memory or execution speed performance MultiIndex labels Jun 5, 2015
@jreback jreback added this to the Next Major Release milestone Jun 5, 2015
@jreback jreback modified the milestones: 0.16.2, Next Major Release, 0.17.0 Jun 5, 2015
jreback added a commit to jreback/pandas that referenced this issue Jun 22, 2015
yarikoptic added a commit to neurodebian/pandas that referenced this issue Jul 2, 2015
* commit 'v0.16.2-42-g383865f': (72 commits)
  BUG: provide categorical concat always on axis 0, pandas-dev#10430     numpy 1.10 makes this an error for 1-d on axis != 0
  DOC: update missing.rst with ref to groupby.rst
  BUG: Timedeltas with no specified units (and frac) should raise, pandas-dev#10426
  BUG: using .loc[:,column] fails when the object is a multi-index, pandas-dev#10408
  Removed scikit-timeseries migration docs from FAQ
  BUG: GH10395 bug in DataFrame.interpolate with axis=1 and inplace=True
  BUG: GH10392 bug where Table.select_column does not preserve column name
  TST: Use unicode literals in string test
  PERF: fix _get_level_indexer to accept an intermediate indexer result
  PERF: bench for pandas-dev#10287
  BUG: drop_duplicates drops name(s).
  ENH: Enable ExcelWriter to construct in-memory sheets
  BLD: remove support for 3.2, pandas-dev#9118
  PERF: timedelta and datetime64 ops improvements
  PERF: parse timedelta strings in cython pandas-dev#6755
  closes bug in reset_index when index contains NaT
  Check for size=0 before setting item Fixes pandas-dev#10193
  closes bug in apply when function returns categorical
  BUG: frequencies.get_freq_code raises an error against offset with n != 1
  CI: run doc-tests always
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
MultiIndex Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant