Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-index and CategoricalIndex performance #22044

Closed
0x0L opened this issue Jul 24, 2018 · 3 comments

Comments

@0x0L
Copy link
Contributor

commented Jul 24, 2018

Hi,

Building a multi-index from a categorical index should be instantaneous: labels are codes and the corresponding level is a CategoricalIndex with the same N categories and codes [0 ... N-1] as intended in

categories = CategoricalIndex(values.categories,

Unfortunately that call is painfully slow

import pandas as pd
import numpy as np

pd.__version__
# 0.23.3

x = np.linspace(-1, 1, 1_000_000)
i = pd.Index(x)
j = i.astype('category')

%timeit pd.MultiIndex.from_arrays([j])
180 ms ± 912 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

The line above from categorical.py is the first bottleneck. It takes 130 ms in this example.
We could go around an replace that line with something along

indexer = np.arange(len(values.categories), dtype=values.codes.dtype)
categories = pd.CategoricalIndex([], fastpath=True).set_categories(values.categories)
categories._data._codes = indexer
categories._cleanup()

On my machine, this takes ~2 ms

The 50 missing milliseconds are spent calling _shallow_copy in MultiIndex._set_levels. That copy may not be that shallow: looks like this boils down to yet another variation on the CategoricalIndex

%timeit categories._shallow_copy()
# 46 ms ± 763 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit pd.CategoricalIndex(categories, categories.categories, categories.ordered)
# 47.2 ms ± 237 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Note however that

%timeit pd.CategoricalIndex(categories)
# 3.69 µs ± 2.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

I would have liked to make PR but I don't fully understand the code and all that class overloading

@TomAugspurger

This comment has been minimized.

Copy link
Contributor

commented Jul 24, 2018

Building a multi-index from a categorical index should be instantaneous

Is this true only when building a MutliIndex with a single level?

@0x0L

This comment has been minimized.

Copy link
Contributor Author

commented Jul 24, 2018

Levels are independent so this is true of any CategoricalIndex in a MultiIndex. See

def _factorize_from_iterables(iterables):
which is where the work is done when calling pd.MultiIndex.from_arrays

@shenker

This comment has been minimized.

Copy link

commented Sep 2, 2018

Just wanted to chime in that this seems to be the main performance bottleneck for me when reading in string index columns from Parquet (using, e.g., pyarrow).

Because they're dictionary-encoded on disk, they're efficiently read in as Categorical columns; calling set_index on this single column (to make a CategoricalIndex) or set_index with a string column and a few integer columns (to make a MultiIndex) takes 1-2min for 250M rows, which is longer than it actually takes to read the data from disk.

@0x0L 0x0L referenced this issue Jun 7, 2019

Merged

PERF: building MultiIndex with categorical levels #26721

4 of 4 tasks complete

@jreback jreback added this to the 0.25.0 milestone Jun 8, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants
You can’t perform that action at this time.