Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: categorical rank #15498

Closed
jreback opened this issue Feb 24, 2017 · 5 comments
Closed

PERF: categorical rank #15498

jreback opened this issue Feb 24, 2017 · 5 comments
Labels
Categorical Categorical Data Type Performance Memory or execution speed performance
Milestone

Comments

@jreback
Copy link
Contributor

jreback commented Feb 24, 2017

xref #15422 (comment)

easy enough after #15422 to rank the categories themselves rather than using expanded values; prob most relevant for object dtypes.

In [15]: s = Series(tm.makeCategoricalIndex(100000))

In [16]: res = Series(np.array(s.cat.rename_categories(Series(s.cat.categories).rank()))).rank()

In [17]: res2 = s.rank()

In [18]: res.equals(res2)
Out[18]: True

In [19]: %timeit Series(np.array(s.cat.rename_categories(Series(s.cat.categories).rank()))).rank()
100 loops, best of 3: 4.39 ms per loop

In [20]: %timeit s.rank()
10 loops, best of 3: 132 ms per loop
@jreback jreback added Categorical Categorical Data Type Difficulty Novice Performance Memory or execution speed performance labels Feb 24, 2017
@jreback jreback added this to the Next Major Release milestone Feb 24, 2017
@jreback jreback mentioned this issue Feb 24, 2017
4 tasks
@jeetjitsu
Copy link
Contributor

jeetjitsu commented Feb 25, 2017

@jreback @jorisvandenbossche : in the _values_for_rank method in Categorical, re-organizing and moving the typecast to float outside the if condition, like below, has the advantage that i can set a single rank function, for categoricals, in the _get_data_algo function in pandas/core/algorithms.py. Which imo is cleaner. Should i move it out or do you think otherwise?

    def _values_for_rank(self):
        from pandas import Series
        if self.ordered:
            values = self.codes
            mask = values == -1
            values = values.astype('float64')
            if mask.any():
                values[mask] = np.nan
        else:
            values = np.array(
                self.rename_categories(Series(self.categories).rank())
            )
        return values


    def _get_data_algo(values, func_map):

        f = None

        if is_float_dtype(values):
            f = func_map['float64']
            values = _ensure_float64(values)
        ...
        elif is_categorical_dtype(values):
            f = func_map['float64']
            values = values._values_for_rank()
        ...

@jreback
Copy link
Contributor Author

jreback commented Feb 25, 2017

yes going to need some reorg
mainly have to pass in the actual rank args themselves (easy enough just pass them thur as kwargs)

@jorisvandenbossche
Copy link
Member

@ikilledthecat You can open a PR with the above change, that will be the easiest to discuss (in any case, the above certainly looks reasonable. A reason to keep the astype in the if condition is to avoid a conversion of the data from int to float when not needed, which will give a (small) performance penalty.)

@jreback why is it needed to pass rank args? The above (or similar) seems OK to me without additional args

@jorisvandenbossche
Copy link
Member

Another thing we could do for performance is for the unordered categorical to first check whether the categories are sorted before doing the renaming (from a quick test this checking is much less expensive than the actual renaming). Although that may not be worth the complexity.

@jreback
Copy link
Contributor Author

jreback commented Feb 25, 2017

so when rank is called on the categories it's fine
but needs na--position in order to order any na

though that will be handled when the categories are re expanded so maybe not needed

yeah makes for sense that way

@jeetjitsu jeetjitsu mentioned this issue Feb 27, 2017
4 tasks
@jreback jreback modified the milestones: 0.20.0, Next Major Release Mar 1, 2017
@jreback jreback closed this as completed in 1c106c8 Mar 1, 2017
AnkurDedania pushed a commit to AnkurDedania/pandas that referenced this issue Mar 21, 2017
closes pandas-dev#15498

Author: Prasanjit Prakash <jeet@gmail.com>

Closes pandas-dev#15518 from ikilledthecat/rank_categorical_perf and squashes the following commits:

30b49b9 [Prasanjit Prakash] PERF: GH15498 - pep8 changes
ad38544 [Prasanjit Prakash] PERF: GH15498 - asv tests and whatsnew
1ebdb56 [Prasanjit Prakash]  PERF: categorical rank GH#15498
a67cd85 [Prasanjit Prakash] PERF: categorical rank GH#15498
81df7df [Prasanjit Prakash]  PERF: categorical rank GH#15498
45dd125 [Prasanjit Prakash]  PERF: categorical rank GH#15498
33249b3 [Prasanjit Prakash] PERF: categorical rank GH#15498
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants