Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rank categorical perf #15518

Closed
wants to merge 7 commits into from

Conversation

jeetjitsu
Copy link
Contributor

@jeetjitsu jeetjitsu commented Feb 27, 2017

@codecov-io
Copy link

codecov-io commented Feb 27, 2017

Codecov Report

Merging #15518 into master will decrease coverage by -0.71%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #15518      +/-   ##
==========================================
- Coverage   91.06%   90.36%   -0.71%     
==========================================
  Files         136      136              
  Lines       49099    49556     +457     
==========================================
+ Hits        44714    44783      +69     
- Misses       4385     4773     +388
Impacted Files Coverage Δ
pandas/core/algorithms.py 94.48% <ø> (ø)
pandas/core/categorical.py 96.92% <100%> (+0.01%)
pandas/io/gbq.py 16.46% <0%> (-70.21%)
pandas/util/decorators.py 68.6% <0%> (-0.63%)
pandas/tools/merge.py 91.78% <0%> (-0.35%)
pandas/core/common.py 91.02% <0%> (-0.34%)
pandas/types/cast.py 85.42% <0%> (-0.24%)
pandas/core/groupby.py 94.98% <0%> (-0.02%)
pandas/core/frame.py 97.83% <0%> (-0.01%)
pandas/core/generic.py 96.31% <0%> (-0.01%)
... and 5 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d0a281f...30b49b9. Read the comment docs.

@jreback jreback added Categorical Categorical Data Type Performance Memory or execution speed performance labels Feb 27, 2017
@jreback
Copy link
Contributor

jreback commented Feb 27, 2017

can you add an asv similar to from the issue (ideally tests multple dtypes as well)

@jreback
Copy link
Contributor

jreback commented Feb 27, 2017

also a whatsnew (in Performance) entry would be great. code lgtm.

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

[1, 2, 3, 4, 5, 6],
).astype('category').cat.set_categories(
[1, 2, 3, 4, 5, 6],
ordered=False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can simplify this creation by doing astype('category', categories=[1, 2, 3, 4, 5, 6], ordered=False)

values = np.array(self)
else:
values = np.array(
self.rename_categories(Series(self.categories).rank())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a small comment on why do this here?

@jeetjitsu
Copy link
Contributor Author

jeetjitsu commented Mar 1, 2017

I have made all the minor changes but can't seem to run the asv benchmarks. Tried running them on upstream/master as well but the tests fail with the following error

TypeError: unbound methods must have non-NULL im_class

I used the following command to run asv benchmarks

asv continuous -f 1.1 -E virtualenv upstream/master HEAD -b ^categorical

Googling for a solution hasn't helped. If you can point me towards a solution i will run the benchmarks and push my changes.

Here's a complete trace

� Creating environments
� Discovering benchmarks
�� Uninstalling from virtualenv-py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytest-scipy-sqlalchemy-tables-xlrd-xlsxwriter-xlwt
�� Installing into virtualenv-py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytest-scipy-sqlalchemy-tables-xlrd-xlsxwriter-xlwt..
�� Error running /home/jeet/Projects/pandas/asv_bench/env/dbb8113eac34359a9ef26b93e54a1d03/bin/python /home/jeet/.virtualenvs/pandas-dev/lib/python3.6/site-packages/asv/benchmark.py discover /home/jeet/Projects/pandas/asv_bench/benchmarks /tmp/tmpklwg1xod
             STDOUT -------->
             
             STDERR -------->
             Traceback (most recent call last):
               File "/home/jeet/.virtualenvs/pandas-dev/lib/python3.6/site-packages/asv/benchmark.py", line 818, in <module>
                 commands[mode](args)
               File "/home/jeet/.virtualenvs/pandas-dev/lib/python3.6/site-packages/asv/benchmark.py", line 763, in main_discover
                 list_benchmarks(benchmark_dir, fp)
               File "/home/jeet/.virtualenvs/pandas-dev/lib/python3.6/site-packages/asv/benchmark.py", line 748, in list_benchmarks
                 for benchmark in disc_benchmarks(root):
               File "/home/jeet/.virtualenvs/pandas-dev/lib/python3.6/site-packages/asv/benchmark.py", line 653, in disc_benchmarks
                 for name, class_attr in inspect.getmembers(module_attr):
               File "/usr/lib/python2.7/inspect.py", line 253, in getmembers
                 value = getattr(object, key)
               File "/home/jeet/Projects/pandas/asv_bench/env/dbb8113eac34359a9ef26b93e54a1d03/local/lib/python2.7/site-packages/pandas/util/decorators.py", line 265, in __get__
                 return types.MethodType(self, instance)
             TypeError: unbound methods must have non-NULL im_class
Traceback (most recent call last):
  File "/home/jeet/.virtualenvs/pandas-dev/bin/asv", line 11, in <module>
    load_entry_point('asv==0.3.dev1099+a95b68de', 'console_scripts', 'asv')()
  File "/home/jeet/.virtualenvs/pandas-dev/lib/python3.6/site-packages/asv/main.py", line 38, in main
    result = args.func(args)
  File "/home/jeet/.virtualenvs/pandas-dev/lib/python3.6/site-packages/asv/commands/__init__.py", line 48, in run_from_args
    return cls.run_from_conf_args(conf, args)
  File "/home/jeet/.virtualenvs/pandas-dev/lib/python3.6/site-packages/asv/commands/continuous.py", line 61, in run_from_conf_args
    quick=args.quick, **kwargs
  File "/home/jeet/.virtualenvs/pandas-dev/lib/python3.6/site-packages/asv/commands/continuous.py", line 86, in run
    _returns=run_objs, _machine_file=_machine_file)
  File "/home/jeet/.virtualenvs/pandas-dev/lib/python3.6/site-packages/asv/commands/run.py", line 185, in run
    benchmarks = Benchmarks(conf, repo, environments, regex=bench)
  File "/home/jeet/.virtualenvs/pandas-dev/lib/python3.6/site-packages/asv/benchmarks.py", line 336, in __init__
    benchmarks = self.disc_benchmarks(conf, repo, environments)
  File "/home/jeet/.virtualenvs/pandas-dev/lib/python3.6/site-packages/asv/benchmarks.py", line 385, in disc_benchmarks
    dots=False)
  File "/home/jeet/.virtualenvs/pandas-dev/lib/python3.6/site-packages/asv/plugins/virtualenv.py", line 192, in run
    return self.run_executable('python', args, **kwargs)
  File "/home/jeet/.virtualenvs/pandas-dev/lib/python3.6/site-packages/asv/environment.py", line 556, in run_executable
    return util.check_output([exe] + args, **kwargs)
  File "/home/jeet/.virtualenvs/pandas-dev/lib/python3.6/site-packages/asv/util.py", line 571, in check_output
    raise ProcessError(args, retcode, stdout, stderr)
asv.util.ProcessError: Command '/home/jeet/Projects/pandas/asv_bench/env/dbb8113eac34359a9ef26b93e54a1d03/bin/python /home/jeet/.virtualenvs/pandas-dev/lib/python3.6/site-packages/asv/benchmark.py discover /home/jeet/Projects/pandas/asv_bench/benchmarks /tmp/tmpklwg1xod' returned non-zero exit status 1

@jorisvandenbossche
Copy link
Member

@ikilledthecat Can you push your changes anyway (regardless whether you get asv running), so we can already review (and getting the asv benchs running is not a blocker for merging this PR in this case).

Regarding the asv, I don't have experience with running it using virtualenv, only used it with conda and never saw such an error.
As the error seems to come from asv itself, you may want to raise an issue on their tracker (https://github.com/spacetelescope/asv)

@jorisvandenbossche
Copy link
Member

@ikilledthecat something went wrong with your rebase, as you have many other commits included now as well. Doing

git fetch upstream
git checkout rank_categorical_perf
git rebase upstream/master
git push -f origin rank_categorical_perf

should normally always do the right thing

no need to rename categories where they are already ordered
check for numeric instead of monotonic
no need to rename categories where they are already ordered
@jeetjitsu
Copy link
Contributor Author

@jorisvandenbossche sorry, pushed after re-mergeing upstream/master.

@jreback
Copy link
Contributor

jreback commented Mar 1, 2017

you'll want to rebase on master as travis was having some issues

@jreback jreback added this to the 0.20.0 milestone Mar 1, 2017
@jreback jreback closed this in 1c106c8 Mar 1, 2017
@jreback
Copy link
Contributor

jreback commented Mar 1, 2017

thanks @ikilledthecat

@jeetjitsu jeetjitsu deleted the rank_categorical_perf branch March 1, 2017 23:48
AnkurDedania pushed a commit to AnkurDedania/pandas that referenced this pull request Mar 21, 2017
closes pandas-dev#15498

Author: Prasanjit Prakash <jeet@gmail.com>

Closes pandas-dev#15518 from ikilledthecat/rank_categorical_perf and squashes the following commits:

30b49b9 [Prasanjit Prakash] PERF: GH15498 - pep8 changes
ad38544 [Prasanjit Prakash] PERF: GH15498 - asv tests and whatsnew
1ebdb56 [Prasanjit Prakash]  PERF: categorical rank GH#15498
a67cd85 [Prasanjit Prakash] PERF: categorical rank GH#15498
81df7df [Prasanjit Prakash]  PERF: categorical rank GH#15498
45dd125 [Prasanjit Prakash]  PERF: categorical rank GH#15498
33249b3 [Prasanjit Prakash] PERF: categorical rank GH#15498
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PERF: categorical rank
5 participants