Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: RangeIndex.format performance #35712

Merged
merged 22 commits into from
Aug 26, 2020

Conversation

topper-123
Copy link
Contributor

@topper-123 topper-123 commented Aug 13, 2020

#35440 dropped RangeIndex._format_with_header, which was functionally not needed, but needed to avoid creating an internal ndarray.

This rectifies that + gives some perf. improvements.

>>> idx = pd.RangeIndex(1_000_000)
>>> %timeit idx.format()
4.6 s ± 102 ms per loop  # pandas v1.1.0
1.67 s ± 19.6 ms per loop  # master
595 ms ± 2.35 ms per loop  # this PR

Also, now the _data attribute isn't called, so this PR gives a perf. & memory improvement in some use cases compared to master:

>>> idx = pd.RangeIndex(1_000_000)
>>> idx.format()
>>> "_data" in idx._cache
False # pandas v.1.1.0
True  # master
False  # this PR

Copy link
Member

@arw2019 arw2019 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Do we have benchmarks for this? If not do we want to add them?

@jreback jreback added Output-Formatting __repr__ of pandas objects, to_string Performance Memory or execution speed performance labels Aug 14, 2020
@@ -187,6 +187,11 @@ def _format_data(self, name=None):
# we are formatting thru the attributes
return None

def _format_with_header(self, header, na_rep="NaN") -> List[str]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you type

@topper-123 topper-123 force-pushed the RangeIndex.format branch 3 times, most recently from bbb33d5 to 5ff29f3 Compare August 19, 2020 17:43
@jreback jreback added this to the 1.2 milestone Aug 19, 2020
@jreback
Copy link
Contributor

jreback commented Aug 19, 2020

looks good. this is worth a perf mention in 1.2 whatsnew. ping on green.

@topper-123
Copy link
Contributor Author

topper-123 commented Aug 20, 2020

The TravisCI issue is just a timeout issue, but In CI / Web and docs, I've got a problem under Check ipython directive errors that I'm not able to fix:

Run ! grep -B1 "^<<<-------------------------------------------------------------------------$" sphinx.log
[0 rows x 1 columns]
<<<-------------------------------------------------------------------------
##[error]Process completed with exit code 1.

The source code for this in .github/workflows/ci.yml says:

    # This can be removed when the ipython directive fails when there are errors,
    # including the `tee sphinx.log` in te previous step (https://github.com/ipython/ipython/issues/11547)
    - name: Check ipython directive errors
      run: "! grep -B1 \"^<<<-------------------------------------------------------------------------$\" sphinx.log"

I don't understand what this check is for or why my PR fails here. @TomAugspurger, you posted that mentioned ipython issue, do you mind looking at why my PR fails on this?

Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pandas/core/indexes/range.py Show resolved Hide resolved
@topper-123
Copy link
Contributor Author

Updated.

@jreback
Copy link
Contributor

jreback commented Aug 21, 2020

lgtm. I think this is ok for 1.1.2 as a perf regresion, can you move the note? ping on green.

@jreback jreback modified the milestones: 1.2, 1.1.2 Aug 21, 2020
@topper-123 topper-123 changed the title PERF: RangeIndex.format perf regression PERF: RangeIndex.format performance Aug 22, 2020
@topper-123
Copy link
Contributor Author

Updated & green.

doc/source/whatsnew/v1.1.2.rst Show resolved Hide resolved
@@ -187,6 +187,15 @@ def _format_data(self, name=None):
# we are formatting thru the attributes
return None

def _format_with_header(self, header: List[str], na_rep: str = "NaN") -> List[str]:
if len(self._range) == 0:
return []
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be return header?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

I added tests for this, and found that DateTimeIndex([]).format(name=True) returns ["None"], where it should return [""]. Same for PeriodIndex. I fixed that bug.

@simonjayhawkins
Copy link
Member

@topper-123 merge conflicts in whatsnew

Copy link
Member

@simonjayhawkins simonjayhawkins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @topper-123 lgtm pending green

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor comment, if you can address and merge on green.

@@ -187,6 +187,15 @@ def _format_data(self, name=None):
# we are formatting thru the attributes
return None

def _format_with_header(self, header: List[str], na_rep: str = "NaN") -> List[str]:
if len(self._range) == 0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if not len(self._range)

@topper-123
Copy link
Contributor Author

That merge brancg master has caused some strange error. anyone know how to fix it?

@topper-123
Copy link
Contributor Author

Thanks, @simonjayhawkins, were there any git oneliners you used for this?

@simonjayhawkins
Copy link
Member

Thanks, @simonjayhawkins, were there any git oneliners you used for this?

I don't profess to know what was going on. resetting HEAD, merging master and then pulling from the branch seemed to fix except for the duplicate release note.

@simonjayhawkins
Copy link
Member

minor comment, if you can address and merge on green.

comment addressed

@simonjayhawkins simonjayhawkins merged commit b7a31eb into pandas-dev:master Aug 26, 2020
@simonjayhawkins
Copy link
Member

Thanks @topper-123

@simonjayhawkins
Copy link
Member

@meeseeksdev backport 1.1.x

@lumberbot-app
Copy link

lumberbot-app bot commented Aug 26, 2020

Owee, I'm MrMeeseeks, Look at me.

There seem to be a conflict, please backport manually. Here are approximate instructions:

  1. Checkout backport branch and update it.
$ git checkout 1.1.x
$ git pull
  1. Cherry pick the first parent branch of the this PR on top of the older branch:
$ git cherry-pick -m1 b7a31ebeb873f37939838b3b788dab0e4c41b8de
  1. You will likely have some merge/cherry-pick conflict here, fix them and commit:
$ git commit -am 'Backport PR #35712: PERF: RangeIndex.format performance'
  1. Push to a named branch :
git push YOURFORK 1.1.x:auto-backport-of-pr-35712-on-1.1.x
  1. Create a PR against branch 1.1.x, I would have named this PR:

"Backport PR #35712 on branch 1.1.x"

And apply the correct labels and milestones.

Congratulation you did some good work ! Hopefully your backport PR will be tested by the continuous integration and merged soon!

If these instruction are inaccurate, feel free to suggest an improvement.

simonjayhawkins pushed a commit to simonjayhawkins/pandas that referenced this pull request Aug 26, 2020
@topper-123 topper-123 deleted the RangeIndex.format branch August 26, 2020 13:29
simonjayhawkins added a commit that referenced this pull request Aug 26, 2020
Co-authored-by: Terji Petersen <contribute@tensortable.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Output-Formatting __repr__ of pandas objects, to_string Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants