Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOC: Added note about groupby excluding Decimal columns by default #18953

Merged
merged 4 commits into from
Nov 8, 2018

Conversation

pdpark
Copy link

@pdpark pdpark commented Dec 27, 2017

Also included example of how to explicitly aggregate by Decimal columns.

@gfyoung gfyoung added Docs Dtype Conversions Unexpected or buggy dtype conversions Groupby labels Dec 27, 2017
@@ -497,6 +497,28 @@ index are the group names and whose values are the sizes of each group.

``nth`` can act as a reducer *or* a filter, see :ref:`here <groupby.nth>`

Decimal columns are "nuisance" columns that .agg automatically excludes in groupby.

If you do wish to aggregate them you must do so explicitly:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a comma between "them" and "you"

@@ -497,6 +497,28 @@ index are the group names and whose values are the sizes of each group.

``nth`` can act as a reducer *or* a filter, see :ref:`here <groupby.nth>`

Decimal columns are "nuisance" columns that .agg automatically excludes in groupby.
Copy link
Member

@gfyoung gfyoung Dec 27, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The word "nuisance" comes off a little too strong IMO. Just state that they're excluded and (very briefly) why that is the case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, we call excluded columns nuisance so this is ok.

@codecov
Copy link

codecov bot commented Dec 27, 2017

Codecov Report

Merging #18953 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master   #18953   +/-   ##
=======================================
  Coverage    92.2%    92.2%           
=======================================
  Files         169      169           
  Lines       50924    50924           
=======================================
  Hits        46952    46952           
  Misses       3972     3972
Flag Coverage Δ
#multiple 90.62% <ø> (ø) ⬆️
#single 42.3% <ø> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c8ce3d0...c755f2c. Read the comment docs.

@topper-123
Copy link
Contributor

topper-123 commented Dec 27, 2017

I'm not so sure Pandas should mention Decimal on the groupby page.

Operations on Decimals will by nature be slow, and people could misunderstand the slowness as a slowness of Pandas. IMO Pandas should promote using the various numpy and pandas types and python str, very seldomly other types.

Maybe it could go to the FAQ or maybe just to StackOverflow? If it goes in the pandas docs, there should IMO be a visible disclaimer regarding slowness and the backsides of doing math operations on python objects rather than numpy objects.

@@ -497,6 +497,28 @@ index are the group names and whose values are the sizes of each group.

``nth`` can act as a reducer *or* a filter, see :ref:`here <groupby.nth>`

Decimal columns are "nuisance" columns that .agg automatically excludes in groupby.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, we call excluded columns nuisance so this is ok.

@@ -497,6 +497,28 @@ index are the group names and whose values are the sizes of each group.

``nth`` can act as a reducer *or* a filter, see :ref:`here <groupby.nth>`

Decimal columns are "nuisance" columns that .agg automatically excludes in groupby.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add this entire section in a .. note:: block

'dec_column2': [Decimal('0.20'), Decimal('0.30'), Decimal('0.55'), Decimal('0.60')]
},
columns=['name','title','id','int_column','dec_column1','dec_column2']
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

show the dec here

columns=['name','title','id','int_column','dec_column1','dec_column2']
)

dec.groupby(['name', 'title', 'id'], as_index=False).sum()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a comment before each of the groupbys (e.g. .sum() by default excludes nuiscance), and how to include

@@ -497,6 +497,28 @@ index are the group names and whose values are the sizes of each group.

``nth`` can act as a reducer *or* a filter, see :ref:`here <groupby.nth>`

Decimal columns are "nuisance" columns that .agg automatically excludes in groupby.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move this entire section down to 'Automatic exclusion of nuiscance columns'

@pdpark
Copy link
Author

pdpark commented Dec 28, 2017

Made requested changes.

@pdpark
Copy link
Author

pdpark commented Jan 5, 2018

Fix for issue #17027.

@jreback
Copy link
Contributor

jreback commented Feb 11, 2018

can you rebase this

@jreback
Copy link
Contributor

jreback commented Jun 26, 2018

closing as stale

@jreback jreback closed this Jun 26, 2018
@gfyoung
Copy link
Member

gfyoung commented Jun 27, 2018

@jreback Can certainly rebase. 🙂

@gfyoung gfyoung reopened this Jun 27, 2018
@gfyoung gfyoung added this to the 0.24.0 milestone Jun 27, 2018
@jreback
Copy link
Contributor

jreback commented Jun 27, 2018

@gfyoung i am not sure this is in the right direction
needs significant edits

@gfyoung
Copy link
Member

gfyoung commented Jun 27, 2018

i am not sure this is in the right direction

Hmm...I see. I was under the impression, given the conversation, that this just needed rebasing. What do you mean by the "right direction" ?

@jreback
Copy link
Contributor

jreback commented Jun 27, 2018

needs to be way way simpler

@jreback
Copy link
Contributor

jreback commented Jun 27, 2018

my point is i would start over

@pdpark
Copy link
Author

pdpark commented Jun 27, 2018

I can revisit this.

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some tips:

  • I would leave the change to gotchas.rst as a separate PR, as this is really a different topic
  • In the decimal example, try to trim down the code to what is minimally needed, to have a much simpler example. You don't need multiple decimal columns; a single key to group by is sufficient.
  • This is in general true for object dtype columns, not only for decimals. I would also mention that somewhere.

@alphaKuz
Copy link

alphaKuz commented Jun 28, 2018 via email

@pdpark
Copy link
Author

pdpark commented Jul 12, 2018

@jorisvandenbossche Will do.

@pdpark
Copy link
Author

pdpark commented Jul 16, 2018

@jorisvandenbossche & @jreback Requested changes made.

@jreback
Copy link
Contributor

jreback commented Oct 7, 2018

you need to merge master and force push, you are picking up all commits since july.

@pdpark
Copy link
Author

pdpark commented Oct 11, 2018

Fixed

@datapythonista
Copy link
Member

@jreback is this ready to be merged after the last chages?

@jreback
Copy link
Contributor

jreback commented Nov 3, 2018

so I am not averse to expanding this section, however decimals DO now work on master if they are typed as EAs. So maybe a reference to that possibility here.

@jreback jreback removed this from the 0.24.0 milestone Nov 6, 2018
@datapythonista
Copy link
Member

@pdpark do you have time to update based on the last comment?

@jorisvandenbossche
Copy link
Member

however decimals DO now work on master if they are typed as EAs

The decimals EA is not publicly exposed, only for testing purposes for now.

@jorisvandenbossche jorisvandenbossche added this to the 0.24.0 milestone Nov 8, 2018
@jorisvandenbossche
Copy link
Member

I updated this PR (slight rewording + simplified and fixed the example a little bit), should be ready to merge now.

@jorisvandenbossche jorisvandenbossche changed the title Added note about groupby excluding Decimal columns by default DOC: Added note about groupby excluding Decimal columns by default Nov 8, 2018
@jorisvandenbossche
Copy link
Member

Some lines are >80, but I think in rst files we don't care

We care a little bit, to the extent that it fits in the code block box without generating scroll bars, but I think the limit is a bit higher than 80 for that.

@jorisvandenbossche
Copy link
Member

Merging since it is a /pandas/doc/source change only (the CI is still an old version that is failing)

@jorisvandenbossche jorisvandenbossche merged commit b73d602 into pandas-dev:master Nov 8, 2018
@jorisvandenbossche
Copy link
Member

@pdpark Thanks a lot for the PR!

thoo added a commit to thoo/pandas that referenced this pull request Nov 10, 2018
…fixed

* upstream/master: (47 commits)
  CLN: remove values attribute from datetimelike EAs (pandas-dev#23603)
  DOC/CI: Add linting to rst files, and fix issues (pandas-dev#23381)
  PERF: Speeds up creation of Period, PeriodArray, with Offset freq (pandas-dev#23589)
  PERF: define is_all_dates to shortcut inadvertent copy when slicing an IntervalIndex (pandas-dev#23591)
  TST: Tests and Helpers for Datetime/Period Arrays (pandas-dev#23502)
  Update description of Index._values/values/ndarray_values (pandas-dev#23507)
  Fixes to make validate_docstrings.py not generate warnings or unwanted output (pandas-dev#23552)
  DOC: Added note about groupby excluding Decimal columns by default (pandas-dev#18953)
  ENH: Support writing timestamps with timezones with to_sql (pandas-dev#22654)
  CI: Auto-cancel redundant builds (pandas-dev#23523)
  Preserve EA dtype in DataFrame.stack (pandas-dev#23285)
  TST: Fix dtype mismatch on 32bit in IntervalTree get_indexer test (pandas-dev#23468)
  BUG: raise if invalid freq is passed (pandas-dev#23546)
  remove uses of (ts)?lib.(NaT|iNaT|Timestamp) (pandas-dev#23562)
  BUG: Fix error message for invalid HTML flavor (pandas-dev#23550)
  ENH: Support EAs in Series.unstack (pandas-dev#23284)
  DOC: Updating DataFrame.join docstring (pandas-dev#23471)
  TST: coverage for skipped tests in io/formats/test_to_html.py (pandas-dev#22888)
  BUG: Return KeyError for invalid string key (pandas-dev#23540)
  BUG: DatetimeIndex slicing with boolean Index raises TypeError (pandas-dev#22852)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Docs Dtype Conversions Unexpected or buggy dtype conversions Groupby
Projects
None yet
Development

Successfully merging this pull request may close these issues.

'groupby' multiple columns and 'sum' multiple columns with different types
7 participants