New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: Deprecate skip_footer in read_csv #13386

Closed
wants to merge 1 commit into
base: master
from

Conversation

Projects
None yet
6 participants
@gfyoung
Member

gfyoung commented Jun 7, 2016

Title is self-explanatory.

Closes gh-13349 and partially undoes this commit back in v0.9.0. With such a massive API now, having duplicate arguments makes managing it way less practical.

@gfyoung

This comment has been minimized.

Member

gfyoung commented Jun 7, 2016

In light of removing duplicate arguments and stripping down the API a bit, here are a couple of others I would like to propose (perhaps not in this PR but for discussion nonetheless):

  1. Deprecate skipfooter in read_excel (it's allowed as an alternative to skip_footer, though it is not surfaced in the signature but rather encompassed in a **kwargs argument)?

  2. Should we choose between sep and delimiter in read_csv?

@jreback

This comment has been minimized.

Contributor

jreback commented Jun 7, 2016

the problem is that we also have skiprows. I would actually deprecate skip_footer instead.

@gfyoung

This comment has been minimized.

Member

gfyoung commented Jun 7, 2016

Hmmm...that's a flip-flop AFAICT from what you said earlier here. However, deprecating skip_footer seems a little strange since we use skip_footer all over the place in the code (internally especially). That's why I chose to deprecate skipfooter, since the code impact is not very significant as you can tell. In addition, we use the _ in a lot of the other arguments, so I'd rather side with the majority.

@jreback

This comment has been minimized.

Contributor

jreback commented Jun 7, 2016

just change it internally to the correct kw. I think they diverged at some point. It looks (just a quick skim), that skipfooter is more consistent (as we have skiprows, usecols) etc. excel should be fixed to be the same as well.

@gfyoung

This comment has been minimized.

Member

gfyoung commented Jun 7, 2016

@jreback : let's actually do the count of multi-word argument names (excluding the skip(_)footer ones) using the documentation as of 0.18.1 here:

Arguments that use _:
filepath_or_buffer
index_col
mangle_dupe_cols
true_values
false_values
na_values
keep_default_na
na_filter
skip_blank_lines
error_bad_lines
warn_bad_lines
delim_whitespace
as_recarray
compact_ints
use_unsigned
low_memory
buffer_lines
memory_map
float_precision

Arguments that don't use _:
usecols
skipinitialspace
skiprows
chunksize
lineterminator
doublequote
quotechar
escapechar

Even excluding ones that have been recently deprecated (compact_ints, buffer_lines, use_unsigned, as_recarray), arguments that use _ vastly outnumber those that don't.

@jreback

This comment has been minimized.

Contributor

jreback commented Jun 7, 2016

@gfyoung I am trying to minimize back-compat pain.

usecols and skiprows are prob some of the most used options. I agree we should just deprecate and move to _ separated, but that would be in 0.19.0 if we go whole hog.

@codecov-io

This comment has been minimized.

codecov-io commented Jun 7, 2016

Current coverage is 85.23% (diff: 100%)

Merging #13386 into master will increase coverage by <.01%

@@             master     #13386   diff @@
==========================================
  Files           140        140          
  Lines         50420      50419     -1   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
  Hits          42975      42975          
+ Misses         7445       7444     -1   
  Partials          0          0          

Powered by Codecov. Last update cc216ad...d21345f

@gfyoung

This comment has been minimized.

Member

gfyoung commented Jun 7, 2016

@jreback : I have no issues waiting until 0.19.0 and going full hog. This duplicate argument is not breaking, and it will then give us the time to properly deprecate and rename arguments with _.

@jreback jreback added this to the 0.19.0 milestone Jun 7, 2016

@gfyoung gfyoung changed the title from DEPR: deprecate skipfooter in read_csv to API: Use underscore in read_* arguments Jun 8, 2016

@gfyoung gfyoung changed the title from API: Use underscore in read_* arguments to API: Use underscore in read_csv arguments Jun 9, 2016

@jorisvandenbossche jorisvandenbossche modified the milestones: 0.19.0, 0.20.0 Jul 8, 2016

@gfyoung

This comment has been minimized.

Member

gfyoung commented Jul 9, 2016

Rebased onto master and Travis is passing. I see that @jorisvandenbossche has added the "Needs Discussion" tag, so comments on the deprecation and of what to deprecate are welcome!

@jreback

View changes

doc/source/io.rst Outdated
@@ -282,16 +285,16 @@ float_precision : string, default None
Specifies which converter the C engine should use for floating-point values.
The options are ``None`` for the ordinary converter, ``high`` for the
high-precision converter, and ``round_trip`` for the round-trip converter.
lineterminator : str (length 1), default ``None``

This comment has been minimized.

@jreback

jreback Jul 10, 2016

Contributor

so anything you are actually deprecating (which is pretty much everything that is changed), needs to have a either a separate entry for a doc-string (like you did for skip_footer), or prob better, show the doc-string for skip_footer then at the end put the DEPRECATION OF skipfooter or somesuch.

Futher need a BIG BIG warning in v0.19.0 (use ::warning).

This comment has been minimized.

@gfyoung

gfyoung Jul 10, 2016

Member

Can you illustrate with an example? I don't quite understand what you're saying here or exactly to implement in this case.

This comment has been minimized.

@jorisvandenbossche

jorisvandenbossche Jul 13, 2016

Member

I would add a note (.. note::) somewhere on this page (maybe add the end of the enumeration of all the options), explaining that for some keywords aliases without underscores are accepted for historical reasons, but that it is recommended to use the version with underscores.

This comment has been minimized.

@gfyoung

gfyoung Jul 14, 2016

Member

@jorisvandenbossche : I'm not sure how viable that is given that the enumeration is split into different sections.

@jreback

View changes

doc/source/whatsnew/v0.19.0.txt Outdated
@@ -439,6 +439,17 @@ Deprecations
- ``buffer_lines`` has been deprecated in ``pd.read_csv()`` and will be removed in a future version (:issue:`13360`)
- ``as_recarray`` has been deprecated in ``pd.read_csv()`` and will be removed in a future version (:issue:`13373`)
- top-level ``pd.ordered_merge()`` has been renamed to ``pd.merge_ordered()`` and the original name will be removed in a future version (:issue:`13358`)
- The ``chunksize``, ``dayfirst``, ``doublequote``, ``escapechar``, ``lineterminator``, ``quotechar``, ``skipinitialspace``,

This comment has been minimized.

@jreback

jreback Jul 10, 2016

Contributor

make this a separate section. add to the highlites (single line with reference to this section).

This comment has been minimized.

@gfyoung

gfyoung Jul 11, 2016

Member

Made a separate section, but need help on the table part.

@jreback

View changes

doc/source/whatsnew/v0.19.0.txt Outdated
- The ``chunksize``, ``dayfirst``, ``doublequote``, ``escapechar``, ``lineterminator``, ``quotechar``, ``skipinitialspace``,
``skipfooter``, ```skiprows``, and ``usecols`` arguments in ``pd.read_csv()`` have been deprecated in favor of their
underscore counterparts ``chunk_size``, ``day_first``, ``double_quote``, ``escape_char``, ``line_terminator``, ``quote_char``,
``skip_initial_space``, ``skip_footer``, ```skip_rows``, and ``use_cols`` arguments respectively (:issue:`13386`)

This comment has been minimized.

@jreback

jreback Jul 10, 2016

Contributor

make this a mini-table for easier viewing.

This comment has been minimized.

@gfyoung

gfyoung Jul 11, 2016

Member

How do I make a table in these files?

@jreback

View changes

pandas/io/parsers.py Outdated
Line numbers to skip (0-indexed) or number of lines to skip (int)
at the start of the file
skipfooter : int, default 0
skip_footer : int, default 0
Number of lines at bottom of file to skip (Unsupported with engine='c')

This comment has been minimized.

@jreback

jreback Jul 10, 2016

Contributor

whatever you do in the docs needs to be the same here

This comment has been minimized.

@gfyoung

gfyoung Jul 10, 2016

Member

Can you clarify? The reason why skipfooter is slightly different from the rest is because there were already duplicate args for these two, whereas the others don't.

@jreback

This comment has been minimized.

Contributor

jreback commented Jul 10, 2016

this looks good to me. need to have a nice way of communicating the deprecations in the doc-string. typically we add an additional entry for the new one and make the original kwarg have a DEPRECATED in bold. Here we need to do something different as so many args.

maybe just list the new ones, but have a sentence that says the orginal arg is deprecated.

AND have a section before the arguments start that says, hey all of these are deprecated.

finally, need to create an issue (maybe just a single one), to have read_excel, read_html, read_hdf made consistent with this (I don't think any others, but should check). e.g. chunksize for sure is one.

@jreback

This comment has been minimized.

Contributor

jreback commented Jul 10, 2016

also have issue #4988 which is kind of what I am suggesting for above (e.g. its about read_excel); so maybe repurpose that one to be the non-csv-parser but same kw-in-need-of-deprecation issue.

@gfyoung

This comment has been minimized.

Member

gfyoung commented Jul 10, 2016

@jreback : Agreed about read_excel (and the others that you mentioned) - that has a few arguments that could be changed, though best saved for another PR. I was planning on looking those afterwards after working out the kinks with this one here. The ones to check I think are:

read_excel
read_html
read_sql
read_hdf
read_stata

@gfyoung

View changes

doc/source/whatsnew/v0.19.0.txt Outdated
All multi-word arguments to ``pd.read_csv()`` will now have their individual words separated by an underscore, and their
underscore-less versions have not be deprecated (:issue:`13386`). They are:
- The ``chunksize``, ``dayfirst``, ``doublequote``, ``escapechar``, ``lineterminator``, ``quotechar``, ``skipinitialspace``,

This comment has been minimized.

@gfyoung

gfyoung Jul 11, 2016

Member

@jreback : GitHub unfortunately hid away the other comment about the table, so I'm "refreshing it." Could you explain how to create tables in these files?

This comment has been minimized.

@jreback

jreback Jul 11, 2016

Contributor

look in 0.18.0 whatsnew - there is a table about msgpack

This comment has been minimized.

@gfyoung

gfyoung Jul 11, 2016

Member

For future reference, the example you are referring to is here. Ah, got it. Done.

@jorisvandenbossche

This comment has been minimized.

Member

jorisvandenbossche commented Jul 11, 2016

The reason I added "Needs discussion" is because I am not sure if I find this a good idea. read_csv is one of the most used pandas functions, and causing thousands/millions (?) users to have to change their code just for 'aesthetic'* reasons, not sure this is worth the trouble.

* I know this is not only aesthetic, because having a consistent API (eg always underscores) also makes it easier to learn, to remember, etc .. And that is very important, but sometimes 'historical' reasons are good reasons.

@jorisvandenbossche

This comment has been minimized.

Member

jorisvandenbossche commented Jul 11, 2016

On the specific keywords:

  • dayfirst is also used in to_datetime, Timestamp, and comes from dateutil.parser.parse , so I would leave this is as.
  • chunksize is also used a lot in other places (read_hdf, read_sql, to_sql, ..), so I would also be more hesitant to change this
@jreback

This comment has been minimized.

Contributor

jreback commented Jul 11, 2016

an alternative is simply to accept aliases, e.g. skiprows & skip_rows. This only adds a very tiny amount of code (in getting kwargs). and 'solves' the problem pretty easily.

@jorisvandenbossche

This comment has been minimized.

Member

jorisvandenbossche commented Jul 11, 2016

Yes, you could see it as a "deprecation by documentation": documenting the recommended way, but not actually deprecating (and maybe a small note the docstring that certain list of keywords are also accepted for historical reasons).
This options of course also has some overhead and disadvantages (possible confusion when users see code using the 'old' args)

@gfyoung

This comment has been minimized.

Member

gfyoung commented Jul 13, 2016

@jorisvandenbossche :

chunksize is one that @jreback and I discussed earlier in the conversation as converting to chunk_size (note the list that I made here).

As for dayfirst, I do realize that there are some inconsistencies in deprecating it, especially since datetime uses that parameter. However, a consistent API is useful (i.e. all underscores), so similar to what @jreback is suggesting, we could just document only the recommended parameters but still accept the old parameters with DeprecationWarning and message about the recommended parameters (as I do now). However, I can add a TODO or comment saying that we should NOT remove support for the old parameters for some time.

How does that sound?

@gfyoung

This comment has been minimized.

Member

gfyoung commented Jul 13, 2016

@jorisvandenbossche

View changes

doc/source/io.rst Outdated
Number of lines at bottom of file to skip (unsupported with engine='c').
skipfooter : int, default ``None``
DEPRECATED: this argument will be removed in a future version. Please
use the ``skip_footer`` parameter instead.

This comment has been minimized.

@jorisvandenbossche

jorisvandenbossche Jul 13, 2016

Member

Why did you add this one? (given you did not add it for the others, which is good I think)

This comment has been minimized.

@gfyoung

gfyoung Jul 13, 2016

Member

skipfooter and skip_footer were already in the signature before this PR, hence the necessity to document both. Unless you think that is unnecessary?

This comment has been minimized.

@jorisvandenbossche

jorisvandenbossche Jul 13, 2016

Member

Personally I don't think that is a reason to do differentely for this one (in the end, for all the others, the non-underscore versions were also in the signature)

This comment has been minimized.

@gfyoung

gfyoung Jul 13, 2016

Member

I suppose if we're going to keep accepting both versions for ALL of them, that is a fair point. Will remove.

@jorisvandenbossche

View changes

pandas/io/parsers.py Outdated
Number of lines at bottom of file to skip (Unsupported with engine='c')
skipfooter : int, default None
DEPRECATED: this argument will be removed in a future version. Please
use the `skip_footer` parameter instead.

This comment has been minimized.

@jorisvandenbossche

jorisvandenbossche Jul 13, 2016

Member

Same comment here: why add it for skipfooter?

In the docstring, I would add a similar note about other keyword arguments without underscores being accepted (can be the same as in io.rst) in the 'Notes' section.

This comment has been minimized.

@gfyoung

gfyoung Jul 13, 2016

Member

See my response above.

@jorisvandenbossche

This comment has been minimized.

Member

jorisvandenbossche commented Jul 13, 2016

but I agree with @jorisvandenbossche I don't think we should deprecate these keywords. It really not a big deal to simply accept both in the constructors (and document it in the doc-strings).

Yes that is how I see it as well.

kind of like a PendingDeprecationWarning.

So then just let's use one :-) But the questions is maybe: are we ever going to actually deprecate them? If not, we maybe don't need to use any warning (but also in that case you could see the PendingDeprecationWarning as a kind of NotRecommendedKeywordPleaseUseOtherWarning :-), so maybe no problem with having that).
In any case, I would not use a FutureWarning

As for dayfirst, I do realize that there are some inconsistencies in deprecating it, especially since datetime uses that parameter. However, a consistent API is useful.

-1 on changing the dayfirst kwarg. An inconsistent naming scheme within a function is not good, but having inconsistent names for the same thing between functions is even worse IMO. So either we should rename it everywhere, or either keep it here as dayfirst.
But personally my preference is just to keep it as dayfirst (given that it corresponds with a dateutil kwarg). Having one keyword that deviates from the naming scheme is not that a big deal I think.

chunksize is also used a lot in other places (read_hdf, read_sql, to_sql, ..), so I would also be more hesitant to change this

chunksize is one that @jreback and I discussed earlier in the conversation as converting to chunk_size.

Maybe it is my Dutch mother tongue that I find chunksize good as one word :-) (in Dutch, such a compound word would be one word instead of two separate words)

Since this is a rather big change (regarding user impact), I think it is good to have some more eyes on it. cc @TomAugspurger @sinhrks @shoyer @wesm @chris-b1 If you have any objections/opinions on the general idea, time to raise them.

@jreback

This comment has been minimized.

Contributor

jreback commented Jul 13, 2016

I would say leave chunksize and dayfirst alone as joris indicated

@shoyer

This comment has been minimized.

Member

shoyer commented Jul 14, 2016

I would lean against this change -- I don't think deprecating or even adding aliases with underscores for these keywords is worth it. Yes, consistency is nice, but these are keywords that are mostly either (a) copied straight from the API docs or an example or (b) auto-completed. Omitting underscores between lower case words is also common enough in PEP8 compliant code that it doesn't hurt my eyes in the way that camelCase would, for example. I would save changes like this for next time somebody writes a DataFrame library ;). There's just very little upside to the change at this point.

@chris-b1

This comment has been minimized.

Contributor

chris-b1 commented Jul 14, 2016

I'd also be inclined to not do this.

Consistency is great, but to me, big breaking (or warning triggering) changes for consistency are only worth it to the extent they actually improve the user experience. E.g. resample, the new sort_... api, or a closer example, when the inconsistency between the rows and index keyword was resolved in pivot_table / pivot - #5505

This doesn't seem to meet that hurdle. Like @shoyer said, these argument would be typically tab-completed or looked up, or even if not, the inconsistency doesn't cause any confusion.

@gfyoung

This comment has been minimized.

Member

gfyoung commented Jul 14, 2016

To put things in perspective, this whole discussion was generated because I wanted to remove the duplicate 'skipfooter/skip_footer' argument in the signature. It really should only be one of the two IMO.

The consistency part came up because we weren't sure which way to go i.e do we take the underscore version or the non-underscore?

I perfectly understand if we don't want to make such a change as drastic as it has become, though some input on the initial goal would be good then.

@chris-b1

This comment has been minimized.

Contributor

chris-b1 commented Jul 14, 2016

I don't have a strong opinion on that, but given that all the docs / SO answers / etc in the wild use skipfooter, I'd think deprecating skip_footer makes more sense?

@gfyoung gfyoung changed the title from API: Use underscore in read_csv arguments to API: Deprecate skip_footer in read_csv Jul 15, 2016

@gfyoung

This comment has been minimized.

Member

gfyoung commented Jul 15, 2016

In light of the push back, I will rollback my changes and just deprecate skip_footer as @chris-b1 indicated. If skipfooter is being used in the wild, let's stick with that one then.

@gfyoung

This comment has been minimized.

Member

gfyoung commented Jul 15, 2016

@jreback , @jorisvandenbossche : Rolled back my changes successfully (i.e. Travis is passing) to just deprecating skip_footer for the reason that @chris-b1 outlined above. Ready to merge if there are no other concerns.

@jreback

This comment has been minimized.

Contributor

jreback commented Jul 15, 2016

@gfyoung

This comment has been minimized.

Member

gfyoung commented Jul 19, 2016

@jorisvandenbossche : any updates on this?

@jorisvandenbossche

This comment has been minimized.

Member

jorisvandenbossche commented Jul 19, 2016

What do we do other read_ functions for the skipfooter case?

@gfyoung

This comment has been minimized.

Member

gfyoung commented Jul 19, 2016

Same as here pre-PR (i.e. it just accepts both of them). However, I will handle all other cases similarly if this one is merged.

@gfyoung

This comment has been minimized.

Member

gfyoung commented Jul 26, 2016

@jorisvandenbossche : Any updates on this?

@@ -1351,7 +1351,7 @@ back to python if C-unsupported options are specified. Currently, C-unsupported
options include:
- ``sep`` other than a single character (e.g. regex separators)
- ``skip_footer``
- ``skipfooter``

This comment has been minimized.

@jreback

jreback Jul 27, 2016

Contributor

did the doc-string get changed (to add DEPRECATED)?

This comment has been minimized.

@gfyoung

gfyoung Jul 28, 2016

Member

Hmmm...not sure where those changes went. Added them back.

@gfyoung

This comment has been minimized.

Member

gfyoung commented Jul 28, 2016

@jreback : added the documentation back about the deprecation, and Travis is still happy after rebase. Ready to merge if there are no other concerns.

@jsexauer jsexauer referenced this pull request Jul 29, 2016

Open

DEPR: deprecations from prior versions #6581

0 of 85 tasks complete

@jreback jreback closed this in aa88215 Jul 29, 2016

@jreback

This comment has been minimized.

Contributor

jreback commented Jul 29, 2016

thanks @gfyoung

@gfyoung gfyoung deleted the forking-repos:deprecate-dup-skipfooter branch Jul 29, 2016

gfyoung added a commit to forking-repos/pandas that referenced this pull request Dec 11, 2017

@jreback jreback referenced this pull request Dec 11, 2017

Open

DEPR: deprecations log for removed issues #13777

115 of 115 tasks complete

gfyoung added a commit to forking-repos/pandas that referenced this pull request Dec 11, 2017

gfyoung added a commit that referenced this pull request Dec 11, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment