Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: Deprecate compact_ints and use_unsigned in read_csv #13323

Closed

Conversation

Projects
None yet
4 participants
@gfyoung
Copy link
Member

commented May 30, 2016

Title is self-explanatory.

xref #12686 - I don't quite understand why these are marked (if at all) as internal to the C engine only, as the benefits for having these options accepted for the Python engine is quite clear based on the documentation I added as well.

Implementation simply just calls the already-written function in pandas/parsers.pyx - as it isn't specific to the TextReader class, crossing over to grab this function from Cython (instead of duplicating in pure Python) seems reasonable while maintaining that separation between the C and Python engines.

@gfyoung gfyoung force-pushed the forking-repos:python-engine-compact-ints branch 3 times, most recently May 30, 2016

@gfyoung

This comment has been minimized.

Copy link
Member Author

commented May 30, 2016

FYI: the test_compact_ints test that I deleted was a duplicate of test_compact_ints_as_recarray and also did not correctly reflect on the situation of the test data.

@codecov-io

This comment has been minimized.

Copy link

commented May 30, 2016

Current coverage is 84.22%

Merging #13323 into master will decrease coverage by <.01%

@@             master     #13323   diff @@
==========================================
  Files           138        138          
  Lines         50713      50671    -42   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
- Hits          42715      42676    -39   
+ Misses         7998       7995     -3   
  Partials          0          0          

Powered by Codecov. Last updated by 99e78da...0d56c09

@jreback

This comment has been minimized.

Copy link
Contributor

commented May 30, 2016

This is fine to document these. In reality the parser should not be doing this, and these options just add to an already huge API. Better to have this an an option to to_numeric() or something. Or the user can simply pass in dtypes as appropriate.

I get that this makes things more 'automatic'.

So I think that we should deprecate these for now.

@jorisvandenbossche ?

@gfyoung

This comment has been minimized.

Copy link
Member Author

commented May 30, 2016

@jreback : No issue with simplifying the API, though it did seem like it was a somewhat decent memory-conscientious option. If it is deprecated, telling them to refer to dtype might make more sense because I don't see where we would be going with adding to these to to_numeric, though that would be more inconvenient if there was a large table of small integers.

@jreback

This comment has been minimized.

Copy link
Contributor

commented May 30, 2016

@gfyoung you could simply add them as options to to_numeric, which is basically a smart .astype.
(that handles errors and such).

@gfyoung

This comment has been minimized.

Copy link
Member Author

commented May 30, 2016

How would they be toggled from the parser API exactly?

@jreback

This comment has been minimized.

Copy link
Contributor

commented May 30, 2016

@gfyoung they wouldn't that why we deprecate :>

though you could very easily. Move these to a post-processing step and simply call something like

if compact_ints:
    x = pd.to_numeric(x, errors='ignore', compact_ints=compact_ints, use_unsigned=use_unsigned)

and simply move the code downcast_int64 to a call inside pd.to_numeric. (and move the code to inference.pyx just for cleanliness).

This is indepedent of deprecation or not though. (e.g. you can fix this and can separately deprecate).

@gfyoung

This comment has been minimized.

Copy link
Member Author

commented May 30, 2016

Ah, I see. Will do that then.

@gfyoung gfyoung force-pushed the forking-repos:python-engine-compact-ints branch May 30, 2016

@gfyoung

This comment has been minimized.

Copy link
Member Author

commented May 30, 2016

Actually, it's not as straightforward as it seems. parser.pyx creates its own list of na_values (see here) that it uses when casting inside _downcast_to_int64, and that list of values is used in other functions within the file. If I move it to inference.pyx to be called via to_numeric (which I'm no longer sure about since that is a further API change), I would have to duplicate the code that creates na_values - how does the refactoring sound now?

@gfyoung gfyoung force-pushed the forking-repos:python-engine-compact-ints branch May 30, 2016

@jreback

This comment has been minimized.

Copy link
Contributor

commented May 30, 2016

you can just do this is maybe_convert_numeric or equivalent

@gfyoung gfyoung force-pushed the forking-repos:python-engine-compact-ints branch 2 times, most recently May 31, 2016

@gfyoung

This comment has been minimized.

Copy link
Member Author

commented May 31, 2016

@jreback : Refactored downcast_to_int64 as a function in inference.pyx, and Travis is happy. So just waiting to see what @jorisvandenbossche has to say about deprecation.

@jreback

This comment has been minimized.

Copy link
Contributor

commented May 31, 2016

I don't think this works if the arr has any na_values in it
(not tested either)

@gfyoung

This comment has been minimized.

Copy link
Member Author

commented May 31, 2016

@jreback : I literally moved the function from one location to another. If that's an issue, then it would have appeared in the parser as well. Will add a test though for that.

@jreback

This comment has been minimized.

Copy link
Contributor

commented May 31, 2016

@gfyoung no, if its by efinition int64 coming in, then na_values are superfulous here. I know you copied it. that's not to say it was completely correct before.

@gfyoung

This comment has been minimized.

Copy link
Member Author

commented May 31, 2016

@jreback : na_values is actually used in several places in the implementation, so it's not as superfluous as you make it sound (unless you would like to hard-code values). Also, the function does work when the na_value for np.int64 is included in the array. Added test to make sure.

@gfyoung gfyoung force-pushed the forking-repos:python-engine-compact-ints branch 2 times, most recently May 31, 2016

@jreback

This comment has been minimized.

Copy link
Contributor

commented May 31, 2016

@gfyoung you are missing my point. int arrays don't support missing values (currently), so if you actually have a missing value then this must be converted to float which is impossible in the current impl. so they ARE superfluous.

@gfyoung

This comment has been minimized.

Copy link
Member Author

commented May 31, 2016

@jreback : That's not the point of this function at all. If it has missing values (i.e.nan), the array won't go through this path. That na_values parameter is hardly superfluous if you look at what it actually has stored in it.

@jreback

This comment has been minimized.

Copy link
Contributor

commented May 31, 2016

@gfyoung then prove it by showing an example

@gfyoung

This comment has been minimized.

Copy link
Member Author

commented May 31, 2016

@jreback : I think my test cases speak for themselves.

@jreback

This comment has been minimized.

Copy link
Contributor

commented May 31, 2016

it's kind of like a fancy astype

@gfyoung gfyoung force-pushed the forking-repos:python-engine-compact-ints branch May 31, 2016

@gfyoung

This comment has been minimized.

Copy link
Member Author

commented Jun 1, 2016

Hmm...not sure why my test for the deprecation is failing on just one machine. What am I doing wrong here? It couldn't be this difficult to add, should it?

@jorisvandenbossche

View changes

doc/source/whatsnew/v0.18.2.txt Outdated
@@ -76,6 +76,7 @@ Other enhancements

- The ``pd.read_csv()`` with ``engine='python'`` has gained support for the ``decimal`` option (:issue:`12933`)
- The ``pd.read_csv()`` with ``engine='python'`` has gained support for the ``na_filter`` option (:issue:`13321`)
- The ``pd.read_csv()`` with ``engine='python'`` has gained support for the ``compact_ints`` and ``use_unsigned`` options (:issue:`13323`)

This comment has been minimized.

Copy link
@jorisvandenbossche

jorisvandenbossche Jun 1, 2016

Member

It's a bit strange to announce this as a new feature, while it has been deprecated at the same time ...

This comment has been minimized.

Copy link
@gfyoung

gfyoung Jun 1, 2016

Author Member

That is a fair point, but I did add support though. The exact direction of this PR is still somewhat in flux, and I'll adjust the whatsnew once things have settled down a bit. More importantly, I would like to figure out why Travis is unhappy with me at the moment, as the tests pass on my machine.

This comment has been minimized.

Copy link
@jorisvandenbossche

jorisvandenbossche Jun 1, 2016

Member

yes, certainly fine to change things here in the end (and not sure why travis is failing, I also do not see something obvious)

This comment has been minimized.

Copy link
@jreback

jreback Jun 2, 2016

Contributor

hmm, I agree here. Since we are deprecating this, I think that no need to announce it for python engine support.

This comment has been minimized.

Copy link
@gfyoung

gfyoung Jun 2, 2016

Author Member

Fair enough. Will change as soon as I placate the Travis.

@jorisvandenbossche

View changes

pandas/io/parsers.py Outdated
the parser will attempt to cast it as the smallest integer dtype possible,
either signed or unsigned depending on the specification from the
`use_unsigned` parameter. Selecting this option can help save memory if
there are a lot of relatively small integers in the resulting DataFrame.

This comment has been minimized.

Copy link
@jorisvandenbossche

jorisvandenbossche Jun 1, 2016

Member

Question: is this actually true? (the "can help save memory") In the sense that: are the values from the csv file first parsed as int64's and then downcasted? Because in that case, there is no real memory saving for reading the csv file, but of course the resulting frame will be smaller

This comment has been minimized.

Copy link
@gfyoung

gfyoung Jun 1, 2016

Author Member

You are correct. The reading parses as int64 and then does downcasting, so you aren't saving memory during reading but the resulting frame will be smaller, which does provide a memory boost still, right?

This comment has been minimized.

Copy link
@jorisvandenbossche

jorisvandenbossche Jun 1, 2016

Member

But not in consumed memory (which is eg a problem if you are reading in a very large csv file that would need too much memory to read in, although the result may fit in memory), and therefore I find the current wording a bit misleading.

But also see my comment below on the docstring before we further discuss this.

This comment has been minimized.

Copy link
@gfyoung

gfyoung Jun 1, 2016

Author Member

Hmmm...that is a fair point. Perhaps I'll just take that out then because trying to clarify the correct meaning would make things overly complicated I'd imagine.

@gfyoung

This comment has been minimized.

Copy link
Member Author

commented Jun 1, 2016

@jreback : any thoughts on why Travis is unhappy on one machine? Seems like me and @jorisvandenbossche are stumped at the moment

@jorisvandenbossche

This comment has been minimized.

Copy link
Member

commented Jun 1, 2016

Regarding the deprecation, I would be fine with that (I don't think many people already have heard of these keywords (in any case, I don't), although it would be nice if we could check this somewhere), but, it would be nice if there is then an alternative available (like the pd.to_numeric idea).

I am only wondering if it is needed to add the documentation to the docstring when we are deprecating it at the same time (so we don't actually want to encourage people to use it, and the docstring is already really long)

@gfyoung

This comment has been minimized.

Copy link
Member Author

commented Jun 1, 2016

@jorisvandenbossche : yeah, me neither. I hadn't really heard of it until I started looking into what it did in the CParser. Surprising just how many options are undocumented for a function that works quite well for most everyday purposes such as my own. 😄

The pd.to_numeric idea will need some work to flesh out, but I don't think that it is appropriate for this PR as @jreback has already mentioned. I think the doc-string is still useful because comprehensive documentation is always a good idea IMO just so that there is no confusion about all of these options. Otherwise, how would you know that it was deprecated?

@jreback

View changes

pandas/io/tests/parser/common.py Outdated
})

# default behaviour for 'use_unsigned'
out = self.read_csv(StringIO(data), compact_ints=True)

This comment has been minimized.

Copy link
@jreback

jreback Jun 1, 2016

Contributor

every one of these needs a tm.assert_produces_warning block. The warning on travis is because of below when you DO try to catch the warning it has already been triggered (and hence doesn't show). These are somewhat hard to isolated. Usually just keep running smaller and smaller blocks of code to figure out where it is triggering the uncaught warning, then inspect the code.

This comment has been minimized.

Copy link
@gfyoung

gfyoung Jun 2, 2016

Author Member

Ah! I did not know that. Fixed.

@jreback

This comment has been minimized.

Copy link
Contributor

commented Jun 1, 2016

I am ok with doc-stringing this (always helpful) and deprecating is fine with me. Whenever we add this to pd.to_numeric (which I think is best place). Then can simply update the deprecation warnings to recommend a new usage (even if its done later). maybe make a placeholder issue for that enhancement.

@gfyoung gfyoung force-pushed the forking-repos:python-engine-compact-ints branch Jun 2, 2016

@jsexauer jsexauer referenced this pull request Jun 2, 2016

Open

DEPR: deprecations from prior versions #6581

0 of 98 tasks complete

@gfyoung gfyoung force-pushed the forking-repos:python-engine-compact-ints branch to 95f7ba8 Jun 2, 2016

@gfyoung

This comment has been minimized.

Copy link
Member Author

commented Jun 2, 2016

@jreback @jorisvandenbossche : Travis is happy now, and I made the requested doc changes. Ready to merge if there are no other concerns.

@jorisvandenbossche

This comment has been minimized.

Copy link
Member

commented Jun 2, 2016

It still feels a bit strange to add (to the python parser) and doc a feature that we are deprecating directly :-), but OK to merge for me

@jreback

This comment has been minimized.

Copy link
Contributor

commented Jun 2, 2016

@jorisvandenbossche though this will prob be deprecated for a while...

@jreback jreback added this to the 0.18.2 milestone Jun 2, 2016

@jreback jreback closed this in 0c6226c Jun 2, 2016

@jreback

This comment has been minimized.

Copy link
Contributor

commented Jun 2, 2016

@gfyoung ty, pls also create an issue to add to pd.to_numeric

@gfyoung gfyoung deleted the forking-repos:python-engine-compact-ints branch Jun 2, 2016

@gfyoung gfyoung changed the title ENH: Add support for compact_ints and use_unsigned in Python engine API: Deprecate compact_ints and use_unsigned in read_csv Jun 2, 2016

gfyoung added a commit to forking-repos/pandas that referenced this pull request Dec 19, 2017

@jreback jreback referenced this pull request Dec 19, 2017

Open

DEPR: deprecations log for removed issues #13777

116 of 116 tasks complete

gfyoung added a commit to forking-repos/pandas that referenced this pull request Dec 20, 2017

jreback added a commit that referenced this pull request Dec 21, 2017

CLN: Drop compact_ints/use_unsigned from read_csv (#18851)
* CLN: Drop compact_ints/use_unsigned from read_csv

Deprecated in v0.19.0

xref gh-13323

* CLN: Remove downcast_int64 from inference.pyx

It was only being used for the compact_ints
and use_unsigned parameters in read_csv.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.