Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH:Add EA types to read CSV #23255

Merged
merged 22 commits into from Jan 2, 2019
Merged

Conversation

kprestel
Copy link
Contributor

@kprestel kprestel commented Oct 20, 2018

Closes GH23228

@pep8speaks
Copy link

pep8speaks commented Oct 20, 2018

Hello @kprestel! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on January 02, 2019 at 02:01 Hours UTC

pandas/_libs/parsers.pyx Outdated Show resolved Hide resolved
@jreback jreback added IO CSV read_csv, to_csv ExtensionArray Extending pandas with custom dtypes or arrays. labels Nov 1, 2018
pandas/_libs/parsers.pyx Outdated Show resolved Hide resolved
pandas/tests/io/parser/common.py Outdated Show resolved Hide resolved
@TomAugspurger
Copy link
Contributor

Where exactly in the chain of pandas' normal parse, infer, astype, chain does this fall? Is it at the very end?

This seems fragile... Consider a person writing a DecimalArray extension type. Would the DecimalDtype._from_sequence be given a list of strings? Would pandas have already converted the numeric-like values to floats, thus losing precision? We'll need to better document exactly when this is called.

@jorisvandenbossche
Copy link
Member

If we want to support this in general for EAs, I think we need to add a new method to the interface to parse from strings. _from_sequence should not be expected to be able to do that.
(and if the EA does not implement it, we can raise in read_csv saying this dtype does not support it)

@jreback
Copy link
Contributor

jreback commented Nov 9, 2018

see #23595 maybe _from_sequence_of_strings ? or a more general function

@kprestel
Copy link
Contributor Author

@TomAugspurger I'm nowhere near an expert in pandas but from what I can tell the casting happens at the very end. The infer/astype chain seems to start in parsers.pyx self._convert_tokens where self == TextReader. From just reading the code I don't think there is a risk of losing precision as you stated, although it should be tested.

@jorisvandenbossche Are you suggesting that we add something like _from_sequence_of_strings to the EA interface as @jreback suggested? I can try this and then we can see what it would look like.

@jreback Can you explain a little bit more about how to create a new test module? I'm not exactly sure what pattern to follow. I'm guessing I need to create a test class and inherit from some base test class but I can't figure out which. Just putting my test in the new io.py as you suggested isn't enough because pytest doesn't pick it up. Sorry if this is a dumb question, I just haven't worked with a test suite like this before.

Thanks everyone for their feedback!

@TomAugspurger
Copy link
Contributor

From just reading the code I don't think there is a risk of losing precision as you stated, although it should be tested.

As a quick bit of debugging, you could drop a print(type(scalars[0])) in DecimalArray._from_sequence when you refactor to use that. Though I think @jorisvandenbossche's suggestion of adding a dedicated method to API should be followed.

We should also think about this in the context of #20612. There will be common patterns for every type of IO we want to add support for (no concrete suggestions right now though).

@kprestel
Copy link
Contributor Author

I think whatever this change ends up being it will close #20612. In the original issue (#23228), it is noted that read_csv is broken because it will return object as the dtype if the specified dtype is an EA. I believe this is due to the _get_dtype_type function in dtypes/common.py.

Basically when you pass a EA to the dtype dict in read_csv it gets treated as a type. Which means _get_dtype_type turns it into a np.object.

It seems like that needs to be fixed before anything.

@jreback
Copy link
Contributor

jreback commented Nov 23, 2018

@kprestel pls merge master.

@codecov
Copy link

codecov bot commented Dec 1, 2018

Codecov Report

Merging #23255 into master will increase coverage by <.01%.
The diff coverage is 50%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #23255      +/-   ##
==========================================
+ Coverage   42.46%   42.46%   +<.01%     
==========================================
  Files         161      161              
  Lines       51557    51592      +35     
==========================================
+ Hits        21892    21908      +16     
- Misses      29665    29684      +19
Flag Coverage Δ
#single 42.46% <50%> (ø) ⬆️
Impacted Files Coverage Δ
pandas/core/dtypes/cast.py 47.63% <0%> (-0.79%) ⬇️
pandas/io/parsers.py 48.54% <100%> (ø) ⬆️
pandas/core/series.py 50.6% <100%> (-0.15%) ⬇️
pandas/core/arrays/base.py 50.98% <100%> (+0.64%) ⬆️
pandas/core/arrays/integer.py 38.27% <50%> (+0.29%) ⬆️
pandas/core/dtypes/common.py 69.21% <60%> (-0.69%) ⬇️
pandas/core/arrays/categorical.py 42.01% <0%> (+0.12%) ⬆️
pandas/core/dtypes/dtypes.py 76.68% <0%> (+0.51%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5b0610b...3e5ec56. Read the comment docs.

@codecov
Copy link

codecov bot commented Dec 1, 2018

Codecov Report

Merging #23255 into master will increase coverage by 0.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #23255      +/-   ##
==========================================
+ Coverage    92.3%   92.32%   +0.01%     
==========================================
  Files         166      166              
  Lines       52412    52454      +42     
==========================================
+ Hits        48381    48430      +49     
+ Misses       4031     4024       -7
Flag Coverage Δ
#multiple 90.75% <100%> (+0.02%) ⬆️
#single 43.01% <52.94%> (-0.05%) ⬇️
Impacted Files Coverage Δ
pandas/core/arrays/base.py 98.25% <100%> (+0.02%) ⬆️
pandas/core/arrays/integer.py 96.29% <100%> (+0.06%) ⬆️
pandas/io/parsers.py 95.4% <100%> (+0.02%) ⬆️
pandas/core/arrays/timedeltas.py 87.42% <0%> (-0.26%) ⬇️
pandas/core/reshape/reshape.py 99.35% <0%> (-0.22%) ⬇️
pandas/core/indexes/timedeltas.py 90.47% <0%> (-0.19%) ⬇️
pandas/util/testing.py 87.59% <0%> (-0.17%) ⬇️
pandas/core/series.py 93.64% <0%> (-0.1%) ⬇️
pandas/core/indexes/period.py 92.37% <0%> (-0.09%) ⬇️
pandas/core/indexes/datetimes.py 96.3% <0%> (-0.03%) ⬇️
... and 22 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 681e75c...f42235a. Read the comment docs.

@jreback
Copy link
Contributor

jreback commented Dec 3, 2018

can you merge master

@kprestel
Copy link
Contributor Author

kprestel commented Dec 8, 2018

@jreback Done. Sorry about the delay.

@kprestel
Copy link
Contributor Author

kprestel commented Dec 9, 2018

I think the implementation works and is ready for a full review. I may need a bit of guidance on the testing part though. I've tested it locally and it works but I'm not sure how to test it "more generally" following existing patterns. Any guidance on that aspect would be greatly appreciated.

Thanks for all the help/time given to me so far!

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are having WAY too many catching of dtypes. Rip all of that out and just leave the code in parser.pyx

doc/source/whatsnew/v0.24.0.rst Outdated Show resolved Hide resolved
pandas/_libs/parsers.pyx Outdated Show resolved Hide resolved
pandas/_libs/parsers.pyx Outdated Show resolved Hide resolved
pandas/_libs/parsers.pyx Outdated Show resolved Hide resolved
pandas/_libs/parsers.pyx Outdated Show resolved Hide resolved
pandas/core/arrays/base.py Show resolved Hide resolved
pandas/core/arrays/integer.py Outdated Show resolved Hide resolved
pandas/core/dtypes/cast.py Outdated Show resolved Hide resolved
pandas/core/dtypes/common.py Outdated Show resolved Hide resolved
pandas/tests/extension/base/io.py Outdated Show resolved Hide resolved
@kprestel
Copy link
Contributor Author

kprestel commented Dec 9, 2018

@jreback You're right and not 20 minutes after I pushed it did I think about that... I was too focused on the use case you described in the original issue, however if I just rip out everything that isn't in parsers.pyx I don't think it would work at all. If it did it wouldn't work the python parser for sure.

I will reply to your comments and answer your questions.

Thanks for the review. I really appreciate it.

@jreback
Copy link
Contributor

jreback commented Dec 9, 2018

@kprestel for > #23255 (comment)

right, the python parser will need some small fixes as well. but let's start minimally.

pandas/core/arrays/base.py Show resolved Hide resolved
pandas/core/arrays/base.py Show resolved Hide resolved
pandas/_libs/parsers.pyx Outdated Show resolved Hide resolved
pandas/core/arrays/integer.py Outdated Show resolved Hide resolved
pandas/core/arrays/integer.py Outdated Show resolved Hide resolved
pandas/core/dtypes/cast.py Outdated Show resolved Hide resolved
@jreback
Copy link
Contributor

jreback commented Dec 30, 2018

@kprestel no, thank you! this is a really useful feature!

actually now that I think about it. Can you add an example in io.rst (and maybe in the read_csv doc-string) showing say reading in as Int64 (I think we already have an example, so you can just add another column that is Int64), actually maybe add reading as decimal would be cool too.

@kprestel
Copy link
Contributor Author

@jreback Done. Let me know if that isn't what you were looking for.

@jreback
Copy link
Contributor

jreback commented Dec 30, 2018

perfect

@jreback
Copy link
Contributor

jreback commented Dec 31, 2018

@jreback
Copy link
Contributor

jreback commented Dec 31, 2018

ping on green.

@kprestel
Copy link
Contributor Author

kprestel commented Jan 1, 2019

Seems like this needs to get updated or removed? This works now I think...

2018-12-30T20:24:59.3212208Z Check for invalid EA testing
2018-12-30T20:24:59.3294654Z ##[error]pandas/tests/extension/base/io.py(24,): error : Found unwanted pattern: tm.assert_frame_equal(result, expected)

Otherwise it should be fine now.

@jreback
Copy link
Contributor

jreback commented Jan 1, 2019

@kprestel that's a hint to use: self.assert_frame_equal

@kprestel
Copy link
Contributor Author

kprestel commented Jan 2, 2019

@jreback should be good after this build.

Sorry for the delay.

@jreback
Copy link
Contributor

jreback commented Jan 2, 2019

thanks @kprestel we have pretty strict ci / linters now-a-days; a good thing.

@jreback jreback merged commit f67aa13 into pandas-dev:master Jan 2, 2019
@jreback
Copy link
Contributor

jreback commented Jan 2, 2019

thanks @kprestel really nice!

@jreback
Copy link
Contributor

jreback commented Jan 2, 2019

keep em coming!

@kprestel
Copy link
Contributor Author

kprestel commented Jan 2, 2019

I plan on it. Thanks again for the help!

gfyoung added a commit to forking-repos/pandas that referenced this pull request Jan 2, 2019
gfyoung added a commit that referenced this pull request Jan 2, 2019
@kprestel kprestel deleted the ea-types-read-csv branch January 5, 2019 16:11
Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019
Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019
Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019
Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ExtensionArray Extending pandas with custom dtypes or arrays. IO CSV read_csv, to_csv
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: support EA types in read_csv
6 participants