ENH:Add EA types to read CSV #23255

kprestel · 2018-10-20T19:12:14Z

Closes GH23228

[x ] closes ENH: support EA types in read_csv #23228
[ x] tests added / passed
[x ] passes git diff upstream/master -u -- "*.py" | flake8 --diff
[x ] whatsnew - May need some guidance on this part.

pep8speaks · 2018-10-20T19:12:18Z

Hello @kprestel! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on January 02, 2019 at 02:01 Hours UTC

pandas/_libs/parsers.pyx

pandas/tests/io/parser/common.py

TomAugspurger · 2018-11-06T15:43:03Z

Where exactly in the chain of pandas' normal parse, infer, astype, chain does this fall? Is it at the very end?

This seems fragile... Consider a person writing a DecimalArray extension type. Would the DecimalDtype._from_sequence be given a list of strings? Would pandas have already converted the numeric-like values to floats, thus losing precision? We'll need to better document exactly when this is called.

jorisvandenbossche · 2018-11-09T12:24:13Z

If we want to support this in general for EAs, I think we need to add a new method to the interface to parse from strings. _from_sequence should not be expected to be able to do that.
(and if the EA does not implement it, we can raise in read_csv saying this dtype does not support it)

jreback · 2018-11-09T13:36:35Z

see #23595 maybe _from_sequence_of_strings ? or a more general function

kprestel · 2018-11-12T15:28:48Z

@TomAugspurger I'm nowhere near an expert in pandas but from what I can tell the casting happens at the very end. The infer/astype chain seems to start in parsers.pyx self._convert_tokens where self == TextReader. From just reading the code I don't think there is a risk of losing precision as you stated, although it should be tested.

@jorisvandenbossche Are you suggesting that we add something like _from_sequence_of_strings to the EA interface as @jreback suggested? I can try this and then we can see what it would look like.

@jreback Can you explain a little bit more about how to create a new test module? I'm not exactly sure what pattern to follow. I'm guessing I need to create a test class and inherit from some base test class but I can't figure out which. Just putting my test in the new io.py as you suggested isn't enough because pytest doesn't pick it up. Sorry if this is a dumb question, I just haven't worked with a test suite like this before.

Thanks everyone for their feedback!

TomAugspurger · 2018-11-12T15:57:25Z

From just reading the code I don't think there is a risk of losing precision as you stated, although it should be tested.

As a quick bit of debugging, you could drop a print(type(scalars[0])) in DecimalArray._from_sequence when you refactor to use that. Though I think @jorisvandenbossche's suggestion of adding a dedicated method to API should be followed.

We should also think about this in the context of #20612. There will be common patterns for every type of IO we want to add support for (no concrete suggestions right now though).

kprestel · 2018-11-12T19:31:54Z

I think whatever this change ends up being it will close #20612. In the original issue (#23228), it is noted that read_csv is broken because it will return object as the dtype if the specified dtype is an EA. I believe this is due to the _get_dtype_type function in dtypes/common.py.

Basically when you pass a EA to the dtype dict in read_csv it gets treated as a type. Which means _get_dtype_type turns it into a np.object.

It seems like that needs to be fixed before anything.

jreback · 2018-11-23T20:53:45Z

@kprestel pls merge master.

codecov · 2018-12-01T17:37:40Z

Codecov Report

Merging #23255 into master will increase coverage by <.01%.
The diff coverage is 50%.

@@            Coverage Diff             @@
##           master   #23255      +/-   ##
==========================================
+ Coverage   42.46%   42.46%   +<.01%     
==========================================
  Files         161      161              
  Lines       51557    51592      +35     
==========================================
+ Hits        21892    21908      +16     
- Misses      29665    29684      +19

Flag	Coverage Δ
#single	`42.46% <50%> (ø)`	⬆️

Impacted Files	Coverage Δ
pandas/core/dtypes/cast.py	`47.63% <0%> (-0.79%)`	⬇️
pandas/io/parsers.py	`48.54% <100%> (ø)`	⬆️
pandas/core/series.py	`50.6% <100%> (-0.15%)`	⬇️
pandas/core/arrays/base.py	`50.98% <100%> (+0.64%)`	⬆️
pandas/core/arrays/integer.py	`38.27% <50%> (+0.29%)`	⬆️
pandas/core/dtypes/common.py	`69.21% <60%> (-0.69%)`	⬇️
pandas/core/arrays/categorical.py	`42.01% <0%> (+0.12%)`	⬆️
pandas/core/dtypes/dtypes.py	`76.68% <0%> (+0.51%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5b0610b...3e5ec56. Read the comment docs.

codecov · 2018-12-01T17:37:40Z

Codecov Report

Merging #23255 into master will increase coverage by 0.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #23255      +/-   ##
==========================================
+ Coverage    92.3%   92.32%   +0.01%     
==========================================
  Files         166      166              
  Lines       52412    52454      +42     
==========================================
+ Hits        48381    48430      +49     
+ Misses       4031     4024       -7

Flag	Coverage Δ
#multiple	`90.75% <100%> (+0.02%)`	⬆️
#single	`43.01% <52.94%> (-0.05%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/arrays/base.py	`98.25% <100%> (+0.02%)`	⬆️
pandas/core/arrays/integer.py	`96.29% <100%> (+0.06%)`	⬆️
pandas/io/parsers.py	`95.4% <100%> (+0.02%)`	⬆️
pandas/core/arrays/timedeltas.py	`87.42% <0%> (-0.26%)`	⬇️
pandas/core/reshape/reshape.py	`99.35% <0%> (-0.22%)`	⬇️
pandas/core/indexes/timedeltas.py	`90.47% <0%> (-0.19%)`	⬇️
pandas/util/testing.py	`87.59% <0%> (-0.17%)`	⬇️
pandas/core/series.py	`93.64% <0%> (-0.1%)`	⬇️
pandas/core/indexes/period.py	`92.37% <0%> (-0.09%)`	⬇️
pandas/core/indexes/datetimes.py	`96.3% <0%> (-0.03%)`	⬇️
... and 22 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 681e75c...f42235a. Read the comment docs.

jreback · 2018-12-03T01:31:42Z

can you merge master

kprestel · 2018-12-08T16:38:31Z

@jreback Done. Sorry about the delay.

kprestel · 2018-12-09T18:12:52Z

I think the implementation works and is ready for a full review. I may need a bit of guidance on the testing part though. I've tested it locally and it works but I'm not sure how to test it "more generally" following existing patterns. Any guidance on that aspect would be greatly appreciated.

Thanks for all the help/time given to me so far!

jreback

you are having WAY too many catching of dtypes. Rip all of that out and just leave the code in parser.pyx

doc/source/whatsnew/v0.24.0.rst

pandas/_libs/parsers.pyx

pandas/core/arrays/base.py

pandas/core/arrays/integer.py

pandas/core/dtypes/cast.py

pandas/core/dtypes/common.py

pandas/tests/extension/base/io.py

kprestel · 2018-12-09T19:21:17Z

@jreback You're right and not 20 minutes after I pushed it did I think about that... I was too focused on the use case you described in the original issue, however if I just rip out everything that isn't in parsers.pyx I don't think it would work at all. If it did it wouldn't work the python parser for sure.

I will reply to your comments and answer your questions.

Thanks for the review. I really appreciate it.

jreback · 2018-12-09T20:10:51Z

@kprestel for > #23255 (comment)

right, the python parser will need some small fixes as well. but let's start minimally.

pandas/core/arrays/base.py

pandas/_libs/parsers.pyx

pandas/core/arrays/integer.py

pandas/core/dtypes/cast.py

jreback · 2018-12-30T19:52:36Z

@kprestel no, thank you! this is a really useful feature!

actually now that I think about it. Can you add an example in io.rst (and maybe in the read_csv doc-string) showing say reading in as Int64 (I think we already have an example, so you can just add another column that is Int64), actually maybe add reading as decimal would be cool too.

kprestel · 2018-12-30T20:07:41Z

@jreback Done. Let me know if that isn't what you were looking for.

jreback · 2018-12-30T20:11:12Z

perfect

pandas/tests/extension/base/io.py

pandas/core/arrays/base.py

jreback · 2018-12-31T23:32:34Z

@kprestel some lint issues:https://dev.azure.com/pandas-dev/pandas/_build/results?buildId=6210

jreback · 2018-12-31T23:32:40Z

ping on green.

kprestel · 2019-01-01T00:11:21Z

Seems like this needs to get updated or removed? This works now I think...

2018-12-30T20:24:59.3212208Z Check for invalid EA testing
2018-12-30T20:24:59.3294654Z ##[error]pandas/tests/extension/base/io.py(24,): error : Found unwanted pattern: tm.assert_frame_equal(result, expected)

Otherwise it should be fine now.

jreback · 2019-01-01T00:21:30Z

@kprestel that's a hint to use: self.assert_frame_equal

pandas/tests/extension/base/io.py

kprestel · 2019-01-02T02:02:11Z

@jreback should be good after this build.

Sorry for the delay.

jreback · 2019-01-02T02:10:02Z

thanks @kprestel we have pretty strict ci / linters now-a-days; a good thing.

jreback · 2019-01-02T02:57:26Z

thanks @kprestel really nice!

jreback · 2019-01-02T02:57:30Z

keep em coming!

kprestel · 2019-01-02T02:59:18Z

I plan on it. Thanks again for the help!

Follow-up to pandas-devgh-23255.

Follow-up to gh-23255.

Follow-up to pandas-devgh-23255.

jreback requested changes Nov 1, 2018

View reviewed changes

pandas/_libs/parsers.pyx Outdated Show resolved Hide resolved

jreback added IO CSV read_csv, to_csv ExtensionArray Extending pandas with custom dtypes or arrays. labels Nov 1, 2018

jreback requested changes Nov 4, 2018

View reviewed changes

pandas/_libs/parsers.pyx Outdated Show resolved Hide resolved

pandas/tests/io/parser/common.py Outdated Show resolved Hide resolved

kprestel force-pushed the ea-types-read-csv branch from d6a09e5 to 22f2fa1 Compare November 23, 2018 18:20

kprestel force-pushed the ea-types-read-csv branch 3 times, most recently from db9af0e to 69226de Compare November 30, 2018 20:10

kprestel force-pushed the ea-types-read-csv branch from 69226de to 5b53438 Compare December 1, 2018 16:38

kprestel force-pushed the ea-types-read-csv branch from 856753c to 4fb55b5 Compare December 8, 2018 16:36

kprestel force-pushed the ea-types-read-csv branch from 4fb55b5 to b1aaa36 Compare December 9, 2018 18:11

kprestel force-pushed the ea-types-read-csv branch from b1aaa36 to 2c3d27a Compare December 9, 2018 18:48

jreback requested changes Dec 9, 2018

View reviewed changes

jorisvandenbossche reviewed Dec 10, 2018

View reviewed changes

kprestel force-pushed the ea-types-read-csv branch from e8db3d2 to 4937b22 Compare December 21, 2018 21:49

Update docs to show using Int64 dtype

f908e2e

gfyoung reviewed Dec 30, 2018

View reviewed changes

pandas/tests/extension/base/io.py Outdated Show resolved Hide resolved

jreback reviewed Dec 30, 2018

View reviewed changes

pandas/core/arrays/base.py Show resolved Hide resolved

Update docs per comments in PR

e60549c

Fix linters

a6a2d99

jreback reviewed Jan 1, 2019

View reviewed changes

pandas/tests/extension/base/io.py Outdated Show resolved Hide resolved

kprestel added 2 commits January 1, 2019 15:47

Fix linters again

053b442

isort

f42235a

jreback merged commit f67aa13 into pandas-dev:master Jan 2, 2019

jreback mentioned this pull request Jan 2, 2019

WIP: make _holder changeover #24540

Closed

gfyoung added a commit to forking-repos/pandas that referenced this pull request Jan 2, 2019

MAINT: Remove empty Python file

6db730f

Follow-up to pandas-devgh-23255.

gfyoung mentioned this pull request Jan 2, 2019

MAINT: Remove empty Python file #24544

Merged

gfyoung added a commit that referenced this pull request Jan 2, 2019

MAINT: Remove empty Python file (#24544)

e76c90e

Follow-up to gh-23255.

kprestel deleted the ea-types-read-csv branch January 5, 2019 16:11

Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019

ENH:Add EA types to read CSV (pandas-dev#23255)

b769eb5

Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019

MAINT: Remove empty Python file (pandas-dev#24544)

b25b3c2

Follow-up to pandas-devgh-23255.

Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019

ENH:Add EA types to read CSV (pandas-dev#23255)

635ab33

Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019

MAINT: Remove empty Python file (pandas-dev#24544)

11fe552

Follow-up to pandas-devgh-23255.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH:Add EA types to read CSV #23255

ENH:Add EA types to read CSV #23255

kprestel commented Oct 20, 2018 •

edited

pep8speaks commented Oct 20, 2018 •

edited

TomAugspurger commented Nov 6, 2018

jorisvandenbossche commented Nov 9, 2018

jreback commented Nov 9, 2018

kprestel commented Nov 12, 2018

TomAugspurger commented Nov 12, 2018

kprestel commented Nov 12, 2018

jreback commented Nov 23, 2018

codecov bot commented Dec 1, 2018

codecov bot commented Dec 1, 2018 •

edited

jreback commented Dec 3, 2018

kprestel commented Dec 8, 2018

kprestel commented Dec 9, 2018

jreback left a comment

kprestel commented Dec 9, 2018

jreback commented Dec 9, 2018

jreback commented Dec 30, 2018 •

edited

kprestel commented Dec 30, 2018

jreback commented Dec 30, 2018

jreback commented Dec 31, 2018

jreback commented Dec 31, 2018

kprestel commented Jan 1, 2019 •

edited

jreback commented Jan 1, 2019

kprestel commented Jan 2, 2019

jreback commented Jan 2, 2019

jreback commented Jan 2, 2019

jreback commented Jan 2, 2019

kprestel commented Jan 2, 2019

ENH:Add EA types to read CSV #23255

ENH:Add EA types to read CSV #23255

Conversation

kprestel commented Oct 20, 2018 • edited

pep8speaks commented Oct 20, 2018 • edited

Comment last updated on January 02, 2019 at 02:01 Hours UTC

TomAugspurger commented Nov 6, 2018

jorisvandenbossche commented Nov 9, 2018

jreback commented Nov 9, 2018

kprestel commented Nov 12, 2018

TomAugspurger commented Nov 12, 2018

kprestel commented Nov 12, 2018

jreback commented Nov 23, 2018

codecov bot commented Dec 1, 2018

Codecov Report

codecov bot commented Dec 1, 2018 • edited

Codecov Report

jreback commented Dec 3, 2018

kprestel commented Dec 8, 2018

kprestel commented Dec 9, 2018

jreback left a comment

Choose a reason for hiding this comment

kprestel commented Dec 9, 2018

jreback commented Dec 9, 2018

jreback commented Dec 30, 2018 • edited

kprestel commented Dec 30, 2018

jreback commented Dec 30, 2018

jreback commented Dec 31, 2018

jreback commented Dec 31, 2018

kprestel commented Jan 1, 2019 • edited

jreback commented Jan 1, 2019

kprestel commented Jan 2, 2019

jreback commented Jan 2, 2019

jreback commented Jan 2, 2019

jreback commented Jan 2, 2019

kprestel commented Jan 2, 2019

kprestel commented Oct 20, 2018 •

edited

pep8speaks commented Oct 20, 2018 •

edited

codecov bot commented Dec 1, 2018 •

edited

jreback commented Dec 30, 2018 •

edited

kprestel commented Jan 1, 2019 •

edited