Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Read csv category fix #18402

Merged
merged 2 commits into from
Nov 22, 2017
Merged

Conversation

sam-cohan
Copy link
Contributor

Summary:

@sam-cohan sam-cohan changed the base branch from 0.21.x to master November 21, 2017 03:17
@@ -64,6 +64,7 @@ Bug Fixes
- Bug in ``pd.concat`` when empty and non-empty DataFrames or Series are concatenated (:issue:`18178` :issue:`18187`)
- Bug in :class:`IntervalIndex` constructor when a list of intervals is passed with non-default ``closed`` (:issue:`18334`)
- Bug in :meth:`IntervalIndex.copy` when copying and ``IntervalIndex`` with non-default ``closed`` (:issue:`18339`)
- Bug in ``pd.read_csv`` when reading numeric category fields with high cardinality (:issue `18186`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • ``pd.read_csv`` -> :func:`read_csv`
  • :issue is missing the closing :, and no space between :issue: and the number

dtypes = set(a.dtype for a in arrs)
if len(dtypes) > 1:
common_type = np.find_common_type(dtypes, [])
dtypes = set([a.dtype for a in arrs])
Copy link
Member

@jschendel jschendel Nov 21, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you rewrite this as a set comprehension, i.e. {a.dtype for a in arrs}. Seems to be preferred based on #18383. I think this line is what caused the failure on Travis.

def test_categorical_dtype_high_cardinality_numeric(self):
# GH 18186
data = sorted([str(i) for i in range(10**6)])
expected = pd.DataFrame({'a': Categorical(data, ordered=True)})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DataFrame has already been imported, so can remove the pd..

actual = self.read_csv(StringIO('a\n' + '\n'.join(data)),
dtype='category')
actual.a.cat.reorder_categories(sorted(actual.a.cat.categories),
ordered=True, inplace=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you do this by assignment, i.e. actual['a'] = ..., instead of inplace. The convention is generally to avoid using inplace in tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will make this change but I did the inplace hoping it would be faster and more memory efficient. Can you justify why it is preferred not to do inplace?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sam-cohan inplace=True is not more efficient in any way. Furthermore it is much harder to read, we do not use it in the codebase except in support of specific API where it is user supported.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorting should be using np.sort here

Copy link
Contributor Author

@sam-cohan sam-cohan Nov 21, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done on np.sort.
I see your point about inplace=True in an ideal world, but FYI, in my workflows I typically use this to avoid temporarily needing double the storage for sorting (since I am often playing close to limits of available physical memory). Would have been nice if assignment was smart enough to be memory efficient.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To expand on inplace=True a bit: Per my understanding, for most operations inplace=True essentially does the same thing as assignment. It will still operate on a copy, but with the top-level reference being reassigned, so there usually isn't actually a gain in terms of memory efficiency, etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the note... that is unfortunate. Perhaps that is worthy of discussion in another venue.

@jschendel
Copy link
Member

A few comments, but looks good. Thanks!

@codecov
Copy link

codecov bot commented Nov 21, 2017

Codecov Report

Merging #18402 into master will decrease coverage by 0.01%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #18402      +/-   ##
==========================================
- Coverage   91.36%   91.34%   -0.02%     
==========================================
  Files         164      164              
  Lines       49730    49730              
==========================================
- Hits        45435    45426       -9     
- Misses       4295     4304       +9
Flag Coverage Δ
#multiple 89.14% <ø> (ø) ⬆️
#single 39.62% <ø> (-0.07%) ⬇️
Impacted Files Coverage Δ
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/core/frame.py 97.8% <0%> (-0.1%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 509e03c...b831699. Read the comment docs.

@codecov
Copy link

codecov bot commented Nov 21, 2017

Codecov Report

Merging #18402 into master will decrease coverage by 0.01%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #18402      +/-   ##
==========================================
- Coverage   91.35%   91.33%   -0.02%     
==========================================
  Files         163      163              
  Lines       49714    49714              
==========================================
- Hits        45415    45406       -9     
- Misses       4299     4308       +9
Flag Coverage Δ
#multiple 89.13% <ø> (ø) ⬆️
#single 39.63% <ø> (-0.07%) ⬇️
Impacted Files Coverage Δ
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/core/frame.py 97.8% <0%> (-0.1%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 103ea6f...ff1945d. Read the comment docs.

@sam-cohan
Copy link
Contributor Author

@jschendel PTAL. I made the changes you requested but I think there is still something off...

@@ -114,6 +114,17 @@ def test_categorical_dtype(self):
actual = self.read_csv(StringIO(data), dtype='category')
tm.assert_frame_equal(actual, expected)

@pytest.mark.slow
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how slow is this test?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On my machine it is about 4.5 seconds for the high memory parser, and 6.5 seconds for low memory and python parsers.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The minimal range necessary to reproduce the error is range(524289), at least locally for me. Might be beneficial to lower the range in the test to offset some of the slowness? If so, not sure if we should push it down to the limit, or just go down to something like 600k to leave a little buffer room.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checked and the limit is the same on both ubuntu and OSX so decided to do it as just the limit. This cut down the times to 2.25 seconds and 3 seconds so I removed the slow mark.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added slow mark back. rebased. waiting for tests.

@jreback jreback added Bug IO CSV read_csv, to_csv labels Nov 21, 2017
@sam-cohan
Copy link
Contributor Author

@jreback PTAL.

- Bug in ``pd.concat`` when empty and non-empty DataFrames or Series are concatenated (:issue:`18178` :issue:`18187`)
- Bug in :class:`IntervalIndex` constructor when a list of intervals is passed with non-default ``closed`` (:issue:`18334`)
- Bug in :meth:`IntervalIndex.copy` when copying and ``IntervalIndex`` with non-default ``closed`` (:issue:`18339`)
- Bug in :func:``read_csv`` when reading numeric category fields with high cardinality (:issue:`18186`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like all entries above yours got duplicated after rebasing #18408; you can see repeat entries under different sections below. So, I think the following should be done:

  • Delete all entries above yours (the repeat entries should still be there)
  • Move your entry under either the I/O or Categorical section (not sure which is more appropriate, so should double check on this)
  • When using references like :func: you only need single backticks, so :func:``read_csv`` -> :func:`read_csv`

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. I put it under I/O because the direct call to CategoricalDType for the expected was not broken in the original code.

@sam-cohan
Copy link
Contributor Author

@jreback requested changes are completed. PTAL.

@@ -114,6 +114,16 @@ def test_categorical_dtype(self):
actual = self.read_csv(StringIO(data), dtype='category')
tm.assert_frame_equal(actual, expected)

def test_categorical_dtype_high_cardinality_numeric(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sam-cohan can you restore the slow market. otherwise lgtm. ping on green.

@jreback
Copy link
Contributor

jreback commented Nov 22, 2017

pls rebase

@jreback jreback added this to the 0.21.1 milestone Nov 22, 2017
@jreback jreback merged commit d421a09 into pandas-dev:master Nov 22, 2017
@jreback
Copy link
Contributor

jreback commented Nov 22, 2017

thanks!

TomAugspurger pushed a commit to TomAugspurger/pandas that referenced this pull request Dec 8, 2017
TomAugspurger pushed a commit that referenced this pull request Dec 11, 2017
(cherry picked from commit d421a09)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO CSV read_csv, to_csv
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: read_csv(dtype='category') raises with many categories
4 participants