Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Thoroughly dedup column names in read_csv #17095

Merged
merged 1 commit into from
Aug 1, 2017
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v0.21.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -278,6 +278,7 @@ I/O
^^^

- Bug in :func:`read_csv` in which columns were not being thoroughly de-duplicated (:issue:`17060`)
- Bug in :func:`read_csv` in which specified column names were not being thoroughly de-duplicated (:issue:`17095`)
- Bug in :func:`read_csv` in which non integer values for the header argument generated an unhelpful / unrelated error message (:issue:`16338`)
- Bug in :func:`read_csv` in which memory management issues in exception handling, under certain conditions, would cause the interpreter to segfault (:issue:`14696`, :issue:`16798`).
- Bug in :func:`read_csv` when called with ``low_memory=False`` in which a CSV with at least one column > 2GB in size would incorrectly raise a ``MemoryError`` (:issue:`16798`).
Expand Down
18 changes: 11 additions & 7 deletions pandas/io/parsers.py
Original file line number Diff line number Diff line change
Expand Up @@ -1318,14 +1318,18 @@ def _maybe_dedup_names(self, names):
# would be nice!
if self.mangle_dupe_cols:
names = list(names) # so we can index
counts = {}
counts = defaultdict(int)

for i, col in enumerate(names):
cur_count = counts.get(col, 0)
cur_count = counts[col]

if cur_count > 0:
names[i] = '%s.%d' % (col, cur_count)
while cur_count > 0:
counts[col] = cur_count + 1

col = '%s.%d' % (col, cur_count)
cur_count = counts[col]

names[i] = col
counts[col] = cur_count + 1

return names
Expand Down Expand Up @@ -2330,15 +2334,15 @@ def _infer_columns(self):
this_columns.append(c)

if not have_mi_columns and self.mangle_dupe_cols:
counts = {}
counts = defaultdict(int)

for i, col in enumerate(this_columns):
cur_count = counts.get(col, 0)
cur_count = counts[col]

while cur_count > 0:
counts[col] = cur_count + 1
col = "%s.%d" % (col, cur_count)
cur_count = counts.get(col, 0)
cur_count = counts[col]

this_columns[i] = col
counts[col] = cur_count + 1
Expand Down
24 changes: 23 additions & 1 deletion pandas/tests/io/parser/mangle_dupes.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ def test_basic(self):
mangle_dupe_cols=True)
assert list(df.columns) == expected

def test_thorough_mangle(self):
def test_thorough_mangle_columns(self):
# see gh-17060
data = "a,a,a.1\n1,2,3"
df = self.read_csv(StringIO(data), sep=",", mangle_dupe_cols=True)
Expand All @@ -40,3 +40,25 @@ def test_thorough_mangle(self):
df = self.read_csv(StringIO(data), sep=",", mangle_dupe_cols=True)
assert list(df.columns) == ["a", "a.1", "a.3", "a.1.1",
"a.2", "a.2.1", "a.3.1"]

def test_thorough_mangle_names(self):
# see gh-17095
data = "a,b,b\n1,2,3"
names = ["a.1", "a.1", "a.1.1"]
df = self.read_csv(StringIO(data), sep=",", names=names,
mangle_dupe_cols=True)
assert list(df.columns) == ["a.1", "a.1.1", "a.1.1.1"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is also another behaviour that might be sensible: a expected result of "a.1", "a.1.1.1", "a.1.1"] (so taking into account all names and not only the ones that already were processed). This ends up with only one column being different than specified, while in the test two columns did change.
Not sure this is necessarily better, just pointing out the possibility. But less changed names sounds like a nice argument.

I would also add a test with a different order to be specific about this, for example for names=['a', 'a.1', 'a']

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anyhow, I agree with @jreback that we should actually just raise an error here (possibly after a deprecation cycle), and then this is not really important.

So if we decide to do that, I would not bother discussing / changing this.

Copy link
Member Author

@gfyoung gfyoung Jul 31, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would aim for a deprecation cycle. _maybe_dedup_names has been well-ingrained for sometime. In addition, the function is actually used to dedup column names when the data portion of the CSV is empty. Thus, I would like to give some more thought on how best to deprecate deduping names without issuing a warning (or raising) in the empty case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also add a test with a different order to be specific about this.

I already have a test for that lower in the file.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback : Your thoughts? I think we're agreement that this behavior should be disallowed, but given some of the complications I've mentioned here, I would prefer to refrain from disallowing or deprecating until a subsequent PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think the issue is that if we have an ambiguous dedupe like above then I would just raise. This doesn't work now, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't work correctly on master, yes. However, I'm following along with what I did in #17060, so in fact, what you're seeing is consistent with my decision in that PR.


data = "a,b,c,d,e,f\n1,2,3,4,5,6"
names = ["a", "a", "a.1", "a.1.1", "a.1.1.1", "a.1.1.1.1"]
df = self.read_csv(StringIO(data), sep=",", names=names,
mangle_dupe_cols=True)
assert list(df.columns) == ["a", "a.1", "a.1.1", "a.1.1.1",
"a.1.1.1.1", "a.1.1.1.1.1"]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we should just raise for cases like these. this is clearly a user error in specification of the names.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The thought has crossed my mind. I don't see why not. However, it should probably go through a deprecation cycle (this behavior is dedup-ing names is pretty baked into the code).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can do that as a follow-up if that works.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think is an ambiguous / unsupported case. I would just raise rather than try to get fancy.

Copy link
Member Author

@gfyoung gfyoung Jul 29, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't really change to much to make this change. I still would want to deprecate this behavior (in general) instead of raising immediately on a special case because _maybe_dedup_names after all is about dedup-ing names.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On further investigation, _maybe_dedup_names is used beyond just deduping the names parameter. Thus, I would feel better looking into this further for a separate PR if necessary.

data = "a,b,c,d,e,f,g\n1,2,3,4,5,6,7"
names = ["a", "a", "a.3", "a.1", "a.2", "a", "a"]
df = self.read_csv(StringIO(data), sep=",", names=names,
mangle_dupe_cols=True)
assert list(df.columns) == ["a", "a.1", "a.3", "a.1.1",
"a.2", "a.2.1", "a.3.1"]