Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Fix of handle missing CSV MI column names #23484

Merged
merged 1 commit into from Nov 6, 2018

Conversation

Projects
None yet
3 participants
@gfyoung
Copy link
Member

commented Nov 4, 2018

De-hackifies this hack:

pandas/pandas/io/parsers.py

Lines 3206 to 3211 in d78bd7a

# hack
if (isinstance(index_names[0], compat.string_types) and
'Unnamed' in index_names[0]):
index_names[0] = None
return index_names, columns, index_col

Setup:

from pandas.compat import StringIO
from pandas import read_csv

data = ",,col\na,c,1\na,d,2\nb,c,3\nb,d,4"
print(read_csv(StringIO(data), index_col=[0, 1]))

data = "NotReallyUnnamed,Unnamed: 0,col\na,c,1\na,d,2\nb,c,3\nb,d,4"
print(read_csv(StringIO(data), index_col=[0, 1]))

Before:

# Why is only `index_names[0]` replaced with `None`?
              col
  Unnamed: 1
a c             1
  d             2
b c             3
  d             4

# Having "Unnamed" in the name doesn't make it replace-able ?
# (this is also surfacing the `index_names[0]` bug too)
              col
  Unnamed: 0
a c             1
  d             2
b c             3
  d             4

After:

# All placeholder names get dropped.
              col
a c             1
  d             2
b c             3
  d             4

# Non-placeholder names never get dropped.
                             col
NotReallyUnnamed Unnamed: 0
a                c             1
                 d             2
b                c             3
                 d             4

@gfyoung gfyoung added this to the 0.24.0 milestone Nov 4, 2018

@pep8speaks

This comment has been minimized.

Copy link

commented Nov 4, 2018

Hello @gfyoung! Thanks for submitting the PR.

Show resolved Hide resolved pandas/io/parsers.py

@gfyoung gfyoung force-pushed the forking-repos:multi-index-column-names branch from 41ef255 to 2668351 Nov 4, 2018

@codecov

This comment has been minimized.

Copy link

commented Nov 4, 2018

Codecov Report

Merging #23484 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #23484      +/-   ##
==========================================
+ Coverage   92.23%   92.23%   +<.01%     
==========================================
  Files         161      161              
  Lines       51197    51204       +7     
==========================================
+ Hits        47220    47227       +7     
  Misses       3977     3977
Flag Coverage Δ
#multiple 90.61% <100%> (ø) ⬆️
#single 42.27% <64.7%> (ø) ⬆️
Impacted Files Coverage Δ
pandas/io/parsers.py 95.62% <100%> (+0.01%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 24ab22f...e18fef9. Read the comment docs.

BUG: Fix of handle missing CSV MI column names
Before, only the first index name got replaced
with `None` so long as it had the string "Unnamed"
in it.

Now we replace all index names with `None` if they
were deliberately set with placeholders.

@gfyoung gfyoung force-pushed the forking-repos:multi-index-column-names branch from 2668351 to e18fef9 Nov 4, 2018

@gfyoung

This comment has been minimized.

Copy link
Member Author

commented Nov 5, 2018

@jreback : I simplified the logic a little bit, but I still needed unnamed_count in some cases because it's computed on a per-iteration basis of a for-loop, whereas unnamed_cols is a global collection. Everything is still green though. PTAL.

@@ -786,6 +793,9 @@ cdef class TextReader:
name = '%s.%d' % (name, count)
count = counts.get(name, 0)

if old_name == '':

This comment has been minimized.

Copy link
@jreback

jreback Nov 6, 2018

Contributor

it seems like you could just add unamed_cols.add(name) at line 774 (e.g. after the if name == '' ?

This comment has been minimized.

Copy link
@gfyoung

gfyoung Nov 6, 2018

Author Member

Not quite. You need to add the name that you get post-mangling (e.g. if there are dupes). That's why you have to "keep track" after the logic ending at 794.

@jreback

jreback approved these changes Nov 6, 2018

@jreback jreback merged commit 819ee75 into pandas-dev:master Nov 6, 2018

3 checks passed

ci/circleci: py36_locale Your tests passed on CircleCI!
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
pandas-dev.pandas Build #20181104.80 has test failures
Details
@jreback

This comment has been minimized.

Copy link
Contributor

commented Nov 6, 2018

thanks!

@gfyoung gfyoung deleted the forking-repos:multi-index-column-names branch Nov 7, 2018

JustinZhengBC added a commit to JustinZhengBC/pandas that referenced this pull request Nov 14, 2018

brute4s99 added a commit to brute4s99/pandas that referenced this pull request Nov 19, 2018

Pingviinituutti added a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019

Pingviinituutti added a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.