New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reading a CSV with duplicated MultiRow columns #18062

Closed
jpiabrantes opened this Issue Nov 1, 2017 · 11 comments

Comments

Projects
None yet
5 participants
@jpiabrantes

jpiabrantes commented Nov 1, 2017

Problem description

I have the following .csv file:

A A B
one one two
0 34 0.1

I want to use the first two rows to create MultiIndex columns. However, the first and second columns are duplicated and pandas is handling that in a weird way.

Code Sample, a copy-pastable example if possible

print(pd.read_csv('test.csv',header=[0,1]).columns)
>> Index([('A', 'one'), '('A', 'one').1', ('B', 'two')], dtype='object')

This output is a total mess, the first column and last are tuples while the second column is a string. Also, this is not a MultiIndex, it's an Index. Raising an error or having the option to delete the duplicate columns would be a better solution.

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.0-97-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_GB.UTF-8
LOCALE: en_GB.UTF-8

pandas: 0.21.0
pytest: 2.9.2
pip: 9.0.1
setuptools: 27.2.0
Cython: 0.24.1
numpy: 1.13.3
scipy: 0.18.1
pyarrow: None
xarray: None
IPython: 5.1.0
sphinx: 1.4.6
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: 1.1.0
tables: 3.2.3.1
numexpr: 2.6.1
feather: None
matplotlib: 1.5.3
openpyxl: 2.3.2
xlrd: 1.0.0
xlwt: 1.1.2
xlsxwriter: 0.9.3
lxml: 3.6.4
bs4: 4.5.1
html5lib: None
sqlalchemy: 1.0.13
pymysql: None
psycopg2: None
jinja2: 2.8
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@TomAugspurger

This comment has been minimized.

Contributor

TomAugspurger commented Nov 1, 2017

Looks like a bad interaction between the name mangling and multiple headers.

This is a bit tricky though, since (IIUC) not name mangling here would cause data lose since you don't have unique columns. I'm not sure the best behavior, but the most consistent is to probably just mangle the inner-most level's names?

@TomAugspurger TomAugspurger added the IO CSV label Nov 1, 2017

@TomAugspurger TomAugspurger added this to the Next Major Release milestone Nov 1, 2017

@WillAyd

This comment has been minimized.

Member

WillAyd commented Nov 1, 2017

I’d have to confirm later but on initial glance I think the problem stems from line 799 in parsers.pyx where the mangling is coercing the duplicated name into a string value.

If that holds true would it not make sense to check the type of the name variable and only perform the mangling if it is a string, providing some warning to the user instead in the case of a sequence?

@WillAyd

This comment has been minimized.

Member

WillAyd commented Nov 1, 2017

My original thought was wrong, but the problem here seems to stem from the _maybe_dedup_names function. I edited the function to the below:

def _maybe_dedup_names(self, names):
    ...
    if self.mangle_dupe_cols:
        names = list(names)  # so we can index
        counts = defaultdict(int)

        for i, col in enumerate(names):
            if isinstance(col, tuple):
                continue
            ...

And this worked as intended. However, this assumes that multi-indexed columns will always be passed in as a tuple, which I'm not sure is valid. An alternate approach would be to differentiate between strings and tuples / lists, but I wasn't sure off the top of my head of a completely reliable way to do that across all Python versions, so there could be some complexity trade-offs there.

@TomAugspurger - do you have a point of view on this? The above would be the simplest solution, but I'm certainly happy to work on something a more robust if we want to handle more than tuples and feel its worth the added complexity. Let me know what you think and I can work something up in a PR

@TomAugspurger

This comment has been minimized.

Contributor

TomAugspurger commented Nov 2, 2017

cc @gfyoung who is more familiar with this code.

@gfyoung

This comment has been minimized.

Member

gfyoung commented Nov 2, 2017

@jpiabrantes : Thanks for reporting this! Indeed, duplicate data is annoying to handle in pandas, and while we've mostly ironed out the single-column case, it does not surprise me that something would have broken in the MultiIndex case.

I think you should augment _maybe_dedup_names to account for the MultiIndex case more carefully. I agree with your suggest though to "dedup" at the lowest row. Give it a shot in a PR!

@WillAyd

This comment has been minimized.

Member

WillAyd commented Nov 2, 2017

Well I was able to get the correct columns returned with the modifications called out above, but what I didn't notice earlier until I started testing was that it impacted the values within the DataFrame. The issue for that is that the read function within the CParserWrapper converts the data into a dict, using the col_names as a key...So when not mangled the values of the data frame being returned are incorrect.

Still debugging but if anyone has experience dealing with that or thoughts on how to resolve let me know

@gfyoung

This comment has been minimized.

Member

gfyoung commented Nov 2, 2017

@WillAyd : Could you provide an example?

@WillAyd

This comment has been minimized.

Member

WillAyd commented Nov 2, 2017

Assuming you add the code I provided earlier to the _maybe_dedup_names function (actual solution would be more robust, but that will work for this comment) try the following:

data = "A,A,B\none,one,two\n0,34,0.1"
pd.read_csv(io.StringIO(data), header=[0,1])

Doing so yields:

A B
one one two
34 34 0.1

The correct column / MultiIndex will be built, but you'll notice that the values of the DataFrame are 34, 34 and 0.1; not 0, 34 and 0.1 as expected.

I think this gets mucked up within the read functions where you see lines like:

data = dict((k, v) for k, (i, v) in zip(names, data))

Because there are only 2 unique names to be shared across the 3 columns, and using the names within the keys of the dict prevent that from parsing correctly

@gfyoung

This comment has been minimized.

Member

gfyoung commented Nov 2, 2017

@WillAyd : It doesn't look like you're mangling the second level of the names (or name tuple, rather), judging from what I can see from here.

@WillAyd

This comment has been minimized.

Member

WillAyd commented Nov 2, 2017

Are you expecting the output to look something like this?

A   B
one one.1 two
0 34 0.1
@gfyoung

This comment has been minimized.

Member

gfyoung commented Nov 2, 2017

Yes, that would be what I would expect.

WillAyd added a commit to WillAyd/pandas that referenced this issue Nov 3, 2017

@gfyoung gfyoung added the MultiIndex label Nov 3, 2017

WillAyd added a commit to WillAyd/pandas that referenced this issue Nov 3, 2017

WillAyd added a commit to WillAyd/pandas that referenced this issue Nov 4, 2017

@jreback jreback modified the milestones: Next Major Release, 0.22.0 Nov 6, 2017

gfyoung added a commit that referenced this issue Nov 6, 2017

watercrossing added a commit to watercrossing/pandas that referenced this issue Nov 10, 2017

No-Stream added a commit to No-Stream/pandas that referenced this issue Nov 28, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment