Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG in MultiIndex truncated repr with integer level names #15262

Closed
KevinBaudin opened this issue Jan 30, 2017 · 11 comments
Closed

BUG in MultiIndex truncated repr with integer level names #15262

KevinBaudin opened this issue Jan 30, 2017 · 11 comments
Labels
Bug Output-Formatting __repr__ of pandas objects, to_string
Milestone

Comments

@KevinBaudin
Copy link

KevinBaudin commented Jan 30, 2017

Reproducible example:

In [10]: df = pd.DataFrame({'col': range(9)}, index=pd.MultiIndex.from_product([['A0', 'A1', 'A2'], ['B0', 'B1', 'B2']], names=[1,2]))

In [11]: df
Out[11]: 
       col
1  2      
A0 B0    0
   B1    1
   B2    2
A1 B0    3
   B1    4
   B2    5
A2 B0    6
   B1    7
   B2    8

In [12]: pd.options.display.max_rows = 4

In [13]: df
Out[13]: 
       col
1  2      
A0 A0    0
   A0    1
...    ...
A2 A2    7
   A2    8

[9 rows x 1 columns]

So the truncated repr shows incorrectly the first index level (with integer level name 1) again for the second level.


Original post:

Code Sample, a copy-pastable example if possible

import wget
import pandas
import pickle

url = 'https://www.dropbox.com/s/aldllo0bi3m3wkl/stock?dl=1'
filename = wget.download(url)
df = pickle.load(open(filename))
df 
#bad display, index duplicated?
df.head() 
#expected display

Problem description

merged
1 2
a. a. 2
abel abel 1
agnes agnes 2
alain alain 8
alain 2

I have created a multi-index based on 2 columns .
Those two columns wont appear properly, index_column "2" being duplicated from "1"
When displaying up to the 60th first rows of dataframe, it's fine, then it duplicates again the column 1 in the column 2.

Expected Output

merged
1 2
a. masson-dubois 2
abel pinchard 1
agnes paquet 2
alain corcia 8
hudelot-noellat 2

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 2.7.9.final.0 python-bits: 64 OS: Linux OS-release: 4.9.4-moby machine: x86_64 processor: byteorder: little LC_ALL: None LANG: None LOCALE: None.None

pandas: 0.19.1
nose: None
pip: 9.0.1
setuptools: 28.8.0
Cython: None
numpy: 1.11.2
scipy: 0.18.0
statsmodels: 0.6.1
xarray: None
IPython: 5.1.0
sphinx: None
patsy: 0.4.1
dateutil: 2.5.0
pytz: 2016.7
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: 1.5.3
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.5.1
html5lib: None
httplib2: 0.9.2
apiclient: 1.5.1
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.8
boto: None
pandas_datareader: None

@jreback
Copy link
Contributor

jreback commented Jan 30, 2017

pls show a copy-pastable example that doesn't rely on opening your file

@KevinBaudin
Copy link
Author

KevinBaudin commented Jan 30, 2017

Sorry, I didn't manage to find a simple reproducible example.
The problem only appeared on this particular dataframe at the moment.

PS : It's only a 22Kb file

@jreback
Copy link
Contributor

jreback commented Jan 30, 2017

this is not reproducible. you can try .sort_index()

@adrtod
Copy link

adrtod commented Jan 30, 2017

I could reproduce it but .sort_index() did not work.
The data seem not corrupted, only displaying is wrong

@jorisvandenbossche
Copy link
Member

@KevinBaudin The file is not available anymore?

@KevinBaudin
Copy link
Author

@jorisvandenbossche edited with new link, sorry.

@jorisvandenbossche
Copy link
Member

@KevinBaudin The cause of the issue are the index level names ([1, 2]). If you set those to something else, you will see that the issue is resolved:

In [6]: df
Out[6]: 
                                 merged
1               2                      
a.              a.                    2
abel            abel                  1
agnes           agnes                 2
...
[100 rows x 1 columns]

In [7]: df.index.names = ['a', 'b']

In [8]: df
Out[8]: 
                                   merged
a               b                        
a.              masson-dubois           2
abel            pinchard                1
agnes           paquet                  2
...
[100 rows x 1 columns]

The reason for this is the integer level names (confusion between integer number of first (0) or second (1) level, or the level names (1 and 2)).
So it seems that the repr used in .head() is correctly dealing with this distinction, but the general repr not.

@jorisvandenbossche jorisvandenbossche added Bug Output-Formatting __repr__ of pandas objects, to_string and removed Can't Repro labels Jan 31, 2017
@jorisvandenbossche
Copy link
Member

Smaller reproducible example:

In [10]: df = pd.DataFrame({'col': range(9)}, index=pd.MultiIndex.from_product([['A0', 'A1', 'A2'], ['B0', 'B1', 'B2']], names=[1,2]))

In [11]: df
Out[11]: 
       col
1  2      
A0 B0    0
   B1    1
   B2    2
A1 B0    3
   B1    4
   B2    5
A2 B0    6
   B1    7
   B2    8

In [12]: pd.options.display.max_rows = 4

In [13]: df
Out[13]: 
       col
1  2      
A0 A0    0
   A0    1
...    ...
A2 A2    7
   A2    8

[9 rows x 1 columns]

So it is the truncated repr that has this issue.

@jorisvandenbossche jorisvandenbossche changed the title Multi-Index doesn't display as expected using IPython BUG in MultiIndex truncated repr with integer level names Jan 31, 2017
@KevinBaudin
Copy link
Author

@jorisvandenbossche ❤️ 👍

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Feb 10, 2017

I've started looking at this. Seems to be an issue in pd.concat():

In [2]: df = pd.DataFrame({'col': range(9)}, index=pd.MultiIndex.from_product([ ['A0', 'A1', 'A2'], ['B0', 'B1', 'B2']], names=[1,2]))

In [3]: df.iloc[:2,:]
Out[3]:
       col
1  2
A0 B0    0
   B1    1

In [4]: df.iloc[-2:,:]
Out[4]:
       col
1  2
A2 B1    7
   B2    8

In [5]: pd.concat((df.iloc[:2,:],df.iloc[-2:,:]))
Out[5]:
       col
1  2
A0 A0    0
   A0    1
A2 A2    7
   A2    8

That last result is incorrect. Should the name of this issue be changed? (@jorisvandenbossche)

@jreback
Copy link
Contributor

jreback commented Feb 17, 2017

I think this is a dupe of: #12223

if this is the case, just use an example from there as well in tests.

Dr-Irv added a commit to Dr-Irv/pandas that referenced this issue Feb 22, 2017
Dr-Irv added a commit to Dr-Irv/pandas that referenced this issue Feb 23, 2017
@jreback jreback added this to the 0.20.0 milestone Feb 23, 2017
AnkurDedania pushed a commit to AnkurDedania/pandas that referenced this issue Mar 21, 2017
…n MultiIndex

closes pandas-dev#12223
closes pandas-dev#15262

Author: Dr-Irv <irv@princeton.com>

Closes pandas-dev#15478 from Dr-Irv/Issue15262 and squashes the following commits:

15d8433 [Dr-Irv] Address jreback comments
10667a3 [Dr-Irv] Fix types for test
8935068 [Dr-Irv] resolve conflicts
385ca3e [Dr-Irv] BUG: GH pandas-dev#12223, GH pandas-dev#15262. Allow ints for names in MultiIndex
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Output-Formatting __repr__ of pandas objects, to_string
Projects
None yet
Development

No branches or pull requests

5 participants