Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rethink when HTML repr of DataFrame is displayed #4886

Closed
takluyver opened this issue Sep 19, 2013 · 17 comments
Closed

Rethink when HTML repr of DataFrame is displayed #4886

takluyver opened this issue Sep 19, 2013 · 17 comments
Labels
IO HTML read_html, to_html, Styler.apply, Styler.applymap Output-Formatting __repr__ of pandas objects, to_string
Milestone

Comments

@takluyver
Copy link
Contributor

Helping out with a moderately beginner class recently, I noticed several people having problems, because they could easily display a table view of a small DataFrame, but the representation looked completely different when it exceeded a certain size. People thought that they had a different type of object, or that the detailed information was some kind of an error message. There's no obvious way to get the HTML repr for larger DataFrames.

Suggestions:

  • Increase the size limit for displaying the full HTML repr. When we did force larger dataframes to display as HTML tables, the IPython notebook can easily handle substantially larger tables than the current cutoff.
  • When the DataFrame is too large to display whole, produce a truncated HTML table rather than switching to a completely different kind of repr.

I'll try to work on this soon if no-one objects or beats me to it.

@jtratner
Copy link
Contributor

Instead of specializing just on HTML, why don't we just change the default max row config option if you detect you're in a notebook?

@jtratner
Copy link
Contributor

Looks like you can't easily detect whether in a notebook, I guess we could just up the max_rows in __repr_html__. Good part is that info() will still produce the other view.

@jtratner
Copy link
Contributor

This isn't free though, gets slow when you get up to 50,000 cells for example:

df = DataFrame([range(1000) for _ in range(50)])
In [21]: %timeit df.to_string() # method used to print object
1 loops, best of 3: 3.26 s per loop

@takluyver
Copy link
Contributor Author

Yep, by design, the kernel (where code is executed) doesn't know about the frontend.

Looking at the code, it mentions:

# ipnb in html repr mode allows scrolling
# users strongly prefer to h-scroll a wide HTML table in the browser
# then to get a summary view. GH3541, GH3573

So perhaps this is already improved, and I saw it in an older version. Linking those issues: #3541, #3573, and PR #3663 claiming to fix them.

Oh, and there's code which attempts to detect whether it's running in the Qt console or the notebook...sorry, that won't work all the time (the process which started a kernel isn't necessarily the same as the process making this execution request). I'll bring that up to try to work out a better way to do handle the difference.

@jtratner
Copy link
Contributor

yeah, I had a sense. Does Qt console also use __repr_html__? If so, that's unfortunate.

@takluyver
Copy link
Contributor Author

It does. I'm proposing that we (IPython) define a rich HTML repr and a separate 'poor HTML', suitable for use in the Qt console.

I'd still like to leave this issue open, because it looks like when you hit 60 rows (or whatever max_rows is configured to), it still switches abruptly to the short 'info' view, whereas I think it should show a truncated table.

@jtratner
Copy link
Contributor

that'd be helpful :) - but yes, it seems like it would make sense to change html's repr, instead of just defaulting to info()

@jtratner
Copy link
Contributor

@takluyver if you can set up how you'd like the repr to look, we can add a config option that can be set either in a .pandasrc or in an ipython startup script/in a notebook.

@takluyver
Copy link
Contributor Author

I've found time to take a look at this. I reused the max_rows and max_columns display settings, propagating the values down to the HTMLFormatter, and truncating rows and columns. My changes are on this branch - they don't yet handle all the odd cases, so I haven't made a pull request just now. Does this look like the right approach? Or would we like to make this more general than the HTML formatting code? I considered slicing the dataframe and then taking the HTML repr of that, but I think getting the ... truncation markers in place would be tricky in that case.

Here's the current display when you go beyond 60 rows/20 columns:

pandas_long_repr_before
pandas_wide_repr_before

And here's the new:

pandas_long_repr_after
pandas_wide_repr_after

@jtratner
Copy link
Contributor

I personally like how your proposed version looks.

@takluyver
Copy link
Contributor Author

Thanks Jeff. I've now covered the cases with MultiIndex-es, added tests, and made PR #5550.

@ghost
Copy link

ghost commented Nov 20, 2013

I don't object to making this controllable via an option, but I'm -1 on making it the default.
Obviously, changes like this can be traumatic to existing users but I think the current
behavior actually makes sense from a useability pov.

The way I see it, the default view of a dataframe is the info view. It always
provides schema information and "query info" such as number of rows.
The reason a small frame is displayed in it's entirety when it's "small enough" is that a view of
all the data is a superset of the data in the info view (schema+number of records).
But yes, this can have a jarring "jump-cut" effect on new-users.

An alternative solution in 2 parts is:

  1. add a caption to the repr output describing type and display mode: "DataFrame [data view]:",
    "Dataframe [info view]:", "Series [data view]:".
    That also distinguishes series from single column frames, another newbie gotcha.
  2. The PR doesn't address the problem of re: getting a glimpse of larger frames,
    that's df.head()'s role, altering it to emit ellipsis for wider frames makes a lot of sense.
    (for html repr, and for text repr when expand_frame_rapr is off)

I strongly urge conducting a small usability study (have a few users adopt it for a week
and report) before making potentially disruptive change like this to UX.

@jorisvandenbossche
Copy link
Member

The Series representation gives the first and last elements. That could maybe also be an interesting approach to something similar for DataFrames, instead of first rows/cols in the proposal.

Example of Series (there is also, apart from the data, some extra information on the total length):

In [27]: s = pd.Series(np.arange(61))

In [28]: s
Out[28]: 
0      0
1      1
2      2
3      3
4      4
5      5
6      6
7      7
8      8
9      9
10    10
11    11
12    12
13    13
14    14
...
46    46
47    47
48    48
49    49
50    50
51    51
52    52
53    53
54    54
55    55
56    56
57    57
58    58
59    59
60    60
Length: 61, dtype: int32

@takluyver
Copy link
Contributor Author

Conversely, there's the .info() method if you want to see a summary of the columns. Showing the data with truncation is in line with numpy reprs and Series reprs, as well as regular Python reprs of collections (although they don't truncate).

Not making it the default would defeat the entire point. New users are not going to hunt around in config settings to set this to behave intuitively. I don't even know what configuration file pandas uses. And I don't think another config setting is necessary: if you want to see the info view, use the info() method.

I would love some people to do user testing - any volunteers? However, in uncontrolled user testing of the current behaviour, I have observed the sudden switch to a completely different repr confusing new users and annoying more experienced users.

@ghost
Copy link

ghost commented Nov 20, 2013

after playing with this some more I think it is an improvement - objections withdrawn.

@takluyver
Copy link
Contributor Author

Cheers @y-p. :-)

@ghost
Copy link

ghost commented Nov 26, 2013

merged #5550.

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO HTML read_html, to_html, Styler.apply, Styler.applymap Output-Formatting __repr__ of pandas objects, to_string
Projects
None yet
Development

No branches or pull requests

3 participants