Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Data formatting with unicode length #11102

Merged
merged 1 commit into from
Oct 3, 2015

Conversation

sinhrks
Copy link
Member

@sinhrks sinhrks commented Sep 15, 2015

Closes #2612. Added display.unicode.east_asian_width options, which calculate text width considering East Asian Width. Enabling this option affects to a performance as width must be calculated per characters.

Current results (captured)

2015-09-15 20 57 27

  • Basic impl and test
  • Series / DataFrame truncation
  • Perf test
  • Doc / Release note

@sinhrks sinhrks added Output-Formatting __repr__ of pandas objects, to_string Unicode Unicode strings labels Sep 15, 2015
@sinhrks sinhrks added this to the 0.17.1 milestone Sep 15, 2015
@shoyer
Copy link
Member

shoyer commented Sep 16, 2015

How much slower is this than the default behavior? My guess is that anyone who uses these characters will want this.

Usually we're not printing enough data to the screen for performance on this sort of stuff to matter.


def east_asian_len(data, encoding=None):
"""
Calcurate display width considering unicode East Asian Width
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo - calculate

@kawochen
Copy link
Contributor

east_asian_len calculation can be reduced to a 3-element set membership testing. Ever so slightly faster but probably too micro to matter. I think ambiguous characters need special handling (?).

@sinhrks sinhrks force-pushed the unicode_justify branch 3 times, most recently from 1db82da to 5c84786 Compare September 27, 2015 01:47
@sinhrks
Copy link
Member Author

sinhrks commented Sep 27, 2015

OK, this PR should work all cases which I'm aware of. Appreciated if anyone provide further test cases if any concerns.

@shoyer Yes, east-asian prefer this to be default True. But it is almost 2 times slower in below case.

  • DataFrame contains 10000 data, 100 rows * 100 columns, each item contains 10 Unicode chars
  • Display options are default:
    • pd.options.display.max_rows: 60
    • pd.options.display.max_columns: 20
import numpy as np
import pandas as pd

chars = list(u'あいうえおかきくけこさしすせそたちつてとなにぬねの')

def rand_jp(x):
    return ''.join(np.random.choice(chars) for _ in range(x))

df = pd.DataFrame(np.empty((100, 100)))
df = df.applymap(lambda x: rand_jp(10))

%timeit unicode(df)
# 10 loops, best of 3: 177 ms per loop

# Enable Unicode handling
pd.options.display.unicode.east_asian_width = True

%timeit unicode(df)
# 1 loops, best of 3: 381 ms per loop

The affect is almost the same as all ascii (same condition except for characters are all ascii):

  • Default: 10 loops, best of 3: 131 ms per loop
  • Enable Unicode handling: 1 loops, best of 3: 302 ms per loop

@kawochen I may not properly understand, but reducing East Asian Width category will not affect to the performance because dict lookup is O(1).

I think ambiguous characters need special handling (?).

Do you have any idea about affected characters and special handling logic? My concern is these characters can not be aligned properly even if we tried so.

CC: @ayapi

@shoyer
Copy link
Member

shoyer commented Sep 27, 2015

If I understand correctly, because pandas does not actually print every element of large dataframes, so printing a larger DataFrame would be the same speed?

How does this patch effect the speed of printing Unicode text if it only contains ASCII characters?

@sinhrks
Copy link
Member Author

sinhrks commented Sep 27, 2015

@shoyer Correct. Larger data can be printed almost the same speed. Perf is not affected by data size once it exceeds max_columns and max_rows.

Result for all ascii are described in above. Almost 2 times longer, because the logic is the same. It can be short-passed if we can distinguish whether input has 2 bytes char or not including symbols. Is it possible using any built-in?.

@shoyer
Copy link
Member

shoyer commented Sep 27, 2015

I would probably try tweaking the string length check and seeing if a simple variant can give you a speed up. But generally this is pretty reasonable already. Few dataframes will show this many strings.

On Sat, Sep 26, 2015 at 8:16 PM, Sinhrks notifications@github.com wrote:

@shoyer Correct. Larger data can be printed almost the same speed. Perf is not affected by data size once it exceeds max_columns and max_rows.

Result for all ascii are described in above. Almost 2 times longer, because the logic is the same. It can be short-passed if we can distinguish whether input has 2 bytes char or not including symbols. Is it possible using any built-in?.

Reply to this email directly or view it on GitHub:
#11102 (comment)

@kawochen
Copy link
Contributor

@sinhrks Yes the optimization I mentioned is so minor I don't know why I brought it up. It makes east_asian_len take about 30~40% less time to run, but I don't think this is where time is spent anyways. Regarding ambiguous characters I am not sure what can be done (perhaps config/options)? Also wanted to mention in passing that printing ⟼ (neutral) would still not print as one would hope.

else:
name = strlen.__name__

if name == 'east_asian_len':
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pass a different pad function if you are in east_asian_len mode, then then you don't need the if/then block that duplicates code

@sinhrks sinhrks force-pushed the unicode_justify branch 4 times, most recently from 7a03703 to 7de4215 Compare October 1, 2015 13:01
@sinhrks
Copy link
Member Author

sinhrks commented Oct 1, 2015

@jreback I've refactored a little based on your comments. I've created TextAdjustment class which has a set of east-asian depending function rather than defining separate pad function. Because separate definitions may causes unexpected results in future (e.g. using pad for east-asian with normal len ). If looks OK, I'll squash.

@kawochen OK, I leave current east_asian_len.

Also, please provide the list of ambiguous characters and its width. As long as I understand, these ambiguous character widths are not integral multiple, thus these cannot be aligned by padding with white spaces. I don't think we can fix it because it is unicode spec.

@jreback
Copy link
Contributor

jreback commented Oct 1, 2015

looks nice!

can u run a perf check on relevant benchmarks to assert its about the same as current?

add a release note and can do for 0.17.0

@kawochen
Copy link
Contributor

kawochen commented Oct 1, 2015

@sinhrks ambiguous characters are either wide or narrow. see here for list http://unicode.org/reports/tr11-2/

@sinhrks
Copy link
Member Author

sinhrks commented Oct 1, 2015

@kawochen Ok. Backed to first discussion, it can't be support it without clarifying the logic. I can't find the logic per characters from your link.... Maybe I misunderstood sonething? Pls provide actual code works as your expectation.

@kawochen
Copy link
Contributor

kawochen commented Oct 1, 2015

@sinhrks There is no per character logic. All ambiguous characters are narrow on my terminal, but that's just my settings. Users should know whether it should be treated as wide or narrow, which depends on where the characters are being printed, so making it configurable might make sense. I don't think that can be figured out from within Python. You can try printing the characters in the list.

@sinhrks
Copy link
Member Author

sinhrks commented Oct 1, 2015

Thanks. Maybe I could understand a little. From the link:

When mapping Unicode to East Asian legacy character encodings

  • Wide Unicode characters always map to fullwidth characters.
  • Narrow (and neutral) Unicode characters always map to halfwidth characters.
  • Halfwidth Unicode characters always map to halfwidth characters.
  • Ambiguous Unicode characters always map to fullwidth characters.

When mapping Unicode to non-East Asian legacy character encodings

  • Wide Unicode characters do not map to non-East Asian legacy character encodings.
  • Narrow (and neutral) Unicode characters always map to regular (narrow) characters.
  • Halfwidth Unicode characters do not map.
  • Ambiguous Unicode characters always map to regular (narrow) characters.

When mapping Unicode to East Asian legacy character encodings: Ambiguous should be handled ad full-width (length=2). It can be covered by the impl added by the PR (display.unicode.east_asian_width=True) .

When mapping Unicode to non-East Asian legacy character encodings: Because full width are not mapped (cannot appear), all characters including ambiguous can be regarded as half-width (length=1). It should be corresponding to the current default display.unicode.east_asian_width=False.

What the situation requires a separate option only for Ambiguous?

@kawochen
Copy link
Contributor

kawochen commented Oct 2, 2015

@sinhrks when we are not mapping to legacy encodings. For example in my terminal Chinese characters are twice as wide as ambiguous characters. Whether it should be 1 or 2 depends on where and how the data will be displayed. In the unlikely case when we wanted it to look pretty in Arial then we'd print tabs to align stuff. In mono space fonts it's easier so we can use spaces. In modern terminals ambiguous characters are usually narrow.

@sinhrks
Copy link
Member Author

sinhrks commented Oct 2, 2015

@kawochen Can you describe with screenshots and code which I can confirm on my terminal?

@kawochen
Copy link
Contributor

kawochen commented Oct 2, 2015

In [2]: print('中文\n\u00A1\u00A1ab')
In [4]: east_asian_width('\u00A1')
Out[4]: 'A'

image

So if I have all of those in a DataFrame, I might have too much padding.

@sinhrks
Copy link
Member Author

sinhrks commented Oct 2, 2015

Understood, how about the name unicode.ambiguous_as_wide with default False (length=1) if it is popular in current terminal. If specified True, its length is regarded as 2.

@sinhrks sinhrks force-pushed the unicode_justify branch 2 times, most recently from 75763af to 8a442b7 Compare October 2, 2015 15:09
@sinhrks
Copy link
Member Author

sinhrks commented Oct 2, 2015

Updated to add release note and unicode.ambiguous_as_wide. "Ambiguous" characters cannot be aligned properly on Sphinx (Jupyter).

terminal

Latter case cannot be aligned because of the mismatch between terminal and pandas option.

2015-10-03 0 01 38

Sphinx/Jupyter

Going to add note to say "This should be aligned properly in terminal which uses monospaced font."

2015-10-03 0 06 29

I'll update perf comparison tomorrow.

@@ -304,7 +305,7 @@ See the :ref:`documentation <io.excel>` for more details.
:suppress:

import os
os.remove('test.xlsx')
# os.remove('test.xlsx')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

@jreback
Copy link
Contributor

jreback commented Oct 2, 2015

@sinhrks ok some minor doc fixes. merge when ready.

@sinhrks sinhrks changed the title (WIP) ENH: Data formatting with unicode length ENH: Data formatting with unicode length Oct 3, 2015
@sinhrks
Copy link
Member Author

sinhrks commented Oct 3, 2015

Thanks, updated doc and asv result attached. Results looks random.

All benchmarks:

    before     after       ratio
  [5049b5  ] [53ac28  ]
    19.50ms    22.42ms      1.15  frame_methods.frame_repr_tall.time_frame_repr_talld
     1.30ms     1.43ms      1.10  groupby.series_value_counts.time_value_counts_int64
     3.41μs     3.74μs      1.10  indexing.indexing_frame_get_value.time_indexing_frame_get_value

@sinhrks
Copy link
Member Author

sinhrks commented Oct 3, 2015

@jreback Could you merge this when you prepare RC?

All: I'm willing to fix if anything is pointed out during RC.

jreback added a commit that referenced this pull request Oct 3, 2015
ENH: Data formatting with unicode length
@jreback jreback merged commit 75cd3e8 into pandas-dev:master Oct 3, 2015
@jreback
Copy link
Contributor

jreback commented Oct 3, 2015

@sinhrks (and @kawochen ) this is fantastic. quite a bit of work and lots of tests yeh!

@sinhrks sinhrks deleted the unicode_justify branch October 3, 2015 17:37
@sinhrks
Copy link
Member Author

sinhrks commented Oct 3, 2015

Found doc layout can differ by environment. I'll update notes to describe it.

1st looks OK, 2nd NG

002

1st looks NG, 2nd OK

001

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Output-Formatting __repr__ of pandas objects, to_string Unicode Unicode strings
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants