Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invalid unicode characters removed from datagrid #578

Merged
merged 6 commits into from May 11, 2023

Conversation

nicojapas
Copy link
Contributor

@nicojapas nicojapas commented Apr 26, 2023

Fixes #456

When dealing with astral symbols and ellipsing, datagrid generates invalid Unicode characters because of the use of substring().

before

With the regular expression /[\u{D800}-\u{DFFF}]/gu we match any character falling within the range of surrogate code points. This includes both high surrogates (0xD800 to 0xDBFF) and low surrogates (0xDC00 to 0xDFFF). So any invalid Unicode character resulting from splitting a surrogate pair is removed with replace().

after

@welcome
Copy link

welcome bot commented Apr 26, 2023

Thanks for submitting your first pull request! You are awesome! 🤗

If you haven't done so already, check out Jupyter's Code of Conduct. Also, please make sure you followed the pull request template, as this will help us review your contribution more quickly.
welcome
You can meet the other Jovyans by joining our Discourse forum. There is also a intro thread there where you can stop by and say Hi! 👋

Welcome to the Jupyter community! 🎉

@krassowski
Copy link
Member

Thank you for opening the PR! On conceptual level, what if someone has a table with all Unicode code points? Or if those are mapped to something else in a font. Would it be better to rewrite eliding to use Unicode-aware slice by first converting the string to an array as in https://stackoverflow.com/questions/62341685/javascript-unicode-aware-string-slice/62341816#62341816 ?

@krassowski krassowski added the bug Something isn't working label Apr 26, 2023
@nicojapas
Copy link
Contributor Author

Thank you for opening the PR! On conceptual level, what if someone has a table with all Unicode code points? Or if those are mapped to something else in a font. Would it be better to rewrite eliding to use Unicode-aware slice by first converting the string to an array as in https://stackoverflow.com/questions/62341685/javascript-unicode-aware-string-slice/62341816#62341816 ?

Hi! Yes, that is a better approach I think. I just commited a new solution.

Copy link
Member

@fcollonval fcollonval left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @nicojapas

Letting this opened to let @krassowski have a look at the latter version.

Copy link
Member

@krassowski krassowski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! I will open a follow-up PR with unit tests.

@krassowski krassowski merged commit e887f33 into jupyterlab:main May 11, 2023
18 of 19 checks passed
@welcome
Copy link

welcome bot commented May 11, 2023

Congrats on your first merged pull request in this project! 🎉
congrats
Thank you for contributing, we are very proud of you! ❤️

@krassowski krassowski changed the title Invalid unicode characters removed from datagrid (#456) Invalid unicode characters removed from datagrid May 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DataGrid eliding does not respect unicode character integrity
3 participants