Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some visible whitespaces in browser are trimmed in the result #206

Closed
dzcpy opened this issue Jul 13, 2020 · 6 comments
Closed

Some visible whitespaces in browser are trimmed in the result #206

dzcpy opened this issue Jul 13, 2020 · 6 comments

Comments

@dzcpy
Copy link

dzcpy commented Jul 13, 2020

htmlToText.fromString('<br>  123<br>'

The line above should return the result like '\n  123\n' instead of '\n123\n' since ' ' is a visible white space in browser.
I would suggest that it's better to provide an option to choose what characters should not be trimmed.

@KillyMXI
Copy link
Member

Thanks for pointing this out.

Your example uses Ideographic Space (&#x3000;). It is indeed preserved in browser.

I don't see a need for an option though. Since the goal of html-to-text is to approximate the browser behavior - we just have to make it behave with each whitespace character like a browser does.

I have to take a closer look into HTML spec, add extra tests, and work from there.

@KillyMXI
Copy link
Member

I started to look into this issue and realized it will indeed require an option, with the default set true to HTML spec.
Have to meet conflicting demands.

@dzcpy
Copy link
Author

dzcpy commented Aug 3, 2020

Thanks very much for your reply.
Also, I found there are a lot of quirks too in the browser. For example:
<p>  abc</p> is not trimmed, but
  <p>abc</p> is trimmed. It's a bit wierd.

@KillyMXI
Copy link
Member

KillyMXI commented Aug 3, 2020

Vivaldi (as well as any Chromium-based browsers, I suspect) shows the second example with Ideographic Spaces put on extra line.
This aligns well with what I know at this point: text node containing these two characters is rendered in a block context and gets it's own line. It is not removed as HTML whitespaces, just not visible. It can be revealed with mouse selection.

I think this will be handled well ("as in browser" that is) once I'm done with the refactoring.

@KillyMXI
Copy link
Member

I've pushed the update that should cover this, among many other things.

@KillyMXI
Copy link
Member

New version is now live in npm.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants