Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

trimWhiteSpace only trims newlines at the start of content #19

Closed
markatrosie opened this issue Oct 18, 2018 · 5 comments
Closed

trimWhiteSpace only trims newlines at the start of content #19

markatrosie opened this issue Oct 18, 2018 · 5 comments

Comments

@markatrosie
Copy link

The documentation for the trimWhiteSpace option of HtmlExtractor indicates:

If set to true, white space at the very beginning and at the end of the content will get removed

The actual implementation, below, only trims newline characters from the start of string, but all whitespace characters from the end.

if (options.trimWhiteSpace) {
   content = content.replace(/^\n+|\s+$/g, '');
}

I believe that should be content.replace(/^\s+|\s+$/g, '').

@lukasgeiter
Copy link
Owner

I actually implemented it this way on purpose. This way a conflict with the preserveIndentation option is avoided. Let's look at an example:

<p>
    Lorem ipsum
    dolor sit amet
</p>

The raw content of the element is:

\n    Lorem ipsum\n    dolor sit amet\n

With trimWhiteSpace = true and preserveIndentation = false (the default) we will get:

Lorem ipsum\ndolor sit amet

With trimWhiteSpace = true and preserveIndentation = true we get:

    Lorem ipsum\n    dolor sit amet

Now I agree that the documentation doesn't fully describe how the option works.

I would appreciate if you could share your use case with me so I can determine whether it makes sense to release a new version or if it is enough to update the documentation. Thanks!

@markatrosie
Copy link
Author

I see what you mean. I think it would be nice if it used the current regex or the one I'm recommending above contextually based on whether trimWhiteSpace && preserveIndentation or trimWhiteSpace && !preserveIndentation, but doing that would probably cause a lot of headaches in the current userbase.

Maybe instead just a documentation update to clarify how trimWhitespace behaves?

As for my use case -- I was just trying to generate "clean" message IDs. Once I discovered this quirk, I was able to work around it. The challenge is that I'm consuming my language files inside a web browser and I need my JS to extract message IDs from markup the same way gettext-extractor does so they line up. I was trying to replicate trimWhiteSpace = true in the browser by using theMarkup.trim() and running into issues when the markup started with a tab. Once I encountered this problem, it was simple enough for me to switch to trimWhiteSpace = false and preserveIndentation = true and let both the browser and gettext-extractor use the raw, untouched content.

@lukasgeiter
Copy link
Owner

Thanks for your answer. If your goal is to generate "clean" message IDs why not just use the same regex for removing indentation during runtime?

I'm still contemplating what to do here and having message ids like this doesn't seem very desirable to me:

Lorem ipsum
    dolor sit
    amet

@markatrosie
Copy link
Author

Because I expected trimWhiteSpace to behave the same as String.prototype.trim. Once I dug into the code and realized it didn't, I felt it was better to just leave things untouched on both sides. Copying the regex out of gettext-extractor would have worked for me, too.

@lukasgeiter
Copy link
Owner

Sorry for my late response. I've now added a comment to the documentation that hopefully makes it clear enough that, in combination with preserveIndentation, it doesn't exactly behave like .trim().

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants