Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSS selectors only match the first found element #215

Closed
arokanto opened this issue Dec 14, 2020 · 6 comments
Closed

CSS selectors only match the first found element #215

arokanto opened this issue Dec 14, 2020 · 6 comments

Comments

@arokanto
Copy link

Minimal HTML example

<div>
    <div class="my-header">header</div>
    <div class="my-content">one</div>
    <div class="my-content">two</div>
    <div class="my-content">three</div>
    <div class="my-footer">footer</div>
</div>

Options

{
  baseElement: ['div.my-header', 'div.my-content', 'div.my-footer']
}

Observed output

header
one
footer

Expected output

header
one
two
three
footer

Version information

  • html-to-text: 6.0.0
  • node: 14.15.1

I realize that this may be intended behavior, but it just threw me off, because usually the same class is used in several tags when building HTML. Also there doesn't seem to be an easy way to circumvent this problem, since in my use case the code is generated, so I can't just change the div elements to table elements and use the tables attribute.

@KillyMXI
Copy link
Member

KillyMXI commented Dec 14, 2020

Yeah, this was looking weird to me too, when I realized how it works.

I may have to revisit that part of the code to fix that.

Is there any particular reason to use baseElement though? Is it that you're trying to pick a particular div and skip the rest but only have class identifiers on it's children?

For now it might require external preprocessing of the document.
Or a custom formatter for, let's say, the body tag, that will find the right children to walk through.
(This gives me some ideas, but unfortunately I can't test it right now and provide any example code.)

@arokanto
Copy link
Author

Yeah, sorry my example wasn't complete in that regard. I have other div elements that should not be converted to text, which is why I was looking for a way to cherry pick only some elements.

I didn't realize I could build a custom formatter for this. That might be the easiest quick fix. Preprocessing the document sounds somewhat fragile in my use case. Thanks for the tip!

If you do choose to revisit that part of the code, one option might be to replace that custom recursive function with domutils, which should happily eat the dom given by htmlparser2. Then you could just use domutils's findAll function to get the expected result.

@KillyMXI
Copy link
Member

KillyMXI commented Feb 8, 2021

Current behavior:

  • for each selector walk the whole tree (within the configured limits) and stop at the first found item

Possible alternatives:

  • look for every outermost tag matching any of the selectors (it probably makes no sense to look inside of already picked tag for any selector, same or different);
  • look for every selectors that include tag and/or class name and only for the first one that includes an id;
  • have some options to select the desired behavior.

I think the first one is more straightforward but with some possibility of a performance drop when looking for extra occurrences when only one is needed. But the other ones might be a bit too complicated fwiw. Although there was a proposal to provide the limit for the number of picked bases too - it didn't make sense before but might make sense now...

domutils implementation looks inside matched nodes. My other concern about it is that I'm considering to swap the htmlparser2 for something else in the future.
No big deal to provide own find implementation.

@KillyMXI
Copy link
Member

Current behavior allows to manipulate the order of matched elements in the output.
I don't know how often it is used, but that's a meaningful feature.
Walking the whole DOM N times to preserve it - seems wasteful.

Possible solutions:

  • strategy selection specifically for this feature;
  • provide a way to manipulate the parsed DOM before further processing instead;
  • get rid of the specific feature and ship with formatters that allow to achieve the desired goal.

More careful consideration is needed. I think I will not touch this part for now. In the long run I want multiple existing problems to be solvable with built-in formatters and this falls into a larger task of figuring out the set of good building blocks.

@KillyMXI
Copy link
Member

KillyMXI commented Jun 2, 2021

I think I came up with a reasonable solution while integrating new selectors support.

I've pushed new code into a separate branch - selectors
Tracking issue: #228

I'd be grateful if you can test this version on real tasks and provide your feedback before release (in a week probably).
(You can install the package directly from GitHub branch. Instruction is in the tracking issue.)

@KillyMXI
Copy link
Member

KillyMXI commented Jun 9, 2021

Version 8 in now live.
It uses the new selectors feature to match base elements as well.
The behavior is changed to include all outermost elements matching any of provided selectors.
All found bases can be ordered by selectors array (default) or by appearance in the input document.

@KillyMXI KillyMXI closed this as completed Jun 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants