CSS selectors only match the first found element #215

arokanto · 2020-12-14T13:36:44Z

Minimal HTML example

<div>
    <div class="my-header">header</div>
    <div class="my-content">one</div>
    <div class="my-content">two</div>
    <div class="my-content">three</div>
    <div class="my-footer">footer</div>
</div>

Options

{
  baseElement: ['div.my-header', 'div.my-content', 'div.my-footer']
}

Observed output

header
one
footer

Expected output

header
one
two
three
footer

Version information

html-to-text: 6.0.0
node: 14.15.1

I realize that this may be intended behavior, but it just threw me off, because usually the same class is used in several tags when building HTML. Also there doesn't seem to be an easy way to circumvent this problem, since in my use case the code is generated, so I can't just change the div elements to table elements and use the tables attribute.

The text was updated successfully, but these errors were encountered:

KillyMXI · 2020-12-14T13:54:08Z

Yeah, this was looking weird to me too, when I realized how it works.

I may have to revisit that part of the code to fix that.

Is there any particular reason to use baseElement though? Is it that you're trying to pick a particular div and skip the rest but only have class identifiers on it's children?

For now it might require external preprocessing of the document.
Or a custom formatter for, let's say, the body tag, that will find the right children to walk through.
(This gives me some ideas, but unfortunately I can't test it right now and provide any example code.)

arokanto · 2020-12-14T14:06:35Z

Yeah, sorry my example wasn't complete in that regard. I have other div elements that should not be converted to text, which is why I was looking for a way to cherry pick only some elements.

I didn't realize I could build a custom formatter for this. That might be the easiest quick fix. Preprocessing the document sounds somewhat fragile in my use case. Thanks for the tip!

If you do choose to revisit that part of the code, one option might be to replace that custom recursive function with domutils, which should happily eat the dom given by htmlparser2. Then you could just use domutils's findAll function to get the expected result.

KillyMXI · 2021-02-08T16:15:51Z

Current behavior:

for each selector walk the whole tree (within the configured limits) and stop at the first found item

Possible alternatives:

look for every outermost tag matching any of the selectors (it probably makes no sense to look inside of already picked tag for any selector, same or different);
look for every selectors that include tag and/or class name and only for the first one that includes an id;
have some options to select the desired behavior.

I think the first one is more straightforward but with some possibility of a performance drop when looking for extra occurrences when only one is needed. But the other ones might be a bit too complicated fwiw. Although there was a proposal to provide the limit for the number of picked bases too - it didn't make sense before but might make sense now...

domutils implementation looks inside matched nodes. My other concern about it is that I'm considering to swap the htmlparser2 for something else in the future.
No big deal to provide own find implementation.

KillyMXI · 2021-02-10T22:54:14Z

Current behavior allows to manipulate the order of matched elements in the output.
I don't know how often it is used, but that's a meaningful feature.
Walking the whole DOM N times to preserve it - seems wasteful.

Possible solutions:

strategy selection specifically for this feature;
provide a way to manipulate the parsed DOM before further processing instead;
get rid of the specific feature and ship with formatters that allow to achieve the desired goal.

More careful consideration is needed. I think I will not touch this part for now. In the long run I want multiple existing problems to be solvable with built-in formatters and this falls into a larger task of figuring out the set of good building blocks.

KillyMXI · 2021-06-02T15:15:04Z

I think I came up with a reasonable solution while integrating new selectors support.

I've pushed new code into a separate branch - selectors
Tracking issue: #228

I'd be grateful if you can test this version on real tasks and provide your feedback before release (in a week probably).
(You can install the package directly from GitHub branch. Instruction is in the tracking issue.)

KillyMXI · 2021-06-09T15:37:21Z

Version 8 in now live.
It uses the new selectors feature to match base elements as well.
The behavior is changed to include all outermost elements matching any of provided selectors.
All found bases can be ordered by selectors array (default) or by appearance in the input document.

This was referenced Feb 8, 2021

Version 7 roadmap (tracking issue) #222

Closed

Upgrade htmlparser2 to version 6 #220

Merged

KillyMXI mentioned this issue Jun 2, 2021

Version 8 roadmap (tracking issue) #228

Closed

11 tasks

KillyMXI closed this as completed Jun 9, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CSS selectors only match the first found element #215

CSS selectors only match the first found element #215

arokanto commented Dec 14, 2020

KillyMXI commented Dec 14, 2020 •

edited

Loading

arokanto commented Dec 14, 2020

KillyMXI commented Feb 8, 2021 •

edited

Loading

KillyMXI commented Feb 10, 2021

KillyMXI commented Jun 2, 2021

KillyMXI commented Jun 9, 2021

CSS selectors only match the first found element #215

CSS selectors only match the first found element #215

Comments

arokanto commented Dec 14, 2020

KillyMXI commented Dec 14, 2020 • edited Loading

arokanto commented Dec 14, 2020

KillyMXI commented Feb 8, 2021 • edited Loading

KillyMXI commented Feb 10, 2021

KillyMXI commented Jun 2, 2021

KillyMXI commented Jun 9, 2021

KillyMXI commented Dec 14, 2020 •

edited

Loading

KillyMXI commented Feb 8, 2021 •

edited

Loading