jusText skips content from HTML lists (ul, ol) #24

polosatyi · 2018-06-04T22:16:18Z

It seems like jusText can not extract content from html lists (ul, ol tags). For example, only "Some text A. Some text C." will be extracted from:
<p>Some text A.</p><ul><li>Some text B.</li></ul><p>Some text C.</p>
Is it normal behavior? Is it possible to fix?
Or could you point me where can I modify this behavior, please?
Example: https://plantcaretoday.com/how-to-grow-and-care-for-bougainvillea.html

The text was updated successfully, but these errors were encountered:

miso-belica · 2018-06-05T10:14:39Z

@polosatyi Hi, to be honest I don't know where is the problem. JustText has many heuristics and it may be be any of them or the combination. I can see that <li> is paragraph tag so Some text B. is considered new paragraph of text and it can be deleted because it's too short or anything else. It's quite a time I worked with jusText last time so it's hard to tell. I think the best thing is to try tweak some CLI args like min. length, density, ... to minimal/maximal value to find out which one is causing the problem. If that does not help it's on good old debugging :)

miso-belica added the bug label Jun 5, 2018

miso-belica self-assigned this Oct 14, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

jusText skips content from HTML lists (ul, ol) #24

jusText skips content from HTML lists (ul, ol) #24

polosatyi commented Jun 4, 2018

miso-belica commented Jun 5, 2018

jusText skips content from HTML lists (ul, ol) #24

jusText skips content from HTML lists (ul, ol) #24

Comments

polosatyi commented Jun 4, 2018

miso-belica commented Jun 5, 2018