Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

jusText skips content from HTML lists (ul, ol) #24

Open
polosatyi opened this issue Jun 4, 2018 · 1 comment
Open

jusText skips content from HTML lists (ul, ol) #24

polosatyi opened this issue Jun 4, 2018 · 1 comment
Assignees
Labels

Comments

@polosatyi
Copy link

Hey, @miso-belica

It seems like jusText can not extract content from html lists (ul, ol tags). For example, only "Some text A. Some text C." will be extracted from:
<p>Some text A.</p><ul><li>Some text B.</li></ul><p>Some text C.</p>
Is it normal behavior? Is it possible to fix?
Or could you point me where can I modify this behavior, please?
Example: https://plantcaretoday.com/how-to-grow-and-care-for-bougainvillea.html

@miso-belica
Copy link
Owner

@polosatyi Hi, to be honest I don't know where is the problem. JustText has many heuristics and it may be be any of them or the combination. I can see that <li> is paragraph tag so Some text B. is considered new paragraph of text and it can be deleted because it's too short or anything else. It's quite a time I worked with jusText last time so it's hard to tell. I think the best thing is to try tweak some CLI args like min. length, density, ... to minimal/maximal value to find out which one is causing the problem. If that does not help it's on good old debugging :)

@miso-belica miso-belica added the bug label Jun 5, 2018
@miso-belica miso-belica self-assigned this Oct 14, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants