New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stripping parts of extracted html #2527
Comments
Do you want to strip all HTML tags? If you just want to remove that one specific |
Yes, I know that these functions strip all html tags. And I want to find a way to strip html tags with content inside them in order to remove unwanted parts like ads, links to related articles, etc. Now, I've suspected that
How would you strip |
You could use the liquid regex_replace filter, but parsing and handling HTML with regular expression is a bit tedious. Another option is the ReadabilityAgent, it has a few build in rules to clean up HTML, but you can also specify a white and blacklist. |
Thanks for your help, I think I've achieved what I was aiming for. Considering that ad blocks almost always have a unique classname, So I've added
It doesn't look very nice but it works. If there is a way to make a nicer expression I'll always be glad to see it! |
I'm creating a scenario that would let me 1) generate RSS (via Website agent by extracting URLs) then 2) scrap html of corresponding articles (also via Website agent) and finally 3) emit full-text RSS (via DataOutput agent).
I struggle with stripping unwanted parts from the extracted article. For example, I use xpath to extract article body:
It happens to contain
<div class=\"visuallyhidden no-print\">some text</div>
part at the end which then shows up in DataOutput agent.Is there an option to completely strip this part? Maybe using some xpath function?
The text was updated successfully, but these errors were encountered: