Stripping parts of extracted html #2527

animusastralis · 2019-04-20T11:50:20Z

I'm creating a scenario that would let me 1) generate RSS (via Website agent by extracting URLs) then 2) scrap html of corresponding articles (also via Website agent) and finally 3) emit full-text RSS (via DataOutput agent).

I struggle with stripping unwanted parts from the extracted article. For example, I use xpath to extract article body:

    "description": {
      "xpath": "//div[@class=\"gutter-left mobile-zero\"]",
      "value": "."

It happens to contain <div class=\"visuallyhidden no-print\">some text</div> part at the end which then shows up in DataOutput agent.

Is there an option to completely strip this part? Maybe using some xpath function?

The text was updated successfully, but these errors were encountered:

dsander · 2019-04-21T08:16:41Z

Do you want to strip all HTML tags? normalize-space(.) or string(.) do that.

If you just want to remove that one specific div it's probably the easiest to to it in a liquid replace filter in either the template option of the WebsiteAgent or a EventFormattingAgent.

animusastralis · 2019-04-21T09:06:21Z

Do you want to strip all HTML tags? normalize-space(.) or string(.) do that.
If you just want to remove that one specific div it's probably the easiest to to it in a liquid replace filter in either the template option of the WebsiteAgent or a EventFormattingAgent.

Yes, I know that these functions strip all html tags. And I want to find a way to strip html tags with content inside them in order to remove unwanted parts like ads, links to related articles, etc.

Now, I've suspected that template is an option I would probably need, yet particular implementation is unclear to me. For instance, I have an event with payload that looks like:

{
  "title": "ARTICLE TITLE",
  "date_published": "21.04.2019",
  "author": [
    "ARTICLE AUTHOR"
  ],
  "description": "<div class=\"article-body\">ARTICLE TEXT<div class=\"ads\">AD TEXT<\/div><\/div>",
  "url": "https://example.com/article"
}

How would you strip <div class=\"ads\">AD TEXT<\/div> from this payload?

dsander · 2019-04-22T08:21:12Z

You could use the liquid regex_replace filter, but parsing and handling HTML with regular expression is a bit tedious. Another option is the ReadabilityAgent, it has a few build in rules to clean up HTML, but you can also specify a white and blacklist.

animusastralis · 2019-04-22T10:21:35Z

You could use the liquid regex_replace filter, but parsing and handling HTML with regular expression is a bit tedious. Another option is the ReadabilityAgent, it has a few build in rules to clean up HTML, but you can also specify a white and blacklist.

Thanks for your help, I think I've achieved what I was aiming for. Considering that ad blocks almost always have a unique classname, regex_replace + template option should work well enough.

So I've added template option:

  "template": {
    "description": "{{ description | regex_replace: '<div class=\\x22ads\\x22>(.|\n)*?</div>', '' }}"
  }

It doesn't look very nice but it works. If there is a way to make a nicer expression I'll always be glad to see it!

animusastralis changed the title ~~Stripping extracted html~~ Stripping parts of extracted html Apr 20, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stripping parts of extracted html #2527

Stripping parts of extracted html #2527

animusastralis commented Apr 20, 2019

dsander commented Apr 21, 2019

animusastralis commented Apr 21, 2019 •

edited

dsander commented Apr 22, 2019

animusastralis commented Apr 22, 2019

Stripping parts of extracted html #2527

Stripping parts of extracted html #2527

Comments

animusastralis commented Apr 20, 2019

dsander commented Apr 21, 2019

animusastralis commented Apr 21, 2019 • edited

dsander commented Apr 22, 2019

animusastralis commented Apr 22, 2019

animusastralis commented Apr 21, 2019 •

edited