Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stripping parts of extracted html #2527

Open
animusastralis opened this issue Apr 20, 2019 · 4 comments
Open

Stripping parts of extracted html #2527

animusastralis opened this issue Apr 20, 2019 · 4 comments

Comments

@animusastralis
Copy link

I'm creating a scenario that would let me 1) generate RSS (via Website agent by extracting URLs) then 2) scrap html of corresponding articles (also via Website agent) and finally 3) emit full-text RSS (via DataOutput agent).

I struggle with stripping unwanted parts from the extracted article. For example, I use xpath to extract article body:

    "description": {
      "xpath": "//div[@class=\"gutter-left mobile-zero\"]",
      "value": "."

It happens to contain <div class=\"visuallyhidden no-print\">some text</div> part at the end which then shows up in DataOutput agent.

Is there an option to completely strip this part? Maybe using some xpath function?

@animusastralis animusastralis changed the title Stripping extracted html Stripping parts of extracted html Apr 20, 2019
@dsander
Copy link
Collaborator

dsander commented Apr 21, 2019

Do you want to strip all HTML tags? normalize-space(.) or string(.) do that.

If you just want to remove that one specific div it's probably the easiest to to it in a liquid replace filter in either the template option of the WebsiteAgent or a EventFormattingAgent.

@animusastralis
Copy link
Author

animusastralis commented Apr 21, 2019

Do you want to strip all HTML tags? normalize-space(.) or string(.) do that.
If you just want to remove that one specific div it's probably the easiest to to it in a liquid replace filter in either the template option of the WebsiteAgent or a EventFormattingAgent.

Yes, I know that these functions strip all html tags. And I want to find a way to strip html tags with content inside them in order to remove unwanted parts like ads, links to related articles, etc.

Now, I've suspected that template is an option I would probably need, yet particular implementation is unclear to me. For instance, I have an event with payload that looks like:

{
  "title": "ARTICLE TITLE",
  "date_published": "21.04.2019",
  "author": [
    "ARTICLE AUTHOR"
  ],
  "description": "<div class=\"article-body\">ARTICLE TEXT<div class=\"ads\">AD TEXT<\/div><\/div>",
  "url": "https://example.com/article"
}

How would you strip <div class=\"ads\">AD TEXT<\/div> from this payload?

@dsander
Copy link
Collaborator

dsander commented Apr 22, 2019

You could use the liquid regex_replace filter, but parsing and handling HTML with regular expression is a bit tedious. Another option is the ReadabilityAgent, it has a few build in rules to clean up HTML, but you can also specify a white and blacklist.

@animusastralis
Copy link
Author

You could use the liquid regex_replace filter, but parsing and handling HTML with regular expression is a bit tedious. Another option is the ReadabilityAgent, it has a few build in rules to clean up HTML, but you can also specify a white and blacklist.

Thanks for your help, I think I've achieved what I was aiming for. Considering that ad blocks almost always have a unique classname, regex_replace + template option should work well enough.

So I've added template option:

  "template": {
    "description": "{{ description | regex_replace: '<div class=\\x22ads\\x22>(.|\n)*?</div>', '' }}"
  }

It doesn't look very nice but it works. If there is a way to make a nicer expression I'll always be glad to see it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants