Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HUGINN cannot scrap certain websites #3099

Open
lemonGon opened this issue Mar 22, 2022 · 2 comments
Open

HUGINN cannot scrap certain websites #3099

lemonGon opened this issue Mar 22, 2022 · 2 comments

Comments

@lemonGon
Copy link

lemonGon commented Mar 22, 2022

Hi all,
I'm new with HUGINN and while I managed to scrape 3 websites so far, there are some others that it looks like there is no way of pulling any data out of them.

This is more of a question rather than a real issue I suppose, as I think I might be doing something wrong.

Here is my case.

I want to scrap this link, which seems rather ordinary to me.
List of jobs

And this is my Website Agent

{
  "expected_update_period_in_days": "2",
  "url": "https://www.irishjobs.ie/ShowResults.aspx?Keywords=php&Location=102&SortBy=MostRecent&PerPage=100&Recruiter=All",
  "type": "html",
  "mode": "on_change",
  "extract": {
    "title": {
      "css": ".two-thirds .module-content .job-result-title",
      "value": "string(.)"
    }
  }
}

The DryRun result

While the same approach worked on other 3 html links, it doesn't seem to work on this.
The DryRun returns and empty list (as well as the saved agent) and, while I have tried using both xpath and and css, the result doesn't change altogether (empty list of events).


I encounter the very same issue when I try to scrap [the price of this library] on Trademax.(https://www.trademax.se/f%C3%B6rvaring/hyllor/bokhylla/skanelija-bokhylla-svart-p882980)

It doesn't matter whether I use xpath or css, for HUGINN, it looks like after the id #productInfoPrice, there is nothing. It looks like <div id="productInfoPrice">....</div> is totally empty.

this is my WebAgent for scraping the library.

{
  "expected_update_period_in_days": "2",
  "url": "https://www.trademax.se/f%C3%B6rvaring/hyllor/bokhylla/skanelija-bokhylla-svart-p882980",
  "type": "html",
  "mode": "on_change",
  "extract": {
    "title": {
      "css": "#productInfoPrice",
      "value": "."
    }
  }
}

Result of the dry run
DryRun shows how < div id="productInfoPrice" > looks completely empty

As I said, the same approach worked on 3 other websites but in this case, it simply doesn't and returns an empty list of event.

Do you have any suggestions? I'm really grasping at straws here :-(

@yorch
Copy link

yorch commented May 14, 2022

The website you are trying to scrap returns to the browser the HTML like:

...
<div class="productInfoContent--buySectionBlock"><div id="productInfoPrice"></div><div id="discountedPeriodInfo"></div>
...

There is JS on the page that after the HTML is loaded in the browser, updates the HTML with some logic to update the price.

Without much looking, seems like the price you are looking for is sent in the same HTML document (not retrieved through a subsequent request to an API), so you should take a look at the raw HTML and look for var componentsData = {, which is a giant JS object with all of the product info.

@gingerbeardman
Copy link

Alternative would be to run through browserless or phantomjs to get the HTML from after the javascript has fired, and parse that with normal Website Agent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants