HUGINN cannot scrap certain websites #3099

lemonGon · 2022-03-22T09:22:05Z

Hi all,
I'm new with HUGINN and while I managed to scrape 3 websites so far, there are some others that it looks like there is no way of pulling any data out of them.

This is more of a question rather than a real issue I suppose, as I think I might be doing something wrong.

Here is my case.

I want to scrap this link, which seems rather ordinary to me.
List of jobs

And this is my Website Agent

{
  "expected_update_period_in_days": "2",
  "url": "https://www.irishjobs.ie/ShowResults.aspx?Keywords=php&Location=102&SortBy=MostRecent&PerPage=100&Recruiter=All",
  "type": "html",
  "mode": "on_change",
  "extract": {
    "title": {
      "css": ".two-thirds .module-content .job-result-title",
      "value": "string(.)"
    }
  }
}

While the same approach worked on other 3 html links, it doesn't seem to work on this.
The DryRun returns and empty list (as well as the saved agent) and, while I have tried using both xpath and and css, the result doesn't change altogether (empty list of events).

I encounter the very same issue when I try to scrap [the price of this library] on Trademax.(https://www.trademax.se/f%C3%B6rvaring/hyllor/bokhylla/skanelija-bokhylla-svart-p882980)

It doesn't matter whether I use xpath or css, for HUGINN, it looks like after the id #productInfoPrice, there is nothing. It looks like <div id="productInfoPrice">....</div> is totally empty.

this is my WebAgent for scraping the library.

{
  "expected_update_period_in_days": "2",
  "url": "https://www.trademax.se/f%C3%B6rvaring/hyllor/bokhylla/skanelija-bokhylla-svart-p882980",
  "type": "html",
  "mode": "on_change",
  "extract": {
    "title": {
      "css": "#productInfoPrice",
      "value": "."
    }
  }
}

DryRun shows how < div id="productInfoPrice" > looks completely empty

As I said, the same approach worked on 3 other websites but in this case, it simply doesn't and returns an empty list of event.

Do you have any suggestions? I'm really grasping at straws here :-(

The text was updated successfully, but these errors were encountered:

yorch · 2022-05-14T04:41:00Z

The website you are trying to scrap returns to the browser the HTML like:

...
<div class="productInfoContent--buySectionBlock"><div id="productInfoPrice"></div><div id="discountedPeriodInfo"></div>
...

There is JS on the page that after the HTML is loaded in the browser, updates the HTML with some logic to update the price.

Without much looking, seems like the price you are looking for is sent in the same HTML document (not retrieved through a subsequent request to an API), so you should take a look at the raw HTML and look for var componentsData = {, which is a giant JS object with all of the product info.

gingerbeardman · 2022-11-02T11:50:33Z

Alternative would be to run through browserless or phantomjs to get the HTML from after the javascript has fired, and parse that with normal Website Agent.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HUGINN cannot scrap certain websites #3099

HUGINN cannot scrap certain websites #3099

lemonGon commented Mar 22, 2022 •

edited

yorch commented May 14, 2022

gingerbeardman commented Nov 2, 2022

HUGINN cannot scrap certain websites #3099

HUGINN cannot scrap certain websites #3099

Comments

lemonGon commented Mar 22, 2022 • edited

yorch commented May 14, 2022

gingerbeardman commented Nov 2, 2022

lemonGon commented Mar 22, 2022 •

edited