New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ {} ] does not work #65

Open
monolithed opened this Issue Jul 5, 2015 · 30 comments

Comments

Projects
None yet
@monolithed

monolithed commented Jul 5, 2015

{
     addresses: [
           {
               address: '.address',
               location: '.location'
          }
     ]
}

Returns:

{ addresses: [] }

Expected:

{
     addresses: [
           {
               address: 'address',
               location: 'location'
          }
     ]
}
@matthewmueller

This comment has been minimized.

Owner

matthewmueller commented Jul 6, 2015

what does the HTML look like?

@monolithed

This comment has been minimized.

monolithed commented Jul 6, 2015

It looks like:

<div id="addresses"> 
   <ul>
        <li> 
               <span class="address"> address  </span>
               <span class="location"> location  </span>
        </li>
  </ul>
</div>
@disbelief

This comment has been minimized.

disbelief commented Jul 8, 2015

@monolithed I think for your example to work you'd need to add class="address" and class="location" to those <span> tags.

However I believe the original issue still stands. I've been struggling with the same problem: the only array selector that works is a basic string: x(url, ['.selector']).

@monolithed

This comment has been minimized.

monolithed commented Jul 8, 2015

@disbelief, thanks, fixed.

@matthewmueller

This comment has been minimized.

Owner

matthewmueller commented Jul 8, 2015

Try this:

x('http://example.com', {
  addresses: x('.addresses li', [{
    address: '.address',
    location: '.location'
  }]);
})(fn)
@disbelief

This comment has been minimized.

disbelief commented Jul 8, 2015

@matthewmueller yes that works! I guess I hadn't tried that specific combination of nesting the x() call.

However I'm trying to do something along the following lines, and xray seems to just die on me:

x('http://example.com',
  {
    companies: x('.vcard', [
      {
        name: 'a.url',
        domain: x('a.url@href', '.domain')
      }
    ])
  }
)(function(err, obj){
  if (err) {
    console.error(err);
  }
  console.log("RESULT:", object);
});

So the first page read in by xray contains a list of links, and I'd like xray to follow each link to grab the .domain element on the resulting page.

Should this work?

When I run the above, I see no console output at all but if I remove the domain: key, the correct array of objects with name attributes is returned.

@matthewmueller

This comment has been minimized.

Owner

matthewmueller commented Jul 8, 2015

yah, so for the domain key, the first argument would be the scope, the second would be the selector.

I don't understand exactly what you're trying to do, but you should use x(...) if you want to narrow down or scope the selection. My guess is that this:

domain: x('.domain', 'a.url@href')

is what you want. But in this case you could probably just do:

domain: '.domain a.url@href'
@disbelief

This comment has been minimized.

disbelief commented Jul 8, 2015

Here's what I'm trying to do:

<!-- example.com/index.html -->
<div id="list">
  <div class="company">
    <a class="url" href="./company1.html">Company1</a>
  </div>
  <div class="company">
    <a class="url" href="./company2.html">Company2</a>
  </div>
</div>
<!-- example.com/company1.html -->
<div id="profile">
  <span class="domain">company1.com</span>
</div>
<!-- example.com/company2.html -->
<div id="profile">
  <span class="domain">company2.com</span>
</div>
x("example.com/index.html",
  {
    companies: x('.company', [
      {
        name: 'a.url',
        domain: x('.domain', 'a.url@href')
      }
    ])
  }
)(function(err, object){
  console.log("done", err, object);
});

returns the following:

{
  companies: [
    {
      name: "Company1"
    },
    {
      name: "Company2"
    }
  ]
}

note the lack of a domain attribute on the nested objects.

If I swap the arguments for the domain lookup to

domain: xray('a.url@href', '.domain')

there is no result logged to the console. It does seem to do something, but nothing is returned to the callback function.

@matthewmueller

This comment has been minimized.

Owner

matthewmueller commented Jul 8, 2015

oh hm, you're following links. got it. okay i think that should work, can you run the program using DEBUG=x-ray*?

@disbelief

This comment has been minimized.

disbelief commented Jul 8, 2015

DEBUG=x-ray* babel-node index.js 
  x-ray params: {"source":"http://example.com/index.html","scope":null,"selector":{}} +0ms
  x-ray starting at: http://example.com/index.html +1ms
  x-ray fetching http://example.com/index.html +0ms
  x-ray-crawler getting: http://example.com/index.html +1ms
  x-ray:phantom going to http://example.com/index.html +0ms
  x-ray:phantom got response from http://example.com/index.html: 200 +2s
  x-ray:phantom redirect: http://example.com/index.html +1ms
  x-ray:phantom got response from http://example.com/index.html: 200 +3ms
  x-ray:phantom http://example.com/index.html - 200 +1s
  x-ray got response for http://example.com/index.html with status code: 200 +3s
  x-ray params: {"scope":".company","selector":[{"name":"a.url"}]} +135ms
[TypeError: Converting circular structure to JSON]
@matthewmueller

This comment has been minimized.

Owner

matthewmueller commented Jul 8, 2015

this is exactly the output (minus base url)? you're sure you kept all the pathnames intact? it doesn't seem like it's going to company1 and company2 at all

@disbelief

This comment has been minimized.

disbelief commented Jul 8, 2015

I thought the same thing so I double checked and yes this is the output. The secondary pages are never visited.

That TypeError only seems to show up in debug mode also.

@disbelief

This comment has been minimized.

disbelief commented Jul 8, 2015

Also notice that the "selector" doesn't include the "domain" key in that last "params" debug line. Not sure if that's expected or not.

@matthewmueller

This comment has been minimized.

Owner

matthewmueller commented Jul 8, 2015

Okay hm. Can you verify via curl or whatever that the HTML you're getting back contains the proper links that you described (company1, company2, etc.) and it's not being set on the client-side.

@disbelief

This comment has been minimized.

disbelief commented Jul 8, 2015

Yep I'm actually doing this with locally hosted pages. Their markup is exactly as you see in my snippets above, just a local domain name (not example.com).

If I remove that domain lookup, it does return the other attributes on the page correctly. eg.

x('http://localhost/home.html',
  {
    companies: x('.company', [
      {
        name: 'a.url',
        url: 'a.url@href'
      }
    ])
  }
)(function(err, object){
  console.log(object);
});

outputs:

{ 
  companies: [ 
     { name: 'Company1', url: 'http://localhost/company1.html' },
     { name: 'Company2', url: 'http://localhost/company2.html' } 
  ]
}
@matthewmueller

This comment has been minimized.

Owner

matthewmueller commented Jul 8, 2015

dang, that's strange. so those domain urls should hit this chunk of the code:

https://github.com/lapwinglabs/x-ray/blob/master/index.js#L88-L104

Can you investigate in your local copy of x-ray why it may not be hitting that?

@matthewmueller

This comment has been minimized.

Owner

matthewmueller commented Jul 8, 2015

To be clear... using: domain: x('a.url@href', '.domain') as your logic.

@disbelief

This comment has been minimized.

disbelief commented Jul 8, 2015

sure I'll throw a few more debug lines in and see what I can find out.

@disbelief

This comment has been minimized.

disbelief commented Jul 9, 2015

No joy yet, but a bit of news:

It appears that this debug call was causing the TypeError whilst in debug mode: https://github.com/lapwinglabs/x-ray/blob/master/index.js#L75

Commenting it out allows xray to execute further. In fact it seems to do exactly what is expected, however the final function is never called. This is the entire output:

  x-ray starting at: http://example.com/index.html +0ms
  x-ray fetching http://example.com/index.html +1ms
  x-ray-crawler getting: http://example.com/index.html +1ms
  x-ray:phantom going to http://example.com/index.html +0ms
  x-ray:phantom got response from http://example.com/index.html: 200 +1s
  x-ray:phantom redirect: http://example.com/index.html +0ms
  x-ray:phantom got response from http://example.com/index.html: 200 +1ms
  x-ray:phantom http://example.com/index.html - 200 +2s
  x-ray got response for http://example.com/index.html with status code: 200 +3s
  x-ray resolving to a url: a.url@href +135ms
  x-ray resolved "a.url@href" to a http://example.com/company1.html +0ms
  x-ray fetching http://example.com/company1.html +0ms
  x-ray-crawler queued: "http://example.com/company1.html", waiting "0ms" +0ms
  x-ray resolving to a url: a.url@href +1ms
  x-ray resolved "a.url@href" to a http://example.com/company2.html +0ms
  x-ray fetching http://example.com/company2.html +0ms
  x-ray-crawler queued: "http://example.com/company2.html", waiting "0ms" +0ms
  x-ray-crawler getting: http://example.com/company1.html +0ms
  x-ray:phantom going to http://example.com/company1.html +146ms
  x-ray-crawler getting: http://example.com/company2.html +1ms
  x-ray:phantom going to http://example.com/company2.html +1ms
  x-ray:phantom got response from http://example.com/company1.html: 200 +1s
  x-ray:phantom redirect: http://example.com/company1.html +0ms
  x-ray:phantom got response from http://example.com/company1.html: 200 +1ms
  x-ray:phantom got response from http://example.com/company2.html: 200 +2s
  x-ray:phantom redirect: http://example.com/company2.html +0ms
  x-ray:phantom got response from http://example.com/company2.html: 200 +1ms
  x-ray:phantom http://example.com/company1.html - 200 +671ms
  x-ray got response for http://example.com/company1.html with status code: 200 +14s

notice that there's no line reading: x-ray got response for http://example.com/company2.html, and the xray callback function does not fire. The process simply terminates silently at this point.

@kvetoslavnovak

This comment has been minimized.

kvetoslavnovak commented Aug 11, 2015

Hello, first of all thank you matthewmueller for the great job!

Unfortunatelly I have to confirm the same problem. I am trying to follow three links from one page and grab some data from those linked pages (the html structure is quite the same as in example of disbelief). But array selector for links seems not to work. If I delete it I am able to grab data from the first linked page just fine.

Than you very much for your help.

PS: These issues probably refer to the same problem

@kvetoslavnovak

This comment has been minimized.

kvetoslavnovak commented Aug 12, 2015

I can confirm the same bug. See my duplicate closed issue #81 For crawling to other (sub)pages just [ .a ] or just [ .b ] collection selector scrapes its multiple contents correctly but collection [{ .a, .b }] does not work for crawling.

Your main page example works for more other page links (collection of links)

x('http://google.com', {
  image: x('#gbar a@href', ['a'']), 
})(function(err, obj) {
})

or for more other page headings (collection of h1s)

x('http://google.com', {
  image: x('#gbar a@href', ['h1']),
})(function(err, obj) {
})

but not for headings as well as links (collection of links and h1s):

x('http://google.com', {
  image: x('#gbar a@href', [{'a','h1'}]), 
})(function(err, obj) {
})

Collection for more items [{ }] works only for scraping on direct page not when using a@href crawling.

@jproby

This comment has been minimized.

jproby commented Sep 10, 2015

I can confirm the same bug. Any clue how to fix this ?

@onbjerg

This comment has been minimized.

onbjerg commented Oct 17, 2015

I can confirm the same bug.

var x = require('x-ray')()

x('https://www.retsinformation.dk/Forms/R0210.aspx', {
  laws: x('.tbl.tbl2 tr:not(.th) td:nth-child(2) a@href', {
    title: 'title',
    paragraphs: x('.ParagrafNr', [{
      no: '@html'
    }])
  })
})(function (err, obj) {
  if (err) {
    console.log(err)
    return
  }

  console.log(obj)
})

The above code works, but this does not

var x = require('x-ray')()

x('https://www.retsinformation.dk/Forms/R0210.aspx', {
  laws: x('.tbl.tbl2 tr:not(.th) td:nth-child(2) a@href', [{
    title: 'title',
    paragraphs: x('.ParagrafNr', [{
      no: '@html'
    }])
  }])
})(function (err, obj) {
  if (err) {
    console.log(err)
    return
  }

  console.log(obj)
})
@KristerV

This comment has been minimized.

KristerV commented Oct 18, 2015

alas, I also am at this wall...
got around it by manually creating new xray objects for every link.. very slow and resource intensive.

@sylvery

This comment has been minimized.

sylvery commented Mar 21, 2016

@monolithed, I ran into this problem a few weeks back while working on a webscraper and I came up with a simple solution. Here's what you can do:

x("example.com/index.html", 
  {
    companies: x('.company', [ // get company name and link
      {
        name: 'a.url',
        domainURL: 'a.url@href')
      }
    ])
  }
)(function(err, object){ // pass company name and link to this function
  object.companies.forEach(company, function(company){ // forEach function to get to '.domain'
    x(company.domainURL,{
      domain: '.domain'
    })(function(err, result){
      console.log(result); // log the domain
    }
  })
  console.log("done", err, object);
});

It is not much but I hope it helps. You can get the result from the first callback stored in the DB then make a call to the DB to get back the URL links and then crawl the links to get additional information. Take time and consumes resources but it will help. :)

@matthewmueller I believe 'q' promises will help solve this bug. Just a suggestion

@Kikobeats Kikobeats added the bug label Mar 21, 2016

@calopez

This comment has been minimized.

calopez commented Apr 7, 2016

Same issue,

// xray_poc.js
'use strict';

var Xray = require("x-ray");
var x = new Xray();

x('http://testing-ground.scraping.pro/', '#content > .caseblock:nth-child(3) ', {
    title: 'a',
    link: 'a @href',
    other: x('#content > .caseblock:nth-child(3) a@href',{
        title: 'h1',
        text_links: ['#caseinfo ul li a@href'],
        text_links2: x('#caseinfo ul ', ['li a@href']),          
        text_list: x('#caseinfo ul ', [{link: 'li a@href'}]),        
        content: x('#caseinfo ul li a@href' , 'h1'),
    })
}).write('results.json');

if you run: ``DEBUG=x-ray node xray_poc.js & cat results.json
You will see that the list works well but when you need a list of objects, only the first item in the list appears:

{
  "title": "TEXT LIST",
  "link": "http://testing-ground.scraping.pro/textlist",
  "other": {
    "title": "TEXT LIST ",
    "text_links": [
      "http://testing-ground.scraping.pro/textlist?ver=1",
      "http://testing-ground.scraping.pro/textlist?ver=2",
      "http://testing-ground.scraping.pro/textlist?ver=3",
      "http://testing-ground.scraping.pro/textlist?ver=4",
      "http://testing-ground.scraping.pro/textlist?ver=5"
    ],
    "text_links2": [
      "http://testing-ground.scraping.pro/textlist?ver=1",
      "http://testing-ground.scraping.pro/textlist?ver=2",
      "http://testing-ground.scraping.pro/textlist?ver=3",
      "http://testing-ground.scraping.pro/textlist?ver=4",
      "http://testing-ground.scraping.pro/textlist?ver=5"
    ],
    "text_list": [
      {
        "link": "http://testing-ground.scraping.pro/textlist?ver=1"
      }
    ],
    "content": "TEXT LIST (version 1)"
  }
}
@heyimlance

This comment has been minimized.

heyimlance commented May 2, 2016

First let me say this is a great project!!

Anyone find a solution for this? Having a similar problem. I'm thinking I may just need to scrape a page, save all the links to other pages into an array, iterate that and scrape the links.

I may have to do that anyway, in some cases I need to scrap 3+ pages deep.

@gnujeremie

This comment has been minimized.

gnujeremie commented May 24, 2016

Any update on this issue ?
I'm trying to follow a list of links from a webpage, and for each link scrape some elements. I can only get the elements from the first followed link.

@milad145

This comment has been minimized.

milad145 commented Aug 6, 2016

this is my html

`


a
b
c
d

`

``

and i want this array

[{title:a},{title:b},{title:c},{title:d}]

what can i do?

@mkdizajn

This comment has been minimized.

mkdizajn commented Sep 1, 2016

did anyone tried to use promises for waiting for child X's ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment