.each not working for classes used a single time #34

Closed
derekdr opened this Issue Jul 9, 2011 · 4 comments

Projects

None yet

3 participants

@derekdr
derekdr commented Jul 9, 2011

There is no way of using .each when there is just a single element matching a class. Take this HTML as an example.

<div class='example'>This is text contained in a div with the class example</div>
<div class='ex2'>This is also text, this time contained in a class named ex2</div>
<div class='example'>Oh look, same class as the first div, example</div>

All fun and games right?

Let's try to get the text out of every div with the class example:

var a = []; $('div.example').each(function(e){ a.push(e.text); }); this.emit(a);

This should work without a hitch.

Now let's try to get the text out of every div with the class ex2, we have no idea how many classes like this are used so we use each to just get them all, even if it's just one:

var z = []; $('div.ex2').each(function(e){ z.push(e.text); }); this.emit(z);

Awww man! That didn't work at all!

.each just works for classes used multiple times. With CSSLint fanatics starting to use classes in place of # this might become a larger issue in the future.

Contributor
chriso commented Jul 10, 2011

To be honest, this is a design fail on my part. I was sick of typing first() every time I wanted a single element from a collection with only one item, so I added this shortcut - I knew it would come back to bite me at some stage!

In the mean time, you could use jQuery/sizzle as your selector engine by setting the jsdom option to true - I'm sure it's handled more elegantly.

I'll have to think of a solution that's backwards compatible.

Hey chris, i think you ought to make jquery selectors the default option. This same issue tripped me up for quite a time last night and i can't think of any advantages that the current default has over jquery.

Furthermore, jquery selectors return a jquery object so I can do things like
sections = $('.section');
$.each(sections, function (i,sec) {
chi = $(sec).find('.child');
})

those few lines of code would avoid a lot of workarounds i had to do for the current default.

To expand on my previous comment, the current default will often fail in a common scraping scenario:

Imagine you are scraping a page with 5 sections, where the sections contains a header and an author. If one of the headers is missing from a section, then the current default selector fails.

With the current default, you'd have to do something like:
headers = $('.section .header');
authors = $('.section .author');

And then you pray and hope that there is the same amount of headers as authors, otherwise your scraped data object will be mix-matched.

The code i put in the previous comment using jquery would avoid this problem, by iterating through each section individually, and doing a selector 'find' function on each section to find the children of each individual section. It appears at first review that the current default selector engine does not have a 'find' or similar method to traverse through children.

Contributor
chriso commented Aug 6, 2011

Hey @devjones. There's a few reasons why jQuery / JSDOM isn't the default:

  • It's slow spinning up a full instance of JSDOM for every page you want to scrape. Most of the time you just need to get a few pieces of information - you don't a full blown DOM to do that
  • Htmlparser can start to parse the page as it's received, while JSDOM doesn't parse it until the entire page has been received. When you're scraping a lot of data, this makes a significant difference to the memory usage and speed of node.io
  • I've added several convenience methods which make scraping easier.

You can do a lot of what you're describing with the default $ object, e.g. to use a custom context or find within another element, use $(selector [, context]);. JSDOM/jQuery is great if you need a full DOM or need complex / compound selectors, but IMO it's not worth the trade off in speed and memory usage.

@chriso chriso closed this Aug 6, 2011
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment