Better error handling, with proper error messages #9

oyvindeh · 2012-09-05T10:52:30Z

Better error handling and reporting is needed when a resouce (html or css file) cannot be found or downloaded.

There are some aspects complicating this: For example, if there is a list of ten CSS urls, and you get 404 on one of them, what should be done? Should the whole script be aborted, or should the other files be processed?

joeytwiddle · 2014-01-23T03:43:47Z

The behaviour I am currently experiencing is that an error message is displayed, but the crawl continues. But then at the end no output is produced. That is horrible!

Better might be to abort immediately after an error.

But I would prefer if it would simply display results even after errors have occurred. (Perhaps with a note at the top with a count of how many errors occurred during the crawl.)

oyvindeh · 2014-01-23T09:20:52Z

Yeah, there seems to be a bug: I've seen that a couple of times too, but I haven't had the time to investigate yet. I will look into it as soon as I can find some time for it. I've created a separate issue ( #32 ).

oyvindeh · 2014-01-23T12:06:39Z

What is the error message(s) you get, is it ETIMEDOUT and/or ESOCKETTIMEDOUT? Do you get a timeout on loading the CSS?

joeytwiddle · 2014-01-27T02:20:07Z

I did have a few timeouts, but I increased the "timeout" parameter in the config to compensate.

The errors I am getting now appear to be with certain binaries. I see them from .zip, .jpg, .png and .pdf files found during the crawl:

Visited:  http://...-engines-eng.jpg
Unable to load http://...-engines-eng.jpg: RangeError: Maximum call stack size exceeded
undefined

Occasionally I also get them from messy links!

_getHtmlAsString() failed to read javascript: void(0): ENOENT, no such file or directory 'javascript: void(0)'

It would be great if the report could be shown even when failures occur!

oyvindeh · 2014-01-27T12:42:40Z

Thanks for the info! I have fixed the problem with the timeouts (not released yet). As for the problem with the crawler following binaries, that's a known bug (#29). I am currently travelling, but I will look into all this when I get back in a few days, as well as release a bunch of other fixes.

oyvindeh · 2014-02-04T11:56:55Z

#29 has now been fixed.

joeytwiddle · 2014-02-05T08:10:06Z

Great thanks, the fix for binaries is working here too. 👍

However I am still getting a few errors from javascript: "links" and an ENOENT for an ftp: URL that does not exist.

I am also getting this at the end:

.../node_modules/ucss/node_modules/q/q.js:126
                    throw e;
                          ^
RangeError: Maximum call stack size exceeded

The result is that after a long crawl visiting 400 pages, I see no results about the CSS!

(Unfortunately stack overflows do not produce a stack-trace so it is harder to see where this came from, but I may try to run with a debugger at some point...)

oyvindeh · 2014-02-07T09:21:04Z

I've fixed the "javascript:" and "ftp:" bug.

Looking into the other one as well. It's a bit tricky, so I cannot promise when a fix will be out. I don't know what kind of site you try to crawl, but if it resembles a product catalog (i.e. having lots of pages with the same markup and CSS, but different content), you could make an exclude list in your config and add the whole subtree, and add a couple of entries to include list (I've updated the example config to show this).

oyvindeh · 2014-02-07T11:39:37Z

Actually, it seems like I may have a fix working. Was just able to crawl 24k+ pages without any crash. Will clean it up somewhat, do some more test runs, and hopefully publish it later today.

oyvindeh · 2014-02-07T13:36:51Z

Published both fixes.

joeytwiddle · 2014-02-10T08:44:18Z

Many thanks oyvindeh. Your fixes have prevented all the errors here, and our crawl of 700 pages is now completing! It seems it was worth the wait too...

Total: 1446 (87 used, 1353 unused, 177 duplicates, 5 ignored)

I do wonder if the report would still be shown if an error did occur (e.g. if a webserver incorrectly returned a binary file as text/html).

oyvindeh · 2014-02-10T08:54:34Z

No problem, happy that I could help! 👍

That's a lot of unused CSS! (Be aware that if your page is JavaScript heavy, and classes are added using JavaScript, they will not be captured.)

There are errors that will crash the script. E.g., sometimes modules used, or V8, will just make the script exit. However, my error handling can improve as well. The case you mention is such a case, and I have a TODO item about that. Will look further into this later, so I keep this issue open.

ghost assigned oyvindeh Jan 27, 2014

oyvindeh closed this as completed Aug 28, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better error handling, with proper error messages #9

Better error handling, with proper error messages #9

oyvindeh commented Sep 5, 2012

joeytwiddle commented Jan 23, 2014

oyvindeh commented Jan 23, 2014

oyvindeh commented Jan 23, 2014

joeytwiddle commented Jan 27, 2014

oyvindeh commented Jan 27, 2014

oyvindeh commented Feb 4, 2014

joeytwiddle commented Feb 5, 2014

oyvindeh commented Feb 7, 2014

oyvindeh commented Feb 7, 2014

oyvindeh commented Feb 7, 2014

joeytwiddle commented Feb 10, 2014

oyvindeh commented Feb 10, 2014

Better error handling, with proper error messages #9

Better error handling, with proper error messages #9

Comments

oyvindeh commented Sep 5, 2012

joeytwiddle commented Jan 23, 2014

oyvindeh commented Jan 23, 2014

oyvindeh commented Jan 23, 2014

joeytwiddle commented Jan 27, 2014

oyvindeh commented Jan 27, 2014

oyvindeh commented Feb 4, 2014

joeytwiddle commented Feb 5, 2014

oyvindeh commented Feb 7, 2014

oyvindeh commented Feb 7, 2014

oyvindeh commented Feb 7, 2014

joeytwiddle commented Feb 10, 2014

oyvindeh commented Feb 10, 2014