Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better error handling, with proper error messages #9

Closed
oyvindeh opened this issue Sep 5, 2012 · 12 comments
Closed

Better error handling, with proper error messages #9

oyvindeh opened this issue Sep 5, 2012 · 12 comments
Assignees

Comments

@oyvindeh
Copy link
Owner

oyvindeh commented Sep 5, 2012

Better error handling and reporting is needed when a resouce (html or css file) cannot be found or downloaded.

There are some aspects complicating this: For example, if there is a list of ten CSS urls, and you get 404 on one of them, what should be done? Should the whole script be aborted, or should the other files be processed?

@joeytwiddle
Copy link

The behaviour I am currently experiencing is that an error message is displayed, but the crawl continues. But then at the end no output is produced. That is horrible!

Better might be to abort immediately after an error.

But I would prefer if it would simply display results even after errors have occurred. (Perhaps with a note at the top with a count of how many errors occurred during the crawl.)

@oyvindeh
Copy link
Owner Author

Yeah, there seems to be a bug: I've seen that a couple of times too, but I haven't had the time to investigate yet. I will look into it as soon as I can find some time for it. I've created a separate issue ( #32 ).

@oyvindeh
Copy link
Owner Author

What is the error message(s) you get, is it ETIMEDOUT and/or ESOCKETTIMEDOUT? Do you get a timeout on loading the CSS?

@joeytwiddle
Copy link

I did have a few timeouts, but I increased the "timeout" parameter in the config to compensate.

The errors I am getting now appear to be with certain binaries. I see them from .zip, .jpg, .png and .pdf files found during the crawl:

Visited:  http://...-engines-eng.jpg
Unable to load http://...-engines-eng.jpg: RangeError: Maximum call stack size exceeded
undefined

Occasionally I also get them from messy links!

_getHtmlAsString() failed to read javascript: void(0): ENOENT, no such file or directory 'javascript: void(0)'

It would be great if the report could be shown even when failures occur!

@oyvindeh
Copy link
Owner Author

Thanks for the info! I have fixed the problem with the timeouts (not released yet). As for the problem with the crawler following binaries, that's a known bug (#29). I am currently travelling, but I will look into all this when I get back in a few days, as well as release a bunch of other fixes.

@ghost ghost assigned oyvindeh Jan 27, 2014
@oyvindeh
Copy link
Owner Author

oyvindeh commented Feb 4, 2014

#29 has now been fixed.

@joeytwiddle
Copy link

Great thanks, the fix for binaries is working here too. 👍

However I am still getting a few errors from javascript: "links" and an ENOENT for an ftp: URL that does not exist.

I am also getting this at the end:

.../node_modules/ucss/node_modules/q/q.js:126
                    throw e;
                          ^
RangeError: Maximum call stack size exceeded

The result is that after a long crawl visiting 400 pages, I see no results about the CSS!

(Unfortunately stack overflows do not produce a stack-trace so it is harder to see where this came from, but I may try to run with a debugger at some point...)

@oyvindeh
Copy link
Owner Author

oyvindeh commented Feb 7, 2014

I've fixed the "javascript:" and "ftp:" bug.

Looking into the other one as well. It's a bit tricky, so I cannot promise when a fix will be out. I don't know what kind of site you try to crawl, but if it resembles a product catalog (i.e. having lots of pages with the same markup and CSS, but different content), you could make an exclude list in your config and add the whole subtree, and add a couple of entries to include list (I've updated the example config to show this).

@oyvindeh
Copy link
Owner Author

oyvindeh commented Feb 7, 2014

Actually, it seems like I may have a fix working. Was just able to crawl 24k+ pages without any crash. Will clean it up somewhat, do some more test runs, and hopefully publish it later today.

@oyvindeh
Copy link
Owner Author

oyvindeh commented Feb 7, 2014

Published both fixes.

@joeytwiddle
Copy link

Many thanks oyvindeh. Your fixes have prevented all the errors here, and our crawl of 700 pages is now completing! It seems it was worth the wait too...

Total: 1446 (87 used, 1353 unused, 177 duplicates, 5 ignored)

I do wonder if the report would still be shown if an error did occur (e.g. if a webserver incorrectly returned a binary file as text/html).

@oyvindeh
Copy link
Owner Author

No problem, happy that I could help! 👍

That's a lot of unused CSS! (Be aware that if your page is JavaScript heavy, and classes are added using JavaScript, they will not be captured.)

There are errors that will crash the script. E.g., sometimes modules used, or V8, will just make the script exit. However, my error handling can improve as well. The case you mention is such a case, and I have a TODO item about that. Will look further into this later, so I keep this issue open.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants