Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Browsertrix crawler stops on page crashing #266

Closed
benoit74 opened this issue Jan 15, 2024 · 8 comments
Closed

Browsertrix crawler stops on page crashing #266

benoit74 opened this issue Jan 15, 2024 · 8 comments
Labels
scraping_issue Issue occured while using the scraper upstream

Comments

@benoit74
Copy link
Collaborator

We have multiple instances where a Browsertrix crawl ends-up with this kind of errors:

{"timestamp":"2024-01-15T08:17:30.893Z","logLevel":"info","context":"general","message":"Awaiting page load","details":{"page":"https://solar.lowtechmagazine.com/pl/2020/04/fruit-trenches-cultivating-subtropical-plants-in-freezing-temperatures/","workerid":0}}
{"timestamp":"2024-01-15T08:17:30.991Z","logLevel":"warn","context":"general","message":"Link Extraction failed in frame","details":{"reason":{"name":"TargetCloseError"},"page":"https://solar.lowtechmagazine.com/pl/2020/01/how-sustainable-is-a-solar-powered-website/","workerid":0}}
{"timestamp":"2024-01-15T08:17:31.478Z","logLevel":"error","context":"worker","message":"Page Crashed","details":{"type":"exception","message":"Page crashed!","stack":"Error: Page crashed!\n    at #onTargetCrashed (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/common/Page.js:284:28)\n    at file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/common/Page.js:153:41\n    at file:///app/node_modules/puppeteer-core/lib/esm/third_party/mitt/index.js:1:248\n    at Array.map (<anonymous>)\n    at Object.emit (file:///app/node_modules/puppeteer-core/lib/esm/third_party/mitt/index.js:1:232)\n    at CDPSessionImpl.emit (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/common/EventEmitter.js:82:22)\n    at CDPSessionImpl._onMessage (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/common/Connection.js:425:18)\n    at Connection.onMessage (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/common/Connection.js:255:25)\n    at WebSocket.<anonymous> (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/common/NodeWebSocketTransport.js:46:32)\n    at callListener (/app/node_modules/puppeteer-core/node_modules/ws/lib/event-target.js:290:14)","page":"https://solar.lowtechmagazine.com/pl/2020/04/fruit-trenches-cultivating-subtropical-plants-in-freezing-temperatures/","workerid":0}}
{"timestamp":"2024-01-15T08:17:31.479Z","logLevel":"info","context":"pageStatus","message":"Page Finished","details":{"loadState":2,"page":"https://solar.lowtechmagazine.com/pl/2020/04/fruit-trenches-cultivating-subtropical-plants-in-freezing-temperatures/","workerid":0}}

{"timestamp":"2024-01-15T08:17:31.701Z","logLevel":"error","context":"worker","message":"Page Crashed","details":{"type":"exception","message":"Page crashed!","stack":"Error: Page crashed!\n    at #onTargetCrashed (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/common/Page.js:284:28)\n    at file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/common/Page.js:153:41\n    at file:///app/node_modules/puppeteer-core/lib/esm/third_party/mitt/index.js:1:248\n    at Array.map (<anonymous>)\n    at Object.emit (file:///app/node_modules/puppeteer-core/lib/esm/third_party/mitt/index.js:1:232)\n    at CDPSessionImpl.emit (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/common/EventEmitter.js:82:22)\n    at CDPSessionImpl._onMessage (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/common/Connection.js:425:18)\n    at Connection.onMessage (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/common/Connection.js:255:25)\n    at WebSocket.<anonymous> (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/common/NodeWebSocketTransport.js:46:32)\n    at callListener (/app/node_modules/puppeteer-core/node_modules/ws/lib/event-target.js:290:14)","page":"about:blank?_browsertrixkozykf79y5c","workerid":0}}
node:internal/process/promises:288
            triggerUncaughtException(err, true /* fromPromise */);
            ^

[UnhandledPromiseRejection: This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). The promise rejected with the reason "crashed".] {
  code: 'ERR_UNHANDLED_REJECTION'
}

Let's track these occurences in this issue until we know what to report upstream (or not).

Last known occurrences:

@benoit74 benoit74 added the scraping_issue Issue occured while using the scraper label Jan 15, 2024
@benoit74 benoit74 changed the title Browsertrix crawler has regularly pages crashing Browsertrix crawler stops on page crashing Jan 15, 2024
@benoit74
Copy link
Collaborator Author

benoit74 commented Feb 5, 2024

@benoit74
Copy link
Collaborator Author

benoit74 commented Feb 8, 2024

@benoit74
Copy link
Collaborator Author

Something important I did not noticed before: there are in fact many "Page Crashed" situations in the log (72 here) but the last one stops the crawl with an additional message we do not have for other page crash. Looks like the situation here is that it does not even achieves to initialize the page. It is important to note that between the 72 page crashes, some pages succeeded to be crawled.

{"timestamp":"2024-02-17T01:51:16.156Z","logLevel":"warn","context":"worker","message":"Error getting new page","details":{"workerid":0,"type":"exception","message":"timed out","stack":"Error: timed out\n    at PageWorker.initPage (file:///app/util/worker.js:135:17)\n    at async PageWorker.runLoop (file:///app/util/worker.js:254:22)\n    at async PageWorker.run (file:///app/util/worker.js:227:7)\n    at async Promise.allSettled (index 0)\n    at async Crawler.crawl (file:///app/crawler.js:887:5)\n    at async Crawler.run (file:///app/crawler.js:323:7)"}}

@benoit74
Copy link
Collaborator Author

See #283 for what looks like a new symptom of what has maybe the same underlying cause.

@benoit74
Copy link
Collaborator Author

I've reported the issue(s) upstream.

@benoit74
Copy link
Collaborator Author

benoit74 commented Feb 22, 2024

@benoit74
Copy link
Collaborator Author

@benoit74
Copy link
Collaborator Author

Looks like Browsertrix-Crawler 1.x solved the issue, let's close this one

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
scraping_issue Issue occured while using the scraper upstream
Projects
None yet
Development

No branches or pull requests

1 participant