Browsertrix crawler stops on page crashing #266

benoit74 · 2024-01-15T09:40:51Z

We have multiple instances where a Browsertrix crawl ends-up with this kind of errors:

{"timestamp":"2024-01-15T08:17:30.893Z","logLevel":"info","context":"general","message":"Awaiting page load","details":{"page":"https://solar.lowtechmagazine.com/pl/2020/04/fruit-trenches-cultivating-subtropical-plants-in-freezing-temperatures/","workerid":0}}
{"timestamp":"2024-01-15T08:17:30.991Z","logLevel":"warn","context":"general","message":"Link Extraction failed in frame","details":{"reason":{"name":"TargetCloseError"},"page":"https://solar.lowtechmagazine.com/pl/2020/01/how-sustainable-is-a-solar-powered-website/","workerid":0}}
{"timestamp":"2024-01-15T08:17:31.478Z","logLevel":"error","context":"worker","message":"Page Crashed","details":{"type":"exception","message":"Page crashed!","stack":"Error: Page crashed!\n    at #onTargetCrashed (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/common/Page.js:284:28)\n    at file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/common/Page.js:153:41\n    at file:///app/node_modules/puppeteer-core/lib/esm/third_party/mitt/index.js:1:248\n    at Array.map (<anonymous>)\n    at Object.emit (file:///app/node_modules/puppeteer-core/lib/esm/third_party/mitt/index.js:1:232)\n    at CDPSessionImpl.emit (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/common/EventEmitter.js:82:22)\n    at CDPSessionImpl._onMessage (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/common/Connection.js:425:18)\n    at Connection.onMessage (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/common/Connection.js:255:25)\n    at WebSocket.<anonymous> (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/common/NodeWebSocketTransport.js:46:32)\n    at callListener (/app/node_modules/puppeteer-core/node_modules/ws/lib/event-target.js:290:14)","page":"https://solar.lowtechmagazine.com/pl/2020/04/fruit-trenches-cultivating-subtropical-plants-in-freezing-temperatures/","workerid":0}}
{"timestamp":"2024-01-15T08:17:31.479Z","logLevel":"info","context":"pageStatus","message":"Page Finished","details":{"loadState":2,"page":"https://solar.lowtechmagazine.com/pl/2020/04/fruit-trenches-cultivating-subtropical-plants-in-freezing-temperatures/","workerid":0}}

{"timestamp":"2024-01-15T08:17:31.701Z","logLevel":"error","context":"worker","message":"Page Crashed","details":{"type":"exception","message":"Page crashed!","stack":"Error: Page crashed!\n    at #onTargetCrashed (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/common/Page.js:284:28)\n    at file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/common/Page.js:153:41\n    at file:///app/node_modules/puppeteer-core/lib/esm/third_party/mitt/index.js:1:248\n    at Array.map (<anonymous>)\n    at Object.emit (file:///app/node_modules/puppeteer-core/lib/esm/third_party/mitt/index.js:1:232)\n    at CDPSessionImpl.emit (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/common/EventEmitter.js:82:22)\n    at CDPSessionImpl._onMessage (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/common/Connection.js:425:18)\n    at Connection.onMessage (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/common/Connection.js:255:25)\n    at WebSocket.<anonymous> (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/common/NodeWebSocketTransport.js:46:32)\n    at callListener (/app/node_modules/puppeteer-core/node_modules/ws/lib/event-target.js:290:14)","page":"about:blank?_browsertrixkozykf79y5c","workerid":0}}
node:internal/process/promises:288
            triggerUncaughtException(err, true /* fromPromise */);
            ^

[UnhandledPromiseRejection: This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). The promise rejected with the reason "crashed".] {
  code: 'ERR_UNHANDLED_REJECTION'
}

Let's track these occurences in this issue until we know what to report upstream (or not).

Last known occurrences:

mesquartierschinois_fr on task https://farm.openzim.org/pipeline/787b96ee-36df-4d3d-96d5-758ca67b52df/debug with Browsertrix-Crawler 0.12.3 (with warcio.js 1.6.2 pywb 2.7.4) on worker athena18
solar.lowtechmagazine.com on task https://farm.openzim.org/pipeline/ebbea738-c84d-4b9c-a4a4-2476e03581a1/debug with Browsertrix-Crawler 0.12.3 (with warcio.js 1.6.2 pywb 2.7.4) on worker chance

The text was updated successfully, but these errors were encountered:

benoit74 · 2024-02-05T08:00:07Z

mesquartierschinois_fr again on task https://farm.openzim.org/pipeline/29998e1f-6ffc-423f-934f-911aa3eb7ec0 with Browsertrix-Crawler 0.12.4 (with warcio.js 1.6.2 pywb 2.7.4) on worker athena18

benoit74 · 2024-02-08T07:17:45Z

solar.lowtechmagazine.com again on task https://farm.openzim.org/pipeline/6e4a0355-a69b-4849-98ab-ab1f0960aaf2/debug with Browsertrix-Crawler 0.12.4 (with warcio.js 1.6.2 pywb 2.7.4) on worker athena18

benoit74 · 2024-02-19T07:34:20Z

solar.lowtechmagazine.com again on task https://farm.openzim.org/pipeline/ebf9d7d4-ab85-48bd-9363-a988694050ac with Browsertrix-Crawler 0.12.4 (with warcio.js 1.6.2 pywb 2.7.4) on worker athena18

Something important I did not noticed before: there are in fact many "Page Crashed" situations in the log (72 here) but the last one stops the crawl with an additional message we do not have for other page crash. Looks like the situation here is that it does not even achieves to initialize the page. It is important to note that between the 72 page crashes, some pages succeeded to be crawled.

{"timestamp":"2024-02-17T01:51:16.156Z","logLevel":"warn","context":"worker","message":"Error getting new page","details":{"workerid":0,"type":"exception","message":"timed out","stack":"Error: timed out\n    at PageWorker.initPage (file:///app/util/worker.js:135:17)\n    at async PageWorker.runLoop (file:///app/util/worker.js:254:22)\n    at async PageWorker.run (file:///app/util/worker.js:227:7)\n    at async Promise.allSettled (index 0)\n    at async Crawler.crawl (file:///app/crawler.js:887:5)\n    at async Crawler.run (file:///app/crawler.js:323:7)"}}

benoit74 · 2024-02-19T07:55:44Z

See #283 for what looks like a new symptom of what has maybe the same underlying cause.

benoit74 · 2024-02-19T08:12:21Z

I've reported the issue(s) upstream.

benoit74 · 2024-02-22T12:30:52Z

ir.voanews.com_persian on task https://farm.openzim.org/pipeline/4889a582-f24d-4364-acad-507c5d94ced6 with Browsertrix-Crawler 0.12.4 (with warcio.js 1.6.2 pywb 2.7.4) on worker athena18

benoit74 · 2024-02-23T11:07:11Z

ndla.no_no_all on task https://farm.openzim.org/pipeline/1586bd19-22c2-4275-ae2e-abc266e6c7fe with Browsertrix-Crawler 0.12.4 (with warcio.js 1.6.2 pywb 2.7.4) on worker athena18

benoit74 · 2024-05-28T11:57:17Z

Looks like Browsertrix-Crawler 1.x solved the issue, let's close this one

benoit74 added the scraping_issue Issue occured while using the scraper label Jan 15, 2024

benoit74 changed the title ~~Browsertrix crawler has regularly pages crashing~~ Browsertrix crawler stops on page crashing Jan 15, 2024

benoit74 mentioned this issue Feb 19, 2024

solar.lowtechmagazine.com is very unstable #283

Closed

benoit74 mentioned this issue Feb 19, 2024

Crawler getting stuck on Page Crashed webrecorder/browsertrix-crawler#391

Closed

benoit74 added the upstream label Feb 19, 2024

This was referenced Feb 19, 2024

Upgrade to browsertrix crawler 1.0.0 beta #284

Closed

New request: ir.voanews.com openzim/zim-requests#833

Open

benoit74 mentioned this issue Feb 23, 2024

New request: NDLA - Norwegian Digital Learning Arena openzim/zim-requests#626

Open

benoit74 closed this as completed May 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Browsertrix crawler stops on page crashing #266

Browsertrix crawler stops on page crashing #266

benoit74 commented Jan 15, 2024

benoit74 commented Feb 5, 2024

benoit74 commented Feb 8, 2024

benoit74 commented Feb 19, 2024

benoit74 commented Feb 19, 2024

benoit74 commented Feb 19, 2024

benoit74 commented Feb 22, 2024 •

edited

Loading

benoit74 commented Feb 23, 2024

benoit74 commented May 28, 2024

Browsertrix crawler stops on page crashing #266

Browsertrix crawler stops on page crashing #266

Comments

benoit74 commented Jan 15, 2024

benoit74 commented Feb 5, 2024

benoit74 commented Feb 8, 2024

benoit74 commented Feb 19, 2024

benoit74 commented Feb 19, 2024

benoit74 commented Feb 19, 2024

benoit74 commented Feb 22, 2024 • edited Loading

benoit74 commented Feb 23, 2024

benoit74 commented May 28, 2024

benoit74 commented Feb 22, 2024 •

edited

Loading