[Crawler] Sometimes gets stuck after navigating to the page and before extracting metadata #93

SeeJayEmm · 2024-04-09T19:33:00Z

I'm seeing this error in my logs:

2024-04-09T19:28:41.366Z error: [Crawler][14] Crawling job failed: {}

For the sites it reports this, doesn't update the bookmark (no thumbnail, title, tags, etc...). I've tried to refresh the item but it always reports the same error (nothing between the curly braces).

And example url would be https://github.com/filebrowser/filebrowser

Given the lack of actionable info in the log I'm not sure how to proceed in troubleshooting.

The text was updated successfully, but these errors were encountered:

MohamedBassem · 2024-04-09T19:35:30Z

Can you try updating your 'HOARDER_VERSION' to 'latest'? I've improved the error logging and we'll be able to know why exactly it's failing. Please update to latest and report back to me with the error message :)

SeeJayEmm · 2024-04-09T19:42:07Z

2024-04-09T19:40:35.783Z info: [Crawler][15] Will crawl "https://github.com/filebrowser/filebrowser" for link with id "p8ca026hk3yawz49tkozhge2"
2024-04-09T19:40:38.036Z info: [Crawler][15] Successfully navigated to "https://github.com/filebrowser/filebrowser". Waiting for the page to load ...
2024-04-09T19:40:43.041Z info: [Crawler][15] Finished waiting for the page to load.
2024-04-09T19:41:35.775Z error: [Crawler][15] Crawling job failed: Error: Timed-out after 60 secs

MohamedBassem · 2024-04-09T19:44:22Z

Thanks! This is the 2nd time I'm seeing this issue, so I'll assume it's not a one off and that it's a bug. I'll try to debug it to understand why it happens. Thanks for the report!

SeeJayEmm · 2024-04-09T19:46:06Z

I tried bookmarking a few different pages on github so I wonder if the useragent or something similar is being blocked. I've seen it a few times but inconsistently with youtube.

I've mostly been trying this out by marking things I want to come back to later. It's been pretty random. I can do a bit more testing tho, to see if I can find a pattern.

MohamedBassem · 2024-04-09T19:51:04Z

The thing is, in this instance and in the previous report, it's not reproducible for me.

I'm yet to find a website that reproduces it on my server. However, with the new log lines, I know exactly where it happens. Either when fetching the page content or when closing the browser context. I'll need to dig deeper to understand why any of them can get stuck.

SeeJayEmm · 2024-04-09T19:51:31Z

Let me know how I can help.

MohamedBassem · 2024-04-09T19:53:04Z

Thanks a lot. Worst case, I might add some more debugging lines and ask you to try to re-reproduce :)

SeeJayEmm · 2024-04-09T20:08:29Z

If it helps, docker version is 25.0.4 and the output of uname -a:

Linux hostname 6.1.0-18-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.76-1 (2024-02-01) x86_64 GNU/Linux

MohamedBassem · 2024-04-09T20:09:35Z

Maybe while we're at it, maybe also share the logs of the browser container? Maybe there's something interesting there?

SeeJayEmm · 2024-04-09T20:19:42Z

I'm assuming that's the chrome container.

_hoarder-chrome-1_logs.txt

SeeJayEmm · 2024-04-09T20:34:35Z

I just confirmed that the host hoarder is running on isn't being blocked by github. I was able to load the page via another tool I have running in docker on the same host called Ladder.

MohamedBassem · 2024-04-09T21:07:21Z

ok maybe another idea. Can you try restarting the chrome container itself and see if it helps?

SeeJayEmm · 2024-04-09T23:06:32Z

Same.

2024-04-09T23:04:42.143Z info: [Crawler] The puppeteer browser got disconnected. Will attempt to launch it again.
2024-04-09T23:04:42.209Z info: [Crawler] Connecting to existing browser instance: http://chrome:9222
2024-04-09T23:04:42.266Z info: [Crawler] Successfully resolved IP address, new address: http://10.17.8.4:9222/
2024-04-09T23:04:42.508Z error: [Crawler] Failed to connect to the browser instance, will retry in 5 secs
2024-04-09T23:04:47.510Z info: [Crawler] Connecting to existing browser instance: http://chrome:9222
2024-04-09T23:04:47.513Z info: [Crawler] Successfully resolved IP address, new address: http://10.17.8.4:9222/
2024-04-09T23:04:52.939Z info: [Crawler][20] Will crawl "https://github.com/filebrowser/filebrowser" for link with id "p8ca026hk3yawz49tkozhge2"
2024-04-09T23:04:55.854Z info: [Crawler][20] Successfully navigated to "https://github.com/filebrowser/filebrowser". Waiting for the page to load ...
2024-04-09T23:05:00.863Z info: [Crawler][20] Finished waiting for the page to load.
2024-04-09T23:05:52.807Z error: [Crawler][20] Crawling job failed: Error: Timed-out after 60 secs

NeoHuncho · 2024-04-14T14:52:36Z

I was having the same issue for reddit links yesterday (getting stuck before metadata extraction) but i pasted the same link today and it worked fine.

MohamedBassem · 2024-04-14T14:55:49Z

It's a very weird problem and so far I couldn't reproduce it locally 😔

SeeJayEmm · 2024-04-17T12:54:03Z

I just updated my hoarder again and refreshed the link. It loaded/populated this time.

MohamedBassem · 2024-04-17T12:56:21Z

The problem seems transient and when it happens it gets stuck for a while and then gets resolved on its own. I'm still trying to figure out what is this "state" that we get stuck on for some websites.

MohamedBassem · 2024-05-12T13:46:16Z

Guys, is there anyone still facing this issue? If yes, can you try adding --disable-dev-shm-usage to your chrome container flags and check if this helps?

SeeJayEmm · 2024-05-13T03:15:29Z

I haven't encountered the problem since the last time I posted in this thread.

talentedmrripley · 2024-05-31T22:15:53Z

Guys, is there anyone still facing this issue? If yes, can you try adding --disable-dev-shm-usage to your chrome container flags and check if this helps?

I too was having issues with scrapping sites, predominantly reddit. Adding the suggested flag did allow the worker to successfully crawl all of the bookmarks that it was failing on previously. I will see as I add new reddit links if it continues to be successful.

For reference, here is my compose for the chrome container:

chrome:
    image: gcr.io/zenika-hub/alpine-chrome:123
    container_name: hoarder-chrome
    restart: unless-stopped
    command:
      - --no-sandbox
      - --disable-gpu
      - --remote-debugging-address=0.0.0.0
      - --remote-debugging-port=9222
      - --hide-scrollbars
      - --enable-features=ConversionMeasurement,AttributionReportingCrossAppWeb
      - --disable-dev-shm-usage

MohamedBassem · 2024-06-01T03:56:51Z

Ok, I'll consider this completed for now and we can always reopen if someone faces the issue again.

MohamedBassem added the bug Something isn't working label Apr 9, 2024

MohamedBassem changed the title ~~error: [Crawler][14] Crawling job failed: {}~~ [Crawler] Sometimes gets stuck after navigating to the page and before extracting metadata Apr 9, 2024

MohamedBassem mentioned this issue Apr 13, 2024

Reddit scraping not working + potential infinite fail loop #100

Closed

MohamedBassem closed this as completed Jun 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Crawler] Sometimes gets stuck after navigating to the page and before extracting metadata #93

[Crawler] Sometimes gets stuck after navigating to the page and before extracting metadata #93

SeeJayEmm commented Apr 9, 2024

MohamedBassem commented Apr 9, 2024

SeeJayEmm commented Apr 9, 2024

MohamedBassem commented Apr 9, 2024

SeeJayEmm commented Apr 9, 2024

MohamedBassem commented Apr 9, 2024

SeeJayEmm commented Apr 9, 2024

MohamedBassem commented Apr 9, 2024

SeeJayEmm commented Apr 9, 2024

MohamedBassem commented Apr 9, 2024

SeeJayEmm commented Apr 9, 2024

SeeJayEmm commented Apr 9, 2024

MohamedBassem commented Apr 9, 2024

SeeJayEmm commented Apr 9, 2024

NeoHuncho commented Apr 14, 2024

MohamedBassem commented Apr 14, 2024

SeeJayEmm commented Apr 17, 2024

MohamedBassem commented Apr 17, 2024

MohamedBassem commented May 12, 2024

SeeJayEmm commented May 13, 2024

talentedmrripley commented May 31, 2024

MohamedBassem commented Jun 1, 2024

[Crawler] Sometimes gets stuck after navigating to the page and before extracting metadata #93

[Crawler] Sometimes gets stuck after navigating to the page and before extracting metadata #93

Comments

SeeJayEmm commented Apr 9, 2024

MohamedBassem commented Apr 9, 2024

SeeJayEmm commented Apr 9, 2024

MohamedBassem commented Apr 9, 2024

SeeJayEmm commented Apr 9, 2024

MohamedBassem commented Apr 9, 2024

SeeJayEmm commented Apr 9, 2024

MohamedBassem commented Apr 9, 2024

SeeJayEmm commented Apr 9, 2024

MohamedBassem commented Apr 9, 2024

SeeJayEmm commented Apr 9, 2024

SeeJayEmm commented Apr 9, 2024

MohamedBassem commented Apr 9, 2024

SeeJayEmm commented Apr 9, 2024

NeoHuncho commented Apr 14, 2024

MohamedBassem commented Apr 14, 2024

SeeJayEmm commented Apr 17, 2024

MohamedBassem commented Apr 17, 2024

MohamedBassem commented May 12, 2024

SeeJayEmm commented May 13, 2024

talentedmrripley commented May 31, 2024

MohamedBassem commented Jun 1, 2024