Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Crawler] Sometimes gets stuck after navigating to the page and before extracting metadata #93

Closed
SeeJayEmm opened this issue Apr 9, 2024 · 21 comments
Labels
bug Something isn't working

Comments

@SeeJayEmm
Copy link

I'm seeing this error in my logs:

2024-04-09T19:28:41.366Z error: [Crawler][14] Crawling job failed: {}

For the sites it reports this, doesn't update the bookmark (no thumbnail, title, tags, etc...). I've tried to refresh the item but it always reports the same error (nothing between the curly braces).

And example url would be https://github.com/filebrowser/filebrowser

Given the lack of actionable info in the log I'm not sure how to proceed in troubleshooting.

@MohamedBassem
Copy link
Collaborator

Can you try updating your 'HOARDER_VERSION' to 'latest'? I've improved the error logging and we'll be able to know why exactly it's failing. Please update to latest and report back to me with the error message :)

@SeeJayEmm
Copy link
Author

2024-04-09T19:40:35.783Z info: [Crawler][15] Will crawl "https://github.com/filebrowser/filebrowser" for link with id "p8ca026hk3yawz49tkozhge2"
2024-04-09T19:40:38.036Z info: [Crawler][15] Successfully navigated to "https://github.com/filebrowser/filebrowser". Waiting for the page to load ...
2024-04-09T19:40:43.041Z info: [Crawler][15] Finished waiting for the page to load.
2024-04-09T19:41:35.775Z error: [Crawler][15] Crawling job failed: Error: Timed-out after 60 secs

@MohamedBassem MohamedBassem added the bug Something isn't working label Apr 9, 2024
@MohamedBassem
Copy link
Collaborator

Thanks! This is the 2nd time I'm seeing this issue, so I'll assume it's not a one off and that it's a bug. I'll try to debug it to understand why it happens. Thanks for the report!

@SeeJayEmm
Copy link
Author

I tried bookmarking a few different pages on github so I wonder if the useragent or something similar is being blocked. I've seen it a few times but inconsistently with youtube.

I've mostly been trying this out by marking things I want to come back to later. It's been pretty random. I can do a bit more testing tho, to see if I can find a pattern.

@MohamedBassem
Copy link
Collaborator

The thing is, in this instance and in the previous report, it's not reproducible for me.

image

I'm yet to find a website that reproduces it on my server. However, with the new log lines, I know exactly where it happens. Either when fetching the page content or when closing the browser context. I'll need to dig deeper to understand why any of them can get stuck.

@SeeJayEmm
Copy link
Author

Let me know how I can help.

@MohamedBassem
Copy link
Collaborator

Thanks a lot. Worst case, I might add some more debugging lines and ask you to try to re-reproduce :)

@MohamedBassem MohamedBassem changed the title error: [Crawler][14] Crawling job failed: {} [Crawler] Sometimes gets stuck after navigating to the page and before extracting metadata Apr 9, 2024
@SeeJayEmm
Copy link
Author

If it helps, docker version is 25.0.4 and the output of uname -a:

Linux hostname 6.1.0-18-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.76-1 (2024-02-01) x86_64 GNU/Linux

@MohamedBassem
Copy link
Collaborator

Maybe while we're at it, maybe also share the logs of the browser container? Maybe there's something interesting there?

@SeeJayEmm
Copy link
Author

I'm assuming that's the chrome container.

_hoarder-chrome-1_logs.txt

@SeeJayEmm
Copy link
Author

I just confirmed that the host hoarder is running on isn't being blocked by github. I was able to load the page via another tool I have running in docker on the same host called Ladder.

@MohamedBassem
Copy link
Collaborator

ok maybe another idea. Can you try restarting the chrome container itself and see if it helps?

@SeeJayEmm
Copy link
Author

Same.

2024-04-09T23:04:42.143Z info: [Crawler] The puppeteer browser got disconnected. Will attempt to launch it again.
2024-04-09T23:04:42.209Z info: [Crawler] Connecting to existing browser instance: http://chrome:9222
2024-04-09T23:04:42.266Z info: [Crawler] Successfully resolved IP address, new address: http://10.17.8.4:9222/
2024-04-09T23:04:42.508Z error: [Crawler] Failed to connect to the browser instance, will retry in 5 secs
2024-04-09T23:04:47.510Z info: [Crawler] Connecting to existing browser instance: http://chrome:9222
2024-04-09T23:04:47.513Z info: [Crawler] Successfully resolved IP address, new address: http://10.17.8.4:9222/
2024-04-09T23:04:52.939Z info: [Crawler][20] Will crawl "https://github.com/filebrowser/filebrowser" for link with id "p8ca026hk3yawz49tkozhge2"
2024-04-09T23:04:55.854Z info: [Crawler][20] Successfully navigated to "https://github.com/filebrowser/filebrowser". Waiting for the page to load ...
2024-04-09T23:05:00.863Z info: [Crawler][20] Finished waiting for the page to load.
2024-04-09T23:05:52.807Z error: [Crawler][20] Crawling job failed: Error: Timed-out after 60 secs

@NeoHuncho
Copy link

I was having the same issue for reddit links yesterday (getting stuck before metadata extraction) but i pasted the same link today and it worked fine.

@MohamedBassem
Copy link
Collaborator

It's a very weird problem and so far I couldn't reproduce it locally 😔

@SeeJayEmm
Copy link
Author

I just updated my hoarder again and refreshed the link. It loaded/populated this time.

@MohamedBassem
Copy link
Collaborator

The problem seems transient and when it happens it gets stuck for a while and then gets resolved on its own. I'm still trying to figure out what is this "state" that we get stuck on for some websites.

@MohamedBassem
Copy link
Collaborator

Guys, is there anyone still facing this issue? If yes, can you try adding --disable-dev-shm-usage to your chrome container flags and check if this helps?

@SeeJayEmm
Copy link
Author

I haven't encountered the problem since the last time I posted in this thread.

@talentedmrripley
Copy link

Guys, is there anyone still facing this issue? If yes, can you try adding --disable-dev-shm-usage to your chrome container flags and check if this helps?

I too was having issues with scrapping sites, predominantly reddit. Adding the suggested flag did allow the worker to successfully crawl all of the bookmarks that it was failing on previously. I will see as I add new reddit links if it continues to be successful.

For reference, here is my compose for the chrome container:

chrome:
    image: gcr.io/zenika-hub/alpine-chrome:123
    container_name: hoarder-chrome
    restart: unless-stopped
    command:
      - --no-sandbox
      - --disable-gpu
      - --remote-debugging-address=0.0.0.0
      - --remote-debugging-port=9222
      - --hide-scrollbars
      - --enable-features=ConversionMeasurement,AttributionReportingCrossAppWeb
      - --disable-dev-shm-usage

@MohamedBassem
Copy link
Collaborator

Ok, I'll consider this completed for now and we can always reopen if someone faces the issue again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants