Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

403 forbidden possibly due to rate limit #12

Closed
SecDWizar opened this issue Aug 29, 2023 · 7 comments
Closed

403 forbidden possibly due to rate limit #12

SecDWizar opened this issue Aug 29, 2023 · 7 comments
Assignees
Labels
bug Something isn't working

Comments

@SecDWizar
Copy link

Hi,

I've started last evening the process which processed quite quickly until it finished the 75,000 "Base URL requests per day" quota,
Overnight it continued advancing very slowly the numbers with "WARNING/ForkPoolWorker-32] Received 429 Client Error: Too Many Requests for url" (maybe skipped those). which is fine.

Waiting for subtasks to complete... (934 / 2056) advanced over this time (429 errors) about 200 numbers...
At pacific midnight the quota was reset, the app started chugging again but now I got:

"WARNING/ForkPoolWorker-23] Received 403 Client Error: Forbidden for url: https://lh3.googleusercontent.com/lr/... getting media item size"

tons of them quickly, those tasks numbers raised a bit, perhaps 100-200 and stopped raising, at this time tens of thousands of requests (by the logs) happened with those 403 forbidden.

From reading about this - this is also some rate limit, here it's mentioned to add "referrerpolicy". so what are those 403 errors?

@mtalcott mtalcott added the bug Something isn't working label Aug 29, 2023
@mtalcott mtalcott self-assigned this Aug 29, 2023
@mtalcott
Copy link
Owner

mtalcott commented Aug 30, 2023

@SecDWizar Thanks for the detailed report! I think I understand the problem. For an immediate fix, try again with "Refresh media items" checked (as it is by default) now that the daily quota has been reset. This should re-fetch media items and resolve the 403s (although it might actually take 2 days' quota, see below).

The Photos API returns a baseUrl that is used to 1) download a small version of the image and 2) get the size. Only 75k baseUrl requests are allowed per day before 429s are returned. These baseUrls also only stay valid for about an hour, so when the quota reset at midnight, it started returning 403s instead as the baseUrls previously fetched had expired.

Based on the 2056 subtasks, I estimate your total library size is ~100k media items. The tool won't call baseUrls again for any media items it already has a size/image for, but it will use 1 call for the image and 1 call for the size, so it might require 3 days total quota to get images and sizes for your whole library.

Longer term, I want to handle this situation more gracefully upfront with user-facing messaging, and provide an option to not fetch size so that the daily quota can all go towards downloading images.

@SecDWizar
Copy link
Author

SecDWizar commented Aug 30, 2023

Thank you very much for this explanation, now it's all clear (and those long guid urls make sense).

A couple of small questions:
How do you know what images it initially has, for the list of items to process and get those small versions and sizes for? what quota does this take? does it refresh? does this list refreshes only when checking that "refresh media items" and selecting restart? if so - does it invalidates that whole list? and getting it again counts as just one api call?

  • sounds to me this should be true on all accounts, so why not just fetch this up automatically whenever those ephemeral (1h) links go stale?

How do you track changes? adding from the phone images or deleting from the phone images. again if getting all the media items is just one API call, then processing it (which can be expensive, need a nice data structure for this - but this is done on my computer so who cares), this should even be done periodically IMHO and processed in the background (well, maybe it's a technical challenge) and even offer in the GUI occasionally, with the changes you discussed about, some prompt that media item changes detected, 100 more images added/scanned, 50 removed, etc. make that work (and gui) thus dynamic.

What are your thoughts?

@mtalcott
Copy link
Owner

The image metadata is stored in MongoDB. The port is exposed by default when you run docker-compose, so you can take a peek at the media_items it has gather with a tool like https://www.mongodb.com/products/compass. Refreshing the media items pulls new data from the Photos API into the mongo collection, and spins of new tasks to store downsized images and get the size, each of which count against that 75k/day quota I mentioned above.

sounds to me this should be true on all accounts, so why not just fetch this up automatically whenever those ephemeral (1h) links go stale?

Yes, that's what it should probably do instead - plus notify the user that the quota was hit then keep going once the quota resets. I wasn't sure when the quota reset, so knowing it reset for you at midnight PST is helpful.

I also don't have an option yet to cancel the current task, that'd be helpful once the quota is hit in case you want to shut the whole thing down and try again later.

How do you track changes? adding from the phone images or deleting from the phone images. again if getting all the media items is just one API call, then processing it (which can be expensive, need a nice data structure for this - but this is done on my computer so who cares), this should even be done periodically IMHO and processed in the background (well, maybe it's a technical challenge) and even offer in the GUI occasionally, with the changes you discussed about, some prompt that media item changes detected, 100 more images added/scanned, 50 removed, etc. make that work (and gui) thus dynamic.

The Photos API makes this difficult. There's no way to filter by modified date or anything. In fact, it doesn't even tell you how many media items exist - it just returns a next page token and you have to keep iterating until there are no more pages. Getting the media item metadata is the least expensive part though, it's calling the baseUrls and storing the images that takes more time, which is why I've parallelized it.

@mtalcott
Copy link
Owner

mtalcott commented Sep 1, 2023

@SecDWizar This should be addressed by #16.

The task will now fail once quota is hit, but progress is saved and it can be restarted the next day.

@mtalcott mtalcott closed this as completed Sep 1, 2023
@SecDWizar
Copy link
Author

@SecDWizar This should be addressed by #16.

The task will now fail once quota is hit, but progress is saved and it can be restarted the next day.

Hi Hi,
Sorry for the late reply,

Progress is saved with/out "refresh media items"?

I've pulled and rebuilt the images - that should do it, right?
Ever since then it failed (I always tried with "Refresh media items". I didn't see what happened in the logs as I always looked at it after a few days - and it's hard to find it like this in the logs.

image

@SecDWizar
Copy link
Author

Found it. so I think that's a different bug, want me to open a new one for that?
Still the question is asked about progress is saved with or without "refresh media items"? (I mean if you do "refresh media items" it zeros it, etc.)

worker_1  | [2023-09-14 13:09:26,333: ERROR/ForkPoolWorker-31] Task app.tasks.process_duplicates[75866370-e90a-4066-a894-42a29a6f3b12] raised unexpected: RuntimeError('Image decoding failed (unknown image type): /mnt/images/AMP2KI72Y-ZxNHXGmA_vS0fgKGuVV-R43RJHXupfrUWomAFDAwANBVTaFUTYkYpBr4PfkTlMuLACm1eDBOfpA4CoCRT30BPurQ-250.jpg')
worker_1  | Traceback (most recent call last):
worker_1  |   File "/usr/local/lib/python3.9/site-packages/celery/app/trace.py", line 477, in trace_task
worker_1  |     R = retval = fun(*args, **kwargs)
worker_1  |   File "/usr/src/app/app/__init__.py", line 25, in __call__
worker_1  |     return self.run(*args, **kwargs)
worker_1  |   File "/usr/src/app/app/tasks.py", line 90, in process_duplicates
worker_1  |     results = task_instance.run()
worker_1  |   File "/usr/src/app/app/lib/process_duplicates_task.py", line 109, in run
worker_1  |     similarity_map = duplicate_detector.calculate_similarity_map()
worker_1  |   File "/usr/src/app/app/lib/duplicate_image_detector.py", line 61, in calculate_similarity_map
worker_1  |     embeddings = self._calculate_embeddings()
worker_1  |   File "/usr/src/app/app/lib/duplicate_image_detector.py", line 114, in _calculate_embeddings
worker_1  |     mp_image = mp.Image.create_from_file(storage_path)
worker_1  | RuntimeError: Image decoding failed (unknown image type): /mnt/images/AMP2KI72Y-ZxNHXGmA_vS0fgKGuVV-R43RJHXupfrUWomAFDAwANBVTaFUTYkYpBr4PfkTlMuLACm1eDBOfpA4CoCRT30BPurQ-250.jpg

@SecDWizar
Copy link
Author

How do you track changes? adding from the phone images or deleting from the phone images. again if getting all the media items is just one API call, then processing it (which can be expensive, need a nice data structure for this - but this is done on my computer so who cares), this should even be done periodically IMHO and processed in the background (well, maybe it's a technical challenge) and even offer in the GUI occasionally, with the changes you discussed about, some prompt that media item changes detected, 100 more images added/scanned, 50 removed, etc. make that work (and gui) thus dynamic.

The Photos API makes this difficult. There's no way to filter by modified date or anything. In fact, it doesn't even tell you how many media items exist - it just returns a next page token and you have to keep iterating until there are no more pages. Getting the media item metadata is the least expensive part though, it's calling the baseUrls and storing the images that takes more time, which is why I've parallelized it.

In that case it should be done on schedule and kept (Mongo) IMHO, perhaps even a UI option of triggering that refresh and also to set that schedule (by default once a day). from what I understand it's pagination-amount-api-calls right? how many entries per page? say if one has ~100K items, and 100 entries per page that'll be 1K calls, ,if it's 10 entries it'll be 10K calls, etc.? so that's problematic...

I'm just thinking that images are added by the phone all the time. also sometimes people delete them.

Maybe it shouldn't be at a schedule and just that "refresh media items" will refresh it by choice, so once in a while you'd hit that and it'll resync everything, right?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants