Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error processing duplicates #14

Closed
jkowall opened this issue Aug 30, 2023 · 9 comments
Closed

Error processing duplicates #14

jkowall opened this issue Aug 30, 2023 · 9 comments
Assignees
Labels
bug Something isn't working

Comments

@jkowall
Copy link

jkowall commented Aug 30, 2023

Seemed to be running, but I think I hit some rate limits. The task was moving forward slowly. Finally it got an error in the web UI, but I am not sure which logs to check. Might be good to add to the docs:

image
@mtalcott
Copy link
Owner

The error should show up in the terminal window you're running docker-compose in. Do you see a stacktrace when you run docker-compose logs worker? If so I can diagnose if you share a screenshot or the logs (docker-compose logs worker > worker.log and share worker.log here or via email).

@jkowall
Copy link
Author

jkowall commented Aug 31, 2023

Thanks! The problem is the constant stream of access log data, so it's hard to tell from the terminal window.

I ran that command, and the logfile is 400mb, so no email :)

This is the last part of the file callback:

worker_1  | [2023-08-31 20:29:56,164: ERROR/ForkPoolWorker-31] Task app.tasks.process_duplicates[ecaee328-15f6-4a
49-ad10-269a6addc3f0] raised unexpected: RuntimeError('Image decoding failed (unknown image type): /mnt/images/APynsm7bO
tET5Euhm6k8bIGPCHQkzZiMhoUevFZyChWvVZ7Cgn_gZA6gb-Rwnf04-T0d5smwE-5KPQgbj86Cvi-QzXlqgVi5sw-250.jpg')
worker_1  | Traceback (most recent call last):
worker_1  |   File "/usr/local/lib/python3.9/site-packages/celery/app/trace.py", line 477, in trace_task
worker_1  |     R = retval = fun(*args, **kwargs)
worker_1  |   File "/usr/src/app/app/__init__.py", line 25, in __call__
worker_1  |     return self.run(*args, **kwargs)
worker_1  |   File "/usr/src/app/app/tasks.py", line 78, in process_duplicates
worker_1  |     results = task_instance.run()
worker_1  |   File "/usr/src/app/app/lib/process_duplicates_task.py", line 102, in run
worker_1  |     similarity_map = duplicate_detector.calculate_similarity_map()
worker_1  |   File "/usr/src/app/app/lib/duplicate_image_detector.py", line 61, in calculate_similarity_map
worker_1  |     embeddings = self._calculate_embeddings()
worker_1  |   File "/usr/src/app/app/lib/duplicate_image_detector.py", line 114, in _calculate_embeddings
worker_1  |     mp_image = mp.Image.create_from_file(storage_path)
worker_1  | RuntimeError: Image decoding failed (unknown image type): /mnt/images/APynsm7bOtET5Euhm6k8bIGPCHQkzZi
MhoUevFZyChWvVZ7Cgn_gZA6gb-Rwnf04-T0d5smwE-5KPQgbj86Cvi-QzXlqgVi5sw-250.jpg

It seems like a file is not decodable, but which one?

@mtalcott
Copy link
Owner

I see. Looks like a corrupted image, I should handle that better, thanks for sharing!

Two options:

  1. If you're concerned about daily quota, and want to retain as many of your existing images as possible, just delete this single image. Assuming you're using Docker Desktop, you can do this through the UI under Volumes > google-photos-deduper_image-volume > find the corresponding image file (APynsm7bOtET5Euhm6k8bIGPCHQkzZi MhoUevFZyChWvVZ7Cgn_gZA6gb-Rwnf04-T0d5smwE-5KPQgbj86Cvi-QzXlqgVi5sw-250.jpg) > Right click > Delete (Screenshot). There may be another corrupted file, in which case you'd run into the same error and have to repeat the process.

  2. If you are OK re-downloading all images, you can delete the whole google-photos-deduper_image-volume volume, either through the Docker Desktop UI under Actions column (Screenshot) or with a terminal window using docker-compose down then docker volume rm google-photos-deduper_image-volume.

@mtalcott mtalcott self-assigned this Aug 31, 2023
@mtalcott mtalcott added the bug Something isn't working label Aug 31, 2023
@jkowall
Copy link
Author

jkowall commented Sep 1, 2023

I started doing this manually, but there are too many of them. I guess most of the files which are 1.5k in size are likely corrupt. There are many thousand of them and I can't batch delete them all.

If I were to-do item 2 of your suggestions, then I would likely run into the same quota issues and end up with many corrupt image files again. I guess we need an error handler to not download files when quota is hit from Google Photos and instead exit so the user can run it again later.

Thanks @mtalcott !

@mtalcott
Copy link
Owner

mtalcott commented Sep 2, 2023

Got it. I'm not sure what corrupted them, I will play around with responses after quota is hit and see if I can reproduce and handle.

@mtalcott
Copy link
Owner

mtalcott commented Sep 2, 2023

Actually @jkowall, could you download one of the corrupted files and post here? That will help diagnose.

@jkowall
Copy link
Author

jkowall commented Sep 4, 2023

Yep here you go. Looks like it's a 403 :)

<!DOCTYPE html><html lang=en><meta charset=utf-8><meta name=viewport content="initial-scale=1, minimum-scale=1, width=device-width"><title>Error 403 (Forbidden)!!1</title><style>*{margin:0;padding:0}html,code{font:15px/22px arial,sans-serif}html{background:#fff;color:#222;padding:15px}body{margin:7% auto 0;max-width:390px;min-height:180px;padding:30px 0 15px}* > body{background:url(//www.google.com/images/errors/robot.png) 100% 5px no-repeat;padding-right:205px}p{margin:11px 0 22px;overflow:hidden}ins{color:#777;text-decoration:none}a img{border:0}@media screen and (max-width:772px){body{background:none;margin-top:0;max-width:none;padding-right:0}}#logo{background:url(//www.google.com/images/branding/googlelogo/1x/googlelogo_color_150x54dp.png) no-repeat;margin-left:-5px}@media only screen and (min-resolution:192dpi){#logo{background:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) no-repeat 0% 0%/100% 100%;-moz-border-image:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) 0}}@media only screen and (-webkit-min-device-pixel-ratio:2){#logo{background:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) no-repeat;-webkit-background-size:100% 100%}}#logo{display:inline-block;height:54px;width:150px}</style><a href=//www.google.com/><span id=logo aria-label=Google></span></a><p><b>403.</b> <ins>That’s an error.</ins><p>Your client does not have permission to get URL <code>/lr/AJUiC1zxH_sJF87KjrCd-7t4B3E4oe2C96r_sBXmDmTIcs_ds6O6pyKtlbWfalPGW77fcEsjTFSc6fPEwV7jhrI2IlRvvWJm07xB5ao8aQeLYN3JWuv5TYrFe3WJeXnBZZpFCZ452sc4ozf-5BuoFRPLlZn4ACn15cpzs268OdWUukhHFSjRSzUg5OdpHBGlFC5DXmzxylr4Rn_Sh7_peOCqCSo0yXwvK39yy9HRP-QIa4rN3jZxPhKJvs2yGNDLY_hULyyTEBzF0O10JgJ_UqP2MPKvBTbyLYX7H8uYlvFfCYahNN0Xi_9wc-ML7-PhVduyaFKDVbiqfviv7PFP16ssyH-jBJ2m3GCCjeJs4DPa01Lk_sKXmhkuwAdXEkjjIULDONSnE3p5WK_Qxa-zUl9HhQ9wuxgaCNAxvUt8hlqlj5p6IjUzqZVawB9MySkQehn65v3HU1XB9W4ebUnRSYuHiPa0k6XfOV-vyZJfpTQV-zVDCx9oJ_jl1I1NXz-0mrHvxP3Rt91CO3lcwBdafHPvWfKOMzLEj5aIoWvead6eJgx6a4d2-mtHP0HLcAOqJnDWmqIqal1YM7qxzoGryl7SzOiUfaPjcZv9k4KNLcwnS0XSOLriD_SwPrbAvEa3PSj0shzh2UBchQgCIb7tRDO1Zn29a93g18g91n62TDV_bApqGU80O9WYiBcS3CNFHfT4LrfVdKeipIj3uD4xAGnzPEWHNbCXVCAKMXe6TD2bBnOAeJbabO2y2XwtRYHtMe7_1eaSUwsB0W-V2j2J0yhv6tYvMxCiTi8csEIrMtY1wMowP85UpVH3YQlvdQ-CUYV7uTgBrPqDzRd7me97DaCFnDYXJ3YtbGHID46PCMHkmJ897NFkVpFVdTN7cGK_dDztZYIx42KsM9dQNhkZeczEBhhrU-aL3GKPUssOxYSwqBArEWGNDSG7kMby_C5-6V_rq3rtyLVfcK_juLsM8FDV8W4lJyMBL1bNgeGwGyyROqS25AuIkPAUAdoCp_Y3fbc=w250-h250</code> from this server.  (Client IP address: 73.57.6.58)<br><br>
Forbidden
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        
  <ins>That’s all we know.</ins>

@mtalcott
Copy link
Owner

mtalcott commented Sep 4, 2023

There was a bug that saved error responses as image files. This was resolved with #16 which will now raise an error and retry instead.

@jkowall If you pull the latest from main, delete your volume (#2 above), and try again you shouldn't get these invalid image files any more.

@mtalcott mtalcott closed this as completed Sep 4, 2023
@jkowall
Copy link
Author

jkowall commented Sep 5, 2023

All fixed, thanks @mtalcott ! Worked like a charm, and 430 less duplicates.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants