Skip to content

Faces: Improve performance and strategy for manual tagging #3124

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
maxime1992 opened this issue Jan 17, 2023 · 31 comments
Closed

Faces: Improve performance and strategy for manual tagging #3124

maxime1992 opened this issue Jan 17, 2023 · 31 comments
Assignees
Labels
ai Artificial Intelligence, Machine Learning (ML) bug Something isn't working faces Face Detection, Recognition and Search performance Performance Optimization priority Supported by early sponsors or popular demand released Available in the stable release tested Changes have been tested successfully

Comments

@maxime1992
Copy link

Let me start by saying this is a new issue that'll summarize a closed, but still happening issue. I'm creating this new issue based on @lastzero 's comment.

1. What is not working as documented?

For faces that were not added automatically, tagging people manually is slow to extremely slow. With some photos or persons (unsure), it takes a few seconds only (rare) but once that issue happens, most of the time it takes in my case between 1 and 2 minutes to tag 1 person. I've decided recently to move all my photos to photoprism so I've got quite a lot already, and quite a lot more to come. I have more than a couple hundreds pictures with unrecognized faces waiting to be tagged, but I'm not able to wait 1 to 2 mn for each, it'd be unmanageable.

It's documented as a known issue. I have tried the command docker compose exec photoprism photoprism faces audit --fix but it didn't help fixing the issue.

2. How can we reproduce it?

I have no idea how exactly to reproduce this, but both @pjft and myself have shared parts of our database by email directly to help debug it. See pjft comment and mine. Best case scenario it's enough to debug, if not I'm more than happy to provide more info.

3. What behavior do you expect?

When a face is detected but unknown, it shouldn't take more than a few seconds to add it manually.

4. What could be the cause of your problem?

Really unsure, sorry.

5. Can you provide us with example files for testing, error logs, or screenshots?

As stated above, this has been done by email as it includes personal data.

6. Which software versions do you use?

(a) PhotoPrism Architecture & Build Number: AMD64, ARM64, ARMv7,...

Build 221118-e58fee0fb

(b) Database Type & Version: MariaDB, MySQL, SQLite,...

MariaDB

(c) Operating System Types & Versions: Linux, Windows, Android,...

Linux

(d) Browser Types & Versions: Firefox, Chrome, Safari on iPhone,...

Brave and Chrome

(e) Ad Blockers, Browser Plugins, and/or Firewall Software?

Probably irrelevant here. (I've tried to turn them off, doesn't change anything).

7. On what kind of device is PhotoPrism installed?

This is especially important if you are reporting a performance, import, or indexing issue. You can skip this if you're reporting a problem you found in our public demo, or if it's a completely unrelated issue, such as incorrect page layout.

(a) Device / Processor Type: Raspberry Pi 4, Intel Core i7-3770, AMD Ryzen 7 3800X,...

Intel(R) Core(TM) i3-3227U CPU @ 1.90GHz

(b) Physical Memory & Swap Space in GB

4gb and 4gb

(c) Storage Type: HDD, SSD, RAID, USB, Network Storage,...

SSD

(d) Anything else that might be helpful to know?

8. Do you use a Reverse Proxy, Firewall, VPN, or CDN?

Yes, I use SWAG which itself uses NGINX, but it's probably unrelated here as it used to work perfectly with that config.


I'll summarize what was found in the previous issue:

  • There's a suspicion that this issue may arise when photoprism thinks there are different faces, assigned to the same person
  • Make a "faces reset" works... But in the case of reporter of the previous issue it came back so not really a viable option
  • It was asked if it could come from either running multiple instances or directly modifying database content. It's not the case for neither of these for me
  • The original issue was closed because nobody shared their database, we've now send 2 of them for debugging
@maxime1992 maxime1992 added the bug Something isn't working label Jan 17, 2023
@lastzero lastzero changed the title Slow face tagging and error "Failed removing merged clusters for subject" Faces: Failed removing merged clusters for subject Jan 20, 2023
@lastzero lastzero added the performance Performance Optimization label Jan 20, 2023
@lastzero lastzero self-assigned this Jan 20, 2023
@lastzero lastzero added the help wanted Help with this would be much appreciated! label Jan 20, 2023
@lastzero lastzero removed their assignment Jan 20, 2023
@mdmarwil
Copy link

I have am having this same issue and the command docker compose exec photoprism photoprism faces audit --fix does not help.

My background is mostly in python, but I tried to do some digging through the code.

The Warning message faces: failed removing merged clusters for subject comes from the following code block in internal/query/faces.go:

// PurgeOrphanFaces removes unused faces from the index.
if removed, err := PurgeOrphanFaces(merge.IDs()); err != nil {
return merged, err
} else if removed > 0 {
log.Debugf("faces: removed %d orphans for subject %s", removed, clean.Log(subjUID))
} else {
log.Warnf("faces: failed removing merged clusters for subject %s", clean.Log(subjUID))
}

Just prior to this block of code MatchMarkers() is called. It is trying to match 1 Face to 4 other Faces with the same subj_uid, but it looks like the distance of the embeddings of the 4 others Faces is too far from the first Face to be considered a match. So none complete.

Going back to the block of code shown above that means:

  • No error was encountered
  • The number removed is 0, since no faces were reassigned and thus orphaned

So it falls into the else block with a warning message, although for this case the more accurate message would be to state there were no orphaned cluster to remove.

@dalfry
Copy link

dalfry commented Feb 5, 2023

Running into this issue. Tried running the audit fix command with no change. I have the merge failed error for one single subject. Library has 100k+ files. Spent a week tagging faces. Would prefer not to reset faces. :-)

BTW, Thanks for all your work! Photoprism is awesome! and I am a sponsor as well.

TRAC[2023-02-05T19:31:33Z] config: defaults file /etc/photoprism/defaults.yml does not exist
INFO[2023-02-05T19:31:33Z] config: case-insensitive file system detected
DEBU[2023-02-05T19:31:33Z] config: running on 'Intel(R) Core(TM) i7-3615QM CPU @ 2.30GHz', 4.1 GB memory detected
DEBU[2023-02-05T19:31:33Z] settings: loaded from /photoprism/storage/config/settings.yml
DEBU[2023-02-05T19:31:33Z] config: successfully initialized [35.329727ms]
INFO[2023-02-05T19:31:33Z] migrate: executing pre migrations
DEBU[2023-02-05T19:31:33Z] migrate: found 2 previous migrations
DEBU[2023-02-05T19:31:33Z] migrate: 20221015-100000 skipped
DEBU[2023-02-05T19:31:33Z] migrate: 20221015-100100 skipped
INFO[2023-02-05T19:31:34Z] migrate: executing main migrations
DEBU[2023-02-05T19:31:34Z] migrate: found 21 previous migrations
DEBU[2023-02-05T19:31:34Z] migrate: 20211121-094727 skipped
DEBU[2023-02-05T19:31:34Z] migrate: 20211124-120008 skipped
DEBU[2023-02-05T19:31:34Z] migrate: 20220329-030000 skipped
DEBU[2023-02-05T19:31:34Z] migrate: 20220329-040000 skipped
DEBU[2023-02-05T19:31:34Z] migrate: 20220329-050000 skipped
DEBU[2023-02-05T19:31:34Z] migrate: 20220329-060000 skipped
DEBU[2023-02-05T19:31:34Z] migrate: 20220329-061000 skipped
DEBU[2023-02-05T19:31:34Z] migrate: 20220329-070000 skipped
DEBU[2023-02-05T19:31:34Z] migrate: 20220329-071000 skipped
DEBU[2023-02-05T19:31:34Z] migrate: 20220329-080000 skipped
DEBU[2023-02-05T19:31:34Z] migrate: 20220329-081000 skipped
DEBU[2023-02-05T19:31:34Z] migrate: 20220329-083000 skipped
DEBU[2023-02-05T19:31:34Z] migrate: 20220329-090000 skipped
DEBU[2023-02-05T19:31:34Z] migrate: 20220329-091000 skipped
DEBU[2023-02-05T19:31:34Z] migrate: 20220329-093000 skipped
DEBU[2023-02-05T19:31:34Z] migrate: 20220421-200000 skipped
DEBU[2023-02-05T19:31:34Z] migrate: 20220521-000001 skipped
DEBU[2023-02-05T19:31:34Z] migrate: 20220521-000002 skipped
DEBU[2023-02-05T19:31:34Z] migrate: 20220521-000003 skipped
DEBU[2023-02-05T19:31:34Z] migrate: 20220927-000100 skipped
DEBU[2023-02-05T19:31:34Z] migrate: 20221002-000100 skipped
TRAC[2023-02-05T19:31:34Z] migrate: migrations migrated
TRAC[2023-02-05T19:31:34Z] migrate: auth_sessions migrated
TRAC[2023-02-05T19:31:34Z] migrate: countries migrated
TRAC[2023-02-05T19:31:34Z] migrate: labels migrated
TRAC[2023-02-05T19:31:35Z] migrate: photos_labels migrated
TRAC[2023-02-05T19:31:35Z] migrate: reactions migrated
TRAC[2023-02-05T19:31:35Z] migrate: files_share migrated
TRAC[2023-02-05T19:31:35Z] migrate: files_sync migrated
TRAC[2023-02-05T19:31:35Z] migrate: details migrated
TRAC[2023-02-05T19:31:35Z] migrate: photos_albums migrated
TRAC[2023-02-05T19:31:35Z] migrate: auth_users_shares migrated
TRAC[2023-02-05T19:31:35Z] migrate: places migrated
TRAC[2023-02-05T19:31:35Z] migrate: lenses migrated
TRAC[2023-02-05T19:31:35Z] migrate: keywords migrated
TRAC[2023-02-05T19:31:35Z] migrate: auth_users migrated
TRAC[2023-02-05T19:31:35Z] migrate: folders migrated
TRAC[2023-02-05T19:31:36Z] migrate: duplicates migrated
TRAC[2023-02-05T19:31:36Z] migrate: photos_users migrated
TRAC[2023-02-05T19:31:36Z] migrate: albums migrated
TRAC[2023-02-05T19:31:36Z] migrate: albums_users migrated
TRAC[2023-02-05T19:31:40Z] migrate: photos_keywords migrated
TRAC[2023-02-05T19:31:40Z] migrate: links migrated
TRAC[2023-02-05T19:31:40Z] migrate: subjects migrated
TRAC[2023-02-05T19:31:40Z] migrate: markers migrated
TRAC[2023-02-05T19:31:40Z] migrate: auth_users_settings migrated
TRAC[2023-02-05T19:31:40Z] migrate: services migrated
TRAC[2023-02-05T19:31:40Z] migrate: errors migrated
TRAC[2023-02-05T19:31:40Z] migrate: passwords migrated
TRAC[2023-02-05T19:31:40Z] migrate: cells migrated
TRAC[2023-02-05T19:31:40Z] migrate: cameras migrated
TRAC[2023-02-05T19:31:40Z] migrate: categories migrated
TRAC[2023-02-05T19:31:40Z] migrate: faces migrated
TRAC[2023-02-05T19:31:40Z] migrate: auth_users_details migrated
TRAC[2023-02-05T19:31:41Z] migrate: files migrated
TRAC[2023-02-05T19:31:41Z] migrate: photos migrated
DEBU[2023-02-05T19:31:41Z] migrate: completed in 7.796722162s
INFO[2023-02-05T19:31:41Z] faces: found 104 subjects
INFO[2023-02-05T19:31:41Z] faces: found no invalid marker subjects
INFO[2023-02-05T19:31:41Z] faces: found no invalid marker faces
INFO[2023-02-05T19:32:10Z] faces: found no ambiguous subjects
INFO[2023-02-05T19:32:10Z] faces: found no orphan face clusters
INFO[2023-02-05T19:32:10Z] faces: found no orphan people
INFO[2023-02-05T19:32:10Z] completed in 36.75042936s
INFO[2023-02-05T19:32:10Z] closed database connection

@lastzero lastzero self-assigned this Feb 9, 2023
@lastzero lastzero added the in-progress Somebody is working on this label Feb 9, 2023
lastzero added a commit that referenced this issue Feb 21, 2023
Signed-off-by: Michael Mayer <michael@photoprism.app>
lastzero added a commit that referenced this issue Feb 21, 2023
Signed-off-by: Michael Mayer <michael@photoprism.app>
@lastzero lastzero added please-test Ready for acceptance test and removed in-progress Somebody is working on this labels Feb 21, 2023
@lastzero
Copy link
Member

I now have a clearer idea of the underlying problem and think we can improve the situation by not matching manually labeled faces if they do not meet the constraints set for automatic clustering (more specifically, cluster core and core distance).

When testing with the pictures we had been using during development, a more aggressive approach for manually labeled faces seemed to work well, but this is only feasible for faces that the model can categorize well. For example, all these faces were matched after labeling a single face in the edit dialog:

single-w

@lastzero lastzero added in-progress Somebody is working on this and removed please-test Ready for acceptance test help wanted Help with this would be much appreciated! labels Feb 23, 2023
@pjft
Copy link

pjft commented Feb 24, 2023

Thank you for the update - this looks promising! What consequences would this change bring about? Meaning, what will change by not matching manually labeled faces if they do not meet the constraints set for automatic clustering?

Just curious to understand the "why the original decision, what problems did it solve" and "what consequences does it have".

Not urgent, by any means - just intellectual curiosity. :)

Thank you, and happy Friday!

@lastzero lastzero changed the title Faces: Failed removing merged clusters for subject Faces: Improve performance and strategy for manual tagging Feb 24, 2023
@lastzero
Copy link
Member

@pjft We can't tell what the final effect will be until it's implemented. It is expected that you will have to tag a few more faces that belong to the same person before the automatic matching starts, so basically how most other apps work.

Since these changes require extensive testing, we will probably wait until after the next stable release to avoid delaying it even further.

The sample faces we use for testing can be downloaded here:

@pjft
Copy link

pjft commented Feb 24, 2023

Of course - appreciate the time and effort in looking into this.

Let us know how we can help.

lastzero added a commit that referenced this issue Feb 24, 2023
Signed-off-by: Michael Mayer <michael@photoprism.app>
lastzero added a commit that referenced this issue Mar 8, 2023
This may reduce server load and prevent disks from spinning up.
We welcome tests reports!

Signed-off-by: Michael Mayer <michael@photoprism.app>
seeschloss pushed a commit to seeschloss/photoprism that referenced this issue Apr 10, 2023
Signed-off-by: Michael Mayer <michael@photoprism.app>
seeschloss pushed a commit to seeschloss/photoprism that referenced this issue Apr 10, 2023
…sm#3124

This may reduce server load and prevent disks from spinning up.
We welcome tests reports!

Signed-off-by: Michael Mayer <michael@photoprism.app>
@pjft
Copy link

pjft commented May 19, 2023

Just wanted to say that this seems to be solved in the latest builds!

Awesome work :) It's a breeze to label hundreds of clusters now.

@lastzero
Copy link
Member

@pjft Thank you! That's good to know, since the feedback we've received on the optimizations hasn't been very enthusiastic so far. Most users didn't seem to notice (or care about) the changes when they tested our development preview builds.

@nodecentral
Copy link

Just wanted to add to this, and share how inconsistent it can be when manually tagging (or de-tagging) faces. As such a key part of the product for me (and to help my aging parents remember people from their past) - anything that can be done to speed things up would be amazing..

The face tagging feature is the most important feature for a number of people - thank you so much for making it available

@theshadow27
Copy link
Contributor

theshadow27 commented Dec 20, 2024

This has been bugging me for a while, but it's now taking 200-300 seconds per face. That's enough time to start looking through the code.

Background

The call that starts the blocking operation, and subsequently fails resulting in 429 is PUT /api/v1/markers/:marker_uid (controller is internal/api/markers:182 func UpdateMarker(router *gin.RouterGroup) . The 429 error is returned on line 191 when the mutex.UpdatePeople.Start() is already locked. The next line uses defer mutex.UpdatePeople.Stop() to lock the mutex until the UpdateMarker function returns.

The issue is that the remainder of the function can take a VERY long time to return. Upwards of 3 minutes. The application exhibits both CPU utilization (100% of 1 core) and DB traffic (distinct spikes of DB traffic and queries). The best I can tell from the logs, when SaveForm reports changed (which will almost always be true when adding a name to a face that did not have one before) we do some stuff:

  1. get.Faces().Optimize() ( when marker.FaceID != "" && marker.SubjUID != "" && marker.SubjSrc == entity.SrcManual, which is again always true from what I can gather based on the front end, since we passed the marker and subject in with the call and the src is always 'manual').
  2. query.UpdateSubjectCovers(true) - seems like a single, well optimized query
  3. entity.UpdateSubjectCounts(true) - also a single, well optimized query
  4. query.PhotoByUID(file.PhotoUID) - also a single query
  5. p.UpdateAndSaveTitle(); - slightly more complicated but nothing alarming)

Based on log line timing, I suspect that the most time is being spent in (1), Faces().Optomize(), which is invoked on every single successful call to PUT markers.

faces_optimize.go : Optimize()

The Optomize() function is scary for something routinely invoked by a front-end client. First off, it starts with a loop to 11 (although there is an early abort, this is a lot of effort). Up to 11 times, until no progress is made we:

  1. Query EVERY manually added face from the DB: ManuallyAddedFaces(false, false), basically SELECT * FROM faces where face_hidden=false AND face_src="manual" AND subj_uid <> '' AND face_kind <= 1 ORDERBY subj_uid, samples DESC. Remember - up to 10 times! Why?!? we are not manually adding new faces; the mutex has locked us out of doing so.
  2. For EACH n manually added face from the DB (mine has n=322 rows):
  • merge all faces where the subject has the same UID and continue (fine)
  • run merge[0].match(faces[j].embedding()) (will come to this later). If that returns ok then print a log line and append the face to the merge
  • while the faces are the same, try to merge them in the DB using query.MergeFaces - more on this next
  1. Return the result, which has a single property Merged which is an int. So a count of the number merged.
  2. Print the count of merged faces log.Infof("faces: merged %s", english.Plural(res.Merged, "cluster", "clusters")) so that we know that Photoprism is ready to accept the next name (3 minutes later).

In summary, we loop up to 11*322 times, calling match up to len(faces)-len(subjects) (like 322-135 = 187 times for me) times and MergeFaces len(faces)-2*len(subjects) times (per the log lines, this seems to be about 121-122 times).

The 10 times is important - I am moderately confident that this loop never exits early (at least on my setup). See the DB traffic graph below for each name-adding operation:

image

There are 11 distinct spikes of "UPDATE" statements (peak = 220/second).

What happens in MergeFaces

Yikes. Remember this happens essentially 11 * number of subjects for each PUT API call:

  1. Allocate memory and serialized the embeddings (entity.NewFace)
  2. DB call to find the first face with the subject UID or create it if it does not exist FindFirstOrCreate
  3. Attempt to match markers from the IDs MatchMarkers which does another query for all markers with the faceIDs, running Match against each one - and potentially calling SetFace that then calls ResolveCollision that AGAIN matches all the embeddings and might write to the DB. In my case, this is rare, but it does happen. This means that most of the cycles are wasted.
  4. Purging orphan faces via PurgeOrphanFaces - this seems like an over-eager operation considering how many times it is run.

The end result is thousands of UPDATE operations on the DB, along with substantial CPU time spent in Match(), for each PUT API call.

Why Optimize Every Call?

TBH I have no clue. I put a lot of the source through GPT4o and asked, and it said:

Running the Optimize() function every time a face is tagged is likely an overly cautious approach intended to ensure consistency in the face clustering database. However, this is a highly inefficient design for the following reasons:

  1. Redundancy in Optimization: The optimization process (which appears to cluster faces and update subject-related data) does not need to be run for every face tagging event. Most likely, the state of the system doesn’t change drastically with a single new tag, especially for large datasets.
  2. Expensive Operations:
  • The clustering (Optimize()) is an iterative and computationally expensive process, particularly for large datasets.
  • Functions like UpdateSubjectCovers() and UpdateSubjectCounts() involve database updates and recalculations that can add significant overhead when called repeatedly.
  1. Mutex Locking (mutex.UpdatePeople.Start()): This locking mechanism prevents concurrent updates to ensure data consistency. While this makes sense to avoid race conditions, it leads to throttling issues (429 Too Many Requests) when the clustering is invoked excessively.

The logic may have been implemented this way to handle:

  • Immediate consistency: Ensuring that all face tags are optimized and clustering results are updated instantly after every change.
  • Avoidance of stale data: The developers might have been overly concerned about stale data in subjects or clusters, leading them to enforce Optimize() every time.

It's guess is as good as mine. I do agree that it does not need to run on each tagging operation. Perhaps it means less efficient clusters for the tagging session - fine. I can enter 50 names in the time it takes for one Optimize cycle to complete!

How to not Optimize on every call

  1. Batch Optimization: Instead of running Optimize() every time a face is tagged, Queue the optimization task and run it periodically (e.g., every few minutes) or when a batch of changes is detected.
  2. Health Metrics: Run the Optimize call when a specific metric, which is inexpensive to calculate, indicates that the matching would benefit from optimization. Candidates include:
  • Clustering state (average size of GROUP BY subject_uids in the faces table
  • Tagging a new subject (that did not exist before), and it's the third or fourth image of them
  1. Async Optimization and Partial Updates:
  • Immediately tag the face and return
  • run Optimize() asynchronously, updating only the affected clusters/subjects
  • Use a message queue pattern with a deduplicating filter to avoid unnecessary work
  1. User-Triggered Optimization:
  • Use user behavior queues on the front-end (such as, having the edit window open, and clicking previous/next photo) to determine that the user is "tagging faces" - when this happens, start a timer. Each next/previous load, restart the timer. Run Optimize only when there have been no API calls in some time (5 minutes, or user-adjustable)
  • Allow a User-Initiated optimization via a button in the UI.

If it is necessary to Optimize on every call...

  1. Verify that the loop early termination condition is working correctly. Mine never seems to finish in less than 11 cycles.
  2. Potentially use the new VEC_DISTANCE in MariaDB 11.7 (and 10.12) to reduce the number of markers that need to be Matched - I believe there is a separate ticket #4669 for this.
  3. Rather than retrying a merge of every single cluster each time, can we focus on just the subject_uid that changed?
  4. Something else?

In conclusion

Regarding @lastzero 's comment:

Unfortunately, we haven't had time to work on this yet. We will reference this issue in the related commits and mention it in the release notes once the improvements are available. Alternatively, feel free to work on this and provide us with a pull request that we just need to merge. Thank you! ❤️

I am happy to help, but would appreciate some thoughts on the best strategy first so that it does not end up as a wasted effort. Do you feel that it is critical to Optimize on every PUT API call? If so, why? If not, what is your preferred pattern for hiding this latency from the user? My goal is to contribute something usable, not add to the pile of 30-something unmerged PRs.

Edit: The loop actually runs 11 times, not 10 (because it's j <= 10). This matches the number of huge query spikes seen during the Optimize run in DBeaver.

@theshadow27
Copy link
Contributor

theshadow27 commented Dec 20, 2024

To further investigate, I used DBeaver to export the faces table before and after one of these multi-minute, CPU intensive Optimize() calls. The net diff was a single row (the face that was changed), which had only a single field added (subject_uid).

image

From this, I have to assume that the clusters aren't actually being merged in a meaningful way in the long-running Optimize() calls, despite the logs reporting that they are. Perhaps this is a bug, and if the clusters were actually merged, then there would not be repeated "faces: merged 121 clusters [2m30.469009483s]" lines in the log...

2024-12-20 17:05:55 time="2024-12-20T22:05:55Z" level=info msg="changes successfully saved"
2024-12-20 17:09:07 time="2024-12-20T22:09:07Z" level=info msg="faces: merged 120 clusters"
2024-12-20 17:09:07 time="2024-12-20T22:09:07Z" level=info msg="changes successfully saved"
2024-12-20 17:15:34 time="2024-12-20T22:15:34Z" level=info msg="faces: merged 121 clusters [2m30.469009483s]"
2024-12-20 17:19:29 time="2024-12-20T22:19:29Z" level=info msg="faces: updated 1 marker, recognized 1 face, 0 unknown [3m55.490692353s]"
2024-12-20 17:19:37 time="2024-12-20T22:19:37Z" level=info msg="index: updated 8 photos [7.306792706s]"
2024-12-20 17:33:05 time="2024-12-20T22:33:05Z" level=info msg="changes successfully saved"
2024-12-20 17:33:21 time="2024-12-20T22:33:21Z" level=info msg="changes successfully saved"
2024-12-20 17:35:48 time="2024-12-20T22:35:48Z" level=info msg="changes successfully saved"
2024-12-20 17:36:18 time="2024-12-20T22:36:18Z" level=info msg="changes successfully saved"
2024-12-20 17:36:22 time="2024-12-20T22:36:22Z" level=info msg="changes successfully saved"
2024-12-20 17:39:13 time="2024-12-20T22:39:13Z" level=info msg="faces: merged 121 clusters"
2024-12-20 17:39:14 time="2024-12-20T22:39:14Z" level=info msg="changes successfully saved"
2024-12-20 17:45:33 time="2024-12-20T22:45:33Z" level=info msg="faces: merged 121 clusters [2m29.398296581s]"

@theshadow27
Copy link
Contributor

theshadow27 commented Dec 22, 2024

I think I found the bug causing the Optimize loop to not terminate early (and run 11 times, i := 0; i <= 10; i++):

The early termination code:

               // Done?
		if result.Merged <= c {
			break
		}

checks to see if any faces were merged on the last run through the loop. Basically, if a call to query.MergeFaces(merge, false) with mergeable faces (clusters that Match with merge[0]) does not error, we assume that all faces were merged with result.Merged += len(merge).

This is a flawed assumption because:

  • for a valid (matching subj_uid and embedding Match) set of mergeable faces, MergeFaces is not guaranteed to actually eliminate any (or all) of the supplied Faces
  • Specifically, in MatchMarkers, the "old, suboptimal" cluster face_id is not replaced when m.Match(marker.Embeddings()) is not ok; in this case we // Ignore on face.go:L265.
  • So if the merged cluster does NOT result in EVERY face_id being replaced with the new super-cluster, then PurgeOrphanFaces cannot delete the old clusters.
  • We try to check this with a if removed > 0 call in faces.go:L216, which seems like we should get an error log when this happens, and we don't see any such errors. Except:
  • In the PurgeOrphanFaces function, there appears to be a subtle bug where if no rows are affected by the delete, i.e. else if result.RowsAffected > 0 is not true, the len(ids) (batch size) is added to the affected counter anyway
  • So in the case that we try to merge (for example) 3 clusters, but 0 can be deleted, there is no warning/error but MergeFaces returns as if it did useful work.

Because the daemon runs this function every 15 minutes, it is highly likely that all face clusters which can be merged have been merged. In a static library in steady state that has at least 2 clusters which could be merged according to the embeddings matching, but the smaller cluster can't be removed due to the new cluster not matching the face, this loop will run 11 times; checking every cluster against all others with the same subj_uid each time.

This could be fixed in 2 ways:

  1. Fix PurgeOrphanFaces so that if no rows are affected, it returns 0. This will trigger the WarnF in MergeFaces and the Errorf in Optimize each time, but at least result.Merged won't be incremented and so the 11x loop won't run unnecessarily. The downside to this (besides the log output) is that if PurgeOrphanFaces or MergeFaces is used elsewhere and relies upon the broken behavior, that the other consumers may no longer work as expected. GitHub doesn't seem to think that either function has other consumers, but it might have missed something.
  2. Find a different early-termination condition. This has the upside of not disrupting the current behavior. For example, since we query.ManuallyAddedFaces each time, we could track the number of face clusters returned (n), and terminate if n >= prev_n (i.e. the number of clusters increased or stayed the same). The downside here is that it would require an extra query of all faces (with their embeddings).

@lastzero if you let me know if you prefer fix (1) or fix (2) more, I am happy to submit a PR.

There are a few other things I noticed while tracing this code:

In the Optimize function, the else if len(merge) == 1 { merge = nil } condition probably is a bug. This is trying to reset the set of mergeable faces, for the case where it's the same subject but the faces don't match. In doing so, it skips a potential face to merge. Consider clusters A, B, C, D, E, F all of the same subject:

  • A, B -> merge, reset
  • C (D does not match) so reset
  • Intended Behavior (I think): Try to match D with E
  • Actual Behavior: Try to match E with F

To fix this, change the code to else if len(merge) == 1 { merge = entity.Faces{faces[j]} }

A more aggressive refactoring could also improve performance significantly: rather than querying all faces each time through the loop, we should get the list of faces at the start, then only query for face IDs with a created_at constraint, fetching only the new faces. Better yet (though more involved), all of the operations could be kept in memory, with a append-only log of 'added' and 'removed' to write to the DB once at the end. When there are hundreds of faces, this repeated querying does consume significant DB resources.

Edit: One further (but schema-changing) optimization would be to add a column optimized_at in the faces table. Whenever a face cluster was matched/added to a marker, this column would be cleared. In the Optimize loop, following an optimization run, each remaining cluster that was processed would have optimized_at set. Add one more argument to ManuallyAddedFaces(hidden, ignored, optimized) that only returns Faces where at least one face with that subj_uid has an empty optimized_at value... The query would be something like

SELECT *
FROM faces f
WHERE 
    f.face_hidden = false 
    AND f.face_src = 'manual'
    AND f.subj_uid <> ''
    AND f.face_kind <= 1
    AND EXISTS (
        SELECT 1
        FROM faces sub_f
        WHERE 
            sub_f.subj_uid = f.subj_uid
            AND (sub_f.optimized_at IS NULL OR sub_f.optimized_at = '')
    )
ORDER BY f.subj_uid, f.samples DESC;

Or thereabouts. In this way, we'd only be processing faces where at least one marker had changed; rather than doing the same work over and over.

Insanity is doing the same thing over and over again and expecting different results - Albert Einstein

@theshadow27
Copy link
Contributor

Have a PR going. No confirmed fix yet. #4691
BTW tests failing in WSL off develop - probably not your native env.

@theshadow27
Copy link
Contributor

theshadow27 commented Dec 31, 2024

(from the PR for visibility by @lastzero )

This issue is deeper than initially suspected - the proposed fix (1) is necessary but not sufficient.

After implementing the proposed fix (1), the loop is still running all 11 times. It seems that when there are several clusters of the same size (in my case 6) it is possible for each iteration to attempt to merge them, resulting in one new cluster being created and one old cluster being dropped.

Trying to grok what is going on here...

When the new cluster is created, it's ID comes from the embeddings... the centroid of all possible merge candidates. Somehow, this results in a new face hash (rounding issues? order of operations?) and the creation of a new automatic cluster which replaces the old one. Since MatchMarkers eagerly asserts ownership over the markers, it is happy to take them from the other automatic cluster. The database is updated by adding the new cluster, removing the old cluster, (effectively a nop)... and the cycle repeats.

I assumed that the intent of the md5 face_id was to prevent exactly this problem, and initially speculated that a stable sort order from ManuallyAddedFaces might fix it, but thinking about it some more, I can see how it would not:

if A + B + C = D
then D + B + C = E
then E + B + C = F
...

each time, the centroid is shifted slightly, resulting in a new face_id - one that will be very very close, but not exactly, to the previous one.

Initial thought: One way around this would be to do the match in-memory, i.e. MatchMarkers should store the results but not actually write anything. Then, it could be compared to the previous marker set, and if it was the same set of markers, not persist itself or take them over. Given the structure of the code, this is a complicated change. I suppose it could be done with less friction inside a TRANSACTION (with rollback) at the expense of additional DB overhead. I am not sure the best way to proceed here. If you have any suggestions @lastzero I would be happy to try them with my problematic db.

In the meantime, I will add a parameter to Optimize(subj_uid) that can be passed through to ManuallyAddedFaces for the special case of markers.UpdateMarker when the subj_uid is known in advance. Because of the locking here, this will only ever result in that subject being optimized (it's the only face cluster to have changed, and the merge set is all the same subj_uid). At least that will cut down on the amount of work this loop has to do while the user is blocked.

Edit: Noticed another problem (bug?). The sort in ManuallyAddedFaces orders by samples with the assumption that this will somehow be connected to reality. However, it seems that this value is never updated - not with Optimize, not with photoprism faces update -f, not with photoprism faces audit --fix. Once it's there, it's set (as far as I have observed). In doing some spot checks, I have several cases where samples is something like 7-10, but there are 25+ markers with that face_id meeting the minimum score and size. Unless I'm missing something, and the intent is not to denote how many markers are matched to the cluster?

@lastzero
Copy link
Member

@theshadow27 Thanks for taking a look and sharing your insights! 🧐

Our current plan is to offload (most of) the calculations to the new MariaDB vector functions in an upcoming release:

This should yield better results and be much faster than our own vector comparisons, which were intended as a temporary solution until we have a proper vector database/library in place. Unfortunately, as we are currently performing a major upgrade of our frontend, I do not have any code or a more specific timeline/roadmap for this yet:

The edge cases that need to be handled/investigated (see your comments and what I've commented in other places) further encourage us to go this way. If you like, you are welcome to experiment with the vector functions and work on a proof of concept? ✨

Of course, if you find improvements to the current code that we can test, merge, and release within a reasonable timeframe, we'd be happy to do that as well - but my feeling is that the effort required for a fast and completely bug-free implementation is beyond what anyone could maintain/contribute... 🤔

@theshadow27
Copy link
Contributor

@lastzero thanks for getting back, and happy new years! 🥂

First off, IMHO the current implementation/"temporary" solution is VERY good at meeting it's functional requirements - better than Apple Photos for sure. In the rare situation where it's confused a subject, it's straightforward to eject them and fix it. New markers are identified and matched quickly and accurately. Youth sports are a brutal use case, and I sincerely appreciate the effort that went into the current version! The only issue I have is with the nonfunctional UX requirement of not having excessive delays while naming many markers in quick succession.

To be clear: the current state of the draft PR offers a significant improvement to the UX; by limiting the Optimize function to the face supplied in the POST, it almost never triggers the 11-iteration loop, unless you are specifically working on one of those subjects that has problematic clusters. It is not a big PR (40 lines, 12 lines functional), and I don't believe it changes the behavior at all for the worker/CLI usage - a huge plus for building confidence after years of stable deployments.

Regarding the Vector implementation - I'm already thinking about it, but the scope will be much wider than this PR, with impacts from DBSCAN up to everywhere that Match is used (at least 11 places) - likely requiring substantial rework. In addition to schema updates, a Vector implementation requires a minimum DB version that could break a lot of existing embedded/hosted installs without manual intervention during the upgrade. To keep backwards compatibility, some version of the current implementation strategy needs to be encapsulated behind an interface, further expanding the scope of refactoring, and potentially running with two different Gorm entities depending on a flag. I don't know enough about Gorm to know if that's ok or not. This is a Big project, lots of new code, lots of testing, bug hunting, and so on.

As such, my suggestion is to try to get the minimum change for maximum UX improvement merged sooner than later - there are quite a few users who have reported this or derivative issues going back to October 2022. It doesn't fix the root cause, but it hides it well enough to make a big difference for those who are impacted. What do you think?

P.S. If you're not comfortable rolling it out everywhere, it is not much more to hide it behind a flag (like PHOTOSTREAM_EXPERIMENTAL_FACES_OPTIMIZE) that would eliminate nearly all of the risk for "normal" users doing routine upgrades.

P.P.S. I am working on another branch that has a more substantial diff; a re-write of the Optimize function completely for how I think it should work - Breaking out each subj_uid into it's own struct and operating on them independently, spending more effort on finding mergeable clusters within a subject, going back to ground-truth tagged markers, instead of cluster-of-clusters, and only re-fetching faces when that particular subject had a change. This will be several hundred lines diff, so I don't expect it to be accepted without substantial testing that I will likely not finish on this holiday 😄 .

@lastzero lastzero added faces Face Detection, Recognition and Search ai Artificial Intelligence, Machine Learning (ML) labels Jan 24, 2025
lastzero pushed a commit that referenced this issue Mar 27, 2025
This PR implements a strategy to improve Optimize() performance by returning the correct number of modified rows in PurgeOrphanFaces and treating the result 0 as an error condition in MergeFaces. Logging around this has been improved in faces.go and faces_optimize.go.

Related Comments:
- #3124 (comment)
- #4691 (comment)
@lastzero
Copy link
Member

@theshadow27 Thank you so much for your patience while we completed the frontend upgrade! 🙏

The last batch of UI/UX enhancements was released last week, so I'm happy to report that your PR is now merged and ready to be tested with our development preview build:

Any help with that is much appreciated :)

@lastzero lastzero moved this from Development 🐝 to Preview 🐳 in Roadmap 🚀✨ Mar 27, 2025
@lastzero lastzero added the please-test Ready for acceptance test label Mar 27, 2025
@Coloradohusky
Copy link

Coloradohusky commented Apr 6, 2025

While I am not OP, I have definitely noticed speedups with manual tagging when on the latest preview build, 250405-eed7bfff7 - speeds went from around 30 seconds to a minute+ before, to around 2-10 seconds now. (although in certain cases it can still take 45 seconds, but it is much better overall)

@lastzero
Copy link
Member

lastzero commented Apr 7, 2025

@Coloradohusky Thanks for testing! While the API could also be faster due to other performance improvements we've implemented in the meantime, it's definitely great to hear :)

@graciousgrey graciousgrey added tested Changes have been tested successfully and removed please-test Ready for acceptance test labels Apr 18, 2025
@graciousgrey graciousgrey moved this from Preview 🐳 to Release 🌈 in Roadmap 🚀✨ Apr 25, 2025
@graciousgrey graciousgrey added the released Available in the stable release label Apr 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ai Artificial Intelligence, Machine Learning (ML) bug Something isn't working faces Face Detection, Recognition and Search performance Performance Optimization priority Supported by early sponsors or popular demand released Available in the stable release tested Changes have been tested successfully
Projects
Status: Release 🌈
Development

No branches or pull requests