Skip to content

Commit

Permalink
Version 354
Browse files Browse the repository at this point in the history
  • Loading branch information
hydrusnetwork committed May 29, 2019
1 parent 4dcdc1e commit e556dba
Show file tree
Hide file tree
Showing 30 changed files with 1,472 additions and 512 deletions.
46 changes: 46 additions & 0 deletions help/changelog.html
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,52 @@
<div class="content">
<h3>changelog</h3>
<ul>
<li><h3>version 354</h3></li>
<ul>
<li>duplicates important:</li>
<li>duplicates 'false positive' and 'alternates' pairs are now stored in a new more efficient structure that is better suited for larger groups of files</li>
<li>alternate relationships are now implicitly transitive--if A is alternate B and A is alternate C, B is now alternate C</li>
<li>false positive relationships remain correctly non-transitive, but they are now implicitly shared amongst alternates--if A is alternate B and A is false positive with C, B is now false positive with C. and further, if C alt D, then A and B are implicitly fp D as well!</li>
<li>your existing false positive and alternates relationships will be migrated on update. alternates will apply first, so in the case of conflicts due to previous non-excellent filtering workflow, formerly invalid false positives (i.e. false positives between now-transitive alternates) will be discarded. invalid potentials will also be cleared out</li>
<li>attempting to set a 'false positives' or 'alternates' relationship to files that already have a conflicting relation (e.g. setting false positive to two files that already have alternates) now does nothing. in future, this will have graceful failure reporting</li>
<li>the false positive and alternate transitivity clears out potential dupes at a faster rate than previously, speeding up duplicate filter workflow and reducing redundancy on the human end</li>
<li>unfortunately, as potential and better/worse/same pairs have yet to be updated, the system may report that a file has the same alternate as same quality partner. this will be automatically corrected in the coming weeks</li>
<li>when selecting 'view this file's duplicates' from thumbnail right-click, the focus file will now be the first file displayed in the next page</li>
<li>.</li>
<li>duplicates boring details:</li>
<li>setting 'false positive' and 'alternates' status now accounts for the new data storage, and a variety of follow-on assumptions and transitive properties (such as implying other false positive relationships or clearing out potential dupes between two groups of merging alternates) are now dealt with more rigorously (and moreso when I move the true 'duplicate' file relationships over)</li>
<li>fetching file duplicate status counts, file duplicate status hashes, and searching for system:num_dupes now accounts for the new data storage r.e. false positives and alternates</li>
<li>new potential dupes are culled when they conflict with the new transitive alternate and false positive relationships</li>
<li>removed the code that fudges explicit transitive 'false positive' and 'alternate' relationships based on existing same/better/worse pairs when setting new dupe pairs. this temporary gap will be filled back in in the coming weeks (clearing out way more potentials too)</li>
<li>several specific advanced duplicate actions are now cleared out to make way for future streamlining of the filter workflow:</li>
<li>removed the 'duplicate_media_set_false_positive' shortcut, which is an action only appropriate when viewing confirmed potentials through the duplicate filter (or after the ' show random pairs' button)</li>
<li>removed the 'duplicate_media_remove_relationships' shortcut and menu action ('remove x pairs ... from the dupes system'), which will return as multiple more precise and reliable 'dissolve' actions in the coming weeks</li>
<li>removed the 'duplicate_media_reset_to_potential' shortcut and menu action ('send the x pairs ... to be compared in the duplicates filter') as it was always buggy and lead to bloating of the filter queue. it is likely to return as part of the 'dissolve'-style reset commands as above</li>
<li>fixed an issue where hitting 'duplicate_media_set_focused_better' shortcut with no focused thumb would throw an error</li>
<li>started proper unit tests for the duplicates system and filled in the phash search, basic current better/worse, and false positive and alternate components</li>
<li>various incidences of duplicate 'action options' and similar phrasing are now unified to 'metadata merge options'</li>
<li>cleaned up 'unknown/potential' phrasing in duplicate pair code and some related duplicate filter code</li>
<li>cleaned up wording and layout of the thumbnail duplicates menu</li>
<li>.</li>
<li>the rest:</li>
<li>tag blacklists in downloaders' tag import options now apply to the parsed tags both before and after a tag sibling collapse. it uses the combined tag sibling rules, so feedback on how well this works irl would be appreciated</li>
<li>I believe I fixed the annoying issue where a handful of thumbnails would sometimes inexplicitly not fade in after during thumbgrid scrolling (and typically on first thumb load--this problem was aggravated by scroll/thumb-render speed ratio)</li>
<li>when to-be-regenerated thumbnails are taken off the thumbnail waterfall queue due to fast scrolling or page switching, they are now queued up in the new file maintenance system for idle-time work!</li>
<li>the main gui menus will now no longer try to update while they are open! uploading pending tags while lots of new tags are coming in is now much more reliable. let me know if you discover a way to get stuck in this frozen state!</li>
<li>cleaned up some main gui menu regeneration code, reducing the total number of stub objects created and deleted, particularly when the 'pending' menu refreshes its label frequently while uploading many pending tags. should be a bit more stable for some linux flavours</li>
<li>the 'fix siblings and parents' button on manage tags is now a menu button with two options--for fixing according to the 'all services combined' siblings and parents or just for the current panel's service. this overrides the 'apply sibs/parents across all services' options. this will be revisited in future when more complicated sibling application rules are added</li>
<li>the 'hide and anchor mouse' check under 'options->media' is no longer windows-only, if you want to test it, and the previous touchscreen-detecting override (which unhid and unanchored on vigorous movement) is now optional, defaulting to off</li>
<li>greatly reduced typical and max repository pre-processing disk cache time and reworked stop calculations to ensure some work always gets done</li>
<li>fixed an issue with 'show some random dupes' thumbnails not hiding on manual trashing, if that option is set. 'show some random dupes' thumbnail panels will now inherit their file service from the current duplicate search domain</li>
<li>repository processing will now never run for more than an hour at once. this mitigates some edge-case disastrous ui-hanging outcomes and generally gives a chance for hydrus-level jobs like subscriptions and even other programs like defraggers to run even when there is a gigantic backlog of processing to do</li>
<li>added yet another CORS header to improve Client API CORS compatibility, and fixed an overauthentication problem</li>
<li>setting a blank string on the new local booru external port override option will now forego the host:port colon in the resultant external url. a tooltip on the control repeats this</li>
<li>reworded and coloured the pause/play sync button in review services repository panel to be more clear about current paused status</li>
<li>fixed a problem when closing the gui when the popup message manager is already closed by clever OS-specific means</li>
<li>misc code cleanup</li>
<li>updated sqlite on windows to 3.28.0</li>
<li>updated upnpc exe on windows to 2.1</li>
</ul>
<li><h3>version 353</h3></li>
<ul>
<li>duplicate filter:</li>
Expand Down
23 changes: 0 additions & 23 deletions help/depots.html

This file was deleted.

12 changes: 6 additions & 6 deletions help/duplicates.html
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
</head>
<body>
<div class="content">
<p class="warning">This help is slightly out of date--you can now refine your current duplicate 'search domain' with regular file searches. I will update this help in the coming weeks to reflect final changes.</p>
<p class="warning"><b>This is currently out of date! The duplicates system is being reworked right now at the database level to better support large groups of duplicates. The UI and workflow are simultaneously being streamlined. The concepts behind this help remain valid, but it will be properly updated once the work is complete to reflect the changes.</b></p>
<h3>duplicates</h3>
<p>As files are shared on the internet, they are often resized, cropped, converted to a different format, subsequently altered by the original or a new artist, or turned into a template and reinterpreted over and over and over. Even if you have a very restrictive importing workflow, your client is almost certainly going to get some <i>duplicates</i>. Some will be interesting alternate versions that you want to keep, and others will be thumbnails and other low-quality garbage you accidentally imported and would rather delete. Along the way, it would be nice to harmonise your ratings and tags to the better files so you don't lose any work.</p>
<p>Finding and processing duplicates within a large collection is impossible to do by hand, so I have written a system to do the heavy lifting for you. It is all on--</p>
Expand All @@ -26,7 +26,7 @@ <h3>preparation</h3>
<h3>discovery</h3>
<p>Once the database is ready to search, you actually have to do it! You can set a 'search distance', which represents how 'fuzzy' or imprecise a match the database will consider a duplicate. I recommend you start with 'exact match', which looks for files that are as similar as it can understand. The smaller the search distance, the faster and better and fewer the results will be. I do not recommend you go above 8--the 'speculative' option--as you will be inundated with false positives.</p>
<p>Like the preparation step, this is very CPU intensive and will lock your db. Either leave it alone while it works or let the client handle everything automatically during idle time.</p>
<p>If you are interested, the current version of this system uses a <i>phash</i> (a 64-bit binary string 'perceptual hash' based on whether the values of an 8x8 DCT of a 32x32 greyscale version of the image are above or below the average value) to represent the image shape and a VPTree to search different files' phashes' relative <a href="https://en.wikipedia.org/wiki/Hamming_distance">hamming distance</a>. I expect to extend it in future with multiple phash generation (flips, rotations, crops on interesting parts of the image) and most-common colour comparisons.</p>
<p>If you are interested, the current version of this system uses a <a href="https://jenssegers.com/61/perceptual-image-hashes">phash</a> to represent the image shape and a <a href="https://en.wikipedia.org/wiki/VP-tree">VPTree</a> to search different files' phashes' relative <a href="https://en.wikipedia.org/wiki/Hamming_distance">hamming distance</a>. I expect to extend it in future with multiple phash generation (flips, rotations, and 'interesting' image crops and video frames) and most-common colour comparisons.</p>
</li>
<li>
<h3>processing</h3>
Expand Down Expand Up @@ -110,7 +110,7 @@ <h3>alternates</h3>
<p><a href="dupe_alternates_progress.png"><img src="dupe_alternates_progress.png" /></a></p>
<p>And a costume change:</p>
<p><a href="dupe_alternates_costume.png"><img src="dupe_alternates_costume.png" /></a></p>
<p>None of these are exact duplicates, but they are obviously related. The duplicate search will notice they are similar, so we should let it know they are 'alternate'.</p>
<p>None of these are strictly duplicates, but they are obviously related. The duplicate search will notice they are similar, so we should let it know they are 'alternate'.</p>
<p>Here's a subtler case:</p>
<p><a href="dupe_alternate_boxer_a.jpg"><img src="dupe_alternate_boxer_a.jpg" /></a> <a href="dupe_alternate_boxer_b.jpg"><img src="dupe_alternate_boxer_b.jpg" /></a></p>
<p>These two files are very similar, but try opening both in separate tabs and then flicking back and forth: the second's glove-string is further into the mouth and has improved chin shading, a more refined eye shape, and shaved pubic hair. It is simple to spot these differences in the client's duplicate filter when you flick back and forth.</p>
Expand All @@ -123,13 +123,13 @@ <h3>alternates</h3>
<p><b>The default action here is to do nothing but record the alternate status. A future version of the client will support revisiting the large unsorted archive you build here and adding file relationship metadata, but creating that will be a complicated job that was not in the scope of this initial duplicate management system.</b></p>
</li>
<li>
<h3>not duplicates</h3>
<p>The duplicate finder sometimes has false positives, so this status is to tell the client that the potential pair are actually not duplicates of any kind. This usually happens when two images have a similar shape by accident.</p>
<h3>not related/false positive</h3>
<p>The duplicate finder sometimes has false positives, so this status is to tell the client that the potential pair are not related in any way. This usually happens when two images have a similar shape by accident.</p>
<p>Here are two such files:</p>
<p><a href="dupe_not_dupes_1.png"><img style="max-width: 100%;" src="dupe_not_dupes_1.png" /></a></p>
<p><a href="dupe_not_dupes_2.jpg"><img style="max-width: 100%;" src="dupe_not_dupes_2.jpg" /></a></p>
<p>Despite their similarity, they are neither duplicates nor of even the same topic. The only commonality is the medium. I would not consider them close enough to be alternates--just adding something like 'screenshot' and 'imageboard' as tags to both is probably the closest connection they have.</p>
<p>The default action here is obviously to do nothing but record the status and move on.</p>
<p>The default action here is obviously to do nothing but record the status and move on. Recording the 'false positive' relationship is important to make sure the comparison does not come up again.</p>
<p>The incidence of false positives increases as you broaden the search distance--the less precise your search, the less likely it is to be correct. At distance 14, these files all match, but uselessly:</p>
<p><a href="dupe_garbage.png"><img style="max-width: 100%;" src="dupe_garbage.png" /></a></p>
</li>
Expand Down

0 comments on commit e556dba

Please sign in to comment.