Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
(Obsidian) added some more developer notes.
- Loading branch information
1 parent
bd26b0d
commit 58fc252
Showing
14 changed files
with
494 additions
and
36 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,3 +1,8 @@ | ||
# BLAKE3+BASE58 :: The new Qiqqa Document Fingerprint | ||
|
||
(TBD: move BLAKE3 performance stuff and fingerprint size calculus here?) | ||
(TBD: move BLAKE3 performance stuff and fingerprint size calculus here? Currently that info is at [[Fingerprinting - moving forward and away from b0rked SHA1#Quick update]] onwards.) | ||
|
||
See also: | ||
-[[Fingerprinting - moving forward and away from b0rked SHA1]] | ||
- [[SHA1B - Qiqqa Fingerprint 1.0 Classic]] | ||
- [[Fingerprinting, Linking and Libraries]] |
68 changes: 68 additions & 0 deletions
68
...the Way Forward/Compressing document and other hashes to uin32_t or uint64_t.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,68 @@ | ||
# Can't we compress our document and task hashes to a regular integer number so it's swifter and taking up less memory? | ||
|
||
Yes, we can. IFF... | ||
|
||
...we reckon with the number of documents (or pages / tasks / what-have-you) we expect to *uniquely identify* using the BLAKE3 hash system. | ||
|
||
First, let's identify the several sources we would like to index/identify that way: | ||
|
||
- **documents** = **document hashes**. Preferably not per-library but *across-all-my-libraries-please*. Which would make the `uint64_t` *shorthand index* a very personal, nay, *machine*-specific shorthand, as we won't be able to know about the documents we haven't seen yet -- while we may be synchronizing and working in a multi-node or multi-user environment. See also [[Multi-user, Multi-node, Sync-Across and Remote-Backup Considerations]]. | ||
That means we might have a different shorthand index value for document XYZ at machine A than at machine B. Definitely not something you would want to pollute your SQLite database with, for it would otherwise complicate Sync-Across across activity *quite a bit* as the thus-shorthand-linked data *would require transposing to the target machine*. **Ugh! *Hörk*!** | ||
- **document+page**. Text extracts, etc. are kept per document, per page. | ||
|
||
Must say I don't have a particularly strong feeling towards needing a *shorthand index* for this one, though. Given [[BLAKE3+BASE58 - Qiqqa Fingerprint 2.0]], the raw, unadulterated cost[^1] would run me at: | ||
|
||
[^1]: see [[Fingerprinting - moving forward and away from b0rked SHA1|here for the factors]] used in the tables below | ||
|
||
For documents: | ||
|
||
| encoding | calculus | # output chars | | ||
|--------------|----------------------------|-----------------------| | ||
| binary: | $32$ | 32 chars | | ||
| Base64: | $32 * log(256) / log(64) = 42.7$ | 43 chars | | ||
| Base58: | $32 * log(256) / log(58) = 43.7$ | 44 chars | | ||
| **Base58X**: | $32 * 7 * 8 / 41 = 43.7$ | 44 chars too! | | ||
|
||
For documents+pages, where we assume a 'safe upper limit' in the page count of `MAX(uin16_t)` i.e. 65535, which fits in 2 bytes: | ||
|
||
| encoding | calculus | # output chars | | ||
|--------------|----------------------------|-----------------------| | ||
| binary: | $32 + 2$ | 34 chars | | ||
| Base64: | $(32 + 2) * 4 / 3 = 45.3$ | 46 chars | | ||
| **Base58X**: | $(32 + 2) * 7 * 8 / 41 = 46.4$ | 47 chars | | ||
|
||
For *tasks*, which are generally page oriented (e.g. OCR document page) documents+page+taskCategoryID, where we assume a 'safe upper limit' in the page count of `MAX(uin16_t)` i.e. 65535, which fits in 2 bytes, plus the taskCategoryID assumed to always fit in a `uint8_t`, i.e. a single byte: | ||
|
||
| encoding | calculus | # output chars | | ||
|--------------|----------------------------|-----------------------| | ||
| binary: | $32 + 2 + 1$ | 35 chars | | ||
| Base64: | $(32 + 2 + 1) * 4 / 3 = 46.7$ | 47 chars | | ||
| **Base58X**: | $(32 + 2 + 1) * 7 * 8 / 41 = 47.8$ | 48 chars | | ||
|
||
i.e. storing every task performance record in an SQLite database would incur us the added key cost of 48 bytes (*plus SQLite-internal overhead*). | ||
|
||
|
||
|
||
## Is hash compression useful at all? | ||
|
||
Maybe. | ||
|
||
Hashes for documents, tasks, etc. take, as we saw above, between 44 and 48 bytes each -- *and a string compare for equality checks*. | ||
|
||
When we map these to internal, *system-native* `uint64_t` numbers, that would cost 8 bytes per index number and a very fast integer equality test. | ||
|
||
*Alas*, we should wonder whether this is an undesirable *micro-optimization* now that we've much bigger fish to fry still. | ||
Given the amount of extra work and confusion I can see already, I'ld say: *nice thought, but not making it past the mark*. *Rejected.* | ||
|
||
> After all, *huge* Qiqqa libraries would be between 10K-100K documents, where each document would, perhaps, average at less than 100 pages, thus resulting in about 100K document hashes and 10M (100K * 100) *task hashes*, which would clock in at 480M space (sans terminating NUL bytes, etc.) if we'd kept all those hashes around forever, which is kind of ridiculous. | ||
> x | ||
> Hence it's *probably* far smarter to assign fast `uint32_t` indexes for hashes while the application is running and for use in the application run-time and *no persistence ever*. And that's assuming you won't be running your Qiqqa server components for several days on end... *Nah. **Cute but no cigar**!* | ||
|
||
|
||
## TL;DR | ||
|
||
*Don't.* Too much fuss for very little gain. Does not mix well with | ||
[[Multi-user, Multi-node, Sync-Across and Remote-Backup Considerations|Sync-Across]] either. | ||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
61 changes: 61 additions & 0 deletions
61
...otes/Progress in Development/Considering the Way Forward/IPC - Qiqqa Monitor.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,61 @@ | ||
# IPC :: Qiqqa Monitor | ||
|
||
The QiqqaMonitor wants a steady stream of performance data from the various Qiqqa components. | ||
|
||
The Monitor is a *client*, where the various Qiqqa (background) processes are *servers*. | ||
|
||
## Pull vs. Pull-based monitor data | ||
|
||
While we *could* choose to use a *server push* mechanism, we opt, instead, to use a **data pull** approach where the Monitor determines the pace of the incoming data by explicitly sending requests for the various performance data. | ||
|
||
Initially, I had some qualms about this, as this implies we would add *requesting overhead* that way, but a **pull**-based approach is more flexible: | ||
|
||
- *server push* would either need a good timer server-side and a fast pace to ensure the client receives ample updates under all circumstances. Of course, this can be made *smarter*, but that would add development cost. (See also the next item: multiple clients) | ||
- *multiple clients* for the performance data is a consideration: why would we *not* open up this part of the Qiqqa data, when we intend to open up everything else to user-scripted and other kinds of direct external access to our components? **pull**-based performance data feeds would then automatically pace to each client's 'clock' without the need to complicate the server-side codebase. | ||
- *pace* must be *configured* for *server push* systems if you don't like the current data stream, while we don't have to do *anything* server-side when we go for **pull**-based data feeds: if the client desires a faster (or slower) pace, it can simply and *immediately* attain this by sending more (or fewer) *data requests*. | ||
|
||
## Which data to track server-side | ||
|
||
Next to this, there are a few more bits to keep in mind when we code this baby up: | ||
|
||
- we don't want (potentially large) distortions in our performance data under load. | ||
|
||
Previous experience with Qiqqa (v80 series and older) has shown us that the machine can be swiftly consumed and (over)loaded with PDF processing requests when importing, reloading, recovering or otherwise bulk-processing large libraries. We may want to know the size of our task queues, but *performance* is not equal to the delta of the *fill size* of said queues: when the machine is choking on the load the number of entries processed per second may be low or high, but we wouldn't know as many requests (PDF text extracts) trigger PDF OCR tasks thus filling up the task queue fast while work is being done. | ||
|
||
Hence we would want to see both the **number of tasks completed** and the **number of tasks pending**. | ||
|
||
As we could derive the other numbers we would need for a performance chart from these two, these should suffice: | ||
- actual number of queued work items | ||
- running total of work items completed | ||
- extra: | ||
- running total of work items *skipped* | ||
- running total of work items *rescheduled* | ||
|
||
These latter two are only relevant when we want to observe the effects of (temporarily) disabling certain background processes as we did in the old Qiqqa application. | ||
|
||
|
||
Derived data values: | ||
|
||
- Total Work Pending =~ Backpressure In The System = `QSize` | ||
- Total Work Completed = `NCompleted` | ||
- Total Work Requested = `QSize + NCompleted` | ||
- Work Speed = `NCompleted[now] - NCompleted[previous] / TimeInterval` | ||
- Work Request Speed =~ Queue Growth Speed = `TotalWorkRequested[now] - TotalWorkRequested[previous] / TimeInterval` | ||
|
||
and if we track the base data (number of queued items + running total of completed items) per task type/category/priority, then we can derive those data values for each priority category and thus show some useful diagnostics to the user when the Qiqqa backend is hard at work. | ||
|
||
|
||
### Extra data: what are we currently working on? | ||
|
||
Qiqqa v80 series has a rather chaotic UI statusbar based report going to show what's being done. | ||
|
||
While this can be replicated by asking the background processes what task they're each working on *right now*, perhaps we could provide a smarter and potentially more useful UX by also tracking the amount of time spent per work category and thus be able to report the (average/minimum/peak) run-time cost per work category. | ||
|
||
|
||
#### ... and veering off at a tangent... | ||
|
||
Some times we even dream of being able to drill down to the *document* and possibly *page* level for this data, but that would mean we'ld be storing all that work item monitor info in a persistent database. | ||
|
||
Which makes one wonder: use SQLite for this (with its query construction and parsing overhead), or go the NoSQL LMDB route? | ||
- https://www.pdq.com/blog/improving-bulk-insert-speed-in-sqlite-a-comparison-of-transactions/ | ||
- what about our document hashes? A *task hash* would be a document hash, task type code *plus* page number (up to at least 2500 as that's about the largest page count in a book I've ever seen). [[Compressing document and other hashes to uin32_t or uint64_t|Can't we compress our document and task hashes to a regular integer number so it's swifter and taking up less memory?]] |
2 changes: 2 additions & 0 deletions
2
...the Way Forward/IPC - monitoring & managing the Qiqqa application components.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,2 @@ | ||
# IPC :: monitoring & managing the Qiqqa application components | ||
|
Oops, something went wrong.