Skip to content

Commit

Permalink
(Obsidian) added some more developer notes.
Browse files Browse the repository at this point in the history
  • Loading branch information
GerHobbelt committed Dec 2, 2021
1 parent bd26b0d commit 58fc252
Show file tree
Hide file tree
Showing 14 changed files with 494 additions and 36 deletions.
15 changes: 14 additions & 1 deletion docs-src/Notes/.obsidian/app.json
Expand Up @@ -90,6 +90,19 @@
"executables",
"Qiqqa",
"submodule",
"icecream"
"icecream",
"Backpressure",
"cryptographic",
"crypto",
"stringified",
"lookups",
"selectable",
"encodings",
"thumbdrive",
"thumbdrive",
"OneDrive",
"DropBox",
"DropBox",
"performant"
]
}
2 changes: 1 addition & 1 deletion docs-src/Notes/Aside/How to attach raw data to a PDF.md
Expand Up @@ -6,4 +6,4 @@ Regrettably, this almost never happens -- including research/university circles,
So we generally succumb to the reality of *scraping content*, whether it's merely to obtain the obvious metadata elements (title, authors, publishing venue, publishing date, *abstract*, ...) or data/charts. Many of us probably won't even realize we are scraping like that -- and what the consequences are for our extracted data quality for our collection at large -- because it's so pervasive, despite bibliographic sites and metadata-offering websites, e.g. Google Scholar.

Once you have \[scraping] tools, you're not set for life either! Then it turns into a statistics game: how good are your tools, how much effort are you willing to spend (cleaning and /or cajoling your tools to do your bidding) and what is the target accuracy/veracity of your extracted data?
Once you have \[scraping] tools, you're not set for life either! Then it turns into a statistics game: how good are your tools, how much effort are you willing to spend (cleaning and/or cajoling your tools to do your bidding) and what is the target accuracy/veracity of your extracted data?
7 changes: 6 additions & 1 deletion docs-src/Notes/BLAKE3+BASE58 - Qiqqa Fingerprint 2.0.md
@@ -1,3 +1,8 @@
# BLAKE3+BASE58 :: The new Qiqqa Document Fingerprint

(TBD: move BLAKE3 performance stuff and fingerprint size calculus here?)
(TBD: move BLAKE3 performance stuff and fingerprint size calculus here? Currently that info is at [[Fingerprinting - moving forward and away from b0rked SHA1#Quick update]] onwards.)

See also:
-[[Fingerprinting - moving forward and away from b0rked SHA1]]
- [[SHA1B - Qiqqa Fingerprint 1.0 Classic]]
- [[Fingerprinting, Linking and Libraries]]
@@ -0,0 +1,68 @@
# Can't we compress our document and task hashes to a regular integer number so it's swifter and taking up less memory?

Yes, we can. IFF...

...we reckon with the number of documents (or pages / tasks / what-have-you) we expect to *uniquely identify* using the BLAKE3 hash system.

First, let's identify the several sources we would like to index/identify that way:

- **documents** = **document hashes**. Preferably not per-library but *across-all-my-libraries-please*. Which would make the `uint64_t` *shorthand index* a very personal, nay, *machine*-specific shorthand, as we won't be able to know about the documents we haven't seen yet -- while we may be synchronizing and working in a multi-node or multi-user environment. See also [[Multi-user, Multi-node, Sync-Across and Remote-Backup Considerations]].
That means we might have a different shorthand index value for document XYZ at machine A than at machine B. Definitely not something you would want to pollute your SQLite database with, for it would otherwise complicate Sync-Across across activity *quite a bit* as the thus-shorthand-linked data *would require transposing to the target machine*. **Ugh! *Hörk*!**
- **document+page**. Text extracts, etc. are kept per document, per page.

Must say I don't have a particularly strong feeling towards needing a *shorthand index* for this one, though. Given [[BLAKE3+BASE58 - Qiqqa Fingerprint 2.0]], the raw, unadulterated cost[^1] would run me at:

[^1]: see [[Fingerprinting - moving forward and away from b0rked SHA1|here for the factors]] used in the tables below

For documents:

| encoding | calculus | # output chars |
|--------------|----------------------------|-----------------------|
| binary: | $32$ | 32 chars |
| Base64: | $32 * log(256) / log(64) = 42.7$ | 43 chars |
| Base58: | $32 * log(256) / log(58) = 43.7$ | 44 chars |
| **Base58X**: | $32 * 7 * 8 / 41 = 43.7$ | 44 chars too! |

For documents+pages, where we assume a 'safe upper limit' in the page count of `MAX(uin16_t)` i.e. 65535, which fits in 2 bytes:

| encoding | calculus | # output chars |
|--------------|----------------------------|-----------------------|
| binary: | $32 + 2$ | 34 chars |
| Base64: | $(32 + 2) * 4 / 3 = 45.3$ | 46 chars |
| **Base58X**: | $(32 + 2) * 7 * 8 / 41 = 46.4$ | 47 chars |

For *tasks*, which are generally page oriented (e.g. OCR document page) documents+page+taskCategoryID, where we assume a 'safe upper limit' in the page count of `MAX(uin16_t)` i.e. 65535, which fits in 2 bytes, plus the taskCategoryID assumed to always fit in a `uint8_t`, i.e. a single byte:

| encoding | calculus | # output chars |
|--------------|----------------------------|-----------------------|
| binary: | $32 + 2 + 1$ | 35 chars |
| Base64: | $(32 + 2 + 1) * 4 / 3 = 46.7$ | 47 chars |
| **Base58X**: | $(32 + 2 + 1) * 7 * 8 / 41 = 47.8$ | 48 chars |

i.e. storing every task performance record in an SQLite database would incur us the added key cost of 48 bytes (*plus SQLite-internal overhead*).



## Is hash compression useful at all?

Maybe.

Hashes for documents, tasks, etc. take, as we saw above, between 44 and 48 bytes each -- *and a string compare for equality checks*.

When we map these to internal, *system-native* `uint64_t` numbers, that would cost 8 bytes per index number and a very fast integer equality test.

*Alas*, we should wonder whether this is an undesirable *micro-optimization* now that we've much bigger fish to fry still.
Given the amount of extra work and confusion I can see already, I'ld say: *nice thought, but not making it past the mark*. *Rejected.*

> After all, *huge* Qiqqa libraries would be between 10K-100K documents, where each document would, perhaps, average at less than 100 pages, thus resulting in about 100K document hashes and 10M (100K * 100) *task hashes*, which would clock in at 480M space (sans terminating NUL bytes, etc.) if we'd kept all those hashes around forever, which is kind of ridiculous.
> x
> Hence it's *probably* far smarter to assign fast `uint32_t` indexes for hashes while the application is running and for use in the application run-time and *no persistence ever*. And that's assuming you won't be running your Qiqqa server components for several days on end... *Nah. **Cute but no cigar**!*


## TL;DR

*Don't.* Too much fuss for very little gain. Does not mix well with
[[Multi-user, Multi-node, Sync-Across and Remote-Backup Considerations|Sync-Across]] either.


Expand Up @@ -6,15 +6,19 @@ However, we've noted elsewhere that we want/need to support a few more base form

- MHTML, CHM or another HTML+CSS+JS bundling which freezes/*archives* web pages for off-line and later peruse.

Also **consider using the HTMLZ format for all of these**: it's compressed and zipped, thus bundling all bits and pieces in a tight format that consumes little disk space and is fast to access and create.

> HTMLZ is one or more HTML files + assets zipped in a ZIP archive, plus an added [OPF metadata](http://idpf.org/epub/20/spec/OPF_2.0_latest.htm) file, which is, for example, used by [Calibre](https://calibre-ebook.com/).
- Images (which serve as single-page documents)

However, we *could* easily wrap any images in a fresh PDF of our own making and then allow the user to edit & annotate *that*: that way we have the annotation and all the other work persisted with minimal fuss, while we only have account for the image file to be *the original document*, i.e. one of the supported *near duplicates* in your Qiqqa library.



Which leaves the HTML-based pages (and all document formats which transform to this format, e.g. MarkDown documents): SumatraPDF & FoxIt are particularly geared towards *editing* and *annotating* such documents.
Which leaves the HTML-based pages (and all document formats which transform to this format, e.g. MarkDown documents): [SumatraPDF](https://www.sumatrapdfreader.org/free-pdf-reader) & [FoxIt](https://www.foxit.com/pdf-reader/) are particularly geared towards *editing* and *annotating* such documents.

We *could* use the same cop-out as for our image file based documents by first transforming it into a PDF before enabling user editing and annotating, but I feel we should also offer a *native* edit facility here, even when that would involve yet another external application.
We *could* use the same cop-out as for our image file based documents *by first transforming it into a PDF (using [wkhtmltopdf](https://wkhtmltopdf.org/index.html)) before enabling user editing and annotating*, but I feel we should also offer a *native* edit facility here, even when that would involve yet another external application.

Turns out a quick search for Open Source WYSIWYG HTML editors are all in-page and JS-based:

Expand All @@ -25,7 +29,7 @@ However, we've noted elsewhere that we want/need to support a few more base form
- [Froala](https://github.com/froala/wysiwyg-editor) + https://froala.com/wysiwyg-editor/examples
- [TinyMCE](https://github.com/tinymce/tinymce)
- [CKEditor](https://github.com/ckeditor) + https://ckeditor.com/
-

- Special Mention:
- https://github.com/brackets-cont/brackets -- though this is more a source editor, rather than WYSIWYG when it comes to HTML
- https://github.com/dok/awesome-text-editing
Expand Down
@@ -0,0 +1,61 @@
# IPC :: Qiqqa Monitor

The QiqqaMonitor wants a steady stream of performance data from the various Qiqqa components.

The Monitor is a *client*, where the various Qiqqa (background) processes are *servers*.

## Pull vs. Pull-based monitor data

While we *could* choose to use a *server push* mechanism, we opt, instead, to use a **data pull** approach where the Monitor determines the pace of the incoming data by explicitly sending requests for the various performance data.

Initially, I had some qualms about this, as this implies we would add *requesting overhead* that way, but a **pull**-based approach is more flexible:

- *server push* would either need a good timer server-side and a fast pace to ensure the client receives ample updates under all circumstances. Of course, this can be made *smarter*, but that would add development cost. (See also the next item: multiple clients)
- *multiple clients* for the performance data is a consideration: why would we *not* open up this part of the Qiqqa data, when we intend to open up everything else to user-scripted and other kinds of direct external access to our components? **pull**-based performance data feeds would then automatically pace to each client's 'clock' without the need to complicate the server-side codebase.
- *pace* must be *configured* for *server push* systems if you don't like the current data stream, while we don't have to do *anything* server-side when we go for **pull**-based data feeds: if the client desires a faster (or slower) pace, it can simply and *immediately* attain this by sending more (or fewer) *data requests*.

## Which data to track server-side

Next to this, there are a few more bits to keep in mind when we code this baby up:

- we don't want (potentially large) distortions in our performance data under load.

Previous experience with Qiqqa (v80 series and older) has shown us that the machine can be swiftly consumed and (over)loaded with PDF processing requests when importing, reloading, recovering or otherwise bulk-processing large libraries. We may want to know the size of our task queues, but *performance* is not equal to the delta of the *fill size* of said queues: when the machine is choking on the load the number of entries processed per second may be low or high, but we wouldn't know as many requests (PDF text extracts) trigger PDF OCR tasks thus filling up the task queue fast while work is being done.

Hence we would want to see both the **number of tasks completed** and the **number of tasks pending**.

As we could derive the other numbers we would need for a performance chart from these two, these should suffice:
- actual number of queued work items
- running total of work items completed
- extra:
- running total of work items *skipped*
- running total of work items *rescheduled*

These latter two are only relevant when we want to observe the effects of (temporarily) disabling certain background processes as we did in the old Qiqqa application.


Derived data values:

- Total Work Pending =~ Backpressure In The System = `QSize`
- Total Work Completed = `NCompleted`
- Total Work Requested = `QSize + NCompleted`
- Work Speed = `NCompleted[now] - NCompleted[previous] / TimeInterval`
- Work Request Speed =~ Queue Growth Speed = `TotalWorkRequested[now] - TotalWorkRequested[previous] / TimeInterval`

and if we track the base data (number of queued items + running total of completed items) per task type/category/priority, then we can derive those data values for each priority category and thus show some useful diagnostics to the user when the Qiqqa backend is hard at work.


### Extra data: what are we currently working on?

Qiqqa v80 series has a rather chaotic UI statusbar based report going to show what's being done.

While this can be replicated by asking the background processes what task they're each working on *right now*, perhaps we could provide a smarter and potentially more useful UX by also tracking the amount of time spent per work category and thus be able to report the (average/minimum/peak) run-time cost per work category.


#### ... and veering off at a tangent...

Some times we even dream of being able to drill down to the *document* and possibly *page* level for this data, but that would mean we'ld be storing all that work item monitor info in a persistent database.

Which makes one wonder: use SQLite for this (with its query construction and parsing overhead), or go the NoSQL LMDB route?
- https://www.pdq.com/blog/improving-bulk-insert-speed-in-sqlite-a-comparison-of-transactions/
- what about our document hashes? A *task hash* would be a document hash, task type code *plus* page number (up to at least 2500 as that's about the largest page count in a book I've ever seen). [[Compressing document and other hashes to uin32_t or uint64_t|Can't we compress our document and task hashes to a regular integer number so it's swifter and taking up less memory?]]
@@ -0,0 +1,2 @@
# IPC :: monitoring & managing the Qiqqa application components

0 comments on commit 58fc252

Please sign in to comment.