From 51435ee9f493c800486311b64b05cea1005d74b1 Mon Sep 17 00:00:00 2001 From: Ger Hobbelt Date: Thu, 4 May 2023 18:42:38 +0200 Subject: [PATCH] (Obsidian) typo/spelling fixes --- MuPDF | 2 +- .../Aside/How to attach raw data to a PDF.md | 2 +- ...ding, Annotating and Content Extraction.md | 5 ++++- docs-src/Notes/Misc notes for developers.md | 4 ++-- .../Processing other document types/HTML.md | 10 ++++----- .../Considering the Way Forward/!dummy.md | 2 +- ...port in JS (Web UI) - Links of Interest.md | 4 ++-- ...ibrary in a team vs. personal libraries.md | 2 +- ...w platform or not, that is the question.md | 8 +++---- ...design - the trouble with a string UUID.md | 4 ++-- ...ment metadata - flattened table or what.md | 2 +- ... arbitrary precision in a 64-bit number.md | 2 +- ...act engine - thoughts on the new design.md | 8 +++---- ...sseract - NSFQ (Not Suitable For Qiqqa).md | 2 +- ...g a database for OCR text extract cache.md | 6 ++--- ...ential yet hard(er) to port UI features.md | 11 ++++++++-- ...oving forward and away from b0rked SHA1.md | 4 ++-- ...unique identifier as unique document id.md | 6 ++--- ...s into trigrams (for N \342\211\247 4).md" | 8 +++---- ...ment id integer as a 63-bit one instead.md | 22 +++++++++---------- ...idering the relevant atomic search unit.md | 14 ++++++------ .../Full-Text Search Engines.md | 6 ++--- ...per binary) or Centralized (log server).md | 2 +- ...we divide it up further into subsystems.md | 4 ++-- ... - and which db storage lib to use then.md | 10 ++++----- ... methods - HTTP vs WebSocket, Pipe, etc.md | 9 ++++---- .../IPC/IPC - Qiqqa Monitor.md | 2 +- .../IPC - transferring and storing data.md | 6 ++--- ...ps in our databases (annotations, etc.).md | 2 +- ...ests with CEF+CEFSharp+CEFGlue+Chromely.md | 6 ++--- ...s), backups and backwards compatibility.md | 8 +++++-- ...ch and Produce for Cross-platform Qiqqa.md | 6 ++--- .../Full Text Search Engines.md | 2 +- .../Metadata Search Engines.md | 6 +++-- .../Syncing, zsync style.md | 2 +- ...itor + Web Browser As WWW Search Engine.md | 2 +- ...ls of invoking other child applications.md | 2 +- ...Lite for the OCR and page render caches.md | 6 ++--- ...to download PDF or HTML document at URL.md | 2 +- ... Trenches - Odd, Odder, Oddest, ...PDF!.md | 4 ++-- .../notes 001-of-N.md | 2 +- .../notes 002-of-N.md | 2 +- .../notes 003-of-N.md | 2 +- .../notes 004-of-N.md | 2 +- .../notes 005-of-N.md | 2 +- .../notes 006-of-N.md | 2 +- .../notes 007-of-N.md | 2 +- .../notes 008-of-N.md | 2 +- .../notes 009-of-N.md | 2 +- .../notes 010-of-N.md | 2 +- .../notes 011-of-N.md | 2 +- .../notes 012-of-N.md | 2 +- .../notes 013-of-N.md | 2 +- .../notes 014-of-N.md | 2 +- .../notes 015-of-N.md | 2 +- .../notes 016-of-N.md | 2 +- .../The Transitional Period - Extra Notes.md | 14 ++++++------ .../The Transitional Period.md | 8 +++---- ...DF Viewer + Renderer (+ Text Extractor).md | 4 ++-- .../Extracting the text from PDF documents.md | 4 ++-- ...documents' text and the impact on UI+UX.md | 2 +- .../Qiqqa-Repository-Main-README(Copy).md | 7 +++--- .../Software Releases/Where To Get Them.md | 2 +- .../Specialized Bits Of General Technology.md | 1 + ...d elsewhere - stuff to be reckoned with.md | 6 ++--- ...OT mix that bugger with other compilers.md | 4 ++-- ...be Saved.As in browser (Microsoft Edge).md | 2 +- .../Testing - Nasty URLs for PDFs.md | 19 +++++++++++++--- .../Testing - PDF URLs with problems.md | 2 +- ...es.md => curl - command-line and notes.md} | 6 ++--- ...ecovering from b0rked repos and systems.md | 2 +- ...awn of evil and disable it in tesseract.md | 4 ++-- 72 files changed, 182 insertions(+), 152 deletions(-) rename "docs-src/Notes/Progress in Development/Considering the Way Forward/Full Text Search - Exit Lucene.NET, Enter SOLR/Using ngrams \326\211\326\211 folding N-grams and attributes into trigrams (for N \342\211\247 4).md" => "docs-src/Notes/Progress in Development/Considering the Way Forward/Full Text Search - Exit Lucene.NET, Enter SOLR/Using n-grams \326\211\326\211 folding N-grams and attributes into trigrams (for N \342\211\247 4).md" (96%) rename docs-src/Notes/Technology/Odds 'n' Ends/{curl - commandline and notes.md => curl - command-line and notes.md} (93%) diff --git a/MuPDF b/MuPDF index 1421a3623..b43cc3fd5 160000 --- a/MuPDF +++ b/MuPDF @@ -1 +1 @@ -Subproject commit 1421a3623b7fbe453136687ef73b390522c73a31 +Subproject commit b43cc3fd5c16c989b06d1d49fd6a238487fd026b diff --git a/docs-src/Notes/Aside/How to attach raw data to a PDF.md b/docs-src/Notes/Aside/How to attach raw data to a PDF.md index 86599e20b..3296000f9 100644 --- a/docs-src/Notes/Aside/How to attach raw data to a PDF.md +++ b/docs-src/Notes/Aside/How to attach raw data to a PDF.md @@ -6,4 +6,4 @@ Regrettably, this almost never happens -- including research/university circles, So we generally succumb to the reality of *scraping content*, whether it's merely to obtain the obvious metadata elements (title, authors, publishing venue, publishing date, *abstract*, ...) or data/charts. Many of us probably won't even realize we are scraping like that -- and what the consequences are for our extracted data quality for our collection at large -- because it's so pervasive, despite bibliographic sites and metadata-offering websites, e.g. Google Scholar. -Once you have \[scraping] tools, you're not set for life either! Then it turns into a statistics game: how good are your tools, how much effort are you willing to spend (cleaning and/or cajoling your tools to do your bidding) and what is the target accuracy/veracity of your extracted data? +Once you have \[scraping\] tools, you're not set for life either! Then it turns into a statistics game: how good are your tools, how much effort are you willing to spend (cleaning and/or cajoling your tools to do your bidding) and what is the target accuracy/veracity of your extracted data? diff --git a/docs-src/Notes/Aside/The World of Data Extraction and Re-use/PDF Reading, Annotating and Content Extraction.md b/docs-src/Notes/Aside/The World of Data Extraction and Re-use/PDF Reading, Annotating and Content Extraction.md index 03b91fd15..d7ac76226 100644 --- a/docs-src/Notes/Aside/The World of Data Extraction and Re-use/PDF Reading, Annotating and Content Extraction.md +++ b/docs-src/Notes/Aside/The World of Data Extraction and Re-use/PDF Reading, Annotating and Content Extraction.md @@ -1,6 +1,9 @@ # The World of Data Extraction and Re-use :: PDF Reading, Annotating and Content Extraction -If you want to do more with PDFs than merely *read* them on screen^[for which there's Adobe Acrobat and if you don't like it, there's a plethora of PDF *Viewers* out there, e.g. [SumatraPDF](https://www.sumatrapdfreader.org/free-pdf-reader) and [FoxIt](https://www.foxit.com/).], there's trouble ahead. +If you want to do more with PDFs than merely *read* them on screen[^adobe], there's trouble ahead. + +[^adobe]: for which there's Adobe Acrobat and if you don't like it, there's a plethora of PDF *Viewers* out there, e.g. [SumatraPDF](https://www.sumatrapdfreader.org/free-pdf-reader) and [FoxIt](https://www.foxit.com/). + ## Searchable text, anyone? diff --git a/docs-src/Notes/Misc notes for developers.md b/docs-src/Notes/Misc notes for developers.md index 3c479e640..da6149086 100644 --- a/docs-src/Notes/Misc notes for developers.md +++ b/docs-src/Notes/Misc notes for developers.md @@ -21,9 +21,9 @@ If you run into stuff that's weird, that you don't grok or otherwise have trouble with, please give me feedback on that so it can be improved. I'm not a technical writer (e.g. got too flowery verbiage) and there's already a lot of gain when those pages end up as a useful and *readable* resource for you and others-to-come. -- Might be good to try using the DevStudio debugger on Qiqqa. Though it is a multi-threaded application and that can be confusing with multiple background task copies running in parallel when you debug, so there's a place in the code where you can dial the number of detected processor cores down to 1 and/or disable background tasks entirely -- I coded that but I have to look up the precise spot as it dropped from active memory. The useful bit here is being aware that you can dial down the mayhem when stepping through background or foregound code by limiting/killing the parallelism. +- Might be good to try using the DevStudio debugger on Qiqqa. Though it is a multi-threaded application and that can be confusing with multiple background task copies running in parallel when you debug, so there's a place in the code where you can dial the number of detected processor cores down to 1 and/or disable background tasks entirely -- I coded that but I have to look up the precise spot as it dropped from active memory. The useful bit here is being aware that you can dial down the mayhem when stepping through background or foreground code by limiting/killing the parallelism. -- Also be aware that Qiqqa now has a commandline argument to specify a different base directory for the Qiqqa libraries: that's quite handy when you want to debug/test Qiqqa on a test rig or observe how it behaves when it's a "fresh install", i.e. no user library yet. This can be set in DevStudio Project Debug section and is passed to Qiqqa when you start it from DevStudio to run [Ctrl+F5] or debug [F5] +- Also be aware that Qiqqa now has a command-line argument to specify a different base directory for the Qiqqa libraries: that's quite handy when you want to debug/test Qiqqa on a test rig or observe how it behaves when it's a "fresh install", i.e. no user library yet. This can be set in DevStudio Project Debug section and is passed to Qiqqa when you start it from DevStudio to run [Ctrl+F5] or debug [F5] ![](assets/devstudio-project-debug-arguments-view.png) diff --git a/docs-src/Notes/Processing other document types/HTML.md b/docs-src/Notes/Processing other document types/HTML.md index 4457bbb46..dc1df5076 100644 --- a/docs-src/Notes/Processing other document types/HTML.md +++ b/docs-src/Notes/Processing other document types/HTML.md @@ -8,26 +8,26 @@ Next to that, there's notes, comments/critiques and (regrettably relevant some t Now we could take the position where we store everything we encounter as PDF, which is doable most of the time, but sub-optimal, both in formatting and storage costs per item: the HTML-to-PDF renderers available do their best, but most such web-based publications have never been near a page-layout CSS style sheet in their life, or page-layout "print output" work like that is done in a rather low quality fashion: creating great print-ready stylesheets is no *sine cure* and often requires customization per document. Combine this with the relative low priority for the originator to deliver this type of format for their own ultimate benefit and you end up with good will and a "*meh*" result 9 times out of 10. -We also should not forget that, under the hood, we do a lot of work to get back from PDF to a HTML-kind of content format -- you may not notice it as such, but any PDF text extraction process, whether OCR assisted or not, is hard-pressed to produce a continuous data stream uninterrupted by obnoxious page numbers and other page-boundary content that's not relevant to the document flow and the information provided there-in. When we say we're interested in formats such as hOCR, we're not looking at an *addition to the Qiqqa feature set* but rather wonder whether we can store the current text extracts any way **smarter**. Thus the obvious "*least common denominator format for **accessible & processable** document storage*" is HTML/hOCR, rather than PDF. PDF is only very handy because we are, in our own way, *librarians* and thus want to keep the **original source** around for posterity/reference. The academic field (and anyone sane) demands that: no original sources, no back-up for your arguments. *Fake News* and similar patterns make this a still-insecure approach, but it's the best we've got, so storing and keeping those original sources *intact* is a *must*. And there is where PDF shines -- unless our *original sources* are already HTML web pages themselves: then PDF, a page based storage system, shows its non-web roots and fails to deliver. So we would do well to copy/*mirror* web pages we wish to import as *documents* -- keep in mind that the *average* website pages' "*lifetime*" is about 2 years, according to some research and a lot of personal experience. Keep your own copy, which can reproduce the page that *was*, isn't just a nice *hobby* or to be relegated to a visionary like The Wayback Machine: keeping a local copy is an essential part of being a library. +We also should not forget that, under the hood, we do a lot of work to get back from PDF to a HTML-kind of content format -- you may not notice it as such, but any PDF text extraction process, whether OCR assisted or not, is hard-pressed to produce a continuous data stream uninterrupted by obnoxious page numbers and other page-boundary content that's not relevant to the document flow and the information provided there-in. When we say we're interested in formats such as hOCR, we're not looking at an *addition to the Qiqqa feature set* but rather wonder whether we can store the current text extracts any way **smarter**. Thus the obvious "*least common denominator format for **accessible & processable** document storage*" is HTML/hOCR, rather than PDF. PDF is only very handy because we are, in our own way, *librarians* and thus want to keep the **original source** around for posterity/reference. The academic field (and anyone sane) demands that: no original sources, no back-up for your arguments. *Fake News* and similar patterns make this a still-insecure approach, but it's the best we've got, so storing and keeping those original sources *intact* is a *must*. And there is where PDF shines -- unless our *original sources* are already HTML web pages themselves: then PDF, a page based storage system, shows its non-web roots and fails to deliver. So we would do well to copy/*mirror* web pages we wish to import as *documents* -- keep in mind that the *average* website pages' "*lifetime*" is about 2 years, according to some research and a lot of personal experience. Keep your own copy, which can reproduce the page that *was*, isn't just a nice *hobby* or to be relegated to a visionary like The WayBack Machine: keeping a local copy is an essential part of being a library. Hence we should be able to store any HTML page *mirror*, i.e. including its CSS, images, etc. *Possibly* we should store those pages as *DOM snapshots* so we are not dependent on changing and buggy (or disappeared) JavaScript code loaded from third-party sites as part of the web page render. The only drawback there is you cannot really snapshot a web page that's very dynamic, i.e. has all kinds of fancy JS-driven content-hiding and showing/revealing built-in. That is regrettable, but does not diminish of having the capability of mirroring/snapshotting a given web page; it rather points at the technical complexities involved when we want to achieve this goal. -## Do we consider storing other *original source* formats, given the logic above? Youtube movies? +## Do we consider storing other *original source* formats, given the logic above? YouTube movies? No. At least not at this moment. -Storing *multimedia* like that will require a few other additions (viewing, processing: CC (Closed Captions) as text extract, etc. -- what to do there and how doable is it, really?!?!) so we'll stop at *documents* which have a large(-ish) *text content* component, which we can then index and search. For video-based multimedia, that *search process* is still the field of a few select players (e.g. automatic CC production in the original language, thanks to speech recognition, as done by Google/Youtube). +Storing *multimedia* like that will require a few other additions (viewing, processing: CC (Closed Captions) as text extract, etc. -- what to do there and how doable is it, really?!?!) so we'll stop at *documents* which have a large(-ish) *text content* component, which we can then index and search. For video-based multimedia, that *search process* is still the field of a few select players (e.g. automatic CC production in the original language, thanks to speech recognition, as done by Google/YouTube). ## Do we support more formats? -No. We're not The Wayback Machine: the purpose of all *original source* collecting and storing is to be produce both the original and the text information available therein so we can *find* stuff we seek and *analyze* that (text) content to help our discovery investigative processes. +No. We're not The WayBack Machine: the purpose of all *original source* collecting and storing is to be produce both the original and the text information available therein so we can *find* stuff we seek and *analyze* that (text) content to help our discovery investigative processes. -Storing stuff like MSword documents, etc. may sound nice at first, but makes for a quite different library approach: we then SHOULD also offer easy reading/viewing capabilities for all those formats and that's not what we wish to spend our efforts on; meanwhile storing these various formats and only having PDF or HTML/hOCR formats easy-to-read, makes that limited (yet already very complex!) capability a nuisance, rather than a boon. As we cannot treat those other storage formats democratically from a view/render perspective, we are all best best served by not accepting those formats: after all, (almost) all of them can be transported to either the page-oriented PDF format or the continuous-stream-oriented HTML format. +Storing stuff like MS-word documents, etc. may sound nice at first, but makes for a quite different library approach: we then SHOULD also offer easy reading/viewing capabilities for all those formats and that's not what we wish to spend our efforts on; meanwhile storing these various formats and only having PDF or HTML/hOCR formats easy-to-read, makes that limited (yet already very complex!) capability a nuisance, rather than a boon. As we cannot treat those other storage formats democratically from a view/render perspective, we are all best served by not accepting those formats: after all, (almost) all of them can be transported to either the page-oriented PDF format or the continuous-stream-oriented HTML format. With those two we have covered 99.9% of the *original sources* market as far as I'm concerned. diff --git a/docs-src/Notes/Progress in Development/Considering the Way Forward/!dummy.md b/docs-src/Notes/Progress in Development/Considering the Way Forward/!dummy.md index 8cbe29153..ac07806f7 100644 --- a/docs-src/Notes/Progress in Development/Considering the Way Forward/!dummy.md +++ b/docs-src/Notes/Progress in Development/Considering the Way Forward/!dummy.md @@ -22,7 +22,7 @@ https://i.imgur.com/On83uvH.mp4 --- -(Hm, image drag&drop from website behaves wierd: file selector for local storage comes up, then MM seems to fail or lock up?!) +(Hm, image drag&drop from website behaves weird: file selector for local storage comes up, then MM seems to fail or lock up?!) --- diff --git a/docs-src/Notes/Progress in Development/Considering the Way Forward/Annotating Documents/Annotations Support in JS (Web UI) - Links of Interest.md b/docs-src/Notes/Progress in Development/Considering the Way Forward/Annotating Documents/Annotations Support in JS (Web UI) - Links of Interest.md index e6558afe0..10b863ee0 100644 --- a/docs-src/Notes/Progress in Development/Considering the Way Forward/Annotating Documents/Annotations Support in JS (Web UI) - Links of Interest.md +++ b/docs-src/Notes/Progress in Development/Considering the Way Forward/Annotating Documents/Annotations Support in JS (Web UI) - Links of Interest.md @@ -1326,5 +1326,5 @@ https://stackoverflow.com/questions/23205202/user-annotation-overlay-in-html5-ja Installing h in a development environment — h 0.0.2 documentation https://h.readthedocs.io/en/latest/developing/install/ - - + + diff --git a/docs-src/Notes/Progress in Development/Considering the Way Forward/Annotating Documents/Sharing library in a team vs. personal libraries.md b/docs-src/Notes/Progress in Development/Considering the Way Forward/Annotating Documents/Sharing library in a team vs. personal libraries.md index c0d8c5b96..4918e2524 100644 --- a/docs-src/Notes/Progress in Development/Considering the Way Forward/Annotating Documents/Sharing library in a team vs. personal libraries.md +++ b/docs-src/Notes/Progress in Development/Considering the Way Forward/Annotating Documents/Sharing library in a team vs. personal libraries.md @@ -26,7 +26,7 @@ If you don't want or like the annotations in the team library, you can always co ## Further thoughts -If folks edit or annotate documents, the idea was to import (and track) each revision of the document: this number can grow when edits happen over a longer period of time and/or in small increments, thus resulting in a large storage cost. Might we consider some sort of *delta compression* here, *iff* that's feasible at all, given the freakiness of the PDF format: a small edit may include a (hidden/non-obvious) restructuring of the binary file layout! Hmmmmmm, might not be bothered with it: if it gets too much, we should allow to easily erase "unimportant revisions", e.g. the ones between the first and the last -- I would personally keep the fist version as a marker of where we started. Besides, the first version is often the 'initial download copy' and relevant due to that fact alone: we should always keep an 'original source copy' (no matter how b0rked it may be). So I strongly prefer to keep at least the first and last revisions at all times, even forbidding deleting those when the document isn't nuked from orbit *entirely already*. +If folks edit or annotate documents, the idea was to import (and track) each revision of the document: this number can grow when edits happen over a longer period of time and/or in small increments, thus resulting in a large storage cost. Might we consider some sort of *delta compression* here, *iff* that's feasible at all, given the freakishness of the PDF format: a small edit may include a (hidden/non-obvious) restructuring of the binary file layout! Hmmmmmm, might not be bothered with it: if it gets too much, we should allow to easily erase "unimportant revisions", e.g. the ones between the first and the last -- I would personally keep the fist version as a marker of where we started. Besides, the first version is often the 'initial download copy' and relevant due to that fact alone: we should always keep an 'original source copy' (no matter how b0rked it may be). So I strongly prefer to keep at least the first and last revisions at all times, even forbidding deleting those when the document isn't nuked from orbit *entirely already*. Thus then the question becomes: what does the user feel is an "unimportant revision"? E.g.: small edits of the OCR extracted text indicate typo fixes: we always keep the latest, thus that would imply we kill the older revision then. Another example: 'unimportant' would be adding and annotation when the document (or page?) already has annotations; meanwhile we have the heuristic that adding an annotation where there were *none* before should be considered an *important* edit: then, when we nuke unimportant edits, we consequently end up with the last unannotated revision, followed by the latest fully annotated revision: all the intermediate steps (revisions) are nuked due to the 'unimportant update' rule. diff --git a/docs-src/Notes/Progress in Development/Considering the Way Forward/Completely new platform or not, that is the question.md b/docs-src/Notes/Progress in Development/Considering the Way Forward/Completely new platform or not, that is the question.md index 5e7444b36..5267987ff 100644 --- a/docs-src/Notes/Progress in Development/Considering the Way Forward/Completely new platform or not, that is the question.md +++ b/docs-src/Notes/Progress in Development/Considering the Way Forward/Completely new platform or not, that is the question.md @@ -13,7 +13,7 @@ When I say I am looking into "moving to an electron-like system, e.g. Chromely o Anywhere WPF or XAML gets mentioned in relation to Qiqqa (that's the UI tech currently used for it and it's Microsoft/windows only), the "moving to electron/CEF" means that part is going to be redone as "HTML web pages" one way or another. - electron/Chromely/CEF/... are all basically concepts of a stripped down Chrome browser glued to some "business layer backend": + electron/Chromely/CEF/... are all basically concepts of a stripped down Chrome browser glued to some "business layer back-end": - electron is Chrome+NodeJS, so JavaScript all the way (not my first choice therefor), - Chromely is Chrome+C#/.NET, which would make my *hope* to keep most of the business logic in C# at least *possible* (regrettably Chromely isn't very active, nor is it *finished* for what I need from it 😓 ) and @@ -25,11 +25,11 @@ When I say I am looking into "moving to an electron-like system, e.g. Chromely o Though I am approaching this like the classical solution to the Gordian Knot though: wielding an axe and observing which parts survive. (Alexander wielded a *sword*; I'm not qualified for that 😄 ) -- First *minor* counterpoint to that last one (keeping as much of the original C# code as possible) is my current activity around PDF processing, which is to replace SORAX + QiqqaOCR + very old tesseract.Net (document OCR + text extraction + metadata extraction) with something new, based on Artifex' MuPDF and "master branch" = leading edge `tesseract` v5.0 codebases, which are C/C++. Ultimately that should take care of #35, #86, #165, #193 and (in part) #289. +- First *minor* counterpoint to that last one (keeping as much of the original C# code as possible) is my current activity around PDF processing, which is to replace SORAX + QiqqaOCR + very old tesseract.Net (document OCR + text extraction + metadata extraction) with something new, based on Artifex' MuPDF and "master branch" = leading edge `tesseract` v5.0 code-bases, which are C/C++. Ultimately that should take care of #35, #86, #165, #193 and (in part) #289. - QiqqaOCR (a tool used by Qiqqa under the hood) is currently C# glueing those old libraries together and is being replaced that way with an entirely different codebase in C/C++. Linux-ready? Yes. That part will then be ready for Linux et al, requiring some CMake work (or similar) to compile that collective chunk of software on Linuxes, but the codebase itself won't be in the way and the parts used are already in use on Linux platforms, individually. + QiqqaOCR (a tool used by Qiqqa under the hood) is currently C# glueing those old libraries together and is being replaced that way with an entirely different code-base in C/C++. Linux-ready? Yes. That part will then be ready for Linux et al, requiring some CMake work (or similar) to compile that collective chunk of software on Linuxes, but the code-base itself won't be in the way and the parts used are already in use on Linux platforms, individually. -- Second *larger* counterpoint is the current C# Qiqqa codebase, which has its "business logic/glue" quite tightly intertwined with the UI code: that's a bit of a bother to untangle. The v83 experimental Qiqqa releases and the UI issues reported by several users during the last year or so is me fiddling (and screwing up) with that Gordian knot while spending too many hours in travel + house construction work. +- Second *larger* counterpoint is the current C# Qiqqa code-base, which has its "business logic/glue" quite tightly intertwined with the UI code: that's a bit of a bother to untangle. The v83 experimental Qiqqa releases and the UI issues reported by several users during the last year or so is me fiddling (and screwing up) with that Gordian knot while spending too many hours in travel + house construction work. - Another planned mandatory migration is getting rid of the antique Lucene.NET, which is the core facilitating the *document / text search* features in Qiqqa. Lucene.NET still exists out there, but is rather slow in upgrading and much less actively supported than the *true original*: Lucene, which is done in Java. Java+C# is a mix for the sufferers in the (unmentioned) 9th level of Dante's Hell, so best to avoid it, which is why I'm opting for using SOLR, which is bluntly speaking Lucene wrapped in a (local) web server. C# has no trouble talking to web sites like that, so we're staying clear of Java+C# tight mixes. diff --git a/docs-src/Notes/Progress in Development/Considering the Way Forward/Database Design/Considering the database design - the trouble with a string UUID.md b/docs-src/Notes/Progress in Development/Considering the Way Forward/Database Design/Considering the database design - the trouble with a string UUID.md index 8831401cc..6a23a671a 100644 --- a/docs-src/Notes/Progress in Development/Considering the Way Forward/Database Design/Considering the database design - the trouble with a string UUID.md +++ b/docs-src/Notes/Progress in Development/Considering the Way Forward/Database Design/Considering the database design - the trouble with a string UUID.md @@ -24,10 +24,10 @@ With the new BLAKE3+BASE58X-based scheme, that *string*-typed key would be a *fi > - [Datatypes In SQLite](https://www.sqlite.org/datatype3.html) > > Ergo: now that we have discovered that SQLite differs from our usual SQL databases in that it fully supports BLOBs as primary key field type, the question then gets raised: "**What's the use of BASE58X, after all? Won't `SELECT hex(blake3) FROM document_table` do very nicely for all involved, including future external use?**" -> x +> > Erm... Good point. Further checks into the SQLite API show that I can feed those BLAKE3 hashes verbatim using query parameters and the appropriate APIs: > - [Binding Values To Prepared Statements](http://www.sqlite.org/c3ref/bind_blob.html) -> `sqlite3_bind_blob()` and `sqlite3_bind_blob64()` -> x +> > This would mean the only reason to employ BASE58X is to keep transmission costs low (storage being covered by this ability of SQLite to store BLOBs as primary keys), but does this weigh against using the default `hex()` encode already offered by SQLite? (`select hex(blake3) from ...` vs. custom function registration requiring `select base58x(blake3) from ...`) Given the note above, *not using any string encoding at all* is a viable option while we're using SQLite. This would mean our BLAKE3 hash can be used as a key of length $256/8 = 32$ bytes. Which is *even shorter* than the original (*string*-typed) SHA1B hash (at 40 characters long). diff --git a/docs-src/Notes/Progress in Development/Considering the Way Forward/Database Design/Document metadata - flattened table or what.md b/docs-src/Notes/Progress in Development/Considering the Way Forward/Database Design/Document metadata - flattened table or what.md index acf7a5567..79de53518 100644 --- a/docs-src/Notes/Progress in Development/Considering the Way Forward/Database Design/Document metadata - flattened table or what.md +++ b/docs-src/Notes/Progress in Development/Considering the Way Forward/Database Design/Document metadata - flattened table or what.md @@ -4,7 +4,7 @@ Current Qiqqa (v83 and older) dump the Qiqqa metadata as a single BibTeX wrapped *We* intend to have *versioning* added, so we can slowly improve our metadata, either as a whole or in part, e.g. by correcting typos in the document title and then *updating* our metadata database: later on we SHOULD be able to observe that attribute update as part of a revision list - a bit a la `git log`. You get the idea. -Part of that *gradual improvement process* is tagging the updates with a reliability / believability / **feducy** rating: auto-guestimated metadata should be ranked rather low, until a human has inspected and *vetted* the data, in which case it will see a significant jump in *feducy ranking*. Specialized (semi-)automated metadata improvement processes MAY bump up the rank a bit here and there, if we find that idea actually useful in practice, e.g. e.g. bumping up the ranking when we've discovered the metadata element matches for several (deemed *independent*) sources? Or when we have spell-checked the Abstract or blurbs in a fully automated pre-process? (So we wouldn't have to do so much vetting/correcting by hand any more, we hope.) Or specialized user-controlled scripts have gone over the available metadata and submitted their own (derived) set as the conclusion to their activity -- assuming they *add* to the overall metadata quality and hence deserve a slight bump in their metadata records' ranking too! +Part of that *gradual improvement process* is tagging the updates with a reliability / believability / **feducy** rating: auto-guesstimated metadata should be ranked rather low, until a human has inspected and *vetted* the data, in which case it will see a significant jump in *feducy ranking*. Specialized (semi-)automated metadata improvement processes MAY bump up the rank a bit here and there, if we find that idea actually useful in practice, e.g. e.g. bumping up the ranking when we've discovered the metadata element matches for several (deemed *independent*) sources? Or when we have spell-checked the Abstract or blurbs in a fully automated pre-process? (So we wouldn't have to do so much vetting/correcting by hand any more, we hope.) Or specialized user-controlled scripts have gone over the available metadata and submitted their own (derived) set as the conclusion to their activity -- assuming they *add* to the overall metadata quality and hence deserve a slight bump in their metadata records' ranking too! There's also our *multiple sources for (nearly) the same stuff* conundrum to consider here: we've observed there's multiple BibTeX records available out there (Google Scholar being only one of them), with various and *varying* degree of completeness and other marks of quality. For a single document. So we wish to import some or all of those "*variants*" / *alternatives* and mix & mash them as we, the *user*, may see fit. Consequently, we need to record the source / origin for each *element* of our metadata and at the same (low) ranking we MAY expect multiple entries (records?) from different origins. Our final metadata produce COULD be a mix/mash of these multiple sources. Meanwhile I want to be able to offer users the ability to query the database directly (when they're up for it, technically) and then query for a title, author or other metadata *element* match. This implies we're to move from a wrapper-in-a-wrapper whole-record-dump to a per-element database table layout. This speaks strongly for a *normalized* database table layout, e.g. key:(document ID, metadata item name/ID, source ID, ranking) + data:(ranking?, version?, metadata value). Thus we would expect to have several tens of records per document. Which implies we have to reckon with a *large* metadata table with a non-unique index and all the performance worries such a beast entails... diff --git a/docs-src/Notes/Progress in Development/Considering the Way Forward/Database Design/Storing a wide range of date+time-stamps of arbitrary precision in a 64-bit number.md b/docs-src/Notes/Progress in Development/Considering the Way Forward/Database Design/Storing a wide range of date+time-stamps of arbitrary precision in a 64-bit number.md index 101269fa6..4d0a2f5c7 100644 --- a/docs-src/Notes/Progress in Development/Considering the Way Forward/Database Design/Storing a wide range of date+time-stamps of arbitrary precision in a 64-bit number.md +++ b/docs-src/Notes/Progress in Development/Considering the Way Forward/Database Design/Storing a wide range of date+time-stamps of arbitrary precision in a 64-bit number.md @@ -165,7 +165,7 @@ Both schemes are perfectly *sortable*, while only the latter *microseconds since All in all, it looks like a tie: when time differences are important and many in your application, then the *since epoch* scheme will be the preferred one, as a difference value will be two bit masking operations and a subtraction away, while Option A will take 7 subtractions and assorted multiplications and additions before you have your time distance value. -Meanwhile, when fast conversion to and from dates and times in various text formats is your prime business, then the bitfields' based Option A will certainly be your preferred choice. +Meanwhile, when fast conversion to and from dates and times in various text formats is your prime business, then the bit-fields' based Option A will certainly be your preferred choice. ## Conclusion: Option A is the preferred design for Qiqqa diff --git a/docs-src/Notes/Progress in Development/Considering the Way Forward/Document OCR & Text Extraction/OCR text extract engine - thoughts on the new design.md b/docs-src/Notes/Progress in Development/Considering the Way Forward/Document OCR & Text Extraction/OCR text extract engine - thoughts on the new design.md index 327f294d0..2f5e12c52 100644 --- a/docs-src/Notes/Progress in Development/Considering the Way Forward/Document OCR & Text Extraction/OCR text extract engine - thoughts on the new design.md +++ b/docs-src/Notes/Progress in Development/Considering the Way Forward/Document OCR & Text Extraction/OCR text extract engine - thoughts on the new design.md @@ -14,15 +14,15 @@ That also involves image sectioning: multi-column and other non-super-simple lay > > Another approach there would be a post-mortem filter where *everything* is extracted and we deal will cover pages by recognizing and filtering them at the hOCR/extracted-text level, i.e. not by looking at the original layout/image, but by looking at the text/layout output. While this is a viable approach IMO, I'd rather also wish to end up with *cleaned up PDFs*, derived from the originals. For print and other viewing purposes where I don't want to be bothered with the cover-page clutter. -Having the entire OCR+textExtract process configurable/modifyable through a little scripting is all nice, but it would lengthen and already long running process, which is, without additional means, very *unsuitable* for any UI responsive approach, be it based on chatting with a webserver or otherwise. +Having the entire OCR+textExtract process configurable/modifyable through a little scripting is all nice, but it would lengthen and already long running process, which is, without additional means, very *unsuitable* for any UI responsive approach, be it based on chatting with a web-server or otherwise. I've consider "push technology" or "chat lines" (SocketIO et al) for the mupdf system to talk to the top layer, i.e. the UI. However, while these technologies exist, it is still a "unconventional"/irregular to have a (web) interface like that. It's easier to code (also by others) to have a system which is basically "web page" with possibly a polling system underneath to provide UI updates. I think polling here would be easier, as that would make for some very simple JavaScript actions to update the UI. Reactive is all nice & dandy, but that would then have travel through the entire system, down into the bowels of mupdf and other systems we're using. So at some point a polling mechanism would be in order to ensure the UI gets periodic updates. Another though in *support* of polling (vs fully reactive / event driven) is the notion that: -- UI updates for a batch processing system don't need to show *all of it*: the batch system should hopefully be faster than the human eye could perceive UI updates, so you want to see work happening, either through snapshots or less strict means where your UI shows updates of the work done in the backend "as it happens". With events and reactive you then would need to code filters, which collect or discard events percolating up, for otherwise you'ld be swamped in updates that are unnecessary and only loading the UI system tremendously: imagine a system which does process about, say, 50 pages per second through the pipeline. That would mean your entire stack should be realtime video-capable if all those pages (and their intermediate filter stages when you're watching the pipeline progress page I imagine) are to be drawn in the UI. Let alone the cost in computing, it's too fast to be useful for folks anyway. That power could better be used to speeed up the PDF processing itself. -- polling as a mechanism to talk to the mupdf+extras backend means we can use basic HTTP request-response interface approaches, which are well known and supported by many. A lot of knowledge about such data flows is available, when we're stuck. Restartable/Continuable systems are *hard* to engineer and rare besides. Any push tek (or event propagation from backend to UI front, which would be the same, conceptually) is harder to mix with an otherwise request-response based system. Better to keep the reactive part to the UI only and do the mupdf augmentations in a classic fashion. +- UI updates for a batch processing system don't need to show *all of it*: the batch system should hopefully be faster than the human eye could perceive UI updates, so you want to see work happening, either through snapshots or less strict means where your UI shows updates of the work done in the back-end "as it happens". With events and reactive you then would need to code filters, which collect or discard events percolating up, for otherwise you'ld be swamped in updates that are unnecessary and only loading the UI system tremendously: imagine a system which does process about, say, 50 pages per second through the pipeline. That would mean your entire stack should be real-time video-capable if all those pages (and their intermediate filter stages when you're watching the pipeline progress page I imagine) are to be drawn in the UI. Let alone the cost in computing, it's too fast to be useful for folks anyway. That power could better be used to speed up the PDF processing itself. +- polling as a mechanism to talk to the mupdf+extras backend means we can use basic HTTP request-response interface approaches, which are well known and supported by many. A lot of knowledge about such data flows is available, when we're stuck. Restartable/Continuable systems are *hard* to engineer and rare besides. Any push tek (or event propagation from back-end to UI front, which would be the same, conceptually) is harder to mix with an otherwise request-response based system. Better to keep the reactive part to the UI only and do the mupdf augmentations in a classic fashion. That drives the question how we can ensure the caller (UI) is able to find out about the progress of our backend work? Do we define a set of states in the process to monitor and list those as part of the initial response? Hmmmm. @@ -51,7 +51,7 @@ What about caching those images, etc.? We don't know what images will be requested by the UI, and in what order. The UI view may be a simplified/contracted display of the process stages, each of which *may* have a 'print' statement at some point. When the user *refreshes* a web page, we loose all images downloaded so far: if we are going to cache all of them, we're back at the memory/storage size problem again. So some images have to loose and disappear... Current thought is that the cache should monitor the age of creation/updating of the image -- the image may be updated due to looping, next page or whatever the conditions are in the `print` statement in the processing script. -The cache should also keep track of when which image was requested last. Knee-jerk design would say the last-accessed images should be kept around, but then thinking about the visualization of the process, which may follow through different sub-pipelines while processing different pages/documents, we always wish to see "what's been happening most recently in there", so the creation date is perhaps more important. At least the 'bleeding edge' should be maintained and not discarded because older images happen to be requested a lot and thus bumped up the cache lifecycle. +The cache should also keep track of when which image was requested last. Knee-jerk design would say the last-accessed images should be kept around, but then thinking about the visualization of the process, which may follow through different sub-pipelines while processing different pages/documents, we always wish to see "what's been happening most recently in there", so the creation date is perhaps more important. At least the 'bleeding edge' should be maintained and not discarded because older images happen to be requested a lot and thus bumped up the cache life-cycle. That means we should create a cache which tracks two 'ages' (not necessarily *timestamps*; it can be counter-trackers which determine the age by subtracting from a 'right-now' counter marker) at least: create/update + fetch times. Because we cannot know how long it takes for another image to be created and thus land in the cache, if we want only a limited number of 'front of the activity wave' images to stay around no matter what, timestamps are *out*. Tracking this sort of thing is much easier with counter-trackers as the 'front of the wave' will simply be aged 1..N for N front-of-the-wave images. That makes cache management and thus cache coding easy: we can fiddle with the control where request/fetch age becomes more important for live-or-die consideration, past the age=N point for the front-of-the-wave images in there. diff --git a/docs-src/Notes/Progress in Development/Considering the Way Forward/Document OCR & Text Extraction/OpenMP & tesseract - NSFQ (Not Suitable For Qiqqa).md b/docs-src/Notes/Progress in Development/Considering the Way Forward/Document OCR & Text Extraction/OpenMP & tesseract - NSFQ (Not Suitable For Qiqqa).md index c1791bb44..9b41fdc1b 100644 --- a/docs-src/Notes/Progress in Development/Considering the Way Forward/Document OCR & Text Extraction/OpenMP & tesseract - NSFQ (Not Suitable For Qiqqa).md +++ b/docs-src/Notes/Progress in Development/Considering the Way Forward/Document OCR & Text Extraction/OpenMP & tesseract - NSFQ (Not Suitable For Qiqqa).md @@ -31,7 +31,7 @@ Qiqqa will **always** employ tesseract in batch mode, even when you might not re OpenMP can be quite beneficial for those other applications, which perform a single duty and can assume there's no competition when it comes to assigning CPU cycles. This can apply to games and does certainly apply to a lot of research applications such as simulations, which are, by decree, often allowed to hog an entire machine all by their lonesome. *Then*, when there's no smarter solution coded into those applications yet, OpenMP can produce a nett benefit -- while *very probably* increasing the carbon footprint per action, thanks to the overhead inherent in its use, but then again you might not be bothered about that little detail, so OpenMP is *good for you*. - With Qiqqa, after I have looked into the matter in more detail, the conclusion is simple and potentially counter-intuitive: OpenMP is not a sane part of any solution or subsystem we choose to \[contiunue to] use, as almost all CPU-intensive subtasks appear in batches, either in 'monoculture' or 'polyculture' form, and user-visible results depend on multiple subtasks in those batches, rather than a *single* subtask, to complete as soon as possible. + With Qiqqa, after I have looked into the matter in more detail, the conclusion is simple and potentially counter-intuitive: OpenMP is not a sane part of any solution or subsystem we choose to \[continue to\] use, as almost all CPU-intensive sub-tasks appear in batches, either in 'mono-culture' or 'poly-culture' form, and user-visible results depend on multiple sub-tasks in those batches, rather than a *single* sub-task, to complete as soon as possible. > This means that prioritized task queue management can be expected to be far more relevant and important for an optimal user experience. > Indeed, we have been observing this task queue load problem already for several years (with commercial Qiqqa and later, with the open source versions as well) with our large libraries, where task queue pressures mount to very high numbers, particularly when importing / recovering / adding new libraries, which naturally do not have (most of) the OCR, text extraction & processing, FTS indexing and keyword / category detection work done yet -- or having had one or more of those components' overall results flagged as corrupted or invalid, thanks to earlier application crashes and other failures. diff --git a/docs-src/Notes/Progress in Development/Considering the Way Forward/Document OCR & Text Extraction/Using a database for OCR text extract cache.md b/docs-src/Notes/Progress in Development/Considering the Way Forward/Document OCR & Text Extraction/Using a database for OCR text extract cache.md index 42d768944..321df1346 100644 --- a/docs-src/Notes/Progress in Development/Considering the Way Forward/Document OCR & Text Extraction/Using a database for OCR text extract cache.md +++ b/docs-src/Notes/Progress in Development/Considering the Way Forward/Document OCR & Text Extraction/Using a database for OCR text extract cache.md @@ -1,9 +1,9 @@ # Using a database for OCR / *text extracts* cache -Qiqqa classically uses a filesystem directory tree (2 levels, segmented by the first byte of the document content hash (a.k.a. *Document ID*), where each (PDF) *document* results in two or more files: +Qiqqa classically uses a file-system directory tree (2 levels, segmented by the first byte of the document content hash (a.k.a. *Document ID*), where each (PDF) *document* results in two or more files: - 1 tiny text file caching the *document page count*. -- 1 or more text extract files (text format, but proprietary: each '*word*' in the text is encoded as a serialized tuple: `(bbox coordate x0, y0, w, h, word_text)`, resulting in a rather high overhead for the ASCII-text serialized *bbox* (bounding box) coordinates, using 5-significant-digits per coordinate, taking up 8 (text) bytes per coordinate for a positioning accuracy that's a little less that 32bit IEEE *float* -- plus it costs additional CPU serialization and deserialization overhead on write/read. +- 1 or more text extract files (text format, but proprietary: each '*word*' in the text is encoded as a serialized tuple: `(bbox coordate x0, y0, w, h, word_text)`, resulting in a rather high overhead for the ASCII-text serialized *bbox* (bounding box) coordinates, using 5-significant-digits per coordinate, taking up 8 (text) bytes per coordinate for a positioning accuracy that's a little less that 32-bit IEEE *float* -- plus it costs additional CPU serialization and deserialization overhead on write/read. An example record for an `ocr` file: @@ -28,7 +28,7 @@ When storage cost is considered*an added challenge*, then we might be better off As this is an item that is expected to need to be highly performant, *and* is very easy and basic re expected content queries, we are considering using a NoSQL database system instead. -These *content queries* are currently used in the Qiqqa codebase: +These *content queries* are currently used in the Qiqqa code-base: - get page count for document `D` - get all text content for document `D` (to feed to the FTS engine, keyword analysis and auto-suggestion, topic analysis, FTS search query highlighting, ...) diff --git a/docs-src/Notes/Progress in Development/Considering the Way Forward/Essential yet hard(er) to port UI features.md b/docs-src/Notes/Progress in Development/Considering the Way Forward/Essential yet hard(er) to port UI features.md index e1b04167f..bdd2b1747 100644 --- a/docs-src/Notes/Progress in Development/Considering the Way Forward/Essential yet hard(er) to port UI features.md +++ b/docs-src/Notes/Progress in Development/Considering the Way Forward/Essential yet hard(er) to port UI features.md @@ -57,7 +57,14 @@ You (or rather, your **users**!) **must** be able to ## I don't know about you, but I have a library that's *huge* and that has... *consequences*! -When you test with a 10 - 100 document library, anything in a UI would be performant. However, when you start testing with a 40K - 50K+ long list of documents, things 'suddenly' slow down to a crawl, UX-wise: the list view of the library **must** have a 'virtual scroll' type behaviour built-in to prevent your UI from screeching to a tormented halt as you force it to render 50K+ custom formatted PDF document entries ^[each document entry displays several items from the metadata as text, while temporary/context-dependent bits like search score (0-100% estimates this is what you were looking for), ranking, etc. are included as well, plus a colorful 'mood bar' that represents the Expedition/Tags category-like overview of said item. ^[hard to explain that one; must refer to the Qiqqa manual for the proper words] All those bits together are rendered in a wide rectangular bar, which means you'll need to have a **custom sortable list view** component available which can cope with 100K custom rendered items **at speed**. ^[ Qiqqa slows down to a crawl due to lookup/render work happening *synchronously* in the UI thread, among other things. Performance profiling has not found the ListView itself to be the culprit but there's plenty around it that together makes your CPU go "unnnng!" for a while there, every time you decide to move/scroll around. Qiqqa, today, does not sport async UI activity throughout the application -- that would be another major refactor to accomplish as I've looked into using C# `async`/`await`, for example, but the current state of the UI code would mean I'll have to do it for all at once (AFAICT) unless I'm okay with (temporary) deterioration of performance while the codebase is refactored to deliver a proper (and stable!) async UX. This, by the way, is another reason why I'm looking at going to Electon: despite my worries over some very important bits, I feel more safe and 'future-proof' there then here in WPF. See also the rant page XXX.] ] ] +When you test with a 10 - 100 document library, anything in a UI would be performant. However, when you start testing with a 40K - 50K+ long list of documents, things 'suddenly' slow down to a crawl, UX-wise: the list view of the library **must** have a 'virtual scroll' type behaviour built-in to prevent your UI from screeching to a tormented halt as you force it to render 50K+ custom formatted PDF document entries [^1] + +[^1]: each document entry displays several items from the metadata as text, while temporary/context-dependent bits like search score (0-100% estimates this is what you were looking for), ranking, etc. are included as well, plus a colorful 'mood bar' that represents the Expedition/Tags category-like overview of said item. [^2] + +[^2]: hard to explain that one; must refer to the Qiqqa manual for the proper words. All those bits together are rendered in a wide rectangular bar, which means you'll need to have a **custom sortable list view** component available which can cope with 100K custom rendered items **at speed**.[^3] + +[^3]: Qiqqa slows down to a crawl due to lookup/render work happening *synchronously* in the UI thread, among other things. Performance profiling has not found the ListView itself to be the culprit but there's plenty around it that together makes your CPU go "unnnng!" for a while there, every time you decide to move/scroll around. Qiqqa, today, does not sport async UI activity throughout the application -- that would be another major refactor to accomplish as I've looked into using C# `async`/`await`, for example, but the current state of the UI code would mean I'll have to do it for all at once (AFAICT) unless I'm okay with (temporary) deterioration of performance while the codebase is refactored to deliver a proper (and stable!) async UX. This, by the way, is another reason why I'm looking at going to Electon: despite my worries over some very important bits, I feel more safe and 'future-proof' there then here in WPF. See also the rant page XXX. + So you'll need have a list or grid view like component ready, which - can handle large numbers of rows and custom row display (same custom render for each row, though, so that's less of a worry) **and** @@ -66,7 +73,7 @@ So you'll need have a list or grid view like component ready, which ## Another, though maybe lesser, worry is the desire to move Qiqqa forward and part of this is better UX: faster performance for growing collections -Modern day UIs should have some way to be coded **asynchronously**: once you've dealt with the 'virtual scrolling', etc. performance bits in the UI itself, it's the speed at which the data to the UI can be incoming and Qiqqa is not and will not be *instantaneous* at all times: particularly when working with the bibTeX Sniffer, you will observe in current Qiqqa (and obviously anyone else with any such functionality!) that the stuff you see and collect in the Sniffer (bibTeX and XML metadata, PDF documents which may be new *or* a copy of a document in your library) will have to wait a bit before it is actually available in the library proper and thus added/**visible** in the library list: PDF documents must be collected, hashed and stored in the library, which takes some disk I/O and *takes time*, while incoming metadata must be processed and stored in the database before it's actually part of the library, hence another couple of disk I/O actions *which take time*. Consequently, some or all of a row/record's data is not available at the moment the row is rendered for the first time. This requires async capabilities or you'ld end up with a slew of refresh/rerender actions, **or** your UI (temporarily) slows down to a halt in wait for the data to arrive -- *that* is Qiqqa's current behaviour and it's, well, bloody irritating to me, at least. So do yourself and any users a favor and check async capability before you move. +Modern day UIs should have some way to be coded **asynchronously**: once you've dealt with the 'virtual scrolling', etc. performance bits in the UI itself, it's the speed at which the data to the UI can be incoming and Qiqqa is not and will not be *instantaneous* at all times: particularly when working with the bibTeX Sniffer, you will observe in current Qiqqa (and obviously anyone else with any such functionality!) that the stuff you see and collect in the Sniffer (bibTeX and XML metadata, PDF documents which may be new *or* a copy of a document in your library) will have to wait a bit before it is actually available in the library proper and thus added/**visible** in the library list: PDF documents must be collected, hashed and stored in the library, which takes some disk I/O and *takes time*, while incoming metadata must be processed and stored in the database before it's actually part of the library, hence another couple of disk I/O actions *which take time*. Consequently, some or all of a row/record's data is not available at the moment the row is rendered for the first time. This requires async capabilities or you'ld end up with a slew of refresh/re-render actions, **or** your UI (temporarily) slows down to a halt in wait for the data to arrive -- *that* is Qiqqa's current behaviour and it's, well, bloody irritating to me, at least. So do yourself and any users a favor and check async capability before you move. diff --git a/docs-src/Notes/Progress in Development/Considering the Way Forward/Fingerprinting Documents/Fingerprinting - moving forward and away from b0rked SHA1.md b/docs-src/Notes/Progress in Development/Considering the Way Forward/Fingerprinting Documents/Fingerprinting - moving forward and away from b0rked SHA1.md index 46260dbef..9455a449e 100644 --- a/docs-src/Notes/Progress in Development/Considering the Way Forward/Fingerprinting Documents/Fingerprinting - moving forward and away from b0rked SHA1.md +++ b/docs-src/Notes/Progress in Development/Considering the Way Forward/Fingerprinting Documents/Fingerprinting - moving forward and away from b0rked SHA1.md @@ -1,6 +1,6 @@ # Fingerprinting :: moving forward and away from b0rked SHA1 -Okay, it's a known fact that the SHA1-based fingerprinting used in qiqqa to identify PDF documents is flawed: that's a bug that exists since the inception of the software. The SHA1 binary hash was encoded as HEX, but any byte value with a zero(0) for the most significant nibble would have that nibble silently discarded. +Okay, it's a known fact that the SHA1-based fingerprinting used in Qiqqa to identify PDF documents is flawed: that's a bug that exists since the inception of the software. The SHA1 binary hash was encoded as HEX, but any byte value with a zero(0) for the most significant nibble would have that nibble silently discarded. This results in a couple of things, none of them major, but enough for me to consider moving: @@ -135,7 +135,7 @@ Notes on these datums: But my concern are the growth factors. Let's have a look: | encoding | calculus & growth factor | - |--------------|----------------------------| + |--------------|----------------------------|-----------------------| | HEX: | $log(256) / log(16) = 2$, which is expected | | Base64: | $log(256) / log(64) = 1.3333333$, as expected. | | Base85: | $log(256) / log(85) = 1.24816852$ vs. Wikipedia's $(1/0.80) = 1.25$| diff --git a/docs-src/Notes/Progress in Development/Considering the Way Forward/Full Text Search - Exit Lucene.NET, Enter SOLR/Can and should we use our BLAKE3+BASE58X based unique identifier as unique document id.md b/docs-src/Notes/Progress in Development/Considering the Way Forward/Full Text Search - Exit Lucene.NET, Enter SOLR/Can and should we use our BLAKE3+BASE58X based unique identifier as unique document id.md index d8e91ee57..ca20e9c1c 100644 --- a/docs-src/Notes/Progress in Development/Considering the Way Forward/Full Text Search - Exit Lucene.NET, Enter SOLR/Can and should we use our BLAKE3+BASE58X based unique identifier as unique document id.md +++ b/docs-src/Notes/Progress in Development/Considering the Way Forward/Full Text Search - Exit Lucene.NET, Enter SOLR/Can and should we use our BLAKE3+BASE58X based unique identifier as unique document id.md @@ -20,9 +20,9 @@ A few questions need to be asked and evaluated: - **performance penalties**: is there any performance penalty for using these 44-character wide BLAKE3+BASE58X globally unique document ids? - > SQLite benefits from having 64bit integer primary keys for its tables. *That* then is a *locally unique document id* as it will benefit the performance of that particular subsystem. + > SQLite benefits from having 64-bit integer primary keys for its tables. *That* then is a *locally unique document id* as it will benefit the performance of that particular subsystem. > - > If we want to (be able to) present globally unique identifiers to the outside world, a 64bit `rowid` to BLAKE3 document hash mapping table will have to be provided, making all queries more complex as that's an added `JOIN tables` for every query which has to adhere to these "*globally unique document ids only*" demands. + > If we want to (be able to) present globally unique identifiers to the outside world, a 64-bit `rowid` to BLAKE3 document hash mapping table will have to be provided, making all queries more complex as that's an added `JOIN tables` for every query which has to adhere to these "*globally unique document ids only*" demands. > > > The BASE58X part is only important when you wish to reduce the published ASCII string hash id in a short form: BASE58X delivers the BLAKE3 hash in 44 characters while a simple `hex(hash)` function will produce a 64 characters wide string for the same. > @@ -30,7 +30,7 @@ A few questions need to be asked and evaluated: > > > \[Edit:] looks like there's no impact... but what about their storage cost? That remains unmentioned so we'll have to find out ourselves. -- **acceptable to all subsystems**: is the BLAKE3 256bit hash acceptable as identifying *unique document id* for all our subsystems (much of which we won't have written ourselves) or do we need to jump through a couple of hoops to get them accepted? +- **acceptable to all subsystems**: is the BLAKE3 256-bit hash acceptable as identifying *unique document id* for all our subsystems (much of which we won't have written ourselves) or do we need to jump through a couple of hoops to get them accepted? > My *initial guess* is that everybody will be able to cope, more or less easily, but actual practice can open a few cans of worms, if you are onto Murphy's Law. > diff --git "a/docs-src/Notes/Progress in Development/Considering the Way Forward/Full Text Search - Exit Lucene.NET, Enter SOLR/Using ngrams \326\211\326\211 folding N-grams and attributes into trigrams (for N \342\211\247 4).md" "b/docs-src/Notes/Progress in Development/Considering the Way Forward/Full Text Search - Exit Lucene.NET, Enter SOLR/Using n-grams \326\211\326\211 folding N-grams and attributes into trigrams (for N \342\211\247 4).md" similarity index 96% rename from "docs-src/Notes/Progress in Development/Considering the Way Forward/Full Text Search - Exit Lucene.NET, Enter SOLR/Using ngrams \326\211\326\211 folding N-grams and attributes into trigrams (for N \342\211\247 4).md" rename to "docs-src/Notes/Progress in Development/Considering the Way Forward/Full Text Search - Exit Lucene.NET, Enter SOLR/Using n-grams \326\211\326\211 folding N-grams and attributes into trigrams (for N \342\211\247 4).md" index 49e1955fb..594275c8b 100644 --- "a/docs-src/Notes/Progress in Development/Considering the Way Forward/Full Text Search - Exit Lucene.NET, Enter SOLR/Using ngrams \326\211\326\211 folding N-grams and attributes into trigrams (for N \342\211\247 4).md" +++ "b/docs-src/Notes/Progress in Development/Considering the Way Forward/Full Text Search - Exit Lucene.NET, Enter SOLR/Using n-grams \326\211\326\211 folding N-grams and attributes into trigrams (for N \342\211\247 4).md" @@ -1,8 +1,8 @@ -# Using *ngrams* ։։ folding N-grams and attributes into *trigrams* (for N ≧ 4) +# Using *n-grams* ։։ folding N-grams and attributes into *trigrams* (for N ≧ 4) This idea was triggered after reading [The technology behind GitHub’s new code search | The GitHub Blog](https://github.blog/2023-02-06-the-technology-behind-githubs-new-code-search/). Quoting some relevant parts from there: -> The ngram indices we use are especially interesting. While trigrams are a known sweet spot in the design space (as [Russ Cox and others](https://swtch.com/~rsc/regexp/regexp4.html) have noted: bigrams aren’t selective enough and quadgrams take up too much space), they cause some problems at our scale. +> The n-gram indices we use are especially interesting. While trigrams are a known sweet spot in the design space (as [Russ Cox and others](https://swtch.com/~rsc/regexp/regexp4.html) have noted: bigrams aren’t selective enough and quadgrams take up too much space), they cause some problems at our scale. > > For common grams like `for` trigrams aren’t selective enough. We get way too many false positives and that means slow queries. An example of a false positive is something like finding a document that has each individual trigram, but not next to each other. You can’t tell until you fetch the content for that document and double check at which point you’ve done a lot of work that has to be discarded. We tried a number of strategies to fix this like adding follow masks, which use bitmasks for the character following the trigram (basically halfway to quad grams), but they saturate too quickly to be useful. > @@ -27,7 +27,7 @@ This idea was triggered after reading [The technology behind GitHub’s new code > [0,7] = "er " > ``` > -> Using those weights, we tokenize by selecting intervals where the inner weights are strictly smaller than the weights at the borders. The inclusive characters of that interval make up the ngram and we apply this algorithm recursively until its natural end at trigrams. At query time, we use the exact same algorithm, but keep only the covering ngrams, as the others are redundant. +> Using those weights, we tokenize by selecting intervals where the inner weights are strictly smaller than the weights at the borders. The inclusive characters of that interval make up the n-gram and we apply this algorithm recursively until its natural end at trigrams. At query time, we use the exact same algorithm, but keep only the covering n-grams, as the others are redundant. While the weights shown in the quoted article's diagram don't make sense (at least the 0-weight I'ld have expected to be something like 4 or 5, or the sequence start with 9,3,6 instead of 9,6,3) the trigger for me was "*like adding follow masks, which use bitmasks for the character following the trigram (basically halfway to quad grams), but they saturate too quickly to be useful*": how about we *encode attributes* in a trigram, eh? @@ -46,7 +46,7 @@ What would make more sense, at least to me, is using a weighting algo where you Hence `"chester "` (*note their inclusion of the trailing whitespace, but not a leading one -- we'll stick with that for now*) would then have 'midpoint' `"st"` (thanks to that trailing whitespace being accounted for) and weights 2,1. *Radiating out* this would give the weight sequence: 6,4,2,1,3,5,7. -Okay, now redo this with word delimiters like I've seen described for regular human languages' word tokenizers: `""`, where "<>" are arbitrarily chosen SOW (Start Of Word) and EOW (End Of Word) markers. Of course, when you intend to parse, index and search *program source code*, those "<>" would be a *particularly* bad choice as for *programming languages* those 'punctuation characters' are rather actual keywords themselves (*operators*), so you might want to pick something way off in the Unicode range where you are sure nobody will be bothering you by feeding you content that carries those actual delimiter codepoints you just picked as part of their incoming token stream. +Okay, now redo this with word delimiters like I've seen described for regular human languages' word tokenizers: `""`, where "<>" are arbitrarily chosen SOW (Start Of Word) and EOW (End Of Word) markers. Of course, when you intend to parse, index and search *program source code*, those "<>" would be a *particularly* bad choice as for *programming languages* those 'punctuation characters' are rather actual keywords themselves (*operators*), so you might want to pick something way off in the Unicode range where you are sure nobody will be bothering you by feeding you content that carries those actual delimiter code-points you just picked as part of their incoming token stream. So let's pick something cute for our example:  🙞  🙜  diff --git a/docs-src/Notes/Progress in Development/Considering the Way Forward/Full Text Search - Exit Lucene.NET, Enter SOLR/Why it might be smart to treat 64-bit document id integer as a 63-bit one instead.md b/docs-src/Notes/Progress in Development/Considering the Way Forward/Full Text Search - Exit Lucene.NET, Enter SOLR/Why it might be smart to treat 64-bit document id integer as a 63-bit one instead.md index 05af98541..08f890662 100644 --- a/docs-src/Notes/Progress in Development/Considering the Way Forward/Full Text Search - Exit Lucene.NET, Enter SOLR/Why it might be smart to treat 64-bit document id integer as a 63-bit one instead.md +++ b/docs-src/Notes/Progress in Development/Considering the Way Forward/Full Text Search - Exit Lucene.NET, Enter SOLR/Why it might be smart to treat 64-bit document id integer as a 63-bit one instead.md @@ -6,7 +6,7 @@ Now, I hear you say -- and you are *so utterly correct* -- "*but that's so damn Yes, as long as we all remain in the nice computer programming languages (C/C++ preferentially, [for there's *no admittance* for ones like, well, *JavaScript* and *TypeScript*](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Number/MAX_SAFE_INTEGER)!) -However, there are many circumstances where we will have some form of *formatted-as-text* ids traveling the communications' paths, both between Qiqqa subsystems (FTS <-> SQLite metadata database core engine, f.e.) and external access APIs (localhost web queries where external user-written scripts interface with our subsystems directly using cURL/REST like interfaces and/or larger JSON/XML data containers): in these circumstances it is highly desirable to keep the risk of confusion, including confusion and subtle bugs about whether something should be encoded and then (re)parsed as either *signed* or *unsigned* 64bit integer to a bare minimum. +However, there are many circumstances where we will have some form of *formatted-as-text* ids traveling the communications' paths, both between Qiqqa subsystems (FTS <-> SQLite metadata database core engine, f.e.) and external access APIs (localhost web queries where external user-written scripts interface with our subsystems directly using cURL/REST like interfaces and/or larger JSON/XML data containers): in these circumstances it is highly desirable to keep the risk of confusion, including confusion and subtle bugs about whether something should be encoded and then (re)parsed as either *signed* or *unsigned* 64-bit integer to a bare minimum. Better yet: we SHOULD make sure these types of confusion and error cannot ever happen! That's one source for bugs & errors less to worry about! @@ -14,19 +14,19 @@ The solution there is to specify that all *document ids* used where "64 bit inte *That* means we will only accept **63-bit(!) document ids**. -This has consequences elsewhere in our codebases, of course, because folding a 256bit BLAKE3 identifier onto a potential 63-bit *document id* throws up a few obvious questions: +This has consequences elsewhere in our code-bases, of course, because folding a 256-bit BLAKE3 identifier onto a potential 63-bit *document id* throws up a few obvious questions: -- how would we like to fold then? With 64-bit it might be rather trivial, given the origin being a cryptographically strong 256bit hash number, to *fold* that number using a basic 64-bit XOR operation on all the 64bit QuadWords in that hash value: that would be 3 XOR operations and *presto*! +- how would we like to fold then? With 64-bit it might be rather trivial, given the origin being a cryptographically strong 256-bit hash number, to *fold* that number using a basic 64-bit XOR operation on all the 64-bit QuadWords in that hash value: that would be 3 XOR operations and *presto*! - Now that we require 6**3**bits, do we fold bit 63 after we have first folded to 64bit using the cheapest approach possible (XOR)? And how? XOR onto bit 0? Or do we choose to do something a little more sophisticated or even less CPU intensive, iff that were possible? So! Many! Choices! -- Or do we just discard those bits (№'s 63, 127, 195 and 255) from the original 256bit hash? Where's that paper I read that clipping/discarding bits from a (cryptographic) hash does make a rotten quality *folded hash*? + Now that we require 6**3**bits, do we fold bit 63 after we have first folded to 64-bit using the cheapest approach possible (XOR)? And how? XOR onto bit 0? Or do we choose to do something a little more sophisticated or even less CPU intensive, *iff* that were possible? So! Many! Choices! +- Or do we just discard those bits (№'s 63, 127, 195 and 255) from the original 256-bit hash? Where's that paper I read that clipping/discarding bits from a (cryptographic) hash does make a rotten quality *folded hash*? ---- **Update**: at least *older* Sphinx/Manticore Search documentation mentions *document ids* cannot be zero. This would mean a minimum legal range would then be $1 .. 2^{63}$ and our folding / id producing algorithm should provide for those -- a simple `if h = 0 then h = +x` won't fly. So I'm thinking along these lines: -- fold onto 64bit space quickly using a blunt 64-bit XOR. -- shift the bits we'ld loose that way so they form a relatively small integer. (Given the "id cannot be zero" condition, I'm counting bit0 among them.) For example: +- fold onto 64-bit space quickly using a blunt 64-bit XOR. +- shift the bits we'ld loose that way so they form a relatively small integer. (Given the "id cannot be zero" condition, I'm counting $bit_{0}$ among them.) For example: $$\begin{aligned} v &= 1 + bit_0 \mathbin{◮} 1 + bit_{63} \mathbin{⧩} (63 - 2) + bit_{64+63} \mathbin{⧩} (64 + 63 - 3) \\ & \qquad + \: bit_{2 \cdot 64+63} \mathbin{⧩} (2 \cdot 64 + 63 - 4) + bit_{3 \cdot 64+63} \mathbin{⧩} (3 \cdot 64 + 63 - 5) \\ @@ -34,19 +34,19 @@ v &= 1 + bit_0 \mathbin{◮} 1 + bit_{63} \mathbin{⧩} (63 - 2) + bit_{64+63} \ &= 1 + bit_0 \mathbin{◮} 1 + bit_{63} \mathbin{⧩} 61 + bit_{127} \mathbin{⧩} 124 + bit_{191} \mathbin{⧩} 187 + bit_{255} \mathbin{⧩} 250 \end{aligned}$$ - where $\mathbin{◮}$ and $\mathbin{⧩}$ are the logical shift *left* and *right* operators, respectively. Then mix this value $v$ into the *masked* 64bit folded hash (the mask throwing away both $bit_{0}$ and $bit_{63}$). The idea being that the $\mathbin{+} 1$ ensures our end value will not be zero(0), but XOR-ing that one in just like that would loose us $bit_0$ so we take that one, alongside all the bits at $bit_{63}$-equivalent positions in the quadwords of the BLAKE3 hash that we would loose by restricting the result to a *positive 64bit integer range* and give them a new place in a new value $v$ (at bit positions 1,2,3,4 and 5) and then mix that value into the folded value. + where $\mathbin{◮}$ and $\mathbin{⧩}$ are the logical shift *left* and *right* operators, respectively. Then mix this value $v$ into the *masked* 64-bit folded hash (the mask throwing away both $bit_{0}$ and $bit_{63}$). The idea being that the $\mathbin{+} 1$ ensures our end value will not be zero(0), but XOR-ing that one in just like that would loose us $bit_0$ so we take that one, alongside all the bits at $bit_{63}$-equivalent positions in the quadwords of the BLAKE3 hash that we would loose by restricting the result to a *positive 64-bit integer range* and give them a new place in a new value $v$ (at bit positions 1,2,3,4 and 5) and then mix that value into the folded value. If we were to worry about those bits skewing those bit positions in the final result, we can always go and copy/distribute them across the entire 62bit area ($bit_1 .. bit_{62}$) that we get to work with here in the end. Of course, we shouldn't put too much effort into this *folding* as we're not aiming for a *perfect hash* -- because we know we can't -- and any collision discovered at library sync/merge time will force us to adjust the 63-bit *document id* (with only 62 bits of spread as we now set $bit_0$ to 1 to ensure the *document id* is never zero). Maybe borrow some ideas from linear and quadratic hashing there, if we hit a collision? I.e. calculate a new *document id* "nearby", check if that one is still available and when it's not (another collision), loop (while incrementing our test step). That's a $2^{62}$ range of *document ids* to test, so we're practically guaranteed a free slot here. -> **Note**: note that the "may not be zero" requirement has taken another bit from our range power: while the number will be value in the integer value range $\lbrace 1 .. (2^{63} - 1) \rbrace$ thanks to two's complement 64 bit use, we are left with only $2^{(64-1-1)} = 2^{62}$ values we can land on as every *even number* has now become *off limits* thanks to our quick move to set $bit_0$ to 1. +> **Note**: note that the "may not be zero" requirement has taken another bit from our range power: while the number will be a value in the integer value range $\lbrace 1 .. (2^{63} - 1) \rbrace$ thanks to two's complement 64 bit use, we are left with only $2^{(64-1-1)} = 2^{62}$ values we can land on as every *even number* has now become *off limits* thanks to our quick move to set $bit_0$ to 1. > -> The other option we have been pondering was to take the 256-bit hash value and **add 1 arrhythmically** and then fold *that one*, only this would require the use of BigInt calculus and would be rather more costly to do at run-time. +> The other option we have been pondering was to take the 256-bit hash value and **add 1 arithmetically** and then fold *that one*, only this would require the use of BigInt calculus and would be rather more costly to do at run-time. > > Then there's the next idea: do this *plus one(1)* work on the (already $bit_{63}$-masked) quadwords of the original hash, just before we fold them. While that would work (since drop those $bit_{63}$ bits before we do this) we will end up with a folded value that spans *beyond* the `MAXINT` ($2^{63} - 1$) value, which means another correction is mandatory before we would be done. And that correction would land us in the same (or very similar) situation as we have now, unless we are willing to do this using *arithmetic modulo*. As division is still costly on modern hardware, I prefer the faster bit operations, expecting a negligible difference in quality of output of this "shortened hash"-style *document id*... > -> **Third alternative** then would deliver a 62-biit range like we have, but we would have the (dubious) benefit of having both *even* and *odd* document ids: mask off the top 2 bits of every quadword, mix those into a new value $v_2$, resulting in an 8-bit value and mix that one into the folded 62-bit number we now get using XOR around. Now correct for the non-zero requirement by simply adding 1 and you're done. +> **Third alternative** then would be to deliver a 62-bit range like we have, but we would have the (dubious) benefit of having both *even* and *odd* document ids: mask off the top 2 bits of every quadword, mix those into a new value $v_2$, resulting in an 8-bit value and mix that one into the folded 62-bit number we now get using XOR around. Now correct for the non-zero requirement by simply adding 1 and you're done. > > > Hm, *maybe* I like that one better and go with that, but I know it's all bike-shedding from here... > > diff --git a/docs-src/Notes/Progress in Development/Considering the Way Forward/Full Text Search - Exit Lucene.NET, Enter SOLR/trigrams, n-grams, words, syllables, stopwords - considering the relevant atomic search unit.md b/docs-src/Notes/Progress in Development/Considering the Way Forward/Full Text Search - Exit Lucene.NET, Enter SOLR/trigrams, n-grams, words, syllables, stopwords - considering the relevant atomic search unit.md index cc8e39ce8..a8c1f6313 100644 --- a/docs-src/Notes/Progress in Development/Considering the Way Forward/Full Text Search - Exit Lucene.NET, Enter SOLR/trigrams, n-grams, words, syllables, stopwords - considering the relevant atomic search unit.md +++ b/docs-src/Notes/Progress in Development/Considering the Way Forward/Full Text Search - Exit Lucene.NET, Enter SOLR/trigrams, n-grams, words, syllables, stopwords - considering the relevant atomic search unit.md @@ -1,22 +1,22 @@ # Considering the relevant atomic search unit -- as one essay put it: "*3-grams are the sweet spot*" (*for practical implementations* of (inverse) search indexes): 2-grams (bigrams) are not *specific* enough, i.e. result in huge scan lists per index hit, while quadgrams (4-grams) explode your index size, taking up too much harddisk space for too little gain. +- as one essay put it: "*3-grams are the sweet spot*" (*for practical implementations* of (inverse) search indexes): 2-grams (bigrams) are not *specific* enough, i.e. result in huge scan lists per index hit, while quadgrams (4-grams) explode your index size, taking up too much hard-disk space for too little gain. - 1-gram equals 1 *syllable* or 1-gram equals 1 *word*? Indo-European languages (such as English) come with their words nicely pre-separated (most of the time) and *selectivity* improves for gram=word: you are hashing more *context* (or a longer Markov Chain) into a single n-gram hash, then when using gram=syllable. However, if that's a success, you're in some deep (word separation task) trouble for most Asian languages as they don't know about *whitespace characters*: Chinese is *hard* to chop into actual *words*. - lots of advice starts with chucking *stopwords* out. However, if you look at many published stopword lists, there's quite a few *useful* words in there, for example "*high*" or "*extra*" may be listed in the stopwords set, but when you're looking for "*high voltage*" specifically (e.g. Marx generators, but you didn't recall *that* particular name), then you don't want to be bothered with Fluke Multimeters For The Installation Professional, for example. So there's a few folks advocating **not to use stopword filters**: the classical example being the search for the Shakespeare play which contains the line "*To be or not be!*": that one would be *annihilated* by any stopword filter out there as it's *all stopwords*. However, *not ditching stopwords* makes your index bloated (the lesser problem) and probably *useless* due to the machine weighting the index statistics as *not specific enough to be useful* to be used in any sane query execution plan (as considered by the search engine machinery). -- 1-gram=syllable and the *word edge*: some peeps argue that the *position within the word* for a given syllable is useful info, which also helps to improve *selectivity*, e.g. 3gram "*ali*" should be *weighted* differently when expected as the start-of-word, whole-word or smack-in-the-middle-of-a-word: "_**Ali** Express_", "_**Ali**|as Black Bra_", "_He|**ali**|ng Crystals_". As this *ranking* should be replicable at the search query construction site for search 3grams to *match* index 3grams, these folks do suggest some *search query introspection* at query compilation time and marking anything that has equal-*or-lower* rank as a *potential hit*, with some rank-sorting heuristics added for improved results' quality, of course. +- 1-gram=syllable and the *word edge*: some peeps argue that the *position within the word* for a given syllable is useful info, which also helps to improve *selectivity*, e.g. 3-gram "*ali*" should be *weighted* differently when expected as the start-of-word, whole-word or smack-in-the-middle-of-a-word: "_**Ali** Express_", "_**Ali**|as Black Bra_", "_He|**ali**|ng Crystals_". As this *ranking* should be replicable at the search query construction site for search 3-grams to *match* index 3-grams, these folks do suggest some *search query introspection* at query compilation time and marking anything that has equal-*or-lower* rank as a *potential hit*, with some rank-sorting heuristics added for improved results' quality, of course. >You want the *whole-word* or *start-of-word* / *end-of-word* ranked n-gram hits to show up at the top of the list, when it looked like the user query was looking for that particular type, for instance. -- 1-gram=syllable and the *word edge*: others have mentioned another idea about *n-grams at the word edges* -- which incidentally would help *somewhat* when you're also in the business of *not* throwing out those pesky short stopwords: mark up every word with special border markers (which otherwise wouldn't show up in your search index feed), e.g. ``, like: ` ` --> 3grams: {``, ``, ``, ``, `` just happens to be a legal 3gram that way when you allow all words, so as you can search for "*a book*" vs. "*the book*" and get significantly different search answers: "*the book*" might, depending on other context, place more emphasis on results which mention "*the bible*" or "*the koran*", for example. -- *Attributed 3-grams* are another *quite interesting* proposition I encountered while reading up on search/indexing research: the idea there was to take a 3-gram and *make it more specific* by turning it in a sort of 3.5-gram: 4-grams being too specific, i.e. producing *too many different hash values* (about 4 billion), while 3-grams produce a search space of about $256^3 = 2^{24} \approx 16 \times 10^6$ hash values. Their idea was to take the input, chop it into 3-grams, *but then take the tail following the **bi**gram and hash it to a single code that turns this into a trigram*, in order to help search indexes produce better rankable / more specific index hits. For example "*James*" would be *grammed* as { `Ja+(mes)`, `Jam`, `am+(es)`, `ame`, `mes` } where the `(...)` parts are first hashed into a single symbol before further 3gram processing is done, e.g. --> { `Ja7`, `Jam`, `am#`, `ame`, `mes`}. - Note that *the bare 3grams are also included* in the set: the "*attributed*" 3grams are only expected to help when the search query compiler manages to produce exactly the same *attributed* 3grams. Which would only happen when the user searches for `James`, but not when he looks for `*ames` (where "`*`" is a wildcard). +- 1-gram=syllable and the *word edge*: others have mentioned another idea about *n-grams at the word edges* -- which incidentally would help *somewhat* when you're also in the business of *not* throwing out those pesky short stopwords: mark up every word with special border markers (which otherwise wouldn't show up in your search index feed), e.g. ``, like: ` ` --> 3-grams: {``, ``, ``, ``, `` just happens to be a legal 3-gram that way when you allow all words, so as you can search for "*a book*" vs. "*the book*" and get significantly different search answers: "*the book*" might, depending on other context, place more emphasis on results which mention "*the bible*" or "*the koran*", for example. +- *Attributed 3-grams* are another *quite interesting* proposition I encountered while reading up on search/indexing research: the idea there was to take a 3-gram and *make it more specific* by turning it in a sort of 3.5-gram: 4-grams being too specific, i.e. producing *too many different hash values* (about 4 billion), while 3-grams produce a search space of about $256^3 = 2^{24} \approx 16 \times 10^6$ hash values. Their idea was to take the input, chop it into 3-grams, *but then take the tail following the **bi**gram and hash it to a single code that turns this into a trigram*, in order to help search indexes produce better rankable / more specific index hits. For example "*James*" would be *grammed* as { `Ja+(mes)`, `Jam`, `am+(es)`, `ame`, `mes` } where the `(...)` parts are first hashed into a single symbol before further 3-gram processing is done, e.g. --> { `Ja7`, `Jam`, `am#`, `ame`, `mes`}. + Note that *the bare 3-grams are also included* in the set: the "*attributed*" 3-grams are only expected to help when the search query compiler manages to produce exactly the same *attributed* 3-grams. Which would only happen when the user searches for `James`, but not when he looks for `*ames` (where "`*`" is a wildcard). - Additional / similar *attributed n-gram* approaches I've seen mention mixing start-of-word and end-of-word marker bits into the 3-gram before it is converted to a search hash index slot number. - Ditto for some folks suggesting *improved results* when you mix 3-grams with 4-grams, 5-grams and/or word-grams, where the latter are all hashed and then mapped onto the 3-gram sized search space. The *3.5-grams* I mentioned above (*my word, not found as such in the research publications*) are *my interpretation* of the latter couple of items, which is taking your *whatever*-grams, producing **unique hash values** for them (so their specificity/quality remains unchanged) and them *folding* / *mapping* them onto a $N$-sized index space using a *mapping/folding operator*, e.g. *modulo*, where $$h = H(ngram) \bmod n \enspace | \enspace 0 \le n < {256^p} \wedge 3 \le p < 4$$ - and $p$ may be larger than 3 to facilitate (mostly) unique mapping of those n-grams into the search index. For example, when we decide 2 bits extra is enough / a good practical compromise, then the modulo value would be $2^{26}$ rather than 3gram's regular $2^{24}$ and we end up with a search space $\approx 64 \times 10^6$ index slots. For a yet-collision-less index at, say, 8 bytes per slot, that would naively cost $64*8 = 512$ MBytes, i.e. half a gig of disk or RAM space, which is still within sane limits for modern user-level desktop hardware. This would then be a "$3 \frac 1 4$-gram" or "*3.25-gram*". + and $p$ may be larger than 3 to facilitate (mostly) unique mapping of those n-grams into the search index. For example, when we decide 2 bits extra is enough / a good practical compromise, then the modulo value would be $2^{26}$ rather than 3-gram's regular $2^{24}$ and we end up with a search space $\approx 64 \times 10^6$ index slots. For a yet-collision-less index at, say, 8 bytes per slot, that would naively cost $64*8 = 512$ MBytes, i.e. half a gig of disk or RAM space, which is still within sane limits for modern user-level desktop hardware. This would then be a "$3 \frac 1 4$-gram" or "*3.25-gram*". - *Codepoints versus character bytes*: of course, when you consider non-U.S. American English languages too, i.e. you consider *international language space*, you're going to talk *Unicode*. Which comes as *codepoints*, rather than old-skool ANSI/ASCII *bytes*. And then you have to reconsider the question: "*what is my atomic character unit that I'm constructing my n-gram from?*" Some folks just don't bother and rip *raw bytes*, stating that UTF8 will take care of the rest. Others argue this blunt way of thinking reduces selectivity and therefor *quality* of the search results, both in *performance* and *matching* (ranking, selectivity, ...) as you would now find *half-characters* or even *quarter-characters* treated as full character citizens in your trigram-based search index hashes, so they propose to use *Unicode codepoints* for the trigrams instead. The extra argument for that approach is that it could potentially help improve search quality for Chinese and similar non-word-delimited languages as those usually have a larger *alphabet* than English/European languages, which *might* offset the possible quality gain you are expected to get from including start/end-of-word markers as part of your trigramming action. (Remember those "`<`" and "`>`"?) diff --git a/docs-src/Notes/Progress in Development/Considering the Way Forward/Full-Text Search Engines.md b/docs-src/Notes/Progress in Development/Considering the Way Forward/Full-Text Search Engines.md index 4cca8459d..0a1b2e04c 100644 --- a/docs-src/Notes/Progress in Development/Considering the Way Forward/Full-Text Search Engines.md +++ b/docs-src/Notes/Progress in Development/Considering the Way Forward/Full-Text Search Engines.md @@ -7,7 +7,7 @@ https://www.joelonsoftware.com/2000/04/06/things-you-should-never-do-part-i/ - you start looking when you feel agonized. + for me, it's WPF/XAML: cute but way too verbose and hard to dive into (not just scratching the surface but getting things done in a nice looking, well-behaved way with attention to details) - + there's already questions about the cross-platform availability/ability of Qiqqa ([#215](https://github.com/jimmejardine/qiqqa-open-source/issues/215)) and he's not the only one: though I am largely Windows-based in my own work, that's not forever and besides, I do realize that if we ever want to climb out/up that we need a Linux base at least. And **I** am not going to port the codebase in .NET+WPF to Linux as it is: porting WPF would mean re-doing the UI anyway as *there is no WPF on Linux or Apple* and then there were my thoughts on separating functionality (local server) from UI (local client) already from before for purposes of making UI dev work easier for me (as in: attempt moving the biatch to a well-supported cross-platform environment that I want to be in: HTML5+CSS+JS (not Qt, GTK or what-have-you -- not gonna use those anywhere else, so why spend the effort on yet another UI framework I am not going to use if I can help it? Besides, go where all the action is. That's web. Not Qt, GTK, WPF, whatever.) + + there's already questions about the cross-platform availability/ability of Qiqqa ([#215](https://github.com/jimmejardine/qiqqa-open-source/issues/215)) and he's not the only one: though I am largely Windows-based in my own work, that's not forever and besides, I do realize that if we ever want to climb out/up that we need a Linux base at least. And **I** am not going to port the code-base in .NET+WPF to Linux as it is: porting WPF would mean re-doing the UI anyway as *there is no WPF on Linux or Apple* and then there were my thoughts on separating functionality (local server) from UI (local client) already from before for purposes of making UI dev work easier for me (as in: attempt moving the biatch to a well-supported cross-platform environment that I want to be in: HTML5+CSS+JS (not Qt, GTK or what-have-you -- not gonna use those anywhere else, so why spend the effort on yet another UI framework I am not going to use if I can help it? Besides, go where all the action is. That's web. Not Qt, GTK, WPF, whatever.) + XAML vs HTML5 ## Upgrading Lucene.NET to the bleeding edge @@ -26,7 +26,7 @@ The bottom line question is always: is it good enough? and alive enough? It's go What spurred me into writing this (and veering off course where it comes to Lucene.NET): I had another "great" experience with the NuGet package manager, which took an entire valuable day (instead of some dumbed down night hours) to recover from as it turned out to be unsolvable save for an **entire re-install of my entire Visual Studio rig**. \ -What that did was remind me that Visual Studio has always been a great IDE for me, but some of the bits in the dev flow (NuGet combined with how you can fiddle / edit / tweak the installed versions) still feel as convoluted as back in the day when I was doing C & C++ : I could never see the sanity in the reasoning to use XML in any environment where it might be directly human-facing (me hacking project files, for example) which leaves using XML in a machine-machine environment, which is itself insane again as it's a huge overhead (open/close tags repeating) burdening network, lexers (parsers) and not being whitespace agnostic where it counts anyway (oh, that's a human-facing bit there), so the presence of any XML always was a sign of corporeal/rate insanity to me since the day it was introduced: bloody useless for both types of sender *and* receiver. Anyhooo. Some bits make me ... uncomfortable. +What that did was remind me that Visual Studio has always been a great IDE for me, but some of the bits in the dev flow (NuGet combined with how you can fiddle / edit / tweak the installed versions) still feel as convoluted as back in the day when I was doing C & C++ : I could never see the sanity in the reasoning to use XML in any environment where it might be directly human-facing (me hacking project files, for example) which leaves using XML in a machine-machine environment, which is itself insane again as it's a huge overhead (open/close tags repeating) burdening network, lexers (parsers) and not being whitespace agnostic where it counts anyway (oh, that's a human-facing bit there), so the presence of any XML always was a sign of corporeal/rate insanity to me since the day it was introduced: bloody useless for both types of sender *and* receiver. *Anyhoo.* Some bits make me ... uncomfortable. One of those bits is the migration of the old Lucene API interface code to the new Lucene.NET v4 releases / v5 bleeding edge: somehow I haven't got around to it and that's purely down to /hunch/ AFAICT. @@ -325,7 +325,7 @@ https://trends.google.com/trends/explore?cat=31&date=2011-07-16%202020-08-16&q=c ![](assets/google-trends-cef-et-al2.png) If these trends are anything to go by, there's not much interest in CEFGlue, which is cross platform, while there's much more interest in CEFSharp, which is Windows-only. Also note that the C# CEF wrappers get much less attention than the straight C/C++ embedded browser component CEF, which is the core of Electron and several others. Now to put the C# components in perspective, here's the trend graph with Electron included (and we drop CEFGlue to make room): ![](assets/google-trends-cef-et-al3.png) -and to show that we (or rather: Google) did not confuse that one with the physical entity called an electron, here's that electron search filter redone so it's very specific: note the correlation in the upswitng since 2014 for both, while this latter trend query of course delivers fewer hits over the entire period: +and to show that we (or rather: Google) did not confuse that one with the physical entity called an electron, here's that electron search filter redone so it's very specific: note the correlation in the upswing since 2014 for both, while this latter trend query of course delivers fewer hits over the entire period: ![](assets/google-trends-cef-et-al4.png) https://trends.google.com/trends/explore?cat=31&date=2011-07-16%202020-08-16&q=winforms,wpf%20.net,%2Fg%2F11bw_559wr,cefsharp,cef diff --git a/docs-src/Notes/Progress in Development/Considering the Way Forward/General Design Choices/Logging - Local (per binary) or Centralized (log server).md b/docs-src/Notes/Progress in Development/Considering the Way Forward/General Design Choices/Logging - Local (per binary) or Centralized (log server).md index 0ed7b4373..87f872727 100644 --- a/docs-src/Notes/Progress in Development/Considering the Way Forward/General Design Choices/Logging - Local (per binary) or Centralized (log server).md +++ b/docs-src/Notes/Progress in Development/Considering the Way Forward/General Design Choices/Logging - Local (per binary) or Centralized (log server).md @@ -8,7 +8,7 @@ Pros and cons: * Pro Local: no working *localhost* socket connect and *listening server* required for the log lines to land on disk and persist beyond crash / termination. * This implies that the individual parts (applications / binaries / processes) can easily be (re-)used in an environment where the *listening log server* is not (yet) present, aiding testing and hackish custom activities involving only a few members of the entire crew. * No worries about potential intermittent errors where the *listening log server* is dropped/killed, even if only temporarily: as we DO NOT use a *log server*, we don't have the added (minimal) risk of a *permanently live* local socket and working log server code. - * Ditto for risks about laggard or severely bottlenecked or throttled (*log server*) processes, while our application process code is running full tilt and swamping the CPU cores. Or other ways you can get at a severely overburdened system where important processes start to cause significant waits due to CPU core unavailability. + * Ditto for risks about laggard or severely bottle-necked or throttled (*log server*) processes, while our application process code is running full tilt and swamping the CPU cores. Or other ways you can get at a severely overburdened system where important processes start to cause significant waits due to CPU core unavailability. We can manage thread priority *locally* (within our current process) and chances are that no matter the scheduler overload, wee'll get a decent *logging thread performance in step with our main, log-generating process(es)*. diff --git a/docs-src/Notes/Progress in Development/Considering the Way Forward/General Design Choices/Simple - One process per programming language or should we divide it up further into subsystems.md b/docs-src/Notes/Progress in Development/Considering the Way Forward/General Design Choices/Simple - One process per programming language or should we divide it up further into subsystems.md index e1c19e488..870096ac4 100644 --- a/docs-src/Notes/Progress in Development/Considering the Way Forward/General Design Choices/Simple - One process per programming language or should we divide it up further into subsystems.md +++ b/docs-src/Notes/Progress in Development/Considering the Way Forward/General Design Choices/Simple - One process per programming language or should we divide it up further into subsystems.md @@ -15,7 +15,7 @@ The grounds for this being that we know we have some *huge* PDFs, which will exp While we do realize this would increase the inter-process data transfer costs significantly as we'll be pumping large quantities of PDF page data (image and text) across to other subsystems (SOLR or FT5/manticore), separating this subsystem into its own process would mean we can expect a higher overall system reliability from scratch: the databases and other processes can remain operational while aborted/crashed/terminated PDF/ OCR processes are restarted under the hood (by the Qiqqa Monitor application). -Meanwhile, critical path user facing data comms will have the same or comparable costs: one of the CPU-heaviest, data-intensive tasks is rendering PDF pages: the UX is, of course, immediately impacted by the performance of that subtask. It does not matter for total system latency and costs whether subsystem A (the previously envisioned database+ monolith) or subsystem B (PDF page renderer + cache) delivers the image data to show on screen. Ditto for the equally important text+position PDF page info stream: in the new design it will be still 1(one) localhost TCP transfer away. +Meanwhile, critical path user facing data comms will have the same or comparable costs: one of the CPU-heaviest, data-intensive tasks is rendering PDF pages: the UX is, of course, immediately impacted by the performance of that sub-task. It does not matter for total system latency and costs whether subsystem A (the previously envisioned database+ monolith) or subsystem B (PDF page renderer + cache) delivers the image data to show on screen. Ditto for the equally important text+position PDF page info stream: in the new design it will be still 1(one) localhost TCP transfer away. Added costs are to be found in the background batch processes (PDF page text copied over to the cache and FTS systems) as those now will need to communicate through localhost sockets or memory-mapped I/O instead of a simple pointer reference to internal heap memory. @@ -29,7 +29,7 @@ Nevertheless, I think it would be good to think of "*one process per architectur ## Post Scriptum -Each of the subsystems can be its own webserver then for when we code the UI as a web browser: +Each of the subsystems can be its own web-server then for when we code the UI as a web browser: - main Qiqqa metadata database -> static web content templates + metadata filled in the views = web pages. - PDF renderer / processor -> cached document content, serves as a CDN when it comes to displaying rendered PDF document pages and text(ified) content. diff --git a/docs-src/Notes/Progress in Development/Considering the Way Forward/General Design Choices/Storing large (OCR) data in a database vs. in a directory tree - and which db storage lib to use then.md b/docs-src/Notes/Progress in Development/Considering the Way Forward/General Design Choices/Storing large (OCR) data in a database vs. in a directory tree - and which db storage lib to use then.md index dda8daa28..3f775a8f6 100644 --- a/docs-src/Notes/Progress in Development/Considering the Way Forward/General Design Choices/Storing large (OCR) data in a database vs. in a directory tree - and which db storage lib to use then.md +++ b/docs-src/Notes/Progress in Development/Considering the Way Forward/General Design Choices/Storing large (OCR) data in a database vs. in a directory tree - and which db storage lib to use then.md @@ -9,11 +9,11 @@ Qiqqa stores the text extracted from the PDF documents it manages into separate - some of these files span 20 pages (or less, if they're the last one that runs until the end of the document) and these are relatively large: every word in the text is accompanied by a ASCII-text-printed (think `printf %f` if that helps) coordinate *box* (4 coordinates) indicating the *position* of the word on the page. > This coordinate box is very important: it is used (among other things) to render an overlay over the rendered page so you can use your mouse and select a chunk of text on the page by click&drag. - > x + > > When the extractor has done a *proper job* this will look like you're actually selecting the text you're looking at and all is well in the world. > > When, however, there's been a slight misunderstanding, a tiny b0rk b0rk b0rk of sorts, then it can happen that you're observing the reddish coloured selection area happening in the page's whitespace more or less 'nearby'. This, incidentally, is due to Qiqqa using two different subsystems for PDF rendering and processing: SORAX does the page render job, while an old, minimally hacked, muPDF tool does the text & position decoding and extraction legwork. These two fellows *may* have a different opinion about how the current page is supposed to display, resulting in such "why is my selection off to the side or way above my actual text?!" user questions. - > x + > > > We do *intentionally* neglect the added fact that both subsystems have quite different ideas about the *words* in the text as well, which can be observed when you select text in a PDF and the selection area acts all kinds of 'choppy and fragmented into tiny vertical slices' when you drag your mouse over it to select a chunk of text. This type of behaviour is highly undesirable but requires a complete overhaul of the PDF processing and rendering subsystems. That, however, is a story for another time. The box coordinates account, at about 8 characters per value plus a separating space, for $(8+1) \times 4 = 36$ characters peer content *word*. Each word is also separated only its own line, but that doesn't matter: NewLine (NL) vs. Space (SP), it's all the same size. @@ -55,14 +55,14 @@ Default hOCR, however, does the bbox (*bounding box*) coordinate thing *per char > > > > That way we should be able to have both: binary storage of hOCR-alike data, including raw bbox coordinates (so reduced parse costs on repeated reads), while external users (and our own FTS/SOLR search engine subsystem perhaps as well!) can use statements like `SELECT as_hOCR(text_extract) FROM ocr_table ...` to get easy to use hOCR-formatted HTML pages. > > -> > > Which leaves the images (diagrams, formulas, etc.) embedded in the original PDF: we *should* have separated those out at the `ocr` file / database storage level already. Thus we can have that `ocr` store as the backend of a regular web server, which can then produce every PDF as hOCR/HTML web page. +> > > Which leaves the images (diagrams, formulas, etc.) embedded in the original PDF: we *should* have separated those out at the `ocr` file / database storage level already. Thus we can have that `ocr` store as the back-end of a regular web server, which can then produce every PDF as hOCR/HTML web page. > > > Going nuts with this, that same *web server* can use the same (or similar) mechanisms to produce crisp fully rendered page images *alongside* as an alternative for both thee Qiqqa UI and others to see the original PDF, (pre)rendered as high rez page images. The idea there being that we need a page image renderer anyway (to replace the SORAX one) and it being a *smart* choice to have to page images *cached* and (at least partly) pre-rendered for a smoother viewing and PDF document scrolling experience, both inside and outside Qiqqa. > > > > > > #### PDF page renders: caching the images - for how long and at what resolutions? > > > > > > For regular Qiqqa use we need two resolutions: one for side panel thumbnails and one for full image page reader views. Given a 4K monitor screen and a 'full-screened' application window, that'ld be worst-case 4K wide page images for the reader. -> > > As this gets hefty pretty quickly, this should be treated in a rather more **on demand dynamic way**: why not render these on demand only and at the requested resolution if no larger rez is available yet in the cache? While a minimum rez for any request would perhaps be equivalent to 120..200ppi, so we can always serve basic views and thumbnails of any size by rescaling and serving viewers and OCR tools (tesseract!) as needed with a minimum amount of (costly) re-rendering activity. -> > > Also restrict re-rendering decisions to 10..20% size increases at minimum to reduce the number of re-render actions; maybe go even as far as to state that any re-render will increase the render size by 50% at a minimum, so any further demand within that range gets a rescale of that image or the image itself, iff the caller can provide more suitable rescaling facilities itself (an option that would be useful for web browsers, f.e.). +> > > As this gets hefty pretty quickly, this should be treated in a rather more **on demand dynamic way**: why not render these on demand only and at the requested resolution if no larger rez is available yet in the cache? While a minimum rez for any request would perhaps be equivalent to 120..200 ppi, so we can always serve basic views and thumbnails of any size by rescaling and serving viewers and OCR tools (tesseract!) as needed with a minimum amount of (costly) re-rendering activity. +> > > Also restrict re-rendering decisions to 10..20% size increases at minimum to reduce the number of re-render actions; maybe go even as far as to state that any re-render will increase the render size by 50% at a minimum, so any further demand within that range gets a rescale of that image or the image itself, *iff* the caller can provide more suitable rescaling facilities itself (an option that would be useful for web browsers, f.e.). > > > > > > #### That leaves the question of "okay, fine, we cache those images. For how long?!" > > > diff --git a/docs-src/Notes/Progress in Development/Considering the Way Forward/IPC/Considering IPC methods - HTTP vs WebSocket, Pipe, etc.md b/docs-src/Notes/Progress in Development/Considering the Way Forward/IPC/Considering IPC methods - HTTP vs WebSocket, Pipe, etc.md index a4cd00cd8..067eb27ff 100644 --- a/docs-src/Notes/Progress in Development/Considering the Way Forward/IPC/Considering IPC methods - HTTP vs WebSocket, Pipe, etc.md +++ b/docs-src/Notes/Progress in Development/Considering the Way Forward/IPC/Considering IPC methods - HTTP vs WebSocket, Pipe, etc.md @@ -6,17 +6,17 @@ But before we go there, let's backpedal a bit and look at the bigger picture of ## Old Qiqqa -'*Old Qiqqa*' is a near-monolithic application that way: it has a UI (which in WPF/.NET is served by a single thread by (Microsoft) design), *business logic*, a database (SQLite) and a FTS Engine (Full Text Search Engine) through Lucene.NET, all as *libraries* (Windows DLLs). Then there's a couple of commercial libraries used as well: Intragistics for parts of the UI and some PDF metadata (*page count* extraction) and (*now defunct*) SORAX for PDF page image rendering. All *communications* among these is via library function calls as usual so call overhead is cheap (the biggest costs there are data marshalling in various places and some thread context switching). +'*Old Qiqqa*' is a near-monolithic application that way: it has a UI (which in WPF/.NET is served by a single thread by (Microsoft) design), *business logic*, a database (SQLite) and a FTS Engine (Full Text Search Engine) through Lucene.NET, all as *libraries* (Windows DLLs). Then there's a couple of commercial libraries used as well: Infragistics for parts of the UI and some PDF metadata (*page count* extraction) and (*now defunct*) SORAX for PDF page image rendering. All *communications* among these is via library function calls as usual so call overhead is cheap (the biggest costs there are data marshalling in various places and some thread context switching). '*Old Qiqqa*' also has a few external applications it uses for specific tasks: `QiqqaOCR.exe` is a C#/.NET application using SORAX and `tesseract.net` (a .NET wrapper for `tesseract` v3, the Open Source OCR engine) to help with PDF text extraction and PDF OCR, when the text layer is not present in the PDF. `QiqqaOCR.exe` also uses an old, patched, `pdfdraw` application from Artifex (MuPDF v1.11 or there-about) for the text extraction work itself, when no OCR is required. Qiqqa does this through this external application to improve overall user-facing application stability as these tools are/were quite finicky and brittle. -v83 Qiqqa is still '*Old Qiqqa*' that way, but has been slowly moving all costly business logic out of the UI thread to improve UI responsiveness. A *synchronous* to *asynchronous* event handling transition in the application, which is generally cause for reams of bugs to surface -- as has happened to the various v83 experimental releases. It's never a nice story, but it had to happen as a prelude to transitioning Qiqqa to a '*New Qiqqa*': this phase showed various issues in the Qiqqa codebase that will also cause difficulties when we pull the monolith apart into several processing chunks. +v83 Qiqqa is still '*Old Qiqqa*' that way, but has been slowly moving all costly business logic out of the UI thread to improve UI responsiveness. A *synchronous* to *asynchronous* event handling transition in the application, which is generally cause for reams of bugs to surface -- as has happened to the various v83 experimental releases. It's never a nice story, but it had to happen as a prelude to transitioning Qiqqa to a '*New Qiqqa*': this phase showed various issues in the Qiqqa code-base that will also cause difficulties when we pull the monolith apart into several processing chunks. ### Important Note While generally it is easy to invoke external applications on Windows via `execve()` et al (Windows **does not provide a `fork()` API of any kind!**), redirecting stdin/stdout/stderr is a little more involved and pretty important in our case: -- `QiqqaOCR` spits out all kinds of logging info via stdout/stderr, which has to be filed in the logfiles using qiqqa's own log4net logging library. No problem so far. +- `QiqqaOCR` spits out all kinds of logging info via stdout/stderr, which has to be filed in the log-files using qiqqa's own log4net logging library. No problem so far. - `QiqqaOCR` -- or another external application, e.g. bleeding edge MuPDF `mudraw` -- was initially designated as the tool to replace the SORAX render library: *external tool* instead of *library* so we don't have to hassle with marshalling *image data* across the native-to-.NET DLL boundary, *plus* we hoped it would allow us to run 64-bit modern applications as part of the Qiqqa 'back-end'. This was considered a good first step as part of the *transitional phase* of Qiqqa from Old to New. @@ -308,7 +308,8 @@ Flatbuffers/Protocolbuffers/Msgpack have been considered as well and deemed *too > - **WebSockets** > The WebSocket protocol allows for constant, bi-directional communication between the server and client. > - > \[...] + > \[...\] + > > ### Conclusion > > In the current state of the web, short and long-polling have a much higher bandwidth cost than other options, but may be considered if supporting legacy browsers. Short-polling requires estimating an interval that suits an application’s requirements. Regardless of the estimation’s accuracy, there will be a higher cost due to continuously opening new requests. If using HTTP/1.1, this results in passing headers unnecessarily and potentially opening multiple TCP connections if parallel polls are open. Long-polling reduces these costs significantly by receiving one update per request. diff --git a/docs-src/Notes/Progress in Development/Considering the Way Forward/IPC/IPC - Qiqqa Monitor.md b/docs-src/Notes/Progress in Development/Considering the Way Forward/IPC/IPC - Qiqqa Monitor.md index 9b966941e..bfa72749f 100644 --- a/docs-src/Notes/Progress in Development/Considering the Way Forward/IPC/IPC - Qiqqa Monitor.md +++ b/docs-src/Notes/Progress in Development/Considering the Way Forward/IPC/IPC - Qiqqa Monitor.md @@ -11,7 +11,7 @@ While we *could* choose to use a *server push* mechanism, we opt, instead, to us Initially, I had some qualms about this, as this implies we would add *requesting overhead* that way, but a **pull**-based approach is more flexible: - *server push* would either need a good timer server-side and a fast pace to ensure the client receives ample updates under all circumstances. Of course, this can be made *smarter*, but that would add development cost. (See also the next item: multiple clients) -- *multiple clients* for the performance data is a consideration: why would we *not* open up this part of the Qiqqa data, when we intend to open up everything else to user-scripted and other kinds of direct external access to our components? **pull**-based performance data feeds would then automatically pace to each client's 'clock' without the need to complicate the server-side codebase. +- *multiple clients* for the performance data is a consideration: why would we *not* open up this part of the Qiqqa data, when we intend to open up everything else to user-scripted and other kinds of direct external access to our components? **pull**-based performance data feeds would then automatically pace to each client's 'clock' without the need to complicate the server-side code-base. - *pace* must be *configured* for *server push* systems if you don't like the current data stream, while we don't have to do *anything* server-side when we go for **pull**-based data feeds: if the client desires a faster (or slower) pace, it can simply and *immediately* attain this by sending more (or fewer) *data requests*. ## Which data to track server-side diff --git a/docs-src/Notes/Progress in Development/Considering the Way Forward/IPC/IPC - transferring and storing data.md b/docs-src/Notes/Progress in Development/Considering the Way Forward/IPC/IPC - transferring and storing data.md index d0afda9e0..144682406 100644 --- a/docs-src/Notes/Progress in Development/Considering the Way Forward/IPC/IPC - transferring and storing data.md +++ b/docs-src/Notes/Progress in Development/Considering the Way Forward/IPC/IPC - transferring and storing data.md @@ -21,9 +21,9 @@ What does that mean? Then there's yet another tool called `pdfdraw.exe` which is an *old* (v1.11 or there-abouts) mupdf tool which has been patched for Qiqqa's purposes to spit out *extracted text content* for a given PDF. - > Incidentally, "text extraction" is more like modern hCR output production: the eextracted text is a mix of both *text* and *in-page coordinates for each character/word* as Qiqqa needs the latter to properly *position* the text as a hidden overlay over the rendered PDF when the user is viewing/annotating/reviewing a PDF in Qiqqa: the coordinates are needed so Qiqqa can discover which bit of text you wish to work on when clicking and dragging your mouse over the page image, produced by the SORAX render library -- which simply spits out an image, no text coordinates or whatsoever: that stuff is not always present in a PDF and is the reason why we need to OCR many PDFs: to get at the actual *text* **and** the text *positions* throughout each page. + > Incidentally, "text extraction" is more like modern hOCR output production: the extracted text is a mix of both *text* and *in-page coordinates for each character/word* as Qiqqa needs the latter to properly *position* the text as a hidden overlay over the rendered PDF when the user is viewing/annotating/reviewing a PDF in Qiqqa: the coordinates are needed so Qiqqa can discover which bit of text you wish to work on when clicking and dragging your mouse over the page image, produced by the SORAX render library -- which simply spits out an image, no text coordinates or whatsoever: that stuff is not always present in a PDF and is the reason why we need to OCR many PDFs: to get at the actual *text* **and** the text *positions* throughout each page. > - > Anyhoo... back to the subject matter: + > *Anyhoo*... back to the subject matter: The current `pdfdraw.exe -tt` text extraction run and/or QiqqaOCR will produce a set of *cache* files in the Qiqqa `/ocr/` directory tree, indexed by the PDF document Qiqqa fingerprint. (In SHA1B format: the "b0rked" SHA1 checksum. See elsewhere (link:XXXXXXXXXX) for info about this baby.) @@ -244,6 +244,6 @@ The added cost is in the need to produce two *converter tools*: one for reading - + http://swig.org/Doc4.0/SWIGDocumentation.html#Introduction diff --git a/docs-src/Notes/Progress in Development/Considering the Way Forward/MiddleWare/And here I was - worrying over the binary .NET dumps in our databases (annotations, etc.).md b/docs-src/Notes/Progress in Development/Considering the Way Forward/MiddleWare/And here I was - worrying over the binary .NET dumps in our databases (annotations, etc.).md index b051803d1..f3ae917b8 100644 --- a/docs-src/Notes/Progress in Development/Considering the Way Forward/MiddleWare/And here I was - worrying over the binary .NET dumps in our databases (annotations, etc.).md +++ b/docs-src/Notes/Progress in Development/Considering the Way Forward/MiddleWare/And here I was - worrying over the binary .NET dumps in our databases (annotations, etc.).md @@ -18,7 +18,7 @@ Here ya go, Sunny Jim! No C/C++, but protocol documentation (and the hint somebo - **kaboom! 🎉** - https://docs.microsoft.com/en-us/dotnet/standard/serialization/binaryformatter-security-guide a.k.a. "Deserialization risks in use of `BinaryFormatter` and related types" - https://github.com/pwntester/ysoserial.net (heh!) -- [# [MS-NRBF]: .NET Remoting: Binary Format Data Structure](https://docs.microsoft.com/en-us/openspecs/windows_protocols/ms-nrbf/75b9fe09-be15-475f-85b8-ae7b7558cfe5?redirectedfrom=MSDN) -- *paydirt!* This page carries all the format documentation as PDFs. +- [# [MS-NRBF]: .NET Remoting: Binary Format Data Structure](https://docs.microsoft.com/en-us/openspecs/windows_protocols/ms-nrbf/75b9fe09-be15-475f-85b8-ae7b7558cfe5?redirectedfrom=MSDN) -- *pay-dirt!* This page carries all the format documentation as PDFs. - https://stackoverflow.com/questions/3052202/how-to-analyse-contents-of-binary-serialization-stream/30176566#30176566 diff --git a/docs-src/Notes/Progress in Development/Considering the Way Forward/Moving Away From Windows-only UI/Moving away for Windows-bound UI (WPF) to HTML - feasibility tests with CEF+CEFSharp+CEFGlue+Chromely.md b/docs-src/Notes/Progress in Development/Considering the Way Forward/Moving Away From Windows-only UI/Moving away for Windows-bound UI (WPF) to HTML - feasibility tests with CEF+CEFSharp+CEFGlue+Chromely.md index 135dc18a8..70b4b80fd 100644 --- a/docs-src/Notes/Progress in Development/Considering the Way Forward/Moving Away From Windows-only UI/Moving away for Windows-bound UI (WPF) to HTML - feasibility tests with CEF+CEFSharp+CEFGlue+Chromely.md +++ b/docs-src/Notes/Progress in Development/Considering the Way Forward/Moving Away From Windows-only UI/Moving away for Windows-bound UI (WPF) to HTML - feasibility tests with CEF+CEFSharp+CEFGlue+Chromely.md @@ -18,9 +18,9 @@ While this surely will smell like "[Second System Syndrome](https://en.wikipedia - ASP.NET to drive a HTML/CSS UI then? (Electron.NET) not my cup of tea. - Xamarin / Blazor / ...: mobile-focused: not geared towards desktop-level complexity and the longevity concern: they come and they go. - [MAUI](https://devblogs.microsoft.com/dotnet/introducing-net-multi-platform-app-ui/)? WPF regurgitated. Might have the longevity, finally, but no fun. - - electron: not a UI per se, but a target when doing X-plat work. Drawback for Qiqqa: comes with NodeJS backend, which we do not need unless I ditch the C# codebase utterly. + - electron: not a UI *per se*, but a target when doing X-plat work. Drawback for Qiqqa: comes with NodeJS back-end, which we do not need unless I ditch the C# code-base utterly. - electron.NET: that's electron + ASP.NET: no fun as I would be moving from WPF to ASP.NET. Motivation. - - Chromely: viable as it's CEF plus C# backend. We'll have a look at this one (see below) + - Chromely: viable as it's CEF plus C# back-end. We'll have a look at this one (see below) ## Having a look at Chromely :: feasibility for Qiqqa (and a few words about [electron](https://www.electronjs.org/) et al) @@ -32,7 +32,7 @@ This is \[one of the reasons] why I always want source code level access to my l ### Chromely and the others: [electron](https://www.electronjs.org/), [NW.js](https://nwjs.io/), ... -While I was considering Chromely, I also looked at electron et al. Only later did I reconsider [NW.js](https://nwjs.io/) as possibly more viable, when I found out that [electron](https://www.electronjs.org/) basically is a webbrowser with a ([NodeJS](https://nodejs.org/en/)) webserver jammed together in a single package, when I started looking at the answers they provide for this question: +While I was considering Chromely, I also looked at electron et al. Only later did I reconsider [NW.js](https://nwjs.io/) as possibly more viable, when I found out that [electron](https://www.electronjs.org/) basically is a web-browser with a ([NodeJS](https://nodejs.org/en/)) web-server jammed together in a single package, when I started looking at the answers they provide for this question: **How does the "backend layer" communicate with the "frontend layer" (i.e. CEF i.e. Chromium browser core)?** diff --git a/docs-src/Notes/Progress in Development/Considering the Way Forward/Qiqqa library storage, database, DropBox (and frenemies), backups and backwards compatibility.md b/docs-src/Notes/Progress in Development/Considering the Way Forward/Qiqqa library storage, database, DropBox (and frenemies), backups and backwards compatibility.md index b59888064..21a0eecd1 100644 --- a/docs-src/Notes/Progress in Development/Considering the Way Forward/Qiqqa library storage, database, DropBox (and frenemies), backups and backwards compatibility.md +++ b/docs-src/Notes/Progress in Development/Considering the Way Forward/Qiqqa library storage, database, DropBox (and frenemies), backups and backwards compatibility.md @@ -20,12 +20,16 @@ that the entire database and PDF store must be *processed* to relate the old MD5 New documents would not get a MD5 hash, or it would not be considered unique anymore, anyway), but everyone would be named using SHA256 and the database table in SQlite would need to be changed to use that SHA256 as a unique key. -Hence the thought is, here and now, to keep the old SQLite database table as-is, in case you want to migrate *back* to an older version perhaps, but to copy/transform it into a new table, where everything is using the new SHA256 key. **Plus** you'ld need a lookup table where MD5 is mapped to SHA256 and vice versa. ^[Hm, maybe we could combine that the 'document grouping' feature I want to add to Qiqqa so I can 'bundle' PDFs for chapters into books and such-like. MD5 hash collisions would just be another grouping/mapping type. There's also ^[to be, not implemented yet!] the decrypting and cleaning up and PDF/A text embedding of existing PDFs, resulting in more PDFs with basically the same content, but a different internal *shape* and thus different hash key. All these should not land in a single table as they clearly have slightly different structure and widely different semantics, but it all means the same thing: there's some database rework to be done! +Hence the thought is, here and now, to keep the old SQLite database table as-is, in case you want to migrate *back* to an older version perhaps, but to copy/transform it into a new table, where everything is using the new SHA256 key. **Plus** you'ld need a lookup table where MD5 is mapped to SHA256 and vice versa. [^1] + +[^1]: Hm, maybe we could combine that the 'document grouping' feature I want to add to Qiqqa so I can 'bundle' PDFs for chapters into books and such-like. MD5 hash collisions would just be another grouping/mapping type. There's also [^to be, not implemented yet!] the decrypting and cleaning up and PDF/A text embedding of existing PDFs, resulting in more PDFs with basically the same content, but a different internal *shape* and thus different hash key. All these should not land in a single table as they clearly have slightly different structure and widely different semantics, but it all means the same thing: there's some database rework to be done! ## Backups to cloud storage -Currently Qiqqa copies the Sqlite DB to cloud using SQlite, which is not very smart as this can break the database due to potential collisions with other accessors^[you or other user accessing the same cloud storage spot and thus shared DB over network, if only for a short moment]: the idea there is to always **binary file copy** the database to cloud storage and only ever let Sqlite access the DB that sits in local *private* storage. +Currently Qiqqa copies the Sqlite DB to cloud using SQlite, which is not very smart as this can break the database due to potential collisions with other accessors[^2] + +[^2]: you or other user accessing the same cloud storage spot and thus shared DB over network, if only for a short moment]: the idea there is to always **binary file copy** the database to cloud storage and only ever let Sqlite access the DB that sits in local *private* storage. Multi-user access over cloud storage is a persistent problem as there's no solid file locking solution for such systems: not for basic networking and certainly not for cloud storage systems (such as Google Drive or DropBox, which have their own proprietary ways of 'syncing' files and none of them will be happy with *shared use* of such files while they 'sync'). diff --git a/docs-src/Notes/Progress in Development/Considering the Way Forward/Specific Tools and Tasks to Research and Produce for Cross-platform Qiqqa.md b/docs-src/Notes/Progress in Development/Considering the Way Forward/Specific Tools and Tasks to Research and Produce for Cross-platform Qiqqa.md index 57c5c172a..2335bf87b 100644 --- a/docs-src/Notes/Progress in Development/Considering the Way Forward/Specific Tools and Tasks to Research and Produce for Cross-platform Qiqqa.md +++ b/docs-src/Notes/Progress in Development/Considering the Way Forward/Specific Tools and Tasks to Research and Produce for Cross-platform Qiqqa.md @@ -39,11 +39,11 @@ 1. we don't have to bother with coding any Java to make it happen 2. users can invent their own smart ideas accessing and using the collected metadata: we are opening up Qiqqa! -- What's left for the old C# codebase then, you ask? +- What's left for the old C# code-base then, you ask? - WPF/.NET is **very** *non*-portable so most of it will have to go. What we untangle and isn't replaced by Core Technologies (see above: LuceneNET is out and so will be a couple of other libs and tools) will stay on a "useful middleware" which will be commandline / socket-interface based so **not facing users directly, i.e. no need for any GUI work any more**: all that should be handled by the new web-based GUI above. + WPF/.NET is **very** *non*-portable so most of it will have to go. What we untangle and isn't replaced by Core Technologies (see above: LuceneNET is out and so will be a couple of other libs and tools) will stay on a "useful middle-ware" which will be command-line / socket-interface based so **not facing users directly, i.e. no need for any GUI work any more**: all that should be handled by the new web-based GUI above. - This part will then become cross-platform portable as the remains would fit the bill of C#/.NET.Core which is touted as cross-platform able *today*. We'll see what remains as "middleware" as we work on the other components and discover its relative usefulness along the way. I expect all sorts of import/export jobs will stay in C#/.NET.Core as there's little benefit to re-coding those processes in a otherwise-critical Core Server component, which already does the PDF and Library Database handling then. + This part will then become cross-platform portable as the remains would fit the bill of C#/.NET.Core which is touted as cross-platform able *today*. We'll see what remains as "middle-ware" as we work on the other components and discover its relative usefulness along the way. I expect all sorts of import/export jobs will stay in C#/.NET.Core as there's little benefit to re-coding those processes in a otherwise-critical Core Server component, which already does the PDF and Library Database handling then. ### Demarcated Projects = producing a Tool or a Procedure To Use A Tool: diff --git a/docs-src/Notes/Progress in Development/Considering the Way Forward/Stuff To Look At/Full Text Search Engines.md b/docs-src/Notes/Progress in Development/Considering the Way Forward/Stuff To Look At/Full Text Search Engines.md index 373fabb94..d9cf07f38 100644 --- a/docs-src/Notes/Progress in Development/Considering the Way Forward/Stuff To Look At/Full Text Search Engines.md +++ b/docs-src/Notes/Progress in Development/Considering the Way Forward/Stuff To Look At/Full Text Search Engines.md @@ -30,7 +30,7 @@ https://ui.adsabs.harvard.edu/ - an advanced example of SOLR in actual use. (The https://tantivy-search.github.io/bench/ - benchmark of pisa, tantivy and lucene. Interesting stuff and it teases me to have a look at those. Goes with this: [quickwit-oss/search-benchmark-game: Search engine benchmark (Tantivy, Lucene, PISA, ...) (github.com)](https://github.com/quickwit-oss/search-benchmark-game) -[pisa-engine/pisa: PISA: Performant Indexes and Search for Academia (github.com)](https://github.com/pisa-engine/pisa) - C++ stuff. +[pisa-engine/pisa: PISA: Performant Indexes and Search for Academia (github.com)](https://github.com/pisa-engine/pisa) - C++ stuff. (**Not an option, as the entire search index is supposed to fit in memory, according to their documentation.** *Pity.*) diff --git a/docs-src/Notes/Progress in Development/Considering the Way Forward/Stuff To Look At/Metadata Search Engines.md b/docs-src/Notes/Progress in Development/Considering the Way Forward/Stuff To Look At/Metadata Search Engines.md index 1047ff1ae..a34694e84 100644 --- a/docs-src/Notes/Progress in Development/Considering the Way Forward/Stuff To Look At/Metadata Search Engines.md +++ b/docs-src/Notes/Progress in Development/Considering the Way Forward/Stuff To Look At/Metadata Search Engines.md @@ -43,7 +43,7 @@ Found via the README of the project [`edsu/etudier`: "Extract a citation network > > I’ve read or heard someone say that Google Scholar is given privileged access to crawl Publisher,aggregator (often enhanced with subject heading and controlled vocab) and none-free abstract and indexing sites like Elsevier and Thomson Reuters’s Scopus and Web of Science respectively. > -> Obviously the latter two wouldn’t be so wild about Google Scholar offering a API that would expose all their content to anyone since they sell access to such metadata. +> Obviously the latter two wouldn’t be so wild about Google Scholar offering an API that would expose all their content to anyone since they sell access to such metadata. > > Currently you only get such content (relatively rare) from GS if you are in the specific institution IP range that has subscriptions. (Also If your institution is already a subscriber to such services such as Web of Science or Scopus, you library could usually with some work allow you access directly via the specific resource API!.) > @@ -55,7 +55,7 @@ Found via the README of the project [`edsu/etudier`: "Extract a citation network > > Also Web Scale discovery services that libraries pay for such as Summon, Ebsco discovery service, Primo etc do have APIs and they come closest to duplicating a (less comprehensive version) Google Scholar API > -> Another poor substitute to a Google Scholar API, is the Crossref Metadata Search. It’s not as comprehensive as Google Scholar but most major publishers do deposit their metadata. +> Another poor substitute to a Google Scholar API is the Crossref Metadata Search. It’s not as comprehensive as Google Scholar but most major publishers do deposit their metadata. > > --- > @@ -3027,6 +3027,8 @@ Just one example: https://pubmed.ncbi.nlm.nih.gov/32668870/ - https://www.bibsonomy.org/ - The blue social bookmark and publication sharing system: BibSonomy helps you to manage your publications and bookmarks, to collaborate with your colleagues and to find new interesting material for your research. - https://typeset.io/ - https://synapse.koreamed.org/advanced/ +- https://bazaar.abuse.ch/browse/tag/pdf/ +- https://epdf.tips/ - --- diff --git a/docs-src/Notes/Progress in Development/Considering the Way Forward/Synchronization & Updates/Syncing, zsync style.md b/docs-src/Notes/Progress in Development/Considering the Way Forward/Synchronization & Updates/Syncing, zsync style.md index 9b9829480..f3dc0ce23 100644 --- a/docs-src/Notes/Progress in Development/Considering the Way Forward/Synchronization & Updates/Syncing, zsync style.md +++ b/docs-src/Notes/Progress in Development/Considering the Way Forward/Synchronization & Updates/Syncing, zsync style.md @@ -1,4 +1,4 @@ -# Syncing, zsync style +# Syncing, `zsync` style I've looked at a lot of stuff, including `rsync` & `unison`. All the *smart stuff* requires dedicated software running at both sides of the fence. I don't want that: the cheapest 'cloud storage' solutions are plain file and/or static web page access only: diff --git a/docs-src/Notes/Progress in Development/Considering the Way Forward/The Qiqqa Sniffer UI-UX - PDF Viewer, Metadata Editor + Web Browser As WWW Search Engine.md b/docs-src/Notes/Progress in Development/Considering the Way Forward/The Qiqqa Sniffer UI-UX - PDF Viewer, Metadata Editor + Web Browser As WWW Search Engine.md index 75741c0ac..77dba2902 100644 --- a/docs-src/Notes/Progress in Development/Considering the Way Forward/The Qiqqa Sniffer UI-UX - PDF Viewer, Metadata Editor + Web Browser As WWW Search Engine.md +++ b/docs-src/Notes/Progress in Development/Considering the Way Forward/The Qiqqa Sniffer UI-UX - PDF Viewer, Metadata Editor + Web Browser As WWW Search Engine.md @@ -8,7 +8,7 @@ TBD See also: - [[Using embedded cURL to obtain Metadata for a document]] - [[Using embedded cURL to download PDF or HTML document at URL]] -- [[curl - commandline and notes]] +- [[../../Technology/Odds 'n' Ends/curl - command-line and notes]] - [[Testing - PDF URLs with problems]] - [[Testing - Nasty URLs for PDFs]] - [[wxWidgets + CEF for UI - or should we go electron anyway⁈ ⇒ WebView2 et al]] diff --git a/docs-src/Notes/Progress in Development/Considering the Way Forward/The woes and perils of invoking other child applications.md b/docs-src/Notes/Progress in Development/Considering the Way Forward/The woes and perils of invoking other child applications.md index 3cde2f1d4..acca53b8e 100644 --- a/docs-src/Notes/Progress in Development/Considering the Way Forward/The woes and perils of invoking other child applications.md +++ b/docs-src/Notes/Progress in Development/Considering the Way Forward/The woes and perils of invoking other child applications.md @@ -27,7 +27,7 @@ * Of course, trouble never travels alone: I also had some very obnoxious issues re getting hold of the child process' **exit code** and/or precise moment when the child process has indeed fully terminated: I know I still have bugs lurking deep down in the current C#/.NET code handling child process invocations. - I've spent quite some time on those and the conclusion there is that things are too opaque to perform the very detailed problem analysis that these issues require: *that* is one of the reasons why I intend to migrate the whole caboodle to local-loopback IPC (sockets): that approach may be a tad slower, though from what I've gathered from the few banchmarks old and new floating around on the *IntrNetz* the performance is probably *on par*. Heck, as long as the trouble is **not** "on par", I'll be a happy camper! + I've spent quite some time on those and the conclusion there is that things are too opaque to perform the very detailed problem analysis that these issues require: *that* is one of the reasons why I intend to migrate the whole caboodle to local-loopback IPC (sockets): that approach may be a tad slower, though from what I've gathered from the few benchmarks old and new floating around on the *IntrNetz* the performance is probably *on par*. Heck, as long as the trouble is **not** "on par", I'll be a happy camper! * As far as I'm concerned, any stdio + child process invocating will, from now on, only be done in a precisely controlled environment that's -- at least in principle -- cross-platform: a C/C++ based library. And no P/Invoke, either. I've had it with those! diff --git a/docs-src/Notes/Progress in Development/Considering the Way Forward/Using SQLite for the OCR and page render caches.md b/docs-src/Notes/Progress in Development/Considering the Way Forward/Using SQLite for the OCR and page render caches.md index 1fac9f2d0..9bbe8e697 100644 --- a/docs-src/Notes/Progress in Development/Considering the Way Forward/Using SQLite for the OCR and page render caches.md +++ b/docs-src/Notes/Progress in Development/Considering the Way Forward/Using SQLite for the OCR and page render caches.md @@ -10,7 +10,7 @@ As we would like to have a flexible, adaptive priority queue for the task queue, When we task the PDF page render process to produce one or more pages for viewing (or OCR processing!), it can benefit us to cache these page images. (Less so with the OCR process requesting a page: after all, we expect that process to complete and never ask for that same page again as it would be *done*!) -Given that the page images are expected to be relatively large (several 100's of kilobytes a piece), even when WebP encoded, it might be smart to store these directly in the filesystem and only store the cache info + file reference in the SQL table itself. +Given that the page images are expected to be relatively large (several 100's of kilobytes a piece), even when WebP encoded, it might be smart to store these directly in the file-system and only store the cache info + file reference in the SQL table itself. As these page images will be valid *beyond a single application run* it MAY be useful to make this a persistent cache, which can be re-used the next time the application is run. @@ -24,7 +24,7 @@ Having a persistent cache for this data also enables us to add another feature o > Any such updates should also be forwarded to the FTS engine: when the document text is altered (for whatever reason, be it error corrections, augmentation or otherwise) this should be reflected in the FTS index so that future search activity by the user will produce matches against the latest texts. -Currently, OCR / text extracts are stored in single page and 20page files and are several kilobytes large, also thanks to the relatively expensive ASCII text representation of the word bboxes used in the current format. When we store these numbers as either integers or IEEE 754 *floats* (32-bit floating point values), we will have both plenty precision and a much lower storage cost per word in our store. +Currently, OCR / text extracts are stored in single page and 20-page files and are several kilobytes large, also thanks to the relatively expensive ASCII text representation of the word bboxes used in the current format. When we store these numbers as either integers or IEEE 754 *floats* (32-bit floating point values), we will have both plenty precision and a much lower storage cost per word in our store. Still it would be a pending question whether to store this data as a BLOB inside the SQLite table or merely store a reference to the cache file in the database instead. @@ -50,7 +50,7 @@ While those systems have been tested heavily and are pretty reliable, the argume > This would be particularly important for the OCR/text-extract "cache" as recreating that one is a very heavy burden on the entire system. > > Once the "manual update of extracted text / hOCR layer" feature is available, the cost of re-creating this table and content will include (a lot of) human labor as well, so ruggedness of the data store should trump raw performance in the scoring for selection for use in the code. SQLite is the only one of the options which has clear public documentation listing these claims and under what conditions those guarantees are available to us -- f.e. it's good for performance to switch the WAL to be in-memory and thus non-persistent across an application hard abort or system disruption; while using such a *tweak* (pragma) might be nice for the work priority queue and rendered page image caches, it *certainly* would be very wrong to apply that same performance tweak to the OCR/text extract cache table! - > x + > > While I like upscaledb/hamsterdb a lot (and find the lmdb variants very intriguing, while I haven't used them yet for stuff like this with lots of writes happening over time), the arguments for reachability and ruggedness are winning. The additional concern here is that, even when SQLite would be a decade *slower* than LMDB or hamsterdb, this "mediocre" performance would probably go unnoticed against the costs of the tasks themselves: regular usage and various experiments with Qiqqa to date have shown that the task prioritizer/scheduler is a very critical component for the UX (including "freezing the machine due to all threads being loaded with work") while the data storage part went unnoticed in terms of cost, so I expect not having to go to extremes there to make it "fast": there's more to be gained in optimizing the work tasks themselves and the scheduler thereof: that is another reason why I prefer to use SQLite for No.1 now: it gives me options to easily adjust the scheduler mechanism when I want to/have to when performance is at a premium when we process/import large Qiqqa libraries, for example. diff --git a/docs-src/Notes/Progress in Development/Considering the Way Forward/Using embedded cURL to download PDF or HTML document at URL.md b/docs-src/Notes/Progress in Development/Considering the Way Forward/Using embedded cURL to download PDF or HTML document at URL.md index 76c162ec3..da177f721 100644 --- a/docs-src/Notes/Progress in Development/Considering the Way Forward/Using embedded cURL to download PDF or HTML document at URL.md +++ b/docs-src/Notes/Progress in Development/Considering the Way Forward/Using embedded cURL to download PDF or HTML document at URL.md @@ -6,4 +6,4 @@ ## Caveats -See also [[curl - commandline and notes]] for issues discovered when following this approach. +See also [[../../Technology/Odds 'n' Ends/curl - command-line and notes]] for issues discovered when following this approach. diff --git a/docs-src/Notes/Progress in Development/Notes From The Trenches - Odd, Odder, Oddest, ...PDF!.md b/docs-src/Notes/Progress in Development/Notes From The Trenches - Odd, Odder, Oddest, ...PDF!.md index 64ee21a87..828ab2c22 100644 --- a/docs-src/Notes/Progress in Development/Notes From The Trenches - Odd, Odder, Oddest, ...PDF!.md +++ b/docs-src/Notes/Progress in Development/Notes From The Trenches - Odd, Odder, Oddest, ...PDF!.md @@ -5,7 +5,7 @@ Just a collection of notes of what I've run into while working on Qiqqa and rela Most[^not100pct] of this stuff is reproducible using the PDFs in the Evil Collection of PDFs. -[^not100pct]: You don't get **all** of the *eavil basturds* as there's at least lacking a few remarkable PDFs which were over 500MBytes(!) *each* when I received them -- and we're not talking *press-ready preflight PDFs* here, as I would *expect* sizes like that for *that* kind of PDF! *No sir-ee!* These fine specimens came straight off the IntarWebz looking like someone had already made the, ah, *effort* to have them "reduced" for web-only screen display viewing (e.g. highly compressed lossy page images, etc.). +[^not100pct]: You don't get **all** of the *eavil basturds* as there's at least lacking a few remarkable PDFs which were over 500MBytes(!) *each* when I received them -- and we're not talking *press-ready pre-flight PDFs* here, as I would *expect* sizes like that for *that* kind of PDF! *No sir-ee!* These fine specimens came straight off the IntarWebz looking like someone had already made the, ah, *effort* to have them "reduced" for web-only screen display viewing (e.g. highly compressed lossy page images, etc.). Alas, you can't have them all, they say... @@ -14,7 +14,7 @@ Alas, you can't have them all, they say... ## Repair & Linearize for Web?... You Better *NOT*! -`qpdf -qdr` does a fine job, but sometimes that job is just **way too fine**: as I was considering iff `qpdf` and/or `mutool repair` and/or other tools could be used to produce PDFs (from source PDFs) which would be more palatable and easier to process by Qiqqa et al, I discovered that, on a few occasions, `qpdf -qdr` was able to produce PDFs of over 1 GIGAbyte(!) from humble beginnings such as a ~10MByte source PDF. Thus 'linearize for web' and similar 'restructuring cleanup' actions SHOULD NOT be performed to help displaying / processing these buggers in Qiqqa et al. +`qpdf -qdr` does a fine job, but sometimes that job is just **way too fine**: as I was considering *iff* `qpdf` and/or `mutool repair` and/or other tools could be used to produce PDFs (from source PDFs) which would be more palatable and easier to process by Qiqqa et al, I discovered that, on a few occasions, `qpdf -qdr` was able to produce PDFs of over 1 GIGAbyte(!) from humble beginnings such as a ~10MByte source PDF. Thus 'linearize for web' and similar 'restructuring cleanup' actions SHOULD NOT be performed to help displaying / processing these buggers in Qiqqa et al. **Note**: this, of course, does not mean that `qpdf` is an unuseful tool: on the contrary! `qpdf decrypt` and a few other tools (`mutool repair`) help greatly in getting some obnoxious PDFs displayed on screen, just like Acrobat would show them (see also https://github.com/jimmejardine/qiqqa-open-source/issues/9 and https://github.com/jimmejardine/qiqqa-open-source/issues/136). diff --git a/docs-src/Notes/Progress in Development/Testing & Evaluating/Collected RAW logbook notes - PDF bulktest + mutool_ex PDF + URL tests/notes 001-of-N.md b/docs-src/Notes/Progress in Development/Testing & Evaluating/Collected RAW logbook notes - PDF bulktest + mutool_ex PDF + URL tests/notes 001-of-N.md index 375e9538d..88ac39478 100644 --- a/docs-src/Notes/Progress in Development/Testing & Evaluating/Collected RAW logbook notes - PDF bulktest + mutool_ex PDF + URL tests/notes 001-of-N.md +++ b/docs-src/Notes/Progress in Development/Testing & Evaluating/Collected RAW logbook notes - PDF bulktest + mutool_ex PDF + URL tests/notes 001-of-N.md @@ -4,7 +4,7 @@ # Test run notes at the bleeding edge -This is about the multiple test runs covering the `evil-base` PDF corpus: I've been collecting these notes over the years. **Big Caveat: these notes were valid at the time of writing, but MAY be obsolete or even counterdicting current behaviour at any later moment, sometimes even *seconds* away from the original event.** +This is about the multiple test runs covering the `evil-base` PDF corpus: I've been collecting these notes over the years. **Big Caveat: these notes were valid at the time of writing, but MAY be obsolete or even contradicting current behaviour at any later moment, sometimes even *seconds* away from the original event.** This is about the things we observe when applying our tools at the bleeding edge of development to existing PDFs of all sorts, plus more or less wicked Internet URLs we checked out and the (grave) bugs that pop up most unexpectedly. diff --git a/docs-src/Notes/Progress in Development/Testing & Evaluating/Collected RAW logbook notes - PDF bulktest + mutool_ex PDF + URL tests/notes 002-of-N.md b/docs-src/Notes/Progress in Development/Testing & Evaluating/Collected RAW logbook notes - PDF bulktest + mutool_ex PDF + URL tests/notes 002-of-N.md index 1f200dd20..fd052aaa9 100644 --- a/docs-src/Notes/Progress in Development/Testing & Evaluating/Collected RAW logbook notes - PDF bulktest + mutool_ex PDF + URL tests/notes 002-of-N.md +++ b/docs-src/Notes/Progress in Development/Testing & Evaluating/Collected RAW logbook notes - PDF bulktest + mutool_ex PDF + URL tests/notes 002-of-N.md @@ -3,7 +3,7 @@ # Test run notes at the bleeding edge -This is about the multiple test runs covering the `evil-base` PDF corpus: I've been collecting these notes over the years. **Big Caveat: these notes were valid at the time of writing, but MAY be obsolete or even counterdicting current behaviour at any later moment, sometimes even *seconds* away from the original event.** +This is about the multiple test runs covering the `evil-base` PDF corpus: I've been collecting these notes over the years. **Big Caveat: these notes were valid at the time of writing, but MAY be obsolete or even contradicting current behaviour at any later moment, sometimes even *seconds* away from the original event.** This is about the things we observe when applying our tools at the bleeding edge of development to existing PDFs of all sorts, plus more or less wicked Internet URLs we checked out and the (grave) bugs that pop up most unexpectedly. diff --git a/docs-src/Notes/Progress in Development/Testing & Evaluating/Collected RAW logbook notes - PDF bulktest + mutool_ex PDF + URL tests/notes 003-of-N.md b/docs-src/Notes/Progress in Development/Testing & Evaluating/Collected RAW logbook notes - PDF bulktest + mutool_ex PDF + URL tests/notes 003-of-N.md index f72f4a65d..9c447355c 100644 --- a/docs-src/Notes/Progress in Development/Testing & Evaluating/Collected RAW logbook notes - PDF bulktest + mutool_ex PDF + URL tests/notes 003-of-N.md +++ b/docs-src/Notes/Progress in Development/Testing & Evaluating/Collected RAW logbook notes - PDF bulktest + mutool_ex PDF + URL tests/notes 003-of-N.md @@ -3,7 +3,7 @@ # Test run notes at the bleeding edge -This is about the multiple test runs covering the `evil-base` PDF corpus: I've been collecting these notes over the years. **Big Caveat: these notes were valid at the time of writing, but MAY be obsolete or even counterdicting current behaviour at any later moment, sometimes even *seconds* away from the original event.** +This is about the multiple test runs covering the `evil-base` PDF corpus: I've been collecting these notes over the years. **Big Caveat: these notes were valid at the time of writing, but MAY be obsolete or even contradicting current behaviour at any later moment, sometimes even *seconds* away from the original event.** This is about the things we observe when applying our tools at the bleeding edge of development to existing PDFs of all sorts, plus more or less wicked Internet URLs we checked out and the (grave) bugs that pop up most unexpectedly. diff --git a/docs-src/Notes/Progress in Development/Testing & Evaluating/Collected RAW logbook notes - PDF bulktest + mutool_ex PDF + URL tests/notes 004-of-N.md b/docs-src/Notes/Progress in Development/Testing & Evaluating/Collected RAW logbook notes - PDF bulktest + mutool_ex PDF + URL tests/notes 004-of-N.md index 3a99b2458..b3e95f9d4 100644 --- a/docs-src/Notes/Progress in Development/Testing & Evaluating/Collected RAW logbook notes - PDF bulktest + mutool_ex PDF + URL tests/notes 004-of-N.md +++ b/docs-src/Notes/Progress in Development/Testing & Evaluating/Collected RAW logbook notes - PDF bulktest + mutool_ex PDF + URL tests/notes 004-of-N.md @@ -3,7 +3,7 @@ # Test run notes at the bleeding edge -This is about the multiple test runs covering the `evil-base` PDF corpus: I've been collecting these notes over the years. **Big Caveat: these notes were valid at the time of writing, but MAY be obsolete or even counterdicting current behaviour at any later moment, sometimes even *seconds* away from the original event.** +This is about the multiple test runs covering the `evil-base` PDF corpus: I've been collecting these notes over the years. **Big Caveat: these notes were valid at the time of writing, but MAY be obsolete or even contradicting current behaviour at any later moment, sometimes even *seconds* away from the original event.** This is about the things we observe when applying our tools at the bleeding edge of development to existing PDFs of all sorts, plus more or less wicked Internet URLs we checked out and the (grave) bugs that pop up most unexpectedly. diff --git a/docs-src/Notes/Progress in Development/Testing & Evaluating/Collected RAW logbook notes - PDF bulktest + mutool_ex PDF + URL tests/notes 005-of-N.md b/docs-src/Notes/Progress in Development/Testing & Evaluating/Collected RAW logbook notes - PDF bulktest + mutool_ex PDF + URL tests/notes 005-of-N.md index 1c6e8bd48..5669412de 100644 --- a/docs-src/Notes/Progress in Development/Testing & Evaluating/Collected RAW logbook notes - PDF bulktest + mutool_ex PDF + URL tests/notes 005-of-N.md +++ b/docs-src/Notes/Progress in Development/Testing & Evaluating/Collected RAW logbook notes - PDF bulktest + mutool_ex PDF + URL tests/notes 005-of-N.md @@ -3,7 +3,7 @@ # Test run notes at the bleeding edge -This is about the multiple test runs covering the `evil-base` PDF corpus: I've been collecting these notes over the years. **Big Caveat: these notes were valid at the time of writing, but MAY be obsolete or even counterdicting current behaviour at any later moment, sometimes even *seconds* away from the original event.** +This is about the multiple test runs covering the `evil-base` PDF corpus: I've been collecting these notes over the years. **Big Caveat: these notes were valid at the time of writing, but MAY be obsolete or even contradicting current behaviour at any later moment, sometimes even *seconds* away from the original event.** This is about the things we observe when applying our tools at the bleeding edge of development to existing PDFs of all sorts, plus more or less wicked Internet URLs we checked out and the (grave) bugs that pop up most unexpectedly. diff --git a/docs-src/Notes/Progress in Development/Testing & Evaluating/Collected RAW logbook notes - PDF bulktest + mutool_ex PDF + URL tests/notes 006-of-N.md b/docs-src/Notes/Progress in Development/Testing & Evaluating/Collected RAW logbook notes - PDF bulktest + mutool_ex PDF + URL tests/notes 006-of-N.md index 5622498eb..e4b8652f7 100644 --- a/docs-src/Notes/Progress in Development/Testing & Evaluating/Collected RAW logbook notes - PDF bulktest + mutool_ex PDF + URL tests/notes 006-of-N.md +++ b/docs-src/Notes/Progress in Development/Testing & Evaluating/Collected RAW logbook notes - PDF bulktest + mutool_ex PDF + URL tests/notes 006-of-N.md @@ -3,7 +3,7 @@ # Test run notes at the bleeding edge -This is about the multiple test runs covering the `evil-base` PDF corpus: I've been collecting these notes over the years. **Big Caveat: these notes were valid at the time of writing, but MAY be obsolete or even counterdicting current behaviour at any later moment, sometimes even *seconds* away from the original event.** +This is about the multiple test runs covering the `evil-base` PDF corpus: I've been collecting these notes over the years. **Big Caveat: these notes were valid at the time of writing, but MAY be obsolete or even contradicting current behaviour at any later moment, sometimes even *seconds* away from the original event.** This is about the things we observe when applying our tools at the bleeding edge of development to existing PDFs of all sorts, plus more or less wicked Internet URLs we checked out and the (grave) bugs that pop up most unexpectedly. diff --git a/docs-src/Notes/Progress in Development/Testing & Evaluating/Collected RAW logbook notes - PDF bulktest + mutool_ex PDF + URL tests/notes 007-of-N.md b/docs-src/Notes/Progress in Development/Testing & Evaluating/Collected RAW logbook notes - PDF bulktest + mutool_ex PDF + URL tests/notes 007-of-N.md index cc448988a..3520b1124 100644 --- a/docs-src/Notes/Progress in Development/Testing & Evaluating/Collected RAW logbook notes - PDF bulktest + mutool_ex PDF + URL tests/notes 007-of-N.md +++ b/docs-src/Notes/Progress in Development/Testing & Evaluating/Collected RAW logbook notes - PDF bulktest + mutool_ex PDF + URL tests/notes 007-of-N.md @@ -3,7 +3,7 @@ # Test run notes at the bleeding edge -This is about the multiple test runs covering the `evil-base` PDF corpus: I've been collecting these notes over the years. **Big Caveat: these notes were valid at the time of writing, but MAY be obsolete or even counterdicting current behaviour at any later moment, sometimes even *seconds* away from the original event.** +This is about the multiple test runs covering the `evil-base` PDF corpus: I've been collecting these notes over the years. **Big Caveat: these notes were valid at the time of writing, but MAY be obsolete or even contradicting current behaviour at any later moment, sometimes even *seconds* away from the original event.** This is about the things we observe when applying our tools at the bleeding edge of development to existing PDFs of all sorts, plus more or less wicked Internet URLs we checked out and the (grave) bugs that pop up most unexpectedly. diff --git a/docs-src/Notes/Progress in Development/Testing & Evaluating/Collected RAW logbook notes - PDF bulktest + mutool_ex PDF + URL tests/notes 008-of-N.md b/docs-src/Notes/Progress in Development/Testing & Evaluating/Collected RAW logbook notes - PDF bulktest + mutool_ex PDF + URL tests/notes 008-of-N.md index 7af26f70d..b004c1a52 100644 --- a/docs-src/Notes/Progress in Development/Testing & Evaluating/Collected RAW logbook notes - PDF bulktest + mutool_ex PDF + URL tests/notes 008-of-N.md +++ b/docs-src/Notes/Progress in Development/Testing & Evaluating/Collected RAW logbook notes - PDF bulktest + mutool_ex PDF + URL tests/notes 008-of-N.md @@ -3,7 +3,7 @@ # Test run notes at the bleeding edge -This is about the multiple test runs covering the `evil-base` PDF corpus: I've been collecting these notes over the years. **Big Caveat: these notes were valid at the time of writing, but MAY be obsolete or even counterdicting current behaviour at any later moment, sometimes even *seconds* away from the original event.** +This is about the multiple test runs covering the `evil-base` PDF corpus: I've been collecting these notes over the years. **Big Caveat: these notes were valid at the time of writing, but MAY be obsolete or even contradicting current behaviour at any later moment, sometimes even *seconds* away from the original event.** This is about the things we observe when applying our tools at the bleeding edge of development to existing PDFs of all sorts, plus more or less wicked Internet URLs we checked out and the (grave) bugs that pop up most unexpectedly. diff --git a/docs-src/Notes/Progress in Development/Testing & Evaluating/Collected RAW logbook notes - PDF bulktest + mutool_ex PDF + URL tests/notes 009-of-N.md b/docs-src/Notes/Progress in Development/Testing & Evaluating/Collected RAW logbook notes - PDF bulktest + mutool_ex PDF + URL tests/notes 009-of-N.md index 7af26f70d..b004c1a52 100644 --- a/docs-src/Notes/Progress in Development/Testing & Evaluating/Collected RAW logbook notes - PDF bulktest + mutool_ex PDF + URL tests/notes 009-of-N.md +++ b/docs-src/Notes/Progress in Development/Testing & Evaluating/Collected RAW logbook notes - PDF bulktest + mutool_ex PDF + URL tests/notes 009-of-N.md @@ -3,7 +3,7 @@ # Test run notes at the bleeding edge -This is about the multiple test runs covering the `evil-base` PDF corpus: I've been collecting these notes over the years. **Big Caveat: these notes were valid at the time of writing, but MAY be obsolete or even counterdicting current behaviour at any later moment, sometimes even *seconds* away from the original event.** +This is about the multiple test runs covering the `evil-base` PDF corpus: I've been collecting these notes over the years. **Big Caveat: these notes were valid at the time of writing, but MAY be obsolete or even contradicting current behaviour at any later moment, sometimes even *seconds* away from the original event.** This is about the things we observe when applying our tools at the bleeding edge of development to existing PDFs of all sorts, plus more or less wicked Internet URLs we checked out and the (grave) bugs that pop up most unexpectedly. diff --git a/docs-src/Notes/Progress in Development/Testing & Evaluating/Collected RAW logbook notes - PDF bulktest + mutool_ex PDF + URL tests/notes 010-of-N.md b/docs-src/Notes/Progress in Development/Testing & Evaluating/Collected RAW logbook notes - PDF bulktest + mutool_ex PDF + URL tests/notes 010-of-N.md index 7af26f70d..b004c1a52 100644 --- a/docs-src/Notes/Progress in Development/Testing & Evaluating/Collected RAW logbook notes - PDF bulktest + mutool_ex PDF + URL tests/notes 010-of-N.md +++ b/docs-src/Notes/Progress in Development/Testing & Evaluating/Collected RAW logbook notes - PDF bulktest + mutool_ex PDF + URL tests/notes 010-of-N.md @@ -3,7 +3,7 @@ # Test run notes at the bleeding edge -This is about the multiple test runs covering the `evil-base` PDF corpus: I've been collecting these notes over the years. **Big Caveat: these notes were valid at the time of writing, but MAY be obsolete or even counterdicting current behaviour at any later moment, sometimes even *seconds* away from the original event.** +This is about the multiple test runs covering the `evil-base` PDF corpus: I've been collecting these notes over the years. **Big Caveat: these notes were valid at the time of writing, but MAY be obsolete or even contradicting current behaviour at any later moment, sometimes even *seconds* away from the original event.** This is about the things we observe when applying our tools at the bleeding edge of development to existing PDFs of all sorts, plus more or less wicked Internet URLs we checked out and the (grave) bugs that pop up most unexpectedly. diff --git a/docs-src/Notes/Progress in Development/Testing & Evaluating/Collected RAW logbook notes - PDF bulktest + mutool_ex PDF + URL tests/notes 011-of-N.md b/docs-src/Notes/Progress in Development/Testing & Evaluating/Collected RAW logbook notes - PDF bulktest + mutool_ex PDF + URL tests/notes 011-of-N.md index 7af26f70d..b004c1a52 100644 --- a/docs-src/Notes/Progress in Development/Testing & Evaluating/Collected RAW logbook notes - PDF bulktest + mutool_ex PDF + URL tests/notes 011-of-N.md +++ b/docs-src/Notes/Progress in Development/Testing & Evaluating/Collected RAW logbook notes - PDF bulktest + mutool_ex PDF + URL tests/notes 011-of-N.md @@ -3,7 +3,7 @@ # Test run notes at the bleeding edge -This is about the multiple test runs covering the `evil-base` PDF corpus: I've been collecting these notes over the years. **Big Caveat: these notes were valid at the time of writing, but MAY be obsolete or even counterdicting current behaviour at any later moment, sometimes even *seconds* away from the original event.** +This is about the multiple test runs covering the `evil-base` PDF corpus: I've been collecting these notes over the years. **Big Caveat: these notes were valid at the time of writing, but MAY be obsolete or even contradicting current behaviour at any later moment, sometimes even *seconds* away from the original event.** This is about the things we observe when applying our tools at the bleeding edge of development to existing PDFs of all sorts, plus more or less wicked Internet URLs we checked out and the (grave) bugs that pop up most unexpectedly. diff --git a/docs-src/Notes/Progress in Development/Testing & Evaluating/Collected RAW logbook notes - PDF bulktest + mutool_ex PDF + URL tests/notes 012-of-N.md b/docs-src/Notes/Progress in Development/Testing & Evaluating/Collected RAW logbook notes - PDF bulktest + mutool_ex PDF + URL tests/notes 012-of-N.md index 7af26f70d..b004c1a52 100644 --- a/docs-src/Notes/Progress in Development/Testing & Evaluating/Collected RAW logbook notes - PDF bulktest + mutool_ex PDF + URL tests/notes 012-of-N.md +++ b/docs-src/Notes/Progress in Development/Testing & Evaluating/Collected RAW logbook notes - PDF bulktest + mutool_ex PDF + URL tests/notes 012-of-N.md @@ -3,7 +3,7 @@ # Test run notes at the bleeding edge -This is about the multiple test runs covering the `evil-base` PDF corpus: I've been collecting these notes over the years. **Big Caveat: these notes were valid at the time of writing, but MAY be obsolete or even counterdicting current behaviour at any later moment, sometimes even *seconds* away from the original event.** +This is about the multiple test runs covering the `evil-base` PDF corpus: I've been collecting these notes over the years. **Big Caveat: these notes were valid at the time of writing, but MAY be obsolete or even contradicting current behaviour at any later moment, sometimes even *seconds* away from the original event.** This is about the things we observe when applying our tools at the bleeding edge of development to existing PDFs of all sorts, plus more or less wicked Internet URLs we checked out and the (grave) bugs that pop up most unexpectedly. diff --git a/docs-src/Notes/Progress in Development/Testing & Evaluating/Collected RAW logbook notes - PDF bulktest + mutool_ex PDF + URL tests/notes 013-of-N.md b/docs-src/Notes/Progress in Development/Testing & Evaluating/Collected RAW logbook notes - PDF bulktest + mutool_ex PDF + URL tests/notes 013-of-N.md index 7af26f70d..b004c1a52 100644 --- a/docs-src/Notes/Progress in Development/Testing & Evaluating/Collected RAW logbook notes - PDF bulktest + mutool_ex PDF + URL tests/notes 013-of-N.md +++ b/docs-src/Notes/Progress in Development/Testing & Evaluating/Collected RAW logbook notes - PDF bulktest + mutool_ex PDF + URL tests/notes 013-of-N.md @@ -3,7 +3,7 @@ # Test run notes at the bleeding edge -This is about the multiple test runs covering the `evil-base` PDF corpus: I've been collecting these notes over the years. **Big Caveat: these notes were valid at the time of writing, but MAY be obsolete or even counterdicting current behaviour at any later moment, sometimes even *seconds* away from the original event.** +This is about the multiple test runs covering the `evil-base` PDF corpus: I've been collecting these notes over the years. **Big Caveat: these notes were valid at the time of writing, but MAY be obsolete or even contradicting current behaviour at any later moment, sometimes even *seconds* away from the original event.** This is about the things we observe when applying our tools at the bleeding edge of development to existing PDFs of all sorts, plus more or less wicked Internet URLs we checked out and the (grave) bugs that pop up most unexpectedly. diff --git a/docs-src/Notes/Progress in Development/Testing & Evaluating/Collected RAW logbook notes - PDF bulktest + mutool_ex PDF + URL tests/notes 014-of-N.md b/docs-src/Notes/Progress in Development/Testing & Evaluating/Collected RAW logbook notes - PDF bulktest + mutool_ex PDF + URL tests/notes 014-of-N.md index 7af26f70d..b004c1a52 100644 --- a/docs-src/Notes/Progress in Development/Testing & Evaluating/Collected RAW logbook notes - PDF bulktest + mutool_ex PDF + URL tests/notes 014-of-N.md +++ b/docs-src/Notes/Progress in Development/Testing & Evaluating/Collected RAW logbook notes - PDF bulktest + mutool_ex PDF + URL tests/notes 014-of-N.md @@ -3,7 +3,7 @@ # Test run notes at the bleeding edge -This is about the multiple test runs covering the `evil-base` PDF corpus: I've been collecting these notes over the years. **Big Caveat: these notes were valid at the time of writing, but MAY be obsolete or even counterdicting current behaviour at any later moment, sometimes even *seconds* away from the original event.** +This is about the multiple test runs covering the `evil-base` PDF corpus: I've been collecting these notes over the years. **Big Caveat: these notes were valid at the time of writing, but MAY be obsolete or even contradicting current behaviour at any later moment, sometimes even *seconds* away from the original event.** This is about the things we observe when applying our tools at the bleeding edge of development to existing PDFs of all sorts, plus more or less wicked Internet URLs we checked out and the (grave) bugs that pop up most unexpectedly. diff --git a/docs-src/Notes/Progress in Development/Testing & Evaluating/Collected RAW logbook notes - PDF bulktest + mutool_ex PDF + URL tests/notes 015-of-N.md b/docs-src/Notes/Progress in Development/Testing & Evaluating/Collected RAW logbook notes - PDF bulktest + mutool_ex PDF + URL tests/notes 015-of-N.md index 7af26f70d..b004c1a52 100644 --- a/docs-src/Notes/Progress in Development/Testing & Evaluating/Collected RAW logbook notes - PDF bulktest + mutool_ex PDF + URL tests/notes 015-of-N.md +++ b/docs-src/Notes/Progress in Development/Testing & Evaluating/Collected RAW logbook notes - PDF bulktest + mutool_ex PDF + URL tests/notes 015-of-N.md @@ -3,7 +3,7 @@ # Test run notes at the bleeding edge -This is about the multiple test runs covering the `evil-base` PDF corpus: I've been collecting these notes over the years. **Big Caveat: these notes were valid at the time of writing, but MAY be obsolete or even counterdicting current behaviour at any later moment, sometimes even *seconds* away from the original event.** +This is about the multiple test runs covering the `evil-base` PDF corpus: I've been collecting these notes over the years. **Big Caveat: these notes were valid at the time of writing, but MAY be obsolete or even contradicting current behaviour at any later moment, sometimes even *seconds* away from the original event.** This is about the things we observe when applying our tools at the bleeding edge of development to existing PDFs of all sorts, plus more or less wicked Internet URLs we checked out and the (grave) bugs that pop up most unexpectedly. diff --git a/docs-src/Notes/Progress in Development/Testing & Evaluating/Collected RAW logbook notes - PDF bulktest + mutool_ex PDF + URL tests/notes 016-of-N.md b/docs-src/Notes/Progress in Development/Testing & Evaluating/Collected RAW logbook notes - PDF bulktest + mutool_ex PDF + URL tests/notes 016-of-N.md index d87f8b08d..703fd09eb 100644 --- a/docs-src/Notes/Progress in Development/Testing & Evaluating/Collected RAW logbook notes - PDF bulktest + mutool_ex PDF + URL tests/notes 016-of-N.md +++ b/docs-src/Notes/Progress in Development/Testing & Evaluating/Collected RAW logbook notes - PDF bulktest + mutool_ex PDF + URL tests/notes 016-of-N.md @@ -3,7 +3,7 @@ # Test run notes at the bleeding edge -This is about the multiple test runs covering the `evil-base` PDF corpus: I've been collecting these notes over the years. **Big Caveat: these notes were valid at the time of writing, but MAY be obsolete or even counterdicting current behaviour at any later moment, sometimes even *seconds* away from the original event.** +This is about the multiple test runs covering the `evil-base` PDF corpus: I've been collecting these notes over the years. **Big Caveat: these notes were valid at the time of writing, but MAY be obsolete or even contradicting current behaviour at any later moment, sometimes even *seconds* away from the original event.** This is about the things we observe when applying our tools at the bleeding edge of development to existing PDFs of all sorts, plus more or less wicked Internet URLs we checked out and the (grave) bugs that pop up most unexpectedly. diff --git a/docs-src/Notes/Progress in Development/The Transitional Period - Extra Notes.md b/docs-src/Notes/Progress in Development/The Transitional Period - Extra Notes.md index a969e88dd..204bc6540 100644 --- a/docs-src/Notes/Progress in Development/The Transitional Period - Extra Notes.md +++ b/docs-src/Notes/Progress in Development/The Transitional Period - Extra Notes.md @@ -8,7 +8,7 @@ However, this leads to another round-trip = *latency risk*, as I'll be introduci My excuse is that I will need this type of *cache service* once I've moved the UI to HTML/CSS if I want my HTML/CSS-based UI to *stay relatively easy to maintain* by having reduced the *complicated async cache usage bits* to fundamental web requests. -Another, *more acute*, reason is, while async programming is nice, at least in C#/.NET it *clings to the UI thread* unless you spend *significant additional non-trivial effort*, which already has complicated the current codebase quite a bit and clearly is a support dead end long term. +Another, *more acute*, reason is, while async programming is nice, at least in C#/.NET it *clings to the UI thread* unless you spend *significant additional non-trivial effort*, which already has complicated the current code-base quite a bit and clearly is a support dead end long term. Thus simple refactoring of once UI-locking behaviour is a *sine cure* as C# `async` is not sufficient: it merely will move the heavy calculations to another point in time, resulting in UI lockups at arbitrary times, if we hadn't expended that additional (and *code complicating*!) effort. The cache+task system needed in Qiqqa is manifold as it needs to manage all those pesky UI-triggered *actions* in the background: @@ -28,13 +28,13 @@ The cache+task system needed in Qiqqa is manifold as it needs to manage all thos - Quite a few (render-)actions require 1. an *immediate* and *delayed* response, where the *immediate* response would be the swiftest, simplest, render of a "please wait while I render this" user feedback behaviour, e.g. rendering only rendering only a basic, *context-independent*, page view. (Think the modern approach to fast-load web pages where text is *faked* initially by rendering a bunch of gray bars representing a chunk of text-to-be.) 2. plus the final page render, which may have taken some time and is prone to becoming obsolete by the user moving away from the page already before the page finished rendering or due to the user scrolling up&down repeatedly, causing the simple UI code to re-trigger this very same render task as the user re-visits the page already before is has rendered fully. - 3. plus some data queries are sufficiently expensive (LDA-based keyword discovery, for example) that a *good UX* implies we'll be expecting *multiple updates as the backend system delivers more useful results to be added to the pick set presented to the user*. Think keyword / tag suggestions and such-like where swiftness-of-response is a *major factor for usability*: if the user still hasn't decided, we can update the list a la google Search suggestions, otherwise we can abort the costly background operation. + 3. plus some data queries are sufficiently expensive (LDA-based keyword discovery, for example) that a *good UX* implies we'll be expecting *multiple updates as the back-end system delivers more useful results to be added to the pick set presented to the user*. Think keyword / tag suggestions and such-like where swiftness-of-response is a *major factor for usability*: if the user still hasn't decided, we can update the list a la google Search suggestions, otherwise we can abort the costly background operation. - While all these *could* be coded in C# now and then redone in the front-end JS-based UI (CEF/WebView-based UI in *Future Qiqqa*), I think offloading the bulk of this complexity for the relevant backend *servers* to manage would benefit the total CPU cost for a given UX responsiveness at reasonable *developer cost*: I'll need *abortable task execution* server-side anyway, due to no.3, while deciding to keep the queue management client-side implies I'll be doing the thread/CPU-load management of the relevant server *client-side* as well, as then the *client* would then dictate how many tasks are processed at the same time. Not that this is necessarily *bad*, but since I'll need *abort functionality* server-side anyway, I don't have a very good reason *not* to keep the entire queue management on that side as well. + While all these *could* be coded in C# now and then redone in the front-end JS-based UI (CEF/WebView-based UI in *Future Qiqqa*), I think offloading the bulk of this complexity for the relevant back-end *servers* to manage would benefit the total CPU cost for a given UX responsiveness at reasonable *developer cost*: I'll need *abortable task execution* server-side anyway, due to no.3, while deciding to keep the queue management client-side implies I'll be doing the thread/CPU-load management of the relevant server *client-side* as well, as then the *client* would then dictate how many tasks are processed at the same time. Not that this is necessarily *bad*, but since I'll need *abort functionality* server-side anyway, I don't have a very good reason *not* to keep the entire queue management on that side as well. - Another argument pro-server-side is the *way* we would respond to a user-side dictated *abort* or *nil=superseded/anulled* conclusion made there: it's generally swiftest to keep the client-server connection *open* all the time, thus *abort* and *anulled* would have to be *messages*: simply dropping the connection to signal lack-of-interest in the result is too costly, when we're talking many such task requests per second. (*And we are expecting rather high task request numbers indeed as we have both the busy user and the current background processes, such as database completion by queueing a zillion pages' worth of text extracts when processing large library imports, for example, to contend with.*) + Another argument pro-server-side is the *way* we would respond to a user-side dictated *abort* or *nil=superseded/annulled* conclusion made there: it's generally swiftest to keep the client-server connection *open* all the time, thus *abort* and *annulled* would have to be *messages*: simply dropping the connection to signal lack-of-interest in the result is too costly, when we're talking many such task requests per second. (*And we are expecting rather high task request numbers indeed as we have both the busy user and the current background processes, such as database completion by queueing a zillion pages' worth of text extracts when processing large library imports, for example, to contend with.*) - The consequence of *abort* and *anulled* becoming *messages*, is having to expect and process *ack responses* thereof as we need to keep the communication across the socket clean and *intact* while these happen and other tasks await (async) response from the server. This means we're essentially consigning ourselves to a *Check Style* approach, or so it seems, while we'll also need some sort of *server push* if we want those *incremental updates* to work out in a request=response communication scheme across the socket. + The consequence of *abort* and *annulled* becoming *messages*, is having to expect and process *ack responses* thereof as we need to keep the communication across the socket clean and *intact* while these happen and other tasks await (async) response from the server. This means we're essentially consigning ourselves to a *Check Style* approach, or so it seems, while we'll also need some sort of *server push* if we want those *incremental updates* to work out in a request=response communication scheme across the socket. Hm. @@ -42,9 +42,9 @@ The cache+task system needed in Qiqqa is manifold as it needs to manage all thos **Nyet.** - If we file a (costly) request which *may* result in multiple responses, we *always* will (need to) know when the query has *completely finished*: it's not really *server push* that way, as our client-side request triggers the behaviour: without it, there's no reason what-so-ever to send those incremental updates. The proper way there thus would be to file the request, wait for a response (any is fine: aborted, anulled, *partial data* or *final data*) and when we happen to observe a *partial data* response form the server, submit a follow-up request message for the server to bind the next (partial or *final*) response to. + If we file a (costly) request which *may* result in multiple responses, we *always* will (need to) know when the query has *completely finished*: it's not really *server push* that way, as our client-side request triggers the behaviour: without it, there's no reason what-so-ever to send those incremental updates. The proper way there thus would be to file the request, wait for a response (any is fine: aborted, annulled, *partial data* or *final data*) and when we happen to observe a *partial data* response form the server, submit a follow-up request message for the server to bind the next (partial or *final*) response to. - It also means we'll need the *client* to be able to send *abort* and *anulled* messages as it's the client who knows further server-side work is obsoleted when the user closes that particular view or *moves away*: when this happens, an "abort/anull anything related to this view" message would be much appreciated as it'd cut *ineffective* server-side CPU costs quite a bit. Think, for example, about a heavily loaded system, where a user opens a PDF reader view, scrolls a bit, thus firing multiple page & thumbnail render requests for document X, then decides otherwise and closes the view and moves to another document, thus causing *any work to-be-done on document X to be utterly useless*: here it would help greatly if the server is kept informed and receives a "no need to work on that document X stuff any more" message. + It also means we'll need the *client* to be able to send *abort* and *annulled* messages as it's the client who knows further server-side work is obsoleted when the user closes that particular view or *moves away*: when this happens, an "abort/annull anything related to this view" message would be much appreciated as it'd cut *ineffective* server-side CPU costs quite a bit. Think, for example, about a heavily loaded system, where a user opens a PDF reader view, scrolls a bit, thus firing multiple page & thumbnail render requests for document X, then decides otherwise and closes the view and moves to another document, thus causing *any work to-be-done on document X to be utterly useless*: here it would help greatly if the server is kept informed and receives a "no need to work on that document X stuff any more" message. Which, when you think about it, means we would be helped *significantly* if we can *discard* pending requests without sending a *nack response*. Thus the request=response 1:1 message exchange is *not true* anymore then: some request messages waiting for a response client side are to be killed without a response? Or should we keep the interface simple and *accept* the 1:1 rule as strict, thus resulting in an *abort this X work* message in an immediate flurry of *nack* response messages for the pending document X requests? Hmmmmmmm.... Cost would be relatively cheap, very small messages and little work that way to cleanly resolve each pending async request client-side that way, as the client-side code would not have to bother with keeping track of outstanding requests *at all*: sending an "abort this lot" message to the server would take care of all of that by way of the *nack* server responses for each of those pending client-side actions, thus causing the desired *cleanup* via the regular code path. diff --git a/docs-src/Notes/Progress in Development/The Transitional Period.md b/docs-src/Notes/Progress in Development/The Transitional Period.md index ded788568..254da4dfd 100644 --- a/docs-src/Notes/Progress in Development/The Transitional Period.md +++ b/docs-src/Notes/Progress in Development/The Transitional Period.md @@ -13,7 +13,7 @@ This is about getting from current Qiqqa to future Qiqqa. - flaky text extraction by way of QiqqaOCR - *.NET specific serialization* of important structures (configuration, PDF annotations) to disk & database. - SQLite metadata database via SQLite.NET; has sync to cloud storage/NAS/network fatal issues. -- 32-bit application only (restricted to 32bit by the libraries used) +- 32-bit application only (restricted to 32-bit by the libraries used) - Trouble with libraries in the 10K's "*Future Qiqqa*" is: @@ -25,13 +25,13 @@ This is about getting from current Qiqqa to future Qiqqa. - bleeding edge MuPDF + *patches* + Tesseract for all PDF work, including reading/viewing. - SQLite metadata database (opened up to enable user-written scripts to work the data for *advanced usage*) - Revamped NAS/network/cloud Sync for cooperative & backup work on a single library. -- *64bit first* (maybe a 'older boxes' 32bit build alongside?) +- *64-bit first* (maybe a 'older boxes' 32-bit build alongside?) - Copes well with 100K+ libraries on medium hardware. -## Tackling the transition from 32bit to 64bit +## Tackling the transition from 32-bit to 64-bit -Experiments have shown that I have no stable way on Windows to start 64bit executables from a 32bit binary. This restricts all backend changes (including QiqqaOCR *full* or *partial* replacements) to having to be 32bit builds. +Experiments have shown that I have no stable way on Windows to start 64-bit executables from a 32-bit binary. This restricts all back-end changes (including QiqqaOCR *full* or *partial* replacements) to having to be 32-bit builds. Further tests have shown repeatedly (and very recently *again*) that I have *unsolved problems* invoking external applications from the .NET application, where I need very tight control over those external application's stdin+stdout+stderr streams, including *binary data* transmissions. diff --git a/docs-src/Notes/Progress in Development/Towards Migrating the PDF Viewer + Renderer (+ Text Extractor).md b/docs-src/Notes/Progress in Development/Towards Migrating the PDF Viewer + Renderer (+ Text Extractor).md index 94ab804bc..b4d8d7597 100644 --- a/docs-src/Notes/Progress in Development/Towards Migrating the PDF Viewer + Renderer (+ Text Extractor).md +++ b/docs-src/Notes/Progress in Development/Towards Migrating the PDF Viewer + Renderer (+ Text Extractor).md @@ -14,9 +14,9 @@ Qiqqa also used an old patched[^patched] version of `pdfdraw` (v1.1.4) in the Qi ## The goal / what'ld be ideal -- Use a single PDF renderer / text extractor, so that *when* a PDF is accepted, both will "interpret" the PDF the same way: the renderer producing page images while the text extractor part will produce hOCR or similar text+coordinates data from those pages: when both outputs are produced by a single toolchain then my **assumption / expectation** is that the hOCR words SHOULD be at the same spot as the image rendered pixels for them, even when we're processing a somewhat *odd* PDF. +- Use a single PDF renderer / text extractor, so that *when* a PDF is accepted, both will "interpret" the PDF the same way: the renderer producing page images while the text extractor part will produce hOCR or similar text+coordinates data from those pages: when both outputs are produced by a single tool-chain then my **assumption / expectation** is that the hOCR words SHOULD be at the same spot as the image rendered pixels for them, even when we're processing a somewhat *odd* PDF. - No closed source libraries anywhere: if bugs aren't fixed quickly by a support team, they should at least be allowed to be analyzed in depth and for that you need **source code**. Too many very bad experiences with closed source for this fellow. 🤕 -- available in 32bit *and* 64bit with a C# interface so we can move Qiqqa into the 64-bit realm once we've got rid of the 32bit requirement thanks to antiquated XULrunner -- this should make life easier on modern boxes and when perusing (very) large libraries. +- available in 32-bit *and* 64-bit with a C# interface so we can move Qiqqa into the 64-bit realm once we've got rid of the 32-bit requirement thanks to antiquated XULrunner -- this should make life easier on modern boxes and when perusing (very) large libraries. - very near or at Acrobat performance and PDF compatibility, i.e. SHOULD NOT b0rk on many PDFs, even evil ones.[^evilPDF] [^evilPDF]: In the end, the PDF renderer WILL be Internet facing -- even while only *indirectly*, but as PDFs are downloaded and then viewed=rendered and processed by Qiqqa, those PDFs are essentially straight off the Internet and consequently security / stability of the PDF processing code should be up for that level of (unintentional) abuse. diff --git a/docs-src/Notes/Qiqqa Internals/Extracting the text from PDF documents.md b/docs-src/Notes/Qiqqa Internals/Extracting the text from PDF documents.md index 84dd7da2f..81f69ee6b 100644 --- a/docs-src/Notes/Qiqqa Internals/Extracting the text from PDF documents.md +++ b/docs-src/Notes/Qiqqa Internals/Extracting the text from PDF documents.md @@ -38,7 +38,7 @@ Before we dive in, there's one important question to ask: - The incoming **original PDF** is copied to the Qiqqa Library **document store**, which is located in the `/documents/` directory tree. - The PDF **content** is hashed (using a [SHA1 derivative](https://github.com/jimmejardine/qiqqa-open-source/blob/0b015c923e965ba61e3f6b51218ca509fcd6cabb/Utilities/Files/StreamFingerprint.cs#L14)) to produce a unique identifier for this particular PDF **content**. That hash is used throughout Qiqqa for indexing *and* is to *name* the cached version of the incoming PDF, using a simple yet effective distribution scheme to help NTFS/filesystem performance for large libraries: the first character of the hash is also used as a *subdirectory* name. + The PDF **content** is hashed (using a [SHA1 derivative](https://github.com/jimmejardine/qiqqa-open-source/blob/0b015c923e965ba61e3f6b51218ca509fcd6cabb/Utilities/Files/StreamFingerprint.cs#L14)) to produce a unique identifier for this particular PDF **content**. That hash is used throughout Qiqqa for indexing *and* is to *name* the cached version of the incoming PDF, using a simple yet effective distribution scheme to help NTFS/file-system performance for large libraries: the first character of the hash is also used as a *subdirectory* name. Example path for a PDF file stored in the `Guest` Qiqqa Library: @@ -46,7 +46,7 @@ Before we dive in, there's one important question to ask: base/Guest/documents/D/DA7B8FDA82E6D7465ADC7590EEC0C914E955C5B8.pdf ``` -- The **extracted text** is saved in a Qiqqa-global store at `base/ocr/` using a similar filesystem performance scheme as for the PDF file itself. +- The **extracted text** is saved in a Qiqqa-global store at `base/ocr/` using a similar file-system performance scheme as for the PDF file itself. Example paths for the OCR output cached for the same PDF file as shown above: diff --git a/docs-src/Notes/Qiqqa Internals/Processing PDF documents' text and the impact on UI+UX.md b/docs-src/Notes/Qiqqa Internals/Processing PDF documents' text and the impact on UI+UX.md index 5e5dca8dd..1343118b1 100644 --- a/docs-src/Notes/Qiqqa Internals/Processing PDF documents' text and the impact on UI+UX.md +++ b/docs-src/Notes/Qiqqa Internals/Processing PDF documents' text and the impact on UI+UX.md @@ -28,7 +28,7 @@ The **primary** method is **direct text extraction**: using the `mupdf` tool, Q ### Text *Recognition* -When the primary method fails to deliver a text for a given page, that page is then *re-queued* to have it OCR-ed using a Tesseract-based subprocess. This is the **secondary** method for obtaining the text of a document (page). +When the primary method fails to deliver a text for a given page, that page is then *re-queued* to have it OCR-ed using a Tesseract-based sub-process. This is the **secondary** method for obtaining the text of a document (page). # How does this impact UX? diff --git a/docs-src/Notes/Qiqqa-Repository-Main-README(Copy).md b/docs-src/Notes/Qiqqa-Repository-Main-README(Copy).md index 7ce41c644..2d19f4620 100644 --- a/docs-src/Notes/Qiqqa-Repository-Main-README(Copy).md +++ b/docs-src/Notes/Qiqqa-Repository-Main-README(Copy).md @@ -10,8 +10,7 @@ Now open source award-winning Qiqqa research management tool for Windows. This version includes **every** feature available in [Commercial Qiqqa](qiqqa.com), including Premium and Premium+. -> Unfortunately we have had to **remove the web cloud sync** ability as that involves storage costs. Users are encouraged to migrate their Web Libraries into Intranet libraries, and **use Google Drive or Dropbox** - as the 'sync point' for those libraries. +> Unfortunately we have had to **remove the web cloud sync** ability as that involves storage costs. Users are encouraged to migrate their Web Libraries into Intranet libraries, and **use Google Drive or Dropbox** as the 'sync point' for those libraries. ## Download & Install Qiqqa @@ -42,7 +41,7 @@ To be notified of new releases [subscribe](https://groups.google.com/d/forum/qiq ### Just in case -On the unhappy chance where you want to revert to a previous Qiqqa version, these are all available for download at [https://github.com/GerHobbelt/qiqqa-open-source/releases](https://github.com/GerHobbelt/qiqqa-open-source/releases) (v82 and v81 prereleases) and [https://github.com/jimmejardine/qiqqa-open-source/releases](https://github.com/jimmejardine/qiqqa-open-source/releases) (v80 release). +On the unhappy chance where you want to revert to a previous Qiqqa version, these are all available for download at [https://github.com/GerHobbelt/qiqqa-open-source/releases](https://github.com/GerHobbelt/qiqqa-open-source/releases) (v82 and v81 pre-releases) and [https://github.com/jimmejardine/qiqqa-open-source/releases](https://github.com/jimmejardine/qiqqa-open-source/releases) (v80 release). All v82*, v81*, v80 and (commercial) v79  Qiqqa releases are binary compatible: they use the same database and directory structures, so you can install any of them over the existing Qiqqa install without damaging your Qiqqa libraries. @@ -51,7 +50,7 @@ Enjoy Qiqqa and take care! ### Miscellaneous Notes -* > **DO NOTE** that the v82 releases are prereleases, *some* of which are only lightly tested and may include bugs. Backup your library before testing these, even if you like living on the edge... +* > **DO NOTE** that the v82 releases are pre-releases, *some* of which are only lightly tested and may include bugs. Backup your library before testing these, even if you like living on the edge... > > *@GerHobbelt has joined the team and keeps the bleeding edge rolling. For recent changes see [closed bugs list](https://github.com/jimmejardine/qiqqa-open-source/issues?q=is%3Aissue+is%3Aclosed).* diff --git a/docs-src/Notes/Software Releases/Where To Get Them.md b/docs-src/Notes/Software Releases/Where To Get Them.md index ee9661f80..13146e165 100644 --- a/docs-src/Notes/Software Releases/Where To Get Them.md +++ b/docs-src/Notes/Software Releases/Where To Get Them.md @@ -29,7 +29,7 @@ New Qiqqa releases are published at two different URLs: # Installing a Qiqqa release - When you doubleclick the installer after downloading, it will run and *overwrite* the existing Qiqqa version (after a dialog has reported a different version is being installed). + When you double-click the installer after downloading, it will run and *overwrite* the existing Qiqqa version (after a dialog has reported a different version is being installed). This is harmless, as your libraries reside elsewhere on your disk and those **are not touched** during the install, only the Qiqqa executable and underlying binaries are replaced by an install action. diff --git a/docs-src/Notes/Specialized Bits Of General Technology.md b/docs-src/Notes/Specialized Bits Of General Technology.md index 883e2f7d7..6a60f7a96 100644 --- a/docs-src/Notes/Specialized Bits Of General Technology.md +++ b/docs-src/Notes/Specialized Bits Of General Technology.md @@ -1,4 +1,5 @@ ## Specialized Bits Of General Technology + + [[IPC - transferring and storing data|IPC :: transferring and storing data in bulk between C++, C# and JavaScript components (and Java too)]] + [[Multiprocessing and IPC]] + [[SHA1B - Qiqqa Fingerprint 1.0 Classic|SHA1B :: The classic Qiqqa document fingerprint, dubbed SHA1B (that's B for B0rk)]] diff --git a/docs-src/Notes/Technology/Odds 'n' Ends/File copy oddities observed elsewhere - stuff to be reckoned with.md b/docs-src/Notes/Technology/Odds 'n' Ends/File copy oddities observed elsewhere - stuff to be reckoned with.md index bcafc4ab9..6d9e82e80 100644 --- a/docs-src/Notes/Technology/Odds 'n' Ends/File copy oddities observed elsewhere - stuff to be reckoned with.md +++ b/docs-src/Notes/Technology/Odds 'n' Ends/File copy oddities observed elsewhere - stuff to be reckoned with.md @@ -44,9 +44,9 @@ Then, of course, I decided to just let those other HDDs rip a backup copy, while - nothing helps, another WTF+ and Process Hacker (yay! great tool!) is called in to do thee dead: kill 'em. - done. - Meanwhile, stuff starts to act weird, then weirder, until the Windows decides to quit to listen to *any* user input and it's total screen freeze time. -- after a bit (I was *slow* today) no bluescreen, but straight to black and reboot it is. -- this doesn't pan out as next thing what happens is the laptop throwing a tantrum, h0rking up a fatal bluescreen every time during boot. -- go ape. disconnect everything. Rip the power cord and accupack, make the bastard **cool boot** like it was on *ice*. +- after a bit (I was *slow* today) no blue-screen, but straight to black and reboot it is. +- this doesn't pan out as next thing what happens is the laptop throwing a tantrum, h0rking up a fatal blue-screen every time during boot. +- go ape. disconnect everything. Rip the power cord and accu-pack, make the bastard **cool boot** like it was on *ice*. - Windows Repair yada yada yada, lots of crap and an *oddly lethargic* boot sequence taking ages and maybe a few crashes along the way (screen going standby mode every once in a while there), but finally The Windows is back from the dead. Registry is filled with disk failures so it's prayer time. - reconnect the USB3 hub and drives. - check the running backup copy: all seems fine, as far as it got. (after all, it got aborted *hard* halfway through) diff --git a/docs-src/Notes/Technology/Odds 'n' Ends/GCC woes - DO NOT mix that bugger with other compilers.md b/docs-src/Notes/Technology/Odds 'n' Ends/GCC woes - DO NOT mix that bugger with other compilers.md index c4b830e14..682a2b3ef 100644 --- a/docs-src/Notes/Technology/Odds 'n' Ends/GCC woes - DO NOT mix that bugger with other compilers.md +++ b/docs-src/Notes/Technology/Odds 'n' Ends/GCC woes - DO NOT mix that bugger with other compilers.md @@ -27,13 +27,13 @@ http://math-atlas.sourceforge.net/errata.html#gccCrazy ## Relevance of this towards Qiqqa? -Been moving towards migrating the crucial bits of backend tech employed by the app, including the PDF renderer and OCR package (antiquated tesseract 3.x something). +Been moving towards migrating the crucial bits of back-end tech employed by the app, including the PDF renderer and OCR package (antiquated tesseract 3.x something). When you, like me, appreciate looking into viability of a new approach while you have *ideas* what the future should look like, user-visible functionality wise, you'll be investigating the lay of the land in OCR land, etc. And there you run into several libs used via Python -- if you forget or otherwise have reasons to give commercial solutions a pass^[in my case that would make the free Qiqqa tool very much **not free at all** if users have to pay license fees for extra software they need alongside to make the basic functionalities of the tool work] -- and binary distros of some libs that you can use. *However*, combining these is no sine cure if you want to forego the Python intermediate and link these libs together into a single application. OpenCV, etc. are involved here as OCR isn't an exact science and you need a bit of help to get your page images to OCR reasonably well when you've got a library to process that's not exactly "*mainstream recently published whitepapers only*". -Next thing that happens is you ignore your own adage about compiling everything from scratch (heck, it's a viability test, after all!) and several hours later you discover you're getting cactussed, not by the old famous Windows function stackframes discrepancies, but by entirely different stuff altogether. +Next thing that happens is you ignore your own adage about compiling everything from scratch (heck, it's a viability test, after all!) and several hours later you discover you're getting cactussed, not by the old famous Windows function stack-frames discrepancies, but by entirely different stuff altogether. (I'm running latest MSVC here; the gcc pre-built work came from a foreign place.) So viability tests cannot use precompiled binary **libraries** from anywhere. /Period./ diff --git a/docs-src/Notes/Technology/Odds 'n' Ends/PDFs in the Wild/PDF cannot be Saved.As in browser (Microsoft Edge).md b/docs-src/Notes/Technology/Odds 'n' Ends/PDFs in the Wild/PDF cannot be Saved.As in browser (Microsoft Edge).md index 3291ff78b..30b2a59b8 100644 --- a/docs-src/Notes/Technology/Odds 'n' Ends/PDFs in the Wild/PDF cannot be Saved.As in browser (Microsoft Edge).md +++ b/docs-src/Notes/Technology/Odds 'n' Ends/PDFs in the Wild/PDF cannot be Saved.As in browser (Microsoft Edge).md @@ -3,7 +3,7 @@ > **Note**: Also check these for more PDF download/fetching woes: > -> - [[curl - commandline and notes]] (sections about *nasty PDFs*) +> - [[../curl - command-line and notes]] (sections about *nasty PDFs*) > - [[Testing - Nasty URLs for PDFs]] > - [[Testing - PDF URLs with problems]] > diff --git a/docs-src/Notes/Technology/Odds 'n' Ends/PDFs in the Wild/Testing - Nasty URLs for PDFs.md b/docs-src/Notes/Technology/Odds 'n' Ends/PDFs in the Wild/Testing - Nasty URLs for PDFs.md index 04c4793b0..3959730d6 100644 --- a/docs-src/Notes/Technology/Odds 'n' Ends/PDFs in the Wild/Testing - Nasty URLs for PDFs.md +++ b/docs-src/Notes/Technology/Odds 'n' Ends/PDFs in the Wild/Testing - Nasty URLs for PDFs.md @@ -2,8 +2,8 @@ > **Note**: Also check these for more PDF download/fetching woes: > -> - [[curl - commandline and notes]] (sections about *nasty PDFs*) -> - [[PDF cannot be Saved.As in browser (Microsoft Edge)]] +> - [[../curl - command-line and notes|curl - command-line and notes]] (sections about *nasty PDFs*) +> - [[PDF cannot be Saved.As in browser (Microsoft Edge)|PDF cannot be 'Saved As' in browser (Microsoft Edge)]] > - [[Testing - PDF URLs with problems]] > @@ -156,7 +156,20 @@ - https://www.dizinot.com/upload/files/2016/12/AOS-AO4606.pdf : this one at least dumps the raw PDF binary content to screen in any browser due to incorrect(?) mimetype setup server-side. Only produces the PDF when done via "save as" popup menu entry in your web browser. Hence we can expect trouble when downloading this one using other tools, such as `curl`. - https://www.pnas.org/doi/epdf/10.1073/pnas.1708279115 : linux firefox requires popups to be enabled for the PDF to be downloaded. - https://opengrey.eu/ -- + + * https://www.proceedings.aaai.org/Papers/ICML/2003/ICML03-102.pdf : browser opens this one, after you explicitly accept to visit the site due to expired/wrong SSL certificate, but `wget` barfs with a HTTP 403 (access denied) error! Very strange indeed. + BTW: `cUrl` spits back HTTP ERROR 406 Not Acceptable. + + PDFs with this problem, all from the same site: + - https://www.proceedings.aaai.org/Papers/ICML/2003/ICML03-000.pdf + - https://www.proceedings.aaai.org/Papers/ICML/2003/ICML03-001.pdf + - https://www.proceedings.aaai.org/Papers/ICML/2003/ICML03-002.pdf + - https://www.proceedings.aaai.org/Papers/ICML/2003/ICML03-011.pdf + - https://www.proceedings.aaai.org/Papers/ICML/2003/ICML03-102.pdf + +* + + ## HTML pages with problems diff --git a/docs-src/Notes/Technology/Odds 'n' Ends/PDFs in the Wild/Testing - PDF URLs with problems.md b/docs-src/Notes/Technology/Odds 'n' Ends/PDFs in the Wild/Testing - PDF URLs with problems.md index 8e8e4b6ff..91562e701 100644 --- a/docs-src/Notes/Technology/Odds 'n' Ends/PDFs in the Wild/Testing - PDF URLs with problems.md +++ b/docs-src/Notes/Technology/Odds 'n' Ends/PDFs in the Wild/Testing - PDF URLs with problems.md @@ -3,7 +3,7 @@ > **Note**: Also check these for more PDF download/fetching woes: > -> - [[curl - commandline and notes]] (sections about *nasty PDFs*) +> - [[../curl - command-line and notes]] (sections about *nasty PDFs*) > - [[PDF cannot be Saved.As in browser (Microsoft Edge)]] > - [[Testing - Nasty URLs for PDFs]] > - [[Testing - PDF URLs with problems]] diff --git a/docs-src/Notes/Technology/Odds 'n' Ends/curl - commandline and notes.md b/docs-src/Notes/Technology/Odds 'n' Ends/curl - command-line and notes.md similarity index 93% rename from docs-src/Notes/Technology/Odds 'n' Ends/curl - commandline and notes.md rename to docs-src/Notes/Technology/Odds 'n' Ends/curl - command-line and notes.md index 187fd8e83..7de22063d 100644 --- a/docs-src/Notes/Technology/Odds 'n' Ends/curl - commandline and notes.md +++ b/docs-src/Notes/Technology/Odds 'n' Ends/curl - command-line and notes.md @@ -1,4 +1,4 @@ -# curl :: commandline and notes +# curl :: command-line and notes ## Fetch a PDF from a website @@ -43,7 +43,7 @@ See [Does curl have a timeout? - Unix & Linux Stack Exchange](https://unix.stack > **Note**: Also check these for more PDF download/fetching woes: > -> - [[PDF cannot be Saved.As in browser (Microsoft Edge)]] +> - [[PDF cannot be Saved.As in browser (Microsoft Edge)|PDF cannot be 'Saved As' in browser (Microsoft Edge)]] > - [[Testing - Nasty URLs for PDFs]] > - [[Testing - PDF URLs with problems]] > @@ -158,6 +158,6 @@ It would also help a lot if the URL that produced the PDF is kept *with the PDF* [^1]: trouble with Windows ADS is two-fold: 1. it's non-portable. It may serve as a metadata *source* where we can extract the download URL for a given document file when there's Zone 3 metadata stored in its ADS, but that's about the size of it when it comes to *usefulness*. - 2. Many tools don't copy the ADS info with the file when it's moved, e.g. when using `robocopy` or other Windows commandline tools. It also looks like third-party file managers (e.g. the nice XYplorer) generally don't care much about ADS either, so the metadata attached that way is easily lost. Same goes for files which are bundled into a ZIP or other archive file and then extracted: the ADS info will be lost. + 2. Many tools don't copy the ADS info with the file when it's moved, e.g. when using `robocopy` or other Windows command-line tools. It also looks like third-party file managers (e.g. the nice XYplorer) generally don't care much about ADS either, so the metadata attached that way is easily lost. Same goes for files which are bundled into a ZIP or other archive file and then extracted: the ADS info will be lost. diff --git a/docs-src/Notes/Technology/Odds 'n' Ends/git - recovering from b0rked repos and systems.md b/docs-src/Notes/Technology/Odds 'n' Ends/git - recovering from b0rked repos and systems.md index 7d0450933..d99e99cf4 100644 --- a/docs-src/Notes/Technology/Odds 'n' Ends/git - recovering from b0rked repos and systems.md +++ b/docs-src/Notes/Technology/Odds 'n' Ends/git - recovering from b0rked repos and systems.md @@ -2,7 +2,7 @@ Had a fatal systems failure while running MuPDF (Qiqqa-related) bulk tests overnight. -End result: cold hard boot was required (which gave a dreaded win10 BSOD ("Page Fault") during intial bootup, so that was a sure sign things had gone the way of the Dodo...) +End result: cold hard boot was required (which gave a dreaded win10 BSOD ("Page Fault") during initial bootup, so that was a sure sign things had gone the way of the Dodo...) End result: Qiqqa + MuPDF git repos are b0rked; bad ref reports, tortoisegit crashes on startup, etc. diff --git a/docs-src/Notes/Why I consider OpenMP the spawn of evil and disable it in tesseract.md b/docs-src/Notes/Why I consider OpenMP the spawn of evil and disable it in tesseract.md index 924ffd999..7e3c96364 100644 --- a/docs-src/Notes/Why I consider OpenMP the spawn of evil and disable it in tesseract.md +++ b/docs-src/Notes/Why I consider OpenMP the spawn of evil and disable it in tesseract.md @@ -1,6 +1,6 @@ # Why I consider OpenMP the *spawn of evil* and disable it in `tesseract` -This already showed up on my radar in 2022 when bulk-testing our `mupdf` + `tesseract` monolith Windows build: it didn't matter which subcommand of `bulktest` I was running, the fans would start spinning and CPU temperatures would rise quick as they could until a new *burn motherfucker burn* versus cooling solution equilibrium was reached at almost 80 degrees Celcius *core*. Which was, frankly, *insane*, for 'twas simple *single threaded* tasks on an *8/16 core* Ryzen mammoth. +This already showed up on my radar in 2022 when bulk-testing our `mupdf` + `tesseract` monolith Windows build: it didn't matter which sub-command of `bulktest` I was running, the fans would start spinning and CPU temperatures would rise quick as they could until a new *burn motherfucker burn* versus cooling solution equilibrium was reached at almost 80 degrees Celsius *core*. Which was, frankly, *insane*, for 'twas simple *single threaded* tasks on an *8/16 core* Ryzen mammoth. Profiling in Visual Studio plus some debugging showed me some obscure OpenMP internals decided to run like mad, without having been told to do so by yours truly. @@ -8,7 +8,7 @@ That analysis led to `tesseract` [where some OpenMP attributes are sprinkled aro The preliminary conclusion then was that OpenMP *somehow*, stupidly, decided to start all the threads required for *when* and *if* all that parallel exec power was demanded inside the belly of `tesseract`, and, then, once started, **never stopped running, i.e. waiting for stuff to do and spinning like mad while doing the wait thing**. (*All* the cores were maxing out once this started.) -Key to the problem was it always occurred *after* some (minimal) `tesseract` activity, which was part of the randomized bulk test, which is set up to stress test several main components, feeding it a slew of PDFs to chew through. Several of those tasks are single-threaded per PDF and to help built-in diagnostics I don't parallelize multiple PDFs for simultaneous testing: while that *will* be a good idea at some point, it's *not exactly helping yourself* when you go and make a multi-threaded parallel execution test rig while you're still hunting obnoxious bugs that occur in single-thread runs, so having OpenMP in there was only there because of `tesseract` having it by default and me considering this as a potential *nice to have & use* system for when I would be ready to migrate my codebase to use it elsewhere in the codebase as well. +Key to the problem was it always occurred *after* some (minimal) `tesseract` activity, which was part of the randomized bulk test, which is set up to stress test several main components, feeding it a slew of PDFs to chew through. Several of those tasks are single-threaded per PDF and to help built-in diagnostics I don't parallelize multiple PDFs for simultaneous testing: while that *will* be a good idea at some point, it's *not exactly helping yourself* when you go and make a multi-threaded parallel execution test rig while you're still hunting obnoxious bugs that occur in single-thread runs, so having OpenMP in there was only there because of `tesseract` having it by default and me considering this as a potential *nice to have & use* system for when I would be ready to migrate my code-base to use it elsewhere in the code-base as well. However, once OpenMP would have run through one such an [`openmp`-enhanced code section](https://github.com/search?q=repo%3AGerHobbelt%2Ftesseract++pragma+omp&type=code), it was apparently "*Bijenkorf Dwaze Dagen*"[^1] at the CPU forever after. The solution, I found, was getting rid of OpenMP entirely. Meanwhile I wondered why everyone is using this one while it clearly is coded like *utter shite*, at least from where I'm standing... And nobody complaining about this?[^2]