Skip to content

Commit

Permalink
(Obsidian) typo/spelling fixes
Browse files Browse the repository at this point in the history
  • Loading branch information
GerHobbelt committed May 4, 2023
1 parent c089745 commit 51435ee
Show file tree
Hide file tree
Showing 72 changed files with 182 additions and 152 deletions.
2 changes: 1 addition & 1 deletion docs-src/Notes/Aside/How to attach raw data to a PDF.md
Expand Up @@ -6,4 +6,4 @@ Regrettably, this almost never happens -- including research/university circles,
So we generally succumb to the reality of *scraping content*, whether it's merely to obtain the obvious metadata elements (title, authors, publishing venue, publishing date, *abstract*, ...) or data/charts. Many of us probably won't even realize we are scraping like that -- and what the consequences are for our extracted data quality for our collection at large -- because it's so pervasive, despite bibliographic sites and metadata-offering websites, e.g. Google Scholar.

Once you have \[scraping] tools, you're not set for life either! Then it turns into a statistics game: how good are your tools, how much effort are you willing to spend (cleaning and/or cajoling your tools to do your bidding) and what is the target accuracy/veracity of your extracted data?
Once you have \[scraping\] tools, you're not set for life either! Then it turns into a statistics game: how good are your tools, how much effort are you willing to spend (cleaning and/or cajoling your tools to do your bidding) and what is the target accuracy/veracity of your extracted data?
@@ -1,6 +1,9 @@
# The World of Data Extraction and Re-use :: PDF Reading, Annotating and Content Extraction

If you want to do more with PDFs than merely *read* them on screen^[for which there's Adobe Acrobat and if you don't like it, there's a plethora of PDF *Viewers* out there, e.g. [SumatraPDF](https://www.sumatrapdfreader.org/free-pdf-reader) and [FoxIt](https://www.foxit.com/).], there's trouble ahead.
If you want to do more with PDFs than merely *read* them on screen[^adobe], there's trouble ahead.

[^adobe]: for which there's Adobe Acrobat and if you don't like it, there's a plethora of PDF *Viewers* out there, e.g. [SumatraPDF](https://www.sumatrapdfreader.org/free-pdf-reader) and [FoxIt](https://www.foxit.com/).


## Searchable text, anyone?

Expand Down
4 changes: 2 additions & 2 deletions docs-src/Notes/Misc notes for developers.md
Expand Up @@ -21,9 +21,9 @@

If you run into stuff that's weird, that you don't grok or otherwise have trouble with, please give me feedback on that so it can be improved. I'm not a technical writer (e.g. got too flowery verbiage) and there's already a lot of gain when those pages end up as a useful and *readable* resource for you and others-to-come.

- Might be good to try using the DevStudio debugger on Qiqqa. Though it is a multi-threaded application and that can be confusing with multiple background task copies running in parallel when you debug, so there's a place in the code where you can dial the number of detected processor cores down to 1 and/or disable background tasks entirely -- I coded that but I have to look up the precise spot as it dropped from active memory. The useful bit here is being aware that you can dial down the mayhem when stepping through background or foregound code by limiting/killing the parallelism.
- Might be good to try using the DevStudio debugger on Qiqqa. Though it is a multi-threaded application and that can be confusing with multiple background task copies running in parallel when you debug, so there's a place in the code where you can dial the number of detected processor cores down to 1 and/or disable background tasks entirely -- I coded that but I have to look up the precise spot as it dropped from active memory. The useful bit here is being aware that you can dial down the mayhem when stepping through background or foreground code by limiting/killing the parallelism.

- Also be aware that Qiqqa now has a commandline argument to specify a different base directory for the Qiqqa libraries: that's quite handy when you want to debug/test Qiqqa on a test rig or observe how it behaves when it's a "fresh install", i.e. no user library yet. This can be set in DevStudio Project Debug section and is passed to Qiqqa when you start it from DevStudio to run [Ctrl+F5] or debug [F5]
- Also be aware that Qiqqa now has a command-line argument to specify a different base directory for the Qiqqa libraries: that's quite handy when you want to debug/test Qiqqa on a test rig or observe how it behaves when it's a "fresh install", i.e. no user library yet. This can be set in DevStudio Project Debug section and is passed to Qiqqa when you start it from DevStudio to run [Ctrl+F5] or debug [F5]

![](assets/devstudio-project-debug-arguments-view.png)

Expand Down
10 changes: 5 additions & 5 deletions docs-src/Notes/Processing other document types/HTML.md
Expand Up @@ -8,26 +8,26 @@ Next to that, there's notes, comments/critiques and (regrettably relevant some t

Now we could take the position where we store everything we encounter as PDF, which is doable most of the time, but sub-optimal, both in formatting and storage costs per item: the HTML-to-PDF renderers available do their best, but most such web-based publications have never been near a page-layout CSS style sheet in their life, or page-layout "print output" work like that is done in a rather low quality fashion: creating great print-ready stylesheets is no *sine cure* and often requires customization per document. Combine this with the relative low priority for the originator to deliver this type of format for their own ultimate benefit and you end up with good will and a "*meh*" result 9 times out of 10.

We also should not forget that, under the hood, we do a lot of work to get back from PDF to a HTML-kind of content format -- you may not notice it as such, but any PDF text extraction process, whether OCR assisted or not, is hard-pressed to produce a continuous data stream uninterrupted by obnoxious page numbers and other page-boundary content that's not relevant to the document flow and the information provided there-in. When we say we're interested in formats such as hOCR, we're not looking at an *addition to the Qiqqa feature set* but rather wonder whether we can store the current text extracts any way **smarter**. Thus the obvious "*least common denominator format for **accessible & processable** document storage*" is HTML/hOCR, rather than PDF. PDF is only very handy because we are, in our own way, *librarians* and thus want to keep the **original source** around for posterity/reference. The academic field (and anyone sane) demands that: no original sources, no back-up for your arguments. *Fake News* and similar patterns make this a still-insecure approach, but it's the best we've got, so storing and keeping those original sources *intact* is a *must*. And there is where PDF shines -- unless our *original sources* are already HTML web pages themselves: then PDF, a page based storage system, shows its non-web roots and fails to deliver. So we would do well to copy/*mirror* web pages we wish to import as *documents* -- keep in mind that the *average* website pages' "*lifetime*" is about 2 years, according to some research and a lot of personal experience. Keep your own copy, which can reproduce the page that *was*, isn't just a nice *hobby* or to be relegated to a visionary like The Wayback Machine: keeping a local copy is an essential part of being a library.
We also should not forget that, under the hood, we do a lot of work to get back from PDF to a HTML-kind of content format -- you may not notice it as such, but any PDF text extraction process, whether OCR assisted or not, is hard-pressed to produce a continuous data stream uninterrupted by obnoxious page numbers and other page-boundary content that's not relevant to the document flow and the information provided there-in. When we say we're interested in formats such as hOCR, we're not looking at an *addition to the Qiqqa feature set* but rather wonder whether we can store the current text extracts any way **smarter**. Thus the obvious "*least common denominator format for **accessible & processable** document storage*" is HTML/hOCR, rather than PDF. PDF is only very handy because we are, in our own way, *librarians* and thus want to keep the **original source** around for posterity/reference. The academic field (and anyone sane) demands that: no original sources, no back-up for your arguments. *Fake News* and similar patterns make this a still-insecure approach, but it's the best we've got, so storing and keeping those original sources *intact* is a *must*. And there is where PDF shines -- unless our *original sources* are already HTML web pages themselves: then PDF, a page based storage system, shows its non-web roots and fails to deliver. So we would do well to copy/*mirror* web pages we wish to import as *documents* -- keep in mind that the *average* website pages' "*lifetime*" is about 2 years, according to some research and a lot of personal experience. Keep your own copy, which can reproduce the page that *was*, isn't just a nice *hobby* or to be relegated to a visionary like The WayBack Machine: keeping a local copy is an essential part of being a library.

Hence we should be able to store any HTML page *mirror*, i.e. including its CSS, images, etc.

*Possibly* we should store those pages as *DOM snapshots* so we are not dependent on changing and buggy (or disappeared) JavaScript code loaded from third-party sites as part of the web page render. The only drawback there is you cannot really snapshot a web page that's very dynamic, i.e. has all kinds of fancy JS-driven content-hiding and showing/revealing built-in. That is regrettable, but does not diminish of having the capability of mirroring/snapshotting a given web page; it rather points at the technical complexities involved when we want to achieve this goal.


## Do we consider storing other *original source* formats, given the logic above? Youtube movies?
## Do we consider storing other *original source* formats, given the logic above? YouTube movies?

No. At least not at this moment.

Storing *multimedia* like that will require a few other additions (viewing, processing: CC (Closed Captions) as text extract, etc. -- what to do there and how doable is it, really?!?!) so we'll stop at *documents* which have a large(-ish) *text content* component, which we can then index and search. For video-based multimedia, that *search process* is still the field of a few select players (e.g. automatic CC production in the original language, thanks to speech recognition, as done by Google/Youtube).
Storing *multimedia* like that will require a few other additions (viewing, processing: CC (Closed Captions) as text extract, etc. -- what to do there and how doable is it, really?!?!) so we'll stop at *documents* which have a large(-ish) *text content* component, which we can then index and search. For video-based multimedia, that *search process* is still the field of a few select players (e.g. automatic CC production in the original language, thanks to speech recognition, as done by Google/YouTube).



## Do we support more formats?

No. We're not The Wayback Machine: the purpose of all *original source* collecting and storing is to be produce both the original and the text information available therein so we can *find* stuff we seek and *analyze* that (text) content to help our discovery investigative processes.
No. We're not The WayBack Machine: the purpose of all *original source* collecting and storing is to be produce both the original and the text information available therein so we can *find* stuff we seek and *analyze* that (text) content to help our discovery investigative processes.

Storing stuff like MSword documents, etc. may sound nice at first, but makes for a quite different library approach: we then SHOULD also offer easy reading/viewing capabilities for all those formats and that's not what we wish to spend our efforts on; meanwhile storing these various formats and only having PDF or HTML/hOCR formats easy-to-read, makes that limited (yet already very complex!) capability a nuisance, rather than a boon. As we cannot treat those other storage formats democratically from a view/render perspective, we are all best best served by not accepting those formats: after all, (almost) all of them can be transported to either the page-oriented PDF format or the continuous-stream-oriented HTML format.
Storing stuff like MS-word documents, etc. may sound nice at first, but makes for a quite different library approach: we then SHOULD also offer easy reading/viewing capabilities for all those formats and that's not what we wish to spend our efforts on; meanwhile storing these various formats and only having PDF or HTML/hOCR formats easy-to-read, makes that limited (yet already very complex!) capability a nuisance, rather than a boon. As we cannot treat those other storage formats democratically from a view/render perspective, we are all best served by not accepting those formats: after all, (almost) all of them can be transported to either the page-oriented PDF format or the continuous-stream-oriented HTML format.

With those two we have covered 99.9% of the *original sources* market as far as I'm concerned.

Expand Down
Expand Up @@ -22,7 +22,7 @@ https://i.imgur.com/On83uvH.mp4
---


(Hm, image drag&drop from website behaves wierd: file selector for local storage comes up, then MM seems to fail or lock up?!)
(Hm, image drag&drop from website behaves weird: file selector for local storage comes up, then MM seems to fail or lock up?!)

---

Expand Down
Expand Up @@ -1326,5 +1326,5 @@ https://stackoverflow.com/questions/23205202/user-annotation-overlay-in-html5-ja
Installing h in a development environment — h 0.0.2 documentation
https://h.readthedocs.io/en/latest/developing/install/



Expand Up @@ -26,7 +26,7 @@ If you don't want or like the annotations in the team library, you can always co

## Further thoughts

If folks edit or annotate documents, the idea was to import (and track) each revision of the document: this number can grow when edits happen over a longer period of time and/or in small increments, thus resulting in a large storage cost. Might we consider some sort of *delta compression* here, *iff* that's feasible at all, given the freakiness of the PDF format: a small edit may include a (hidden/non-obvious) restructuring of the binary file layout! Hmmmmmm, might not be bothered with it: if it gets too much, we should allow to easily erase "unimportant revisions", e.g. the ones between the first and the last -- I would personally keep the fist version as a marker of where we started. Besides, the first version is often the 'initial download copy' and relevant due to that fact alone: we should always keep an 'original source copy' (no matter how b0rked it may be). So I strongly prefer to keep at least the first and last revisions at all times, even forbidding deleting those when the document isn't nuked from orbit *entirely already*.
If folks edit or annotate documents, the idea was to import (and track) each revision of the document: this number can grow when edits happen over a longer period of time and/or in small increments, thus resulting in a large storage cost. Might we consider some sort of *delta compression* here, *iff* that's feasible at all, given the freakishness of the PDF format: a small edit may include a (hidden/non-obvious) restructuring of the binary file layout! Hmmmmmm, might not be bothered with it: if it gets too much, we should allow to easily erase "unimportant revisions", e.g. the ones between the first and the last -- I would personally keep the fist version as a marker of where we started. Besides, the first version is often the 'initial download copy' and relevant due to that fact alone: we should always keep an 'original source copy' (no matter how b0rked it may be). So I strongly prefer to keep at least the first and last revisions at all times, even forbidding deleting those when the document isn't nuked from orbit *entirely already*.

Thus then the question becomes: what does the user feel is an "unimportant revision"? E.g.: small edits of the OCR extracted text indicate typo fixes: we always keep the latest, thus that would imply we kill the older revision then. Another example: 'unimportant' would be adding and annotation when the document (or page?) already has annotations; meanwhile we have the heuristic that adding an annotation where there were *none* before should be considered an *important* edit: then, when we nuke unimportant edits, we consequently end up with the last unannotated revision, followed by the latest fully annotated revision: all the intermediate steps (revisions) are nuked due to the 'unimportant update' rule.

Expand Down

0 comments on commit 51435ee

Please sign in to comment.