Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pausing at the end of each line when listening to pdf #24

Open
killefid opened this issue Jan 28, 2022 · 1 comment
Open

Pausing at the end of each line when listening to pdf #24

killefid opened this issue Jan 28, 2022 · 1 comment

Comments

@killefid
Copy link

killefid commented Jan 28, 2022

Actually this is the best possibility I found to listen to text in pdf files. Irrespective of the pdf viewer I choose.
But... Using talkie in firefox I have the problem, that I get a pause at the end of each line. The emphasis goes down like at the end of a sentence. Do you have a solution of workaround for this?

@joelpurra
Copy link
Owner

@killefid: reading text from PDF documents can be tricky, because what you see (visual representation) is not always the same as what it originally was (for example a paragraph of text) during the document creation.

Your issue is that reading text from a PDF document in Firefox introduces a pause at the end of each line.

  1. If you copy-paste the text into a word processor or text editor, do you see any unexpected text characters?
  2. Are all the line breaks (at the end of each line/column) from the PDF document included in the pasted text?
  3. Optionally, as this is rather technical, can you verify if there are any "invisible" (hidden) control characters in the copied text?
  4. Can you verify if reading PDF documents using Talkie in Google Chrome, or another Chromium-based browser, is working better on your system?

If you have enabled reading longer text in Talkie's settings (this is the default in Firefox), then the text is passed to Firefox' speech synthesizer one paragraph at a time. Depending on the PDF document as well as the PDF viewer, text coming from a PDF might have a lot of line breaks -- which Talkie then interprets as paragraph separators. This seems to work well for HTML pages, where getting the selected text is less affected by layout issues and a paragraph is a native "unit" of text.

This text splitting behavior may change in future versions of Talkie, in particular as there are already other discussions which involve changing how text is split into parts (#22, #13). Changing how text is split text into paragraphs, perhaps by only splitting when there are two line breaks in sequence, may fix some of the PDF pause issues. Would have to verify that it doesn't affect reading other (primarily HTML) text. While this issue could possibly be fixed separately, I'll see if it needs to be performed during a larger rework together with the related issues.

I'll keep this issue open. Thanks for reporting!


Below is some background (aka relatively unedited brain dump) on why reading text from PDF documents is tricky. Meant to write this somewhere, perhaps as part of the FAQ in Talkie, but starting out by writing it here now that I am thinking about it anyways. It may be moved to another page at a later time.


Issues reading PDF documents in Talkie

Selecting and reading is not always working as expected in PDF documents/files, and sometimes Talkie has problems with it. Spoken text may be unnecessarily choppy with pauses between lines, characters may be spoken one by one, skipping special characters, etcetera. Here's some background on the PDF format, how PDF viewers work, and how Talkie handles text, to help explain the issue.

The PDF file format

Portable Document Format (PDF) is a file format (from ~1991) based on the PostScript (PS) file format (from ~1982), originally intended for printing documents on physical paper. PDF documents are more focused on how the physical paper output looks, even if are merely looking at a digital representation of a sheet of paper on the screen.

The print focus means that when a PDF file is created in a document/layout program, what was originally a paragraph of text may be broken up to lines, or individual (non-connected) characters, or even the visual outline of individual characters (resulting in graphics, not text). On top of this, some PDF documents use font substitution tricks (for both file compression and "security") which garbles the original text characters without affecting the visual/print output.

With the PDF file format history in mind, PDF viewers are primarily concerned with rendering a digital image of the document. Text flow, in the sense that the encoded visual representation can still be digitally interpreted as continuous text in the original order, is only important when it is consumed by users directly on a digital device. It may be able to select/copy text for use with Talkie or a screen reader, but in my experience this is far from guaranteed. Historically, one is lucky if able to select sequential text at all.

There are ways to interpret the rendered/visual PDF back to flowing text, by applying heuristics on how humans would interpret the text. This depends on how the document is styled, but may work well for plain-looking documents. Documents using custom fonts (including the aforementioned font substitution trick) may even need optical character recognition (OCR) as if scanning an already printed piece of paper.

There is an optional way to retain the original text for digital use, by outputting tagged PDF documents during the original document creation. The PDF/A-1a (PDF Archive, Part 1, Level A (Accessible) conformance) standard (from 2005) requires that the original text is tagged and included in the PDF document, both for archival and accessibility reasons. There is even a specific PDF/UA (PDF Universal Accessibility) standard (from 2012) focused on creating PDF files which also include the original document's semantic structure, and are more easily navigated and read non-visually. This includes best practices for tagged PDF documents, enabling specialized PDF viewers to, for example, reflow text to fit a device screen as well as enable text-to-speech functionality.

Viewing PDF documents in browsers

(Standalone PDF viewer applications have not yet been tested/verified.)

General PDF viewers have been improving their text handling, both in the standalone applications and builtin browser support. For example the open source project PDF.js, which is (by default) used internally by Firefox, has (partial) support for tagged PDF documents (since 2021). This helps when displaying tagged PDF documents, but heuristics is required to support text for non-tagged PDF documents.

Inspecting the code of a PDF.js page rendering of a PDF file, for each page there is a <canvas> element with an image of the entire page. On top of that image is a text layer with invisible <span> text elements, mapping as closely to the rendered text as it can. The approach seems very similar in Google Chrome, but there the actual HTML elements seem "hidden" behind an <embed> element loading a proprietary PDF plugin. This invisible text generated in the browser is the text that is selectable, and thus readable by Talkie.

Marking all text on a PDF page in the browser, it may be possible to see if the text flow is what one would expect. Confirm this by copy-pasting the text into a word processor or text editor, and see if it matches the expected contents from the document, down to the level of individual characters. This may includes characters not usually noticed, as they are too common to think about, when reading a document. Examples include line-breaks at the end of line and hyphens for word wrapping, which become hard-coded parts of the text layout. PDF viewers may try to mitigate excessive line-breaks in the copyable text, but it comes down to trade-offs where it may be easier to err on the side of caution. For example, text in the header/footer of a page should also be copyable, but it may repeat when copying text from across multiple pages at once.

It is also possible to use programming tools to analyze the character code points and, depending on the document, one may also find hidden control characters. Found 0x18 CAN "cancel" characters in both Google Chrome and Firefox, and but some of the 0x18 characters in Google Chrome were output as 0x19 EM "end medium" characters in Firefox. These control codes in the copied text, while related to the PDF

Text handling in Talkie

(The below is subject to change as Talkie adapts and evolves.)

When text is selected to be read in Talkie, it is first analyzed and prepared for the speech synthesizer.

  1. The text is split up to paragraphs. This is to remove empty lines (between paragraphs) avoid browser speech synthesizers' text length limitations.
  2. Depending on Talkie's settings, the text is split up to sentences and clauses, as best possible. This is also to avoid browser limitations, in particular a bug in Google Chrome (Speech stops after 22-25 works spoken #1). See Talkie's setting for reading longer text.
    • This sentence splitting is enabled by default in Google Chrome.
    • This sentence splitting is disabled by default in Firefox.
  3. The text parts are fed, one by one, to the browser's speech synthesizer.

As mentioned, reading longer text is the default in Firefox, as longer text does not seem to impose a problem -- the specified limit is 32,767 characters. Talkie does not (currently) implement that length limit check, as each paragraph is expected to be shorter.

PDF text handling in Talkie

The first issue is that Talkie has no concept whatsoever of what a PDF document is. This is by design, since the PDF file format is way too complex for Talkie to be concerned with. Instead, Talkie relies on external PDF viewers to parse the text and feed it to Talkie -- either via the builtin PDF viewer in the browser, or by passing text via the system clipboard. Either way, this gives Talkie only second-hand information to work with.

As text does not always "flow" in PDF documents, and is only optionally tagged to indicate how it relates to the greater text flow of sentences/paragraphs/sections, handling text from PDF documents may require workarounds. Each line break in a PDF document is primarily hard-coded part of the layout, also meaning that hyphens are hard-coded, effectively breaking longer words (either with or without a hyphen to begin with) into two parts, with both a hyphen and a line break character in the middle. Such line breaks are interpreted by Talkie to be paragraph separators, and thus spoken as two separate parts.

While both line breaks and hyphens exist in HTML text, usually neither are "hard" while still in HTML format. Instead text in paragraphs can reflow freely, and soft hyphens are dynamically introduced as needed to improve the text flow. When the HTML is reduced to its text contents by the browser, soft hyphens are non-existent and paragraph separator is represented by two line break characters. The resulting empty line between paragraphs are never spoken aloud, and thus removed by Talkie when preparing the text.

Since Talkie is developed primarily for use with HTML pages, the paragraph/line break handling introduces a pause when reading PDF documents. This was not noticed (or not reported) until #24. This difference in usage of line-breaks may require further handling in Talkie. Perhaps attempting to detect PDF files based on context/metadata, either explicitly by MIME type or by URL. PDF detection by metadata is unfortunately not possible for text read from the clipboard, as it lacks such context. Talkie could also attempt to handle hyphens followed by a line-break, since it may indicate PDF text contents. PDF text handling may also be a rabbit hole, depending, for example, on how the PDF viewer providing text to Talkie handles combined glyphs such ÅÄÖ.


That's not all, but it's enough for now. End brain dump. Questions?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants