Improve Copy/Paste #5783

speedplane · 2015-03-04T12:35:37Z

The Problem

Copy/paste for many PDFs is broken. Even in the TraceMonkey example, try copying and pasting the area highlighted in the image below, and you will get the following text:

 informationis available

Notice that when you paste, there is no space between information and is. That is, there is no space between new lines.

But there are examples that are much worse than the above. In the PDF at the link here, the PDF structure itself has no spaces. That is, whatever software created this PDF, only created PDF objects for the text itself, and did not consider spacing. Thus, when one copy/pastes anything from this document using pdf.js, they get no spaces in their pasted text.

It's not just copy/paste that's a problem, but CTRL+F find also does not work. If you CTRL+F for information is in the TraceMonkey example, you will not get any results.

Other Viewers

Other PDF software gets this right, they figure out that there is a space between words or lines and automatically insert the proper space when the user copies text. Adobe Reader, for example, will not only detect a space and insert it, but it will also detect when a line ends with a hyphen and automatically remove it when doing copy/paste. For example, if you copied the word comp-ile in the image above, Reader would actually copy compile. That is probably overkill, but it is interesting that others gave this signifigant amount of thought.

Other software that also gets this right includes pdf miner, a project that goes to great lengths to properly extract text from PDFs. The Chrome viewer (Foxit) also gets this right (it also does hyphens right too).

The Solution

This commit aims to resolve these issues by automatically detecting when we need spaces, and inserting them into the text chunks where appropriate. The insertion occurs within the worker, so not only will copy/paste work, but getTextContent() will now also return text with proper spacing. Accordingly, any headless text extraction will also benefit from this PR.

Unfortunately, for PDFs that do not have the wordSpacing text state set, this feature must resort to using a bit of a heuristic, but I do not see anyway around that.

I also added a test which hooks into the getTextContent function.

…appropriate. Add test re same. PR mozilla#5783.

existentialism · 2015-03-04T14:18:22Z

Nice work! Early run through looks great, will testing more thoroughly and will report anything odd. I had started something similar a while back to try and improve copy/paste, glad to see I was on the right path 👍

speedplane · 2015-03-04T16:15:34Z

@timvandermeij Hi Tim - can you make a build of this when you get a chance?

timvandermeij · 2015-03-04T18:50:26Z

Hm, I see no change at all for the Tracemonkey paper. I tried copying the part in your image, but it still copies without a space at the newline for me.

speedplane · 2015-03-05T15:03:02Z

@timvandermeij Hi Tim... I just tried it with the viewer you built, and it works fine.

…appropriate. Add test re same. PR mozilla#5783.

speedplane · 2015-03-05T15:15:24Z

@timvandermeij I just updated the commit, adding a test specifically for the tracemonkey paper. It's passing for me.

timvandermeij · 2015-03-05T19:24:23Z

/botio-linux preview

pdfjsbot · 2015-03-05T19:24:23Z

From: Bot.io (Linux)

Received

Command cmd_preview from @timvandermeij received. Current queue size: 0

Live output at: http://107.21.233.14:8877/7b23cd1eedc95f6/output.txt

pdfjsbot · 2015-03-05T19:25:11Z

From: Bot.io (Linux)

Success

Full output at http://107.21.233.14:8877/7b23cd1eedc95f6/output.txt

Total script time: 0.79 mins

Published

timvandermeij · 2015-03-05T19:29:55Z

Still no luck for me. If I copy the text "information is available" from the first paragraph, I get:

information
is available

with the current master and

information
is available

with the build above, i.e. both are the same. Also if I copy both texts and paste them in for example the Firefox URL bar, both appear as "informationis available", i.e. without a space before "is". You cannot reproduce that?

speedplane · 2015-03-05T20:21:23Z

@timvandermeij Very strange. First off, where are you testing master? I assumed that the demo page was master, but is that incorrect?

I don't get a line break in either of them.

Here is my result copied from http://107.21.233.14:8877/7b23cd1eedc95f6/web/viewer.html

information is available

And here is the result copied from http://mozilla.github.io/pdf.js/web/viewer.html

informationis available

I copied both into Notepad++. I'm using Chrome on Windows. Could it be a browser thing?

speedplane · 2015-03-05T22:22:24Z

Also, if you have time, could you take a look at the DOM in both and see what you get?

This is the DOM from http://107.21.233.14:8877/7b23cd1eedc95f6/web/viewer.html:

And here is the DOM from http://mozilla.github.io/pdf.js/web/viewer.html:

Notice there is an extra space in the first.

timvandermeij · 2015-03-06T16:41:27Z

I do get the space in the DOM, but for some reason it doesn't copy it for both links you posted. I'm really confused about that: is it a browser bug perhaps? I'm using Firefox 36.0.1 on Windows 7 x64.

timvandermeij · 2015-03-06T19:22:39Z

It does appear to be something with the browser. Take http://www.sersc.org/journals/IJSIP/vol4_no3/5.pdf for example. At the bottom of the first page it says "established by ISO". If you select the text carefully, you can also select a space character behind ISO. Copy/pasting that does give "ISO " with a space, but if you also take the next line the space disappears when you copy/paste. Really weird...

(Note that for this PDF I have tested with the current add-on.)

speedplane · 2015-03-06T20:27:43Z

Maybe it isn't a browser bug. Maybe Firefox is actually smarter than the others. As you pointed out, in the TraceMonkey paper, Firefox is inserting line-breaks between lines. There are no line-break characters within the DOM itself, so Firefox must be doing something to detect it and insert it. Other browsers do not insert line-breaks.

I would bet that as part of that logic, when it detects a line break in selected text, it strips the end of the line of any whitespace.

timvandermeij · 2015-03-06T20:29:37Z

Good point! That would mean that this patch would not have much effect in Firefox, but it will help for other browsers.

speedplane · 2015-03-06T20:30:36Z

Try copying and pasting text in the following jsfiddle: https://jsfiddle.net/4wwd654y/

In Chrome, if you try copying and pasting, the entire thing, you get:

Line OneLine Two

I put money on the fact that in Firefox you would get:

Line One
Line Two

speedplane · 2015-03-06T20:32:26Z

Regarding your second point: this patch would not have an effect on the Firefox Tracemonkey paper. However, it would still fix messed up PDFs like this one in firefox.

timvandermeij · 2015-03-06T20:39:28Z

You're right, that is what is happening in Firefox. However, for the PDF you linked in the comment just above this one, I also see no difference in Firefox. Both https://mozilla.github.io/pdf.js/web/viewer.html and http://107.21.233.14:8877/7b23cd1eedc95f6/web/viewer.html copy any text without spaces for me. The DOM shows the spaces in each div, but since they are separate divs, Firefox strips those spaces too. I think we'll have to look into this later, but nevertheless this seems like a great patch for other browsers and for when we figure out what exactly is different in Firefox.

timvandermeij · 2015-03-06T20:43:10Z

src/core/evaluator.js

+          fontAscent = (1 + font.descent) * fontAscent;
+        }
+        return {
+          x : (angle === 0) ? tx[4] :(tx[4] + (fontAscent * Math.sin(angle))),


Nit: change this line to x: (angle === 0 ? tx[4] : tx[4] + (fontAscent * Math.sin(angle))), to get the spaces and parentheses right

timvandermeij · 2015-03-08T14:19:28Z

src/core/evaluator.js

+            addSpace = newPosisition.x >= lastPosition.x + lastChunk.width +
+              wordSpacing;
+          } else {
+            // Right to left. Add space if next is before sart.


Typ: sart -> start ?

timvandermeij · 2015-03-08T14:25:10Z

@speedplane Probably it is trying to spawn a process/application that it cannot find. Run npm update to get all dependencies. Probably it is https://github.com/mozilla/pdf.js/blob/master/test/webbrowser.js#L25 that is failing.

…appropriate. Add test re same. PR mozilla#5783.

jazzy-em · 2015-03-10T08:01:59Z

@speedplane, great work! IMHO, you resolve very important problem!

jazzy-em · 2015-03-10T10:02:18Z

I found some bugs:

Small difference in highlighting. There is small shift to the left.
For example, search for 'up' in the TraceMonkey example document
Original:

Your:

jazzy-em · 2015-03-10T10:37:55Z

In this document (http://www.selab.isti.cnr.it/ws-mate/example.pdf) try to search "an example paper".
There will no matches, because there is no space between 'example' and 'paper'.

timvandermeij · 2015-03-12T18:50:12Z

@speedplane This also needs a rebase onto the current master.

…appropriate. Add test re same. PR mozilla#5783.

danez · 2015-06-30T15:18:11Z

What is the status of this? Any news?

Edit: Is this PR only fixing the missing space for line breaks? Or also PDFs that completely miss spaces. In our case we have a pdf that even with this PR applied is not copying with spaces.
https://www.researchgate.net/publication/257019040_Shortcomings_of_classical_phenological_forcing_models_and_a_way_to_overcome_them

speedplane · 2015-06-30T16:47:24Z

@danez been very busy with other projects. I want to clean up this PR (and
others that I submitted), but won't realistically get to them until late
August.

On Tue, Jun 30, 2015 at 11:18 AM, Daniel Tschinder <notifications@github.com

wrote:

What is the status of this? Any news?

—
Reply to this email directly or view it on GitHub
#5783 (comment).

danez · 2015-07-01T10:12:01Z

In the case of this pdf there are chunks between each word that only contain the space itself.

The text-layer/viewer is not rendering the text-divs as the width of a space is calculated as 0. I first tried to change something there and render these divs with a single space, but then all the scale-calculation cannot be done as of width being zero.

So I ended up adding spaces to the surrounding words if there are space-chunks in-between to fix that. What do you think about that?

-        // If the last chunk ends with a space it does not need one.
+
         var lastChunk = bidiTexts[bidiTexts.length - 1];
+
+        // If the last chunk equals a single space concat the space to chunk before the last chunk
+        if (lastChunk.str === ' ' && bidiTexts.length - 2 >= 0) {
+            var beforeLastChunk = bidiTexts[bidiTexts.length - 2];
+            if (beforeLastChunk.str.length === 0) {
+                return;
+            }
+            var lastChar = beforeLastChunk.str[beforeLastChunk.str.length - 1];
+            if (lastChar !== ' ') {
+                beforeLastChunk.str += ' ';
+            }
+            return;
+        }
+
         if (lastChunk.str.length === 0) {
           return;
         }
-        var lastChar = lastChunk.str[lastChunk.str.length - 1];
+
+        // If the last chunk ends with a space it does not need one.
+        lastChar = lastChunk.str[lastChunk.str.length - 1];
         if (lastChar === ' ' || lastChar === '-') {
           return;
         }

danez · 2015-07-01T10:18:39Z

I also noticed that the selection is now off in the web-viewer, because the space is added to the div, but the scale is wrong.
You see the selection ends inside the character. (can also be seen in the image from jazzy-em above)

This of course gets worse with my patch from above.

Vad1mo · 2015-10-12T15:06:17Z

hello,
what is the status of the pull request. As I see it even if it doesn't solve the problem as desired it is already a big step forward! Can we not merge it into master?

timvandermeij · 2015-10-12T20:34:52Z

It needs to be rebased onto the current master and reviewed before we can land this.

fbender · 2015-11-16T15:00:41Z

@speedplane any idea when you will get to this?

timvandermeij · 2019-03-06T22:55:27Z

Closing since this was never completely finished and it's not in a mergeable state. However, the associated issue will remain open with a reference to this pull request so the work can be continued later on.

speedplane force-pushed the text-extract-with-spaces branch from 6fb6dbe to 03fae4b Compare March 4, 2015 12:40

speedplane pushed a commit to speedplane/pdf.js that referenced this pull request Mar 4, 2015

Improve copy/paste by inserting spaces into textChunks if we deem it …

03fae4b

…appropriate. Add test re same. PR mozilla#5783.

speedplane force-pushed the text-extract-with-spaces branch from 03fae4b to c6da1ee Compare March 4, 2015 12:59

speedplane pushed a commit to speedplane/pdf.js that referenced this pull request Mar 4, 2015

Improve copy/paste by inserting spaces into textChunks if we deem it …

c6da1ee

…appropriate. Add test re same. PR mozilla#5783.

Snuffleupagus added the text-selection label Mar 4, 2015

This was referenced Mar 4, 2015

Can't search across lines in some PDFs (like the demo PDF) #4742

Closed

Copy All (control+a) #3907

Closed

Selected text which appears to be normal has spaces occasionally missing. #1883

Closed

speedplane force-pushed the text-extract-with-spaces branch from c6da1ee to 6e56084 Compare March 5, 2015 15:14

speedplane pushed a commit to speedplane/pdf.js that referenced this pull request Mar 5, 2015

Improve copy/paste by inserting spaces into textChunks if we deem it …

6e56084

…appropriate. Add test re same. PR mozilla#5783.

timvandermeij reviewed Mar 6, 2015
View reviewed changes

timvandermeij reviewed Mar 8, 2015
View reviewed changes

speedplane force-pushed the text-extract-with-spaces branch from 8519c6b to ee75402 Compare March 8, 2015 15:08

speedplane pushed a commit to speedplane/pdf.js that referenced this pull request Mar 8, 2015

Improve copy/paste by inserting spaces into textChunks if we deem it …

ee75402

…appropriate. Add test re same. PR mozilla#5783.

Improve copy/paste by inserting spaces into textChunks if we deem it …

e9ad0b4

…appropriate. Add test re same. PR mozilla#5783.

speedplane force-pushed the text-extract-with-spaces branch from ee75402 to e9ad0b4 Compare March 8, 2015 15:18

jazzy-em mentioned this pull request Mar 11, 2015

Search matches characters across new lines #2806

Closed

jeffsack pushed a commit to jeffsack/pdf.js that referenced this pull request Mar 16, 2015

Improve copy/paste by inserting spaces into textChunks if we deem it …

7342b6d

…appropriate. Add test re same. PR mozilla#5783.

fbender mentioned this pull request Nov 16, 2015

Find (aka Ctrl-F) inside pdf.js does't work for this case #5955

Closed

This was referenced May 25, 2016

Added multiple term search functionality (with default phrase search) #5579

Merged

String.indexOf() cannot match phrases with variable whitespace #7355

Closed

smileman mentioned this pull request Sep 30, 2016

Improve copy/paste by inserting spaces into textChunks if we deem it … smileman/pdf.js#1

Merged

timvandermeij mentioned this pull request Aug 14, 2017

New line not recognised and instead combines words together without a space #8777

Closed

cforcey mentioned this pull request Mar 1, 2018

Add interword space option to HOCR pdf renderer ocrmypdf/OCRmyPDF#225

Closed

timvandermeij closed this Mar 6, 2019

This was referenced Mar 26, 2020

Can't find words hyphenated across lines #11752

Closed

Can't find phrases in justified text #11753

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve Copy/Paste #5783

Improve Copy/Paste #5783

speedplane commented Mar 4, 2015

existentialism commented Mar 4, 2015

speedplane commented Mar 4, 2015

timvandermeij commented Mar 4, 2015

speedplane commented Mar 5, 2015

speedplane commented Mar 5, 2015

timvandermeij commented Mar 5, 2015

pdfjsbot commented Mar 5, 2015

pdfjsbot commented Mar 5, 2015

timvandermeij commented Mar 5, 2015

speedplane commented Mar 5, 2015

speedplane commented Mar 5, 2015

timvandermeij commented Mar 6, 2015

timvandermeij commented Mar 6, 2015

speedplane commented Mar 6, 2015

timvandermeij commented Mar 6, 2015

speedplane commented Mar 6, 2015

speedplane commented Mar 6, 2015

timvandermeij commented Mar 6, 2015

timvandermeij Mar 6, 2015

timvandermeij Mar 8, 2015

timvandermeij commented Mar 8, 2015

jazzy-em commented Mar 10, 2015

jazzy-em commented Mar 10, 2015

jazzy-em commented Mar 10, 2015

In this document (http://www.selab.isti.cnr.it/ws-mate/example.pdf) try to search "an example paper".
There will no matches, because there is no space between 'example' and 'paper'.

timvandermeij commented Mar 12, 2015

danez commented Jun 30, 2015

speedplane commented Jun 30, 2015

danez commented Jul 1, 2015

danez commented Jul 1, 2015

Vad1mo commented Oct 12, 2015

timvandermeij commented Oct 12, 2015

fbender commented Nov 16, 2015

timvandermeij commented Mar 6, 2019

Improve Copy/Paste #5783

Improve Copy/Paste #5783

Conversation

speedplane commented Mar 4, 2015

The Problem

Other Viewers

The Solution

existentialism commented Mar 4, 2015

speedplane commented Mar 4, 2015

timvandermeij commented Mar 4, 2015

speedplane commented Mar 5, 2015

speedplane commented Mar 5, 2015

timvandermeij commented Mar 5, 2015

pdfjsbot commented Mar 5, 2015

From: Bot.io (Linux)

Received

pdfjsbot commented Mar 5, 2015

From: Bot.io (Linux)

Success

Published

timvandermeij commented Mar 5, 2015

speedplane commented Mar 5, 2015

speedplane commented Mar 5, 2015

timvandermeij commented Mar 6, 2015

timvandermeij commented Mar 6, 2015

speedplane commented Mar 6, 2015

timvandermeij commented Mar 6, 2015

speedplane commented Mar 6, 2015

speedplane commented Mar 6, 2015

timvandermeij commented Mar 6, 2015

timvandermeij Mar 6, 2015

Choose a reason for hiding this comment

timvandermeij Mar 8, 2015

Choose a reason for hiding this comment

timvandermeij commented Mar 8, 2015

jazzy-em commented Mar 10, 2015

jazzy-em commented Mar 10, 2015

jazzy-em commented Mar 10, 2015

In this document (http://www.selab.isti.cnr.it/ws-mate/example.pdf) try to search "an example paper". There will no matches, because there is no space between 'example' and 'paper'.

timvandermeij commented Mar 12, 2015

danez commented Jun 30, 2015

speedplane commented Jun 30, 2015

danez commented Jul 1, 2015

danez commented Jul 1, 2015

Vad1mo commented Oct 12, 2015

timvandermeij commented Oct 12, 2015

fbender commented Nov 16, 2015

timvandermeij commented Mar 6, 2019

In this document (http://www.selab.isti.cnr.it/ws-mate/example.pdf) try to search "an example paper".
There will no matches, because there is no space between 'example' and 'paper'.