Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spaces missing when copying text from PDF #6657

Open
flexpaper opened this issue Nov 18, 2015 · 19 comments

Comments

@flexpaper
Copy link

commented Nov 18, 2015

I know some improvements has been made recently to the handling of spacing in text but copying text from this particular PDF causes spaces between words to be completely missing (when tried using the public viewer of PDF.JS, version "1.2.131").

Spaces in text are available when copying from Acrobat, Preview on OSX etc.

LOW_Article_5.pdf

^Erik

@yurydelendik

This comment has been minimized.

Copy link
Contributor

commented Nov 18, 2015

The spaces missing due to optimization to not display whitespace divs (see https://github.com/mozilla/pdf.js/blob/master/web/text_layer_builder.js#L92). The PDF commands looks like:

/F6 1 Tf
6.3761 0 0 6.3761 257.6175 558.2142 Tm
(Waco,)Tj
/F2 1 Tf
.0002 0 0 -.0002 275.6735 558.2141 Tm
( )Tj
/F6 1 Tf
6.3761 0 0 6.3761 277.3313 558.2142 Tm
(TX)Tj
/F2 1 Tf
.0002 0 0 -.0002 284.7781 558.2141 Tm
( )Tj
/F6 1 Tf
6.3761 0 0 6.3761 286.4359 558.2142 Tm
(76798-7353,)Tj
@flexpaper

This comment has been minimized.

Copy link
Author

commented Nov 18, 2015

Thanks for the speedy reply. The height and width on the spaces in this particular case seems to be off too. The height on a space is 0.0002 and the width width 0.00005, while the text element before a space has a height of 7.17.

@flexpaper

This comment has been minimized.

Copy link
Author

commented Nov 18, 2015

Sorry I just saw that the pdf actually seems to specify this as part of the PDF commands, I guess its in the pdf then. Its actually using a different font for the spaces than for the text

@yurydelendik

This comment has been minimized.

Copy link
Contributor

commented Nov 18, 2015

Can you experiment with disabling the optimization I mentioned above? See if it will resolve the issue.

@flexpaper

This comment has been minimized.

Copy link
Author

commented Nov 18, 2015

I'm not using the text_layer_builder in my case but disabling the optimisation above resolves it. I confirmed that the spaces are returned from getTextContent(). Their sizes (as I stated) are funny in this pdf but they are returned properly

@slavajacobson

This comment has been minimized.

Copy link

commented May 7, 2016

How did you disable the optimization exactly? Almost every document is missing spaces between words when using Find or copying/pasting the text.

@maykefreitas

This comment has been minimized.

Copy link

commented Aug 18, 2017

@flexpaper and @slavajacobson

I made a simple workaround by adding an space character changing textDiv.textContent = geom.str; to textDiv.textContent = geom.str + ' '; at file https://github.com/mozilla/pdf.js/blob/master/src/display/text_layer.js#L109.

Now when I select the text to copy & paste, "line breaks" gets nice as space ;)

@ku20043703

This comment has been minimized.

Copy link

commented Mar 25, 2019

The spaces missing due to optimization to not display whitespace divs (see https://github.com/mozilla/pdf.js/blob/master/web/text_layer_builder.js#L92). The PDF commands looks like:

/F6 1 Tf
6.3761 0 0 6.3761 257.6175 558.2142 Tm
(Waco,)Tj
/F2 1 Tf
.0002 0 0 -.0002 275.6735 558.2141 Tm
( )Tj
/F6 1 Tf
6.3761 0 0 6.3761 277.3313 558.2142 Tm
(TX)Tj
/F2 1 Tf
.0002 0 0 -.0002 284.7781 558.2141 Tm
( )Tj
/F6 1 Tf
6.3761 0 0 6.3761 286.4359 558.2142 Tm
(76798-7353,)Tj

What all lines of codes need to comment or remove to remove spaces in text

@ku20043703

This comment has been minimized.

Copy link

commented Apr 1, 2019

@yurydelendik Can you please advise on it. I am urgently looking for it.

@timvandermeij

This comment has been minimized.

Copy link
Contributor

commented Apr 1, 2019

You can try tweaking

pdf.js/src/core/evaluator.js

Lines 1230 to 1232 in f9c5811

var SPACE_FACTOR = 0.3;
var MULTI_SPACE_FACTOR = 1.5;
var MULTI_SPACE_FACTOR_MAX = 4;

@ku20043703

This comment was marked as off-topic.

Copy link

commented Apr 4, 2019

You can try tweaking

pdf.js/src/core/evaluator.js

Lines 1230 to 1232 in f9c5811

var SPACE_FACTOR = 0.3;
var MULTI_SPACE_FACTOR = 1.5;
var MULTI_SPACE_FACTOR_MAX = 4;

@timvandermeij Thanks. I made some changes in the evaluator.js file. But changes not reflecting after ng serve. How to build after make changes in pdf.js file

@timvandermeij

This comment was marked as off-topic.

Copy link
Contributor

commented Apr 4, 2019

You'll need to rebuild PDF.js. Refer to the README and the wiki for how to do that.

@ku20043703

This comment was marked as off-topic.

Copy link

commented Apr 6, 2019

You'll need to rebuild PDF.js. Refer to the README and the wiki for how to do that.

@timvandermeij when I am running gulp dist-install to build project getting following error:-
Cloning baseline distribution
Error: command "git" with parameters "clone,--depth,1,https://github.com/mozilla/pdfjs-dist,build/dist/" exited with code 1

Can you help on it

@ku20043703

This comment was marked as off-topic.

Copy link

commented Apr 8, 2019

@timvandermeij Can you help on above error

@timvandermeij

This comment was marked as off-topic.

Copy link
Contributor

commented Apr 8, 2019

I have never seen that error before. Try setting up a new clean environment. If you're on Windows, some other steps may be required: https://github.com/mozilla/pdf.js/wiki/Setting-up-pdf.js-Development-Environment-for-Windows

@ku20043703

This comment was marked as off-topic.

Copy link

commented Apr 11, 2019

@timvandermeij Thanks for your help.

I downloaded pdf.js-2.0.943 from git and then running gulp dist-install to build project.

Afterwards I am copying all the files, folder from dist folder and then pasting into my project node-modules pdfjs-dist folder

Earlier pdf-viewer was able to show pdf with prebuilt pdfjs-dist. But Now after custom build as I explained you above its not showing pdf.

I tried it for few days, but not sure what is going wrong.

I am looking for your help on it. Your quick response will be very much helpful for me.

@Snuffleupagus

This comment has been minimized.

Copy link
Contributor

commented Apr 11, 2019

Please refrain from repeatedly posting completely unrelated comments in issues, since that causes notification spam for people and makes it much more difficult to follow the actual discussion.

Basically, everything from #6657 (comment) forwards is completely unrelated here (it possibly even started with #6657 (comment)), and should have been posted in a separate new issue (with all information from ISSUE_TEMPLATE.md provided). @timvandermeij Mind hiding/removing some of the off-topic comments?

@shalva97

This comment has been minimized.

Copy link

commented Jul 4, 2019

any updates on this? i am suffering a lot, i have to open PDF files in chrome to copy text proeprly

@jobjects

This comment has been minimized.

Copy link

commented Jul 4, 2019

Our server-side search keywords PDF highlighting tool workarounds this issue by getting copy text from the server.
https://www.pdf-highlighter.com/docs/Text_Copy_Workaround.html
(it's a slightly customized version of PDF.js)
It could be an overkill if all you need is text copy but maybe you find it useful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
9 participants
You can’t perform that action at this time.