Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Text Extraction (Too many / less spaces in text) #7327

Closed
TuningGuide opened this issue May 14, 2016 · 6 comments
Closed

Text Extraction (Too many / less spaces in text) #7327

TuningGuide opened this issue May 14, 2016 · 6 comments

Comments

@TuningGuide
Copy link

TuningGuide commented May 14, 2016

Link to PDF file (or attach file here):
http://dipbt.bundestag.de/dip21/btp/18/18145.pdf

Configuration:

  • Web browser and its version: Node 5.10.1
  • Operating system and its version: Mac OSX Latest
  • PDF.js version: Latest
  • Is an extension: ?

Steps to reproduce the problem:

  1. Execute following script:
 * Created by velten on 13.05.16.
 */
"use strict";

require('pdfjs-dist');
const fs = require('fs');
const o_o = require('yield-yield');

const pdfPath = '/Users/velten/Downloads/18145.pdf';
const pdfUrl = "http://dipbt.bundestag.de/dip21/btp/18/18145.pdf";

o_o(function* () {
    try {
        var data = new Uint8Array(yield fs.readFile(pdfPath, yield));
        let pdfDocument = yield PDFJS.getDocument(data);
        let page = yield pdfDocument.getPage(1);
        let content = yield page.getTextContent();

        let stop = false;

        var strings = content.items.map(function (item) {
            if(stop) {
                console.log(item.str);
                exit();
            }
            if(item.str.indexOf('Dr') != -1) stop = true;
            return item.str;
        });
    }
    catch (err) {
        ServerError(err);
    }
})();

What is the expected behavior? (add screenshot)
Output = "." or ". "

What went wrong? (add screenshot)
Output is " . "

@TuningGuide TuningGuide changed the title To many spaces in text Too many spaces in text May 14, 2016
@TuningGuide
Copy link
Author

I saw that @speedplane and @yurydelendik did some work in this direction. Maybe they can tell me which functions are relevant?

@TuningGuide
Copy link
Author

The error is also directly inspectable in firefox:

  1. Open http://dipbt.bundestag.de/dip21/btp/18/18145.pdf in Firefox
  2. Inspect the Page: display:none the canvasWrapper and disable '.textLayer > div { color: transparent }
  3. Look at Dr . Angela Merkel

@TuningGuide
Copy link
Author

Maybe I found a quick fix for this issue:
The elements "Dr" and " . " are overlapping.
So the x-position + width of "Dr" are greater then the x-position of " . ". So my code will include now:
if(item.str.startsWith(' ') && item.transform[4] < prev_item.transform[4]+prev_item.width) { return item.str.substring(1); }

@TuningGuide TuningGuide changed the title Too many spaces in text Text Extraction (Too many / less spaces in text) May 29, 2016
@jasonparallel
Copy link

jasonparallel commented Jul 8, 2016

I'm seeing the same issue with extra spaces in words. Example pdf available at https://github.com/jasonparallel/pdf.js-issues/blob/master/webSnapshot.pdf

For example at the bottom of the first page several spaces are inserted into n148584.rar and a space is inserted into the word energy right before that.

This causes issues for copying text out of the pdf and for searching for text in the pdf.

@luomancs
Copy link

I'm seeing the same issue with extra spaces in words. Example pdf available at https://github.com/jasonparallel/pdf.js-issues/blob/master/webSnapshot.pdf

For example at the bottom of the first page several spaces are inserted into n148584.rar and a space is inserted into the word energy right before that.

This causes issues for copying text out of the pdf and for searching for text in the pdf.

Hi,

have you solve this problem, I have the same issue, my search query has one space in between each word, however, the pdf expect two spaces. do you have a solution ? thank you.

@timvandermeij
Copy link
Contributor

Fixed by #13257 and possibly other patches.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants