Text Extraction (Too many / less spaces in text) #7327

TuningGuide · 2016-05-14T14:51:21Z

Link to PDF file (or attach file here):
http://dipbt.bundestag.de/dip21/btp/18/18145.pdf

Configuration:

Web browser and its version: Node 5.10.1
Operating system and its version: Mac OSX Latest
PDF.js version: Latest
Is an extension: ?

Steps to reproduce the problem:

Execute following script:

 * Created by velten on 13.05.16.
 */
"use strict";

require('pdfjs-dist');
const fs = require('fs');
const o_o = require('yield-yield');

const pdfPath = '/Users/velten/Downloads/18145.pdf';
const pdfUrl = "http://dipbt.bundestag.de/dip21/btp/18/18145.pdf";

o_o(function* () {
    try {
        var data = new Uint8Array(yield fs.readFile(pdfPath, yield));
        let pdfDocument = yield PDFJS.getDocument(data);
        let page = yield pdfDocument.getPage(1);
        let content = yield page.getTextContent();

        let stop = false;

        var strings = content.items.map(function (item) {
            if(stop) {
                console.log(item.str);
                exit();
            }
            if(item.str.indexOf('Dr') != -1) stop = true;
            return item.str;
        });
    }
    catch (err) {
        ServerError(err);
    }
})();

What is the expected behavior? (add screenshot)
Output = "." or ". "

What went wrong? (add screenshot)
Output is " . "

The text was updated successfully, but these errors were encountered:

TuningGuide · 2016-05-19T07:55:03Z

I saw that @speedplane and @yurydelendik did some work in this direction. Maybe they can tell me which functions are relevant?

TuningGuide · 2016-05-26T13:18:06Z

The error is also directly inspectable in firefox:

Open http://dipbt.bundestag.de/dip21/btp/18/18145.pdf in Firefox
Inspect the Page: display:none the canvasWrapper and disable '.textLayer > div { color: transparent }
Look at Dr . Angela Merkel

TuningGuide · 2016-05-26T14:41:05Z

Maybe I found a quick fix for this issue:
The elements "Dr" and " . " are overlapping.
So the x-position + width of "Dr" are greater then the x-position of " . ". So my code will include now:
if(item.str.startsWith(' ') && item.transform[4] < prev_item.transform[4]+prev_item.width) { return item.str.substring(1); }

jasonparallel · 2016-07-08T17:24:50Z

I'm seeing the same issue with extra spaces in words. Example pdf available at https://github.com/jasonparallel/pdf.js-issues/blob/master/webSnapshot.pdf

For example at the bottom of the first page several spaces are inserted into n148584.rar and a space is inserted into the word energy right before that.

This causes issues for copying text out of the pdf and for searching for text in the pdf.

luomancs · 2020-07-19T20:03:37Z

I'm seeing the same issue with extra spaces in words. Example pdf available at https://github.com/jasonparallel/pdf.js-issues/blob/master/webSnapshot.pdf

For example at the bottom of the first page several spaces are inserted into n148584.rar and a space is inserted into the word energy right before that.

This causes issues for copying text out of the pdf and for searching for text in the pdf.

Hi,

have you solve this problem, I have the same issue, my search query has one space in between each word, however, the pdf expect two spaces. do you have a solution ? thank you.

timvandermeij · 2021-04-30T18:08:29Z

Fixed by #13257 and possibly other patches.

TuningGuide changed the title ~~To many spaces in text~~ Too many spaces in text May 14, 2016

timvandermeij added the text-selection label May 15, 2016

TuningGuide changed the title ~~Too many spaces in text~~ Text Extraction (Too many / less spaces in text) May 29, 2016

TuningGuide mentioned this issue Sep 10, 2016

Space inserted between each letter in textLayer #6705

Closed

timvandermeij closed this as completed Apr 30, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Text Extraction (Too many / less spaces in text) #7327

Text Extraction (Too many / less spaces in text) #7327

TuningGuide commented May 14, 2016 •

edited

TuningGuide commented May 19, 2016

TuningGuide commented May 26, 2016

TuningGuide commented May 26, 2016

jasonparallel commented Jul 8, 2016 •

edited

luomancs commented Jul 19, 2020

timvandermeij commented Apr 30, 2021

Text Extraction (Too many / less spaces in text) #7327

Text Extraction (Too many / less spaces in text) #7327

Comments

TuningGuide commented May 14, 2016 • edited

TuningGuide commented May 19, 2016

TuningGuide commented May 26, 2016

TuningGuide commented May 26, 2016

jasonparallel commented Jul 8, 2016 • edited

luomancs commented Jul 19, 2020

timvandermeij commented Apr 30, 2021

TuningGuide commented May 14, 2016 •

edited

jasonparallel commented Jul 8, 2016 •

edited