Extract PDF file results in a garbled code #144

tayei1997 · 2024-03-13T07:36:23Z

Hello, I am using podofo library provides pdf text extraction function, encountered a garbled problem: I use podofo to extract sample1.pdf, the results of the console outputs：
“Updating version from 1.7 to 1.7
VARNING: Unable to find font object F1
WARNING: Unable to provide a space size, setting default font size
WARNING: Unable to find font object F1
WARNING: Unable to provide a space size, setting default font size VARNING:
Unable to find font object F1
WARNING: Unable to provide a space size, setting default font size WARNING:
Unable to find font object F1
VARNING: Unable to provide a space size, setting default font size VARNING:
Unable to find font object F2
VARNING: Unable to provide a space size, setting default font size WARNING:
Unable to find font object F1
......”
sample1.pdf

I checked sample1.pdf and it doesn't seem to have a ToUnicode map, could this be the cause of this?

The image below shows the extracted garbled text.

I noticed that the extracted garbled text and PDF file content data stream, Tj keyword before the content is similar, podofo will not be able to find the font will be the pdf text string output as is?

Related to #144

ceztko · 2024-03-16T11:41:48Z

Hello, there may be multiple problems but I think the most relevant one is is related to missing predefined CMaps encoding support, for which I just added an entry in the TODO (but it was a known missing feature). Embedding all the known predefined CMaps, which I believe should be in this repository, is quite a big task. The main idea I have about this task would be to parse all the cmaps inside PdfCharCodeMap (the code that does the CMap parsing is here, but may be refactored so it's callable somewhere else) and make some kind of binary serialization so it can be efficiently embedded in PoDoFo. Then, implement the predefined cmap resolution algorithm as told in the specification and I believe this problem should be addressed. I would be very glad to see some contributions on this matter, as I definitely don't have time to put on the task, I'm sorry.

tayei1997 · 2024-03-18T02:41:06Z

Thanks for your answer.

So the problem now is that podofo can extract the binary encoded data from the text in the image below, but, due to the lack of a corresponding CMap, it cannot decode the text correctly.

If I want to decode it myself, I first need to get the pre-TJ data, can I use podofo to get the pre-TJ data?

ceztko · 2024-03-18T08:42:05Z

You can have a look at the use of PdfContentStreamReader here. But this project would benefit if you try to implement the system I suggested and do it within PoDoFo source (at least a prototype of it in a fork). I recently received some very good contributions from a couple of Chinese users that had issues trying to draw text: I enjoyed the level of competence and their PRs have been already merged.

tayei1997 · 2024-03-19T01:25:46Z

Hello, there may be multiple problems but I think the most relevant one is is related to missing predefined CMaps encoding support, for which I just added an entry in the TODO (but it was a known missing feature). Embedding all the known predefined CMaps, which I believe should be in this repository, is quite a big task. The main idea I have about this task would be to parse all the cmaps inside PdfCharCodeMap (the code that does the CMap parsing is here, but may be refactored so it's callable somewhere else) and make some kind of binary serialization so it can be efficiently embedded in PoDoFo. Then, implement the predefined cmap resolution algorithm as told in the specification and I believe this problem should be addressed. I would be very glad to see some contributions on this matter, as I definitely don't have time to put on the task, I'm sorry.

I have the will to carry out the development, so I would like to ask a few questions:

if all the CMap mappings are embedded in podofo, will it cause the memory usage of podofo to become higher?
I lack the CMap related development experience you mentioned, so it is difficult to estimate the time needed for this work, how many man-days do you think it will take to complete this work?

ceztko · 2024-03-20T17:32:06Z

Hello. Sorry, for the delay in the answer, it took me some more time to do further analysis. First, let me confirm that the issue here is really the missing embedding of predefined CMaps encoding. I try to answer your questions below.

if all the CMap mappings are embedded in podofo, will it cause the memory usage of podofo to become higher?

Yes, but there are options to reduce the memory consumption by embedding pre-parsed maps. See below.

I lack the CMap related development experience you mentioned, so it is difficult to estimate the time needed for this work, how many man-days do you think it will take to complete this work?

It's hard, but let's try to create some tasks and (possibly over-)estimate them:

[4 Hours] Factorize CMap parsing code so it can be used to make a tool to bulk parse many cmaps;
[4 Hours] Make PdfCharCodeMap to be initialized from a CodeUnitMap. This may remove the need of defining binary serialization of the map, as I was suggesting before. Basically you can make a constructor of PdfCharCodeMap like the following:

PdfCharCodeMap(CodeUnitMap&& codeUnitMap);

Which you can use to define many singletons like the following:

static const PdfCharCodeMap& GetInstance_UniGB_UCS2_H()
{
    static PdfCharCodeMap UniGB_UCS2_H(CodeUnitMap({
        { PdfCharCode(32, 2), { 1 } },
        { PdfCharCode(33, 2), { 2 } },
        { PdfCharCode(34, 2), { 3 } }
        // ..
        }));

    return UniGB_UCS2_H;
}

[8 Hours] Make a tool that will do the parsing of the CMap and create the singletons above in many .cpp files.
[8 Hours] Create a script Run the tool above on the existing CMaps from cmap-resources and mapping-resources-pdf repositories (both should be needed, in 2 steps).
[32 Hours] Implement the algorithm described in "9.10.2 Mapping character codes to Unicode values" below:

If the font is a composite font that uses one of the predefined CMaps listed in "Table 116 - Predefined CJK CMap names" (except Identity–H and Identity–V) or whose descendant CIDFont uses the Adobe-GB1, Adobe-CNS1, Adobe-Japan1, Adobe-Korea1 (deprecated in PDF 2.0 (2020)) or Adobe-KR (added in PDF 2.0 (2020)) character collection:
a. Map the character code to a character identifier (CID) according to the font’s CMap.
b. Obtain the registry and ordering of the character collection used by the font’s CMap (for example, Adobe and Japan1) from its CIDSystemInfo dictionary.
c. Construct a second CMap name by concatenating the registry and ordering obtained in step (b) in the format registry-ordering-UCS2 (for example, Adobe–Japan1–UCS2).
d. Obtain the CMap with the name constructed in step (c) (available from a variety of online sources, e.g. https://github.com/adobe-type-tools/mapping-resources-pdf).
e. Map the CID obtained in step (a) according to the CMap obtained in step (d), producing a Unicode value.
Type 0 fonts whose descendant CIDFonts use the Adobe-GB1, Adobe-CNS1, Adobe-Japan1, Adobe-Korea1 (deprecated in PDF 2.0 (2020)) or Adobe-KR (added in PDF 2.0 (2020)) character collection (as specified in the CIDSystemInfo dictionary) shall have a supplement number corresponding to the version of PDF supported by the PDF processor.

Translated in PoDoFo architecture, I believe one PdfEncoding instance has to be constructed from the embedded maps, recognizing the /Encoding entry is one of the predefined names. I believe the code to cid CMap encoding that must be used in point a. is to be found in cmap-resources, while the "toUnicode" CMap needed in step .d is to be found in mapping-resources-pdf. You then constructor an instance like PdfEncoding(cidMap, toUnicode) (the name detection and instance construction should be probably inserted at this location in the source) and text extraction should start to work.

Summarizing, I believe 7-8 man days may be a decent estimation of the work need to accomplish the task. Following the above approach would make me more willing to fast track review/merge a prototype solving the problem. The more the approach differs, the less I may be comfortable at reviewing your work.

ceztko · 2024-03-21T09:02:59Z

Have you considered whether you are willing to implement the above activities? 7-8 days may be larger estimate and if you are quick enough it could be shorter (but remember I would like to see few unit tests as well for this work).

tayei1997 · 2024-03-21T11:11:08Z

Hi, I have the intention of completing the above functionality, but I must state that as I can only develop the relevant code outside of my official working hours, and due to my lack of experience in this development, I cannot offer a guarantee as to the time of completion.

ceztko · 2024-03-21T12:52:04Z

Ok. I'm sorry for the unsolicited advice: I don't know what's your job, but in the case a company is paying you to work on PDF related topics still I recommend you to not work out of official hours if the work ultimately benefits them. In this way companies using open source software "for free" get more responsible , and the actual software improves in a more professional way.

ceztko added a commit that referenced this issue Mar 16, 2024

Added entry in TODO.md related to implement embedding predefined CMaps

5002c41

Related to #144

ceztko added enhancement New feature or request help needed Help on this topics is welcome from the reporter or interested users labels Mar 16, 2024

ceztko mentioned this issue Mar 21, 2024

Question: created pdf file font resources in the absence of ToUnicode map tecnickcom/TCPDF#693

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extract PDF file results in a garbled code #144

Extract PDF file results in a garbled code #144

tayei1997 commented Mar 13, 2024

ceztko commented Mar 16, 2024 •

edited

tayei1997 commented Mar 18, 2024

ceztko commented Mar 18, 2024 •

edited

tayei1997 commented Mar 19, 2024

ceztko commented Mar 20, 2024 •

edited

ceztko commented Mar 21, 2024

tayei1997 commented Mar 21, 2024

ceztko commented Mar 21, 2024 •

edited

Extract PDF file results in a garbled code #144

Extract PDF file results in a garbled code #144

Comments

tayei1997 commented Mar 13, 2024

ceztko commented Mar 16, 2024 • edited

tayei1997 commented Mar 18, 2024

ceztko commented Mar 18, 2024 • edited

tayei1997 commented Mar 19, 2024

ceztko commented Mar 20, 2024 • edited

ceztko commented Mar 21, 2024

tayei1997 commented Mar 21, 2024

ceztko commented Mar 21, 2024 • edited

ceztko commented Mar 16, 2024 •

edited

ceztko commented Mar 18, 2024 •

edited

ceztko commented Mar 20, 2024 •

edited

ceztko commented Mar 21, 2024 •

edited