New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extract PDF file results in a garbled code #144
Comments
Hello, there may be multiple problems but I think the most relevant one is is related to missing predefined CMaps encoding support, for which I just added an entry in the TODO (but it was a known missing feature). Embedding all the known predefined CMaps, which I believe should be in this repository, is quite a big task. The main idea I have about this task would be to parse all the cmaps inside |
You can have a look at the use of |
I have the will to carry out the development, so I would like to ask a few questions:
|
Hello. Sorry, for the delay in the answer, it took me some more time to do further analysis. First, let me confirm that the issue here is really the missing embedding of predefined CMaps encoding. I try to answer your questions below.
Yes, but there are options to reduce the memory consumption by embedding pre-parsed maps. See below.
It's hard, but let's try to create some tasks and (possibly over-)estimate them:
PdfCharCodeMap(CodeUnitMap&& codeUnitMap); Which you can use to define many singletons like the following: static const PdfCharCodeMap& GetInstance_UniGB_UCS2_H()
{
static PdfCharCodeMap UniGB_UCS2_H(CodeUnitMap({
{ PdfCharCode(32, 2), { 1 } },
{ PdfCharCode(33, 2), { 2 } },
{ PdfCharCode(34, 2), { 3 } }
// ..
}));
return UniGB_UCS2_H;
}
Translated in PoDoFo architecture, I believe one Summarizing, I believe 7-8 man days may be a decent estimation of the work need to accomplish the task. Following the above approach would make me more willing to fast track review/merge a prototype solving the problem. The more the approach differs, the less I may be comfortable at reviewing your work. |
Have you considered whether you are willing to implement the above activities? 7-8 days may be larger estimate and if you are quick enough it could be shorter (but remember I would like to see few unit tests as well for this work). |
Hi, I have the intention of completing the above functionality, but I must state that as I can only develop the relevant code outside of my official working hours, and due to my lack of experience in this development, I cannot offer a guarantee as to the time of completion. |
Ok. I'm sorry for the unsolicited advice: I don't know what's your job, but in the case a company is paying you to work on PDF related topics still I recommend you to not work out of official hours if the work ultimately benefits them. In this way companies using open source software "for free" get more responsible , and the actual software improves in a more professional way. |
Hello, I am using podofo library provides pdf text extraction function, encountered a garbled problem: I use podofo to extract sample1.pdf, the results of the console outputs:
“Updating version from 1.7 to 1.7
VARNING: Unable to find font object F1
WARNING: Unable to provide a space size, setting default font size
WARNING: Unable to find font object F1
WARNING: Unable to provide a space size, setting default font size VARNING:
Unable to find font object F1
WARNING: Unable to provide a space size, setting default font size WARNING:
Unable to find font object F1
VARNING: Unable to provide a space size, setting default font size VARNING:
Unable to find font object F2
VARNING: Unable to provide a space size, setting default font size WARNING:
Unable to find font object F1
......”
sample1.pdf
I checked sample1.pdf and it doesn't seem to have a ToUnicode map, could this be the cause of this?
The image below shows the extracted garbled text.
I noticed that the extracted garbled text and PDF file content data stream, Tj keyword before the content is similar, podofo will not be able to find the font will be the pdf text string output as is?
The text was updated successfully, but these errors were encountered: