New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Anyone having success using pdf2htmlex on ubuntu 20.04? #57
Comments
Hello Jeroen, Sometime ago I've noticed @stephengaito was working on some updates, but they are all in his repo. Haven't tried building on Ubuntu-20.04, but lets see what problems are you getting. Before building pdf2htmlEX, you need Poppler-0.81.0 and Fontforge-20170731. They are pretty easy to build from source and you can install them in user (not system) directory. Check issue #56 for howto guide. |
Hi Vilius, All the errors are around Cairo, for example:
|
@jeroenjeremy and @ViliusSutkus89, From my previous work (before Christmas 2019), I can confirm that it will be highly unlikely that you will be able to compile this using Ubuntu 20.04. The pdf2htmlEX sources have a very intimate relationship with specific releases of both poppler and fontforge. Alas, the poppler source is moving very fast so pdf2htmlEX's current source will almost certainly not work with the poppler libraries on Ubuntu 20.04. You find me between tasks.... (and feeling a tad guilty for not keeping up with phf2htmlEX)... so I will try to sort this out next week. I think I managed to get my fork up to poppler v0.82.0 and Ubuntu 20.04 is uses poppler v0.86.0 (poppler itself is now up to v0.88.0). It will take about two days per poppler release. HOWEVER, last November/December I essentially abandoned trying to make *.deb packages. I can and will make both AppImage and Docker releases. I might make a *.deb package BUT I have to review how to keep alternate versions of poppler and fontforge which do not conflict with the Ubuntu 20.04 installed versions. My current AppImage ( https://github.com/stephengaito/pdf2htmlEX/releases/tag/continuous) should still work on Ubuntu 20.04 since it is self contained (it has its own matching versions of pdf2htmlEX/poppler/fontforge. Just before Christmas, I exhausted myself trying to get Travis/Homebrew to work on MacOs... If anyone can contribute that knowledge that would be extremely helpful. @ViliusSutkus89 can we coordinate your work with mine.... so we can eventually merge cleanly back into one project? Regards, Stephen Gaito |
@stephengaito and @ViliusSutkus89 : you're both heroes for keeping pdf2htmlEX alive and kicking! |
Sorry, just realised you build both and they ARE the same version. |
Actually, my search for an updated version started after realising the attached document doesn't convert well. Some characters are simply not shown after pdf2htmlex conversion. It's a document without complications (no odd fonts, no forms or anything like it). Even after normalisation with Ghostscript I had no luck. |
@jeroenjeremy I will have a look at that PDF, I can give no promises. May I add that document to my repository of problematic documents (which I believe is in a github repo and so public)? Also, can you provide me with a "use case" describing the process events (and underlying scripts) which you would be using with AWS Lambda? Again, I can make no promises, but AWS Lambda sounds more self contained and so "easier to target" then a general *.deb package. If you can do this I will try to make what ever I do as AWS Lambda friendly as possible. (Can AWS Lambda make use of Docker images... or is AWS Lambda a Docker competitor?) |
I prefer not to include the document in a public repo. |
@jeroenjeremy I have taken a copy of your PDF... you might want to deleted it from your message above.... |
@jeroenjeremy , My tools have been "resharpened"... and now work with my substantially different development environments. However, I have tried your PDF with both my currently released AppImage and my newly locally recreated AppImage and I can see no characters missing from either pdf2htmlEX converted versions when I compare them to the version that Okular provides me with. Can you send me, or attach, a screen shot (an image either png or jpeg) of your pdf showing me the missing characters? Regards, |
@jeroenjeremy, Many thanks for this image. I have added it to my private archive of troublesome PDFs. |
@jeroenjeremy Interesting... my version works "out of the box" (which might be why I could not see the problem).
A screen shot of my result (in a fireFox browser) is: If that still fails for you, there is an explicit --decompose-ligature (boolean) option which "decompose ligatures, such as \uFB01 -> fi"... have you tried turning that option on? PS: I am hoping, all being well, to merge my development repo back into the main pdf2htmlEX/pdf2htmlEX by the end of this week. |
It happens to me even when I don't use any options whatsoever, viewing in Safari/Edge/Chrome on Macos and Windows... |
Can I build your development version on Ubuntu 20 or 19? |
@jeroenjeremy
Note that I build inside a virtual machine... so that I have complete control over my environment.... "Can I build your development version on Ubuntu 20 or 19?" ... Yes I have built it inside a Ubuntu 20.04 virtual machine. There should be an AppImage, Docker image and a *.deb archive by the end of this week as well. |
A recent *.deb (built on Bionic 18.04 but should work on Focal 20.04) can be downloaded from https://www.dropbox.com/s/avwot8sce8s4vti/pdf2htmlEX-updateTravis-2020_06_02-18_50_44-x86_64-bionic.deb?dl=0 A recent AppImage can be downloaded from https://www.dropbox.com/s/4pc3h2p73fai0gj/pdf2htmlEX-updateTravis-2020_06_02-18_50_44-x86_64.AppImage?dl=0 To run an AppImage... rename it to what ever you like and make it executable.... then run it with the "usual" pdf2htmlEX command line options.... An AppImage is completely self contained.... so should run in (almost) any Linux environment.... as well as recent Windows 10 releases... |
It's really baffling: just ran your new Appimage and a version built per your './buildScripts/buildInstallLocally' and I still see the same issue. I've even tried it on different servers. |
Tried on a freshly installed Ubuntu 18.04 on Amazon today, both with Appimage and locally built and still getting the same result, I'm afraid. |
Since I have shown that the ligatures work on my installation, I suspect you and I have different underlying installations of our distributions. pdf2htmlEX is heavily dependent upon, among other things, fontconfig (both directly and via poppler and fontforge). The fundamental problem is, you need to be the expert on the fonts your PDFs use, pdf2htmlEX can not dictate this.... I suspect we have different locales and font configuration. Can you provide me with:
I suspect some of this information may be fairly sensitive, so you might want to send it to me out-of-band... Alas I am totally unsure if this is enough nor how to exactly use it to debug this problem... but it would be an important start. |
@stephengaito Thanks for looking into it. None of the info is confidential, as I just took the standard AWS AMI and compiled pdf2htmlEX on it. 0 attached in gaito.txt |
Sorry to say but even running your Docker I get the same result... It seems that you managed to create a perfect setup for yourself where it DOES work ;-) |
When I switch on debugging in pdf2htmlEX I get a lot of errors like this: |
Curiouser and curiouser! Many thanks for this detail... I have now downloaded it if you want to delete it... |
After considerable ((re)re)testing, I have now made an official (alpha) release of the (I think) I have tested I have also (begun) to add more detail about how to install each type of Could you try one or more of these release objects and let me know if they work (or not) for you. Regards, |
Hi Stephen, thanks for the heads up. Great to see full releases for so many environments! Some small constructive comments: Best, Jeroen |
Jeroen (@jeroenjeremy), It has taken me a long time to "twig" that we are looking for the problem SO I have finally found the problem.... the pdf2htmEX rendering of Now that I have identified the problem, I can work with all three browsers I will class this as a failure of pdf2htmlEX to respect "the ... but, as I am now way overdrawn on my pdf2htmlEX development budget, it
Many thanks for all of your help identifying this issue. I now have much more comprehensive testing in place... which does help me sleep better at night :-) I am sorry that it has taken so long to identify this problem... and possibly longer to solve.. Regards, |
You're still our hero for keeping pdf2htmlEX up to date @stephengaito ! |
Jeroen (@jeroenjeremy), Thanks for those pointers... the As a late afternoon puzzle.... I have begun looking at the html. I can see all of the characters, and more importantly, Chrome is able to copy and paste all of the characters, it just does not display them. I have just attempted to recreate the first part of your pdf using both LibreOffice and LaTeX. I know that LaTeX, at least, will produce ligatures for the 'fi' so I had hoped that I could reproduce the problem. Unfortunately it is not the ligatures themselves, but the interaction between the ligatures and your particular font. (Chrome displays all ligatures for both the LibreOffice and LaTeX versions). SO: can you tell me a bit more about how you created your pdf? What tool did you use? What fonts did you use? Regards, |
Jeroen (@jeroenjeremy), I have found a solution see: #68 (comment) I will try and implement this as a priority... but might not be able to do this for a couple of weeks. Until I can implement this, the (alas rather tedious) work-a-round is to manually edit the file specific
to all @font-face definitions generated in the file specific *.css file. (If you are |
That works wonderfully @stephengaito! I can add this as a postprocessing step in my flow without a problem. Do you foresee any disadvantages if applied over all result files? |
Jeroen (@jeroenjeremy), It is your lucky day.... I had a number of small tasks to complete before diving deeply into my main work later this week... so I took the half day yesterday to fix this issue and run full build/tests on my development repo as well as the main repo (which have just competed). Have a look at the (current) releases and let me know if they work for you. Many thanks for your help debugging pdf2htmlEX. Enjoy! |
That did the job, thanks again @stephengaito ! |
I tried using the latest version on the latest LTS of Ubuntu but it won't work due to the newer version of Poppler.
Compiling it manually doesn't do the trick either.
Do you have any tips? Or will there be a release for the newer Poppler versions?
The text was updated successfully, but these errors were encountered: