Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Anyone having success using pdf2htmlex on ubuntu 20.04? #57

Closed
jeroenjeremy opened this issue May 22, 2020 · 33 comments
Closed

Anyone having success using pdf2htmlex on ubuntu 20.04? #57

jeroenjeremy opened this issue May 22, 2020 · 33 comments

Comments

@jeroenjeremy
Copy link

I tried using the latest version on the latest LTS of Ubuntu but it won't work due to the newer version of Poppler.
Compiling it manually doesn't do the trick either.
Do you have any tips? Or will there be a release for the newer Poppler versions?

@ViliusSutkus89
Copy link
Contributor

Hello Jeroen,

Sometime ago I've noticed @stephengaito was working on some updates, but they are all in his repo.

Haven't tried building on Ubuntu-20.04, but lets see what problems are you getting.
What exactly do you mean by "doesn't do the trick" ?
Compile failure or runtime errors?

Before building pdf2htmlEX, you need Poppler-0.81.0 and Fontforge-20170731. They are pretty easy to build from source and you can install them in user (not system) directory. Check issue #56 for howto guide.

@jeroenjeremy
Copy link
Author

Hi Vilius,

All the errors are around Cairo, for example:

[  8%] Building CXX object CMakeFiles/pdf2htmlEX.dir/3rdparty/poppler/git/CairoFontEngine.cc.o
In file included from /tmp/pdf2htmlEX/3rdparty/poppler/git/CairoFontEngine.cc:43:
/tmp/pdf2htmlEX/3rdparty/poppler/git/CairoOutputDev.h:192:8: error: ‘void CairoOutputDev::drawChar(GfxState*, double, double, double, double, double, double, CharCode, int, Unicode*, int)’ marked ‘override’, but does not override
  192 |   void drawChar(GfxState *state, double x, double y,
      |        ^~~~~~~~

@stephengaito
Copy link
Contributor

@jeroenjeremy and @ViliusSutkus89,

From my previous work (before Christmas 2019), I can confirm that it will be highly unlikely that you will be able to compile this using Ubuntu 20.04. The pdf2htmlEX sources have a very intimate relationship with specific releases of both poppler and fontforge. Alas, the poppler source is moving very fast so pdf2htmlEX's current source will almost certainly not work with the poppler libraries on Ubuntu 20.04.

You find me between tasks.... (and feeling a tad guilty for not keeping up with phf2htmlEX)... so I will try to sort this out next week. I think I managed to get my fork up to poppler v0.82.0 and Ubuntu 20.04 is uses poppler v0.86.0 (poppler itself is now up to v0.88.0). It will take about two days per poppler release.

HOWEVER, last November/December I essentially abandoned trying to make *.deb packages. I can and will make both AppImage and Docker releases. I might make a *.deb package BUT I have to review how to keep alternate versions of poppler and fontforge which do not conflict with the Ubuntu 20.04 installed versions.

My current AppImage ( https://github.com/stephengaito/pdf2htmlEX/releases/tag/continuous) should still work on Ubuntu 20.04 since it is self contained (it has its own matching versions of pdf2htmlEX/poppler/fontforge.

Just before Christmas, I exhausted myself trying to get Travis/Homebrew to work on MacOs... If anyone can contribute that knowledge that would be extremely helpful.

@ViliusSutkus89 can we coordinate your work with mine.... so we can eventually merge cleanly back into one project?

Regards, Stephen Gaito

@jeroenjeremy
Copy link
Author

@stephengaito and @ViliusSutkus89 : you're both heroes for keeping pdf2htmlEX alive and kicking!
I've been trying to compile different versions of Poppler and Fontforge on 20.04 all afternoon but I realise this is clearly above my paygrade (i.e. I don't understand enough about all the dependencies).
@stephengaito I tried your Appimage. That is a very attractive model, I find. Does it work on AWS Lambda?
What is not immediately clear is whether your Appimage is a more up to date or much different build than release https://github.com/pdf2htmlEX/pdf2htmlEX/releases/tag/v0.18.7-poppler-0.81.0
Do you guys know?

@jeroenjeremy
Copy link
Author

Sorry, just realised you build both and they ARE the same version.

@jeroenjeremy
Copy link
Author

jeroenjeremy commented May 22, 2020

Actually, my search for an updated version started after realising the attached document doesn't convert well. Some characters are simply not shown after pdf2htmlex conversion. It's a document without complications (no odd fonts, no forms or anything like it). Even after normalisation with Ghostscript I had no luck.
Running the Appimage doesn't make a difference either.

@stephengaito
Copy link
Contributor

@jeroenjeremy I will have a look at that PDF, I can give no promises. May I add that document to my repository of problematic documents (which I believe is in a github repo and so public)?

Also, can you provide me with a "use case" describing the process events (and underlying scripts) which you would be using with AWS Lambda? Again, I can make no promises, but AWS Lambda sounds more self contained and so "easier to target" then a general *.deb package. If you can do this I will try to make what ever I do as AWS Lambda friendly as possible.

(Can AWS Lambda make use of Docker images... or is AWS Lambda a Docker competitor?)

@jeroenjeremy
Copy link
Author

I prefer not to include the document in a public repo.
About Lambda: first of all, I'm not an expert on Lambda but we're seriously looking into it for our serverless microservices. Pdf2htmlex would be one of the microservices in a sequence of other functions.
Lambda works with 'layers' where you can include self-contained executables that need to have been compiled on Amazon Linux as far as I understand. See here for a walkthrough: https://docs.aws.amazon.com/lambda/latest/dg/runtimes-walkthrough.html
Lambda cannot work with Docker, I'm afraid; I guess you could see it as a competitor.

@stephengaito
Copy link
Contributor

@jeroenjeremy I have taken a copy of your PDF... you might want to deleted it from your message above....

@stephengaito
Copy link
Contributor

@jeroenjeremy , My tools have been "resharpened"... and now work with my substantially different development environments.

However, I have tried your PDF with both my currently released AppImage and my newly locally recreated AppImage and I can see no characters missing from either pdf2htmlEX converted versions when I compare them to the version that Okular provides me with.

Can you send me, or attach, a screen shot (an image either png or jpeg) of your pdf showing me the missing characters?

Regards,
Stephen Gaito

@jeroenjeremy
Copy link
Author

I'm attaching a screenshot with a few of the problem areas marked. It seems to affect the characters 'f' and 't' in combination with 'i' disproportionately. For example: 'fietsenstalling' becomes 'etsenstalling'.

Screenshot 2020-06-02 at 15 56 45

@stephengaito
Copy link
Contributor

@jeroenjeremy, Many thanks for this image. I have added it to my private archive of troublesome PDFs.
Unfortunately the code for ligatures might be deeply buried across poppler and pdf2htmlEX.... which might make solving this problem much more difficult.
But I will try and have a look.

@stephengaito
Copy link
Contributor

@jeroenjeremy Interesting... my version works "out of the box" (which might be why I could not see the problem).
I am using a freshly built pdf2htmlEX (from the development repo stephengaito/pdf2htmlEX - master branch should work).
My command line is:

using svg as a background format

mkdir /tmp/webfsd/html/een_fietsenstalling_laten_opknappen

pdf2htmlEX --zoom 1.3 \
  --embed cfij --bg-format svg \
  --split-pages 1 \
  --dest-dir /tmp/webfsd/html/een_fietsenstalling_laten_opknappen \
  --page-filename een_fietsenstalling_laten_opknappen-%d.page \
  pdfFiles/een_fietsenstalling_laten_opknappen.pdf

Preprocessing: 1/1
Working: 1/1

A screen shot of my result (in a fireFox browser) is:

image

If that still fails for you, there is an explicit --decompose-ligature (boolean) option which "decompose ligatures, such as \uFB01 -> fi"... have you tried turning that option on?

PS: I am hoping, all being well, to merge my development repo back into the main pdf2htmlEX/pdf2htmlEX by the end of this week.

@jeroenjeremy
Copy link
Author

It happens to me even when I don't use any options whatsoever, viewing in Safari/Edge/Chrome on Macos and Windows...

@jeroenjeremy
Copy link
Author

Can I build your development version on Ubuntu 20 or 19?

@stephengaito
Copy link
Contributor

stephengaito commented Jun 2, 2020

@jeroenjeremy
PPS: You can build your own copy of pdf2htmlEX today... on a Ubuntu >= 18.04 using the command line:

  cd pdf2htmlEX
  ./buildScripts/buildInstallLocally 

Note that I build inside a virtual machine... so that I have complete control over my environment....

"Can I build your development version on Ubuntu 20 or 19?" ... Yes I have built it inside a Ubuntu 20.04 virtual machine.

There should be an AppImage, Docker image and a *.deb archive by the end of this week as well.

@stephengaito
Copy link
Contributor

stephengaito commented Jun 2, 2020

@jeroenjeremy

A recent *.deb (built on Bionic 18.04 but should work on Focal 20.04) can be downloaded from https://www.dropbox.com/s/avwot8sce8s4vti/pdf2htmlEX-updateTravis-2020_06_02-18_50_44-x86_64-bionic.deb?dl=0

A recent AppImage can be downloaded from https://www.dropbox.com/s/4pc3h2p73fai0gj/pdf2htmlEX-updateTravis-2020_06_02-18_50_44-x86_64.AppImage?dl=0

To run an AppImage... rename it to what ever you like and make it executable.... then run it with the "usual" pdf2htmlEX command line options.... An AppImage is completely self contained.... so should run in (almost) any Linux environment.... as well as recent Windows 10 releases...

@jeroenjeremy
Copy link
Author

It's really baffling: just ran your new Appimage and a version built per your './buildScripts/buildInstallLocally' and I still see the same issue. I've even tried it on different servers.

@jeroenjeremy
Copy link
Author

Tried on a freshly installed Ubuntu 18.04 on Amazon today, both with Appimage and locally built and still getting the same result, I'm afraid.

@stephengaito
Copy link
Contributor

stephengaito commented Jun 6, 2020

@jeroenjeremy

Since I have shown that the ligatures work on my installation, I suspect you and I have different underlying installations of our distributions.

pdf2htmlEX is heavily dependent upon, among other things, fontconfig (both directly and via poppler and fontforge). The fundamental problem is, you need to be the expert on the fonts your PDFs use, pdf2htmlEX can not dictate this....

I suspect we have different locales and font configuration.

Can you provide me with:

  1. copies of your /etc/lsb-release and /etc/os-release files
  2. a zipped copy of your /etc/fonts directory
  3. the out put of the locales -a command
  4. the output of the fc-list -v command
  5. which AWS EC2 image are you using ?
  6. can you try my docker image ( https://hub.docker.com/r/stephengaito/pdf2htmlex ) ?

I suspect some of this information may be fairly sensitive, so you might want to send it to me out-of-band...

Alas I am totally unsure if this is enough nor how to exactly use it to debug this problem... but it would be an important start.

@jeroenjeremy
Copy link
Author

@stephengaito Thanks for looking into it. None of the info is confidential, as I just took the standard AWS AMI and compiled pdf2htmlEX on it.

0 attached in gaito.txt
1 attached in fonts.zip
2 in gaito.txt
3 in gaito.txt
4 I used the standard AWS AMI for Ubuntu 18.04 and updated all the packages present. I didn't add any package myself: 'ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20200408 (ami-0701e7be9b2a77600)'
5 will get to that now

fonts.zip
gaito_info.txt

@jeroenjeremy
Copy link
Author

Sorry to say but even running your Docker I get the same result... It seems that you managed to create a perfect setup for yourself where it DOES work ;-)
BTW Your default docker complains there is no manifest but when I specify 'updateBuildScripts-2020_06_06-12_11_26' it does run.

@jeroenjeremy
Copy link
Author

When I switch on debugging in pdf2htmlEX I get a lot of errors like this:
Lookup subtable contains unused glyph glyph618 making the whole subtable invalid
Could that give a hint of what is going on?

@stephengaito
Copy link
Contributor

stephengaito commented Jun 8, 2020

Curiouser and curiouser!

Many thanks for this detail... I have now downloaded it if you want to delete it...
I will have a look at it all over the next couple of days.
I will also try running in an AWS EC image and see what happens.

@stephengaito
Copy link
Contributor

@jeroenjeremy ,

After considerable ((re)re)testing, I have now made an official (alpha) release of the pdf2htmlEX tool. See: https://github.com/pdf2htmlEX/pdf2htmlEX/releases

(I think) I have tested pdf2htmlEX in enough controlled environments to be confident it should now work.... (alas that is easier to say than to do.... )

I have also (begun) to add more detail about how to install each type of pdf2htmlEX release object. Each such wiki Download-* page lists any additional packages which might need to be installed to recreate my working testing environment. Please have a look at these notes to make sure you have these packages installed.

Could you try one or more of these release objects and let me know if they work (or not) for you.

Regards,
Stephen Gaito

@jeroenjeremy
Copy link
Author

Hi Stephen, thanks for the heads up. Great to see full releases for so many environments!
I tested the .deb and the Appimage in Ubuntu Focal freshly installed with the dependencies you mention on the wiki and it still doesn't process the problematic document correctly. I assume it still does on your side, so there must be something I don't have the same as you...

Some small constructive comments:
The version still reports as 'version 0.18.7'
On https://github.com/pdf2htmlEX/pdf2htmlEX/wiki/Download-Debian-Archive I think you mean to instruct people to install by using 'dpkg -i', if necessary followed by a 'apt-get install -f' to pick up missing dependencies?
On ubuntu 20.04 there is no such package as libicu60 or multiarch-support (needed for your Appimage according to https://github.com/pdf2htmlEX/pdf2htmlEX/wiki/Download-AppImage).

Best, Jeroen

@stephengaito
Copy link
Contributor

Jeroen (@jeroenjeremy),

It has taken me a long time to "twig" that we are looking for the problem
at "the wrong end of the stick"... The problem we are having is not at the
building or using of pdf2htmlEX... but rather at the browser end... (over
which I have rather less control :-(

SO I have finally found the problem.... the pdf2htmEX rendering of
your ligatures display properly in any Gecko based browser (such as the
one I use: FireFox), but the same html fails to display correctly in
either the Blink based browsers (such as Chrome/Chromium) or Webkit based
browsers (such as Safari or Gnome's Epiphany/Web browser).

Now that I have identified the problem, I can work with all three browsers
locally to understand the problem.

I will class this as a failure of pdf2htmlEX to respect "the
standard"
(what ever that might be)...

... but, as I am now way overdrawn on my pdf2htmlEX development budget, it
might take me a wee while to figure out how to fix this... (sorry!)


  1. I have just tried this with the --decompose-ligature <int> decompose ligatures, such as fi -> fi (default: 0) switch on. Unfortunately this
    does not fix this issue.

  2. Download-Debian-Archive: well in fact apt install ./pdf2htmlEX.deb
    does work... however apt is not very forgiving and so if you miss the ./
    things won't work. I will improve my discussion.

  3. AppImage dependencies: I will investigate and improve my discussion.


Many thanks for all of your help identifying this issue. I now have much more comprehensive testing in place... which does help me sleep better at night :-)

I am sorry that it has taken so long to identify this problem... and possibly longer to solve..

Regards,
Stephen Gaito

@jeroenjeremy
Copy link
Author

You're still our hero for keeping pdf2htmlEX up to date @stephengaito !
While you were writing your post, I was also doing further experiments with all the options available. The only one that improves the visual result slightly is '--font-format svg'. This makes the 'direce' now show correctly as 'directie' though the other problems persist. I thought knowing this might help you when debugging in the future. (using svg as a permanent fix is not viable anyway due to the huge size increase of result files)
By the way, and not to add to your worries, but the entire result page from pdf2htmlex can't be seen when browsing in Edge on Mac (depending on pdf2htmlex options) or the new Safari 14.0 preview ('Failed to open page'). Could be issues on the side of the browser in both cases.

@stephengaito
Copy link
Contributor

stephengaito commented Jun 27, 2020

Jeroen (@jeroenjeremy),

Thanks for those pointers... the --font-format svg is probably a very good hint!

As a late afternoon puzzle.... I have begun looking at the html. I can see all of the characters, and more importantly, Chrome is able to copy and paste all of the characters, it just does not display them.

I have just attempted to recreate the first part of your pdf using both LibreOffice and LaTeX. I know that LaTeX, at least, will produce ligatures for the 'fi' so I had hoped that I could reproduce the problem. Unfortunately it is not the ligatures themselves, but the interaction between the ligatures and your particular font. (Chrome displays all ligatures for both the LibreOffice and LaTeX versions).

SO: can you tell me a bit more about how you created your pdf? What tool did you use? What fonts did you use?

Regards,
Stephen Gaito

@stephengaito
Copy link
Contributor

Jeroen (@jeroenjeremy),

I have found a solution see: #68 (comment)

I will try and implement this as a priority... but might not be able to do this for a couple of weeks.

Until I can implement this, the (alas rather tedious) work-a-round is to manually edit the file specific *.css file and add:

 font-variant-ligatures:none;font-feature-settings: "liga" 0, "clig" 0, "dlig" 0, "hlig" 0, "calt" 0;

to all @font-face definitions generated in the file specific *.css file.

(If you are pdf2htmlEXing a file: example.pdf the generated example.css would have to be manually edited as above -- this assumes you have used the --embed-css 0 switch).

@jeroenjeremy
Copy link
Author

That works wonderfully @stephengaito! I can add this as a postprocessing step in my flow without a problem. Do you foresee any disadvantages if applied over all result files?

@stephengaito
Copy link
Contributor

Jeroen (@jeroenjeremy),

It is your lucky day.... I had a number of small tasks to complete before diving deeply into my main work later this week... so I took the half day yesterday to fix this issue and run full build/tests on my development repo as well as the main repo (which have just competed).

Have a look at the (current) releases and let me know if they work for you.

Many thanks for your help debugging pdf2htmlEX.

Enjoy!
Stephen Gaito

@jeroenjeremy
Copy link
Author

That did the job, thanks again @stephengaito !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants