Anyone having success using pdf2htmlex on ubuntu 20.04? #57

jeroenjeremy · 2020-05-22T10:15:17Z

I tried using the latest version on the latest LTS of Ubuntu but it won't work due to the newer version of Poppler.
Compiling it manually doesn't do the trick either.
Do you have any tips? Or will there be a release for the newer Poppler versions?

ViliusSutkus89 · 2020-05-22T11:29:31Z

Hello Jeroen,

Sometime ago I've noticed @stephengaito was working on some updates, but they are all in his repo.

Haven't tried building on Ubuntu-20.04, but lets see what problems are you getting.
What exactly do you mean by "doesn't do the trick" ?
Compile failure or runtime errors?

Before building pdf2htmlEX, you need Poppler-0.81.0 and Fontforge-20170731. They are pretty easy to build from source and you can install them in user (not system) directory. Check issue #56 for howto guide.

jeroenjeremy · 2020-05-22T12:16:34Z

Hi Vilius,

All the errors are around Cairo, for example:

[  8%] Building CXX object CMakeFiles/pdf2htmlEX.dir/3rdparty/poppler/git/CairoFontEngine.cc.o
In file included from /tmp/pdf2htmlEX/3rdparty/poppler/git/CairoFontEngine.cc:43:
/tmp/pdf2htmlEX/3rdparty/poppler/git/CairoOutputDev.h:192:8: error: ‘void CairoOutputDev::drawChar(GfxState*, double, double, double, double, double, double, CharCode, int, Unicode*, int)’ marked ‘override’, but does not override
  192 |   void drawChar(GfxState *state, double x, double y,
      |        ^~~~~~~~

stephengaito · 2020-05-22T15:23:43Z

@jeroenjeremy and @ViliusSutkus89,

From my previous work (before Christmas 2019), I can confirm that it will be highly unlikely that you will be able to compile this using Ubuntu 20.04. The pdf2htmlEX sources have a very intimate relationship with specific releases of both poppler and fontforge. Alas, the poppler source is moving very fast so pdf2htmlEX's current source will almost certainly not work with the poppler libraries on Ubuntu 20.04.

You find me between tasks.... (and feeling a tad guilty for not keeping up with phf2htmlEX)... so I will try to sort this out next week. I think I managed to get my fork up to poppler v0.82.0 and Ubuntu 20.04 is uses poppler v0.86.0 (poppler itself is now up to v0.88.0). It will take about two days per poppler release.

HOWEVER, last November/December I essentially abandoned trying to make *.deb packages. I can and will make both AppImage and Docker releases. I might make a *.deb package BUT I have to review how to keep alternate versions of poppler and fontforge which do not conflict with the Ubuntu 20.04 installed versions.

My current AppImage ( https://github.com/stephengaito/pdf2htmlEX/releases/tag/continuous) should still work on Ubuntu 20.04 since it is self contained (it has its own matching versions of pdf2htmlEX/poppler/fontforge.

Just before Christmas, I exhausted myself trying to get Travis/Homebrew to work on MacOs... If anyone can contribute that knowledge that would be extremely helpful.

@ViliusSutkus89 can we coordinate your work with mine.... so we can eventually merge cleanly back into one project?

Regards, Stephen Gaito

jeroenjeremy · 2020-05-22T15:51:20Z

@stephengaito and @ViliusSutkus89 : you're both heroes for keeping pdf2htmlEX alive and kicking!
I've been trying to compile different versions of Poppler and Fontforge on 20.04 all afternoon but I realise this is clearly above my paygrade (i.e. I don't understand enough about all the dependencies).
@stephengaito I tried your Appimage. That is a very attractive model, I find. Does it work on AWS Lambda?
What is not immediately clear is whether your Appimage is a more up to date or much different build than release https://github.com/pdf2htmlEX/pdf2htmlEX/releases/tag/v0.18.7-poppler-0.81.0
Do you guys know?

jeroenjeremy · 2020-05-22T15:55:00Z

Sorry, just realised you build both and they ARE the same version.

jeroenjeremy · 2020-05-22T16:42:18Z

Actually, my search for an updated version started after realising the attached document doesn't convert well. Some characters are simply not shown after pdf2htmlex conversion. It's a document without complications (no odd fonts, no forms or anything like it). Even after normalisation with Ghostscript I had no luck.
Running the Appimage doesn't make a difference either.

stephengaito · 2020-05-22T18:19:37Z

@jeroenjeremy I will have a look at that PDF, I can give no promises. May I add that document to my repository of problematic documents (which I believe is in a github repo and so public)?

Also, can you provide me with a "use case" describing the process events (and underlying scripts) which you would be using with AWS Lambda? Again, I can make no promises, but AWS Lambda sounds more self contained and so "easier to target" then a general *.deb package. If you can do this I will try to make what ever I do as AWS Lambda friendly as possible.

(Can AWS Lambda make use of Docker images... or is AWS Lambda a Docker competitor?)

jeroenjeremy · 2020-05-24T14:38:03Z

I prefer not to include the document in a public repo.
About Lambda: first of all, I'm not an expert on Lambda but we're seriously looking into it for our serverless microservices. Pdf2htmlex would be one of the microservices in a sequence of other functions.
Lambda works with 'layers' where you can include self-contained executables that need to have been compiled on Amazon Linux as far as I understand. See here for a walkthrough: https://docs.aws.amazon.com/lambda/latest/dg/runtimes-walkthrough.html
Lambda cannot work with Docker, I'm afraid; I guess you could see it as a competitor.

stephengaito · 2020-05-24T21:08:30Z

@jeroenjeremy I have taken a copy of your PDF... you might want to deleted it from your message above....

stephengaito · 2020-05-28T16:21:14Z

@jeroenjeremy , My tools have been "resharpened"... and now work with my substantially different development environments.

However, I have tried your PDF with both my currently released AppImage and my newly locally recreated AppImage and I can see no characters missing from either pdf2htmlEX converted versions when I compare them to the version that Okular provides me with.

Can you send me, or attach, a screen shot (an image either png or jpeg) of your pdf showing me the missing characters?

Regards,
Stephen Gaito

jeroenjeremy · 2020-06-02T14:10:50Z

I'm attaching a screenshot with a few of the problem areas marked. It seems to affect the characters 'f' and 't' in combination with 'i' disproportionately. For example: 'fietsenstalling' becomes 'etsenstalling'.

stephengaito · 2020-06-02T18:34:21Z

@jeroenjeremy, Many thanks for this image. I have added it to my private archive of troublesome PDFs.
Unfortunately the code for ligatures might be deeply buried across poppler and pdf2htmlEX.... which might make solving this problem much more difficult.
But I will try and have a look.

stephengaito · 2020-06-02T19:12:29Z

@jeroenjeremy Interesting... my version works "out of the box" (which might be why I could not see the problem).
I am using a freshly built pdf2htmlEX (from the development repo stephengaito/pdf2htmlEX - master branch should work).
My command line is:

using svg as a background format

mkdir /tmp/webfsd/html/een_fietsenstalling_laten_opknappen

pdf2htmlEX --zoom 1.3 \
  --embed cfij --bg-format svg \
  --split-pages 1 \
  --dest-dir /tmp/webfsd/html/een_fietsenstalling_laten_opknappen \
  --page-filename een_fietsenstalling_laten_opknappen-%d.page \
  pdfFiles/een_fietsenstalling_laten_opknappen.pdf

Preprocessing: 1/1
Working: 1/1

A screen shot of my result (in a fireFox browser) is:

If that still fails for you, there is an explicit --decompose-ligature (boolean) option which "decompose ligatures, such as \uFB01 -> fi"... have you tried turning that option on?

PS: I am hoping, all being well, to merge my development repo back into the main pdf2htmlEX/pdf2htmlEX by the end of this week.

jeroenjeremy · 2020-06-02T19:15:28Z

It happens to me even when I don't use any options whatsoever, viewing in Safari/Edge/Chrome on Macos and Windows...

jeroenjeremy · 2020-06-02T19:16:28Z

Can I build your development version on Ubuntu 20 or 19?

stephengaito · 2020-06-02T19:17:48Z

@jeroenjeremy
PPS: You can build your own copy of pdf2htmlEX today... on a Ubuntu >= 18.04 using the command line:

  cd pdf2htmlEX
  ./buildScripts/buildInstallLocally

Note that I build inside a virtual machine... so that I have complete control over my environment....

"Can I build your development version on Ubuntu 20 or 19?" ... Yes I have built it inside a Ubuntu 20.04 virtual machine.

There should be an AppImage, Docker image and a *.deb archive by the end of this week as well.

stephengaito · 2020-06-02T19:35:14Z

@jeroenjeremy

A recent *.deb (built on Bionic 18.04 but should work on Focal 20.04) can be downloaded from https://www.dropbox.com/s/avwot8sce8s4vti/pdf2htmlEX-updateTravis-2020_06_02-18_50_44-x86_64-bionic.deb?dl=0

A recent AppImage can be downloaded from https://www.dropbox.com/s/4pc3h2p73fai0gj/pdf2htmlEX-updateTravis-2020_06_02-18_50_44-x86_64.AppImage?dl=0

To run an AppImage... rename it to what ever you like and make it executable.... then run it with the "usual" pdf2htmlEX command line options.... An AppImage is completely self contained.... so should run in (almost) any Linux environment.... as well as recent Windows 10 releases...

jeroenjeremy · 2020-06-03T12:57:42Z

It's really baffling: just ran your new Appimage and a version built per your './buildScripts/buildInstallLocally' and I still see the same issue. I've even tried it on different servers.

jeroenjeremy · 2020-06-05T18:41:20Z

Tried on a freshly installed Ubuntu 18.04 on Amazon today, both with Appimage and locally built and still getting the same result, I'm afraid.

stephengaito · 2020-06-06T18:36:44Z

@jeroenjeremy

Since I have shown that the ligatures work on my installation, I suspect you and I have different underlying installations of our distributions.

pdf2htmlEX is heavily dependent upon, among other things, fontconfig (both directly and via poppler and fontforge). The fundamental problem is, you need to be the expert on the fonts your PDFs use, pdf2htmlEX can not dictate this....

I suspect we have different locales and font configuration.

Can you provide me with:

copies of your /etc/lsb-release and /etc/os-release files
a zipped copy of your /etc/fonts directory
the out put of the locales -a command
the output of the fc-list -v command
which AWS EC2 image are you using ?
can you try my docker image ( https://hub.docker.com/r/stephengaito/pdf2htmlex ) ?

I suspect some of this information may be fairly sensitive, so you might want to send it to me out-of-band...

Alas I am totally unsure if this is enough nor how to exactly use it to debug this problem... but it would be an important start.

jeroenjeremy · 2020-06-07T08:24:35Z

@stephengaito Thanks for looking into it. None of the info is confidential, as I just took the standard AWS AMI and compiled pdf2htmlEX on it.

0 attached in gaito.txt
1 attached in fonts.zip
2 in gaito.txt
3 in gaito.txt
4 I used the standard AWS AMI for Ubuntu 18.04 and updated all the packages present. I didn't add any package myself: 'ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20200408 (ami-0701e7be9b2a77600)'
5 will get to that now

fonts.zip
gaito_info.txt

jeroenjeremy · 2020-06-07T09:20:42Z

Sorry to say but even running your Docker I get the same result... It seems that you managed to create a perfect setup for yourself where it DOES work ;-)
BTW Your default docker complains there is no manifest but when I specify 'updateBuildScripts-2020_06_06-12_11_26' it does run.

jeroenjeremy · 2020-06-07T09:33:08Z

When I switch on debugging in pdf2htmlEX I get a lot of errors like this:
Lookup subtable contains unused glyph glyph618 making the whole subtable invalid
Could that give a hint of what is going on?

stephengaito · 2020-06-08T11:15:31Z

Curiouser and curiouser!

Many thanks for this detail... I have now downloaded it if you want to delete it...
I will have a look at it all over the next couple of days.
I will also try running in an AWS EC image and see what happens.

stephengaito · 2020-06-24T13:02:28Z

@jeroenjeremy ,

After considerable ((re)re)testing, I have now made an official (alpha) release of the pdf2htmlEX tool. See: https://github.com/pdf2htmlEX/pdf2htmlEX/releases

(I think) I have tested pdf2htmlEX in enough controlled environments to be confident it should now work.... (alas that is easier to say than to do.... )

I have also (begun) to add more detail about how to install each type of pdf2htmlEX release object. Each such wiki Download-* page lists any additional packages which might need to be installed to recreate my working testing environment. Please have a look at these notes to make sure you have these packages installed.

Could you try one or more of these release objects and let me know if they work (or not) for you.

Regards,
Stephen Gaito

jeroenjeremy · 2020-06-25T17:05:36Z

Hi Stephen, thanks for the heads up. Great to see full releases for so many environments!
I tested the .deb and the Appimage in Ubuntu Focal freshly installed with the dependencies you mention on the wiki and it still doesn't process the problematic document correctly. I assume it still does on your side, so there must be something I don't have the same as you...

Some small constructive comments:
The version still reports as 'version 0.18.7'
On https://github.com/pdf2htmlEX/pdf2htmlEX/wiki/Download-Debian-Archive I think you mean to instruct people to install by using 'dpkg -i', if necessary followed by a 'apt-get install -f' to pick up missing dependencies?
On ubuntu 20.04 there is no such package as libicu60 or multiarch-support (needed for your Appimage according to https://github.com/pdf2htmlEX/pdf2htmlEX/wiki/Download-AppImage).

Best, Jeroen

stephengaito · 2020-06-27T12:47:49Z

Jeroen (@jeroenjeremy),

It has taken me a long time to "twig" that we are looking for the problem
at "the wrong end of the stick"... The problem we are having is not at the
building or using of pdf2htmlEX... but rather at the browser end... (over
which I have rather less control :-(

SO I have finally found the problem.... the pdf2htmEX rendering of
your ligatures display properly in any Gecko based browser (such as the
one I use: FireFox), but the same html fails to display correctly in
either the Blink based browsers (such as Chrome/Chromium) or Webkit based
browsers (such as Safari or Gnome's Epiphany/Web browser).

Now that I have identified the problem, I can work with all three browsers
locally to understand the problem.

I will class this as a failure of pdf2htmlEX to respect "the
standard" (what ever that might be)...

... but, as I am now way overdrawn on my pdf2htmlEX development budget, it
might take me a wee while to figure out how to fix this... (sorry!)

I have just tried this with the --decompose-ligature <int> decompose ligatures, such as ﬁ -> fi (default: 0) switch on. Unfortunately this
does not fix this issue.
Download-Debian-Archive: well in fact apt install ./pdf2htmlEX.deb
does work... however apt is not very forgiving and so if you miss the ./
things won't work. I will improve my discussion.
AppImage dependencies: I will investigate and improve my discussion.

Many thanks for all of your help identifying this issue. I now have much more comprehensive testing in place... which does help me sleep better at night :-)

I am sorry that it has taken so long to identify this problem... and possibly longer to solve..

Regards,
Stephen Gaito

jeroenjeremy · 2020-06-27T13:50:48Z

You're still our hero for keeping pdf2htmlEX up to date @stephengaito !
While you were writing your post, I was also doing further experiments with all the options available. The only one that improves the visual result slightly is '--font-format svg'. This makes the 'direce' now show correctly as 'directie' though the other problems persist. I thought knowing this might help you when debugging in the future. (using svg as a permanent fix is not viable anyway due to the huge size increase of result files)
By the way, and not to add to your worries, but the entire result page from pdf2htmlex can't be seen when browsing in Edge on Mac (depending on pdf2htmlex options) or the new Safari 14.0 preview ('Failed to open page'). Could be issues on the side of the browser in both cases.

stephengaito · 2020-06-27T15:52:45Z

Jeroen (@jeroenjeremy),

Thanks for those pointers... the --font-format svg is probably a very good hint!

As a late afternoon puzzle.... I have begun looking at the html. I can see all of the characters, and more importantly, Chrome is able to copy and paste all of the characters, it just does not display them.

I have just attempted to recreate the first part of your pdf using both LibreOffice and LaTeX. I know that LaTeX, at least, will produce ligatures for the 'fi' so I had hoped that I could reproduce the problem. Unfortunately it is not the ligatures themselves, but the interaction between the ligatures and your particular font. (Chrome displays all ligatures for both the LibreOffice and LaTeX versions).

SO: can you tell me a bit more about how you created your pdf? What tool did you use? What fonts did you use?

Regards,
Stephen Gaito

stephengaito · 2020-06-28T07:08:49Z

Jeroen (@jeroenjeremy),

I have found a solution see: #68 (comment)

I will try and implement this as a priority... but might not be able to do this for a couple of weeks.

Until I can implement this, the (alas rather tedious) work-a-round is to manually edit the file specific *.css file and add:

 font-variant-ligatures:none;font-feature-settings: "liga" 0, "clig" 0, "dlig" 0, "hlig" 0, "calt" 0;

to all @font-face definitions generated in the file specific *.css file.

(If you are pdf2htmlEXing a file: example.pdf the generated example.css would have to be manually edited as above -- this assumes you have used the --embed-css 0 switch).

jeroenjeremy · 2020-06-28T08:10:29Z

That works wonderfully @stephengaito! I can add this as a postprocessing step in my flow without a problem. Do you foresee any disadvantages if applied over all result files?

stephengaito · 2020-06-30T15:02:27Z

Jeroen (@jeroenjeremy),

It is your lucky day.... I had a number of small tasks to complete before diving deeply into my main work later this week... so I took the half day yesterday to fix this issue and run full build/tests on my development repo as well as the main repo (which have just competed).

Have a look at the (current) releases and let me know if they work for you.

Many thanks for your help debugging pdf2htmlEX.

Enjoy!
Stephen Gaito

jeroenjeremy · 2020-07-02T08:47:22Z

That did the job, thanks again @stephengaito !

This was referenced Jun 25, 2020

Add correct pdf2htmlEX version check to tests #66

Closed

Add local browser Selenium tests for Blink(Chrome) and Webkit(Safari) #67

Closed

pdf2htmlEX's implementation of Ligatures do not work in the Blink/Webkit browser engines #68

Closed

stephengaito mentioned this issue Jun 29, 2020

Resurrect Lu Wang's "test_remote_browser" using SauceLab's OpenSauce tools #71

Open

stephengaito closed this as completed Jun 30, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Anyone having success using pdf2htmlex on ubuntu 20.04? #57

Anyone having success using pdf2htmlex on ubuntu 20.04? #57

jeroenjeremy commented May 22, 2020

ViliusSutkus89 commented May 22, 2020

jeroenjeremy commented May 22, 2020

stephengaito commented May 22, 2020

jeroenjeremy commented May 22, 2020

jeroenjeremy commented May 22, 2020

jeroenjeremy commented May 22, 2020 •

edited

stephengaito commented May 22, 2020

jeroenjeremy commented May 24, 2020

stephengaito commented May 24, 2020

stephengaito commented May 28, 2020

jeroenjeremy commented Jun 2, 2020

stephengaito commented Jun 2, 2020

stephengaito commented Jun 2, 2020

jeroenjeremy commented Jun 2, 2020

jeroenjeremy commented Jun 2, 2020

stephengaito commented Jun 2, 2020 •

edited

stephengaito commented Jun 2, 2020 •

edited

jeroenjeremy commented Jun 3, 2020

jeroenjeremy commented Jun 5, 2020

stephengaito commented Jun 6, 2020 •

edited

jeroenjeremy commented Jun 7, 2020

jeroenjeremy commented Jun 7, 2020

jeroenjeremy commented Jun 7, 2020

stephengaito commented Jun 8, 2020 •

edited

stephengaito commented Jun 24, 2020

jeroenjeremy commented Jun 25, 2020

stephengaito commented Jun 27, 2020

jeroenjeremy commented Jun 27, 2020

stephengaito commented Jun 27, 2020 •

edited

stephengaito commented Jun 28, 2020

jeroenjeremy commented Jun 28, 2020

stephengaito commented Jun 30, 2020

jeroenjeremy commented Jul 2, 2020

Anyone having success using pdf2htmlex on ubuntu 20.04? #57

Anyone having success using pdf2htmlex on ubuntu 20.04? #57

Comments

jeroenjeremy commented May 22, 2020

ViliusSutkus89 commented May 22, 2020

jeroenjeremy commented May 22, 2020

stephengaito commented May 22, 2020

jeroenjeremy commented May 22, 2020

jeroenjeremy commented May 22, 2020

jeroenjeremy commented May 22, 2020 • edited

stephengaito commented May 22, 2020

jeroenjeremy commented May 24, 2020

stephengaito commented May 24, 2020

stephengaito commented May 28, 2020

jeroenjeremy commented Jun 2, 2020

stephengaito commented Jun 2, 2020

stephengaito commented Jun 2, 2020

jeroenjeremy commented Jun 2, 2020

jeroenjeremy commented Jun 2, 2020

stephengaito commented Jun 2, 2020 • edited

stephengaito commented Jun 2, 2020 • edited

jeroenjeremy commented Jun 3, 2020

jeroenjeremy commented Jun 5, 2020

stephengaito commented Jun 6, 2020 • edited

jeroenjeremy commented Jun 7, 2020

jeroenjeremy commented Jun 7, 2020

jeroenjeremy commented Jun 7, 2020

stephengaito commented Jun 8, 2020 • edited

stephengaito commented Jun 24, 2020

jeroenjeremy commented Jun 25, 2020

stephengaito commented Jun 27, 2020

jeroenjeremy commented Jun 27, 2020

stephengaito commented Jun 27, 2020 • edited

stephengaito commented Jun 28, 2020

jeroenjeremy commented Jun 28, 2020

stephengaito commented Jun 30, 2020

jeroenjeremy commented Jul 2, 2020

jeroenjeremy commented May 22, 2020 •

edited

stephengaito commented Jun 2, 2020 •

edited

stephengaito commented Jun 2, 2020 •

edited

stephengaito commented Jun 6, 2020 •

edited

stephengaito commented Jun 8, 2020 •

edited

stephengaito commented Jun 27, 2020 •

edited