Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bump mupdf, kopt, lept and tess #1203

Closed
wants to merge 2 commits into from
Closed

Bump mupdf, kopt, lept and tess #1203

wants to merge 2 commits into from

Conversation

ezdiy
Copy link
Member

@ezdiy ezdiy commented Sep 29, 2020

mupdf 1.17, tracking https://github.com/ezdiy/mupdf
kopt 2.53, tracking https://github.com/ezdiy/libk2pdfopt
lept 1.79, frozen
tess 4.1, frozen

Due to complexity and heavy interdependencies, these were frozen in time
for past 2-3 years, and are starting showing their age.

This is a large undertaking so it will take a while, any tips are welcome.
Currently "mostly works".

tbd:

  • android build
  • mupdf falling on its face with glyphs that are missing in freefont

This change is Reviewable

@NiLuJe
Copy link
Member

NiLuJe commented Sep 29, 2020

It's even more annoying when you add k2pdfopt in the mix, because it mostly expects a frozen set of dependencies for those, so, you'll probably want to bump that, too.

(Although, last I checked, you should have some breathing room on that front with current k2pdfopt versions).

Hopefully, the way I went about it the last time should make it slightly less arcane to devise what's k2pdfopt and what's our own patchset on top of it...

See #762 for a few comments about this and MµPDF API changes, (and the last time we bumped k2pdfopt: koreader/libk2pdfopt#32).

@NiLuJe
Copy link
Member

NiLuJe commented Sep 29, 2020

Also, I never bothered with it because, ahahaha, but the ZIP encryption thingy relies on an AES lib from minizip, and the minizip build is also frozen to a god-awfully old version.

I don't rightly recall what happened when I tried to bump it, but it was.... not good :D.

@ezdiy
Copy link
Member Author

ezdiy commented Sep 29, 2020

@NiLuJe Minizip is nuked, and moved to mupdf. Ye the hardcoded dependencies of kopt is nasty (the frozen versions instead of tracking is pretty much because of that). Luckily we dont care about lept/tess all that much as long kopt is happy. While with mupdf it pays off to stay on the edge for as long as possible, as both our and kopt use of it is more or less non-invasive so it's just about hoping they won't mess with api all that much in the future.

#762 (comment)

This from what I've seen will be an issue, the api is completely gone replaced with something entirely new. The rest seems to map reasonably well.

@Frenzie
Copy link
Member

Frenzie commented Sep 30, 2020

This is a large undertaking so it will take a while, any tips are welcome.

I assume you know your way around git blame (whether locally or via the GH web GUI) to figure out what some cryptic line is supposed to do, but besides that nothing in particular. Of course you should take a look at #762, possibly #577, as well as koreader/libk2pdfopt#32 but I imagine you either already have or will when you need to.

Thanks so much for looking into this! The amount of time required is a real obstacle. 👍

@NiLuJe
Copy link
Member

NiLuJe commented Sep 30, 2020

Yep, and feel free to poke our brains, I still vaguely remember what I did the last time ^^.

Copy link
Member

@Frenzie Frenzie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hooray, the new OS X CI is working. :-)

@ezdiy
Copy link
Member Author

ezdiy commented Oct 5, 2020

@Frenzie It's kind of long distance telepathy as I have no other way to test it, but better than nothing :)

@Frenzie
Copy link
Member

Frenzie commented Oct 5, 2020

Same, that's why I added it as soon as I noticed GH made it available. It kept silently breaking. ^_^

Makefile.defs Outdated
CSTD_FLAGS:=-std=gnu11
CXXSTD_FLAGS:=-std=gnu++14

# TBA: Do we need this? It seems to actually break clang/OSX in tesseract..
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@NiLuJe Is it ok to drop this and instead add case-by-case if really needed - ie things where autotools don't figure this out on their own? It explodes on clang due to tesseract and some mupdf dependencies mixing up CFLAGS vs CXXFLAGS.

Copy link
Member

@NiLuJe NiLuJe Oct 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect it's unnecessary on macOS, since even truly ancient versions of XCode should default to a recent-enough Clang that would make this redundant.

(IIRC, the main user of this hack was the old stock PB TC. With the current set of TCs, this is arguably a hairy NOP ;)).

-- TBA: This dimension has changed with new mupdf,
-- possibly different font/rendering. Is that an issue?
--assert.equals(math.floor(bbox[4]*1000), 106089)
assert.equals(math.floor(bbox[4]*1000), 103669)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TnS-hun Just in case this ping works, is this to be expected or is something wrong here? (there's no visual difference, but who knows)

INSTALL_COMMAND ""
PATCH_COMMAND COMMAND ${PATCH_CMD} COMMAND ${PATCH2_CMD} COMMAND ${PATCH3_CMD} COMMAND sh -c "${ISED} \"s|\\$srcdir\\/configure \\$\\*||g\" autogen.sh" COMMAND ./autogen.sh
CONFIGURE_COMMAND ${CFG_CMD}
BUILD_COMMAND ${KO_MAKE_RECURSIVE} -j1 # TBA: This is heavy weight C++, -j1 necessary so as to not nuke CircleCI instances
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Frenzie Is there a good way to make this conditional only for CI instances that are limited by memory (but without snipping parallism overall)? The reason this goes kaboom on CircleCI is ginormous C++ template expansions in Tesseract 4.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't CI build envs usually set an environment variable to detect 'em? (Or can we ask 'em to do that, somehow?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, there are various environment variables including a fairly self-evident ${CIRCLECI}, see https://circleci.com/docs/2.0/env-vars/#built-in-environment-variables for the full list.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turns out CMAKE can tell us available memory directly, so I suppose people with old computers can appreciate us not bombing their box too :>

Copy link
Member

@Frenzie Frenzie Oct 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need at least 2.5GB per CPU

Which would include my personal laptop at 4 GB total, which is perfectly usable as long as e.g. tesseract is only compiled irregularly, so yes, much better. ;-)

@ezdiy ezdiy force-pushed the mupdf branch 2 times, most recently from a8bd3c2 to d26be30 Compare October 9, 2020 01:47
@@ -0,0 +1,347 @@
-- generated by https://github.com/ezdiy/unimap/blob/master/gen.lua
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a few comments here describing what these 2 datastructures and their contents are? And what the input to your gen.lua is?
Just so it's a bit less black magic, we know where that comes from, and if/when it needs to be updated.

(Was that unimap some previous project of yours ? Or did you cook it up just for this ?)

Copy link
Member Author

@ezdiy ezdiy Oct 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unimap is ported from different lua thing, yes. I'm not all that thrilled how the table is thrown in here, but the alternative seems even more spammy - makefile/wget Scripts.txt & clone fontconfig/gen.lua somewhere in koreader make - for a table of unicode ranges that's quite unlikely to ever change. Alas, wouldn't describing the format make more sense on harfbuzz.lua end?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alas, wouldn't describing the format make more sense on harfbuzz.lua end?

Whereever you feel it is best - with just a backlink (-- see harfbuzz.lua for details) in this file so we know where to look.
Just a few words about what this is all about - so one can get what it's about and just skip it as ok, not interesting :)

@poire-z
Copy link
Contributor

poire-z commented Oct 9, 2020

Regarding your addition of FT_Get_Sfnt_Name() : do you plan on handling koreader/koreader#6763 in a standalone way (so I don't have to get xtext involved in that, koreader/koreader#6763 (comment)) ?
Or do you need that for something else?

And in a few words before we see your code: how do you plan on having harfbuzz help you with MuPDF ?

@ezdiy
Copy link
Member Author

ezdiy commented Oct 9, 2020

Regarding your addition of FT_Get_Sfnt_Name() : do you plan on handling koreader/koreader#6763 in a standalone way (so I don't have to get xtext involved in that, koreader/koreader#6763 (comment)) ?

You were right, FT kinda sucks for this, as HB can handle most of the encoding issues for us. I think just rewriting it 1:1 to harfbuzz.lua so utility function so as to not pollute xtext with it, sketch:

local hb = require("ffi/harfbuzz")
local hbface = hb.hb_ft_face_create_referenced(uiface)
local buf = ffi.new("char[256]")
hb_ot_name_get_utf8(hbface, hb.HB_OT_NAME_ID_FULL_NAME, lang, ffi.new("int[1]", 256), buf)
return ffi.string(buf)

There are still some minor gotchas like having to check all FONT_FAMILY, NAME and FULLNAME as only one may have something useful still apply.

I'll bring some cherrypicked PR for this as it's not really dependent on what I'm doing with mupdf anymore, the hb parts are already in master.

Or do you need that for something else?

And in a few words before we see your code: how do you plan on having harfbuzz help you with MuPDF ?

This whole ritual dance is essentially for the sake of ghetto fontconfig to get fonts working in mupdf beyond the extent of hardcoded builtins (which aren't even there to begin with now :). fz_install_load_system_font_funcs() gets you script, language and typeface name in a callback, and we should spit some FT_Face instance out.

To do this with device user/supplied fonts, there's not getting around coverage and some modest heuristics to pick something, as there's no luxury of restarting layout whenever we step on a missing glyph. When we respond with a face, we commit to coverage of requested lang and/or script. This can also help avoiding spastic substitutions in mixed script layout block whenever available font choice is more looser than single main+builtins list.

All of this stuff is meant to live more or less parallel to cre, so as to not step on its toes, though some font sharing (on blob/ftmemory level) should be implemented in the future between the two so as to avoid spamming too many copies of same face.

As for the SFNT name fields in FT, it was meant to get pristine view in the cached info for font select should it be needed, but turns out all the fields there are useless, and the one that is helpful for ui, is better asked via harfbuzz.

@poire-z
Copy link
Contributor

poire-z commented Oct 9, 2020

I think just rewriting it 1:1 to harfbuzz.lua so utility function so as to not pollute xtext with it, sketch

Fine with me 👍

Thanks for the other info.
Just a question, as I'm not really familiar with PDF.
I thought PDF has all the layout fixed, i.e. lines of text are already linebroken and each word has some kind of x/y absolute positionning. Which I feel would require the exact font to be known/used so the text is nice.
By using arbitrary external fonts, isn't there the risk that with a font whose line height and glyph widths are different from the original font used by the author (but not embedded), the text will look bad?
Or is it that real nice documents should ship embedded fonts - and what we do here is for plain simple PDF documents where we have to do with that risk of bad layout? Or is PDF supposed to do reflow/line breaking in that case, and handle more stuff than I thought it does?

@ezdiy
Copy link
Member Author

ezdiy commented Oct 9, 2020

For PDF, it's about those that actually don't embed fonts. CJK ones quite often don't (too much bloat), musical notes, papers that are just .ps in pdf wrapper... - as it stands you get only hardcoded builtins for those, and even that ostensibly fails unless you manually butcher base14 builtin to point at freefont/noto/droid (what old patch did). This at least avoids dropping glyphs for "exotic" latin scripts such as czech, at the cost of barring font choice from the document entirely.

As for line height/glyph width that is an issue indeed - it's generally the reason why we should try real hard to supply the font that is asked for, at least something close to that. PDFs like that are not really fixed-sized, the post script path commands do have wiggle room for layout engine to make decisions. The text itself in a pdf is "reflowable", but the text boxes in the document itself tend to be laid out too haphazardly to be of any practical use for whole-page reflow. In HTML terms, imagine pdf as a document with everything being laid out using <table>, text generally at paragraph granularity.

Finally, there's html itself - while CRE is vastly superior renderer, mupdf does offer some interesting options. svg that at least sometimes gave me better results than tinysvg. One can also get very fast pagination (think instant document opens and em size changes). Though admittedly the pdfdocument frontend needs to be made a lil bit smarter to make reasonable use of that (forgo kopt and tile cache when not needed on reflowable documents etc).

mupdf 1.17, tracking https://github.com/ezdiy/mupdf
kopt 2.53, tracking https://github.com/ezdiy/libk2pdfopt
lept 1.79, frozen
tess 4.1, frozen
@poire-z
Copy link
Contributor

poire-z commented Feb 4, 2021

@ezdiy : long time no news :/ Are you ok and fine? Just busy ? Offline ?

Should somebody else (mhh, not me :) continue this PR?

@Frenzie
Copy link
Member

Frenzie commented May 3, 2024

I'll close this one in favor of #1750 and related PRs, but thanks so much for the effort!

@Frenzie Frenzie closed this May 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants