-
Notifications
You must be signed in to change notification settings - Fork 104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bump mupdf, kopt, lept and tess #1203
Conversation
It's even more annoying when you add k2pdfopt in the mix, because it mostly expects a frozen set of dependencies for those, so, you'll probably want to bump that, too. (Although, last I checked, you should have some breathing room on that front with current k2pdfopt versions). Hopefully, the way I went about it the last time should make it slightly less arcane to devise what's k2pdfopt and what's our own patchset on top of it... See #762 for a few comments about this and MµPDF API changes, (and the last time we bumped k2pdfopt: koreader/libk2pdfopt#32). |
Also, I never bothered with it because, ahahaha, but the ZIP encryption thingy relies on an AES lib from minizip, and the minizip build is also frozen to a god-awfully old version. I don't rightly recall what happened when I tried to bump it, but it was.... not good :D. |
@NiLuJe Minizip is nuked, and moved to mupdf. Ye the hardcoded dependencies of kopt is nasty (the frozen versions instead of tracking is pretty much because of that). Luckily we dont care about lept/tess all that much as long kopt is happy. While with mupdf it pays off to stay on the edge for as long as possible, as both our and kopt use of it is more or less non-invasive so it's just about hoping they won't mess with api all that much in the future. This from what I've seen will be an issue, the api is completely gone replaced with something entirely new. The rest seems to map reasonably well. |
I assume you know your way around git blame (whether locally or via the GH web GUI) to figure out what some cryptic line is supposed to do, but besides that nothing in particular. Of course you should take a look at #762, possibly #577, as well as koreader/libk2pdfopt#32 but I imagine you either already have or will when you need to. Thanks so much for looking into this! The amount of time required is a real obstacle. 👍 |
Yep, and feel free to poke our brains, I still vaguely remember what I did the last time ^^. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hooray, the new OS X CI is working. :-)
@Frenzie It's kind of long distance telepathy as I have no other way to test it, but better than nothing :) |
Same, that's why I added it as soon as I noticed GH made it available. It kept silently breaking. ^_^ |
Makefile.defs
Outdated
CSTD_FLAGS:=-std=gnu11 | ||
CXXSTD_FLAGS:=-std=gnu++14 | ||
|
||
# TBA: Do we need this? It seems to actually break clang/OSX in tesseract.. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@NiLuJe Is it ok to drop this and instead add case-by-case if really needed - ie things where autotools don't figure this out on their own? It explodes on clang due to tesseract and some mupdf dependencies mixing up CFLAGS vs CXXFLAGS.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect it's unnecessary on macOS, since even truly ancient versions of XCode should default to a recent-enough Clang that would make this redundant.
(IIRC, the main user of this hack was the old stock PB TC. With the current set of TCs, this is arguably a hairy NOP ;)).
-- TBA: This dimension has changed with new mupdf, | ||
-- possibly different font/rendering. Is that an issue? | ||
--assert.equals(math.floor(bbox[4]*1000), 106089) | ||
assert.equals(math.floor(bbox[4]*1000), 103669) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TnS-hun Just in case this ping works, is this to be expected or is something wrong here? (there's no visual difference, but who knows)
thirdparty/tesseract/CMakeLists.txt
Outdated
INSTALL_COMMAND "" | ||
PATCH_COMMAND COMMAND ${PATCH_CMD} COMMAND ${PATCH2_CMD} COMMAND ${PATCH3_CMD} COMMAND sh -c "${ISED} \"s|\\$srcdir\\/configure \\$\\*||g\" autogen.sh" COMMAND ./autogen.sh | ||
CONFIGURE_COMMAND ${CFG_CMD} | ||
BUILD_COMMAND ${KO_MAKE_RECURSIVE} -j1 # TBA: This is heavy weight C++, -j1 necessary so as to not nuke CircleCI instances |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Frenzie Is there a good way to make this conditional only for CI instances that are limited by memory (but without snipping parallism overall)? The reason this goes kaboom on CircleCI is ginormous C++ template expansions in Tesseract 4.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't CI build envs usually set an environment variable to detect 'em? (Or can we ask 'em to do that, somehow?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, there are various environment variables including a fairly self-evident ${CIRCLECI}
, see https://circleci.com/docs/2.0/env-vars/#built-in-environment-variables for the full list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Turns out CMAKE can tell us available memory directly, so I suppose people with old computers can appreciate us not bombing their box too :>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need at least 2.5GB per CPU
Which would include my personal laptop at 4 GB total, which is perfectly usable as long as e.g. tesseract is only compiled irregularly, so yes, much better. ;-)
a8bd3c2
to
d26be30
Compare
ffi/harfbuzz_coverage.lua
Outdated
@@ -0,0 +1,347 @@ | |||
-- generated by https://github.com/ezdiy/unimap/blob/master/gen.lua |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a few comments here describing what these 2 datastructures and their contents are? And what the input to your gen.lua is?
Just so it's a bit less black magic, we know where that comes from, and if/when it needs to be updated.
(Was that unimap some previous project of yours ? Or did you cook it up just for this ?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unimap is ported from different lua thing, yes. I'm not all that thrilled how the table is thrown in here, but the alternative seems even more spammy - makefile/wget Scripts.txt & clone fontconfig/gen.lua somewhere in koreader make - for a table of unicode ranges that's quite unlikely to ever change. Alas, wouldn't describing the format make more sense on harfbuzz.lua end?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alas, wouldn't describing the format make more sense on harfbuzz.lua end?
Whereever you feel it is best - with just a backlink (-- see harfbuzz.lua for details
) in this file so we know where to look.
Just a few words about what this is all about - so one can get what it's about and just skip it as ok, not interesting :)
Regarding your addition of And in a few words before we see your code: how do you plan on having harfbuzz help you with MuPDF ? |
You were right, FT kinda sucks for this, as HB can handle most of the encoding issues for us. I think just rewriting it 1:1 to harfbuzz.lua so utility function so as to not pollute xtext with it, sketch: local hb = require("ffi/harfbuzz")
local hbface = hb.hb_ft_face_create_referenced(uiface)
local buf = ffi.new("char[256]")
hb_ot_name_get_utf8(hbface, hb.HB_OT_NAME_ID_FULL_NAME, lang, ffi.new("int[1]", 256), buf)
return ffi.string(buf) There are still some minor gotchas like having to check all FONT_FAMILY, NAME and FULLNAME as only one may have something useful still apply. I'll bring some cherrypicked PR for this as it's not really dependent on what I'm doing with mupdf anymore, the hb parts are already in master.
This whole ritual dance is essentially for the sake of ghetto fontconfig to get fonts working in mupdf beyond the extent of hardcoded builtins (which aren't even there to begin with now :). To do this with device user/supplied fonts, there's not getting around coverage and some modest heuristics to pick something, as there's no luxury of restarting layout whenever we step on a missing glyph. When we respond with a face, we commit to coverage of requested lang and/or script. This can also help avoiding spastic substitutions in mixed script layout block whenever available font choice is more looser than single main+builtins list. All of this stuff is meant to live more or less parallel to cre, so as to not step on its toes, though some font sharing (on blob/ftmemory level) should be implemented in the future between the two so as to avoid spamming too many copies of same face. As for the SFNT name fields in FT, it was meant to get pristine view in the cached info for font select should it be needed, but turns out all the fields there are useless, and the one that is helpful for ui, is better asked via harfbuzz. |
Fine with me 👍 Thanks for the other info. |
For PDF, it's about those that actually don't embed fonts. CJK ones quite often don't (too much bloat), musical notes, papers that are just .ps in pdf wrapper... - as it stands you get only hardcoded builtins for those, and even that ostensibly fails unless you manually butcher base14 builtin to point at freefont/noto/droid (what old patch did). This at least avoids dropping glyphs for "exotic" latin scripts such as czech, at the cost of barring font choice from the document entirely. As for line height/glyph width that is an issue indeed - it's generally the reason why we should try real hard to supply the font that is asked for, at least something close to that. PDFs like that are not really fixed-sized, the post script path commands do have wiggle room for layout engine to make decisions. The text itself in a pdf is "reflowable", but the text boxes in the document itself tend to be laid out too haphazardly to be of any practical use for whole-page reflow. In HTML terms, imagine pdf as a document with everything being laid out using Finally, there's html itself - while CRE is vastly superior renderer, mupdf does offer some interesting options. svg that at least sometimes gave me better results than tinysvg. One can also get very fast pagination (think instant document opens and em size changes). Though admittedly the pdfdocument frontend needs to be made a lil bit smarter to make reasonable use of that (forgo kopt and tile cache when not needed on reflowable documents etc). |
mupdf 1.17, tracking https://github.com/ezdiy/mupdf kopt 2.53, tracking https://github.com/ezdiy/libk2pdfopt lept 1.79, frozen tess 4.1, frozen
@ezdiy : long time no news :/ Are you ok and fine? Just busy ? Offline ? Should somebody else (mhh, not me :) continue this PR? |
I'll close this one in favor of #1750 and related PRs, but thanks so much for the effort! |
mupdf 1.17, tracking https://github.com/ezdiy/mupdf
kopt 2.53, tracking https://github.com/ezdiy/libk2pdfopt
lept 1.79, frozen
tess 4.1, frozen
Due to complexity and heavy interdependencies, these were frozen in time
for past 2-3 years, and are starting showing their age.
This is a large undertaking so it will take a while, any tips are welcome.
Currently "mostly works".
tbd:
This change is![Reviewable](https://camo.githubusercontent.com/23b05f5fb48215c989e92cc44cf6512512d083132bd3daf689867c8d9d386888/68747470733a2f2f72657669657761626c652e696f2f7265766965775f627574746f6e2e737667)