zran enable down-version zlib random-access clients using byte-aligned indexes by jnorthrup · Pull Request #801 · madler/zlib

jnorthrup · 2023-04-13T07:22:41Z

the zran indexes and similar tools include non-byte-alignment compressed entrypoints.

I believe that byte-aligned compressed entrypoints simplifies the index, the serde, and relaxes the version particularly where Oracle JDK and JZLib java clients are bound to zlib 1.1.x which works for most files you encounter, but cannot call inflatePrime.

other low-version zlib options.

madler · 2023-04-13T22:57:43Z

I have updated zran.c with a compile-time #define NOPRIME, which will use a substitute for the inflatePrime() function: 7e6dc42 . This still permits entry points at arbitrary bit locations.

However neither that commit nor your PR would solve the stated problem, which is to get zran.c to work with zlib 1.1.x. inflatePrime() was introduced in zlib 1.2.3, but Z_BLOCK was introduced in zlib 1.2.1. You need the Z_BLOCK functionality of inflate() in order to find the block boundaries, be they on byte boundaries or not.

The linked commit will allow zran.c to work with zlib versions 1.2.1 and 1.2.2, and any incomplete zlib clones that don't have inflatePrime(), but that do have Z_BLOCK for inflate().

jnorthrup · 2023-04-13T23:52:28Z

However neither that commit nor your PR would solve the stated problem, which is to get zran.c to work with zlib 1.1.x. inflatePrime() was introduced in zlib 1.2.3, but Z_BLOCK was introduced in zlib 1.2.1. You need the Z_BLOCK functionality of inflate() in order to find the block boundaries, be they on byte boundaries or not.

I am not fully comprehending the scope of your claims, please forgive the naivety.

If i write the zran index on byte-aligned outputs with any recent version of zlib, can i follow up with a reader client of that index to read a 32k window, and then swap the pointer to the zstrm inflater input buffer with the input file seeked to the correct location to continue inflating? I don't understand the role of Z_BLOCK as a requirement in this situation since the inflater simply needs 32k to prime the dictionary by my understanding

madler · 2023-04-14T00:08:30Z

If your application is to use byte-aligned entry points (generated somewhere else) with a variant of zlib that does not have inflatePrime(), then no changes to zran.c would be needed at all. Simply link it with a dummy inflatePrime() routine, which will never be called if point->bits is always zero.

My commit permits entry points at any bit offset, even the zlib variant does not have inflatePrime(). That is a better solution, since candidate byte-aligned entry points occur one-eighth as often. It also has the benefit that no change to zran.c is required for the generation of the index.

And yes, the client side does not need a zlib inflate() with Z_BLOCK, if it does not need to generate the index.

jnorthrup · 2023-04-14T00:29:12Z

porting the macros to jdk or js is a last-best option for index reader. |That is a better solution, since candidate byte-aligned entry points occur one-eighth as often. 1-8 bytes different? or 1-8 blocks of /n/K different? do forgive my ignorance, i was under the impression my mod would increase the average by 4 bytes past the target not avg 4k*4 past the stride

…

On Fri, Apr 14, 2023 at 8:08 AM Mark Adler ***@***.***> wrote: If your application is to use byte-aligned entry points generated somewhere else with a variant of zlib that does not have inflatePrime(), then no changes to zran.c would be needed at all. Simply link it with a dummy inflatePrime() routine, which will never be called if point->bits is always zero. My commit permits entry points at any bit offset, even the zlib variant does not have inflatePrime(). That is a better solution, since candidate byte-aligned entry points occur one-eighth as often. It also has the benefit that no change to zran.c is required for the generation of the index. And yes, the client side does not need a zlib inflate() with Z_BLOCK, if it does not need to generate the index. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: ***@***.***>

madler · 2023-04-14T00:54:49Z

Sorry. I can't make any sense out of any of the content in your last comment.

jnorthrup · 2023-04-14T01:21:09Z

alright, I want to get to the root of your comment "However neither that commit nor your PR would solve the stated problem,' the stated problem is to "enable down-version zlib random-access _clients_" you make the claim that the bits field inclusion is superior by reducing 7 out of 8 intervals needed to post an index. how many bytes are those intervals ? 1-byte intervals, or kilobytes? I'm unaware of any minimum chunking requirements on capturing the last 32kb window and recording a block.

…

On Fri, Apr 14, 2023 at 8:55 AM Mark Adler ***@***.***> wrote: Sorry. I can't make any sense out of any of the content in your last comment. — Reply to this email directly, view it on GitHub <#801 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAAR6KQBA32LWQXC6X7LUUDXBCN6JANCNFSM6AAAAAAW4VLUAE> . You are receiving this because you authored the thread.Message ID: ***@***.***>

madler · 2023-04-14T02:42:39Z

The stated problem was zran.c not working with zlib 1.1.x.

zran.c has both the index generation and the indexed access. Your PR would not solve that problem, since the index generation is not possible with zlib 1.1.x, due to it's lack of Z_BLOCK.

If all you want is the indexed access part of zran.c, then your PR is not needed at all. As I said, if your entry points all have a point->bits of zero, i.e., they are all byte-aligned entry points, then simply link zran.c with a dummy inflatePrime() function that does nothing, and make no changes to the indexed access of zran.c at all.

madler · 2023-04-14T02:51:14Z

Aside from your PR a) not solving the problem, and b) not even being needed in the first place, it would also completely break zran.c. Obviously those changes would prevent the indexed access part from working with the index generation part, which would still be making indexes into arbitrary byte locations. So I'm not clear on why you submitted it.

madler · 2023-04-14T03:01:55Z

The value of allowing entry points at any bit location is that there are eight times as many candidate entry points available to choose from, as compared to only allowing them at byte boundaries. An entry point needs to be at the start of a deflate block. Each deflate block will generate on the order of a few tens to a few hundreds of K bytes of uncompressed data. When you ask zran for an index every megabyte, the distance between entry points will be at least a megabyte, but likely a few tens to a few hundreds of K byte more than that, since it has to wait for the start of the next deflate block after a megabyte has gone by.

If you only permit byte-aligned entry points, you will have to wait much longer for the start of a deflate block that happens to start on a byte boundary. About eight times as long. So you will have to go a few hundreds of K to a few megabytes of uncompressed data until you finally run across a deflate block on a byte boundary.

jnorthrup · 2023-04-14T04:05:17Z

Thank you for helping me understand the complexity here with the blocking and chunking. So to be clear one does not simply keep inflating a stream and storing a 32kb window until the input boundary falls on even boundary, IIUC. that's all my PR tries to do. That's all gpt4 suggests for same, but the implementation exceeds its grasp. My end-goal for down-version zlib compatability on byte-boundaries is to run {interpretted-language-like-js,in-browser-js, or java/jdk} random access to archive data available through a browser's xhr on http[s] range requests, or with curl similarly to grab a needed chunk sized to the uncompressed span(s) needed. The assumption is that a baseline gzip-inflater lib can be nearly any version if byte[] is all it requires for inputs. In terms of technical debt, porting inflatePrime is a harder problem than byte aligned indexes here. I'd settle for the asymptotic block intervals of byte-aligned indexes as a first goal and focus on back-porting zlib implementations with inflatePrime/NOT_PRIME workarounds as a next-level upgrade. the significance of this block-size variance on very large gzip files is in the noise floor. If my patch with a handful of comments and a few lines of code added is not the way to coerce byte-aligned indexes, my next question is what will get the job done? Can I ask a #define for that ? I noticed that zran v3 was written to demonstrate the new 1.2.3.4 features in the comments AFTER i tested the main() and submitted the PR. so yeah, I admit, i missed the target and intent. Here's the landscape, there are any number of c/c++ offshoots of zran, even a java port which uses native zlib-wrapper to call inflateprime and doesn't use java's built-in inflater. zran.c has the smallest source code for obvious reasons, it simply defines the blocks. the other implementations copy the code verbatim. I've got 100 gig gzip files which I have no control or influence over to index and parcel out, this PR is my progress to date to get the index piece on the road for distributed clients.

the index creates byte-aligned index points enabling jdk zlib ports and

120ab2b

other low-version zlib options.

jnorthrup force-pushed the develop branch from bd8ad3a to 120ab2b Compare April 13, 2023 07:30

jnorthrup changed the title ~~zran even-byte index points~~ zran enable down-version zlib random-access using byte-aligned indexes Apr 13, 2023

jnorthrup changed the title ~~zran enable down-version zlib random-access using byte-aligned indexes~~ zran enable down-version zlib random-access clients using byte-aligned indexes Apr 13, 2023

jnorthrup mentioned this pull request Apr 13, 2023

random access/zran using aligned byte entry points #802

Closed

madler closed this Apr 13, 2023

shasheene mentioned this pull request Aug 8, 2025

Splitting "gztool.c" to make it more useful as a library ("libgztool") circulosmeos/gztool#23

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

zran enable down-version zlib random-access clients using byte-aligned indexes#801

zran enable down-version zlib random-access clients using byte-aligned indexes#801
jnorthrup wants to merge 1 commit intomadler:developfrom
jnorthrup:develop

jnorthrup commented Apr 13, 2023

Uh oh!

madler commented Apr 13, 2023

Uh oh!

jnorthrup commented Apr 13, 2023

Uh oh!

madler commented Apr 14, 2023 •

edited

Loading

Uh oh!

jnorthrup commented Apr 14, 2023 via email

Uh oh!

madler commented Apr 14, 2023

Uh oh!

jnorthrup commented Apr 14, 2023 via email

Uh oh!

madler commented Apr 14, 2023

Uh oh!

madler commented Apr 14, 2023

Uh oh!

madler commented Apr 14, 2023

Uh oh!

jnorthrup commented Apr 14, 2023 via email •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jnorthrup commented Apr 13, 2023

Uh oh!

madler commented Apr 13, 2023

Uh oh!

jnorthrup commented Apr 13, 2023

Uh oh!

madler commented Apr 14, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jnorthrup commented Apr 14, 2023 via email

Uh oh!

madler commented Apr 14, 2023

Uh oh!

jnorthrup commented Apr 14, 2023 via email

Uh oh!

madler commented Apr 14, 2023

Uh oh!

madler commented Apr 14, 2023

Uh oh!

madler commented Apr 14, 2023

Uh oh!

jnorthrup commented Apr 14, 2023 via email • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

madler commented Apr 14, 2023 •

edited

Loading

jnorthrup commented Apr 14, 2023 via email •

edited

Loading