zran enable down-version zlib random-access clients using byte-aligned indexes#801
zran enable down-version zlib random-access clients using byte-aligned indexes#801jnorthrup wants to merge 1 commit intomadler:developfrom
Conversation
other low-version zlib options.
|
I have updated zran.c with a compile-time #define However neither that commit nor your PR would solve the stated problem, which is to get zran.c to work with zlib 1.1.x. The linked commit will allow zran.c to work with zlib versions 1.2.1 and 1.2.2, and any incomplete zlib clones that don't have |
I am not fully comprehending the scope of your claims, please forgive the naivety. If i write the zran index on byte-aligned outputs with any recent version of zlib, can i follow up with a reader client of that index to read a 32k window, and then swap the pointer to the zstrm inflater input buffer with the input file seeked to the correct location to continue inflating? I don't understand the role of Z_BLOCK as a requirement in this situation since the inflater simply needs 32k to prime the dictionary by my understanding |
|
If your application is to use byte-aligned entry points (generated somewhere else) with a variant of zlib that does not have My commit permits entry points at any bit offset, even the zlib variant does not have And yes, the client side does not need a zlib |
|
porting the macros to jdk or js is a last-best option for index reader.
|That is a better solution, since candidate byte-aligned entry points
occur one-eighth as often.
1-8 bytes different? or 1-8 blocks of /n/K different? do forgive my
ignorance, i was under the impression my mod would increase the
average by 4 bytes past the target not avg 4k*4 past the stride
…On Fri, Apr 14, 2023 at 8:08 AM Mark Adler ***@***.***> wrote:
If your application is to use byte-aligned entry points generated somewhere else with a variant of zlib that does not have inflatePrime(), then no changes to zran.c would be needed at all. Simply link it with a dummy inflatePrime() routine, which will never be called if point->bits is always zero.
My commit permits entry points at any bit offset, even the zlib variant does not have inflatePrime(). That is a better solution, since candidate byte-aligned entry points occur one-eighth as often. It also has the benefit that no change to zran.c is required for the generation of the index.
And yes, the client side does not need a zlib inflate() with Z_BLOCK, if it does not need to generate the index.
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you authored the thread.Message ID: ***@***.***>
|
|
Sorry. I can't make any sense out of any of the content in your last comment. |
|
alright, I want to get to the root of your comment "However neither that
commit nor your PR would solve the stated problem,'
the stated problem is to "enable down-version zlib random-access _clients_"
you make the claim that the bits field inclusion is superior by reducing 7
out of 8 intervals needed to post an index. how many bytes are those
intervals ? 1-byte intervals, or kilobytes? I'm unaware of any minimum
chunking requirements on capturing the last 32kb window and recording a
block.
…On Fri, Apr 14, 2023 at 8:55 AM Mark Adler ***@***.***> wrote:
Sorry. I can't make any sense out of any of the content in your last
comment.
—
Reply to this email directly, view it on GitHub
<#801 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAAR6KQBA32LWQXC6X7LUUDXBCN6JANCNFSM6AAAAAAW4VLUAE>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
|
The stated problem was zran.c not working with zlib 1.1.x. zran.c has both the index generation and the indexed access. Your PR would not solve that problem, since the index generation is not possible with zlib 1.1.x, due to it's lack of If all you want is the indexed access part of zran.c, then your PR is not needed at all. As I said, if your entry points all have a |
|
Aside from your PR a) not solving the problem, and b) not even being needed in the first place, it would also completely break zran.c. Obviously those changes would prevent the indexed access part from working with the index generation part, which would still be making indexes into arbitrary byte locations. So I'm not clear on why you submitted it. |
|
The value of allowing entry points at any bit location is that there are eight times as many candidate entry points available to choose from, as compared to only allowing them at byte boundaries. An entry point needs to be at the start of a deflate block. Each deflate block will generate on the order of a few tens to a few hundreds of K bytes of uncompressed data. When you ask zran for an index every megabyte, the distance between entry points will be at least a megabyte, but likely a few tens to a few hundreds of K byte more than that, since it has to wait for the start of the next deflate block after a megabyte has gone by. If you only permit byte-aligned entry points, you will have to wait much longer for the start of a deflate block that happens to start on a byte boundary. About eight times as long. So you will have to go a few hundreds of K to a few megabytes of uncompressed data until you finally run across a deflate block on a byte boundary. |
|
Thank you for helping me understand the complexity here with the blocking
and chunking.
So to be clear one does not simply keep inflating a stream and storing a
32kb window until the input boundary falls on even boundary, IIUC. that's
all my PR tries to do. That's all gpt4 suggests for same, but the
implementation exceeds its grasp.
My end-goal for down-version zlib compatability on byte-boundaries is to
run {interpretted-language-like-js,in-browser-js, or java/jdk} random
access to archive data available through a browser's xhr on http[s] range
requests, or with curl similarly to grab a needed chunk sized to the
uncompressed span(s) needed. The assumption is that a baseline
gzip-inflater lib can be nearly any version if byte[] is all it requires
for inputs.
In terms of technical debt, porting inflatePrime is a harder problem than
byte aligned indexes here. I'd settle for the asymptotic block intervals
of byte-aligned indexes as a first goal and focus on back-porting zlib
implementations with inflatePrime/NOT_PRIME workarounds as a next-level
upgrade. the significance of this block-size variance on very large gzip
files is in the noise floor.
If my patch with a handful of comments and a few lines of code added is not
the way to coerce byte-aligned indexes, my next question is what will get
the job done? Can I ask a #define for that ?
I noticed that zran v3 was written to demonstrate the new 1.2.3.4 features
in the comments AFTER i tested the main() and submitted the PR. so yeah, I
admit, i missed the target and intent.
Here's the landscape, there are any number of c/c++ offshoots of zran, even
a java port which uses native zlib-wrapper to call inflateprime and doesn't
use java's built-in inflater.
zran.c has the smallest source code for obvious reasons, it simply defines
the blocks. the other implementations copy the code verbatim. I've got
100 gig gzip files which I have no control or influence over to index and
parcel out, this PR is my progress to date to get the index piece on the
road for distributed clients.
|
the zran indexes and similar tools include non-byte-alignment compressed entrypoints.
I believe that byte-aligned compressed entrypoints simplifies the index, the serde, and relaxes the version particularly where Oracle JDK and JZLib java clients are bound to zlib 1.1.x which works for most files you encounter, but cannot call inflatePrime.