Failing for BGZIP'd streaming files #139

abetusk · 2018-05-29T13:55:24Z

Hi all, thanks for the wonderful library!

Unfortunately I think I've found a bug. Files compressed with bgzip (block gzip) are failing when trying to use pako to do streaming decompression.

The file pako-fail-test-data.txt.gz is an example file that is able to trigger what I believe to be an error. The file itself is 65,569 bytes big, which is just larger than what I assume to be a block size relevant to bgzip (somewhere around 65280?). Here is a small shell session that has some relevant information:

$ wc pako-fail-test-data.txt 
 1858 16831 65569 pako-fail-test-data.txt
$ md5sum pako-fail-test-data.txt 
7eae4c6bc0e68326879728f80a0e002b  pako-fail-test-data.txt
$ zcat pako-fail-test-data.gz | bgzip -c > pako-fail-test-data.txt.gz
$ md5sum pako-fail-test-data.txt.gz 
f4d0b896c191f66ff6962de37d69db45  pako-fail-test-data.txt.gz
$ bgzip -h

Version: 1.4.1
Usage:   bgzip [OPTIONS] [FILE] ...
Options:
   -b, --offset INT        decompress at virtual file pointer (0-based uncompressed offset)
   -c, --stdout            write on standard output, keep original files unchanged
   -d, --decompress        decompress
   -f, --force             overwrite files without asking
   -h, --help              give this help
   -i, --index             compress and create BGZF index
   -I, --index-name FILE   name of BGZF index file [file.gz.gzi]
   -r, --reindex           (re)index compressed file
   -g, --rebgzip           use an index file to bgzip a file
   -s, --size INT          decompress INT bytes (uncompressed size)
   -@, --threads INT       number of compression threads to use [1]

Here is some sample code that should decompress the whole file, but doesn't. My apologies for it not being elegant, I'm still learning and I kind of threw a bunch of things together to get something that I believe triggers the error:

var pako = require("pako"),
    fs = require("fs");

var CHUNK_SIZE = 1024*1024,
    buffer = new Buffer(CHUNK_SIZE);

function _node_uint8array_to_string(data) {
  var buf = new Buffer(data.length);
  for (var ii=0; ii<data.length; ii++) {
    buf[ii] = data[ii];
  }
  return buf.toString();
}

var inflator = new pako.Inflate();
inflator.onData = function(chunk) {
  var v = _node_uint8array_to_string(chunk);
  process.stdout.write(v);
};

fs.open("./pako-fail-test-data.txt.gz", "r", function(err,fd) {
  if (err) { throw err; }
  function read_chunk() {
    fs.read(fd, buffer, 0, CHUNK_SIZE, null,
      function(err, nread) {
        var data = buffer;
        if (nread<CHUNK_SIZE) { data = buffer.slice(0, nread); }
        inflator.push(data, false);
        if (nread > 0) { read_chunk(); }
      });
  };
  read_chunk();
});

I did not indicate an end block (that is I did not do inflator.push(data.false) anywhere) and there are maybe other problems with this in how the data blocks are read from fs but I hope you'll forgive this sloppiness in the interest of simplicity to illuminate the relevant issue.

Running this does successfully decompress a portion of the file but then stops at what I believe to the first block. Here are some shell commands that might be enlightening:

$ node pako-error-example.js | wc
   1849   16755   65280
$ node pako-error-example.js | md5sum
a55dd4f2c7619a52fd6bc76e2af631b8  -
$ zcat pako-fail-test-data.txt.gz | md5sum
7eae4c6bc0e68326879728f80a0e002b  -
$ zcat pako-fail-test-data.txt.gz | head -c 65280 | md5sum
a55dd4f2c7619a52fd6bc76e2af631b8  -
$ zcat pako-fail-test-data.txt.gz | wc
   1858   16831   65569

Running another simple example using browser-zlib triggers an error outright:

var fs = require("fs"),
    zlib = require("browserify-zlib");

var r = fs.createReadStream('pako-fail-test-data.txt.gz');
var z = zlib.createGunzip();

z.on("data", function(chunk) {
  process.stdout.write(chunk.toString());
});
r.pipe(z);

And when run via node stream-example-2.js, the error produces is:

events.js:137
      throw er; // Unhandled 'error' event
      ^

Error: invalid distance too far back
    at Zlib._handle.onerror (/home/abe/play/js/browser-large-file/node_modules/browserify-zlib/lib/index.js:352:17)
    at Zlib._error (/home/abe/play/js/browser-large-file/node_modules/browserify-zlib/lib/binding.js:283:8)
    at Zlib._checkError (/home/abe/play/js/browser-large-file/node_modules/browserify-zlib/lib/binding.js:254:12)
    at Zlib._after (/home/abe/play/js/browser-large-file/node_modules/browserify-zlib/lib/binding.js:262:13)
    at /home/abe/play/js/browser-large-file/node_modules/browserify-zlib/lib/binding.js:126:10
    at process._tickCallback (internal/process/next_tick.js:150:11)

I assume this is a pako error as browserify-zlib uses pako underneath so my apologies if this is browserify-zlib error and has nothing to do with pako.

As a "control", the following code works without issue:

var fs = require("fs"),
    zlib = require("zlib");

var r = fs.createReadStream('pako-fail-test-data.txt.gz');
var z = zlib.createGunzip();

z.on("data", function(chunk) {
  process.stdout.write(chunk.toString());
});
r.pipe(z);

bgzip is used to allow for random access to gzipped files. The resulting block compressed file is a bit bigger than using straight gzip compression but the small compressed file size inflation is often worth it for the ability to efficiently access arbitrary positions in the uncompressed data.

My specific use case is that I want to process a large text file (compressed ~115Mb, ~650Mb uncompressed with other files being even larger). Loading the complete file, either compressed or uncompressed, is not an option either because of memory exhaustion or straight up memory restrictions in JavaScript. I only need to process the data in a streaming manner (that is, I only need to look at the data once and then am able to mostly discard it) so this is why I was looking into this option. The bioinformatics community uses this method quite a bit (bgzip is itself part of tabix which is part of a bioinformatics library called htslib) so it would be nice if pako supported this use case.

If there is another library I should be using to allow for stream processing of compressed data in the browser, I would welcome any suggestions.

The text was updated successfully, but these errors were encountered:

puzrin · 2018-05-29T15:38:55Z

This question is a bit out of project scope, but i recommend split all to smaller steps:

Drop all string conversions and make sure that pako.inflate(data) works correctly in one step.
To the same in chunking mode, and don't forget to finalize with .push(data, false) last step, or try to add .push([], false) after last chunk.
Add { to: 'string' } if you need utf16 output (for JS). Note, input should be still binary (it can be benary string but i don't recommend, that's for old browsers only).

Also, if you are on server size - just use node's zlib instead, pako is good for browsers only.

PS. You may wish to consider use JSzip, it's better suited for end users. This library is a bit low level thing.

abetusk · 2018-05-29T16:49:00Z

Hi @puzrin, thanks for the response.

I'm sorry, I don't think I've communicated the issue properly. I believe this is a bug in pako. I believe that pako does not stream decompress block compressed data properly.

I'll try and address each of your concerns below:

I've only used the string conversion for illustrative purposes and this should have no effect on the buggy behavior.
As I've said in the ticket above I haven't finalized the last chunk but this should have no effect on exposing the buggy behavior.
As stated above, the conversion to string in the example is for illustrative purposes and should have no effect on exposing the buggy behavior.

I use the server size pako example to provide an illustrative example that exposes the buggy behavior. I understand I can use zlib for server side decompression but I ultimately want to use it in the browser. Maybe you missed it but I gave an example in the above ticket of the correct behavior in zlib that pako does not reproduce.

From what I understand, JSzip does not support gzip (as stated in issue #209) so it's not appropriate to use for this purpose (but please let me know if I'm not understanding that properly).

Here is a simpler server side program purely for illustrative purposes. I understand that this is server-side but I'm trying to expose a bug in pako so I'm providing a simple example program exposing the bug in pako. As I've said above I want this to be in browser, not server side so I can't use node's zlib library as it is only (afaik) server side. Here is the example code:

var fs = require("fs"),
    pako = require("pako");

var totalByteCount=0;

var inflator = new pako.Inflate();
inflator.onData = function(chunk) {
  totalByteCount += chunk.length;
  console.log("totalByteCount:", totalByteCount);
};

var readStream = fs.createReadStream("./pako-block-decompress-failure.txt.gz");
readStream.on("data", function(chunk) {
  inflator.push(chunk, true);
}).on("end", function() {
  inflator.push([], false);
});

Running the above code on a file that has been block compressed (pako-block-decompress-failure.txt.gz) produces the following output:

$ node pako-block-decompress-failure.js
totalByteCount: 16384
totalByteCount: 32768
totalByteCount: 49152
totalByteCount: 65280

The file, uncompressed is 214330 bytes long
pako only decompressed the first 65280 bytes
65280 is less than the full file size of 214330 bytes
pako is not decompressing the majority of the file.

I understand that this library is most likely a volunteer effort on your part and so I would completely understand if this fell in a "don't fix" category purely for lack of interest or time but this is within scope of
a zlib port.

I am unfamiliar with zlib in general so I'm not sure how much progress I could make on this but it might be helpful to me and others in the future needing this functionality if you could give some general direction on how to go about fixing this if you're unable or unwilling to look into this further.

puzrin · 2018-05-29T17:13:58Z

Got it (with simple example from last post). I think, problem is here

pako/lib/inflate.js

Line 273 in 893381a

    
           } while ((strm.avail_in > 0 || strm.avail_out === 0) && status !== c.Z_STREAM_END);

Let me explain. Pako consists of 2 parts:

zlib port - very stable and well tested, but difficult to use directly.
sugar wrappers for simple calls.

When we implemented wrappers, we could not find what to do if stream consists of multiple parts (probably returns multiple Z_STREAM_END). That's not widely used mode.

/cc @Kirill89 could you take a look?

puzrin · 2018-05-29T17:53:50Z

That's a minimal sample to reproduce:

const pako = require('pako');

let input = require('fs').readFileSync('./pako-block-decompress-failure.txt.gz');
let output = pako.inflate(input);

console.log(`size = ${output.length}`); // => size = 65280 !!!

puzrin · 2018-05-29T18:08:00Z

After quick look - seems your data really generates Z_STREAM_END status before end. Sure, wrapper can be fixed for this case, but i don't know how.

puzrin · 2018-05-30T00:20:35Z

Googling "zlib inflate multiple streams":

abetusk · 2018-05-30T11:55:57Z

The relevant file (afaict) in the most current version (as of this writing) of node is node_zlib.cc (line 302):

...
        while (strm_.avail_in > 0 &&
               mode_ == GUNZIP &&
               err_ == Z_STREAM_END &&
               strm_.next_in[0] != 0x00) {
          // Bytes remain in input buffer. Perhaps this is another compressed
          // member in the same archive, or just trailing garbage.
          // Trailing zero bytes are okay, though, since they are frequently
          // used for padding.

          Reset();
          err_ = inflate(&strm_, flush_);
        }
        break;
      default:
        UNREACHABLE();
    }
...

puzrin · 2018-05-30T12:13:21Z

Yeah, i've seen it. Could not make quick hack to work. Seems it's better to wait for weekend, when Kirill can take a look.

Kirill89 · 2018-06-03T12:04:57Z

I checked the same file on original zlib code and found the same behavior (inflate returns Z_STREAM_END too early).

Also I found very interesting implementation of wrapper for inflate function. According to these implementation we must do inflateReset on every Z_STREAM_END instead of terminating.

This is possible fix: c60b97e

After that fix one test becomes broken, but I don't understand why (need your help to solve).

Code to reproduce same behavior:

    // READ FILE
    size_t file_size;
    Byte *file_buf = NULL;
    uLong buf_size;

    FILE *fp = fopen("/home/Kirill/Downloads/pako-fail-test-data.txt.gz", "rb");
    fseek(fp, 0, SEEK_END);
    file_size = ftell(fp);
    rewind(fp);
    buf_size = file_size * sizeof(*file_buf);
    file_buf = malloc(buf_size);
    fread(file_buf, file_size, 1, fp);

    // INIT ZLIB
    z_stream d_stream;
    d_stream.zalloc = Z_NULL;
    d_stream.zfree = Z_NULL;
    d_stream.opaque = (voidpf)0;

    d_stream.next_in  = file_buf;
    d_stream.avail_in = (uInt)buf_size;

    int err = inflateInit2(&d_stream, 47);
    CHECK_ERR(err, "inflateInit");

    // Inflate
    uLong chunk_szie = 5000;
    Byte* chunk = malloc(chunk_szie * sizeof(Byte));

    do {
        memset(chunk, 0, chunk_szie);
        d_stream.next_out = chunk;
        d_stream.avail_out = (uInt)chunk_szie;
        err = inflate(&d_stream, Z_NO_FLUSH);
        printf("inflate(): %s\n", (char *)chunk);
        if (err == Z_STREAM_END) {
//            inflateReset(&d_stream);
            break;
        }
    } while (d_stream.avail_in);

    err = inflateEnd(&d_stream);
    CHECK_ERR(err, "inflateEnd");

puzrin · 2018-06-03T12:27:13Z

@Kirill89 what about this code It should not force end in the middle (when multiple .push() callsed and some emit Z_STREAM_END before data ended).

rbuels · 2018-08-30T19:20:33Z

Any update on this issue? I'm running into this problem as well.

puzrin · 2018-08-31T00:06:29Z

c60b97e that needs additionas conditions update after cycle but one more test fails after that.

rbuels · 2018-09-01T18:40:48Z

my #145 pushes this a bit further, fixing (I think) the SYNC test that @puzrin saw failing, and adds 2 failing tests with @abetusk case and another case of my own. What do you guys think of that approach?

rbuels · 2018-09-01T18:46:34Z

The failing tests in that PR reproduce the "too far back" error @abetusk was seeing:

puzrin · 2018-09-01T19:52:49Z

@rbuels look at this lines:

pako/lib/inflate.js

Lines 279 to 281 in c60b97e

    
           if (status === c.Z_STREAM_END) { 
        
             _mode = c.Z_FINISH; 
        
           }

It seems, this should be removed, because Z_STREAM_END is processed inside loop and should not finalize deflate. But after removing that line, one test fails, and that's the main reason why @Kirill89 's commit was postponed.

rbuels · 2018-09-01T22:20:13Z

After discussion with @puzrin, I was able in #146 to write a couple of tests that decompress the bgzip files with pako as-is. The code for doing it can be seen at https://github.com/nodeca/pako/pull/146/files#diff-04f4959c7d84f7da8f54fbf6b0f50553R23

ewimberley · 2018-09-04T18:56:01Z

Thanks for all the work on this. I just ran into this bug, and would appreciate a new release when the pull request is merged.

bovee · 2019-02-28T18:54:39Z

We're seeing this issue quite a bit (I wonder if bioinformaticians just really like gzipping in blocks!)

Making the inflateReset into an inflateResetKeep call fixes the "invalid distance too far back" error, but results in the chunk remainder being written into the out buffer (which is bad). This can be fixed by moving the status === c.Z_STREAM_END condition for the strm.next_out write branch as in #146 (and I think that's the right thing to do?) The "Read stream with SYNC marks" test still fails though and I don't quite understand why either.

I put these changes up at https://github.com/onecodex/pako and I'm happy to restructure or make a PR if that's helpful (thanks for Kirill89's issue-139 branch and rbuels' #145 PR for providing 99% of the work here).

puzrin · 2019-02-28T19:14:02Z

@bovee I'll be happy to accept correct PR.

As far as i remember, #145 was rejected because touched original zlib content (see my comment). @Kirill89's fix was correct in general, but broken 1 strange test. This required to investigate things in debugger, but he had no free time.

I have absolutely no idea (forgot everything) what this test does, but if it exists, it can not be "just skipped". As soon as anyone can resolve this tail, PR will be accepted.

abetusk · 2019-03-01T20:37:26Z

@bovee, bgzip allows for random access to large gzip files. In bioinformatics, there's often a need to access large files efficiently and at random (from 100Mb to 5Gb or more, compressed, representing a whole genome in some format, for example). Vanilla gzip requires to decompress all previous elements before getting at some position.

By splitting the gzip file into blocks, you can create an index which can then be used to allow for efficient random access. The resulting bgzipd files are a bit bigger compressing without block (i.e. just vanilla gzip) but most of the benefits of compression are still retained while still allowing for efficient random access to the file. There's the added benefit that a bgzipd file should look like a regular gzip file so all the "standard" tools should still work to decompress it.

Here is what I believe to be the original paper by Heng Li on Tabix (Tabix has now been subsumed into htslib if I'm not mistaken).

rbuels · 2019-03-01T20:46:51Z

For the bioinformaticians in the thread, just going to say that I ended up coding around this issue and eventually releasing https://github.com/GMOD/bgzf-filehandle and https://github.com/GMOD/tabix-js for accessing BGZIP and tabix files, respectively.

…

On Fri, Mar 1, 2019 at 12:37 PM Abe ***@***.***> wrote: @bovee <https://github.com/bovee>, bgzip allows for random access to large gzip files. In bioinformatics, there's often a need to access large files efficiently and at random (from 100Mb to 5Gb or more, compressed, representing a whole genome in some format, for example). Vanilla gzip requires to decompress all previous elements before getting at some position. By splitting the gzip file into blocks, you can create an index which can then be used to allow for efficient random access. The resulting bgzipd files are a bit bigger compressing without block (i.e. just vanilla gzip) but most of the benefits of compression are still retained while still allowing for efficient random access to the file. There's the added benefit that a bgzipd file should look like a regular gzip file so all the "standard" tools should still work to decompress it. Here is what I believe to be the original paper by Heng Li on Tabix <https://academic.oup.com/bioinformatics/article/27/5/718/262743> (Tabix has now been subsumed into htslib if I'm not mistaken). — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#139 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEgFS3Op1zm7Xq4vK8RQ6z9zMnEb5k4ks5vSY-JgaJpZM4URinR> .

anderspitman · 2020-01-31T23:11:38Z

@rbuels I'm working on reading local fastq.gz files in the browser and stumbled upon this issue. Haven't been able to get pako to work so far. Is there currently a working solution for streaming bgzf files in the browser?

EDIT: I need streaming because the files are large. I don't (and can't) need to store the entire file in memory, just need to stream through all the lines to gather some statistics.

rbuels · 2020-02-01T00:18:31Z

We implemented `@gmod/bgzf-filehandle`, which wraps pako. We use in JBrowse. https://www.npmjs.com/package/@gmod/bgzf-filehandle

…

On Fri, Jan 31, 2020 at 3:11 PM Anders Pitman ***@***.***> wrote: @rbuels <https://github.com/rbuels> I'm working on reading local fastq.gz files in the browser and stumbled upon this issue. Haven't been able to get pako to work so far. Is there currently a working solution for streaming bgzf files in the browser? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#139?email_source=notifications&email_token=AAASAFMPEOUBVJB36WTOYNDRASV2XA5CNFSM4FCGFHI2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEKQJYIQ#issuecomment-580951074>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAASAFIESY53TSQZCLRHJBLRASV2XANCNFSM4FCGFHIQ> .

anderspitman · 2020-02-04T18:07:58Z

pako seems to be working on my files. Not sure what I'm doing differently to not trigger this bug.

puzrin · 2020-11-09T18:00:04Z

e61498c

I've rewritten wrappers. hose works with our old multistream fixture, and with data generated with Z_SYNC_FLUSH. But still fails with provided bgzip file, now with invalid distance too far back error :(.

UPD. Hmm... works if i vary chunkSize

drtconway · 2021-10-20T00:13:40Z

In case it helps, I have a Streams API wrapper, which I modified to support the multiple streams in a file. This supports both regular gzipped files, and bgzip ones.

Basically before pushing more data into the inflator, we check if it hit the end of stream, and rebuffer the remaining input.

class PakoInflateTransformer {
  constructor() {
    this.controller = null;
    this.decoder = new pako.Inflate();
    let self = this;
    this.decoder.onData = (chunk) => {
      self.propagateChunk(chunk);
    }
  }

  propagateChunk(chunk) {
    if (!this.controller) {
      throw "cannot propagate output chunks with no controller";
    }
    //console.log('Inflated chunk with %d bytes.', chunk.byteLength);
    this.controller.enqueue(chunk);
  }

  start(controller) {
    this.controller = controller;
  }

  transform(chunk, controller) {
    //console.log('Pako received chunk with %d bytes.', chunk.byteLength);
    this.resetIfAtEndOfStream();
    this.decoder.push(chunk);
  }

  flush() {
    //console.log('Pako flushing,');
    this.resetIfAtEndOfStream();
    this.decoder.push([], true);
  }

  resetIfAtEndOfStream() {
    // The default behaviour doesn't handle multiple streams
    // such as those produced by bgzip. If the decoder thinks
    // it has ended, but there's available input, save the
    // unused input, reset the decoder, and re-push the unused input.
    //
    while (this.decoder.ended && this.decoder.strm.avail_in > 0) {
      let strm = this.decoder.strm;
      let unused = strm.input.slice(strm.next_in);
      //console.log(`renewing the decoder with ${unused.length} bytes.`);
      this.decoder = new pako.Inflate();
      let self = this;
      this.decoder.onData = (chunk) => {
        self.propagateChunk(chunk);
      }
      this.decoder.push(unused, Z_SYNC_FLUSH);
    }
  }
}

abetusk · 2021-10-20T07:02:51Z

@drtconway, could you provide a working self contained example?

@puzrin, any progress on this?

drtconway · 2021-10-21T00:16:42Z

Here's a stand-alone HTML document. Select a file - it uses suffix matching to guess if it's compressed or uncompressed, and it reads the chunks.

<html>
 <head>
  <title>Uncompress Streams</title>
  <script src="https://cdn.jsdelivr.net/pako/1.0.3/pako.min.js"></script>
  <script src="https://unpkg.com/web-streams-polyfill/dist/polyfill.js"></script>
  <script>
    // Define Z_SYNC_FLUSH since it's not exported from pako.
    //
    const Z_SYNC_FLUSH = 2;

    class PakoInflateTransformer {
      constructor() {
        this.controller = null;
        this.decoder = new pako.Inflate();
        let self = this;
        this.decoder.onData = (chunk) => {
          self.propagateChunk(chunk);
        }
      }

      propagateChunk(chunk) {
        if (!this.controller) {
          throw "cannot propagate output chunks with no controller";
        }
        //console.log('Inflated chunk with %d bytes.', chunk.byteLength);
        this.controller.enqueue(chunk);
      }

      start(controller) {
        this.controller = controller;
      }

      transform(chunk, controller) {
        //console.log('Pako received chunk with %d bytes.', chunk.byteLength);
        this.resetIfAtEndOfStream();
        this.decoder.push(chunk);
      }

      flush() {
        //console.log('Pako flushing,');
        this.resetIfAtEndOfStream();
        this.decoder.push([], true);
      }

      resetIfAtEndOfStream() {
        // The default behaviour doesn't handle multiple streams
        // such as those produced by bgzip. If the decoder thinks
        // it has ended, but there's available input, save the
        // unused input, reset the decoder, and re-push the unused input.
        //
        while (this.decoder.ended && this.decoder.strm.avail_in > 0) {
          let strm = this.decoder.strm;
          let unused = strm.input.slice(strm.next_in);
          //console.log(`renewing the decoder with ${unused.length} bytes.`);
          this.decoder = new pako.Inflate();
          let self = this;
          this.decoder.onData = (chunk) => {
            self.propagateChunk(chunk);
          }
          this.decoder.push(unused, Z_SYNC_FLUSH);
        }
      }
    }

    function blobToReadableStream(blob) {
      let reader = blob.stream().getReader();
      return new ReadableStream({
        start(controller) {
          function push() {
            reader.read().then(({done, value}) => {
              if (done) {
                controller.close();
                return;
              }
              controller.enqueue(value);
              push();
            })
          }
          push();
        }
      });
    }

    function getReader(source) {
      var fileStream = blobToReadableStream(source);
      if (source.name.endsWith(".gz") || source.name.endsWith(".bgz")) {
        fileStream = fileStream.pipeThrough(new TransformStream(new PakoInflateTransformer));
      }
      return fileStream.getReader();
    }

    var readTheFile = async function(event) {
      var inp = event.target;
      let reader = getReader(inp.files[0]);

      let n = 0;
      let s = 0;
      while (true) {
        const { done, value } = await reader.read();
        if (done) {
          break;
        }
        let l = value.byteLength;
        n += 1;
        s += l;
      }
      let resElem = document.getElementById('results');
      let add = function(txt) {
        let para = document.createElement('p');
        resElem.appendChild(para);
        para.appendChild(document.createTextNode(txt));
      }
      let m = s / n;
      add(`number of chunks: ${n}`);
      add(`mean size: ${m}`);
    }
  </script>
 </head>
 <body>
   <div>
     <input type='file' onchange='readTheFile(event)' />
   </div>
   <div id="results">
   </div>
 </body>
</html>

drtconway · 2021-10-25T21:22:33Z

Hmm. I changed the version of pako that I was using from 1.0.3 to 2.0.4 and it fails now. Is there an obvious thing that changed that would break the code? From my initial investigation it looks like it hits the condition to reset at the end of the first stream, but the recovery doesn't work correctly any more.

puzrin · 2021-10-25T21:36:14Z

@drtconway wrapper changed significantly but multistream test exist

pako/test/gzip_specials.js

Lines 60 to 77 in 0398fad

    
           it('Read stream with SYNC marks (multistream source, file 1)', () => { 
        
             const data = fs.readFileSync(path.join(__dirname, 'fixtures/gzip-joined.gz')); 
        
             assert.deepStrictEqual( 
        
               pako.ungzip(data), 
        
               new Uint8Array(zlib.gunzipSync(data)) 
        
             ); 
        
           }); 
        
           it.skip('Read stream with SYNC marks (multistream source, file 2)', () => { 
        
             const data = fs.readFileSync(path.join(__dirname, 'fixtures/gzip-joined-bgzip.gz')); 
        
             assert.deepStrictEqual( 
        
               // Currently fails with this chunk size 
        
               pako.ungzip(data, { chunkSize: 16384 }), 
        
               new Uint8Array(zlib.gunzipSync(data)) 
        
             ); 
        
           });

drtconway · 2021-10-25T21:56:45Z

Ok. I'm looking in the debugger, and I see that it's hitting the error condition "invalid distance too far back" in lib/zlib/inffast.js (in the tagged 2.0.4 version), without ever hitting my code for handling the end of stream.

I've attached the gzipped file. Note because some of these files can be 100s MB to GBs in size, we really want to use the streaming API rather than compress in one chunk.

SRR1301936.trio.genotype.soi.vep.post_filter.vcfanno.rare.cann.vcf.gz
.

drtconway · 2021-10-25T22:02:01Z

i modified the unit test above to use this data, and it fails. gzip --test passes on it. Is there another way to validate the file?

drtconway · 2021-10-25T22:17:34Z

From the bgzip documentation:

The gzip header includes an extra sub-field with identifier 'BC' and the length of the compressed block, including all headers.

Is this the problem?

puzrin · 2021-10-26T04:12:47Z

Is this the problem?

Not sure. See this (disabled) test

pako/test/gzip_specials.js

Lines 69 to 77 in 0398fad

    
           it.skip('Read stream with SYNC marks (multistream source, file 2)', () => { 
        
             const data = fs.readFileSync(path.join(__dirname, 'fixtures/gzip-joined-bgzip.gz')); 
        
             assert.deepStrictEqual( 
        
               // Currently fails with this chunk size 
        
               pako.ungzip(data, { chunkSize: 16384 }), 
        
               new Uint8Array(zlib.gunzipSync(data)) 
        
             ); 
        
           });

. It's behaviour depends on chunk size.

cmdcolin · 2022-03-10T01:13:25Z

this is likely a different manifestation of the same problem but saw an issue where pako did work on a file with 1.0 but stopped working on 2.0, reproducible repo here https://github.com/cmdcolin/pako_error the origin file is not bgzip but may have some multipart stuff perhaps, moved from #250

puzrin closed this as completed May 29, 2018

puzrin reopened this May 29, 2018

puzrin added the bug label May 29, 2018

rbuels mentioned this issue Sep 1, 2018

multi member / bgzip compatibility #145

Closed

rbuels mentioned this issue Sep 1, 2018

add tests for multi member decompression #146

Closed

puzrin mentioned this issue Dec 4, 2018

Plans to upgrade Pako to be equivalent to zlib 1.2.11? #152

Closed

puzrin mentioned this issue Sep 25, 2019

Failing to inflate a sequence of bytes #174

Closed

exander77 mentioned this issue Jan 22, 2021

Majority of files fails to inflate with error -3 invalid distance too far back #216

Closed

puzrin mentioned this issue Aug 30, 2021

Error: invalid distance too far back #236

Closed

ikreymer mentioned this issue Nov 17, 2021

Support reading multi-member gzip files or providing access to remaining data 101arrowz/fflate#102

Closed

puzrin mentioned this issue Mar 9, 2022

"invalid distance too far back" with 2.0? #250

Closed

Failing for BGZIP'd streaming files #139

Failing for BGZIP'd streaming files #139

Comments

abetusk commented May 29, 2018

puzrin commented May 29, 2018

abetusk commented May 29, 2018

puzrin commented May 29, 2018 • edited Loading

puzrin commented May 29, 2018 • edited Loading

puzrin commented May 29, 2018

puzrin commented May 30, 2018 • edited Loading

abetusk commented May 30, 2018

puzrin commented May 30, 2018

Kirill89 commented Jun 3, 2018 • edited Loading

puzrin commented Jun 3, 2018

rbuels commented Aug 30, 2018

puzrin commented Aug 31, 2018 • edited Loading

rbuels commented Sep 1, 2018 • edited Loading

rbuels commented Sep 1, 2018

puzrin commented Sep 1, 2018

rbuels commented Sep 1, 2018

ewimberley commented Sep 4, 2018

bovee commented Feb 28, 2019

puzrin commented Feb 28, 2019

abetusk commented Mar 1, 2019

rbuels commented Mar 1, 2019 via email

anderspitman commented Jan 31, 2020 • edited Loading

rbuels commented Feb 1, 2020 via email

anderspitman commented Feb 4, 2020

puzrin commented Nov 9, 2020 • edited Loading

drtconway commented Oct 20, 2021

abetusk commented Oct 20, 2021

drtconway commented Oct 21, 2021

drtconway commented Oct 25, 2021

puzrin commented Oct 25, 2021

drtconway commented Oct 25, 2021

drtconway commented Oct 25, 2021

drtconway commented Oct 25, 2021

puzrin commented Oct 26, 2021

cmdcolin commented Mar 10, 2022 • edited Loading

puzrin commented May 29, 2018 •

edited

Loading

puzrin commented May 29, 2018 •

edited

Loading

puzrin commented May 30, 2018 •

edited

Loading

Kirill89 commented Jun 3, 2018 •

edited

Loading

puzrin commented Aug 31, 2018 •

edited

Loading

rbuels commented Sep 1, 2018 •

edited

Loading

anderspitman commented Jan 31, 2020 •

edited

Loading

puzrin commented Nov 9, 2020 •

edited

Loading

cmdcolin commented Mar 10, 2022 •

edited

Loading