unzipping large files in chunks #12

Closed
springmeyer opened this Issue Jun 2, 2011 · 12 comments

Projects

None yet

2 participants

@springmeyer
Member

files over 1 GB blow up:

DEBUG: unzipping file
DEBUG: saving to: /Users/dane/projects/TileMill/files/.cache/ed4aa8150169ff8b2fc143574020a070/contour-4.shx
DEBUG: saving to: /Users/dane/projects/TileMill/files/.cache/ed4aa8150169ff8b2fc143574020a070/contour-4.qix
DEBUG: saving to: /Users/dane/projects/TileMill/files/.cache/ed4aa8150169ff8b2fc143574020a070/contour-4.prj
DEBUG: saving to: /Users/dane/projects/TileMill/files/.cache/ed4aa8150169ff8b2fc143574020a070/contour-4.dbf
FATAL ERROR: v8::Object::SetIndexedPropertiesToExternalArrayData() length exceeds max acceptable value
@springmeyer
Member

no seeing this as a major problem since node v0.2.x. closing.

@springmeyer
Member

This lack of chunked or streaming support is continually crashing on systems either with very low memory or on systems like 32 bit windows. Revisiting and trying to bring to life the code in #16 is the next step to avoiding massive memory usage for large files.

@springmeyer springmeyer referenced this issue in tilemill-project/tilemill Jul 24, 2014
Closed

Error: child process: "tile" failed with code "3" #2372

@rclark
Member
rclark commented Dec 22, 2014

An oldie but... today I saw this error thrown while extracting a file over 1GB, using node v0.10.x on Ubuntu 14.04 and also on OSX.

@springmeyer springmeyer reopened this Dec 23, 2014
@rclark
Member
rclark commented Dec 23, 2014

Here's a testcase to work against:

var http = require('http');
var crypto = require('crypto');
var path = require('path');
var fs = require('fs');
var os = require('os');
var zipfile = require('zipfile');

var url = 'http://mapbox.s3.amazonaws.com/tmp/too-large.zip';

var filepath = path.join(
    os.tmpdir(),
    crypto.randomBytes(8).toString('hex')
);

console.log('Fetching file...');
var dst = fs.createWriteStream(filepath + '.zip');
http.get(url, function(res) {
    res.pipe(dst).on('finish', unzip);
}).end();

function unzip(err) {
    if (err) throw err;
    console.log('Unzipping...');
    var zf = new zipfile.ZipFile(filepath + '.zip');
    zf.readFile('US_OG_022014.dbf', function(err, buf) {
        if (err) throw err;
        fs.writeFile(filepath + '.dbf', buf, function(err) {
            if (err) throw err;
            console.log('Written %s', filepath + '.dbf');
        })
    });
}
@springmeyer
Member

@rclark - as discussed on chat, a streaming interface would be be overkill: what we really need is an efficient method of writing a zip entry to a file. So I just added zf.copyFileSync in fef7bcb. I ran it against your testcase (thanks!) just now and memory peaked at 85 MB (when reading in 5 MB chunks).

@springmeyer
Member

It takes ~ 16 seconds to decompress the testcase, which is not terribly fast. Do you think this is okay? Next step to optimize would be to look at how to speed up the writes (where most time is being taken):

screen shot 2014-12-22 at 6 10 52 pm

Also, if this looks good overall we can add an async version.

@rclark
Member
rclark commented Dec 23, 2014

Do you think this is okay?

Faster is always better, but as long as it doesn't crash I'm happy

@springmeyer
Member

Noticing that 16 seconds is actually quite fast compared to unzip which takes 37 seconds to unpack the archive.

@springmeyer
Member

was able to trigger a crash in this code if I increased the chunk size to 10 MB. Will return to this in 2015 :)

@springmeyer
Member

was able to trigger a crash in this code if I increased the chunk size to 10 MB. Will return to this in 2015 :)

Problem was stack overflow (thanks @artemp for helping diagnose). Allocating the buffer chunk on the heap solved the problem. Now looking at hardening the code and optimizing a bit before merging and publishing.

@springmeyer
Member

Current profiling output is as-expected. Most time taken by reading because of the overhead of zlib decompression.

screen shot 2015-01-13 at 1 45 46 pm

@springmeyer springmeyer referenced this issue Jan 14, 2015
Merged

Copyfile #52

@springmeyer springmeyer closed this in #52 Jan 14, 2015
@springmeyer
Member

Okay the v0.5.5 zipfile release is published and binaries are available /cc @rclark.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment