New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change off_t to 64-bit to support large files (in WORKERFS partial reads, and elsewhere) #5250

Open
dattaz opened this Issue May 26, 2017 · 8 comments

Comments

Projects
None yet
3 participants
@dattaz

dattaz commented May 26, 2017

I got a int overflow with files > 2 147 483 647 , using workerfs (using file object).

Here is a code to reproduct : https://gist.github.com/dattaz/62254df7b8f3d6453d87218020efbca5

And here is the output from console
Output from console

@mossroy

This comment has been minimized.

mossroy commented Sep 23, 2018

Version 1.38.12 of emscripten seems to have the same issue.
Is there a 2GB file limit in emscripten workerfs or does it come from another layer?
I have the same behavior with Firefox 62 and Chromium 69.

@kripken

This comment has been minimized.

Owner

kripken commented Sep 24, 2018

In practice JS VMs limit typed arrays to 2GB, so the emscripten FS code doesn't try to support any more.

@mossroy

This comment has been minimized.

mossroy commented Sep 26, 2018

Thanks a lot for you answer @kripken , and for that amazing piece of software.

We're trying to use emscripten and workerfs to read small parts of very big files (for example full archives of wikipedia) for project www.kiwix.org.
We don't need to read the whole file and put it in memory. And workerfs does not seem to do that when we read some bytes in it.

I'm sorry that I don't know the internals of emscripten/workerfs : why does the maximum size of a typed array imply a maximum file size that can be accessed through workerfs?

@kripken

This comment has been minimized.

Owner

kripken commented Sep 28, 2018

I think reading a small part of a big file should work (unless you are hitting a bug in the browser). However, what worries me in emscripten's code is setting up WORKERFS - if you give it the big file as a typed array, it's going to hit the browser typed array limit.

@mossroy

This comment has been minimized.

mossroy commented Sep 30, 2018

The files are passed as File javascript objects, and mounted through WORKERFS. We do not use typed arrays ourself, and never read the whole file in memory (it can be much too large).

@dattaz already implemented a prototype that reads files using a C library called libzim : https://dattaz.github.io/libzim_wasm/file_api/index.html
It can be tested with ZIM files coming from :
https://download.kiwix.org/zim/wikipedia/?C=S;O=D
It works perfectly fine with small files (browser console shows a few titles contained in the file), but fails with files larger than 2GB.

@kripken

This comment has been minimized.

Owner

kripken commented Sep 30, 2018

Interesting - I'm not sure how browsers implement File objects. But it's possible they don't have the 2GB limitation.

I think at this point we just need to debug the workerfs code to see where things go wrong. It's in src/library_workerfs.js, and is not very long. Specifically, look at the read: function, maybe add some debug statements in there to see that the parameters arrive ok, and then what happens in the browser calls to slice and readAsArrayBuffer. Also worth checking this in multiple browsers, as not all may have the same arbitrary 2GB limitation.

@dattaz

This comment has been minimized.

dattaz commented Oct 1, 2018

src/library_workerfs.js work good,
but it's seems than offset (off_t) is on 32 bit https://github.com/kripken/emscripten/blob/incoming/src/library_syscall.js#L700
And on system part, we have a "TODO" : https://github.com/kripken/emscripten/blob/incoming/system/lib/fetch/asmfs.cpp#L1125

@kripken

This comment has been minimized.

Owner

kripken commented Oct 3, 2018

Thanks, that makes sense, we need to define off_t as 64-bit for this to work. Historically, the issue is that 64-bit operations were very slow in JS and asm.js, and typed arrays are limited to 2GB anyhow as mentioned above, so we just set it to 32-bit. But it should be fine to switch that now since wasm has proper i64s. I think, but we'd need to investigate, whether that switch would be backwards compatible.

@kripken kripken changed the title from int overflow with file size to Change off_t to 64-bit to support large files (in WORKERFS partial reads, and elsewhere) Oct 3, 2018

@kripken kripken added the help wanted label Oct 3, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment