Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't upload CSV files > 1MB to Jupyterlite file system by drag and drop #741

Closed
tangocat73 opened this issue Jul 18, 2022 · 4 comments · Fixed by #1024
Closed

Can't upload CSV files > 1MB to Jupyterlite file system by drag and drop #741

tangocat73 opened this issue Jul 18, 2022 · 4 comments · Fixed by #1024
Labels
bug Something isn't working

Comments

@tangocat73
Copy link

Description

I'm trying to see if Jupyterlite demo site can process some of my local CSV files with pandas. I've noticed that when dragging and dropping these files in Jupyterlite file system UI, any file that larger than ~1MB would be truncated by the system; while the smaller ones <1MB can be processed correctly.

Attached please see the screenshots for this issue:

Screenshot 1. query.csv is a 1. 9MB earthquake dataset I have, but it was NOT read correctly by file system; a lot of rows were dropped during uploading process, and thus column headers were not processed correctly.

Screenshot 2. query_2.csv is a 0.9 MB partial dataset I edited from query.csv, by deleting nearly half of the data; and Jupyterlite file system took it without any issue, with column heads processed correctly.

Screenshot 3. reading these CSV files using Pandas.

I wonder if this is a limitation of the Jupyterlite demo website, so that users are now allowed to upload large files (but I'm under the impression that these local files are uploaded to users' browser storage...); or maybe there is some site setting I need to change.

(note: I've also noticed that if I serve these csv files as static assets using my own Github Jupyterlite deployment, then file size would not matter).

Context

  • JupyterLite version: v0.1.0b11
  • Operating System and version: Windows 10
  • Browser and version: version 103.0.5060.114

Screenshots

query csv 1 9MB
query_2 csv 0 9MB
Jupyterlite

@tangocat73 tangocat73 added the bug Something isn't working label Jul 18, 2022
@bollwyvl
Copy link
Collaborator

bollwyvl commented Jul 18, 2022

see the screenshots

It would also be interesting to have some browser logs (e.g. revealed by F12) as well, as this might tell us where the problem is.

Recall: everything in this free software, hosted on free servers, is still very much in beta, and the in-browser storage stuff is all best-effort, especially with respect to the less-than-month-old "magic" mirroring of files between kernels and contents.

To fix this for real, it would need to be under test in a real browser, and I haven't had to time to start building up a suite. Last a checked, galata took some shortcuts...

serve these csv files as static assets using my own Github

Right: in general, we're trying to enable flexibility for deployers without taking away power/debuggability for users.

The lightweight archival data analysis environment about datasets and packages known well in advance use case is much easier to reason about than i might upload anything and try and use it with anything, but it would indeed be lovely for it to Just Work.

For heavy duty files, some other things to consider:

  • ipywidgets FileUpload
    • this generally bypasses IndexedDB, and potentially even the emscripten MEMORYFS
  • jupyterlab-filesystem-access
    • chrome-only 😿 though theoretically could be polyfilled

@tangocat73
Copy link
Author

Thanks for the quick response.

  1. Chrome log file (when clicking on the uploaded "query.csv" file ) attached for troubleshooting. Hope it can show where the issue may come from.

  2. Totally understand that Jupyterlite is a WIP, and for now, we should use it in a controlled and well-defined manner.

  3. Looking forward to more exciting development from this portable Python scientific computing stack.

jupyterlite.readthedocs.io-1658174123942.log

@tanaga9
Copy link

tanaga9 commented Jan 14, 2023

Run the following script

with open('/drive/bigfile.txt', 'w') as f:
    for i in range(16400):
        f.write(str(i).zfill(63) + '\n')

/drive/bigfile.txt will be a 1,049,600 byte (64 * 16400)(1.1 MB) file like this

000000000000000000000000000000000000000000000000000000000000000
000000000000000000000000000000000000000000000000000000000000001
000000000000000000000000000000000000000000000000000000000000002
000000000000000000000000000000000000000000000000000000000000003
..... omission ...
000000000000000000000000000000000000000000000000000000000016396
000000000000000000000000000000000000000000000000000000000016397
000000000000000000000000000000000000000000000000000000000016398
000000000000000000000000000000000000000000000000000000000016399

Download this file, then drag and drop it into Jupyterlite's file system UI to (re)upload it

The (re)uploaded file looks like this

000000000000000000000000000000000000000000000000000000000016384
000000000000000000000000000000000000000000000000000000000016385
000000000000000000000000000000000000000000000000000000000016386
000000000000000000000000000000000000000000000000000000000016387
000000000000000000000000000000000000000000000000000000000016388
000000000000000000000000000000000000000000000000000000000016389
000000000000000000000000000000000000000000000000000000000016390
000000000000000000000000000000000000000000000000000000000016391
000000000000000000000000000000000000000000000000000000000016392
000000000000000000000000000000000000000000000000000000000016393
000000000000000000000000000000000000000000000000000000000016394
000000000000000000000000000000000000000000000000000000000016395
000000000000000000000000000000000000000000000000000000000016396
000000000000000000000000000000000000000000000000000000000016397
000000000000000000000000000000000000000000000000000000000016398
000000000000000000000000000000000000000000000000000000000016399

The file is now 0.1 MB instead of 1.1 MB.
What happened?

Probably, but when a file larger than 1 MB is uploaded, it is divided into parts of 1 MB each and uploaded.
Multiple split parts were not combined correctly and only the last part was overwritten.

+------------------+
|  1.1 MB          |
+------------------+

divided into parts of 1 MB each
+-----------++-----+
|  1.0 MB   ||0.1MB|
+-----------++-----+

Upload by parts
+-----------+
|  1.0 MB   |
+-----------+
             +-----+
             |0.1MB|
             +-----+

Write the first part
+-----------+
|  1.0 MB   |
+-----------+

The next part is wrongly overwritten from the beginning of the file
(it should really be merged at the end)
+-----+
|0.1MB|
+-----+

The size of an n-byte file, after being uploaded, will be
n mod 1,048,576
1,048,576 = 1024*1024

#888 Related Issue

@jtpio
Copy link
Member

jtpio commented Jan 15, 2023

Thanks all for the examples.

Yes I also encountered this issue recently after uploading a ~4MB file and noticed it was truncated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants