Add script to dump and remove attachments? #38

Alexhuszagh · 2022-04-01T15:55:15Z

Feature description

Currently, dumper fails on images with large attachments, likely because the contents are parsed into the DOM via minidom which creates very large documents that can become larger than the available memory, despite the file sizes being quite small.

Feature motivation

I have a series of notes from college that are gigabytes in size, which is almost entirely in attachments. However, I also have a few files of ~80MB which are mostly attachments also failing. I could export smaller notes, however, having a Python script (to handle files larger than a certain size) or similar to process each of the attachments would be very useful, and export the modified ENEX files (as well as adding support for other formats) and the attachments would be very helpful, and would solve many of the issues of large files.

If there's any interest, I'd be more than happy to provide it, under any license desired (including public domain, so you can do whatever you wish). Currently I only have access to Evernote ENEX files, but I could also use the test cases above to add support for other note file types.

I've got a simple version of the script here, and it reduces a 7MB file to ~150KB, while keeping everything but the resources present, and can then be processed by dumper.

Current issues with the script:

Assumes base64 encoding of attachments.
Only handles filename and contents.
~~XML output currently doesn't preserve CDATA sections, which should be trivial to implement using text processing rather than dumping the tree.~~
Only supports Evernote ENEX files.
~~Doesn't store the attachment data when exporting, meaning the attachment names are lost in the header.~~

I'm working on fixing the 1), 3), and 5) currently, and would be more than willing to implement 4) if desired. 5) has been fixed by writing out the buffer to file, and creating a unique attachment (b'\x00\x01\x02\x03\x04\x05\x06\x07\x08\x09'), since empty files are ignored by dumper, so any library users would need to know about this.

The text was updated successfully, but these errors were encountered:

Alexhuszagh · 2022-04-01T17:46:38Z

Update: I've since processed ENEX files up to ~3.4GB in size, and it's worked without hitch, and then used dumper to then export the notes from the resulting files. This includes large attachments, including over 100MB in size. In addition, the combined performance of dump_attachments + dumper is much faster than dumper alone for files that can be processed with or without attachments due to the smaller ENEX file sizes.

I believe the best solution right now would be a Python library, since lxml has both support for huge files (and therefore attachments larger than 10GB) as well as not stripping away CDATA sections.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add script to dump and remove attachments? #38

Add script to dump and remove attachments? #38

Alexhuszagh commented Apr 1, 2022 •

edited

Loading

Alexhuszagh commented Apr 1, 2022 •

edited

Loading

Add script to dump and remove attachments? #38

Add script to dump and remove attachments? #38

Comments

Alexhuszagh commented Apr 1, 2022 • edited Loading

Feature description

Feature motivation

Alexhuszagh commented Apr 1, 2022 • edited Loading

Alexhuszagh commented Apr 1, 2022 •

edited

Loading

Alexhuszagh commented Apr 1, 2022 •

edited

Loading