Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Already on GitHub? Sign in to your account

Error handling #9

Closed
mikeabdullah opened this Issue Jan 12, 2013 · 15 comments

Comments

Projects
None yet
2 participants
Contributor

mikeabdullah commented Jan 12, 2013

At present Zipzap makes relatively little allowance for error-handling, which worries me:

  • In ZZFileChannel, the call to open might fail due to the URL being unsuitable, or a permissions problem. There is no way to indicate that to the caller beyond the likelihood of NSFileHandle blowing up
  • If writing data to the file handle fails midway through, NSFileHandle is documented to throw an exception. I don't think the code handles this at present, and users of Zipzap aren't advised that such a situation might arise
  • Also in ZZFileChannel, the -openInput method completely ignores the NSError object, so clients have no idea why it might have failed
Owner

pixelglow commented Jan 13, 2013

Error handling and recovery is fairly tricky and I would like to figure out the best way forward.

Currently if any of the ZZNewArchiveEntry callback blocks signals an error by returning NO or nil as appropriate, the archive omits the entry in the resulting file. I'd assume that the callback block will have its own logic reporting errors to the caller of -setEntries:.

We can probably handle opening errors by throwing an appropriate exception. Since -setEntries: is a return-void method, it would have to be an exception instead of an NSError return, unless we define another API for error returns. Would you know the signature of a suitable I/O opening exception in Cocoa?

If a write fails halfway, as you said the NSFileHandle already throws an exception. Unfortunately it will be difficult to keep a consistent universe after that -- the zip file may be partially written and the central headers not written out yet, so the file could be garbage. I can attempt to do a clean up but there's no guarantee that writes will even work after that. We need to have some kind of journaling or write-ahead transaction log to preserve integrity in that situation but that may complicate a simple enough API.

Contributor

mikeabdullah commented Jan 13, 2013

Please don't throw any exceptions of your own. It's bad enough that NSFileHandle has this behaviour as legacy, but no new code should.

Personally, I would consider redefining the API as:

- (void)setEntries:(NSArray *)entries;  // updates the list in-memory; doesn't touch the disk etc.
- (BOOL)writeToURL:(NSURL *)url error:(NSError **)error;
- (NSData *)dataRepresentation:(NSError **)error;

Trouble is that does break compatibility for the existing -setEntries: method.

If writing to disk fails at any point, there's highly unlikely to be any recovery route; that's the appropriate time to stop trying to write and fail with an error.

The bigger problem is making callers of the API aware of failures for the individual entries.

  • Right now, for any new entries, the block responsible for providing their data can handle that, although it does seem a little clumsily detached from the code that is asking for the zip file to be written
  • But for existing, old entries, there's no way for clients to know that one of them failed and was left out of the archive. This should only happen by my reckoning if the source archive fails to be read from, but that can happen
  • There's also the problem that if one individual entry fails, many clients would like to just give up at that point, rather than waste time trying to write the remaining entries

I can think of two high-level approaches:

  • Expose some sort of ZZWriter class, much like Omni's OUZipArchive where clients feed in entries one at a time, giving them the result of writing each one
  • Something like -[NSFileManager enumeratorAtURL:…] where clients specify a block to run whenever there's an error and decide what further action to take.
Owner

pixelglow commented Jan 17, 2013

Please don't throw any exceptions of your own. It's bad enough that NSFileHandle has this behaviour as legacy, but no new code should.

After doing further research, I have to agree with you. Modern Cocoa doesn't support exceptions for anything other than life-threatening errors. This necessarily complicates simple API's, especially property-based ones, and means our code has to pass the NSError** pervasively to internal code.

I may replace NSFileHandle with a simple POSIX file descriptor, and use errno -based NSError* objects for such errors.

Personally, I would consider redefining the API as:

Expose some sort of ZZWriter class, much like Omni's OUZipArchive where clients feed in entries one at a time, giving them the result of writing each one

Ah, no.

I wanted each method call to be as atomic as possible. If possible, each mutating call should leave the zip file in a consistent state. Several other zip libraries either adopt entry-writing followed by a finalisation, which is not atomic and could lead to errors in using the API e.g. missing finalisation or double finalisation; or they do atomic entry-writing which ends up rewriting the entire central header writing for each entry.

For atomicity, the archive object has to be consistent as well, which in an entry-writing API, implies that the entries array should be correct even between entry writes and/or before finalization. This would complicate the implementation.

Also, one main feature of zipzap is the skipping of initially equal entries -- this allows you to rewrite a large zip file in place without incurring the overhead of rewriting all of the initially unchanged entries. To achieve this, we would have to control carefully the order of writing and also preserve some state between writing entries, which is hard to do if we had an individual entry writing API.

Because of the skipping of initial entries, a file-based ZZArchive doesn't have a consistent contents or dataRepresentation until after all entries have been written and the central header finalized. Having a purely in-memory setEntries: would make it difficult to implement this skipping of initial entries.

The bottom line is we need some way of reporting errors back to the user. What do you think of changing the setEntries: API to something like:

- (BOOL)updateEntries:(NSArray*)entries error:(NSError**)error;

The original setEntries: API is somewhat inconsistent in that [a setEntries:x] followed by x2 = [a entries] does not have [x isEqualToArray:x2].

The bigger problem is making callers of the API aware of failures for the individual entries.

Right now, for any new entries, the block responsible for providing their data can handle that, although it does seem a little clumsily detached from the code that is asking for the zip file to be written. But for existing, old entries, there's no way for clients to know that one of them failed and was left out of the archive.

We could require those blocks to have the usual error-reporting API signature i.e. BOOL or 'id returning where NO or nil represents an error, and a NSError** passing parameter.

We could add an additional info parameter to the NSError* to report the index of the failed entry.

There's also the problem that if one individual entry fails, many clients would like to just give up at that point, rather than waste time trying to write the remaining entries.

Agreed -- my original implementation of not writing out the erroneous entry without warning is pretty dangerous and relies on the blocks doing the right thing.

There remains the problem of the entries property accessor potentially accessing the disk and ending up with an error. The trouble here is that a missing file is not necessarily an error, since the user could simply be trying to write a new zip file entirely and thus there are no initial entries. I would be loathe to give up the simplicity of the entries accessor to support returning errors as well. We should think about this one more.

Contributor

mikeabdullah commented Jan 17, 2013

An -updateEntries:error: method certainly solves the problem on one level. But it makes introspection to know which entries failed rather awkward — do clients query the error's userInfo? Or do they call -entries to find out which entries actually ended up in the archive. If the latter, it could be awkward trying to figure out which entries are missing, and what the corresponding error was.

And it doesn't solve the problem of deciding whether to give up or continue after a failure.

I understand your not wanting to go for a writer-type of API. It does tend to place a greater burden on the client.

However, let's take a step back.

I worry you're kidding yourself about atomicity here. The only way that I am aware of to guarantee atomicity is to:

  1. Generate a complete file in a temporary location
  2. Replace the original file with the new, such as `-[NSFileManager replaceItemAtURL:withItemAtURL:backupItemName:options:resultingItemURL:error:]
  3. Delete the original file

When both files are on the same suitable file system, the replacement will be atomic. For other cases, it minimises the time for which the original document is in an incomplete state.

This is exactly the approach UI/NSDocument take to document writing. The document is asked to write to a temporary location, and then the doc system takes responsibility for replacing the original document with the updated version.

I can see that the speed of writing out a zip file can be noticeably sped up if actually updating an existing archive. But should it fail partway through, you're left with an incomplete/corrupted archive. Even if zipzap itself is perfect and never crashes, you could still find yourself in a failure situation from:

  • The block called in order to provide an entry's data crashes
  • A secondary thread the app is running crashes
  • The app uses too much memory
  • The user force quits the app
  • The battery runs out mid-write

If I were writing a zip file-based app, I'd be very keen to keep writing atomic, as the document system intends. Right now zipzap provides an awkward API for that. I can see two possible approaches in implementing a document:

  • Create a fresh ZZMutableArchive for the target URL and add it to entries from a ZZArchive of the original document. A complete version of the archive must be written (takes time), but can internally use little memory at least, writing out small portions of the existing entries
  • Copy the existing document to the target URL, then update it as efficiently as possible. This probably takes as long as the above to be honest. Could be sped up by starting the copy operation ahead of the save though
  • Keep around the previous revision of a document, move it to the target URL, and update that. Would recycle existing data on disk as often as possible

I don't believe the current implementation really is atomic. It seems to provide an atomic API to clients, but doesn't actually implement atomic writing underneath.

Owner

pixelglow commented Jan 17, 2013

An -updateEntries:error: method certainly solves the problem on one level. But it makes introspection to know which entries failed rather awkward — do clients query the error's userInfo? Or do they call -entries to find out which entries actually ended up in the archive.

The userInfo would contain the entry failure index if any.

If -updateEntries:error: results in an error, then we document that all bets are off and the archive is thereby inconsistent. This is a change from setEntries: which at least tries to keep a consistent archive, but as you rightly pointed out, isn't possible in all scenarios e.g. a partial write.

I don't believe the current implementation really is atomic. It seems to provide an atomic API to clients, but doesn't actually implement atomic writing underneath.

As you've said, the API is atomic and intentionally so, to make it easy and straightforward to use, and for users to reason about. Property accessors return a consistent world in-between mutator calls. That sort of thing.

As for file system atomicity, there are several possibilities for improvement:

  • As you've said, rely on the user to achieve atomicity: he could create copied ZZMutableArchive instances or copy out the URL using NSFileManager and friends. The price for such atomicity and reliability will be performance loss. In some scenarios, the user wants the performance gains of skipping entries, not the performance loss of file system atomicity.
  • We could use write-ahead logging, which is a concept borrowed from databases. The essential bit is that before we write out any data to the zip file, we write it first to an auxiliary log. If that succeeds, then we write it to the zip file as well. If the WAL write fails, we abort and the zip file is untouched. If the zip write fails, the WAL log has the change and we could apply it again at some later stage.
  • Similar to the WAL but simpler: we could batch all changes to a NSData* before writing it out. This still keeps the initial entry skipping optimisation. Would need to figure out how to re-apply though.
Contributor

mikeabdullah commented Jan 18, 2013

OK, I understand your reasons here. You've gone for raw speed over safety, which seems a fair trade off for many situations.

However I disagree that the current API is "easy and straightforward to use, and for users to reason about". Everything (to my knowledge) in Cocoa that writes to the filesystem is fairly explicit in its naming to indicate so. In part because errors will happen, and they need to be communicated. For example:

  • Modifying an NSMutableData instance created with the contents of a file does not update the file to match; -writeToURL:… must be used to do that instead
  • Likewise, modifying NSMutableStrings doesn't edit their source file or data
  • Even ZZArchive's most similar API in Cocoa, NSFileManager, has an explicit -writeToURL:… method

I strongly recommend following this pattern. Sure, internally track the entries that come from an original zip in order to efficiently update. But externally separate out mutation, and committing the changes to disk.

Contributor

mikeabdullah commented Jan 18, 2013

For that matter, it occurs to me you could go the whole hog and hide the entire system behind a specialised subclass of NSFileWrapper

Owner

pixelglow commented Jan 19, 2013

Based on your feedback, some research and thinking about the design, I will add these:

  • Error reporting via NSError** parameters, at least in the writing case.
  • Write out in phases, to minimize aborting during partial writes. These phases will still be in a single method however:
    • Collect all data that needs to be written out. Any errors in this stage, including errors triggered by writing callbacks, will leave the zip file intact.
    • Write out collected data in one hit. Any errors here may cause zip file corruption, but see next point.
  • Some form of WAL to avoid zip file corruption on writing out data. At this stage I'm not sure whether I want to make it explicit in the API -- which adds to cognitive overhead for using zipzap -- or try for a transparent solution which can fail in common edge cases like some other process or event changing the zip file between recording the WAL and reapplying the WAL. Would appreciate advice / feedback on this!
Owner

pixelglow commented Jan 19, 2013

Everything (to my knowledge) in Cocoa that writes to the filesystem is fairly explicit in its naming to indicate so.

Most Cocoa objects are not incrementally updated, so that can have data that are fully memory resident, writing methods that take different URL's and a fully safe, atomic saves to the file system. But zipzap's initial entry skipping is essentially incremental updating. We should compare to CoreData, the closest incrementally updated API in Cocoa e.g. NSManagedObjectContext.

OTOH, you can adapt the incremental zipzap API to the non-incremental pattern if you wish, e.g.

NSMutableData* data = [NSMutableData dataWithContentsOfURL:URL1];
ZZArchive* archive = [ZZArchive archiveWithData:data];
archive.entries = newEntries;
[archive.contents writeToURL:URL2 atomically:YES];

But it would be impossible or difficult to adapt a non-incremental zipzap to an incremental pattern.

a specialised subclass of NSFileWrapper

NSFileWrapper isn't a truly incremental API, since it can e.g. be passed between dispatch queues in UIDocument and still work correctly. I have a try at implementing a NSZipFileWrapper but it was hard -- NSFileWrapper doesn't appear to be suitable for public subclassing.

Contributor

mikeabdullah commented Jan 19, 2013

Most Cocoa objects are not incrementally updated, so that can have data that are fully memory resident, writing methods that take different URL's and a fully safe, atomic saves to the file system. But zipzap's initial entry skipping is essentially incremental updating. We should compare to CoreData, the closest incrementally updated API in Cocoa e.g. NSManagedObjectContext.

Good point. NSFileWrapper makes optimisations when it can when writing existing files inside a package, but that's as far as it goes. But as to Core Data, it does explicitly separate modifying the context and then saving those changes!

Contributor

mikeabdullah commented Jan 20, 2013

As to the idea of a write-ahead log, I argue it should have to be explicitly turned on by clients because:

  • For any use case other than zip-file-as-document-format, the log is useless
  • For documents, clients should ideally use the document system's atomic writing first if they can, and only switch to incremental updates if there's a tangible benefit to them

In the event of a failure mid-save, the log provides for recovery by the app, but it's possible that iTunes could be used to accidentally retrieve the document before it's been repaired, I believe?

Would the log itself just be another zip archive, containing only the new entries? Or some dedicated format?

Owner

pixelglow commented Jan 21, 2013

My intent was to use a ZZChannel to represent a temp channel for either file- or data-based archives. In the case of a pure overwrite, the temp channel contents will simply be swapped out, effectively becoming an atomic update. In the case of an update, the temp channel contents will be appended at the right offset. This temp channel will be the same sort of channel as the original i.e. file for file, data for data.

Once we do this in a commit, we'll then have to consider how to persist the file-based temp channel in the next commit. I was thinking of writing both the offset and the WAL URL to the xattr of the file. On opening the file, if the xattr shows an attached WAL, it would be applied. The use/reuse of the xattr would be controlled by an option or BOOL parameter passed into the initializer.

Writing the WAL reference into the xattr solves issues when the zip file is renamed or overwritten. I don't know how yet to resolve it if the zip file is somehow updated between the time the WAL is written and is re-applied.

The WAL format will simply be the raw bytes. I was thinking of prepending the offset and length, but that would make it somewhat different in coding from a data-base temp channel, and not equivalent to a zip file in the optimised case of a pure overwrite. So in the case of a pure overwrite, the WAL will be a zip file; otherwise it will be whatever sequence of bytes necessary to update the zip file.

Contributor

mikeabdullah commented Jan 21, 2013

Hmmm, this is all starting to get very complex. I wonder if energy would be better directed at carefully updating archives in a manner such that incomplete writes can be gracefully recovered from by ZZArchive's reading?

Owner

pixelglow commented Jan 21, 2013

Yes, the implementation is somewhat complex, although the prose used to describe the process is probably more complex than the actual Objective-C code!

Hopefully the interface will be simple and straightforward enough. This should achieve what you've asked for i.e.

I wonder if energy would be better directed at carefully updating archives in a manner such that incomplete writes can be gracefully recovered from by ZZArchive's reading?

pixelglow added a commit that referenced this issue Jan 22, 2013

Write to temp channel, then apply temp to original (ref #9).
* When updating, appends temp to original after skipped entries.

* When overwriting, atomically replaces original with temp.

* Channel output supports bias applied to offset property.

pixelglow added a commit that referenced this issue Feb 2, 2013

Improve error handling for writing (ref #9)
* Replace writeable entries with -[ZZMutableArchive updateEntries:error:] method.

* Stream, data and data consumer blocks now should report errors.

* Revise error codes to be consistent with write errors.

* Internal API now report file manager, POSIX and stream errors.

* Unit tests for creating or inserting erroneous entries.
Owner

pixelglow commented Feb 2, 2013

Since I've implemented all planned features discussed here except for WAL, I'm closing this issue.

@pixelglow pixelglow closed this Feb 2, 2013

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment