New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

concurrent reads #14

Closed
mark-buer opened this Issue Jun 14, 2016 · 10 comments

Comments

Projects
None yet
4 participants
@mark-buer

mark-buer commented Jun 14, 2016

Hello Maintainer,

Firstly, thank you for creating and maintaining this library!

Secondly, are you interested in adapting this library to support concurrent reads?

For my application, I need to read entries (at random) from a zip file over an extended period of time. The reads might be concurrent, and might be interleaved (read a few bytes from zip entry A, then a few bytes from zip entry B, then A again, etc).

The current API shape makes this use-case impossible without creating multiple ZipArchives.

I've hacked the code to accomplish my goals, and thought I'd share it here in case you or any library users are interested in doing something similar.

I'd rather not maintain a separate fork of this library just to suit my use-case if it can be avoided, so if there is interest in incorporating any kind of support for concurrent reads, I'm happy to help out.

If there is no interest, please don't hesitate to close this issue.

@mvdnes

This comment has been minimized.

Show comment
Hide comment
@mvdnes

mvdnes Jun 17, 2016

Owner

I would be very interested in this.
The problem I had however was keeping track of the position in the files. I did not want to do a seek for every write just in case someone else had modified the position of the file object.

Could you please explain how you tackle the problem of getting two views into one file?

Owner

mvdnes commented Jun 17, 2016

I would be very interested in this.
The problem I had however was keeping track of the position in the files. I did not want to do a seek for every write just in case someone else had modified the position of the file object.

Could you please explain how you tackle the problem of getting two views into one file?

@mark-buer

This comment has been minimized.

Show comment
Hide comment
@mark-buer

mark-buer Jun 20, 2016

I tackled the "two views into one file" problem by moving the problem onto the consumer of the API.

In my hacky code, the API consumer is allowed to create multiple concurrent Readers into the decompressed data, albeit at the cost of needing to provide multiple source Readers; a source Reader must be provided per decompressed data Reader created.

With such an API shape, it would no longer be necessary for the ZipArchive to own the source Reader. Instead, it could mutably borrow it while parsing the metadata.

If the decompressed data Readers implement into_inner() (like this example from BufReader), it would allow the API consumer to "reuse" the source Readers as needed. Thus, it would still be possible to use a single source Reader to parse and read the entire zip container in a sequential manner; the API consumer would need to repeatedly use into_inner() to retrieve and reuse the source Reader for the next operation.

It would also be possible to do more exotic things like described in my original comment. In that case, the API consumer could use multiple source Readers to concurrently read from multiple decompressed streams at once. The multiple source Readers might be created on demand, or might be maintained in a pool, or might be multiplexed from a single Reader (that might make an interesting library!) or whatever. How the API consumer manages these source Readers isn't really a concern of the API itself.

I feel like I struggled to articulate this, I hope it makes sense?

mark-buer commented Jun 20, 2016

I tackled the "two views into one file" problem by moving the problem onto the consumer of the API.

In my hacky code, the API consumer is allowed to create multiple concurrent Readers into the decompressed data, albeit at the cost of needing to provide multiple source Readers; a source Reader must be provided per decompressed data Reader created.

With such an API shape, it would no longer be necessary for the ZipArchive to own the source Reader. Instead, it could mutably borrow it while parsing the metadata.

If the decompressed data Readers implement into_inner() (like this example from BufReader), it would allow the API consumer to "reuse" the source Readers as needed. Thus, it would still be possible to use a single source Reader to parse and read the entire zip container in a sequential manner; the API consumer would need to repeatedly use into_inner() to retrieve and reuse the source Reader for the next operation.

It would also be possible to do more exotic things like described in my original comment. In that case, the API consumer could use multiple source Readers to concurrently read from multiple decompressed streams at once. The multiple source Readers might be created on demand, or might be maintained in a pool, or might be multiplexed from a single Reader (that might make an interesting library!) or whatever. How the API consumer manages these source Readers isn't really a concern of the API itself.

I feel like I struggled to articulate this, I hope it makes sense?

@mvdnes

This comment has been minimized.

Show comment
Hide comment
@mvdnes

mvdnes Aug 28, 2016

Owner

It makes a lot of sense.

I do not have the time to implement such a thing, and still don't know what the best approach is.
Providing new Readers could give some weird effects when the wrong file is passed for example.
But I am always interested in hearing more ideas!

Owner

mvdnes commented Aug 28, 2016

It makes a lot of sense.

I do not have the time to implement such a thing, and still don't know what the best approach is.
Providing new Readers could give some weird effects when the wrong file is passed for example.
But I am always interested in hearing more ideas!

@icefoxen

This comment has been minimized.

Show comment
Hide comment
@icefoxen

icefoxen Jan 26, 2017

I was thinking of how to solve this problem and came to a similar solution: each handle to the zip file has its own Reader, be it a file handle or whatever, and the ZipArchive itself just contains shared data like the index and such. You could even (optionally) wrap it in an Arc and have each handle hold a reference to it, so there's no lifetime dependencies.

Edit, to expand:

Not sure how reusing a single source Reader would work, since into_inner() destroys the source object. I suppose you'd have to exclusively have one ZipFile at a time with access to the Reader.

The essential problem is the shared state of the current file position. We would have to either have that tracked individually per ZipFile (or equivalent), or have it shared between all of them and lock reads to a single ZipFile at a time. The right way, as far as I can tell, might honestly be to suck it up and have multiple Readers, and just have the ZipArchive store the file offset data shared between them. Doing anything else seems to get into the headache of "how much read data do we want to cache and share between ZipFiles?" which shouldn't be this library's job.

icefoxen commented Jan 26, 2017

I was thinking of how to solve this problem and came to a similar solution: each handle to the zip file has its own Reader, be it a file handle or whatever, and the ZipArchive itself just contains shared data like the index and such. You could even (optionally) wrap it in an Arc and have each handle hold a reference to it, so there's no lifetime dependencies.

Edit, to expand:

Not sure how reusing a single source Reader would work, since into_inner() destroys the source object. I suppose you'd have to exclusively have one ZipFile at a time with access to the Reader.

The essential problem is the shared state of the current file position. We would have to either have that tracked individually per ZipFile (or equivalent), or have it shared between all of them and lock reads to a single ZipFile at a time. The right way, as far as I can tell, might honestly be to suck it up and have multiple Readers, and just have the ZipArchive store the file offset data shared between them. Doing anything else seems to get into the headache of "how much read data do we want to cache and share between ZipFiles?" which shouldn't be this library's job.

@icefoxen

This comment has been minimized.

Show comment
Hide comment
@icefoxen

icefoxen Jan 26, 2017

Experimental implementation of this at icefoxen@6f15148 , but see the commit notes for caveats.

I also appear to have accidentally mangled it through rustfmt by accident. Bah.

icefoxen commented Jan 26, 2017

Experimental implementation of this at icefoxen@6f15148 , but see the commit notes for caveats.

I also appear to have accidentally mangled it through rustfmt by accident. Bah.

@icefoxen

This comment has been minimized.

Show comment
Hide comment
@icefoxen

icefoxen Jan 27, 2017

Rambling a little bit, but what I propose is taking the ZipArchive -> ZipFile structure and breaking it apart a little further. First we would have a ZipHeader which contains the file metadata the ZipArchive does now. Then we have a ZipArchive that contains the Read object containing the source of the data, and from that you could create a ZipFile object that consumes the ZipArchive, but can be turned back into it with into_inner(). There could also be a version of a ZipFile that just borrows the Read object for when that's all you need.

The idea is basically you could share a ZipHeader, so you only need to parse the zip file's header data once, but could then open a file multiple times from multiple different threads and read it independently.

Upon inspection this is basically a slightly more formal version of what @mark-buer proposes anyway...

icefoxen commented Jan 27, 2017

Rambling a little bit, but what I propose is taking the ZipArchive -> ZipFile structure and breaking it apart a little further. First we would have a ZipHeader which contains the file metadata the ZipArchive does now. Then we have a ZipArchive that contains the Read object containing the source of the data, and from that you could create a ZipFile object that consumes the ZipArchive, but can be turned back into it with into_inner(). There could also be a version of a ZipFile that just borrows the Read object for when that's all you need.

The idea is basically you could share a ZipHeader, so you only need to parse the zip file's header data once, but could then open a file multiple times from multiple different threads and read it independently.

Upon inspection this is basically a slightly more formal version of what @mark-buer proposes anyway...

@mvdnes

This comment has been minimized.

Show comment
Hide comment
@mvdnes

mvdnes Jan 29, 2017

Owner

I love the idea of having this feature, but do not know if it is very practical.

Especially since I think the user could also do one of the following things:

  • Copying the decompressed file and keeping multiple cursors on it
  • Opening the ZipArchive multiple times (by calling ZipArchive::new(...))

Passing a reader object to any of the functions would be a no. This may introduce very confusing bugs for users, and is hard to detect at the library side.

One other possibility is to hand out Cursor like objects to a user. In the case of a deflated file this would be easy, by simply seeking to the correct location. The deflate (and bzip2) case would be harder, as we also need to keep track of the decompression context.

@icefoxen: I do want to take a look at your code, but since the formatting is changed I have a hard time to see the changes. Could you undo them, or split them in separate commits?

Owner

mvdnes commented Jan 29, 2017

I love the idea of having this feature, but do not know if it is very practical.

Especially since I think the user could also do one of the following things:

  • Copying the decompressed file and keeping multiple cursors on it
  • Opening the ZipArchive multiple times (by calling ZipArchive::new(...))

Passing a reader object to any of the functions would be a no. This may introduce very confusing bugs for users, and is hard to detect at the library side.

One other possibility is to hand out Cursor like objects to a user. In the case of a deflated file this would be easy, by simply seeking to the correct location. The deflate (and bzip2) case would be harder, as we also need to keep track of the decompression context.

@icefoxen: I do want to take a look at your code, but since the formatting is changed I have a hard time to see the changes. Could you undo them, or split them in separate commits?

@icefoxen

This comment has been minimized.

Show comment
Hide comment
@icefoxen

icefoxen Jan 30, 2017

@mvdnes Sorry 'bout that, here's a somewhat more cleaned-up version with just the relevant changes: icefoxen@a52af55

The goals, as far as I can tell, are:

  • allow the user to have multiple Read objects into a compressed file via ZipFile or something like that, while ideally being thread-safe,
  • without having to just open the ZipArchive multiple times and re-parse the zip header each time
  • while keeping it safe so that people can't screw up by opening one file and trying to read from it using data from a different file.

The hard part really is that you can't necessarily just clone() a Read because it could be anything from an in-memory buffer to a file on disk to a network connection. BUT you also can't hide a Read's state very easily because it has a cursor inside it, so if multiple threads try to access it at the same time they clash. So you need multiple cursors into a buffer, or multiple open file handles (which achieves the same thing at the OS level), or... something.

Goal 2 might or might not really matter depending on what you're doing; certainly, relaxing it would make life much easier. All of the problems disappear if you just create multiple ZipArchives.

(edited to ramble more)

icefoxen commented Jan 30, 2017

@mvdnes Sorry 'bout that, here's a somewhat more cleaned-up version with just the relevant changes: icefoxen@a52af55

The goals, as far as I can tell, are:

  • allow the user to have multiple Read objects into a compressed file via ZipFile or something like that, while ideally being thread-safe,
  • without having to just open the ZipArchive multiple times and re-parse the zip header each time
  • while keeping it safe so that people can't screw up by opening one file and trying to read from it using data from a different file.

The hard part really is that you can't necessarily just clone() a Read because it could be anything from an in-memory buffer to a file on disk to a network connection. BUT you also can't hide a Read's state very easily because it has a cursor inside it, so if multiple threads try to access it at the same time they clash. So you need multiple cursors into a buffer, or multiple open file handles (which achieves the same thing at the OS level), or... something.

Goal 2 might or might not really matter depending on what you're doing; certainly, relaxing it would make life much easier. All of the problems disappear if you just create multiple ZipArchives.

(edited to ramble more)

@spease

This comment has been minimized.

Show comment
Hide comment
@spease

spease Dec 17, 2017

Was poking around to see if this was possible...would be nice to have this.

spease commented Dec 17, 2017

Was poking around to see if this was possible...would be nice to have this.

@mvdnes

This comment has been minimized.

Show comment
Hide comment
@mvdnes

mvdnes Jun 22, 2018

Owner

Closing this.
I think opening a ZipArchive multiple times is the best option for now, as zipfiles are not parallel by nature.

I have added an dervice(Clone) for ZipArchive, which enables a clone if the reader supports it. This will not be the case for a file, but will be for a cursor.

Owner

mvdnes commented Jun 22, 2018

Closing this.
I think opening a ZipArchive multiple times is the best option for now, as zipfiles are not parallel by nature.

I have added an dervice(Clone) for ZipArchive, which enables a clone if the reader supports it. This will not be the case for a file, but will be for a cursor.

@mvdnes mvdnes closed this Jun 22, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment