-
Notifications
You must be signed in to change notification settings - Fork 66
7431 ZFS Channel Programs #198
Conversation
Can one of the admins verify this patch? |
question - if i have lua5.2 as userland package - can i use it for builds? or you import all lua5.2 sources to uts? |
Agree with Igor, probably we should do the same as we did with ficl for loader project - install the binaries with -sys postifx. |
It was necessary to slightly modify the base lua 5.2.4 interpreter for a couple reasons:
From looking at the configuration options for the packaged Lua interpreter, unfortunately I don't think we'd be able to just use binaries from a userland package with these modifications. As such we've added the full source. I'm open to suggestions if there's a better home for the interpreter code than For reference, here's the diff between the stock Lua 5.2.4 interpreter and the modified one we've included: |
615a8d5
to
f646606
Compare
i think, based on info that current modified lua is part of kernel modules builds, will be better put sources to: -Igor
|
Doesn't this sort of violate the illumos-gate rule of (in general) not introducing APIs without consumers? I haven't seen this year's presentation about channel programs, but based on the previously presented info, one of the goals motivating this work was to allow simpler interactions between userspace tools and kernel. To that end, I think it would make sense to attack some of the (uglier) parts of libzpool/libzfs/libzfs_core and have them make use of this API instead of the current mess of C. This would not only take care of the unused-API aspect, but it would also provide a set of excellent examples how to use the API. (Yes, I realize that the tests provide minimal examples, but it'd be nice to see real world ones as well.) |
We would like to make libzfs use channel programs where possible. We may be However, this API is not without consumers. The snapshot deletion ioctl That said, we are concerned with adding any new functionality that may see --matt On Sunday, October 2, 2016, jeffpc notifications@github.com wrote:
|
The manual page suggests that "Channel programs may only be run with root privileges". Would it be better to introduce a specific privileges(5) privilege for this functionality? How does this mechanism interact with delegated datasets? If we're expecting to re-do existing C-based functionality in terms of channel programs, that seems like an important consideration. I took a look in Have you given any thought to the potentially huge new attack surface that a turing complete interpreted language adds to the kernel? At least with the traditional The manual page suggests that if a program runs longer than its timeout allows, it will be "stopped and an error will be returned". What happens to the actions that the program was able to complete before termination? Is everything rolled back? If a program runs for 10 seconds, what impact does that have on ZFS I/O activity for the pool in question? Does it mean that, say, all synchronous writes (e.g. |
The channel program ioctl currently uses
Nothing in that area right now, but on the slate as a future feature.
The first half of the
Nope, anything executed before an error in the channel program stays executed. We've taken measures to prevent this causing problems - each effectful library function has a corresponding dry-run check function which can be used to make sure in advance that an operation will succeed. It's not foolproof (since the script could change system state between the checkfunc and syncfunc), but it's generally possible to specify whatever error handling behavior you want within the script itself.
We block only the transaction group syncing thread with the channel program execution. A very long-running channel program executed repeatedly could cause sync writes to get throttled, but this only really happens when e.g. destroying 10,000 snapshots, which has the same effect anyway. (in practice, using channel programs tends to have the effect of making these bulky operations complete much faster and let the system move on rather than taking up time in every txg sync). |
Which checks for the
Specifically,
More precisely, we only allow a channel program to be run with the |
@zettabot go |
Looks like there was a commit since these tests were written that changed how certain un-set properties behave, causing a few failures. Wrote up a fix, verifying it now. |
@zettabot go |
Circling back around to this one. Sorry it's taken so long -- I've been outrageously busy the last month or so.
It'd be really great if we could get a big theory statement comment as part of this integration.
It doesn't seem like the script could, itself, gracefully handle the failure mode of running longer than 10 seconds? I'm still a bit uncertain about how to reason about the correctness of an arbitrary channel program in the face of a wall clock expiry time. What if this channel program is running within a virtual machine, and the virtual CPU upon which the channel program is executing is, itself, not scheduled for 9.99 of those 10 seconds? What if this consistently happens, in a heavily overloaded hypervisor environment? I don't see anything addressing one of my questions from earlier:
I have also thought of a few more questions:
|
You're right - a channel program can handle errors returned from ZFS but can't specify anything about what happens if there's an error with the script itself. In the case of timeouts, we have 2 ways in place of ensuring this won't cripple anything important:
Hopefully, this should cover any use cases that are either larger scale or otherwise not tolerant of timeout failures.
As far as security goes, we're somewhat leaning on the fact that we require root in the global zone to run a channel program script, given that with those privileges anything a malicious script could do could also easily be accomplished with With respect to resource usage and testing, this is an area for future work I've been looking into. The limits we have in place so far seem to do a pretty good job of preventing a ZCP script from tanking performance, but I've been considering adding some randomized testing for channel program scripts, possibly to
Lua linting and checkers exist, but from what I've seen are pretty limited in usefulness - they're able to check for {undeclared, multiply-declared, attempting to change constant} variables, and not much else. It would probably be possible to move the scripts themselves out to their own files, but I'm not sure this would give any advantage over having the script with the related C code, given the lack of good static analysis.
I'm not sure how much additional info would be able to be gathered with the help of an mdb module. As it stands, the lua interpreter's state structure gives a reasonably good idea of what a running script is doing, so it's definitely possible at present to diagnose a crashed channel program, if finnicky. Function entry/exit probes in the zcp code and lua interpreter have proved generally sufficient so far for live debugging, but I suspect there may be a number places where adding static probes could be quite helpful. This would be a good addition, though I think it's minor enough to be added gradually and/or with future changes. |
@zettabot go |
@zettabot go |
Closing this as it has been updated and reposted as #397. |
Reviewed by: Matt Ahrens mahrens@delphix.com ( @ahrens )
Reviewed by: George Wilson george.wilson@delphix.com ( @grwilson )
Reviewed by: John Kennedy john.kennedy@delphix.com ( @jwk404 )
Reviewed by: Chris Williamson chris.williamson@delphix.com ( @cwill )
ZFS channel programs (ZCP) adds support for performing compound ZFS administrative actions via Lua scripts in a sandboxed environment with time and memory limits.
Upstream bugs: DLPX-39221, DLPX-40120, DLPX-44957, DLPX-46641, DLPX-46247, DLPX-46672, DLPX-47073