Record tx_assign_histogram correctly #1915

kohsuke · 2013-11-30T23:31:47Z

spa_tx_assign_add_nsecs was added in 2d37239 to record the amount of time taken to assign a transaction to TXG.

Unfortunately, this only kicks in for dmu_tx_try_assign with TXG_WAIT, and it misses all the interesting and important cases (such as zfs_write and all ther other ZPL functions) where dmu_tx_assign is called with TXG_NOWAIT, and ERESTART results in a new transaction created.

This change introduces a new function dmu_tx_assign_add_nsecs, and make these TXG_NOWAIT callers call this function appropriately. This allows more accurate insights into the time spent on assigning TXG.

spa_tx_assign_add_nsecs was added in 2d37239 to record the amount of time taken to assign a transaction to TXG. Unfortunately, this only kicks in for dmu_tx_try_assign with TXG_WAIT, and it misses all the interesting and important cases (such as zfs_write and all ther other ZPL functions) where dmu_tx_assign is called with TXG_NOWAIT, and ERESTART results in a new transaction created. This change introduces a new function dmu_tx_assign_add_nsecs, and make these TXG_NOWAIT callers call this function appropriately. This allows more accurate insights into the time spent on assigning TXG. Signed-off-by: Kohsuke Kawaguchi <kk@kohsuke.org>

nedbass · 2013-12-02T17:43:33Z

I'm not sure it's useful to record assign times for TXG_NOWAIT transactions. Either they're assigned immediately or they are aborted. The idea of the histogram is to track delay times for transactions that are forced to wait.

Also, #1696 includes 57b270f which converts most of the ZPL functions to use TXG_WAIT.

kohsuke · 2013-12-02T18:15:04Z

My point is that these "if fail, abort, wait, and retry" cycle does effectively constitute "delay times for transactions that are forced to wait." since all it's doing is to wait for an available TXG to join.

The context of this is that I was looking into my "dd if=/dev/zero of=/file/on/zfs" performance, which eventually led me to TXG size problem (see #1913.) During this investigation, strace reports that some write system calls are taking a really long time, but it didn't show up in dmu_tx_assign histogram. I'd rather want the histogram to show that, since I can't run every program under strace.

It's good to hear that #1696 addresses (some of?) this issue. The same pull request was mentioned in #1913 as well, so it's clearly a very important change for me. Do you have any guesstimate of when it's going to go in?

nedbass · 2013-12-02T19:22:02Z

Yes I see the usefulness of this now. #1696 will fix zfs_write() but others like zfs_mkdir() will still not record their wait times completely. The only problem I see with your solution is that it will call spa_tx_assign_add_nsecs() twice, once in the ZPL function and once in dmu_tx_assign() for the tx that gets assigned.

I wonder if we should just move the spa_tx_assign_add_nsecs() call from dmu_tx_assign() to dmu_tx_wait(), and change "assign" to "wait" in the names of associated functions and variables. With 9036294, dmu_tx_wait() is called at most once per operation in nearly all cases. So this approach would simplify the code and still supply most of the information we want.

behlendorf · 2013-12-03T22:40:06Z

@kohsuke @nedbass After the write throttle code is merged (soon I hope) it may make sense to move this to dmu_tx_wait(). As you've pointed out after that change we should never wait more than once for TXG_WAIT callers. However, before doing that we should audit the rest of the dmu_tx_assign() callers, we may be able to convert more of them to TXG_WAIT.

nedbass · 2013-12-03T23:01:36Z

The write throttle code changes the remaining dmu_tx_assign() callers to

dmu_tx_assign(tx, waited ? TXG_WAITED : TXG_NOWAIT);

so they call it at most once with TXG_NOWAIT and with TXG_WAITED after that. With TXG_WAITED it will not wait again for the dirty data limit, so I think this already gives most of the performance benefit of TXG_WAIT, and still calls dmu_tx_wait() only once except in some unusual cases.

behlendorf · 2013-12-06T19:17:09Z

@kohsuke @nedbass The write throttle changes have just been merged, we should review what, if any, changes are still needed here.

behlendorf · 2014-02-24T18:01:50Z

@nedbass Can you take a quick look at this. Is there still remaining cleanup to be done here to make these stats useful again?

nedbass · 2014-02-24T18:28:10Z

I think it still makes sense to move the spa_tx_assign_add_nsecs() call to dmu_tx_wait(). That should give us a more accurate and useful accounting of wait times. I can make a quick patch to do that.

behlendorf · 2014-02-24T19:10:05Z

@nedbass That would be great, thanks.

Some callers of dmu_tx_assign() use the TXG_NOWAIT flag and call dmu_tx_wait() themselves before retrying if the assignment fails. The wait times for such callers are not accounted for in the dmu_tx_assign kstat histogram, because the histogram only records time spent in dmu_tx_assign(). This change moves the histogram update to dmu_tx_wait() to properly account for all time spent there. One downside of this approach is that it is possible to call dmu_tx_wait() multiple times before successfully assigning a transaction, in which case the cumulative wait time would not be recorded. However, this case should not often arise in practice, because most callers currently use one of these forms: dmu_tx_assign(tx, TXG_WAIT); dmu_tx_assign(tx, waited ? TXG_WAITED : TXG_NOWAIT); The first form should make just one call to dmu_tx_delay() inside of dmu_tx_assign(). The second form retries with TXG_WAITED if the first assignment fails and incurs a delay, in which case no further waiting is performed. Therefore transaction delays normally occur in one call to dmu_tx_wait() so the histogram should be fairly accurate. Another possible downside of this approach is that the histogram will no longer record overhead outside of dmu_tx_wait() such as in dmu_tx_try_assign(). While I'm not aware of any reason for concern on this point, it is conceivable that lock contention, long list traversal, etc. could cause assignment delays that would not be reflected in the histogram. Therefore the histogram should strictly be used for visibility in to the normal delay mechanisms and not as a profiling tool for code performance. Signed-off-by: Ned Bass <bass6@llnl.gov> Closes openzfs#1915

behlendorf closed this in a77c4c8 Mar 7, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Record tx_assign_histogram correctly #1915

Record tx_assign_histogram correctly #1915

Uh oh!

kohsuke commented Nov 30, 2013

Uh oh!

nedbass commented Dec 2, 2013

Uh oh!

kohsuke commented Dec 2, 2013

Uh oh!

nedbass commented Dec 2, 2013

Uh oh!

behlendorf commented Dec 3, 2013

Uh oh!

nedbass commented Dec 3, 2013

Uh oh!

behlendorf commented Dec 6, 2013

Uh oh!

behlendorf commented Feb 24, 2014

Uh oh!

nedbass commented Feb 24, 2014

Uh oh!

behlendorf commented Feb 24, 2014

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Record tx_assign_histogram correctly #1915

Record tx_assign_histogram correctly #1915

Uh oh!

Conversation

kohsuke commented Nov 30, 2013

Uh oh!

nedbass commented Dec 2, 2013

Uh oh!

kohsuke commented Dec 2, 2013

Uh oh!

nedbass commented Dec 2, 2013

Uh oh!

behlendorf commented Dec 3, 2013

Uh oh!

nedbass commented Dec 3, 2013

Uh oh!

behlendorf commented Dec 6, 2013

Uh oh!

behlendorf commented Feb 24, 2014

Uh oh!

nedbass commented Feb 24, 2014

Uh oh!

behlendorf commented Feb 24, 2014

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants