New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
commit zfs-0.6.4-93-gca0bf58 causes excessive server load #4210
Comments
|
The backtrace of all processes I extracted from the crashdump file. |
|
@jxiong any reason you can't use the 0.6.5* branch (which should include fixes against that commit) ? |
|
@kernelOfTruth we tried the latest revision 0.6.5.4 coming out last Friday and it hit the same issue. |
Agreed this looks different. That said the 0.6.5.4 tag definitely does contain fixes for some deadlocks introduced by that big Illumos commit you referenced which reworked the ARC. I'd suggest using that or an 0.6.4 variant which predate the change. We're locally running 0.6.5.3 on our Lustre systems and will be rolling to 0.6.5.4 with the next update. To my knowledge we haven't hit this specific issue but we may just not be pushing the system hard enough. A few things caught my eye in the stacks you posted.
|
|
@behlendorf I didn't see this pattern on the other crashdumps. But for this specific call, I can see the size passing to kdump interrupted |
|
I would like to look into this under your help. I will appreciate if you could tell me where/how to start the investigation. |
|
@jxiong thanks for the quick follow up. Since you have a crash dump I'd start be carefully inspecting the zio (including its children) which the Have you ruled out hardware problems on this system? Real drives can misbehave in the strangest ways. You might be able to get some insight in by looking at the I noticed the pool in question has a low txg number which indicates it's a new pool. Does this mean your able to easily reproduce this issue? If so there's some debugging you could enable which might help shed some light on the issue. The latest SPL code has added a a new I know you've identified |
|
When this crash dump was collected the OST I believe the disk I/O is not yet finished, because the executor of this ZIO is the OST thread: Then I investigated the ZIO in the head of active queue: The disk I/O of this ZIO must have been finished. However, the stage of this is still on I'm pretty sure the hardware should be okay because we ran zfs-0.6.4-92-gb9541d6 for many hours and never hit the same issue. Usually we can hit this issue within half an hour when running a problematic revision. I didn't have a copy of A lot of labor work, sigh. |
|
@jxiong I've definitely been down this exact rabbit hole before and we've found a fixed a few rare issues. One such problem was addressed by commit aa159af but that fix is part of 0.6.5.3 and latter so if you reproduced the issue on 0.6.5.4 that's not it. I'm just speculating here without too much information. But it sounds like either Or possibly |
|
@kernelOfTruth Those patch are not landed to zfs-0.6.5-release but they are in master. I will check them and give them a try. Thanks. |
|
@jxiong if you test out these commits on a testing box, make sure to also include 23de906 Illumos 5745 - zfs set allows only one dataset property to be set at a time and 37f8a88 Illumos 5746 - more checksumming in zfs send otherwise the previously mentioned commits (related to prefetch) won't apply cleanly alternative would be to simply test master in a dedicated testing box |
|
@jxiong I definitely want to get to the bottom on this, do you have a test setup where your able to reproduce the issue fairly easily? I wouldn't mind picking through the crash dumps myself if they're still available. |
|
I would like to update the status of this problem - we tried the ZFS master revision with the fixes @kernelOfTruth mentioned above, but unfortunately it didn't help. @behlendorf Yes, this problem can be reproduced on Hyperion really easily. Most of the time we can start seeing this problem in several minutes after test is launched. Sure, I will upload a crash dump to Dropbox soon and update this ticket again. I have crash dumps for both 'good' and 'bad' revision. |
|
@behlendorf Please check the crashdump at: https://www.dropbox.com/s/23my4xnpahiibgk/zfs-reproduced.tar?dl=0, thanks. |
|
@jxiong thanks for posting the crash dump I've spent the morning digging through it. It contains some really interesting insights in to Lustre's behavior. I suspect you were on the right path investigating the vdev queue. What I found confirms your observation above that the last serviced I/O for that vdev took a little over one second to complete. On the surface that sounds pretty bad, but as you noted there were 507 pending zio's in the sync read class (all other classes were empty). Since sync reads are treated in a strictly FIFO fashion that implies the disk was running at about ~500 read IOPs. Not too shabby. This means that any on-demand reads which are issued are going to have to pay a 1 second latency. Based on the other stack traces it looks like the vast majority of those sync reads are coming from the ll_ost_io threads. I count roughly 224 of them actively waiting on a read in the crash dump. As you pointed out this definitely isn't a deadlock. It seems to be more a matter of This behavior has always existed in Lustre but my guess is that ca0bf58 perturbed things to make this case worse for Lustre. Maybe by reducing contention in the ARC all those Lustre IO threads are now able to drive the disk more so The good news is that we might be able to fix this fairly easily. In our ideal world what we want is for all those Here's an untested patch which adds this prefetch logic to Lustre. Can you try applying it and see if it resolves the issue. I would think this might also help improve the maximum create rate on an idle system. diff --git a/lustre/osd-zfs/osd_object.c b/lustre/osd-zfs/osd_object.c
index 9884333..5f1e85c 100644
--- a/lustre/osd-zfs/osd_object.c
+++ b/lustre/osd-zfs/osd_object.c
@@ -1086,6 +1086,10 @@ static int osd_declare_object_create(const struct lu_env *env,
/* and we'll add it to some mapping */
zapid = osd_get_name_n_idx(env, osd, fid, buf);
+
+ /* prefetch any needed zap blocks for this key */
+ zap_prefetch(osd->od_os, zapid, buf);
+
dmu_tx_hold_bonus(oh->ot_tx, zapid);
dmu_tx_hold_zap(oh->ot_tx, zapid, TRUE, buf);
An additional thing you could try is increasing the Finally, I noticed in that crash dump that the threads blocked in |
|
@behlendorf thanks a lot for looking into this issue.
Yes, I noticed this too. The IOPS was pretty good for a hard drive. This is why I suspected that this problem may boil down to the fact that ZFS/Lustre needs too many metadata reads before reaching the actual data. I will apply your patch and see how it goes.
Actually this is because txg_sync() was being blocked by a slow dmu_read(), and I guess this will be okay after the read problem is addressed. This problem could be mitigated significantly after @don-brady 's isolated md allocation class is in place, because we can store metadata on low latency drives. thanks again. |
Yes, I tried this but it didn't help(I didn't remember the detail though). Also I tried to reduce the number of OSS threads but it didn't help either. |
This wouldn't be a bad thing to try and rule out. I did make a decent amount of changes to the eviction code, and while I tried to keep it roughly the same as before, there's always a chance that I botched it for a specific workload (Lustre does appear to drive ZFS very differently than any ZPL workload I'm aware of). If there's some evidence that I broke the eviction algorithm, I'm more than happy to revisit that code with this specific workload in mind. |
|
Are there any stats you want to collect to identify the ARC behavior in the next run? |
|
@behlendorf We tried the proposed patch today and it didn't help. ARC stats without multilisted ARC: With multilisted ARC: Only 20% cache hit ratio in the latter case, while it was ~100% before the patch is introduced. Is there any further information we need to collect to identify the issue? |
|
That's concerning. I need to think about this some more. Is the workload described in the comments above in any detail? If not, can you explain the workload to help me try to understand what might be going wrong? Is this a Lustre OSS or MDS? Do you think this behavior could be exhibited outside of Lustre; is there a ZPL workload that we could apply to reproduce this? At a glance, the demand data misses went from 0 to over 700K?! yikes! I think it'd be great to get historical log of the arcstats information from "before" and "after"; e.g. It should be possible to take those dumps and write a script to generate graphs for the various fields to see how the ARC is behaving over time. I've used that method to track down a mis-behaving ARC in the past, although it can be time consuming and I don't have those graph generation scripts anymore (sigh, I knew I should have pushed to get those merged into the It might also be interesting to see which objects/blocks we're missing on, and which blocks are being evicted from the caches; although that might be more difficult to gather. I vaguely remember there being a ZOL specific kstat that was introduced when I was at LLNL to gather info about the blocks that were being evicted, but I can't recall what that was called off the top of my head (and there might be a better way to get that information now, using eBPF or tracepoints). There's also the |
|
You can only use the dtrace probes if you've enabled debugfs and tracing in the kernel and have set license to GPL in META. If you have then... [EDIT: oops, should be ">>" in the second one] As to the actual problem being reported in this issue, I've got no idea at the moment and will have to read this entire thread in its entirety. @jxiong It might be worth running I've posted this without having read the entire thread so there's a chance I've missed something. I mainly wanted to point out how to use the tracepoints. |
|
Ah, right! I almost forgot that we're converting the static dtraces probes to linux tracepoints. I also need to read through all the comments in this thread in detail, too; I'll try and do that tomorrow (my brain needs some rest before I can think clear enough to tackle this one). |
|
@prakashsurya This is SWL workload on Hyperion. The nodes in question were OSS nodes. We've never tried to reproduce it through ZPL. I will try to collect the historic information as you mentioned and draw a graph, but it would take time. We collected arcstats for good and bad run, please take a look: iws20.arcstats.bad.txt For the other linux specific on ZoL, let's wait for @behlendorf 's comment. @dweeezil The server load was measured by uptime. In this context, the server load high would mean lots of threads are waiting for disk I/O to finish. I had run `perf top' but it seemed normal. There is a crashdump that was collected at the time when this problem was happening at: https://www.dropbox.com/s/23my4xnpahiibgk/zfs-reproduced.tar?dl=0, please take a look if you're interested. |
|
BTW, does 'CACHE MISSES BY Demand Data' mean the data was not in ARC when it was requested? and the corresponding 'By Prefetch Data' means the data was not in cache when it was prefetched? |
|
I believe so. jxiong notifications@github.com wrote:
|
Those values correspond to the counters that are incremented in As you can see, we have a I'm looking over the arcstat files that you posted, and I'm still a little uncertain what to make out of them, but I think there's a lot of good information here (just need to think a little bit more). So far, here's what strikes me as interesting..... Let's look at a few particular fields that (I think) are helping me build a picture about what might be happening: OK, so if we look at this, Additionally, If we look at the "good" case, If we also look at Just as expected, zero ghost list hits in the "good" case, and non-zero ghost list hits in the "bad" case. Additionally, we can see from the that headers are being evicted from the ghost lists in the "bad" case (that field is incremented as headers roll off of the ghost lists). I think the question we need to answer, is, with the What do folks think? Does that sounds reasonable? Am I overlooking anything? |
|
@prakashsurya I think you've definitely put you're finger on the problem. Why are we evicting at all from the ARC in the "bad" case? We need to identify the call path where this is happening. I spent some time carefully reviewing the patch and the ARC code and I wasn't able to identify a probable cause. To understand why I think we may need a clue from either an ftrace call graph or perhaps some additional debug kstats. The good news is that it looks like we can initially rule out memory pressure as the cause. Not only for the reasons @prakashsurya mentioned but because all of the other memory pressure related kstats don't indicate any issue. That helps exclude some ZoL specific changes in places like There are two other notable related differences between illumos and Linux which we should rule out. Neither one looks obviously like the cause to me but they're in the roughly the right area and it may help us narrow down the issue or result in some new insight. Plus they're easy to test.
@jxiong I'm still hopeful that little Lustre prefetch patch might help after we resolve this much bigger problem! |
|
@prakashsurya we collected the trace log as you mentioned, please check it here: https://www.dropbox.com/sh/nrdhrwoylk2c9c2/AACid-KDadAdi4gRf2R8Bfjwa?dl=0 Thanks. @behlendorf We tried to set the module option |
|
Haven't read all the comments, but could it be the case that "size" got high enough at one point to trigger eviction and then it fell down afterwords? |
|
iws20.arcstats.bad.txg shows: This should mean that we never experienced memory pressure and therefore never evicted anything. However, we also see: Which should mean that we did evict some buffers due to memory pressure. So something here is not as it appears to be. |
|
@thegreatgazoo after getting on Hyperion and running the SWL I wasn't able to reproduce the poor ARC hit rate problem described above. With zfs-0.6.5.5 I was observing a hit rate of ~99% which is more consistent with what we've seen historically. However, I did observe some other issues which could help explain what's going on. The first thing I looked at was Digging a little deeper we can look at the The TXG history in I manually decreased My impression was that the ARC dirty size was pretty badly mismatched with the actual disk performance. It seems plausible that we were getting away with this for the most part up until the ARC multilist improvements were made. The increased concurrency and large number of Lustre threads may have pushed us over a tipping point with respect to the Lustre timeouts. I've left the SWL running with a 1G dirty size to see if we're able to reproduce the issue. If possible it would be great to let this run for a while. We most likely haven't seen this at Livermore for two reasons. 1) our storage hardware can deliver many more IOPs so even large TXGs are written quickly, and 2) we've updated Lustre to not block on a txg sync in |
|
@behlendorf just to make it clear, you couldn't see this problem after you set Probably the I/O was throttled by the ARC lock contention in previous versions. I would suggest to investigate the performance difference between 0.6.4 and 0.6.5.5 with the new zfs_dirty_data_max setting. This is a piece of wonderful work - I tried to reduce the OSS workload by reducing the number of OSS threads but it didn't help. |
|
@jxiong correct, after I reduced the It would be great if you could independently verify this fixes the issue. |
|
We verified that the problem can't be reproduced after setting |
|
@jxiong awesome. I think we still want to completely explain why the delays injected by |
|
@behlendorf Probably the delay inject worked and it caused Lustre I/O took really long time to complete, and at last it caused I/O timeout. |
|
Just a second thought, let's keep this issue open for at least 24 hours in case people want to comment from a different time zone. |
If we had a mechanism in the code to dynamically tune the maximum amount of dirty data, would that be the ideal way to solve this problem? And if so, how would it be tuned? e.g. Maybe if the TXG takes longer than X seconds to sync, the max amount allowed is decreased. If it takes less than X seconds to sync, the max amount is increased. At some point, the max amount should reach some equilibrium point and maintain roughly the same value, right? (this is a drastic oversimplification, but I think it gets the idea across) If I'm understanding correctly, the "problem" is TXG sync times sometimes takes "too long"? and by reducing the amount of dirty data that can accumulate, that ensures the TXG sync times are "sufficiently small enough" to prevent these timeouts? This sort of also feels like Lustre is to blame, in that it doesn't apply sufficient back-pressure to the clients. i.e. Instead of telling the clients to slow down by some percentage based on how fast the server can write out the data (e.g. perhaps injecting small artificial delays into the clients' requests), the server allows clients to stream as much data as they can (as fast as they can) up until a time out occurs. Simply telling ZFS to write less data every TXG feels like an ugly workaround to a problem that is more fundamental to the client/server architecture of Lustre. Obviously I don't have a stake in this, so feel free to ignore me as it pertains to fixing the timeouts; I'm just genuinely curious to learn more and/or more clearly understand the root cause, as this feels more like a systems architecture issue vs. the more mundane defects that usually crop up. |
I was thinking along the same lines. This is something ZFS should be able to handle more gracefully. It's not at all clear to me that Lustre is behaving particularly badly. Having a large number of threads all writing may simply be able to overwhelm the default throttle settings. It would be interesting to see the behavior when 1000 threads are all writing through the ZPL. Just to throw an idea out there as possible solution. We could start dynamically ratcheting up the Other ideas welcome, this was just the first thing which occurred to me and I haven't done a careful analysis. This feels to me like a systems architecture issue as well. The throttle shouldn't allow this to happen. |
|
@behlendorf I agree that zfs_delay_scale is the right place to start. I'm pretty sure the default is not great for systems with high performance and high number of threads calling zfs_write (or equivalent funcs that create dmu transactions). You want to avoid the dirty data ever getting close to 100%. Of course, I have a bunch of dtrace scripts for monitoring this, which doesn't help you on Linux :-( |
We might be able to adapt them to systemtap or add something as a kstat. Specifically what are you monitoring with the dtrace scripts? |
|
@behlendorf Easiest to explain by showing. See https://gist.github.com/ahrens/ca0b0c1009e873c80d53467749085ec2 and https://gist.github.com/ahrens/7e110344c9732d22e143b48e8050923a |
Recently we're experiencing a timeout issue for Lustre workload on all ZFS revision after commit ca0bf58: 'Illumos 5497 - lock contention on arcs_mtx'. The symptom is that the server load became really high and then after a while timeout happened. The timeout thread is always with the following backtrace.
I have collected a crashdump when this issue was happening at: http://goo.gl/L52nf5, please take a look.
This is a snapshot of server load when this issue is going to happen:
There are many servers and the issue is likely seen on busy servers.
And the output of vmstat on one of the busy servers:
As you can see, 94% percent of CPU are waiting for I/O. Not sure if this is normal as we're running an I/O intensive test case.
This is not a deadlock issue as we have seen on the other tickets. When this issue occurs, the corresponding clients, where the workload is produced, are evicted and then the server can go back to normal again.
From the crashdump I mentioned above, there are lots of
ZIO_PRIORITY_SYNC_READZIO queued in vdev. For example, this is one of the child ZIO the server thread is waiting for:And the status of the corresponding vdev:
There were 507 ZIO pending in the queue. I have seen more on the other crashdump.
I have checked that the last finished I/O was the first I/O in the active tree. It took more than 1 second to finish this I/O.
We didn't see this issue on zfs-0.6.4-93-gca0bf58~1. Please let me know if you want any further information, thanks.
The text was updated successfully, but these errors were encountered: