-
Notifications
You must be signed in to change notification settings - Fork 69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
All writers block on same hash table mutex hurting performance #238
Comments
As can be seen from stack traces below, the writer threads are waiting for the mutex in open context (and not syncing context). Stack trace of writer thread that has acquired the hash table mutex
Stack traces of 7 writer threads that are waiting for the mutex acquired by the thread above
|
This thread:
is it still alive? It eventually finishes the task and releases the mutex? It's just that the stack looks exactly like that when you get |
There is no crash, just that the write performance is significantly degraded. In perfmon, we see writes happening at like 10 MBps on zvol for couple of seconds, then it goes back to 0 for a few seconds, then again a burst of writes, then back to 0, etc, ... I am suspecting the way locks are acquired is hurting the throughput. |
Not particularly familiar with this code, but all 8 threads go down from
Where We convert it:
which is computed to the same value each time. Even though there are 8 threads writing to different locations, they are all blocked by one mutex due to;
As all arguments are not changing, the result is also computed to the same value each time. @ahrens Did I miss anything? Are there any mitigations for multiple writing threads? |
I would recommend using |
Unfortunately, the structs are different in Linux and Windows. Specifically, dnode_t* is missing from Windows' zvol_state_t. Thoughts?? |
OK it would look something like this (2 commits): https://github.com/openzfsonwindows/ZFSin/tree/bydnode/ZFSin Took the chance to clean things up again, we don't actually need I've not had a chance to test it much. |
Thanks @ahrens @lundman for the quick turnaround! Above is a snapshot with these changes. I still see writes dropping to 0 on zvol (on zpool the writes are still continuos). However, those 0 writes appear very less frequently compared to how it was without this latest change. Overall, the change looks good. |
@lundman I just picked all your changes from bydnode branch. Here are some observations:
and 250 threads are in blocked state with below stacktrace:
Does that give any clue as to what might have gone wrong? Stacks of all threads running zfsin code if that helps: |
Update: going by the discussion on write amplification issue, I specified the ashift and volblocksize explicitly and the write amplification does look under control. However, the write choppiness remains. I will break in to the debugger after a few hours of Iometer writes and see what is happening with the count of threads and locking. |
Ok, lets see. The But more importantly, it should be user selectable now, ie, Now, I've often wondered if there is some issue in mutex / condvar code, could it be missing signals/broadcasts? Sometimes it appears to have "supposed to have been signaled" but sits there doing nothing, until the next one comes in. Is this something you have noticed?
ZFS spawns a bunch of taskq threads, that idle waiting for something to do: and when they are given a task, The taskq "number of threads" is based on a bunch of things, number of cores, writers, each pool has a number, each dataset, etc. But it is worth noting those values are all "Solaris" defaults. I have not yet looked at what tweaks Windows might want. We might be over-creating threads, but Windows does seem pretty decent at threads. One txg will fill with data until it hits the limit for one txg, then it will wait for it to quiesce, and spa_sync to complete, then it starts again. The txg limits have made knobs you can tune and tweak. https://www.delphix.com/blog/delphix-engineering/zfs-fundamentals-transaction-groups I'm no experts with those tunables - so experiment and report what you find. There is also a write-throttle, which I believe was added so there is always a little room for other datasets to not be starved out. Are we making progress at least? |
Compression seems to be happening although I didn't turn it on (I am using default settings). Is this expected?
|
What is the downside of using "standard" setting for sync? Can there be data loss? |
Yes, up to one txg. The pool will always be consistent (either pointing to txg-1 data, or to txg data, since uber block is updated last). But the last 5s of writes may not be there. |
OK. Thanks for clarifying. Is 1 txg always 5s worth of data or can it be flushed before 5s if it hits a size limit? |
Take a look at: ZFSin/ZFSin/zfs/module/zfs/dsl_pool.c Line 111 in 6304af8
ZFSin/ZFSin/zfs/module/zfs/dmu_tx.c Line 755 in 0766fda
|
@lundman Could you also comment on the "compression" seen in the stacktrace? I will put together all my findings based on above comments and let you know. We are making good progress. |
It would be interesting to know if |
Right now, I have the debugger attached and target paused for some investigation. What can I inspect through the debugger to find out if compression is on/off? |
there's no rush - if its doing lz4 when it shouldn't, we'll come across it again :) |
@lundman Glad to inform that I am not seeing writes choking on physical (512b sector size) hard drives attached to a VM. Have been running Iometer workload for more than 10 hours. Here's my zpool/zvol config:
Looking forward to merge your change in our fork once it is merged in upstream. |
threads.txt
I see a performance issue with all writers trying to acquire same mutex in dbuf_hash_table while writing to zvol (RAW disk). I am using Iometer to generate workload.
During live debugging, I found that there are 8 threads executing dbuf_find (as part of zvol_write_win) out of which 7 are waiting for the mutex held by the 8th thread. The output of !stacks 2 zfsin is attached.
Here's the problem, all 8 threads have same input parameters (shown below) to the function dbuf_find -
0xffffc78f`a4734180
0
0
0
which is causing them to map to the same mutex within the hash table.
Is this expected and mandatory in zvol writes? Are there ways to improve?
Below is my setup:
VM configuration:
4 vCPU
8 GB RAM
zvol settings:
8GB (thick provisioned)
Everything is default. Dedup=off, Compression=off, ...
Iometer settings:
4 workers
Data pattern = full random
No of outstanding IOs = 64
Access spec = 4 KiB; 0% Read; 100% random
Thanks,
Imtiaz
The text was updated successfully, but these errors were encountered: