Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

thread hung in txg_wait_open() forever in D state #3064

Closed
thegreatgazoo opened this issue Jan 31, 2015 · 4 comments
Closed

thread hung in txg_wait_open() forever in D state #3064

thegreatgazoo opened this issue Jan 31, 2015 · 4 comments
Labels
Status: Inactive Not being actively updated Status: Stale No recent activity for issue Status: Understood The root cause of the issue is known Type: Defect Incorrect behavior (e.g. crash, hang)

Comments

@thegreatgazoo
Copy link

We run into this hung quite often, with SPL/ZFS v0.6.3-1.2 (DEBUG mode). A thread hung in txg_wait_open() in D state and would never recover, and it seemed that the thread was actually running in D state, rather than sleeping - the kernel hung task watcher never warned about it, and top showed its CPU time growing.

PID: 21277  TASK: ffff88006130c040  CPU: 0   COMMAND: "ll_ost00_007"
 #0 [ffff88003ed5b880] schedule at ffffffff815296a0
 #1 [ffff88003ed5b948] cv_wait_common at ffffffffa011b415 [spl]
 #2 [ffff88003ed5b9c8] __cv_wait at ffffffffa011b495 [spl]
 #3 [ffff88003ed5b9d8] txg_wait_open at ffffffffa02215e3 [zfs]
 #4 [ffff88003ed5ba18] dmu_tx_wait at ffffffffa01e4939 [zfs]
 #5 [ffff88003ed5ba78] dmu_tx_assign at ffffffffa01e49e9 [zfs]
 #6 [ffff88003ed5bb28] osd_trans_start at ffffffffa0dd45ad [osd_zfs]
 #7 [ffff88003ed5bb58] ofd_trans_start at ffffffffa0f1f07c [ofd]
 #8 [ffff88003ed5bb88] ofd_object_destroy at ffffffffa0f21530 [ofd]
 #9 [ffff88003ed5bbd8] ofd_destroy_by_fid at ffffffffa0f1b79d [ofd]
#10 [ffff88003ed5bcd8] ofd_destroy_hdl at ffffffffa0f150ea [ofd]

The txg_sync/txg_quiesce threads seemed OK, busy alternating between S/D and R states. But the pool state seemed quite screwed up. The TXG # increased by 2186 in just 1 second:

cat lustre-ost1/txgs; sleep 1; cat lustre-ost1/txgs
469 0 0x01 3 336 31269437119932 94328219768999
txg      birth            state ndirty       nread        nwritten     reads    writes   otime        qtime        wtime        stime       
504280372 94327859132340   S     0            0            0            0        0        52962        5155         34018        0           
504280373 94327859185302   W     0            0            0            0        0        42311        5308         0            0           
504280374 94327859227613   O     0            0            0            0        0        0            0            0            0           
469 0 0x01 3 336 31269437119932 94329224682995
txg      birth            state ndirty       nread        nwritten     reads    writes   otime        qtime        wtime        stime       
504282558 94329223042758   S     0            0            0            0        0        819957       7673         68244        0           
504282559 94329223862715   W     0            0            0            0        0        119631       5697         0            0           
504282560 94329223982346   O     0            0            0            0        0        0            0            0            0

The read/write tests were able to go on OK, i.e. no later calls to dmu_tx_assign() would hang, until later when we tried to umount the dataset, when it just hung as well.

This is easily reproducible and I have a crashdump available. Please let me know if any debug information is needed.

@behlendorf
Copy link
Contributor

This was determined to be caused by a dmu_tx_hold_free() reserving a large amount of memory for the ARC in a single TX. This resulted in the TX never being able to be assigned and the thread looping dmu_tx_assign(). Further details are available in the Lustre Jira issue.

https://jira.hpdd.intel.com/browse/LU-5242

There may need to be some work done in dmu_tx_hold_free() to reduce this worst case estimate which caused the issue. This needs to be investigated.

@thegreatgazoo
Copy link
Author

In arc_tempreserve_space() where ERESTART was returned:

     3912 arc_tempreserve_space(uint64_t reserve, uint64_t txg)
     3913 {
     3914         int error;  
     3915         uint64_t anon_size;
     3916 
     3917         if (reserve > arc_c/4 && !arc_no_grow)
     3918                 arc_c = MIN(arc_c_max, reserve * 4);
     3919 
     3920         /*
     3921          * Throttle when the calculated memory footprint for the TXG
     3922          * exceeds the target ARC size.
     3923          */
     3924         if (reserve > arc_c) {
     3925                 DMU_TX_STAT_BUMP(dmu_tx_memory_reserve);
     3926                 return (SET_ERROR(ERESTART));
     3927         }

The condition would test true only when "arc_no_grow || reserve > arc_c_max". In either case, I don't think ERESTART would be the correct return code. A retry would NOT make any difference unless the admin would show up and bump zfs_arc_max to be over reserve. In the Lustre test, the TX was retried more than 1012288401 times to no avail.

So I'd believe in this case an error should be returned to the caller of dmu_tx_assign().

@thegreatgazoo
Copy link
Author

It seemed dmu_tx_count_free() estimated about 1G memory overhead for this sparse object:

# zdb -e -ddddd lustre-ost2/ost2 195
    Object  lvl   iblk   dblk  dsize  lsize   %full  type
       195    5    16K   128K  1.01M  1.00T    0.00  ZFS plain file
Indirect blocks:
               0 L4     0:54a00:400 4000L/400P F=8 B=513/513
     10000000000  L3    0:54600:400 4000L/400P F=8 B=513/513
     10000000000   L2   0:54200:400 4000L/400P F=8 B=513/513
     10000000000    L1  0:4fe00:400 4000L/400P F=8 B=513/513
     10000000000     L0 0:76aa00:20000 20000L/20000P F=1 B=513/513
     10000020000     L0 0:78aa00:20000 20000L/20000P F=1 B=513/513
     10000040000     L0 0:7aaa00:20000 20000L/20000P F=1 B=513/513
     10000060000     L0 0:7caa00:20000 20000L/20000P F=1 B=513/513
     10000080000     L0 0:7eaa00:20000 20000L/20000P F=1 B=513/513
     100000a0000     L0 0:80aa00:20000 20000L/20000P F=1 B=513/513
     100000c0000     L0 0:82aa00:20000 20000L/20000P F=1 B=513/513
     100000e0000     L0 0:84aa00:20000 20000L/20000P F=1 B=513/513

                segment [0000010000000000, 0000010000100000) size    1M

There's only 1 L1 block, and my understanding is that there can't be more level N blocks than level N-1 blocks, so we know at most there could be DN_MAX_LEVELS indirect blocks without scanning them all. The 1G overhead is way over estimate.

Also, once the TX was delayed, the TXGs began to move very fast forward at about 5141 TXGs per second:

# cat lustre-ost1/txgs; sleep 1; cat lustre-ost1/txgs 
25 0 0x01 3 336 244675310220 123698699004892
txg      birth            state ndirty       nread        nwritten     reads    writes   otime        qtime        wtime        stime       
1010319525 123698585210993  S     0            0            0            0        0        84246        3380         33414        0           
1010319526 123698585295239  W     0            0            0            0        0        36794        2496         0            0           
1010319527 123698585332033  O     0            0            0            0        0        0            0            0            0           
25 0 0x01 3 336 244675310220 123699705972680
txg      birth            state ndirty       nread        nwritten     reads    writes   otime        qtime        wtime        stime       
1010324666 123699503872529  S     0            0            0            0        0        111328       5851         53035        0           
1010324667 123699503983857  W     0            0            0            0        0        59840        6406         0            0           
1010324668 123699504043697  O     0            0            0            0        0        0            0            0            0

@stale
Copy link

stale bot commented Aug 25, 2020

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the Status: Stale No recent activity for issue label Aug 25, 2020
@stale stale bot closed this as completed Nov 23, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Inactive Not being actively updated Status: Stale No recent activity for issue Status: Understood The root cause of the issue is known Type: Defect Incorrect behavior (e.g. crash, hang)
Projects
None yet
Development

No branches or pull requests

3 participants
@behlendorf @thegreatgazoo and others