-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Crash in dbuf_evict_one with DTRACE_PROBE #6463
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Looks good, you just need to address the style issue. |
|
Does this mean we can't trace https://github.com/openzfs/openzfs/blob/master/usr/src/uts/common/fs/zfs/dbuf.c#L476 /*
* Evict the oldest eligible dbuf from the dbuf cache.
*/
static void
dbuf_evict_one(void)
{
int idx = multilist_get_random_index(dbuf_cache);
multilist_sublist_t *mls = multilist_sublist_lock(dbuf_cache, idx);
ASSERT(!MUTEX_HELD(&dbuf_evict_lock));
/*
* Set the thread's tsd to indicate that it's processing evictions.
* Once a thread stops evicting from the dbuf cache it will
* reset its tsd to NULL.
*/
ASSERT3P(tsd_get(zfs_dbuf_evict_key), ==, NULL);
(void) tsd_set(zfs_dbuf_evict_key, (void *)B_TRUE);
dmu_buf_impl_t *db = multilist_sublist_tail(mls);
while (db != NULL && mutex_tryenter(&db->db_mtx) == 0) {
db = multilist_sublist_prev(mls, db);
}
DTRACE_PROBE2(dbuf__evict__one, dmu_buf_impl_t *, db,
multilist_sublist_t *, mls);
if (db != NULL) {
multilist_sublist_remove(mls, db);
multilist_sublist_unlock(mls);
(void) refcount_remove_many(&dbuf_cache_size,
db->db.db_size, db);
dbuf_destroy(db);
} else {
multilist_sublist_unlock(mls);
}
(void) tsd_set(zfs_dbuf_evict_key, NULL);
} |
|
@loli10K we can, but the probe in |
|
@behlendorf Agreed. I personally don't like this change the way it is, unless it's a high priority fix can we implement this properly? |
|
Sure, I agree that would be better. This isn't critical, no user will be able to hit this unless they're rolling their own build to enable tracepoints. |
behlendorf
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While this does work let's update the tracepoint itself to handle NULL.
|
@behlendorf @loli10K Thanks for feedback..i will push a new patch soon |
include/sys/trace.h
Outdated
| #undef DTRACE_PROBE2 | ||
| #define DTRACE_PROBE2(name, t1, arg1, t2, arg2) \ | ||
| trace_zfs_##name((arg1), (arg2)) | ||
| if (arg1 && arg2) \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm new to Linux Tracepoints so apologies if i'm asking something that seems unreasonable, but i was thinking we could do something like this:
root@linux:~# cat /sys/kernel/debug/tracing/trace
# tracer: nop
#
# entries-in-buffer/entries-written: 1435/1435 #P:8
#
# _-----=> irqs-off
# / _----=> need-resched
# | / _---=> hardirq/softirq
# || / _--=> preempt-depth
# ||| / delay
# TASK-PID CPU# |||| TIMESTAMP FUNCTION
# | | | |||| | |
dbuf_evict-1422 [006] .... 956.984961: zfs_dbuf__evict__one: dbuf { spa "poolname" objset 0 object 63 level 0 blkid 0 offset 0 size 1024 state 4 holds 0 }
dbuf_evict-1422 [006] .... 956.984980: zfs_dbuf__evict__one: dbuf { NULL }
dbuf_evict-1422 [006] .... 956.984992: zfs_dbuf__evict__one: dbuf { spa "poolname" objset 0 object 57 level 0 blkid 0 offset 0 size 16384 state 4 holds 0 }
dbuf_evict-1422 [006] .... 956.985003: zfs_dbuf__evict__one: dbuf { spa "poolname" objset 0 object 60 level 0 blkid 0 offset 0 size 16384 state 4 holds 0 }
dbuf_evict-1422 [006] .... 956.985013: zfs_dbuf__evict__one: dbuf { NULL }
dbuf_evict-1422 [006] .... 956.985028: zfs_dbuf__evict__one: dbuf { spa "poolname" objset 51 object 1 level 0 blkid 0 offset 0 size 512 state 4 holds 0 }
dbuf_evict-1422 [006] .... 956.985037: zfs_dbuf__evict__one: dbuf { spa "poolname" objset 51 object 2 level 0 blkid 5 offset 655360 size 131072 state 4 holds 0 }
If i read the code correctly we are just skipping the Tracepoint when at least one parameter is NULL, which kind of defeat the purpose of keeping DTRACE_PROBE2(dbuf__evict__one) outside the if (db != NULL).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@loli10K how did u generate the above output...Are the NULL entries manually updated to denote the expected results? Talking about Linux tracepoints, I have not encountered any instance in Linux code where the args passed to the tracing function are NULL. Moreover, based on my understanding of Linux tracepoints and the API it exposes, there is no way to handle NULL without making changing to the tracing functions. As far as i know its the responsibility of the caller to make sure args are non-null just as any other Linux logging infra (printk etc)
With respect to the new patch, the intent is to have the developer not worry about NULL args, otherwise, i would follow the same semantics followed in other places which is to trace only with Non-NULL args.
But if still, we want the results to look like you mentioned, we can probably, create a dummy tracepoint, that just prints NULL.
Let me know what you think and accordingly i can modify the patch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are the NULL entries manually updated to denote the expected results?
Exactly, it was just to show what i expected from this change.
I'm not sure if the code was originally implemented this way so we could trace a NULL dbuf (which seems would only happen when the dbuf cache is already empty), other platforms share this same code so i doubt it was done this way by chance.
Someone with better understanding of dbuf eviction would probably know, my concern is to minimize code differences with other implementations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a good LWN article describing how to use tracepoints which I'd recommend. It provides a nice summary including some examples of what's possible.
https://lwn.net/Articles/379903/
After a little digging I could not find an example of a case where a tracepoint argument is NULL. Usually this is handled by defining another tracepoint which only takes the required number of arguments. For example.
diff --git a/module/zfs/dbuf.c b/module/zfs/dbuf.c
index dc2c004..eea209b 100644
--- a/module/zfs/dbuf.c
+++ b/module/zfs/dbuf.c
@@ -508,16 +508,19 @@ dbuf_evict_one(void)
db = multilist_sublist_prev(mls, db);
}
- DTRACE_PROBE2(dbuf__evict__one, dmu_buf_impl_t *, db,
- multilist_sublist_t *, mls);
-
if (db != NULL) {
+ DTRACE_PROBE2(dbuf__evict__one, dmu_buf_impl_t *, db,
+ multilist_sublist_t *, mls);
+
multilist_sublist_remove(mls, db);
multilist_sublist_unlock(mls);
(void) refcount_remove_many(&dbuf_cache_size,
db->db.db_size, db);
dbuf_destroy(db);
} else {
+ DTRACE_PROBE1(dbuf__evict__one_mls,
+ multilist_sublist_t *, mls);
+
multilist_sublist_unlock(mls);
}
(void) tsd_set(zfs_dbuf_evict_key, NULL);That may be what we have to do here even if it means diverging slightly from upstream. It might be possible to use the same conditional logic for printing custom fields on the arguments themselves in the dbuf__evict__one definition but I haven't tried if that works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@behlendorf ....today even though we use DTRACE_PROBE2 and pass two arguments db and mls, we actually are using only the db arg to log various information in the trace. mls arg is ignored, so, question is do we want to still use DTRACE_PROBE1 passing mls as the only arg and if so, then do we want to log NULL in output or extract sth useful from the multilist arg
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be possible to use the same conditional logic for printing custom fields on the arguments themselves in the dbuf__evict__one definition but I haven't tried if that works.
The following patch should allow us to keep tracepoints filtering capabilities while also being able to conditionally change the print format used in TP_printk(). On the other hand this will also increase the probe-effect since we will always have to snprintf() in TP_fast_assign() even for events that could potentially be filtered out.
diff --git a/include/sys/trace_dbuf.h b/include/sys/trace_dbuf.h
index edaa77f..f6e8b96 100644
--- a/include/sys/trace_dbuf.h
+++ b/include/sys/trace_dbuf.h
@@ -41,9 +41,22 @@
* zio_t *, ...);
*/
+#ifndef TRACE_DBUF_MSG_MAX
+#define TRACE_DBUF_MSG_MAX 255
+#endif
+
+#define DBUF_TP_PRINTK_FMT \
+ "dbuf { spa \"%s\" objset %llu object %llu level %llu " \
+ "blkid %llu offset %llu size %llu state %llu holds %lld }"
+
+#define DBUF_TP_PRINTK_ARGS \
+ __get_str(os_spa), __entry->ds_object, \
+ __entry->db_object, __entry->db_level, \
+ __entry->db_blkid, __entry->db_offset, \
+ __entry->db_size, __entry->db_state, __entry->db_holds
+
#define DBUF_TP_STRUCT_ENTRY \
- __string(os_spa, \
- spa_name(DB_DNODE(db)->dn_objset->os_spa)) \
+ __dynamic_array(char, os_spa, TRACE_DBUF_MSG_MAX) \
__field(uint64_t, ds_object) \
__field(uint64_t, db_object) \
__field(uint64_t, db_level) \
@@ -52,31 +65,37 @@
__field(uint64_t, db_size) \
__field(uint64_t, db_state) \
__field(int64_t, db_holds) \
+ __dynamic_array(char, msg, TRACE_DBUF_MSG_MAX)
#define DBUF_TP_FAST_ASSIGN \
- __assign_str(os_spa, \
- spa_name(DB_DNODE(db)->dn_objset->os_spa)); \
- \
- __entry->ds_object = db->db_objset->os_dsl_dataset ? \
- db->db_objset->os_dsl_dataset->ds_object : 0; \
- \
- __entry->db_object = db->db.db_object; \
- __entry->db_level = db->db_level; \
- __entry->db_blkid = db->db_blkid; \
- __entry->db_offset = db->db.db_offset; \
- __entry->db_size = db->db.db_size; \
- __entry->db_state = db->db_state; \
- __entry->db_holds = refcount_count(&db->db_holds);
-
-#define DBUF_TP_PRINTK_FMT \
- "dbuf { spa \"%s\" objset %llu object %llu level %llu " \
- "blkid %llu offset %llu size %llu state %llu holds %lld }"
+ if (db != NULL) { \
+ __assign_str(os_spa, \
+ spa_name(DB_DNODE(db)->dn_objset->os_spa)); \
+ __entry->ds_object = db->db_objset->os_dsl_dataset ? \
+ db->db_objset->os_dsl_dataset->ds_object : 0; \
+ __entry->db_object = db->db.db_object; \
+ __entry->db_level = db->db_level; \
+ __entry->db_blkid = db->db_blkid; \
+ __entry->db_offset = db->db.db_offset; \
+ __entry->db_size = db->db.db_size; \
+ __entry->db_state = db->db_state; \
+ __entry->db_holds = refcount_count(&db->db_holds); \
+ snprintf(__get_str(msg), TRACE_DBUF_MSG_MAX, \
+ DBUF_TP_PRINTK_FMT, DBUF_TP_PRINTK_ARGS); \
+ } else { \
+ __assign_str(os_spa, "NULL") \
+ __entry->ds_object = 0; \
+ __entry->db_object = 0; \
+ __entry->db_level = 0; \
+ __entry->db_blkid = 0; \
+ __entry->db_offset = 0; \
+ __entry->db_size = 0; \
+ __entry->db_state = 0; \
+ __entry->db_holds = 0; \
+ snprintf(__get_str(msg), TRACE_DBUF_MSG_MAX, \
+ "dbuf { NULL }"); \
+ }
-#define DBUF_TP_PRINTK_ARGS \
- __get_str(os_spa), __entry->ds_object, \
- __entry->db_object, __entry->db_level, \
- __entry->db_blkid, __entry->db_offset, \
- __entry->db_size, __entry->db_state, __entry->db_holds
/* BEGIN CSTYLED */
DECLARE_EVENT_CLASS(zfs_dbuf_class,
@@ -84,7 +103,7 @@ DECLARE_EVENT_CLASS(zfs_dbuf_class,
TP_ARGS(db, zio),
TP_STRUCT__entry(DBUF_TP_STRUCT_ENTRY),
TP_fast_assign(DBUF_TP_FAST_ASSIGN),
- TP_printk(DBUF_TP_PRINTK_FMT, DBUF_TP_PRINTK_ARGS)
+ TP_printk("%s", __get_str(msg))
);
/* END CSTYLED */
@@ -102,7 +121,7 @@ DECLARE_EVENT_CLASS(zfs_dbuf_evict_one_class,
TP_ARGS(db, mls),
TP_STRUCT__entry(DBUF_TP_STRUCT_ENTRY),
TP_fast_assign(DBUF_TP_FAST_ASSIGN),
- TP_printk(DBUF_TP_PRINTK_FMT, DBUF_TP_PRINTK_ARGS)
+ TP_printk("%s", __get_str(msg))
);
/* END CSTYLED */
This results in:
root@linux:/usr/src/zfs# echo 0 > /sys/kernel/debug/tracing/trace
root@linux:/usr/src/zfs# echo 'os_spa == "NULL"' > /sys/kernel/debug/tracing/events/zfs/zfs_dbuf__evict__one/filter
root@linux:/usr/src/zfs# echo 1 > /sys/kernel/debug/tracing/events/zfs/zfs_dbuf__evict__one/enable
root@linux:/usr/src/zfs# cat /sys/kernel/debug/tracing/trace
# tracer: nop
#
# entries-in-buffer/entries-written: 5/5 #P:8
#
# _-----=> irqs-off
# / _----=> need-resched
# | / _---=> hardirq/softirq
# || / _--=> preempt-depth
# ||| / delay
# TASK-PID CPU# |||| TIMESTAMP FUNCTION
# | | | |||| | |
dbuf_evict-14559 [004] .... 8730.442979: zfs_dbuf__evict__one: dbuf { NULL }
dbuf_evict-14559 [006] .... 8762.791772: zfs_dbuf__evict__one: dbuf { NULL }
dbuf_evict-14559 [006] .... 8762.973285: zfs_dbuf__evict__one: dbuf { NULL }
dbuf_evict-14559 [006] .... 8762.989567: zfs_dbuf__evict__one: dbuf { NULL }
dbuf_evict-14559 [006] .... 8762.989959: zfs_dbuf__evict__one: dbuf { NULL }
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@loli10K Thanks for the patch. Just summarizing options we have here:
- Do not log an entry if there is no dbuf that can be evicted. This can also be achieved via conditional tracepoint. Looks like we are not in favor of it.
- Call different trace function for Null vs Non-Null dbufs. Again not preferable.
- Have dummy value assigned to dbuf when it is NULL so that underlying infra doesn't complain/crash. With this, the output can be sth like:
dbuf_evict-1422 [006] .... 956.984961: zfs_dbuf__evict__one: dbuf NULL{ spa "NULL" objset 0 object 0 level 0 blkid 0 offset 0 size 0 state 0 holds 0 } or
dbuf_evict-1422 [006] .... 956.984961: zfs_dbuf__evict__one: dbuf NULL{ spa "NULL" objset ? object ? level ? blkid ? offset ? size ? state ? holds ? }
In this case we just end up logging some useless information. - The above patch as suggested by @loli10K that takes care of different format string at expense of snprintf. I do see similar approach being followed in multiple places in Linux kernel. Even though it involves an additional copy, it may still be ok. If we don't want that then option 3 looks to be the more favorable one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with either 3 and 4 as long as option 3 takes care of the extra "dummy value" in include/sys/trace_dbuf.h and not in dbuf_evict_one().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The 4th option would be my preference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@behlendorf @loli10K uploaded the pull request for the patch suggested by @loli10K ...The only additional change is the size of the buffer is increased to 512 bytes. The reason for doing it that the spa name itself can be up to 256 bytes. We ourselves are using bigger names for spa. Sample o/p:
dbuf_evict-4115 [005] .... 460.018195: zfs_dbuf__evict__one: dbuf { spa "zpool-XOXO-WT-e1c0d782-ca7f-4916-bf68-8fc0650a1172-fd66d34a-43c1-44de-8970-5e659863b235" objset 91 object 4381207 level 0 blkid 0 offset 0 size 50688 state 4 holds 0 }
dbuf_evict-4115 [003] .... 1394.738505: zfs_dbuf__evict__one: dbuf { NULL }
Signed-off-by: gaurkuma <gauravk.18@gmail.com>
loli10K
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, LGTM
Update the dbuf__evict__one() tracepoint so that it can safely handle a NULL dmu_buf_impl_t pointer. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: loli10K <ezomori.nozomu@gmail.com> Signed-off-by: gaurkuma <gauravk.18@gmail.com> Closes #6463
Update the dbuf__evict__one() tracepoint so that it can safely handle a NULL dmu_buf_impl_t pointer. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: loli10K <ezomori.nozomu@gmail.com> Signed-off-by: gaurkuma <gauravk.18@gmail.com> Closes openzfs#6463
The change takes care of checking if NULL args are passed to DTRACE_PROBEx and if so, then skip calling the tracing utility.
Description
Motivation and Context
The system will crash if a NULL arg is passed in DTRACE_PROBE call.
How Has This Been Tested?
Types of changes
Checklist:
Signed-off-by.