Skip to content

Conversation

@gaurkuma
Copy link
Contributor

@gaurkuma gaurkuma commented Aug 4, 2017

The change takes care of checking if NULL args are passed to DTRACE_PROBEx and if so, then skip calling the tracing utility.

Description

Motivation and Context

The system will crash if a NULL arg is passed in DTRACE_PROBE call.

How Has This Been Tested?

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation (a change to man pages or other documentation)

Checklist:

  • My code follows the ZFS on Linux code style requirements.
  • I have updated the documentation accordingly.
  • I have read the CONTRIBUTING document.
  • I have added tests to cover my changes.
  • All new and existing tests passed.
  • All commit messages are properly formatted and contain Signed-off-by.
  • Change has been approved by a ZFS on Linux member.

@behlendorf
Copy link
Contributor

Looks good, you just need to address the style issue.

@loli10K
Copy link
Contributor

loli10K commented Aug 4, 2017

Does this mean we can't trace NULL args? Is this Linux specific or does this also affect Illumos?

https://github.com/openzfs/openzfs/blob/master/usr/src/uts/common/fs/zfs/dbuf.c#L476

/*
 * Evict the oldest eligible dbuf from the dbuf cache.
 */
static void
dbuf_evict_one(void)
{
	int idx = multilist_get_random_index(dbuf_cache);
	multilist_sublist_t *mls = multilist_sublist_lock(dbuf_cache, idx);

	ASSERT(!MUTEX_HELD(&dbuf_evict_lock));

	/*
	 * Set the thread's tsd to indicate that it's processing evictions.
	 * Once a thread stops evicting from the dbuf cache it will
	 * reset its tsd to NULL.
	 */
	ASSERT3P(tsd_get(zfs_dbuf_evict_key), ==, NULL);
	(void) tsd_set(zfs_dbuf_evict_key, (void *)B_TRUE);

	dmu_buf_impl_t *db = multilist_sublist_tail(mls);
	while (db != NULL && mutex_tryenter(&db->db_mtx) == 0) {
		db = multilist_sublist_prev(mls, db);
	}

	DTRACE_PROBE2(dbuf__evict__one, dmu_buf_impl_t *, db,
	    multilist_sublist_t *, mls);

	if (db != NULL) {
		multilist_sublist_remove(mls, db);
		multilist_sublist_unlock(mls);
		(void) refcount_remove_many(&dbuf_cache_size,
		    db->db.db_size, db);
		dbuf_destroy(db);
	} else {
		multilist_sublist_unlock(mls);
	}
	(void) tsd_set(zfs_dbuf_evict_key, NULL);
}

@behlendorf
Copy link
Contributor

@loli10K we can, but the probe in trace_dbuf.h needs to be updated to understand that a NULL dbuf is a possibility. That would be a better fix.

@loli10K
Copy link
Contributor

loli10K commented Aug 4, 2017

@behlendorf Agreed. I personally don't like this change the way it is, unless it's a high priority fix can we implement this properly?

@behlendorf
Copy link
Contributor

Sure, I agree that would be better. This isn't critical, no user will be able to hit this unless they're rolling their own build to enable tracepoints.

Copy link
Contributor

@behlendorf behlendorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this does work let's update the tracepoint itself to handle NULL.

@gaurkuma
Copy link
Contributor Author

gaurkuma commented Aug 4, 2017

@behlendorf @loli10K Thanks for feedback..i will push a new patch soon

@gaurkuma gaurkuma changed the title Fixing crash in dbuf_evict_one when DTRACE_PROBE is enabled Crash in dbuf_evict_one when DTRACE_PROBE is on Aug 5, 2017
@gaurkuma gaurkuma changed the title Crash in dbuf_evict_one when DTRACE_PROBE is on Crash in dbuf_evict_one with DTRACE_PROBE Aug 5, 2017
#undef DTRACE_PROBE2
#define DTRACE_PROBE2(name, t1, arg1, t2, arg2) \
trace_zfs_##name((arg1), (arg2))
if (arg1 && arg2) \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm new to Linux Tracepoints so apologies if i'm asking something that seems unreasonable, but i was thinking we could do something like this:

root@linux:~# cat /sys/kernel/debug/tracing/trace
# tracer: nop
#
# entries-in-buffer/entries-written: 1435/1435   #P:8
#
#                              _-----=> irqs-off
#                             / _----=> need-resched
#                            | / _---=> hardirq/softirq
#                            || / _--=> preempt-depth
#                            ||| /     delay
#           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
#              | |       |   ||||       |         |
 dbuf_evict-1422  [006] ....   956.984961: zfs_dbuf__evict__one: dbuf { spa "poolname" objset 0 object 63 level 0 blkid 0 offset 0 size 1024 state 4 holds 0 }
 dbuf_evict-1422  [006] ....   956.984980: zfs_dbuf__evict__one: dbuf { NULL }
 dbuf_evict-1422  [006] ....   956.984992: zfs_dbuf__evict__one: dbuf { spa "poolname" objset 0 object 57 level 0 blkid 0 offset 0 size 16384 state 4 holds 0 }
 dbuf_evict-1422  [006] ....   956.985003: zfs_dbuf__evict__one: dbuf { spa "poolname" objset 0 object 60 level 0 blkid 0 offset 0 size 16384 state 4 holds 0 }
 dbuf_evict-1422  [006] ....   956.985013: zfs_dbuf__evict__one: dbuf { NULL }
 dbuf_evict-1422  [006] ....   956.985028: zfs_dbuf__evict__one: dbuf { spa "poolname" objset 51 object 1 level 0 blkid 0 offset 0 size 512 state 4 holds 0 }
 dbuf_evict-1422  [006] ....   956.985037: zfs_dbuf__evict__one: dbuf { spa "poolname" objset 51 object 2 level 0 blkid 5 offset 655360 size 131072 state 4 holds 0 }

If i read the code correctly we are just skipping the Tracepoint when at least one parameter is NULL, which kind of defeat the purpose of keeping DTRACE_PROBE2(dbuf__evict__one) outside the if (db != NULL).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@loli10K how did u generate the above output...Are the NULL entries manually updated to denote the expected results? Talking about Linux tracepoints, I have not encountered any instance in Linux code where the args passed to the tracing function are NULL. Moreover, based on my understanding of Linux tracepoints and the API it exposes, there is no way to handle NULL without making changing to the tracing functions. As far as i know its the responsibility of the caller to make sure args are non-null just as any other Linux logging infra (printk etc)

With respect to the new patch, the intent is to have the developer not worry about NULL args, otherwise, i would follow the same semantics followed in other places which is to trace only with Non-NULL args.

But if still, we want the results to look like you mentioned, we can probably, create a dummy tracepoint, that just prints NULL.

Let me know what you think and accordingly i can modify the patch.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the NULL entries manually updated to denote the expected results?

Exactly, it was just to show what i expected from this change.

I'm not sure if the code was originally implemented this way so we could trace a NULL dbuf (which seems would only happen when the dbuf cache is already empty), other platforms share this same code so i doubt it was done this way by chance.

Someone with better understanding of dbuf eviction would probably know, my concern is to minimize code differences with other implementations.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a good LWN article describing how to use tracepoints which I'd recommend. It provides a nice summary including some examples of what's possible.

https://lwn.net/Articles/379903/

After a little digging I could not find an example of a case where a tracepoint argument is NULL. Usually this is handled by defining another tracepoint which only takes the required number of arguments. For example.

diff --git a/module/zfs/dbuf.c b/module/zfs/dbuf.c
index dc2c004..eea209b 100644
--- a/module/zfs/dbuf.c
+++ b/module/zfs/dbuf.c
@@ -508,16 +508,19 @@ dbuf_evict_one(void)
                db = multilist_sublist_prev(mls, db);
        }
 
-       DTRACE_PROBE2(dbuf__evict__one, dmu_buf_impl_t *, db,
-           multilist_sublist_t *, mls);
-
        if (db != NULL) {
+               DTRACE_PROBE2(dbuf__evict__one, dmu_buf_impl_t *, db,
+                   multilist_sublist_t *, mls);
+
                multilist_sublist_remove(mls, db);
                multilist_sublist_unlock(mls);
                (void) refcount_remove_many(&dbuf_cache_size,
                    db->db.db_size, db);
                dbuf_destroy(db);
        } else {
+               DTRACE_PROBE1(dbuf__evict__one_mls,
+                   multilist_sublist_t *, mls);
+
                multilist_sublist_unlock(mls);
        }
        (void) tsd_set(zfs_dbuf_evict_key, NULL);

That may be what we have to do here even if it means diverging slightly from upstream. It might be possible to use the same conditional logic for printing custom fields on the arguments themselves in the dbuf__evict__one definition but I haven't tried if that works.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@behlendorf ....today even though we use DTRACE_PROBE2 and pass two arguments db and mls, we actually are using only the db arg to log various information in the trace. mls arg is ignored, so, question is do we want to still use DTRACE_PROBE1 passing mls as the only arg and if so, then do we want to log NULL in output or extract sth useful from the multilist arg

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be possible to use the same conditional logic for printing custom fields on the arguments themselves in the dbuf__evict__one definition but I haven't tried if that works.

The following patch should allow us to keep tracepoints filtering capabilities while also being able to conditionally change the print format used in TP_printk(). On the other hand this will also increase the probe-effect since we will always have to snprintf() in TP_fast_assign() even for events that could potentially be filtered out.

diff --git a/include/sys/trace_dbuf.h b/include/sys/trace_dbuf.h
index edaa77f..f6e8b96 100644
--- a/include/sys/trace_dbuf.h
+++ b/include/sys/trace_dbuf.h
@@ -41,9 +41,22 @@
  *     zio_t *, ...);
  */
 
+#ifndef	TRACE_DBUF_MSG_MAX
+#define	TRACE_DBUF_MSG_MAX	255
+#endif
+
+#define	DBUF_TP_PRINTK_FMT                                              \
+	"dbuf { spa \"%s\" objset %llu object %llu level %llu "         \
+	"blkid %llu offset %llu size %llu state %llu holds %lld }"
+
+#define	DBUF_TP_PRINTK_ARGS					\
+	__get_str(os_spa), __entry->ds_object,                  \
+	__entry->db_object, __entry->db_level,                  \
+	__entry->db_blkid, __entry->db_offset,                  \
+	__entry->db_size, __entry->db_state, __entry->db_holds
+
 #define	DBUF_TP_STRUCT_ENTRY					\
-	__string(os_spa,					\
-	    spa_name(DB_DNODE(db)->dn_objset->os_spa))		\
+	__dynamic_array(char,	os_spa,	TRACE_DBUF_MSG_MAX)	\
 	__field(uint64_t,	ds_object)			\
 	__field(uint64_t,	db_object)			\
 	__field(uint64_t,	db_level)			\
@@ -52,31 +65,37 @@
 	__field(uint64_t,	db_size)			\
 	__field(uint64_t,	db_state)			\
 	__field(int64_t,	db_holds)			\
+	__dynamic_array(char,	msg,	TRACE_DBUF_MSG_MAX)
 
 #define	DBUF_TP_FAST_ASSIGN					\
-	__assign_str(os_spa,					\
-	    spa_name(DB_DNODE(db)->dn_objset->os_spa));		\
-								\
-	__entry->ds_object = db->db_objset->os_dsl_dataset ?	\
-	    db->db_objset->os_dsl_dataset->ds_object : 0;	\
-								\
-	__entry->db_object = db->db.db_object;			\
-	__entry->db_level  = db->db_level;			\
-	__entry->db_blkid  = db->db_blkid;			\
-	__entry->db_offset = db->db.db_offset;			\
-	__entry->db_size   = db->db.db_size;			\
-	__entry->db_state  = db->db_state;			\
-	__entry->db_holds  = refcount_count(&db->db_holds);
-
-#define	DBUF_TP_PRINTK_FMT						\
-	"dbuf { spa \"%s\" objset %llu object %llu level %llu "		\
-	"blkid %llu offset %llu size %llu state %llu holds %lld }"
+	if (db != NULL) {					\
+		__assign_str(os_spa,				\
+		    spa_name(DB_DNODE(db)->dn_objset->os_spa));	\
+		__entry->ds_object = db->db_objset->os_dsl_dataset ?	\
+		    db->db_objset->os_dsl_dataset->ds_object : 0;	\
+		__entry->db_object = db->db.db_object;		\
+		__entry->db_level  = db->db_level;		\
+		__entry->db_blkid  = db->db_blkid;		\
+		__entry->db_offset = db->db.db_offset;		\
+		__entry->db_size   = db->db.db_size;		\
+		__entry->db_state  = db->db_state;		\
+		__entry->db_holds  = refcount_count(&db->db_holds);	\
+		snprintf(__get_str(msg), TRACE_DBUF_MSG_MAX,	\
+		    DBUF_TP_PRINTK_FMT, DBUF_TP_PRINTK_ARGS);	\
+	} else {						\
+		__assign_str(os_spa, "NULL")			\
+		__entry->ds_object = 0;				\
+		__entry->db_object = 0;				\
+		__entry->db_level  = 0;				\
+		__entry->db_blkid  = 0;				\
+		__entry->db_offset = 0;				\
+		__entry->db_size   = 0;				\
+		__entry->db_state  = 0;				\
+		__entry->db_holds  = 0;				\
+		snprintf(__get_str(msg), TRACE_DBUF_MSG_MAX,	\
+		    "dbuf { NULL }");				\
+	}
 
-#define	DBUF_TP_PRINTK_ARGS					\
-	__get_str(os_spa), __entry->ds_object,			\
-	__entry->db_object, __entry->db_level,			\
-	__entry->db_blkid, __entry->db_offset,			\
-	__entry->db_size, __entry->db_state, __entry->db_holds
 
 /* BEGIN CSTYLED */
 DECLARE_EVENT_CLASS(zfs_dbuf_class,
@@ -84,7 +103,7 @@ DECLARE_EVENT_CLASS(zfs_dbuf_class,
 	TP_ARGS(db, zio),
 	TP_STRUCT__entry(DBUF_TP_STRUCT_ENTRY),
 	TP_fast_assign(DBUF_TP_FAST_ASSIGN),
-	TP_printk(DBUF_TP_PRINTK_FMT, DBUF_TP_PRINTK_ARGS)
+	TP_printk("%s", __get_str(msg))
 );
 /* END CSTYLED */
 
@@ -102,7 +121,7 @@ DECLARE_EVENT_CLASS(zfs_dbuf_evict_one_class,
 	TP_ARGS(db, mls),
 	TP_STRUCT__entry(DBUF_TP_STRUCT_ENTRY),
 	TP_fast_assign(DBUF_TP_FAST_ASSIGN),
-	TP_printk(DBUF_TP_PRINTK_FMT, DBUF_TP_PRINTK_ARGS)
+	TP_printk("%s", __get_str(msg))
 );
 /* END CSTYLED */
 

This results in:

root@linux:/usr/src/zfs# echo 0 > /sys/kernel/debug/tracing/trace
root@linux:/usr/src/zfs# echo 'os_spa == "NULL"' > /sys/kernel/debug/tracing/events/zfs/zfs_dbuf__evict__one/filter 
root@linux:/usr/src/zfs# echo 1 > /sys/kernel/debug/tracing/events/zfs/zfs_dbuf__evict__one/enable 
root@linux:/usr/src/zfs# cat /sys/kernel/debug/tracing/trace
# tracer: nop
#
# entries-in-buffer/entries-written: 5/5   #P:8
#
#                              _-----=> irqs-off
#                             / _----=> need-resched
#                            | / _---=> hardirq/softirq
#                            || / _--=> preempt-depth
#                            ||| /     delay
#           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
#              | |       |   ||||       |         |
      dbuf_evict-14559 [004] ....  8730.442979: zfs_dbuf__evict__one: dbuf { NULL }
      dbuf_evict-14559 [006] ....  8762.791772: zfs_dbuf__evict__one: dbuf { NULL }
      dbuf_evict-14559 [006] ....  8762.973285: zfs_dbuf__evict__one: dbuf { NULL }
      dbuf_evict-14559 [006] ....  8762.989567: zfs_dbuf__evict__one: dbuf { NULL }
      dbuf_evict-14559 [006] ....  8762.989959: zfs_dbuf__evict__one: dbuf { NULL }

Copy link
Contributor Author

@gaurkuma gaurkuma Aug 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@loli10K Thanks for the patch. Just summarizing options we have here:

  1. Do not log an entry if there is no dbuf that can be evicted. This can also be achieved via conditional tracepoint. Looks like we are not in favor of it.
  2. Call different trace function for Null vs Non-Null dbufs. Again not preferable.
  3. Have dummy value assigned to dbuf when it is NULL so that underlying infra doesn't complain/crash. With this, the output can be sth like:
    dbuf_evict-1422 [006] .... 956.984961: zfs_dbuf__evict__one: dbuf NULL{ spa "NULL" objset 0 object 0 level 0 blkid 0 offset 0 size 0 state 0 holds 0 } or
    dbuf_evict-1422 [006] .... 956.984961: zfs_dbuf__evict__one: dbuf NULL{ spa "NULL" objset ? object ? level ? blkid ? offset ? size ? state ? holds ? }
    In this case we just end up logging some useless information.
  4. The above patch as suggested by @loli10K that takes care of different format string at expense of snprintf. I do see similar approach being followed in multiple places in Linux kernel. Even though it involves an additional copy, it may still be ok. If we don't want that then option 3 looks to be the more favorable one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with either 3 and 4 as long as option 3 takes care of the extra "dummy value" in include/sys/trace_dbuf.h and not in dbuf_evict_one().

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 4th option would be my preference.

Copy link
Contributor Author

@gaurkuma gaurkuma Aug 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@behlendorf @loli10K uploaded the pull request for the patch suggested by @loli10K ...The only additional change is the size of the buffer is increased to 512 bytes. The reason for doing it that the spa name itself can be up to 256 bytes. We ourselves are using bigger names for spa. Sample o/p:

dbuf_evict-4115  [005] ....   460.018195: zfs_dbuf__evict__one: dbuf { spa "zpool-XOXO-WT-e1c0d782-ca7f-4916-bf68-8fc0650a1172-fd66d34a-43c1-44de-8970-5e659863b235" objset 91 object 4381207 level 0 blkid 0 offset 0 size 50688 state 4 holds 0 }
dbuf_evict-4115  [003] ....  1394.738505: zfs_dbuf__evict__one: dbuf { NULL }

Signed-off-by: gaurkuma <gauravk.18@gmail.com>
Copy link
Contributor

@loli10K loli10K left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, LGTM

@behlendorf behlendorf merged commit 520faf5 into openzfs:master Aug 9, 2017
tonyhutter pushed a commit that referenced this pull request Aug 22, 2017
Update the dbuf__evict__one() tracepoint so that it can safely
handle a NULL dmu_buf_impl_t pointer.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>    
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Signed-off-by: gaurkuma <gauravk.18@gmail.com>
Closes #6463
Fabian-Gruenbichler pushed a commit to Fabian-Gruenbichler/zfs that referenced this pull request Sep 29, 2017
Update the dbuf__evict__one() tracepoint so that it can safely
handle a NULL dmu_buf_impl_t pointer.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>    
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Signed-off-by: gaurkuma <gauravk.18@gmail.com>
Closes openzfs#6463
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants