New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
implement per-cpu cache for SA layouts and index tables. #5533
Conversation
|
@bzzz77, thanks for your PR! By analyzing the history of the files in this pull request, we identified @behlendorf, @nedbass and @javenwu to be potential reviewers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that this is the best approach, but the code looks correct. I'll leave it to others to judge if it's worth investigating other approaches vs. taking this one despite the shortcomings.
| } | ||
| if (tb != NULL) { | ||
| /* shift toward the most recently used */ | ||
| j = (i + 1) % SA_LAYOUT_ENTRIES; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this. You're swapping the entries at i and i+1, i.e. moving the found entry towards the end of the array. The entries are searched from the beginning, so wouldn't we want to move the entry towards the beginning of the array?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO, the most important thing is to maintain the recently used items actual, lookup itself should be cheap and simple given it's local and hit a single cacheline. again, I'm fine to fix lookup as suggested.
| sa_lot_t *tb = NULL; | ||
| int i, j; | ||
|
|
||
| kpreempt_disable(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not a general-purpose, cross-platform synchronization technique. The typical way of doing this would be to have a lock accompanying each per-CPU structure (i.e. in the sa_layout_cache_t), and to acquire that lock. In the rare case that this thread is rescheduled to another CPU, we have correctness. In the common case, the cost of acquiring a lock which is rarely acquired by a different CPU is extremely small.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I saw these primitives used in ZoL so used them. I'm fine to rework as suggested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
forgot to explain why I wanted to use this approach.. per-cpu locks are fine unless they share same cacheline, otherwise at some number of cores we start to see serious penalty due to cacheline ping-pong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's true. E.g. we may want to add padding to the rrm_lock_t (between each rrw_lock_t) to align the members to cache line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, I can't find rrm_lock_t and rrw_lock_t.. are those presented on Illumos only?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mistyped, it should be rrmlock_t / rrwlock_t (no underscore). It's in module/zfs/rrwlock.c.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's recommended in the Linux kernel to use ____cacheline_aligned to get the right alignment, this way the right thing happens for each architecture. I'm not sure if the other platforms definite a similar macro.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ahrens if it's supposed to be per-cpu, then why regular mutex is not good enough?
| slc->slc_head = (slc->slc_head + 1) % SA_LAYOUT_ENTRIES; | ||
| kpreempt_enable(); | ||
| } | ||
|
|
||
| static void | ||
| sa_find_layout(objset_t *os, uint64_t hash, sa_attr_type_t *attrs, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that the number of layouts is typically very small, and the set of layouts changes very infrequently, I think we could find a cleaner solution to this performance problem. For example, we could replace the sa_lock mutex with a rrmlock_t, or similar read-mostly lock. I won't attempt to veto this change as-is, if others are OK with it, but I think that we could find a more maintainable solution.
The same idea applies to the sa_idx_cache.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in my (rather small) practice the number of layouts is relatively small, so I'd agree about layouts. but I'm not sure this is true for the indices - MDS stores many different EAs, few of them (LOV Linkea) are var.length, so we have to deal with a large set of "lenghts".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's certainly possible - in which case we'd expect a bunch of cache misses?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the majority of Lustre use cases are HPC applications when many nodes share very similar (if not same) naming rules, stripes, etc. so when a cluster runs such an application MDS should be serving files with very few different EAs (because striping and naming schema are the same), so the most of requests should hit LAYOUT/IDX cache. in some cases there may be few big applications or lots of small ones when a cluster is shared among many groups. in this case I'd expect less hits, but still - most of applications work with MDS when an application is starting (open/create files) and is stopping (closing files, update attributes). these are relatively short periods and I think different applications don't overlap much and this should result in more cache hits.
answering your question - the more applications working with different striping and naming schemas, the less hits.
it'd very interesting to learn how many layouts/indices are used on real big sites. I guess it should be easy to count the numbers and dump via /proc. I'm fine to cook a patch. @behlendorf would it be possible to check on your production systems?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bzzz77 it's possible we could get it in to the next rounds of updates if you can provide a patch.
| @@ -2010,7 +2167,25 @@ sa_handle_unlock(sa_handle_t *hdl) | |||
| mutex_exit(&hdl->sa_lock); | |||
| } | |||
|
|
|||
| sa_handle_t * | |||
| sa_handle_alloc(objset_t *os, dmu_buf_t *db, void *userp) | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see where this is used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, my bad - I forgot to remove this from the patch. in short, the problem is that to be able to update attributes (even regular), we need a SA handle. sa_handle_get() builds the index which starts with an empty EA, then we have to add EA(s) in all the cases, so we rebuild the index again. and each rebuild has to find an appropriate layout, then check for an existing index table.. which is quite expensive. so the idea was to get an empty SA handle and build the layout/index when we instantiate attributes along with mandatory EAs.
what would be your idea to this problem?
| typedef struct sa_idx_cache { | ||
| sa_idx_tab_t *sac_idx[SA_IDX_ENTRIES]; | ||
| sa_lot_t *sac_lot[SA_IDX_ENTRIES]; | ||
| int sac_head; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that sac_lot and sac_idx are accessed together, i.e. element i of both of these arrays together constitutes one entry in the cache. Therefore we should use a struct to indicate this:
struct sa_idx_cache_entry {
sa_lot_t *sice_lot;
sa_idx_tab_t *sice_idx;
} sac_entries[SA_IDX_ENTRIES];
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, makes sense. I was going to change SA_IDX_ENTRIES so that the whole structure fit a cacheline..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the review. anyone else please?
|
I've addressed few issues Matt pointed to, few more fixes are coming. please, don't hesitate to comment ;) |
|
I think this is addressing #4806 |
d5c2151
to
19dbe9c
Compare
…le creations shows 5262 hits and 19 misses. this significantly improves multicore creation rate.
|
Closing. We can revisit this after #6117 is merged to reassess the performance impact of this proposed optimization. |
e.g. 1000 file creations shows 5262 hits and 19 misses.
this significantly improves multicore creation rate.
Description
Motivation and Context
How Has This Been Tested?
Types of changes
Checklist: