Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implement per-cpu cache for SA layouts and index tables. #5533

Closed
wants to merge 1 commit into from

Conversation

bzzz77
Copy link
Contributor

@bzzz77 bzzz77 commented Dec 28, 2016

e.g. 1000 file creations shows 5262 hits and 19 misses.
this significantly improves multicore creation rate.

Description

Motivation and Context

How Has This Been Tested?

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

  • My code follows the ZFS on Linux code style requirements.
  • I have updated the documentation accordingly.
  • I have read the CONTRIBUTING document.
  • I have added tests to cover my changes.
  • All new and existing tests passed.
  • Change has been approved by a ZFS on Linux member.

@mention-bot
Copy link

@bzzz77, thanks for your PR! By analyzing the history of the files in this pull request, we identified @behlendorf, @nedbass and @javenwu to be potential reviewers.

Copy link
Member

@ahrens ahrens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that this is the best approach, but the code looks correct. I'll leave it to others to judge if it's worth investigating other approaches vs. taking this one despite the shortcomings.

}
if (tb != NULL) {
/* shift toward the most recently used */
j = (i + 1) % SA_LAYOUT_ENTRIES;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this. You're swapping the entries at i and i+1, i.e. moving the found entry towards the end of the array. The entries are searched from the beginning, so wouldn't we want to move the entry towards the beginning of the array?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO, the most important thing is to maintain the recently used items actual, lookup itself should be cheap and simple given it's local and hit a single cacheline. again, I'm fine to fix lookup as suggested.

sa_lot_t *tb = NULL;
int i, j;

kpreempt_disable();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a general-purpose, cross-platform synchronization technique. The typical way of doing this would be to have a lock accompanying each per-CPU structure (i.e. in the sa_layout_cache_t), and to acquire that lock. In the rare case that this thread is rescheduled to another CPU, we have correctness. In the common case, the cost of acquiring a lock which is rarely acquired by a different CPU is extremely small.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw these primitives used in ZoL so used them. I'm fine to rework as suggested.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

forgot to explain why I wanted to use this approach.. per-cpu locks are fine unless they share same cacheline, otherwise at some number of cores we start to see serious penalty due to cacheline ping-pong.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's true. E.g. we may want to add padding to the rrm_lock_t (between each rrw_lock_t) to align the members to cache line.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, I can't find rrm_lock_t and rrw_lock_t.. are those presented on Illumos only?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mistyped, it should be rrmlock_t / rrwlock_t (no underscore). It's in module/zfs/rrwlock.c.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's recommended in the Linux kernel to use ____cacheline_aligned to get the right alignment, this way the right thing happens for each architecture. I'm not sure if the other platforms definite a similar macro.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ahrens if it's supposed to be per-cpu, then why regular mutex is not good enough?

slc->slc_head = (slc->slc_head + 1) % SA_LAYOUT_ENTRIES;
kpreempt_enable();
}

static void
sa_find_layout(objset_t *os, uint64_t hash, sa_attr_type_t *attrs,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that the number of layouts is typically very small, and the set of layouts changes very infrequently, I think we could find a cleaner solution to this performance problem. For example, we could replace the sa_lock mutex with a rrmlock_t, or similar read-mostly lock. I won't attempt to veto this change as-is, if others are OK with it, but I think that we could find a more maintainable solution.

The same idea applies to the sa_idx_cache.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in my (rather small) practice the number of layouts is relatively small, so I'd agree about layouts. but I'm not sure this is true for the indices - MDS stores many different EAs, few of them (LOV Linkea) are var.length, so we have to deal with a large set of "lenghts".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's certainly possible - in which case we'd expect a bunch of cache misses?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the majority of Lustre use cases are HPC applications when many nodes share very similar (if not same) naming rules, stripes, etc. so when a cluster runs such an application MDS should be serving files with very few different EAs (because striping and naming schema are the same), so the most of requests should hit LAYOUT/IDX cache. in some cases there may be few big applications or lots of small ones when a cluster is shared among many groups. in this case I'd expect less hits, but still - most of applications work with MDS when an application is starting (open/create files) and is stopping (closing files, update attributes). these are relatively short periods and I think different applications don't overlap much and this should result in more cache hits.
answering your question - the more applications working with different striping and naming schemas, the less hits.
it'd very interesting to learn how many layouts/indices are used on real big sites. I guess it should be easy to count the numbers and dump via /proc. I'm fine to cook a patch. @behlendorf would it be possible to check on your production systems?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bzzz77 it's possible we could get it in to the next rounds of updates if you can provide a patch.

@@ -2010,7 +2167,25 @@ sa_handle_unlock(sa_handle_t *hdl)
mutex_exit(&hdl->sa_lock);
}

sa_handle_t *
sa_handle_alloc(objset_t *os, dmu_buf_t *db, void *userp)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see where this is used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, my bad - I forgot to remove this from the patch. in short, the problem is that to be able to update attributes (even regular), we need a SA handle. sa_handle_get() builds the index which starts with an empty EA, then we have to add EA(s) in all the cases, so we rebuild the index again. and each rebuild has to find an appropriate layout, then check for an existing index table.. which is quite expensive. so the idea was to get an empty SA handle and build the layout/index when we instantiate attributes along with mandatory EAs.
what would be your idea to this problem?

typedef struct sa_idx_cache {
sa_idx_tab_t *sac_idx[SA_IDX_ENTRIES];
sa_lot_t *sac_lot[SA_IDX_ENTRIES];
int sac_head;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that sac_lot and sac_idx are accessed together, i.e. element i of both of these arrays together constitutes one entry in the cache. Therefore we should use a struct to indicate this:

struct sa_idx_cache_entry {
    sa_lot_t *sice_lot;
    sa_idx_tab_t *sice_idx;
} sac_entries[SA_IDX_ENTRIES];

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, makes sense. I was going to change SA_IDX_ENTRIES so that the whole structure fit a cacheline..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the review. anyone else please?

@bzzz77
Copy link
Contributor Author

bzzz77 commented Jan 11, 2017

I've addressed few issues Matt pointed to, few more fixes are coming. please, don't hesitate to comment ;)

@ahrens
Copy link
Member

ahrens commented Jan 11, 2017

I think this is addressing #4806

@bzzz77 bzzz77 force-pushed the sa-cache branch 2 times, most recently from d5c2151 to 19dbe9c Compare January 17, 2017 19:00
…le creations shows 5262 hits and 19 misses. this significantly improves multicore creation rate.
@behlendorf behlendorf added the Type: Performance Performance improvement or performance problem label Feb 16, 2017
@behlendorf behlendorf removed this from TODO in 0.7.0-rc4 Mar 29, 2017
@behlendorf
Copy link
Contributor

Closing. We can revisit this after #6117 is merged to reassess the performance impact of this proposed optimization.

@behlendorf behlendorf closed this May 19, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Performance Performance improvement or performance problem
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants