Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP - ZAP Shrinking #14088

Closed
wants to merge 1 commit into from
Closed

Conversation

sdimitro
Copy link
Contributor

@sdimitro sdimitro commented Oct 25, 2022

Ported from:

/*
 * Shrinking Algorithm:
 * 1. Check if a sibling leaf exists.
 * 2. Check if the sibling leaf is empty.
 * 3. If sibling bit of initial leaf is not 0 release it.
 * In order to avoid deadlock, we have to ensure dereferencing leaves in same
 * order - the leaf with sibling bit 0 first, then the leaf with sl_bit 1.
 * 4. Upgrade zapdir lock to WRITER (once).
 * 5. Deref leaves if needed.
 * 6. Recheck both leaves if required.
 * 7. Update ptrtbl pointes of the sibling leaf (sl_bit 1) to point to
 * the initial leaf (sl_bit 0).
 * 8. Free disk space of the sibling leaf (dmu_free_range).
 * 9. Update the leaf prefix and prefix_len
 * 10. Repeat the procedure from beginning to the updated leaf.
 *
 *		+---------------+
 *		| fzap_remove() |
 *		+---------------+
 *			|
 *			v
 *		+---------------+
 *		| zap_shrink()  |
 *		+---------------+
 *		        |
 *		        v
 *		+================+
 *	        < is leaf empty? >---(no)---> OUT
 *		+================+
 *			|
 *		      (yes)
 *		        |
 *	+------->-------+
 *	|	        |
 *	|	        v
 *	|	+---------------------------+
 *	|	| check_sibling_by_ptrtbl() |
 *	|	+---------------------------+
 *	|	        |
 *	|	        v
 *	|	+====================+
 *	|       < sibl. leaf exists? >---(no)---> OUT
 *	|	+====================+
 *	|	      (yes)
 *	|	        |
 *	|	        v
 *	|	+---------------------------+
 *	|	| deref sibl. leaf (READER) |
 *	|	+---------------------------+
 *	|	        |
 *	|	        v
 *	|	+===================+
 *	|       < is sibling empty? >---(no)---> OUT
 *	|	+===================+
 *	|		|
 *	|	      (yes)
 *	|	        |
 *	|	        v
 *	|	+-------------------------------------+
 *	|	| put sibl. leaf cause we need writer |
 *	|	+-------------------------------------+
 *	|		|
 *	|		v
 *	|	+============================+		+-----------------+
 *	|        < do we hold zap as WRITER? >--(no)--> | tryupgradedir() |
 *	|	+============================+		+-----------------+
 *	|		|				       |
 *	|		|				       v
 *	|		|				+==============+
 *	|		|<--------(yes)-----------------<   success?   >
 *	|		|				+==============+
 *	|		|				       |
 *	|		|				      (no)
 *	|		|				       |
 *	|		|				       v
 *	|		|				+--------------+
 *	|		|				| upgrade dir  |
 *	|		|				+--------------+
 *	|		|				       |
 *	|		|				       v
 *	|		|<-------------------------------------+
 *	|		|
 *	|	+-----------------------------------------------+
 *	|	| swap leaf hashes if initial leaf had slbit==1 |
 *	|	| make sure: l (slbit==0), sl (slbit==1)        |
 *	|	+-----------------------------------------------+
 *	|		|
 *	|		v
 *	|	+--------------------------------------+
 *	|	| deref sibl. leaf (WRITER) if required|
 *	|	+--------------------------------------+
 *	|		|
 *	|		v
 *	|	+---------------------------+
 *	|	| deref sibl. leaf (WRITER) |
 *	|	+---------------------------+
 *	|		|
 *	|		v
 *	|	+===========================================+
 *	|	< (recheck) both leaves are empty siblings) >--(no)--> OUT
 *	|	+===========================================+
 *	|		|
 *	|	      (yes)
 *	|	        |
 *	|	        v
 *	|	+----------------------------------+
 *	|	| update sibling leaf ptrtbl range |
 *	|	| to point to initial leaf	   |
 *	|	+----------------------------------+
 *	|		|
 *	|		v
 *	|	+--------------------------------------+
 *	|	| free disk space for the sibling leaf |
 *	|	+--------------------------------------+
 *	|		|
 *	|		v
 *	|	+---------------------------------------+
 *	|	| update initial leaf prefix/prefix_len |
 *	|	| (now this leaf goes to another level,	|
 *	|	|  and it may have another sibling	|
 *	|	+---------------------------------------+
 *	|		|
 *	+---------------+
 */

@behlendorf behlendorf added the Status: Code Review Needed Ready for review and testing label Oct 26, 2022
@jumbi77
Copy link
Contributor

jumbi77 commented Oct 30, 2022

This is awesome! As far as i understood @ahrens on the OpenZFS DevSummit Hackathon 2022 correctly, this could finally fix long waits for simple "ls" on dirs, which contained many files in the past. In case this gets merged, does this only "work" with only new created dirs or also with already existing dirs after upgrading zfs to new release which contains this feature?

Cant wait to get that upstreamed! :) Thanks in advance for all contributors!

Referencing a github discussion regarding zap shrinking: #8420

@@ -110,6 +111,8 @@ typedef enum zap_flags {
* already randomly distributed.
*/
ZAP_FLAG_PRE_HASHED_KEY = 1 << 2,
/* XXX */
ZAP_FLAG_NO_SHRINK = 1 << 3,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see where this is ever passed in; can we remove it?

}
zap_put_leaf(l);
return (err);
}

#define ZAP_PREFIX_HASH(pref, pref_len) ((pref) << (64 - (pref_len)))


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit] one blank line is enough here

zc->zc_leaf = NULL;

/*
* The leaf was either shrunk or splitted.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit] "split" is the correct past tense.

(ZAP_HASH_IDX(zc->zc_hash,
zap_leaf_phys(zc->zc_leaf)->l_hdr.lh_prefix_len) !=
zap_leaf_phys(zc->zc_leaf)->l_hdr.lh_prefix)) {
// XXX
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this change is correct, and the commented-out code can be removed

*
* Any ZAP leaf might have a sibling - a leaf with the same prefix length and
* with the prefix, which differes only by 1 least significant (sibling) bit.
* If both leaves are empty, we can remove one of them. For simplicity, we
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we remove one of them if only one of them is empty (not both empty)? Maybe it doesn't matter much in practice, since you would have to remove the vast majority (>99%?) of entries in order to get either one or two adjacent leaves empty.


/*
* Instead of calling zap_unlockdir(); zap_lockdir();
* we do it in more optimized way.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like an unnecessary optimization, given that we don't make this optimization in zap_expand_leaf(), which is probably called much more often than this.

@jumbi77
Copy link
Contributor

jumbi77 commented Nov 12, 2022

@sdimitro Sorry for bothering you, can you may adress the feedback and rebase this? I am really looking forward to this.
@behlendorf Can you may ping some additional reviewers if required (yourself)?

Much thanks in advance for all participants!

@jumbi77
Copy link
Contributor

jumbi77 commented Jan 19, 2023

Politely pinging @amotin since recent work on ZAP code and to may bring more attention/review to this. Just in case iX is may interested in this.

To get that integrated would be awesome. Anyway much thanks!

@amotin
Copy link
Member

amotin commented Jan 19, 2023

@jumbi77 It is interesting, but so far I've worked with MicroZAP's, not so much FatZAP's handled here, as I see. I'd need to dig deeper into it.
@sdimitro Is this PR abandoned or you plan to return?

@sdimitro
Copy link
Contributor Author

@amotin Feel free to pick this up!

BTW I was planning on trying out a few other designs as this PR is not really my code (I uncovered it from an old illumos PR). Maybe something along the lines of recreating a the whole ZAP once too many entries are gone (1/4?) potentially converting it to a microZAP too.

@jumbi77
Copy link
Contributor

jumbi77 commented Jul 4, 2023

@allanjude I saw your recent presentation on the june 2023 OpenZFS leadership meeting regarding the "rework" dedup stuff. There (as far as i understood), you also mentioned some ZAP optimizations, including ZAP shrinking. Do you plan to use this PR or do you even/may consider finishing this PR before/separate from the dedup stuff? Getting ZAP shrinking/optimizations would be great. In any case much thanks.

@allanjude
Copy link
Contributor

We expect to post an updated version of ZAP shrinking in the next week or two.

@jumbi77
Copy link
Contributor

jumbi77 commented Sep 16, 2023

We expect to post an updated version of ZAP shrinking in the next week or two.

Hello @allanjude, can I may ask the progress on this/can you may give an update on this?

In any case, much thanks for working on zfs!

snajpa pushed a commit to vpsfreecz/zfs that referenced this pull request Oct 9, 2023
Signed-off-by: Pavel Snajdr <snajpa@snajpa.net>
snajpa pushed a commit to vpsfreecz/zfs that referenced this pull request Oct 20, 2023
Signed-off-by: Pavel Snajdr <snajpa@snajpa.net>
snajpa pushed a commit to vpsfreecz/zfs that referenced this pull request Oct 31, 2023
Signed-off-by: Pavel Snajdr <snajpa@snajpa.net>
snajpa pushed a commit to vpsfreecz/zfs that referenced this pull request Nov 13, 2023
Signed-off-by: Pavel Snajdr <snajpa@snajpa.net>
snajpa pushed a commit to vpsfreecz/zfs that referenced this pull request Nov 13, 2023
Signed-off-by: Pavel Snajdr <snajpa@snajpa.net>
snajpa pushed a commit to vpsfreecz/zfs that referenced this pull request Nov 17, 2023
Signed-off-by: Pavel Snajdr <snajpa@snajpa.net>
snajpa pushed a commit to vpsfreecz/zfs that referenced this pull request Jan 3, 2024
Signed-off-by: Pavel Snajdr <snajpa@snajpa.net>
snajpa pushed a commit to vpsfreecz/zfs that referenced this pull request Jan 12, 2024
Signed-off-by: Pavel Snajdr <snajpa@snajpa.net>
snajpa pushed a commit to vpsfreecz/zfs that referenced this pull request Jan 22, 2024
Signed-off-by: Pavel Snajdr <snajpa@snajpa.net>
@snajpa
Copy link
Contributor

snajpa commented Jan 23, 2024

I'd just like to say here, that this code seems solid so far. We've been running it (after a few testing rounds) in production for a while now, no problems to chase, just works. I think there are quite a number of users, who would benefit from this being upstreamed. Not sure it's worth the wait for a newer version... That can eventually still land in master even after this gets merged, can't it?

(also sorry for the spam, I'll remove the reference to this PR from the commit)

@behlendorf
Copy link
Contributor

@snajpa it's good to know this is holding up well in your testing. There's still some outstanding feedback to tackle, that work just needs to be picked up by someone and a fresh PR opened.

@allanjude
Copy link
Contributor

@snajpa it's good to know this is holding up well in your testing. There's still some outstanding feedback to tackle, that work just needs to be picked up by someone and a fresh PR opened.

Klara's improved version of ZAP shrinking should get a pull request before the end of February.

@behlendorf behlendorf added the Status: Work in Progress Not yet ready for general review label Jan 29, 2024
@allanjude allanjude mentioned this pull request Feb 14, 2024
13 tasks
@behlendorf
Copy link
Contributor

Replaced by #15888

@behlendorf behlendorf closed this Feb 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Code Review Needed Ready for review and testing Status: Work in Progress Not yet ready for general review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants