Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Read and understand btrfs code (necessary?) complexity (madness?) #9

Open
mbideau opened this issue Mar 20, 2023 · 0 comments
Open

Comments

@mbideau
Copy link
Owner

mbideau commented Mar 20, 2023

An extract of the comments in the source code...

from https://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git/tree/fs/btrfs/send.c#n206

/*
 * We process inodes by their increasing order, so if before an
 * incremental send we reverse the parent/child relationship of
 * directories such that a directory with a lower inode number was
 * the parent of a directory with a higher inode number, and the one
 * becoming the new parent got renamed too, we can't rename/move the
 * directory with lower inode number when we finish processing it - we
 * must process the directory with higher inode number first, then
 * rename/move it and then rename/move the directory with lower inode
 * number. Example follows.
 *
 * Tree state when the first send was performed:
 *
 * .
 * |-- a                   (ino 257)
 *     |-- b               (ino 258)
 *         |
 *         |
 *         |-- c           (ino 259)
 *         |   |-- d       (ino 260)
 *         |
 *         |-- c2          (ino 261)
 *
 * Tree state when the second (incremental) send is performed:
 *
 * .
 * |-- a                   (ino 257)
 *     |-- b               (ino 258)
 *         |-- c2          (ino 261)
 *             |-- d2      (ino 260)
 *                 |-- cc  (ino 259)
 *
 * The sequence of steps that lead to the second state was:
 *
 * mv /a/b/c/d /a/b/c2/d2
 * mv /a/b/c /a/b/c2/d2/cc
 *
 * "c" has lower inode number, but we can't move it (2nd mv operation)
 * before we move "d", which has higher inode number.
 *
 * So we just memorize which move/rename operations must be performed
 * later when their respective parent is processed and moved/renamed.
 */

/* Indexed by parent directory inode number. */
struct rb_root pending_dir_moves;

/*
 * Reverse index, indexed by the inode number of a directory that
 * is waiting for the move/rename of its immediate parent before its
 * own move/rename can be performed.
 */
struct rb_root waiting_dir_moves;


from https://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git/tree/fs/btrfs/send.c#n260

/*
 * A directory that is going to be rm'ed might have a child directory
 * which is in the pending directory moves index above. In this case,
 * the directory can only be removed after the move/rename of its child
 * is performed. Example:
 *
 * Parent snapshot:
 *
 * .                        (ino 256)
 * |-- a/                   (ino 257)
 *     |-- b/               (ino 258)
 *         |-- c/           (ino 259)
 *         |   |-- x/       (ino 260)
 *         |
 *         |-- y/           (ino 261)
 *
 * Send snapshot:
 *
 * .                        (ino 256)
 * |-- a/                   (ino 257)
 *     |-- b/               (ino 258)
 *         |-- YY/          (ino 261)
 *              |-- x/      (ino 260)
 *
 * Sequence of steps that lead to the send snapshot:
 * rm -f /a/b/c/foo.txt
 * mv /a/b/y /a/b/YY
 * mv /a/b/c/x /a/b/YY
 * rmdir /a/b/c
 *
 * When the child is processed, its move/rename is delayed until its
 * parent is processed (as explained above), but all other operations
 * like update utimes, chown, chgrp, etc, are performed and the paths
 * that it uses for those operations must use the orphanized name of
 * its parent (the directory we're going to rm later), so we need to
 * memorize that name.
 *
 * Indexed by the inode number of the directory to be deleted.
 */
struct rb_root orphan_dirs;

...


from https://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git/tree/fs/btrfs/send.c#n323

struct waiting_dir_move {
	struct rb_node node;
	u64 ino;
	/*
	 * There might be some directory that could not be removed because it
	 * was waiting for this directory inode to be moved first. Therefore
	 * after this directory is moved, we can try to rmdir the ino rmdir_ino.
	 */
	u64 rmdir_ino;
	u64 rmdir_gen;
	bool orphanized;
};

...

from https://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git/tree/fs/btrfs/send.c#n1792

/*
 * Helper function to generate a file name that is unique in the root of
 * send_root and parent_root. This is used to generate names for orphan inodes.
 */
static int gen_unique_name(struct send_ctx *sctx,
			   u64 ino, u64 gen,
			   struct fs_path *dest)
...

from https://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git/tree/fs/btrfs/send.c#n2099

/*
 * Used by process_recorded_refs to determine if a new ref would overwrite an
 * already existing ref. In case it detects an overwrite, it returns the
 * inode/gen in who_ino/who_gen.
 * When an overwrite is detected, process_recorded_refs does proper orphanizing
 * to make sure later references to the overwritten inode are possible.
 * Orphanizing is however only required for the first ref of an inode.
 * process_recorded_refs does an additional is_first_ref check to see if
 * orphanizing is really required.
 */
static int will_overwrite_ref(struct send_ctx *sctx, u64 dir, u64 dir_gen,
			      const char *name, int name_len,
			      u64 *who_ino, u64 *who_gen, u64 *who_mode)
			      
...

from https://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git/tree/fs/btrfs/send.c#n2125

/*
 * If we have a parent root we need to verify that the parent dir was
 * not deleted and then re-created, if it was then we have no overwrite
 * and we can just unlink this entry.
 *
 * @parent_root_dir_gen was set to 0 if the inode does not exist in the
 * parent root.
 */
 
...
 
from https://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git/tree/fs/btrfs/send.c#n2144
 
/*
 * Check if the overwritten ref was already processed. If yes, the ref
 * was already unlinked/moved, so we can safely assume that we will not
 * overwrite anything at this point in time.
 */
 
...

from https://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git/tree/fs/btrfs/send.c#n2164

/*
 * Checks if the ref was overwritten by an already processed inode. This is
 * used by __get_cur_name_and_parent to find out if the ref was orphanized and
 * thus the orphan name needs be used.
 * process_recorded_refs also uses it to avoid unlinking of refs that were
 * overwritten.
 */
static int did_overwrite_ref(struct send_ctx *sctx,
			    u64 dir, u64 dir_gen,
			    u64 ino, u64 ino_gen,
			    const char *name, int name_len)

...

from https://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git/tree/fs/btrfs/send.c#n2237

/*
 * Same as did_overwrite_ref, but also checks if it is the first ref of an inode
 * that got overwritten. This is used by process_recorded_refs to determine
 * if it has to use the path as returned by get_cur_path or the orphan name.
 */
static int did_overwrite_first_ref(struct send_ctx *sctx, u64 ino, u64 gen)

 
...

from https://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git/tree/fs/btrfs/send.c#n2398

/*
 * Magic happens here. This function returns the first ref to an inode as it
 * would look like while receiving the stream at this point in time.
 * We walk the path up to the root. For every inode in between, we check if it
 * was already processed/sent. If yes, we continue with the parent as found
 * in send_root. If not, we continue with the parent as found in parent_root.
 * If we encounter an inode that was deleted at this point in time, we use the
 * inodes "orphan" name instead of the real name and stop. Same with new inodes
 * that were not created yet and overwritten inodes/refs.
 *
 * When do we have orphan inodes:
 * 1. When an inode is freshly created and thus no valid refs are available yet
 * 2. When a directory lost all it's refs (deleted) but still has dir items
 *    inside which were not processed yet (pending for move/delete). If anyone
 *    tried to get the path to the dir items, it would get a path inside that
 *    orphan directory.
 * 3. When an inode is moved around or gets new links, it may overwrite the ref
 *    of an unprocessed inode. If in that case the first ref would be
 *    overwritten, the overwritten inode gets "orphanized". Later when we
 *    process this overwritten inode, it is restored at a new place by moving
 *    the orphan inode.
 *
 * sctx->send_progress tells this function at which point in time receiving
 * would be.
 */
static int get_cur_path(struct send_ctx *sctx, u64 ino, u64 gen,
			struct fs_path *dest)

...

from https://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git/tree/fs/btrfs/send.c#n2924

/*
 * We need some special handling for inodes that get processed before the parent
 * directory got created. See process_recorded_refs for details.
 * This function does the check if we already created the dir out of order.
 */
static int did_create_dir(struct send_ctx *sctx, u64 dir)

...

from https://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git/tree/fs/btrfs/send.c#n3074

/*
 * Renames/moves a file/dir to its orphan name. Used when the first
 * ref of an unprocessed inode gets overwritten and for all non empty
 * directories.
 */
static int orphanize_inode(struct send_ctx *sctx, u64 ino, u64 gen,
			  struct fs_path *path)


...

from https://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git/tree/fs/btrfs/send.c#n3709

/*
 * We might need to delay a directory rename even when no ancestor directory
 * (in the send root) with a higher inode number than ours (sctx->cur_ino) was
 * renamed. This happens when we rename a directory to the old name (the name
 * in the parent root) of some other unrelated directory that got its rename
 * delayed due to some ancestor with higher number that got renamed.
 *
 * Example:
 *
 * Parent snapshot:
 * .                                       (ino 256)
 * |---- a/                                (ino 257)
 * |     |---- file                        (ino 260)
 * |
 * |---- b/                                (ino 258)
 * |---- c/                                (ino 259)
 *
 * Send snapshot:
 * .                                       (ino 256)
 * |---- a/                                (ino 258)
 * |---- x/                                (ino 259)
 *       |---- y/                          (ino 257)
 *             |----- file                 (ino 260)
 *
 * Here we can not rename 258 from 'b' to 'a' without the rename of inode 257
 * from 'a' to 'x/y' happening first, which in turn depends on the rename of
 * inode 259 from 'c' to 'x'. So the order of rename commands the send stream
 * must issue is:
 *
 * 1 - rename 259 from 'c' to 'x'
 * 2 - rename 257 from 'a' to 'x/y'
 * 3 - rename 258 from 'b' to 'a'
 *
 * Returns 1 if the rename of sctx->cur_ino needs to be delayed, 0 if it can
 * be done right away and < 0 on error.
 */
static int wait_for_dest_dir_move(struct send_ctx *sctx,
				  struct recorded_ref *parent_ref,
				  const bool is_orphan)
...

from https://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git/tree/fs/btrfs/send.c#n3970

/*
 * Our current directory inode may not yet be renamed/moved because some
 * ancestor (immediate or not) has to be renamed/moved first. So find if
 * such ancestor exists and make sure our own rename/move happens after
 * that ancestor is processed to avoid path build infinite loops (done
 * at get_cur_path()).
 */
while (ino > BTRFS_FIRST_FREE_OBJECTID) {
	u64 parent_ino_after_gen;

	if (is_waiting_for_move(sctx, ino)) {
		/*
		 * If the current inode is an ancestor of ino in the
		 * parent root, we need to delay the rename of the
		 * current inode, otherwise don't delayed the rename
		 * because we can end up with a circular dependency
		 * of renames, resulting in some directories never
		 * getting the respective rename operations issued in
		 * the send stream or getting into infinite path build
		 * loops.
		 */
		ret = is_ancestor(sctx->parent_root,
				  sctx->cur_ino, sctx->cur_inode_gen,
				  ino, path_before);
		if (ret)
			break;

...

from https://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git/tree/fs/btrfs/send.c#n4082

/*
 * When processing the new references for an inode we may orphanize an existing
 * directory inode because its old name conflicts with one of the new references
 * of the current inode. Later, when processing another new reference of our
 * inode, we might need to orphanize another inode, but the path we have in the
 * reference reflects the pre-orphanization name of the directory we previously
 * orphanized. For example:
 *
 * parent snapshot looks like:
 *
 * .                                     (ino 256)
 * |----- f1                             (ino 257)
 * |----- f2                             (ino 258)
 * |----- d1/                            (ino 259)
 *        |----- d2/                     (ino 260)
 *
 * send snapshot looks like:
 *
 * .                                     (ino 256)
 * |----- d1                             (ino 258)
 * |----- f2/                            (ino 259)
 *        |----- f2_link/                (ino 260)
 *        |       |----- f1              (ino 257)
 *        |
 *        |----- d2                      (ino 258)
 *
 * When processing inode 257 we compute the name for inode 259 as "d1", and we
 * cache it in the name cache. Later when we start processing inode 258, when
 * collecting all its new references we set a full path of "d1/d2" for its new
 * reference with name "d2". When we start processing the new references we
 * start by processing the new reference with name "d1", and this results in
 * orphanizing inode 259, since its old reference causes a conflict. Then we
 * move on the next new reference, with name "d2", and we find out we must
 * orphanize inode 260, as its old reference conflicts with ours - but for the
 * orphanization we use a source path corresponding to the path we stored in the
 * new reference, which is "d1/d2" and not "o259-6-0/d2" - this makes the
 * receiver fail since the path component "d1/" no longer exists, it was renamed
 * to "o259-6-0/" when processing the previous new reference. So in this case we
 * must recompute the path in the new reference and use it for the new
 * orphanization operation.
 */
static int refresh_ref_path(struct send_ctx *sctx, struct recorded_ref *ref)

...

from https://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git/tree/fs/btrfs/send.c#n4216

/*
 * Before doing any rename and link operations, do a first pass on the
 * new references to orphanize any unprocessed inodes that may have a
 * reference that conflicts with one of the new references of the current
 * inode. This needs to happen first because a new reference may conflict
 * with the old reference of a parent directory, so we must make sure
 * that the path used for link and rename commands don't use an
 * orphanized name when an ancestor was not yet orphanized.
 *
 * Example:
 *
 * Parent snapshot:
 *
 * .                                                      (ino 256)
 * |----- testdir/                                        (ino 259)
 * |          |----- a                                    (ino 257)
 * |
 * |----- b                                               (ino 258)
 *
 * Send snapshot:
 *
 * .                                                      (ino 256)
 * |----- testdir_2/                                      (ino 259)
 * |          |----- a                                    (ino 260)
 * |
 * |----- testdir                                         (ino 257)
 * |----- b                                               (ino 257)
 * |----- b2                                              (ino 258)
 *
 * Processing the new reference for inode 257 with name "b" may happen
 * before processing the new reference with name "testdir". If so, we
 * must make sure that by the time we send a link command to create the
 * hard link "b", inode 259 was already orphanized, since the generated
 * path in "valid_path" already contains the orphanized name for 259.
 * We are processing inode 257, so only later when processing 259 we do
 * the rename operation to change its temporary (orphanized) name to
 * "testdir_2".
 */
list_for_each_entry(cur, &sctx->new_refs, list) {

...

from https://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git/tree/fs/btrfs/send.c#n6633

/*
 * We have processed the refs and thus need to advance send_progress.
 * Now, calls to get_cur_xxx will take the updated refs of the current
 * inode into account.
 *
 * On the other hand, if our current inode is a directory and couldn't
 * be moved/renamed because its parent was renamed/moved too and it has
 * a higher inode number, we can only move/rename our current inode
 * after we moved/renamed its parent. Therefore in this case operate on
 * the old path (pre move/rename) of our current inode, and the
 * move/rename will be performed later.
 */

...

from https://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git/tree/fs/btrfs/send.c#n6865

/*
	 * Normally we do not find inodes with a link count of zero (orphans)
	 * because the most common case is to create a snapshot and use it
	 * for a send operation. However other less common use cases involve
	 * using a subvolume and send it after turning it to RO mode just
	 * after deleting all hard links of a file while holding an open
	 * file descriptor against it or turning a RO snapshot into RW mode,
	 * keep an open file descriptor against a file, delete it and then
	 * turn the snapshot back to RO mode before using it for a send
	 * operation. The former is what the receiver operation does.
	 * Therefore, if we want to send these snapshots soon after they're
	 * received, we need to handle orphan inodes as well. Moreover, orphans
	 * can appear not only in the send snapshot but also in the parent
	 * snapshot. Here are several cases:
	 *
	 * Case 1: BTRFS_COMPARE_TREE_NEW
	 *       |  send snapshot  | action
	 * --------------------------------
	 * nlink |        0        | ignore
	 *
	 * Case 2: BTRFS_COMPARE_TREE_DELETED
	 *       | parent snapshot | action
	 * ----------------------------------
	 * nlink |        0        | as usual
	 * Note: No unlinks will be sent because there're no paths for it.
	 *
	 * Case 3: BTRFS_COMPARE_TREE_CHANGED
	 *           |       | parent snapshot | send snapshot | action
	 * -----------------------------------------------------------------------
	 * subcase 1 | nlink |        0        |       0       | ignore
	 * subcase 2 | nlink |       >0        |       0       | new_gen(deletion)
	 * subcase 3 | nlink |        0        |      >0       | new_gen(creation)
	 *
	 */

...

from https://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git/tree/fs/btrfs/send.c#n7086

/*
 * We have found an extent item that changed without the inode item
 * having changed. This can happen either after relocation (where the
 * disk_bytenr of an extent item is replaced at
 * relocation.c:replace_file_extents()) or after deduplication into a
 * file in both the parent and send snapshots (where an extent item can
 * get modified or replaced with a new one). Note that deduplication
 * updates the inode item, but it only changes the iversion (sequence
 * field in the inode item) of the inode, so if a file is deduplicated
 * the same amount of times in both the parent and send snapshots, its
 * iversion becomes the same in both snapshots, whence the inode item is
 * the same on both snapshots.
 */

...

continues to 8K loc

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant