Initial implementation of the metadata log #61

varqox · 2020-02-23T16:12:05Z

No description provided.

include/seastar/fs/path.hh

src/fs/metadata_log.cc

src/fs/metadata_log.hh

src/fs/metadata_to_disk_buffer.hh

src/fs/metadata_log.hh

src/fs/to_disk_buffer.hh

psarna

Thanks, looks very good after a cursory review. I'll need more time to review it properly, but I left initial comments anyway.

Is it correct that unit tests for cut_out_data_range are not part of this pull request? If so, please add some - it's a fragile part of code, ranges are always error-prone.

Also, please go over the series and add some text to commit messages - we usually only merge patches without descriptions (title only) if they are trivial (e.g. fixing a typo), while this series is rather complex. Instead of marking some titles with [see details] (which can by the way be removed), assume that all commit messages will have details inside them - take a look at seastar commit history to see examples of what info can be put in the description.

The "apply numerous suggestions..." commit will need to be rebased out, but I assume it just waits here until the review is finished and will be rebased-out afterwards, which is fine.

Good job!

src/fs/inode_info.hh

src/fs/metadata_log.cc

psarna · 2020-02-24T07:22:29Z

src/fs/metadata_log_bootstrap.hh

+        }
+
+        boost::crc_32_type crc;
+        if (not _curr_cluster.process_crc_without_reading(crc, checkpoint.checkpointed_data_length)) {


Indeed. Example:

#include <seastar/util/log.hh> logger mlogger("fs_metadata"); ... mlogger.warn("Something went wrong: {}", something);

src/fs/metadata_log_bootstrap.hh

src/fs/metadata_log_operations/create_file.hh

src/fs/inode_info.hh

src/fs/metadata_log_operations/create_file.hh

varqox · 2020-02-25T11:06:14Z

src/fs/metadata_log.hh

+
+namespace seastar::fs {
+
+struct fs_exception : public std::exception {


Maybe we would like to export all exceptions to seastar/include/fs since they will propagate up up to the user?

src/fs/metadata_to_disk_buffer.hh

varqox · 2020-02-26T19:58:28Z

Now this PR depends on #80.

SeastarFS is a log-structured filesystem. Every shard will have 3 private logs: - metadata log - medium data log - big data log (this is not actually a log, but in the big picture it looks like it was) Disk space is divided into clusters (typically around several MiB) that have all equal size that is multiple of alignment (typically 4096 bytes). Each shard has its private pool of clusters (assignment is stored in bootstrap record). Each log consumes clusters one by one -- it writes the current one and if cluster becomes full, then log switches to a new one that is obtained from a pool of free clusters managed by cluster_allocator. Metadata log and medium data log write data in the same manner: they fill up the cluster gradually from left to right. Big data log takes a cluster and completely fills it with data at once -- it is only used during big writes. This commit adds the skeleton of the metadata log: - data structures for holding metadata in memory with all operations on this data structure i.e. manipulating files and their contents - locking logic (detailed description can be found in metadata_log.hh) - buffers for writting logs to disk (one for metadata and one for medium data) - basic higher level interface e.g. path lookup, iterating over directory - boostraping metadata log == reading metadata log from disk and reconstructing shard's filesystem structure from just before shutdown File content is stored as a set of data vectors that may have one of three kinds: in memory data, on disk data, hole. Small writes are writted directly to the metadata log and because all metadata is stored in the memory these writes are also in memory, therefore in-memory kind. Medium and large data are not stored in memory, so they are represented using on-disk kind. Enlarging file via truncate may produce holes, hence hole kind. Directory entries are stored as metadata log entries -- directory inodes have no content. To disk buffers buffer data that will be written to disk. There are two kinds: (normal) to disk buffer and metadata to disk buffer. The latter is implemented using the former, but provides higher level interface for appending metadata log entries rather than raw bytes. Normal to disk buffer appends data sequentially, but if a flush occurs the offset where next data will be appended is aligned up to alignment to ensure that writes to the same cluster are non-overlaping. Metadata to disk buffer appends data using normal to disk buffer but does some formatting along the way. The structure of the metadata log on disk is as follows: | checkpoint_1 | entry_1, entry_2, ..., entry_n | checkpoint_2 | ... | | <---- checkpointed data -----> | etc. Every batch of metadata_log entries is preceded by a checkpoint entry. Appending metadata log appends the current batch of entries. Flushing or lack of space ends current batch of entries and then checkpoint entry is updated (because it holds CRC code of all checkpointed data) and then write of the whole batch is requested and a new checkpoint (if there is space for that) is started. Last checkpoint in a cluster contains a special entry pointing to the next cluster that is utilized by the metadata log. Bootstraping is, in fact, just replying of all actions from metadata log that were saved on disk. It works as follows: - reads metadata log clusters one by one - for each cluster, until the last checkpoint contains pointer to the next cluster, processes the checkpoint and entries it checkpoints - processing works as follows: - checkpoint entry is read and if it is invalid it means that the metadata log ends here (last checkpoint was partially written or the metadata log really ended here or there was some data corruption...) and we stop - if it is correct, it contains the length of the checkpointed data (metadata log entries), so then we process all of them (error there indicates that there was data corruption but CRC is still somehow correct, so we abort all bootstraping with an error) Locking is to ensure that concurrent modifications of the metadata do not corrupt it. E.g. Creating a file is a complex operation: you have to create inode and add a directory entry that will represent this inode with a path and write corresponding metadata log entries to the disk. Simultaneous attempts of creating the same file could corrupt the file system. Not to mention concurrent create and unlink on the same path... Thus careful and robust locking mechanism is used. For details see metadata_log.hh. Signed-off-by: Krzysztof Małysa <varqox@gmail.com>

Creating unlinked file may be useful as temporary file or to expose the file via path only after the file is filled with contents. Signed-off-by: Krzysztof Małysa <varqox@gmail.com>

Signed-off-by: Krzysztof Małysa <varqox@gmail.com>

Some operations need to schedule deleting inode in the background. One of these is closing unlinked file if nobody else holds it open. Signed-off-by: Krzysztof Małysa <varqox@gmail.com>

Allows the same file to be visible via different paths or to give a path to an unlinked file. Signed-off-by: Krzysztof Małysa <varqox@gmail.com>

Signed-off-by: Krzysztof Małysa <varqox@gmail.com>

Marks that the file is opened by increasing the opened file counter. Signed-off-by: Michał Niciejewski <quport@gmail.com>

Decreases opened file counter. If the file is unlinked and the counter is zero then the file is automatically removed. Signed-off-by: Michał Niciejewski <quport@gmail.com>

Each write can be divided into multiple smaller writes that can fall into one of the following categories: - small write: writes below SMALL_WRITE_THRESHOLD bytes, those writes are stored fully in memory - medium write: writes above SMALL_WRITE_THRESHOLD and below cluster_size bytes, those writes are stored on disk, they are appended to the on-disk data log where data from different writes can be stored in one cluster - big write: writes that fully fit into one cluster, stored on disk For example, one write can be divided into multiple big writes, some small writes and some medium writes. Current implementation won't make any unnecessary data copying. Data given by caller is either directly used to write to disk or is copied as a small write. Added cluster writer which is used to perform medium writes. Cluster writer keeps a current position in the data log and appends new data by writing it directly into disk. Signed-off-by: Michał Niciejewski <quport@gmail.com>

Truncate can be used on a file to change its size. When the new size is lower than current, the data at higher offsets will be lost, and when it's larger, the file will be filled with null bytes. Signed-off-by: Wojciech Mitros <wmitros@protonmail.com>

Reads file data from disk and memory based on information stored in inode's data vectors. Not optimized version - reads from disk are always read into temporary buffers before copying to the buffer given by the caller. Signed-off-by: Michał Niciejewski <quport@gmail.com>

Provides inteface to query file attributes that include permissions, btime, mtime and ctime. Signed-off-by: Krzysztof Małysa <varqox@gmail.com>

The test checks whether the data written by a to_disk_buffer to disk is the same as the data appended to the buffer and the remaining buffer space is correctly calculated on small examples. Signed-off-by: Wojciech Mitros <wmitros@protonmail.com>

Added mockers: - mockers store information about every operation - store list of virtually created mockers Added tests for metadata_to_disk_buffer mocker. Tests check that mocker behaves similarly to metadata_to_disk_buffer. Signed-off-by: Michał Niciejewski <quport@gmail.com>

- random tests - tests for corner cases * basic single small writes * basic single medium writes * basic single large writes * new cluster allocation for medium writes * medium write split into two smaller writes due to lack of space in data-log cluster * split single write into more smaller writes because of unaligned buffer * split big write (bigger than cluster size) into multiple writes Signed-off-by: Michał Niciejewski <quport@gmail.com>

Checks whether the data that will be written to disk after truncate is correct, the reads from a truncated file are accurate and the files metadata is set to the new size. Signed-off-by: Wojciech Mitros <wmitros@protonmail.com>

For every ondisk entry check if: - it's correctly appended to the buffer when it would fit - the buffer returns TOO_BIG when it wouldn't fit - it's written to disk after successful append and flush. Signed-off-by: Wojciech Mitros <wmitros@protonmail.com>

Optimization for aligned reads. When on-disk data and given buffer are properly aligned than read disk data is not stored in a temporary buffer but is directly read into the buffer given by the caller. Added device_reader to perform unaligned reads with caching. Signed-off-by: Michał Niciejewski <quport@gmail.com>

Random test checking aligned writes and reads optimizations. Signed-off-by: Michał Niciejewski <quport@gmail.com>

Checks if there is access to the newly created directories after bootstrapping. Signed-off-by: Aleksander Sorokin <ankezy@gmail.com>

tropuq suggested changes Feb 23, 2020

View reviewed changes

varqox force-pushed the fs-metadata-log branch from 0f3dcdd to 4ff1791 Compare February 23, 2020 23:19

psarna reviewed Feb 24, 2020

View reviewed changes

tropuq reviewed Feb 24, 2020

View reviewed changes

src/fs/inode_info.hh Outdated Show resolved Hide resolved

varqox commented Feb 24, 2020

View reviewed changes

src/fs/metadata_log_operations/create_file.hh Outdated Show resolved Hide resolved

varqox commented Feb 25, 2020

View reviewed changes

wmitros reviewed Feb 25, 2020

View reviewed changes

src/fs/metadata_to_disk_buffer.hh Outdated Show resolved Hide resolved

tropuq added fs ZPP: FS project zpp ZPP: student project labels Feb 25, 2020

varqox assigned varqox, rokinsky, tropuq and wmitros Feb 25, 2020

psarna force-pushed the zpp_fs branch 2 times, most recently from 3387bfd to bfa85f7 Compare February 26, 2020 09:45

varqox force-pushed the fs-metadata-log branch from 5ad912e to 0612764 Compare February 26, 2020 19:56

varqox force-pushed the fs-metadata-log branch 2 times, most recently from 6d52430 to 28a66c8 Compare February 28, 2020 12:16

wmitros force-pushed the zpp_fs branch from 9c3387b to 3605589 Compare April 19, 2020 22:43

wmitros force-pushed the fs-metadata-log branch 3 times, most recently from 74fa310 to 2ba0cfe Compare April 19, 2020 23:50

psarna force-pushed the zpp_fs branch from 3605589 to 87a484f Compare April 20, 2020 07:12

varqox and others added 20 commits April 20, 2020 09:39

fs: metadata_log: add operation for creating and opening unlinked file

fdccca0

Creating unlinked file may be useful as temporary file or to expose the file via path only after the file is filled with contents. Signed-off-by: Krzysztof Małysa <varqox@gmail.com>

fs: metadata_log: add creating files and directories

477b6d4

Signed-off-by: Krzysztof Małysa <varqox@gmail.com>

fs: metadata_log: add private operation for deleting inode

ac00461

Some operations need to schedule deleting inode in the background. One of these is closing unlinked file if nobody else holds it open. Signed-off-by: Krzysztof Małysa <varqox@gmail.com>

fs: metadata_log: add link operation

b20b693

Allows the same file to be visible via different paths or to give a path to an unlinked file. Signed-off-by: Krzysztof Małysa <varqox@gmail.com>

fs: metadata_log: add unlinking files and removing directories

f18a1a1

Signed-off-by: Krzysztof Małysa <varqox@gmail.com>

fs: metadata_log: add opening file

d03440b

Marks that the file is opened by increasing the opened file counter. Signed-off-by: Michał Niciejewski <quport@gmail.com>

fs: metadata_log: add closing file

a36fd7f

Decreases opened file counter. If the file is unlinked and the counter is zero then the file is automatically removed. Signed-off-by: Michał Niciejewski <quport@gmail.com>

fs: metadata_log: add stat() operation

4943abe

Provides inteface to query file attributes that include permissions, btime, mtime and ctime. Signed-off-by: Krzysztof Małysa <varqox@gmail.com>

tests: fs: add to_disk_buffer test

4dab3d9

The test checks whether the data written by a to_disk_buffer to disk is the same as the data appended to the buffer and the remaining buffer space is correctly calculated on small examples. Signed-off-by: Wojciech Mitros <wmitros@protonmail.com>

tests: fs: add truncate operation test

d053418

Checks whether the data that will be written to disk after truncate is correct, the reads from a truncated file are accurate and the files metadata is set to the new size. Signed-off-by: Wojciech Mitros <wmitros@protonmail.com>

tests: fs: add tests for aligned reads and writes

7b96070

Random test checking aligned writes and reads optimizations. Signed-off-by: Michał Niciejewski <quport@gmail.com>

tests: fs: add basic test for metadata log bootstrapping

47620f0

Checks if there is access to the newly created directories after bootstrapping. Signed-off-by: Aleksander Sorokin <ankezy@gmail.com>

tropuq force-pushed the fs-metadata-log branch from 2ba0cfe to 47620f0 Compare April 20, 2020 07:50

github-actions bot added the tests label Apr 20, 2020

varqox marked this pull request as ready for review April 20, 2020 08:52

varqox merged commit 3a92552 into zpp_fs Apr 20, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initial implementation of the metadata log #61

Initial implementation of the metadata log #61

varqox commented Feb 23, 2020

psarna left a comment

psarna Feb 24, 2020

varqox Feb 25, 2020 •

edited

Loading

varqox commented Feb 26, 2020


		namespace seastar::fs {

		struct fs_exception : public std::exception {

Initial implementation of the metadata log #61

Initial implementation of the metadata log #61

Conversation

varqox commented Feb 23, 2020

psarna left a comment

Choose a reason for hiding this comment

psarna Feb 24, 2020

Choose a reason for hiding this comment

varqox Feb 25, 2020 • edited Loading

Choose a reason for hiding this comment

varqox commented Feb 26, 2020

varqox Feb 25, 2020 •

edited

Loading