-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial implementation of the metadata log #61
Conversation
0f3dcdd
to
4ff1791
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, looks very good after a cursory review. I'll need more time to review it properly, but I left initial comments anyway.
Is it correct that unit tests for cut_out_data_range
are not part of this pull request? If so, please add some - it's a fragile part of code, ranges are always error-prone.
Also, please go over the series and add some text to commit messages - we usually only merge patches without descriptions (title only) if they are trivial (e.g. fixing a typo), while this series is rather complex. Instead of marking some titles with [see details]
(which can by the way be removed), assume that all commit messages will have details inside them - take a look at seastar commit history to see examples of what info can be put in the description.
The "apply numerous suggestions..." commit will need to be rebased out, but I assume it just waits here until the review is finished and will be rebased-out afterwards, which is fine.
Good job!
src/fs/metadata_log_bootstrap.hh
Outdated
} | ||
|
||
boost::crc_32_type crc; | ||
if (not _curr_cluster.process_crc_without_reading(crc, checkpoint.checkpointed_data_length)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed. Example:
#include <seastar/util/log.hh>
logger mlogger("fs_metadata");
...
mlogger.warn("Something went wrong: {}", something);
src/fs/metadata_log.hh
Outdated
|
||
namespace seastar::fs { | ||
|
||
struct fs_exception : public std::exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we would like to export all exceptions to seastar/include/fs since they will propagate up up to the user?
3387bfd
to
bfa85f7
Compare
5ad912e
to
0612764
Compare
Now this PR depends on #80. |
6d52430
to
28a66c8
Compare
74fa310
to
2ba0cfe
Compare
SeastarFS is a log-structured filesystem. Every shard will have 3 private logs: - metadata log - medium data log - big data log (this is not actually a log, but in the big picture it looks like it was) Disk space is divided into clusters (typically around several MiB) that have all equal size that is multiple of alignment (typically 4096 bytes). Each shard has its private pool of clusters (assignment is stored in bootstrap record). Each log consumes clusters one by one -- it writes the current one and if cluster becomes full, then log switches to a new one that is obtained from a pool of free clusters managed by cluster_allocator. Metadata log and medium data log write data in the same manner: they fill up the cluster gradually from left to right. Big data log takes a cluster and completely fills it with data at once -- it is only used during big writes. This commit adds the skeleton of the metadata log: - data structures for holding metadata in memory with all operations on this data structure i.e. manipulating files and their contents - locking logic (detailed description can be found in metadata_log.hh) - buffers for writting logs to disk (one for metadata and one for medium data) - basic higher level interface e.g. path lookup, iterating over directory - boostraping metadata log == reading metadata log from disk and reconstructing shard's filesystem structure from just before shutdown File content is stored as a set of data vectors that may have one of three kinds: in memory data, on disk data, hole. Small writes are writted directly to the metadata log and because all metadata is stored in the memory these writes are also in memory, therefore in-memory kind. Medium and large data are not stored in memory, so they are represented using on-disk kind. Enlarging file via truncate may produce holes, hence hole kind. Directory entries are stored as metadata log entries -- directory inodes have no content. To disk buffers buffer data that will be written to disk. There are two kinds: (normal) to disk buffer and metadata to disk buffer. The latter is implemented using the former, but provides higher level interface for appending metadata log entries rather than raw bytes. Normal to disk buffer appends data sequentially, but if a flush occurs the offset where next data will be appended is aligned up to alignment to ensure that writes to the same cluster are non-overlaping. Metadata to disk buffer appends data using normal to disk buffer but does some formatting along the way. The structure of the metadata log on disk is as follows: | checkpoint_1 | entry_1, entry_2, ..., entry_n | checkpoint_2 | ... | | <---- checkpointed data -----> | etc. Every batch of metadata_log entries is preceded by a checkpoint entry. Appending metadata log appends the current batch of entries. Flushing or lack of space ends current batch of entries and then checkpoint entry is updated (because it holds CRC code of all checkpointed data) and then write of the whole batch is requested and a new checkpoint (if there is space for that) is started. Last checkpoint in a cluster contains a special entry pointing to the next cluster that is utilized by the metadata log. Bootstraping is, in fact, just replying of all actions from metadata log that were saved on disk. It works as follows: - reads metadata log clusters one by one - for each cluster, until the last checkpoint contains pointer to the next cluster, processes the checkpoint and entries it checkpoints - processing works as follows: - checkpoint entry is read and if it is invalid it means that the metadata log ends here (last checkpoint was partially written or the metadata log really ended here or there was some data corruption...) and we stop - if it is correct, it contains the length of the checkpointed data (metadata log entries), so then we process all of them (error there indicates that there was data corruption but CRC is still somehow correct, so we abort all bootstraping with an error) Locking is to ensure that concurrent modifications of the metadata do not corrupt it. E.g. Creating a file is a complex operation: you have to create inode and add a directory entry that will represent this inode with a path and write corresponding metadata log entries to the disk. Simultaneous attempts of creating the same file could corrupt the file system. Not to mention concurrent create and unlink on the same path... Thus careful and robust locking mechanism is used. For details see metadata_log.hh. Signed-off-by: Krzysztof Małysa <varqox@gmail.com>
Creating unlinked file may be useful as temporary file or to expose the file via path only after the file is filled with contents. Signed-off-by: Krzysztof Małysa <varqox@gmail.com>
Signed-off-by: Krzysztof Małysa <varqox@gmail.com>
Some operations need to schedule deleting inode in the background. One of these is closing unlinked file if nobody else holds it open. Signed-off-by: Krzysztof Małysa <varqox@gmail.com>
Allows the same file to be visible via different paths or to give a path to an unlinked file. Signed-off-by: Krzysztof Małysa <varqox@gmail.com>
Signed-off-by: Krzysztof Małysa <varqox@gmail.com>
Marks that the file is opened by increasing the opened file counter. Signed-off-by: Michał Niciejewski <quport@gmail.com>
Decreases opened file counter. If the file is unlinked and the counter is zero then the file is automatically removed. Signed-off-by: Michał Niciejewski <quport@gmail.com>
Each write can be divided into multiple smaller writes that can fall into one of the following categories: - small write: writes below SMALL_WRITE_THRESHOLD bytes, those writes are stored fully in memory - medium write: writes above SMALL_WRITE_THRESHOLD and below cluster_size bytes, those writes are stored on disk, they are appended to the on-disk data log where data from different writes can be stored in one cluster - big write: writes that fully fit into one cluster, stored on disk For example, one write can be divided into multiple big writes, some small writes and some medium writes. Current implementation won't make any unnecessary data copying. Data given by caller is either directly used to write to disk or is copied as a small write. Added cluster writer which is used to perform medium writes. Cluster writer keeps a current position in the data log and appends new data by writing it directly into disk. Signed-off-by: Michał Niciejewski <quport@gmail.com>
Truncate can be used on a file to change its size. When the new size is lower than current, the data at higher offsets will be lost, and when it's larger, the file will be filled with null bytes. Signed-off-by: Wojciech Mitros <wmitros@protonmail.com>
Reads file data from disk and memory based on information stored in inode's data vectors. Not optimized version - reads from disk are always read into temporary buffers before copying to the buffer given by the caller. Signed-off-by: Michał Niciejewski <quport@gmail.com>
Provides inteface to query file attributes that include permissions, btime, mtime and ctime. Signed-off-by: Krzysztof Małysa <varqox@gmail.com>
The test checks whether the data written by a to_disk_buffer to disk is the same as the data appended to the buffer and the remaining buffer space is correctly calculated on small examples. Signed-off-by: Wojciech Mitros <wmitros@protonmail.com>
Added mockers: - mockers store information about every operation - store list of virtually created mockers Added tests for metadata_to_disk_buffer mocker. Tests check that mocker behaves similarly to metadata_to_disk_buffer. Signed-off-by: Michał Niciejewski <quport@gmail.com>
- random tests - tests for corner cases * basic single small writes * basic single medium writes * basic single large writes * new cluster allocation for medium writes * medium write split into two smaller writes due to lack of space in data-log cluster * split single write into more smaller writes because of unaligned buffer * split big write (bigger than cluster size) into multiple writes Signed-off-by: Michał Niciejewski <quport@gmail.com>
Checks whether the data that will be written to disk after truncate is correct, the reads from a truncated file are accurate and the files metadata is set to the new size. Signed-off-by: Wojciech Mitros <wmitros@protonmail.com>
For every ondisk entry check if: - it's correctly appended to the buffer when it would fit - the buffer returns TOO_BIG when it wouldn't fit - it's written to disk after successful append and flush. Signed-off-by: Wojciech Mitros <wmitros@protonmail.com>
Optimization for aligned reads. When on-disk data and given buffer are properly aligned than read disk data is not stored in a temporary buffer but is directly read into the buffer given by the caller. Added device_reader to perform unaligned reads with caching. Signed-off-by: Michał Niciejewski <quport@gmail.com>
Random test checking aligned writes and reads optimizations. Signed-off-by: Michał Niciejewski <quport@gmail.com>
Checks if there is access to the newly created directories after bootstrapping. Signed-off-by: Aleksander Sorokin <ankezy@gmail.com>
No description provided.