Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial implementation of the metadata log #61

Merged
merged 20 commits into from
Apr 20, 2020
Merged

Initial implementation of the metadata log #61

merged 20 commits into from
Apr 20, 2020

Conversation

varqox
Copy link

@varqox varqox commented Feb 23, 2020

No description provided.

include/seastar/fs/path.hh Outdated Show resolved Hide resolved
include/seastar/fs/path.hh Outdated Show resolved Hide resolved
src/fs/metadata_log.cc Show resolved Hide resolved
src/fs/metadata_log.hh Outdated Show resolved Hide resolved
src/fs/metadata_log.hh Show resolved Hide resolved
src/fs/metadata_to_disk_buffer.hh Outdated Show resolved Hide resolved
src/fs/metadata_to_disk_buffer.hh Outdated Show resolved Hide resolved
src/fs/metadata_log.hh Outdated Show resolved Hide resolved
src/fs/to_disk_buffer.hh Outdated Show resolved Hide resolved
src/fs/to_disk_buffer.hh Outdated Show resolved Hide resolved
Copy link
Owner

@psarna psarna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, looks very good after a cursory review. I'll need more time to review it properly, but I left initial comments anyway.

Is it correct that unit tests for cut_out_data_range are not part of this pull request? If so, please add some - it's a fragile part of code, ranges are always error-prone.

Also, please go over the series and add some text to commit messages - we usually only merge patches without descriptions (title only) if they are trivial (e.g. fixing a typo), while this series is rather complex. Instead of marking some titles with [see details] (which can by the way be removed), assume that all commit messages will have details inside them - take a look at seastar commit history to see examples of what info can be put in the description.

The "apply numerous suggestions..." commit will need to be rebased out, but I assume it just waits here until the review is finished and will be rebased-out afterwards, which is fine.

Good job!

src/fs/inode_info.hh Outdated Show resolved Hide resolved
src/fs/inode_info.hh Outdated Show resolved Hide resolved
src/fs/metadata_log.cc Outdated Show resolved Hide resolved
src/fs/metadata_log.cc Outdated Show resolved Hide resolved
src/fs/metadata_log.cc Outdated Show resolved Hide resolved
}

boost::crc_32_type crc;
if (not _curr_cluster.process_crc_without_reading(crc, checkpoint.checkpointed_data_length)) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed. Example:

#include <seastar/util/log.hh>
logger mlogger("fs_metadata");
...
mlogger.warn("Something went wrong: {}", something);

src/fs/metadata_log_bootstrap.hh Outdated Show resolved Hide resolved
src/fs/metadata_log_bootstrap.hh Outdated Show resolved Hide resolved
src/fs/metadata_log_operations/create_file.hh Outdated Show resolved Hide resolved
src/fs/metadata_log_operations/create_file.hh Outdated Show resolved Hide resolved
src/fs/inode_info.hh Outdated Show resolved Hide resolved

namespace seastar::fs {

struct fs_exception : public std::exception {
Copy link
Author

@varqox varqox Feb 25, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we would like to export all exceptions to seastar/include/fs since they will propagate up up to the user?

@varqox
Copy link
Author

varqox commented Feb 26, 2020

Now this PR depends on #80.

@varqox varqox force-pushed the fs-metadata-log branch 2 times, most recently from 6d52430 to 28a66c8 Compare February 28, 2020 12:16
varqox and others added 20 commits April 20, 2020 09:39
SeastarFS is a log-structured filesystem. Every shard will have 3
private logs:
- metadata log
- medium data log
- big data log (this is not actually a log, but in the big picture it
  looks like it was)

Disk space is divided into clusters (typically around several MiB) that
have all equal size that is multiple of alignment (typically 4096
bytes). Each shard has its private pool of clusters (assignment is
stored in bootstrap record). Each log consumes clusters one by one -- it
writes the current one and if cluster becomes full, then log switches to
a new one that is obtained from a pool of free clusters managed by
cluster_allocator. Metadata log and medium data log write data in the
same manner: they fill up the cluster gradually from left to right. Big
data log takes a cluster and completely fills it with data at once -- it
is only used during big writes.

This commit adds the skeleton of the metadata log:
- data structures for holding metadata in memory with all operations on
  this data structure i.e. manipulating files and their contents
- locking logic (detailed description can be found in metadata_log.hh)
- buffers for writting logs to disk (one for metadata and one for medium
  data)
- basic higher level interface e.g. path lookup, iterating over
  directory
- boostraping metadata log == reading metadata log from disk and
  reconstructing shard's filesystem structure from just before shutdown

File content is stored as a set of data vectors that may have one of
three kinds: in memory data, on disk data, hole. Small writes are
writted directly to the metadata log and because all metadata is stored
in the memory these writes are also in memory, therefore in-memory kind.
Medium and large data are not stored in memory, so they are represented
using on-disk kind. Enlarging file via truncate may produce holes, hence
hole kind.

Directory entries are stored as metadata log entries -- directory inodes
have no content.

To disk buffers buffer data that will be written to disk. There are two
kinds: (normal) to disk buffer and metadata to disk buffer. The latter
is implemented using the former, but provides higher level interface for
appending metadata log entries rather than raw bytes.

Normal to disk buffer appends data sequentially, but if a flush occurs
the offset where next data will be appended is aligned up to alignment
to ensure that writes to the same cluster are non-overlaping.

Metadata to disk buffer appends data using normal to disk buffer but
does some formatting along the way. The structure of the metadata log on
disk is as follows:
| checkpoint_1 | entry_1, entry_2, ..., entry_n | checkpoint_2 | ... |
               | <---- checkpointed data -----> |
etc. Every batch of metadata_log entries is preceded by a checkpoint
entry. Appending metadata log appends the current batch of entries.
Flushing or lack of space ends current batch of entries and then
checkpoint entry is updated (because it holds CRC code of all
checkpointed data) and then write of the whole batch is requested and a
new checkpoint (if there is space for that) is started. Last checkpoint
in a cluster contains a special entry pointing to the next cluster that
is utilized by the metadata log.

Bootstraping is, in fact, just replying of all actions from metadata log
that were saved on disk. It works as follows:
- reads metadata log clusters one by one
- for each cluster, until the last checkpoint contains pointer to the
  next cluster, processes the checkpoint and entries it checkpoints
- processing works as follows:
  - checkpoint entry is read and if it is invalid it means that the
    metadata log ends here (last checkpoint was partially written or the
    metadata log really ended here or there was some data corruption...)
    and we stop
  - if it is correct, it contains the length of the checkpointed data
    (metadata log entries), so then we process all of them (error there
    indicates that there was data corruption but CRC is still somehow
    correct, so we abort all bootstraping with an error)

Locking is to ensure that concurrent modifications of the metadata do
not corrupt it. E.g. Creating a file is a complex operation: you have
to create inode and add a directory entry that will represent this inode
with a path and write corresponding metadata log entries to the disk.
Simultaneous attempts of creating the same file could corrupt the file
system. Not to mention concurrent create and unlink on the same path...
Thus careful and robust locking mechanism is used. For details see
metadata_log.hh.

Signed-off-by: Krzysztof Małysa <varqox@gmail.com>
Creating unlinked file may be useful as temporary file or to expose the
file via path only after the file is filled with contents.

Signed-off-by: Krzysztof Małysa <varqox@gmail.com>
Signed-off-by: Krzysztof Małysa <varqox@gmail.com>
Some operations need to schedule deleting inode in the background. One
of these is closing unlinked file if nobody else holds it open.

Signed-off-by: Krzysztof Małysa <varqox@gmail.com>
Allows the same file to be visible via different paths or to give a path
to an unlinked file.

Signed-off-by: Krzysztof Małysa <varqox@gmail.com>
Signed-off-by: Krzysztof Małysa <varqox@gmail.com>
Marks that the file is opened by increasing the opened file counter.

Signed-off-by: Michał Niciejewski <quport@gmail.com>
Decreases opened file counter. If the file is unlinked and the
counter is zero then the file is automatically removed.

Signed-off-by: Michał Niciejewski <quport@gmail.com>
Each write can be divided into multiple smaller writes that can fall
into one of the following categories:
- small write: writes below SMALL_WRITE_THRESHOLD bytes, those writes
  are stored fully in memory
- medium write: writes above SMALL_WRITE_THRESHOLD and below
  cluster_size bytes, those writes are stored on disk, they are appended
  to the on-disk data log where data from different writes can be stored
  in one cluster
- big write: writes that fully fit into one cluster, stored on disk
For example, one write can be divided into multiple big writes, some
small writes and some medium writes. Current implementation won't make
any unnecessary data copying. Data given by caller is either directly
used to write to disk or is copied as a small write.

Added cluster writer which is used to perform medium writes. Cluster
writer keeps a current position in the data log and appends new data
by writing it directly into disk.

Signed-off-by: Michał Niciejewski <quport@gmail.com>
Truncate can be used on a file to change its size. When the new
size is lower than current, the data at higher offsets will be lost,
and when it's larger, the file will be filled with null bytes.

Signed-off-by: Wojciech Mitros <wmitros@protonmail.com>
Reads file data from disk and memory based on information stored in
inode's data vectors. Not optimized version - reads from disk are always
read into temporary buffers before copying to the buffer given by the
caller.

Signed-off-by: Michał Niciejewski <quport@gmail.com>
Provides inteface to query file attributes that include permissions,
btime, mtime and ctime.

Signed-off-by: Krzysztof Małysa <varqox@gmail.com>
The test checks whether the data written by a to_disk_buffer to disk
is the same as the data appended to the buffer and the remaining buffer
space is correctly calculated on small examples.

Signed-off-by: Wojciech Mitros <wmitros@protonmail.com>
Added mockers:
- mockers store information about every operation
- store list of virtually created mockers

Added tests for metadata_to_disk_buffer mocker. Tests check that
mocker behaves similarly to metadata_to_disk_buffer.

Signed-off-by: Michał Niciejewski <quport@gmail.com>
- random tests
- tests for corner cases
  * basic single small writes
  * basic single medium writes
  * basic single large writes
  * new cluster allocation for medium writes
  * medium write split into two smaller writes due to lack of space in
    data-log cluster
  * split single write into more smaller writes because of unaligned
    buffer
  * split big write (bigger than cluster size) into multiple writes

Signed-off-by: Michał Niciejewski <quport@gmail.com>
Checks whether the data that will be written to disk after truncate is correct,
the reads from a truncated file are accurate and the files metadata is set
to the new size.

Signed-off-by: Wojciech Mitros <wmitros@protonmail.com>
For every ondisk entry check if:
- it's correctly appended to the buffer when it would fit
- the buffer returns TOO_BIG when it wouldn't fit
- it's written to disk after successful append and flush.

Signed-off-by: Wojciech Mitros <wmitros@protonmail.com>
Optimization for aligned reads. When on-disk data and given buffer are
properly aligned than read disk data is not stored in a temporary
buffer but is directly read into the buffer given by the caller.

Added device_reader to perform unaligned reads with caching.

Signed-off-by: Michał Niciejewski <quport@gmail.com>
Random test checking aligned writes and reads optimizations.

Signed-off-by: Michał Niciejewski <quport@gmail.com>
Checks if there is access to the newly created directories after bootstrapping.

Signed-off-by: Aleksander Sorokin <ankezy@gmail.com>
@github-actions github-actions bot added the tests label Apr 20, 2020
@varqox varqox marked this pull request as ready for review April 20, 2020 08:52
@varqox varqox merged commit 3a92552 into zpp_fs Apr 20, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment