Commits
J-corwin-Cobur…
Name already in use
Commits on May 23, 2023
-
Enable configuration and building of dm-vdo.
This adds dm-vdo to the drivers/md Kconfig and Makefile. Signed-off-by: J. corwin Coburn <corwin@redhat.com>
-
This adds the dm-vdo target. Signed-off-by: J. corwin Coburn <corwin@redhat.com>
-
Add support for dumping detailed vdo state to the kernel log via a dmsetup message. The dump code is not thread-safe and is generally intended for use only when the vdo is hung. Signed-off-by: J. corwin Coburn <corwin@redhat.com>
-
Add sysfs support for setting vdo parameters and fetching statistics.
Signed-off-by: J. corwin Coburn <corwin@redhat.com>
-
Signed-off-by: J. corwin Coburn <corwin@redhat.com>
-
Add the on-disk formats and marshalling of vdo structures.
Signed-off-by: J. corwin Coburn <corwin@redhat.com>
-
Signed-off-by: J. corwin Coburn <corwin@redhat.com>
-
Add repair (crash recovery and read-only rebuild) of damaged vdos.
When a vdo is restarted after a crash, it will automatically attempt to recover from its journals. If a vdo encounters an unrecoverable error, it will enter read-only mode. This mode indicates that some previously acknowledged data may have been lost. The vdo may be instructed to rebuild as best it can in order to return to a writable state. Although some data may be lost, this process will ensure that the vdo's own metadata is self-consistent. Signed-off-by: J. corwin Coburn <corwin@redhat.com>
-
The recovery journal is used to amortize updates across the block map and slab depot. Each write request causes an entry to be made in the journal. Entries are either "data remappings" or "block map remappings." For a data remapping, the journal records the logical address affected and its old and new physical mappings. For a block map remapping, the journal records the block map page number and the physical block allocated for it (block map pages are never reclaimed, so the old mapping is always 0). Each journal entry and the data write it represents must be stable on disk before the other metadata structures may be updated to reflect the operation. Signed-off-by: J. corwin Coburn <corwin@redhat.com>
-
Implement the vdo block map page cache.
The set of leaf pages of the block map tree is too large to fit in memory, so each block map zone maintains a cache of leaf pages. This patch adds the implementation of that cache. Signed-off-by: J. corwin Coburn <corwin@redhat.com>
-
The block map contains the logical to physical mapping. It can be thought of as an array with one entry per logical address. Each entry is 5 bytes: 36 bits contain the physical block number which holds the data for the given logical address, and the remaining 4 bits are used to indicate the nature of the mapping. Of the 16 possible states, one represents a logical address which is unmapped (i.e. it has never been written, or has been discarded), one represents an uncompressed block, and the other 14 states are used to indicate that the mapped data is compressed, and which of the compression slots in the compressed block this logical address maps to. Signed-off-by: J. corwin Coburn <corwin@redhat.com>
-
Signed-off-by: J. corwin Coburn <corwin@redhat.com>
-
Add the block allocators and physical zones.
Each slab is independent of every other. They are assigned to "physical zones" in round-robin fashion. If there are P physical zones, then slab n is assigned to zone n mod P. The set of slabs in each physical zone is managed by a block allocator. Signed-off-by: J. corwin Coburn <corwin@redhat.com>
-
The slab depot maintains an additional small data structure, the "slab summary," which is used to reduce the amount of work needed to come back online after a crash. The slab summary maintains an entry for each slab indicating whether or not the slab has ever been used, whether it is clean (i.e. all of its reference count updates have been persisted to storage), and approximately how full it is. During recovery, each physical zone will attempt to recover at least one slab, stopping whenever it has recovered a slab which has some free blocks. Once each zone has some space (or has determined that none is available), the target can resume normal operation in a degraded mode. Read and write requests can be serviced, perhaps with degraded performance, while the remainder of the dirty slabs are recovered. Signed-off-by: J. corwin Coburn <corwin@redhat.com>
-
Most of the vdo volume belongs to the slab depot. The depot contains a collection of slabs. The slabs can be up to 32GB, and are divided into three sections. Most of a slab consists of a linear sequence of 4K blocks. These blocks are used either to store data, or to hold portions of the block map (see subsequent patches). In addition to the data blocks, each slab has a set of reference counters, using 1 byte for each data block. Finally each slab has a journal. Reference updates are written to the slab journal, which is written out one block at a time as each block fills. A copy of the reference counters is kept in memory, and are written out a block at a time, in oldest-dirtied-order whenever there is a need to reclaim slab journal space. The journal is used both to ensure that the main recovery journal (see subsequent patches) can regularly free up space, and also to amortize the cost of updating individual reference blocks. This patch adds the slab structure as well as the slab journal and reference counters. Signed-off-by: J. corwin Coburn <corwin@redhat.com>
-
Add the compressed block bin packer.
When blocks do not deduplicate, vdo will attempt to compress them. Up to 14 compressed blocks may be packed into a single data block (this limitation is imposed by the block map). The packer implements a simple best-fit packing algorithm and also manages the formatting and writing of compressed blocks when bins fill. Signed-off-by: J. corwin Coburn <corwin@redhat.com>
-
Add use of the deduplication index in hash zones.
Signed-off-by: J. corwin Coburn <corwin@redhat.com>
-
Add hash locks and hash zones.
In order to deduplicate concurrent writes of the same data (to different locations), data_vios which are writing the same data are grouped together in a "hash lock," named for and keyed by the hash of the data being written. Each hash lock is assigned to a hash zone based on a portion of its hash. Signed-off-by: J. corwin Coburn <corwin@redhat.com>
-
The io_submitter handles bio submission from vdo data store to the storage below. It will merge bios when possible. Signed-off-by: J. corwin Coburn <corwin@redhat.com>
-
This patch adds support for handling incoming flush and/or FUA bios. Each such bio is assigned to a struct vdo_flush. These are allocated as needed, but there is always one kept in reserve in case allocations fail. In the event of an allocation failure, bios may need to wait for an outstanding flush to complete. The logical address space is partitioned into logical zones, each handled by its own thread. Each zone keeps a list of all data_vios handling write requests for logical addresses in that zone. When a flush bio is processed, each logical zone is informed of the flush. When all of the writes which are in progress at the time of the notification have completed in all zones, the flush bio is then allowed to complete. Signed-off-by: J. corwin Coburn <corwin@redhat.com>
-
Add data_vio, the request object which services incoming bios.
Signed-off-by: J. corwin Coburn <corwin@redhat.com>
-
Add vio, the request object for vdo metadata.
Signed-off-by: J. corwin Coburn <corwin@redhat.com>
-
Add administrative state and scheduling for vdo.
This patch adds the admin_state structures which are used to track the states of individual vdo components for handling of operations like suspend and resume. It also adds the action manager which is used to schedule and manage cross-thread administrative and internal operations. Signed-off-by: J. corwin Coburn <corwin@redhat.com>
-
Implement external deduplication index interface.
The deduplication index interface for index clients includes the deduplication request and index session structures. This is the interface that the rest of the vdo target uses to make requests, receive responses, and collect statistics. This patch also adds sysfs nodes for inspecting various index properties at runtime. Signed-off-by: J. corwin Coburn <corwin@redhat.com>
-
Implement top-level deduplication index.
The top-level deduplication index brings all the earlier components together. The top-level index creates the separate zone structures that enable the index to handle several requests in parallel, handles dispatching requests to the right zones and components, and coordinates metadata to ensure that it remain consistent. It also coordinates recovery in the event of an unexpected index failure. If sparse caching is enabled, the top-level index also handles the coordination required by the sparse chapter index cache, which (unlike most index structures) is shared among all zones. Signed-off-by: J. corwin Coburn <corwin@redhat.com>
-
Implement the chapter volume store.
The volume store structures manage the reading and writing of chapter pages. When a chapter is closed, it is packed into a read-only structure, split across several pages, and written to storage. The volume store also contains a cache and specialized queues that sort and batch requests by the page they need, in order to minimize latency and I/O requests when records have to be read from storage. The cache and queues also coordinate with the volume index to ensure that the volume does not waste resources reading pages that are no longer valid. Signed-off-by: J. corwin Coburn <corwin@redhat.com>
-
Implement the open chapter and chapter indexes.
Deduplication records are stored in groups called chapters. New records are collected in a structure called the open chapter, which is optimized for adding, removing, and sorting records. When a chapter fills, it is packed into a read-only structure called a closed chapter, which is optimized for searching and reading. The closed chapter includes a delta index, called the chapter index, which maps each record name to the record page containing the record and allows the index to read at most one record page when looking up a record. Signed-off-by: J. corwin Coburn <corwin@redhat.com>
-
The volume index is a large delta index that maps each record name to the chapter which contains the newest record for that name. The volume index can contain several million records and is stored entirely in memory while the index is operating, accounting for the majority of the deduplication index's memory budget. The volume index is composed of two subindexes in order to handle sparse hook names separately from regular names. If sparse indexing is not enabled, the sparse hook portion of the volume index is not used or instantiated. Signed-off-by: J. corwin Coburn <corwin@redhat.com>
-
The delta index is a space and memory efficient alternative to a hashtable. Instead of storing the entire key for each entry, the entries are sorted by key and only the difference between adjacent keys (the delta) is stored. If the keys are evenly distributed, the size of the deltas follows an exponential distribution, and the deltas can use a Huffman code to take up even less space. This structure allows the index to use many fewer bytes per entry than a traditional hash table, but it is slightly more expensive to look up entries, because a request must read and sum every entry in a list of deltas in order to find a given record. The delta index reduces this lookup cost by splitting its key space into many sub-lists, each starting at a fixed key value, so that each individual list is short. Signed-off-by: J. corwin Coburn <corwin@redhat.com>
-
Add deduplication index storage interface.
This patch adds infrastructure for managing reads and writes to the underlying storage layer for the deduplication index. The deduplication index uses dm-bufio for all of its reads and writes, so part of this infrastructure is managing the various dm-bufio clients required. It also adds the buffered reader and buffered writer abstractions, which simplify reading and writing metadata structures that span several blocks. This patch also includes structures and utilities for encoding and decoding all of the deduplication index metadata, collectively called the index layout. Signed-off-by: J. corwin Coburn <corwin@redhat.com>
-
Add deduplication configuration structures.
Add structures which record the configuration of various deduplication index parameters. This also includes facilities for saving and loading the configuration and validating its integrity. Signed-off-by: J. corwin Coburn <corwin@redhat.com>
-
This patch adds two hash maps, one keyed by integers, the other by pointers, and also a priority heap. The integer map is used for locking of logical and physical addresses. The pointer map is used for managing concurrent writes of the same data, ensuring that those writes are deduplicated. The priority heap is used to minimize the search time for free blocks. Signed-off-by: J. corwin Coburn <corwin@redhat.com>
-
Add specialized request queueing functionality.
This patch adds funnel_queue, a mostly lock-free multi-producer, single-consumer queue. It also adds the request queue used by the dm-vdo deduplication index, and the work_queue used by the dm-vdo data store. Both of these are built on top of funnel queue and are intended to support the dispatching of many short-running tasks. The work_queue also supports priorities. Finally, this patch adds vdo_completion, the structure which is enqueued on work_queues. Signed-off-by: J. corwin Coburn <corwin@redhat.com>
-
Add thread and synchronization utilities.
This patch adds utilities for managing and using named threads, as well as several locking and sychronization utilities. These utilities help dm-vdo minimize thread transitions nad manage cross-thread interations. Signed-off-by: J. corwin Coburn <corwin@redhat.com>
-
Add vdo type declarations, constants, and simple data structures.
Signed-off-by: J. corwin Coburn <corwin@redhat.com>