DocBmap

Bmap metadata, management, and user data CRCs

When a client issues a read or write command, an RPC (SRMT_GETBMAP, defined in slash2/include/slashrpc.h) is made to the MDS requesting the bmap for the associated file region. This document describes the metadata operation involved when handling a bmap request from the client.

Bmap generations

A bmap has its generation number bumped:

when it receives a CRC update from an ION and has at least one other VALID state in its replica table. Any other replicas made obsolete by the update are marked as GARBAGE and queued into the upsch (Update Scheduler) for reclamation.

It should be noted that only one ION may send CRC updates for a bmap. This ION is serially chosen by the MDS; so as long as this bmap ↔ ION association is in place, no other IONs may issue CRC updates to the MDS.
all bmaps past the ptrunc position during partial truncation resolution have their bgen bumped.

Bmap Cache Lookup

Note: only on read or write should the bmap tracking incur a log operation. Otherwise, operations such as touch(1) will cause an unnecessary log operation.

Upon receipt of a GETBMAP request, the MDS issues a lease to the client used for authorizing I/O activity. The bmap cache lookup consists of searching the bmap tree attached to each open file (called a "FID cache member handle" or fcmh) for the bmap specified (by numerical ID) in the request.

If the same client is re-requesting a lease for the same bmap, a "duplicate" lease is issued; this is necessary in the protocol for situations when the client loses contact with the MDS but the MDS hasn't discovered this situation until the reissued request comes in.

If the bmap already has leases (read or write) to other clients, the bmap is first degraded into "direct I/O" mode before reply. In DIO mode, all clients accessing the bmap are forced to perform all I/O without local caching to maintain coherency.

Persistent bmap leases

Upon lease issuance, an entry is stored in the MDS persistent operations table (called the bmap or bia, bmap-ION-assignment, on-disk table or "odtable") signifying the lease to rebuild in recovery scenarios. During recovery (i.e. after failure), these logs are replayed to recreate the MDS's cache.

There is a proposal to function similar to TCP's 2MSL where, on startup, the MDS ignores all requests concerning bmaps leased from prior instances. This strategy would require strict coherence among load-balanced MDS peers so clients don't stall and/or automatic re-requesting of leases once the connection to the failed MDS has been re-established.

While the bmap is being paged in (if it is not already present in the MDS memory cache), a placeholder is allocated to prevent reentrant page-ins and any additional requesting clients will wait on the bcm_waitq until the bmap has been loaded.

bmaps are fixed size structures, about 1KiB in size, which represent 128MiB (by default) of user data. bmap structures are written into a FID's inode metafile, which is a file in the MDFS that corresponds to the FID, such as /deployment_s2md/.slmd/fidns/0/1/2/3/00000000123000_0. This file contains all metadata for the file and for each bmap in the file.

To read a specific bmap from an inode's metafile requires the bmap index number, which is simply the offset of file access divided by the bmap representation size (again, by default 128MiB). Conversely, the bmap index number multiplied by the bmap_ondisk size plus the starting offset (SL_BMAP_START_OFF) gives the offset into the metafile.

At the time when an ION processes a write on a bmap and sends the MDS the bmap's CRCs, the MDS is then required to store an initialized bmap at the respective index in the metafile.

Bmap metadata change logging

All modifications to directory inodes, file inodes, and bmaps are recorded in a transaction log called the MDS journal, for replay and MDS replication purposes. This journal (in pfl/journal.c) is actually called-up from ZFS routines when data modifications are made (e.g. zfs_write).

The journal typically sits on a device outside of the MDFS and is synchronously written to to ensure data consistency. As such, the device should ideally have low-latency write IOPS and doesn't need much storage. A larger journal requires more time to process after startup and a smaller journal limits the number of open transactions that may occur simultaneously before blocking new ones until old ones finish.

Client and ION Operation Overview

In normal read-mode, a bmap read lease request issued by a client also retrieves down the file's inode replica table which contains the list of IOS replicas where the data resides.

For IONs, a similar SRMT_BMAPGETCRCS RPC request is sent via iod_bmap_retrieve() to load CRCs of data in the bmap. The bmap has its own CRC which protects the CRC table and replication table against silent corruption.

File pointer extended past file size

In this case we must create a new bmap on-the-fly with the CRC table containing SL_NULL_CRCs (i.e. the CRC of a sliver filled with all null ytes).

Passing hole information to the client

When CRC updates (CRUDs) are received by the MDS from IOS (which accepted WRITEs from clients), the bmap's bcs_crcstates are updated and toggle on flag BMAP_SLVR_DATA signifying this region of the bmap is no longer a hole (i.e. filled with zeroes). The region size defaults to 1MiB. This flag is passed in the GETBMAP RPC as flags field as SRM_LEASEBMAPF_DATA if it is not zeroes. The client uses this information to know which chunks of the bmap must be retrieved from the ION.

Bmap updates

Bmap metadata is updated and rewritten as a result of numerous operations:

Receipt of a chunk CRC update causes two fields to be updated: the store of the chunk CRC into its appropriate slot and the recomputing and rewriting of the bmap CRC.
Replica management: upon successful replication of the bmap or when replicas become invalid because of an overwrite. This also causes two writes (rewriting of the bmap CRC).
Replay of journal activity.
Update of residency states during partial truncation resolution.
User-initiated modification of the bmap replication policy.
User-initiated change of the residency states e.g. as a result of issuing a replication or residency ejection request.
As a result of a residency change update from an ION e.g. replication, partial garbage reclamation, etc.

Note that any write of a bmap to disk causes on-demand computation of the bmap to-disk CRC.

CRC storage

The dominant issue with CRCs revolves around the data size encompassed by a single 8-byte CRC. This has direct ramifications in the amount of buffering required and the MDS capacity needed to store CRCs. Also, since the MDS stores the CRCs, the system ingest bandwidth is essentially limited to the number of CRCs the MDS can process. Issues regarding the synchronous storing of MDS-side CRCs need to be explored. For our purposes we will assume that the MDS has safely stored the CRCs before acknowledging back to the IOS.

Capacity

The MDS has a fixed size array for CRC storage; the array size is the product of the CRC granularity and the bmap size. For now we assume that the bmap size is 128MiB and the CRC granularity is 1MiB, resulting in an array size of 1KiB required for CRC storage per bmap. Here we can see that 8 bytes per 1MiB provides a reasonable growth path for CRC storage:

(1024^2/(1024^2))*8 = 8			# 1MB requires 8B  of CRCs
(1024^3/(1024^2))*8 = 8192		# 1GB requires 8KB of CRCs
(1024^4/(1024^2))*8 = 8388608		# 1TB requires 8MB of CRCs
(1024^5/(1024^2))*8 = 8589934592	# 1PB requires 8GB of CRCs
(1024^6/(1024^2))*8 = 8796093022208	# 1EB requires 8TB of CRCs

Communication from IOS

As writes are processed by the IOS we must ensure that the CRCs are accurate and take into account any cache coherency issues that may arise. One problem we face with parallel IOSes and CRCs is that we have no way to guarantee which IOS wrote last and therefore which CRCs accurately reflect the state of the file. Therefore, revising the parallel IOS write protocol, the MDS will determine the IOS ↔ bmap association and provide the same IOS to all clients for a given bmap write session. This will ensure that only one source is valid for issuing CRC updates into a bmap region, and that this source is verifiable by the MDS.

Some failure ramifications:

Should the write occur but return with a failure, the IOS must have a way of notifying the MDS that the CRC state on-disk is unknown.

The IOS performs a write and then fails before sending the CRC update. The CRC should be calculated and stored before sending to the MDS.

Synchronously delivered update RPCs will surely slow down the write process. Perhaps we should be able to batch an entire bmap's worth of updates.

CRC updates are batched into bulk RPCs:

sizeof(struct srm_bmap_crcwrt_req) = 72 +
64 * (sizeof(struct srm_bmap_crcup) = 48 +
      24 * sizeof(struct srt_bmap_crcwire) = 16)
= 27.07KiB

So MAX_BMAP_INODE_PAIRS should be bumped to 128.

Also we should have a journal of CRCs on the IOS where possible to deal with failures, so that any unsent CRCs (post-failure) may be verified: compare the buffer-side CRCs against the on-disk state and then update the MDS.

Need to consider what happens when an IOS fails from the perspective of the client and the MDS. The MDS may have to log/record bmap ↔ IOS associations to protect against updates from a previous IOS ownership.

Design fallouts:

MDS chooses IOS for a given bmap; CRC updates may only come from that IOS.
This means that we can bulk CRC updates up to the size of the bmap (big performance win).
Journal buffer-side CRCs (pre-write) to guard against IOS failure. (Perhaps not.)
Need the MDS to send an RPC to an IOS requesting calculation of an CRCs for a bmap - this would be issued when an MDS detects the failure of an IOS and needs to reassign.
When an MDS chooses an ION for write, he must first notify other read-leased clients of this.

SLASH2

Funded in part by:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly