-
Notifications
You must be signed in to change notification settings - Fork 2
DocBmap
When a client issues a read or write command, an RPC (SRMT_GETBMAP
,
defined in slash2/include/slashrpc.h
) is made to the MDS requesting
the bmap for the associated file region.
This document describes the metadata operation involved when handling a
bmap request from the client.
A bmap has its generation number bumped:
-
when it receives a CRC update from an ION and has at least one other
VALID
state in its replica table. Any other replicas made obsolete by the update are marked asGARBAGE
and queued into the upsch (Update Scheduler) for reclamation.It should be noted that only one ION may send CRC updates for a bmap. This ION is serially chosen by the MDS; so as long as this bmap ↔ ION association is in place, no other IONs may issue CRC updates to the MDS.
-
all bmaps past the ptrunc position during partial truncation resolution have their bgen bumped.
Note: only on read or write should the bmap tracking incur a log
operation.
Otherwise, operations such as touch(1)
will cause an unnecessary log
operation.
Upon receipt of a GETBMAP
request, the MDS issues a lease to the
client used for authorizing I/O activity.
The bmap cache lookup consists of searching the bmap tree attached to
each open file (called a "FID cache member handle" or fcmh) for the bmap
specified (by numerical ID) in the request.
If the same client is re-requesting a lease for the same bmap, a "duplicate" lease is issued; this is necessary in the protocol for situations when the client loses contact with the MDS but the MDS hasn't discovered this situation until the reissued request comes in.
If the bmap already has leases (read or write) to other clients, the bmap is first degraded into "direct I/O" mode before reply. In DIO mode, all clients accessing the bmap are forced to perform all I/O without local caching to maintain coherency.
Upon lease issuance, an entry is stored in the MDS persistent operations table (called the bmap or bia, bmap-ION-assignment, on-disk table or "odtable") signifying the lease to rebuild in recovery scenarios. During recovery (i.e. after failure), these logs are replayed to recreate the MDS's cache.
There is a proposal to function similar to TCP's 2MSL where, on startup, the MDS ignores all requests concerning bmaps leased from prior instances. This strategy would require strict coherence among load-balanced MDS peers so clients don't stall and/or automatic re-requesting of leases once the connection to the failed MDS has been re-established.
While the bmap is being paged in (if it is not already present in the
MDS memory cache), a placeholder is allocated to prevent reentrant
page-ins and any additional requesting clients will wait on the
bcm_waitq
until the bmap has been loaded.
bmaps are fixed size structures, about 1KiB in size, which represent
128MiB (by default) of user data.
bmap structures are written into a FID's inode metafile, which is a file
in the MDFS that corresponds to the FID, such as
/deployment_s2md/.slmd/fidns/0/1/2/3/00000000123000_0
.
This file contains all metadata for the file and for each bmap in the file.
To read a specific bmap from an inode's metafile requires the bmap index
number, which is simply the offset of file access divided by the bmap
representation size (again, by default 128MiB).
Conversely, the bmap index number multiplied by the bmap_ondisk
size
plus the starting offset (SL_BMAP_START_OFF
) gives the offset into the
metafile.
At the time when an ION processes a write on a bmap and sends the MDS the bmap's CRCs, the MDS is then required to store an initialized bmap at the respective index in the metafile.
All modifications to directory inodes, file inodes, and bmaps are
recorded in a transaction log called the MDS journal, for replay and MDS
replication purposes.
This journal (in pfl/journal.c
) is actually called-up from ZFS
routines when data modifications are made (e.g. zfs_write
).
The journal typically sits on a device outside of the MDFS and is synchronously written to to ensure data consistency. As such, the device should ideally have low-latency write IOPS and doesn't need much storage. A larger journal requires more time to process after startup and a smaller journal limits the number of open transactions that may occur simultaneously before blocking new ones until old ones finish.
In normal read-mode, a bmap read lease request issued by a client also retrieves down the file's inode replica table which contains the list of IOS replicas where the data resides.
For IONs, a similar SRMT_BMAPGETCRCS
RPC request is sent via
iod_bmap_retrieve()
to load CRCs of data in the bmap.
The bmap has its own CRC which protects the CRC table and replication
table against silent corruption.
In this case we must create a new bmap on-the-fly with the CRC table
containing SL_NULL_CRCs
(i.e. the CRC of a sliver filled with all null
ytes).
When CRC updates (CRUDs) are received by the MDS from IOS (which
accepted WRITEs from clients), the bmap's bcs_crcstates
are updated
and toggle on flag BMAP_SLVR_DATA
signifying this region of the bmap
is no longer a hole (i.e. filled with zeroes).
The region size defaults to 1MiB.
This flag is passed in the GETBMAP RPC as flags
field as
SRM_LEASEBMAPF_DATA
if it is not zeroes.
The client uses this information to know which chunks of the bmap must
be retrieved from the ION.
Bmap metadata is updated and rewritten as a result of numerous operations:
-
Receipt of a chunk CRC update causes two fields to be updated: the store of the chunk CRC into its appropriate slot and the recomputing and rewriting of the bmap CRC.
-
Replica management: upon successful replication of the bmap or when replicas become invalid because of an overwrite. This also causes two writes (rewriting of the bmap CRC).
-
Replay of journal activity.
-
Update of residency states during partial truncation resolution.
-
User-initiated modification of the bmap replication policy.
-
User-initiated change of the residency states e.g. as a result of issuing a replication or residency ejection request.
-
As a result of a residency change update from an ION e.g. replication, partial garbage reclamation, etc.
Note that any write of a bmap to disk causes on-demand computation of the bmap to-disk CRC.
The dominant issue with CRCs revolves around the data size encompassed by a single 8-byte CRC. This has direct ramifications in the amount of buffering required and the MDS capacity needed to store CRCs. Also, since the MDS stores the CRCs, the system ingest bandwidth is essentially limited to the number of CRCs the MDS can process. Issues regarding the synchronous storing of MDS-side CRCs need to be explored. For our purposes we will assume that the MDS has safely stored the CRCs before acknowledging back to the IOS.
The MDS has a fixed size array for CRC storage; the array size is the product of the CRC granularity and the bmap size. For now we assume that the bmap size is 128MiB and the CRC granularity is 1MiB, resulting in an array size of 1KiB required for CRC storage per bmap. Here we can see that 8 bytes per 1MiB provides a reasonable growth path for CRC storage:
(1024^2/(1024^2))*8 = 8 # 1MB requires 8B of CRCs
(1024^3/(1024^2))*8 = 8192 # 1GB requires 8KB of CRCs
(1024^4/(1024^2))*8 = 8388608 # 1TB requires 8MB of CRCs
(1024^5/(1024^2))*8 = 8589934592 # 1PB requires 8GB of CRCs
(1024^6/(1024^2))*8 = 8796093022208 # 1EB requires 8TB of CRCs
As writes are processed by the IOS we must ensure that the CRCs are accurate and take into account any cache coherency issues that may arise. One problem we face with parallel IOSes and CRCs is that we have no way to guarantee which IOS wrote last and therefore which CRCs accurately reflect the state of the file. Therefore, revising the parallel IOS write protocol, the MDS will determine the IOS ↔ bmap association and provide the same IOS to all clients for a given bmap write session. This will ensure that only one source is valid for issuing CRC updates into a bmap region, and that this source is verifiable by the MDS.
Some failure ramifications:
Should the write occur but return with a failure, the IOS must have a way of notifying the MDS that the CRC state on-disk is unknown.
- The IOS performs a write and then fails before sending the CRC update. The CRC should be calculated and stored before sending to the MDS.
Synchronously delivered update RPCs will surely slow down the write process. Perhaps we should be able to batch an entire bmap's worth of updates.
CRC updates are batched into bulk RPCs:
sizeof(struct srm_bmap_crcwrt_req) = 72 + 64 * (sizeof(struct srm_bmap_crcup) = 48 + 24 * sizeof(struct srt_bmap_crcwire) = 16) = 27.07KiB
So
MAX_BMAP_INODE_PAIRS
should be bumped to 128.
Also we should have a journal of CRCs on the IOS where possible to deal with failures, so that any unsent CRCs (post-failure) may be verified: compare the buffer-side CRCs against the on-disk state and then update the MDS.
Need to consider what happens when an IOS fails from the perspective of the client and the MDS. The MDS may have to log/record bmap ↔ IOS associations to protect against updates from a previous IOS ownership.
-
MDS chooses IOS for a given bmap; CRC updates may only come from that IOS.
-
This means that we can bulk CRC updates up to the size of the bmap (big performance win).
-
Journal buffer-side CRCs (pre-write) to guard against IOS failure. (Perhaps not.)
-
Need the MDS to send an RPC to an IOS requesting calculation of an CRCs for a bmap - this would be issued when an MDS detects the failure of an IOS and needs to reassign.
-
When an MDS chooses an ION for write, he must first notify other read-leased clients of this.