-
Notifications
You must be signed in to change notification settings - Fork 2
DocConfig
A metadata server should employ at least one persistent storage device (e.g. SSD) for metadata storage and another persistent storage device for the system journal. The journal is latency bound so a device that offers nonvolatile low latency is desirable, with considerations that the expected operation is continually sequential rewriting of the contents in the device. The system may recover from a lost journal but imposes challenges and potential data loss and should be avoided. Additional devices may be added for either the metadata zpool or system journal to increase the level of fault tolerance.
Here is an example zpool configuration taken from the PSC Data SuperCell. Here the SLASH2 MDS sits atop two vdevs, each of which is triplicated:
mds# zpool status
pool: arc_s2mds
state: ONLINE
scrub: none requested
config:
NAME STATE READ WRITE CKSUM
arc_s2mds ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
disk/by-id/scsi-35000c50033f6624b ONLINE 0 0 0
disk/by-id/scsi-35000c50033f64b0f ONLINE 0 0 0
disk/by-id/scsi-35000c50033f6439f ONLINE 0 0 0
mirror-1 ONLINE 0 0 0
disk/by-id/scsi-35000c500044205df ONLINE 0 0 0
disk/by-id/scsi-35000c50033ea560b ONLINE 0 0 0
disk/by-id/scsi-35000c50033f63417 ONLINE 0 0 0
errors: No known data errors
mds# zpool iostat
capacity operations bandwidth
pool alloc free read write read write
---------- ----- ----- ----- ----- ----- -----
arc_s2mds 379G 3.26T 493 70 2.52M 351K
PSC's Data Supercell has on the order of 10^8 files and directories. The ZFS storage required to store these metadata items is about 300GB (uncompressed). Using the default ZFS compression we generally see about 3.5 : 1 compression of SLASH2 metadata. Without compression one should expect about 150k files to consume about 1GB of ZFS metadata storage. With compression, that ratio would improve to about 500k files per 1GB of ZFS metadata storage.
The heuristic for calculating is:
avg_number_of_bmaps_per_file =
(average file size) / (128MiB per bmap)
total_storage_needed =
(number of files) * (1.5KiB of metdata per file) +
(1.1KiB ) * avg_number_of_bmaps_per_file
Ideal storage for the MDS journal is 256MB - 2GB.
Losing a journal will not result in a loss of the file system; only the uncommitted changes would be lost. If mirroring is desired, the Linux MD mirroring is suitable though we recommend that the devices are partitioned so that rebuilds of the mirror do not require a rebuild of the entire device.
The following steps create and tailor a ZFS pool for use as SLASH2
metadata storage.
These steps are also documented in sladm(7)
.
slmkfs(8)
, when used for formatting the metadata file system, will
output an FSUUID which is the identifier for the file system.
This hexadecimal value needs to be copied into the SLASH2 configuration
file for clients and I/O servers.
Warning! Use devices which pertain to your system and are not currently in use! It is recommended to use the global device identifiers listed in
/dev/disk/by-id
, especially for the journal. This may prevent you from accidentally trashing another mounted file system.
mds# zfs-fuse && sleep 3
mds# zpool create -f s2mds_pool mirror /dev/sdX1 /dev/sdX2
mds# zfs set atime=off s2mds_pool
mds# zfs set compression=lz4 s2mds_pool
mds# slmkfs -I $site_id:$resource_id /s2mds_pool
The UUID of the pool is 0x2a8ae931a776366e
mds# pkill zfs-fuse
Create the journal on a separate device with the FSUUID generated from
slmkfs
.
The journal created by this command will be 512MiB:
mds# slmkjrnl -f -b /dev/sdJ1 -n 1048576 -u 0x2a8ae931a776366e
SLASH2 uses the Lustre networking stack (aka LNET) so configuration will be somewhat familiar to those who have used Lustre. At this time SLASH2 supports TCP and the deprecated Sockets Direct Protocol (SDP). Mixed network topologies are supported to some degree. For instance, if clients and I/O servers have InfiniBand and Ethernet they may use IB for communication between each other even if the MDS has only Ethernet connectivity.
Example configurations are provided in projects/slash2/config
.
The slcfg(5)
man page also contains more information on this topic.
Setting the metadata zpool name and FSUUID is straightforward:
set zpool_name="s2mds_pool";
set fsuuid="2a8ae931a776366e";
TCP port
, and nets
configuration settings.
In the example below, SLASH2 is configured to use the TCP port 989 for
connections among hosts in the deployment.
(Non-privileged ports are allowed).
set port=989;
Next, the LNET network identifier for the TCP network is set to tcp1
.
Any clients or servers with interfaces on the 192.168/16
network will
match the rule 192.168.*.*
and be configured to use the tcp1
network.
Hosts with InfiniBand interfaces on the 10.0.0.*
network will be
configured on the sdp0
network.
set nets="tcp1 192.168.*.*; sdp0 10.0.0.*";
A site is considered to be a management domain in the cloud or
across the wide-area network.
slcfg(5)
details the resource types and the example configurations in
the distribution may be used as guides.
Here we'll do a walk through of a simple site configuration.
set zpool_name="s2mds_pool";
set fsuuid="2a8ae931a776366e";
set port=989;
set nets="tcp1 192.168.*.*; sdp0 10.0.0.*";
set pref_mds="mds1@MYSITE";
set pref_ios="ion1@MYSITE";
site @MYSITE {
site_desc = "test SLASH2 site configuration";
site_id = 1;
# MDS resource #
resource mds1 {
desc = "my metadata server";
type = mds;
id = 0;
# `nids' should be the IP or hostname of your MDS node.
# It should be on the network specified in the variable `nets'
# variable above.
nids = 192.168.0.100;
# `journal' must be the device formatted above.
journal = /dev/sdJ1;
}
resource ion1 {
desc = "I/O server 1";
type = standalone_fs;
id = 1;
nids = 192.168.0.101, 10.0.0.1;
# 'fsroot' points to the storage mounted on the I/O server
# which is to be used by SLASH2
fsroot = /disk;
}
resource ion2 {
desc = "I/O server 2";
type = standalone_fs;
id = 2;
nids = 192.168.0.102, 10.0.0.2;
fsroot = /disk;
}
}
Note that the pref_mds
and pref_ios
global values have been filled
in based on the names specified in the site.
The pref_ios
is important because it is used by clients as the default
target when writing new files.
The configuration above lists two SLASH2 I/O servers: ion1 and ion2.
Both list their fsroot
as /disk
with is the location in their local
namespace that they expose to the SLASH2 deployment as part of their
participation.
The SLASH2 I/O server (sliod
) is a stateless process which exports
locally mounted storage into the respective SLASH2 file system.
In this case, we assume that /disk
is a mounted file system with some
available storage behind it, although it could be shared parallel file
system as well.
In order to use the storage mounted at /disk
, a specific directory
structure must be created with the slmkfs
command prior to starting
the I/O server.
The FSUUID must be supplied as a parameter to this command.
Per this example, the following command is run on both ion1
and
ion2
with different -I
(ID) values, respective to the ID values
listed in the slcfg configuration file:
io# slmkfs -i -u 0x2a8ae931a776366e -I 0x401:0x6210 /disk
io$ ls -l /disk/.slmd/
total 4
drwx------ 3 root root 4096 Jun 18 15:10 2a8ae931a776366e
Once completed, the directory /disk/.slmd
should appear which contains
a subdirectory named by the FSUUID.
Under this is a hierarchy of 16^4 directories used for storing SLASH2
file objects named by object ID.
An explanation of more sophisticated I/O system types is given here: SLASH2IOServices DocAdmin