Skip to content

DocConfig

Jared Yanovich edited this page Jun 24, 2015 · 6 revisions

Configuring SLASH2

Configuring the MDS server

Selecting storage for SLASH2 metadata

A metadata server should employ at least one persistent storage device (e.g. SSD) for metadata storage and another persistent storage device for the system journal. The journal is latency bound so a device that offers nonvolatile low latency is desirable, with considerations that the expected operation is continually sequential rewriting of the contents in the device. The system may recover from a lost journal but imposes challenges and potential data loss and should be avoided. Additional devices may be added for either the metadata zpool or system journal to increase the level of fault tolerance.

Here is an example zpool configuration taken from the PSC Data SuperCell. Here the SLASH2 MDS sits atop two vdevs, each of which is triplicated:

mds# zpool status
pool: arc_s2mds
state: ONLINE
scrub: none requested
config:

  NAME                                   STATE     READ WRITE CKSUM
  arc_s2mds                              ONLINE       0     0     0
    mirror-0                             ONLINE       0     0     0
      disk/by-id/scsi-35000c50033f6624b  ONLINE       0     0     0
      disk/by-id/scsi-35000c50033f64b0f  ONLINE       0     0     0
      disk/by-id/scsi-35000c50033f6439f  ONLINE       0     0     0
    mirror-1                             ONLINE       0     0     0
      disk/by-id/scsi-35000c500044205df  ONLINE       0     0     0
      disk/by-id/scsi-35000c50033ea560b  ONLINE       0     0     0
      disk/by-id/scsi-35000c50033f63417  ONLINE       0     0     0

errors: No known data errors

mds# zpool iostat
	      capacity     operations    bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
arc_s2mds    379G  3.26T    493     70  2.52M   351K

Metadata storage requirements

PSC's Data Supercell has on the order of 10^8 files and directories. The ZFS storage required to store these metadata items is about 300GB (uncompressed). Using the default ZFS compression we generally see about 3.5 : 1 compression of SLASH2 metadata. Without compression one should expect about 150k files to consume about 1GB of ZFS metadata storage. With compression, that ratio would improve to about 500k files per 1GB of ZFS metadata storage.

The heuristic for calculating is:

avg_number_of_bmaps_per_file =
    (average file size) / (128MiB per bmap)

total_storage_needed =
    (number of files) * (1.5KiB of metdata per file) +
    (1.1KiB ) * avg_number_of_bmaps_per_file

MDS journal storage

Ideal storage for the MDS journal is 256MB - 2GB.

Should the journal device be mirrored?

Losing a journal will not result in a loss of the file system; only the uncommitted changes would be lost. If mirroring is desired, the Linux MD mirroring is suitable though we recommend that the devices are partitioned so that rebuilds of the mirror do not require a rebuild of the entire device.

Creating an MDS ZFS pool and journal

The following steps create and tailor a ZFS pool for use as SLASH2 metadata storage. These steps are also documented in sladm(7). slmkfs(8), when used for formatting the metadata file system, will output an FSUUID which is the identifier for the file system. This hexadecimal value needs to be copied into the SLASH2 configuration file for clients and I/O servers.

Warning! Use devices which pertain to your system and are not currently in use! It is recommended to use the global device identifiers listed in /dev/disk/by-id, especially for the journal. This may prevent you from accidentally trashing another mounted file system.

mds# zfs-fuse && sleep 3
mds# zpool create -f s2mds_pool mirror /dev/sdX1 /dev/sdX2
mds# zfs set atime=off s2mds_pool
mds# zfs set compression=lz4 s2mds_pool
mds# slmkfs -I $site_id:$resource_id /s2mds_pool
The UUID of the pool is 0x2a8ae931a776366e
mds# pkill zfs-fuse

Create the journal on a separate device with the FSUUID generated from slmkfs. The journal created by this command will be 512MiB:

mds# slmkjrnl -f -b /dev/sdJ1 -n 1048576 -u 0x2a8ae931a776366e

Network configuration

SLASH2 uses the Lustre networking stack (aka LNET) so configuration will be somewhat familiar to those who have used Lustre. At this time SLASH2 supports TCP and the deprecated Sockets Direct Protocol (SDP). Mixed network topologies are supported to some degree. For instance, if clients and I/O servers have InfiniBand and Ethernet they may use IB for communication between each other even if the MDS has only Ethernet connectivity.

Setting up the master configuration file

Example configurations are provided in projects/slash2/config. The slcfg(5) man page also contains more information on this topic.

Setting the zpool, fsuuid and network settings

Setting the metadata zpool name and FSUUID is straightforward:

set zpool_name="s2mds_pool";
set fsuuid="2a8ae931a776366e";

TCP port, and nets configuration settings. In the example below, SLASH2 is configured to use the TCP port 989 for connections among hosts in the deployment. (Non-privileged ports are allowed).

set port=989;

Next, the LNET network identifier for the TCP network is set to tcp1. Any clients or servers with interfaces on the 192.168/16 network will match the rule 192.168.*.* and be configured to use the tcp1 network. Hosts with InfiniBand interfaces on the 10.0.0.* network will be configured on the sdp0 network.

set nets="tcp1 192.168.*.*; sdp0 10.0.0.*";

Configuring a SLASH2 deployment

A site is considered to be a management domain in the cloud or across the wide-area network. slcfg(5) details the resource types and the example configurations in the distribution may be used as guides. Here we'll do a walk through of a simple site configuration.

set zpool_name="s2mds_pool";
set fsuuid="2a8ae931a776366e";
set port=989;
set nets="tcp1 192.168.*.*; sdp0 10.0.0.*";
set pref_mds="mds1@MYSITE";
set pref_ios="ion1@MYSITE";

site @MYSITE {
	site_desc = "test SLASH2 site configuration";
	site_id   = 1;

	# MDS resource #
	resource mds1 {
		desc = "my metadata server";
		type = mds;
		id   = 0;
		# `nids' should be the IP or hostname of your MDS node.
		# It should be on the network specified in the variable `nets'
		# variable above.
		nids = 192.168.0.100;
		# `journal' must be the device formatted above.
		journal = /dev/sdJ1;
	}

	resource ion1 {
		desc = "I/O server 1";
		type = standalone_fs;
		id   = 1;
		nids = 192.168.0.101, 10.0.0.1;
		# 'fsroot' points to the storage mounted on the I/O server
		# which is to be used by SLASH2
		fsroot = /disk;
	}

	resource ion2 {
		desc = "I/O server 2";
		type = standalone_fs;
		id   = 2;
		nids = 192.168.0.102, 10.0.0.2;
		fsroot = /disk;
	}
}

Note that the pref_mds and pref_ios global values have been filled in based on the names specified in the site. The pref_ios is important because it is used by clients as the default target when writing new files.

I/O Server Setup

The configuration above lists two SLASH2 I/O servers: ion1 and ion2. Both list their fsroot as /disk with is the location in their local namespace that they expose to the SLASH2 deployment as part of their participation.

The SLASH2 I/O server (sliod) is a stateless process which exports locally mounted storage into the respective SLASH2 file system. In this case, we assume that /disk is a mounted file system with some available storage behind it, although it could be shared parallel file system as well.

In order to use the storage mounted at /disk, a specific directory structure must be created with the slmkfs command prior to starting the I/O server. The FSUUID must be supplied as a parameter to this command. Per this example, the following command is run on both ion1 and ion2 with different -I (ID) values, respective to the ID values listed in the slcfg configuration file:

io# slmkfs -i -u 0x2a8ae931a776366e -I 0x401:0x6210 /disk
io$ ls -l /disk/.slmd/
total 4
drwx------ 3 root root 4096 Jun 18 15:10 2a8ae931a776366e

Once completed, the directory /disk/.slmd should appear which contains a subdirectory named by the FSUUID. Under this is a hierarchy of 16^4 directories used for storing SLASH2 file objects named by object ID.

An explanation of more sophisticated I/O system types is given here: SLASH2IOServices DocAdmin