SAFS user manual

SAFS is user-space filesystem designed for a large SSD array. The goal of SAFS is to maximize the I/O performance of the SSD array on a NUMA machine while still providing a filesystem interface to users. SAFS is specifically optimized for large files. A file exposed by SAFS is partitioned and each partition is stored as a physical file on an SSD. SAFS currently does not support directory operations.

Programming interface

SAFS provides basic operations on files: create, delete, read and write.

File metadata operations

The class safs_file represents an SAFS file and provides a few methods for some metadata operations such as creating a file, deleting a file and renaming a file.

class safs_file
{
public:
    /* The constructor method. The file doesn't need to exist. */
    safs_file(const RAID_config &conf, const std::string &file_name);
    /* Test whether the SAFS file exists. */
    bool exist() const;
    /* Get the size of the SAFS file. */
    ssize_t get_size() const;
    /* Create the SAFS file with the specified size. */
    bool create_file(size_t file_size);
    /* Delete the SAFS file. */
    bool delete_file();
    /* Rename the SAFS file to a new name. */
    bool rename(const std::string &new_name);
};

SAFS does not support directories. The function get_all_safs_files returns all files in SAFS.

size_t get_all_safs_files(std::set<std::string> &files);

File access

Two classes (file_io_factory and io_interface) are used for accessing data in a file. The class file_io_factory creates and destroys io_interface objects, which provides methods to read and write an SAFS file. An io_interface instance can only access a single file and can only be used in a single thread. We intentionally make the implementations of io_interface not thread-safe for the sake of performance.

When using file_io_factory and io_interface in multiple threads (a main thread and several worker threads), the recommended approach is to create a single file_io_factory instance for an SAFS file in the main thread and create an io_interface instance in each worker thread.

File open and close

The function create_io_factory creates a file_io_factory instance for a file. It allows a user to specify an access option, which decides what type of the file_io_factory instance is created. Right now, SAFS supports two access options:

REMOTE_ACCESS: this corresponds to direct I/O in Linux. The io_interface instance created by such a file_io_factory doesn't use page cache in SAFS.
GLOBAL_CACHE_ACCESS: this corresponds to buffered I/O in Linux. The io_interface instance uses page cache in SAFS.

Opening a file involves in two steps: invoking create_io_factory to create a file_io_factory object; invoking the create_io method of file_io_factory to create an io_interface object. Files are closed implicitly when the file_io_factory object is destroyed.

Synchronous read and write

A user can use the following method of io_interface to issue synchronous I/O requests. access_method determines whether it is a read or write request: 0 indicates read and 1 indicates write.

class io_interface
{
public:
    io_status access(char *buf, off_t off, ssize_t size, int access_method);
    ...
};

Asynchronous read and write

Users can use the following set of methods to use asynchronous I/O. First, users need to implement the callback interface and register it to an io_interface object to get notification of completion of I/O requests before issuing any I/O requests. Then, they use the asynchronous version of the access method to issue I/O requests. When a request completes, the callback is invoked. It is guaranteed that the callback object will be invoked in the same thread where the I/O request was issued. An io_interface instance does not limit the number of parallel I/O requests that can be issued to it. Users can monitor the number of incomplete I/O requests with the num_pending_ios method and wait for the I/O to complete with the wait4complete method.

class io_interface
{
public:
    ...
    /* Issue asynchronous I/O requests. */
    void access(io_request *, int, io_status *);
    /* Flash I/O requests buffered by the io_interface instance. */
    void flush_requests();
    /* Wait for at least the specified number of I/O requests to complete. */
    int wait4complete(int);
    /* Get the number of pending I/O requests. */
    int num_pending_ios() const;
    /* set the callback function. */
    bool set_callback(callback::ptr);
};

class callback
{
public:
    virtual int invoke(io_request *reqs[], int num) = 0;
};

A simple example of using the library

The following pseudocode illustrates a simple use case of SAFS, which uses its synchronous I/O interface to read data from a file.

#include "io_interface.h"

class task
{
    // Defined by users.
    ...
public:
    size_t get_size() const;
    off_t get_offset() const;
};

static void test(const std::string conf_file, const std::string &graph_file,
        const std::vector<task> &tasks)
{   
    config_map::ptr configs = config_map::create(conf_file);
    init_io_system(configs);
    file_io_factory::shared_ptr factory = create_io_factory(graph_file,
                    REMOTE_ACCESS);
    io_interface::ptr io = create_io(factory, thread::get_curr_thread());

    char *buf = NULL;
    size_t buf_capacity = 0;
    BOOST_FOREACH(task t, tasks) {
        // This is directed I/O. Memory buffer, I/O offset and I/O size
        // all need to be aligned to the I/O block size.
        size_t io_size = ROUNDUP_PAGE(t.get_size());
        data_loc_t loc(factory->get_file_id(), t.get_offset());
        if (io_size > buf_capacity) {
            free(buf); 
            buf_capacity = io_size;
            buf = (char *) valloc(buf_capacity);
        }
        assert(buf_capacity >= io_size);
        io_request req(buf, loc, io_size, READ);
        io->access(&req, 1);
        io->wait4complete(1);
        run_computation(buf, io_size);
    }
    free(buf);
}

The following pseudocode illustrates a use case of SAFS' asynchronous I/O interface to read data from a file. It is slightly more complex than the synchronous I/O interface. It requires to define a callback class and the computation is performed in the callback class.

#include "io_interface.h"

class compute_callback: public callback
{
    public:
        virtual int invoke(io_request *reqs[], int num);
};  
    
int compute_callback::invoke(io_request *reqs[], int num)
{
    for (int i = 0; i < num; i++) {
        char *buf = reqs[i]->get_buf();
        run_computation(buf, reqs[i]->get_size());
        free(buf);
    }
    return 0;
}   
    
static void test(const std::string conf_file, const std::string &graph_file,
        const std::vector<task> &tasks)
{
    config_map::ptr configs = config_map::create(conf_file);
    init_io_system(configs);
    file_io_factory::shared_ptr factory = create_io_factory(graph_file,
            REMOTE_ACCESS);
    io_interface::ptr io = create_io(factory, thread::get_curr_thread());
    io->set_callback(callback::ptr(new compute_callback()));

    int max_ios = 20;
    BOOST_FOREACH(task t, tasks) {
        while (io->num_pending_ios() >= max_ios)
            io->wait4complete(1);

        // This is directed I/O. Memory buffer, I/O offset and I/O size
        // all need to be aligned to the I/O block size.
        size_t io_size = ROUNDUP_PAGE(t.get_size());
        data_loc_t loc(factory->get_file_id(), t.get_offset());
        // The buffer will be free'd in the callback function.
        char *buf = (char *) valloc(io_size);
        io_request req(buf, loc, io_size, READ);
        io->access(&req, 1);
    }
    io->wait4complete(io->num_pending_ios());
}

Utility tool in SAFS

SAFS-util is a tool that helps to manage SAFS. It provides a few commands to operate SAFS:

create: create a file in SAFS.
delete file_name: delete a file in SAFS.
list: list all existing files in SAFS.
load: load a file from an external filesystem to a file in SAFS.
verify: verify the data of a file in SAFS. It’s mainly used for testing.

Configurations

SAFS requires proper Linux kernel configurations and the filesystem configurations to get the maximal performance from an SSD array.

Linux kernel configurations for SAFS:

SAFS runs on a large SSD array in a machine of non-uniform memory architecture (NUMA), we need to configure the Linux kernel properly to get the maximal performance from the SSD array. Kernel configurations include:

evenly distribute interrupts to all CPU cores;
use the noop I/O scheduler.
set I/O request affinity for each SSD to force I/O request completion on the requesting CPU core;
prevent I/O on SSDs from contributing to the entropy pool of the random generator.
use a large sector size for SSDs (maximal sector size is a SSD-specific parameter).

We provide two scripts to automate the process.

conf/set_affinity.sh: The script distributes IRQs (Interrupt Requests) to CPU cores evenly. It is only required when we run SAFS on a NUMA machine, and it is specifically written for an LSI host bus adapter (HBA). For other HBAs, users need to adapt the script for their specific hardware.
conf/set_ssds.pl: The script takes an input file that contains device files to run SAFS on. The input file has a device file on each line. The script sets up the remaining configurations. Then it mounts SSDs to the system, and creates conf/data_files.txt, used as the configuration file of the root directories by the library.

SAFS configurations

SAFS defines the following parameters for users to customize SAFS. When SAFS is initialized, users have the opportunity to set them.

root_conf: a config file that specifies the directories on SSDs where SAFS runs. The config file has a line for each directory and the format of each line is node_id:abs_path_of_directory. SAFS requires users to provide absolute paths to the directories on SSDs.
RAID_block_size: defines the size of a data block on an SSD. Users can specify the size with the format x(k, K, m, M, g, G). e.g., 4k = 4096 bytes. The default block size is 512KB.
RAID_mapping: defines how data blocks of a file are mapped to SSDs. Currently, the library provides three mapping functions: RAID0, RAID5, HASH. The default mapping is RAID5.
cache_size: define the size of the page cache. It uses the format of x(k, K, m, M, g, G). The default cache size is 512MB.
num_nodes: defines the number of NUMA nodes where the page cache is allocated. The default number is 1.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly